Background on Markov Models
An order-k Markov model uses strings of k-letters to predict text, these are called k-grams.
An order-2 Markov model uses two-character strings or bigrams to calculate probabilities in generating random letters. For example suppose that in some text that we’re using for generating random letters using an order-2 Markov model, the bigram “th” is followed 50 times by the letter ‘e’, 20 times by the letter ‘a’, and 30 times by the letter ‘o’, because the sequences “the”, “tha” and “tho” occur 50, 20, and 30 times, respectively while there are no other occurrences of “th” in the text we’re modeling.
Now suppose that we want to generate random text. We generate the bigram “th” and based on this we must generate the next random character using the order-2 model. The next letter will be an ‘e’ with a probability of 0.5 (50/100); will be an ‘a’ with probability 0.2 (20/100); and will be an ‘o’ with probability 0.3 (30/100). If ‘e’ is chosen, then the next bigram used to calculate random letters will be “he” since the last part of the old bigram is combined with the new letter to create the next bigram used in the Markov process.
A Brute-Force Approach
In general here’s pseudo-code to generate random letters (and thus random text) using an order-k Markov model and a training text from which the probabilities are calculated.
seed = random k-character substring from the training text --- the initial seed repeat N times to generate N random letters for each occurrence of seed in training text record the letter that follows the occurrence of seed in a list choose a random element of the list as the generated letter C print or store C seed = (last k-1 characters of seed) + C
Using this algorithm, whose Java equivalent is here, here are a series of 200-randomly generated letters using the order-k Markov model. The training text for these models are speeches made by Barack Obama as his acceptance speech at the Democractic National Convention, John McCain in his speech to the American Legion, and Sarah Palin in her acceptance speech as McCain’s vice president (all late August, 2008). See training texts for the training data files.
Order-k | Obama | McCain | Palin |
---|---|---|---|
2 | n rentrion’t jobackeentry I knompin he and the in ard ould th cou’ll spow. We forponowelves chat, agethilonow, age thembecal th. Ted judelisted we we all Hareaven the se a nuclas nom I’ve then no muc | nd, al ditationg tationtionfits one the of minal orthe re VA lain Kostreed ve rave etens. I we Frans. Firces, me wore litypecest stareparefte and the millse wilignatte shosed make thend truslair comen | ur arever ser; a going therven cal camend weetbacker, antater. Tod Theld prom het younto tol; and the withar kidepe, whe just whe could the wout to be ounted we ations. Oileaterall cles it st ou, This |
3 | y new dreams, have same time that prevel, I save of that eign purplummethe me from tough,” to Democrations onces. I’ve have for our smally keeps talking American it. The new lege have by are future | retary, in mand our VA heall you harity Counding in next continue that, in as of ther. I will the are to jointo elied to proverns a serve in a rought too gress well to essed. Pam elems will in an that | righter Sept. 11 of highbors allies righ school, to serving on a finisher in thank as Sen. He who he PTA and things as but to hear. Todd and polities safer is of Amercial elebrator troops and thing t |
4 | nse on is at we his countries to progress for you’ll investments across America, the United overty? Pull your own. Our grace the fuels. And just money work better an America’s promise of go to resourc | e left their cours extend upon the good. I belonger has the enterests of our respecial very invites are not force for peace, and firmly or she work of freedom the VA should make got lost conce introdu | ct a greats and to threat effort for the people who have places week public office and when Sen. Join ourse, Sen. McCain, for that I didn’t just his opport forms, and the PTA and to just at all the ol |
5 | y watched my life is just takes us from harm’s way in theirs are differences; that the best hope that young veterans to take more accountry, and science, and more accompanies that we need to record’s | es than the anti-America receive benefit is not the people of Arizona, and their home are among those complaints about aiming a few expeditiously assess, and women of us returned from misrepresential | ough, he’ll be honoring the way, Todd is only one expects us today. I never really set out of five of America, then we just your average “hockey mom” in American president of the United States. Well, |
8 | gy independence on something firmer, and more honest in our battlefields may be Democrats and Republican nominee, John McCain, I will stop giving them to companies stop discriminating against those wi | administration will do all that we can do, in less trying and treat conditions that predominantly or exclusively affect women. And here the Veterans’ service to one another, of leaving no one behind, | , Todd and I met way back in high school, and I promised him a little surprise for the right to vote. I think as well today of two other women who came before me in national Guard, that’s not why the |
Here’s the Java code that implements the brute-force solution to generate numLetters at random from the training-text in instance variable myString using an order-k Markov model.
public void brute(int k, int numLetters) { // pick random k-character substring as initial seed int start = myRandom.nextInt(myString.length() – k + 1); String seed = myString.substring(start, start + k); // copy first k characters to back to simulate wrap-around String wrapAroundString = myString + myString.substring(0,k); StringBuilder build = new StringBuilder(); ArrayList<Character> list = new ArrayList<Character>(); for (int i = 0; i < numLetters; i++) { list.clear(); int pos = 0; while ((pos = wrapAroundString.indexOf(seed, pos)) != -1 && pos < myString.length()) { char ch = wrapAroundString.charAt(pos + k); list.add(ch); pos++; } int pick = myRandom.nextInt(list.size()); char ch = list.get(pick); build.append(ch); seed = seed.substring(1) + ch; } }
The code above works fine, but to generate N letters in a text of size T the code does NT work since it rescans the text each time a character is found.