CompSci 100e Spring 2011: Lab 3

Lab 3, 2/4 & 2/7

Making Random Text

In this lab, you will write a program that reads one or more reference files and, based on frequency and order that characters appear in the given text, generates new text randomly that sounds like it might have been written by the author of the reference files.

The model used to generate the new random text is a called an N-gram model, which uses the assumption that the previous n characters mostly determine the probabilities for the next character. For our purposes, we will use the most recently randomly generated n characters to determine the next randomly generated character by looking at all occurrences of those n characters in the reference text.

Our algorithm starts by picking a random n-character substring as the seed that starts the random text generation. This is done simply by picking an n-character long string at random from the reference text. This initial n-character string is called the predictor in the code that follows. Then, every occurrence of the predictor string is located in the reference text. The character that follows each occurrence is a possible candidate to be the next randomly generated character.

Examples

For example, imagine 1-gram generation as a typewriter with lots of 'e' keys, lots of 't' keys, lots of 'o' keys, but fewer 'r' keys, still fewer 'f' keys, and only one or two 'z' and 'y' keys (just like the distribution of Scrabble tiles or the letters given for the final game in Wheel of Fortune).

An order-n Markov model uses strings of n-letters to predict text, these are called n-grams. An order-2 Markov model uses two-character strings or bigrams to calculate probabilities in generating random letters. For example suppose that in the text we're using for generating random letters using an order-2 Markov model the bigram "th" is followed 50 times by the letter 'e', 20 times by the letter 'a', and 30 times by the letter 'o', because the sequences "the", "tha" and "tho" occur 50, 20, and 30 times, respectively while there are no other occurrences of "th" in the text we're modeling.

Now suppose that in generating random text we generate the bigram "th" and based on this we must generate the next random character using the order-2 model. The next letter will be an 'e' with a probability of 0.5 (50/100); will be an 'a' with probability 0.2 (20/100); and will be an 'o' with probability 0.3 (30/100). If 'e' is chosen, then the next bigram used to calculate random letters will be "he" since the last part of the old bigram is combined with the new letter to create the next bigram used in the Markov process.

Here is pseudocode to generate random letters (and thus random text) using n-grams and a training text from which probabilities are calculated.

  predictor = random n-character substring from the training text --- the initial seed
  repeat numLetters times to generate numLetters random letters
     for each occurrence of seed in training text 
        record the letter that follows the occurrence of seed in a list
     choose a random element of the list as the generated letter C
     print or store C
     predictor = (last k-1 characters of predictor) + C

For example, suppose the training text is "bbbabbabbbbaba" and we're using a 3-gram to predict text.

The 3-letter string (3-gram) "bbb" occurs three times, twice followed by 'a' and once by 'b'. We represent this by saying that the 3-gram "bbb" is followed twice by "bba" and once by "bbb" since the next 3-gram in generating Markov-modeled, random text is formed by taking the last two characters of the seed 3-gram "bbb", which are "bb" and adding the letter that follows the original 3-gram seed.

The 3-letter string "bba" occurs three times, each time followed by 'b'. The 3-letter string "bab" occurs three times, followed twice by 'b' and once by 'a'. However, we treat the original string/training-text as circular, i.e., the end of the string is followed by the beginning of the string. This means "bab" also occurs at the end of the string (last two characters followed by the first character) again followed by 'b'.

In processing the training text from left-to-right we see the following transitions between 3-grams starting with the left-most 3-gram "bbb"

  bbb &rarr bba -> bab -> abb -> bba -> bab ->abb -> bbb ->bbb -> bba -> bab -> aba ->bab -> abb -> bbb

This can be represented as a map of each possible three grams to the three grams that follow it:

3-gram following three grams
bbb bba, bbb, bba
bba bab, bab, bab
bab abb, abb, aba, abb
abb bba, bbb, bbb
aba bab

3-gram	following three grams
bbb	bba, bbb, bba
bba	bab, bab, bab
bab	abb, abb, aba, abb
abb	bba, bbb, bbb
aba	bab

For another example, suppose for a 3-gram the predictor is "the" and the reference file is:

these are the authentic anthems of the ethernet, I hypothesize mathematically

Then the characters that follow the predictor are, in order (note the two space characters):

s nm rsm

One of these is chosen at random to be the next character in the generated text (notice that 's', ' ', and 'm' are the most likely since there are two occurrences of each, while 'n' and 'r' are less likely since there is only one occurrence of each). Then the predictor is updated to be the last n-1 characters of the old predictor followed by the chosen character. So, if 'm' is chosen in the example, the new predictor will be "hem" (which has following characters 's' and 'a').

This process is repeated to randomly generate a text of any given length that resembles the original reference text used as its source.

Example Output

Here are a series of 200-randomly generated letters using the above model for different values of n. The training text for these models are speeches made by Barack Obama as his acceptance speech at the Democractic National Convention, John McCain in his speech to the American Legion, and Sarah Palin in her acceptance speech as McCain's vice president (all late August, 2008). See training texts for the training data files.

Order-n	Obama	McCain	Palin
2	n rentrion't jobackeentry I knompin he and the in ard ould th cou'll spow. We forponowelves chat, agethilonow, age thembecal th. Ted judelisted we we all Hareaven the se a nuclas nom I've then no muc	nd, al ditationg tationtionfits one the of minal orthe re VA lain Kostreed ve rave etens. I we Frans. Firces, me wore litypecest stareparefte and the millse wilignatte shosed make thend truslair comen	ur arever ser; a going therven cal camend weetbacker, antater. Tod Theld prom het younto tol; and the withar kidepe, whe just whe could the wout to be ounted we ations. Oileaterall cles it st ou, This
3	y new dreams, have same time that prevel, I save of that eign purplummethe me from tough," to Democrations onces. I've have for our smally keeps talking American it. The new lege have by are future	retary, in mand our VA heall you harity Counding in next continue that, in as of ther. I will the are to jointo elied to proverns a serve in a rought too gress well to essed. Pam elems will in an that	righter Sept. 11 of highbors allies righ school, to serving on a finisher in thank as Sen. He who he PTA and things as but to hear. Todd and polities safer is of Amercial elebrator troops and thing t
4	nse on is at we his countries to progress for you'll investments across America, the United overty? Pull your own. Our grace the fuels. And just money work better an America's promise of go to resourc	e left their cours extend upon the good. I belonger has the enterests of our respecial very invites are not force for peace, and firmly or she work of freedom the VA should make got lost conce introdu	ct a greats and to threat effort for the people who have places week public office and when Sen. Join ourse, Sen. McCain, for that I didn't just his opport forms, and the PTA and to just at all the ol
5	y watched my life is just takes us from harm's way in theirs are differences; that the best hope that young veterans to take more accountry, and science, and more accompanies that we need to record's	es than the anti-America receive benefit is not the people of Arizona, and their home are among those complaints about aiming a few expeditiously assess, and women of us returned from misrepresential	ough, he'll be honoring the way, Todd is only one expects us today. I never really set out of five of America, then we just your average "hockey mom" in American president of the United States. Well,
8	gy independence on something firmer, and more honest in our battlefields may be Democrats and Republican nominee, John McCain, I will stop giving them to companies stop discriminating against those wi	administration will do all that we can do, in less trying and treat conditions that predominantly or exclusively affect women. And here the Veterans' service to one another, of leaving no one behind,	, Todd and I met way back in high school, and I promised him a little surprise for the right to vote. I think as well today of two other women who came before me in national Guard, that's not why the

Writing the Code

If you download the project labs/lab03, you will see a shell that shows the basic algorithm described above, but that is missing three key details. You are to complete and test the three methods of the class NGram described below that work together to generate random text. For fun, we have provided a few large data files that have very characteristic styles.

Note, this program is based primarily on randomness to achieve an interesting result. Whenever randomness is involved, it is hard to verify that a program actually works as intended; thus, it is important to test each method you write individually to verify that each works as intended. Only by testing each method separately can you have any confidence that the program as a whole works. To test your program, we have also provided the class NGramTester that calls each method individually with specific parameter values and prints the results for you to inspect. You may also want to make some smaller reference files than those we provided to help verify each method works as expected.

Choosing a Random Substring

Complete the method getRandomSubstring of the class NGram that chooses a random substring of length subSize from the string s, given as parameters. Once you are confident you have completed this method, test it by using the class NGramTester to verify that you are returning a string of the correct length that occurs within the given string.
Getting all the Following Characters

Complete the method getFollowingCharacters of the class NGram which has been started for you. getFollowingCharacters should return all characters that immediately follow a given substring in a String. For example,
getFollowingCharacters("bbbabbabbbbaba","bab") should return "bbab".

The code should:
1. Find every occurrence of parameter toFind in the string s
2. Add the character immediately following each occurrence of toFind to the string result (without any delimiter between the added characters, as in the example given above)
Once you are confident you have completed this method, test it by using the class NGramTester to verify that you are returning all of the possible following characters given a variety of strings to find.
Putting it all together

Complete the function makeNGram that is passed the reference text as one big string of every word in order, a value for the n in N-gram, and the length of the text to generate randomly. The function returns a new string constructed based on N-gram frequencies. You should implement the body of the loop. You will need to call both of the functions you have written previously to find the possible next characters and then choose one randomly.
Once you are confident you have completed this method, test it by using the class NGramTester to verify that you are returning the correct total number of characters and that it follows the text more closely as n increases.

Submitting

Submit your source code and README.txt using assignment name lab03. Your README should describe all of the tests that you ran on your code, including any new ones that you created.

Last modified: Wed Feb 9 12:56:54 EST 2011