Introduction to Computer Science
CompSci 101 : Spring 2014

Random Text Generation

The picture shows Andrey Markov in many ways the originator of stochastic processes which are typically known as Markov Chains. In this lab you will create a program that uses a Markov Chain to model random text. The program you write will use an algorithm made "famous" as Mark V Shaney, you can find references to it on Wikipedia and an online version of the program here as well.

Although this lab focuses on generating random sentences, Markov processes more commonly use this method in reverse: attempting to identify an unknown source using Markov models, for example in computational biology or verifying authorship.

In lab you will read in a training text and, based on frequency and order that characters/words appear in the given text, generate new text randomly that sounds like it might have been written by the author of the reference files.

An order-k Markov model uses strings of k-letters to predict text, these are called k-grams. The assumption is that the previous k characters determine the probabilities for the next character. For example, suppose the training text for an order-2 Markov model contains the 2-gram “th” followed 50 times by the letter ‘e’, 20 times by the letter ‘a’, and 30 times by the letter ‘o’ because the sequences “the”, “tha” and “tho” occur 50, 20, and 30 times, respectively while there are no other occurrences of “th” in the training text.

Now suppose that we want to generate random text. We generate the 2-gram “th” and based on this we must generate the next random character using the order-2 model. The next letter will be an ‘e’ with a probability of 0.5 (50/100); will be an ‘a’ with probability 0.2 (20/100); and will be an ‘o’ with probability 0.3 (30/100). If ‘e’ is chosen, then the next 2-gram used to calculate random letters will be “he” since the last part of the old 2-gram is combined with the new letter to create the next one.

Consider the following text as an example (in lab you will have text files that are more fun than this one, such as free, as in speech and beer, texts like Little Brother by Cory Doctorow):

  The big dog ate food. The big dog died.

If we break the text into three-letter substrings we get this list of substrings:

   ['The', 'he ', 'e b', ' bi', 'big', 'ig ', 'g d', ' do' ... ]

We then treat each of these substrings as a two-letter predicting sequence. This two-letter sequence generates a new letter by choosing one of the letters that follows the two-letter sequence. For example, in the training text the two-letter sequence ' d' (space-'d') is followed by 'o' (big dog) twice and by 'i' once (dog died). This means that the two-letter sequence ' d' (space-'d') will generate or predict one of ['o','o','i'] and it is more likely, two-thirds to one-third, to generate an 'o' than an 'i'.

The algorithm starts by picking a random k-character substring as the seed that starts the random text generation.

Using an order-3 process would use 3-letters to predict the fourth, and so on. To see how much more English-like a higher-order process is consider the examples below which come from 2008 campaign speeches.

Order-k Obama McCain Palin
2 n rentrion't jobackeentry I knompin he and the in ard ould th cou'll spow. We forponowelves chat, agethilonow, age thembecal th. Ted judelisted we we all Hareaven the se a nuclas nom I've then no muc nd, al ditationg tationtionfits one the of minal orthe re VA lain Kostreed ve rave etens. I we Frans. Firces, me wore litypecest stareparefte and the millse wilignatte shosed make thend truslair comen ur arever ser; a going therven cal camend weetbacker, antater. Tod Theld prom het younto tol; and the withar kidepe, whe just whe could the wout to be ounted we ations. Oileaterall cles it st ou, This
3 y new dreams, have same time that prevel, I save of that eign purplummethe me from tough," to Democrations onces. I've have for our smally keeps talking American it. The new lege have by are future retary, in mand our VA heall you harity Counding in next continue that, in as of ther. I will the are to jointo elied to proverns a serve in a rought too gress well to essed. Pam elems will in an that righter Sept. 11 of highbors allies righ school, to serving on a finisher in thank as Sen. He who he PTA and things as but to hear. Todd and polities safer is of Amercial elebrator troops and thing t
4 nse on is at we his countries to progress for you'll investments across America, the United overty? Pull your own. Our grace the fuels. And just money work better an America's promise of go to resourc e left their cours extend upon the good. I belonger has the enterests of our respecial very invites are not force for peace, and firmly or she work of freedom the VA should make got lost conce introdu ct a greats and to threat effort for the people who have places week public office and when Sen. Join ourse, Sen. McCain, for that I didn't just his opport forms, and the PTA and to just at all the ol
5 y watched my life is just takes us from harm's way in theirs are differences; that the best hope that young veterans to take more accountry, and science, and more accompanies that we need to record's es than the anti-America receive benefit is not the people of Arizona, and their home are among those complaints about aiming a few expeditiously assess, and women of us returned from misrepresential ough, he'll be honoring the way, Todd is only one expects us today. I never really set out of five of America, then we just your average "hockey mom" in American president of the United States. Well,
8 gy independence on something firmer, and more honest in our battlefields may be Democrats and Republican nominee, John McCain, I will stop giving them to companies stop discriminating against those wi administration will do all that we can do, in less trying and treat conditions that predominantly or exclusively affect women. And here the Veterans' service to one another, of leaving no one behind, , Todd and I met way back in high school, and I promised him a little surprise for the right to vote. I think as well today of two other women who came before me in national Guard, that's not why the

Snarf the starting code for this lab or browse the code directory.

Part I : Check your Understanding

These questions refer to the file short.txt which is stored as a string in the variable text:

The big dog ate food. The big dog died.

In an order-2 Markov process each two-letter sequence will be a key in a dictionary. The value associated with the key is a list of characters that follow that two-letter sequence in the file. For example, the two-letter sequence 'at' is followed by an 'e' once.

  1. There are four occurrences of 'g ' ('g'-space), what letters would be associated with that key?
    [                                               ]
  2. There are three occurrences of 'e ' ('e'-space), what letters would be associated with that key?
    [                                               ]
  3. Given a list of 3-letter sequences, how would you build the dictionary described above? Could you build it in such a way that it did not depend on the length of the letter sequences (i.e., so that it would work for 4-letter, 8-letter, or 10-letter sequences)?
      
      
      
      

In order to build the dictionary described above, you need to have a list of k-letter sequences. Describe the contents of the list generated by the list comprehensions below.

  1. [ text[i:i+3] for i in range(len(text)-2) ]
  2.   
      
      
      
  3. words = text.split()
    [ words[i:i+3] for i in range(len(words)-2) ]
      
    
        
  4. What needs to be changed in the code above to make it generate different length sequences?
  5.   
    
        
  6. What needs to be changed in the first list comprehension above so that it generates a list of lists as the second one does? This would be useful because this same algorithm could be used to generate random text based on words instead of letters (i.e., a two word sequence predicts the next word, with the keys and values being words instead of letters).
  7.   
    
        

Part II : Complete the Code

Complete the Markov.py code by completing the functions:

Part III : Generalizing the Code

The process is controlled by the parameters to the main function. Explain the meaning of the each of the parameter values in the following call:

main("short.txt", 3, 200, 10, getFileLetters)
  1. Change the order of the model to see if your code works for any order-k Markov model. If you get an error, explain what caused it and try to fix it.
  2. Change the function used in the call from getFileLetters to getFileWords to see if your code works for words as well as letters. If you get an error, explain what caused it and try to fix it.
  3. Generate some random text. What is your favorite? Read it aloud to your neighbors.