Compsci 6/101, Spring 2011, Lab 13

Handin with Questions about the lab. Snarf code or browse the code directory for Python starting code and free (speech and beer) texts including a copy of Little Brother (by Cory Doctorow) as well as other texts to start the text generation.

Andrey Markov on Wikipedia.

The picture shows Andrey Markov in many ways the originator of stochastic processes which are typically known as Markov Chains. In this lab you'll create a program that uses a Markov Chain to model random paragraphs/sentences. The program you write will use an algorithm made "famous" as Mark V Shaney, you can find references to it on Wikipedia and an online version of the program here as well.

Although in the lab you'll generate random sentences using a Markov process one of the common uses of Markov processes uses this method in reverse: attempting to identify an unknown source using Markov models -- this is common in computational biology, for example.

In lab you will generate random text based on a training text. First you'll use letter-sequences to predict/generate a letter, then (if time) you'll use word sequences to predict a word.

We'll use this text as an example, but in lab you'll have text files that are more fun than this one. The text is:

  The big dog ate food. The big dog died.

We'll use a Markov process to demonstrate the idea and to demonstrate the lab: use two-letters to predict the third (this is an order-2 process). If we break the text into three-letter substrings we get this list of triple substrings.

   ['The', 'he ', 'e b', ' bi', 'big', 'ig ', 'g d', ' do' ... ]

We then treat each these triples as a two-letter predicting sequence. This two-letter sequence generates a new letter by choosing one of the letters that follows the two-letter sequence. For example, in the training text the two-letter sequence ' d' (space-'d') is followed by 'o' (big dog) twice and by 'i' once (dog died). This means that the two-letter sequence ' d' (space-'d') will generate or predict one of ['o','o','i'] and it's more likely, two-thirds to one-third, to generate an 'o' than an 'i'.

Using an order-3 process would use 3-letters to predict the fourth, and so on. To see how much more English-like a higher-order process is consider the examples below which come from 2008 campain speeches.

Order-k Obama McCain Palin
2 n rentrion't jobackeentry I knompin he and the in ard ould th cou'll spow. We forponowelves chat, agethilonow, age thembecal th. Ted judelisted we we all Hareaven the se a nuclas nom I've then no muc nd, al ditationg tationtionfits one the of minal orthe re VA lain Kostreed ve rave etens. I we Frans. Firces, me wore litypecest stareparefte and the millse wilignatte shosed make thend truslair comen ur arever ser; a going therven cal camend weetbacker, antater. Tod Theld prom het younto tol; and the withar kidepe, whe just whe could the wout to be ounted we ations. Oileaterall cles it st ou, This
3 y new dreams, have same time that prevel, I save of that eign purplummethe me from tough," to Democrations onces. I've have for our smally keeps talking American it. The new lege have by are future retary, in mand our VA heall you harity Counding in next continue that, in as of ther. I will the are to jointo elied to proverns a serve in a rought too gress well to essed. Pam elems will in an that righter Sept. 11 of highbors allies righ school, to serving on a finisher in thank as Sen. He who he PTA and things as but to hear. Todd and polities safer is of Amercial elebrator troops and thing t
4 nse on is at we his countries to progress for you'll investments across America, the United overty? Pull your own. Our grace the fuels. And just money work better an America's promise of go to resourc e left their cours extend upon the good. I belonger has the enterests of our respecial very invites are not force for peace, and firmly or she work of freedom the VA should make got lost conce introdu ct a greats and to threat effort for the people who have places week public office and when Sen. Join ourse, Sen. McCain, for that I didn't just his opport forms, and the PTA and to just at all the ol
5 y watched my life is just takes us from harm's way in theirs are differences; that the best hope that young veterans to take more accountry, and science, and more accompanies that we need to record's es than the anti-America receive benefit is not the people of Arizona, and their home are among those complaints about aiming a few expeditiously assess, and women of us returned from misrepresential ough, he'll be honoring the way, Todd is only one expects us today. I never really set out of five of America, then we just your average "hockey mom" in American president of the United States. Well,
8 gy independence on something firmer, and more honest in our battlefields may be Democrats and Republican nominee, John McCain, I will stop giving them to companies stop discriminating against those wi administration will do all that we can do, in less trying and treat conditions that predominantly or exclusively affect women. And here the Veterans' service to one another, of leaving no one behind, , Todd and I met way back in high school, and I promised him a little surprise for the right to vote. I think as well today of two other women who came before me in national Guard, that's not why the

Part I

Complete the first part of the handin questions before using the concepts from those part to write the code below.

Complete the code you snarfed in Markov.py so that the dictionary for an order-2 Markov process is created. See the handin for details. This means you should complete the functions make_substrings and make_dictionary (the latter is described in the handin document).

You should print your dictionary to be sure it's "right". You can do this using short.txt and verify by hand.

Part II

To generate random text for the order-2 process you do two things.

Part III

Generalize from order-2 Markov to order-K Markov, see the handin document for details. Then generate and print random text. What's your favorite text. Read it aloud to your neighbors.