Handin with Questions about the lab. Snarf code or browse the code directory for Python starting code and free (speech and beer) texts including a copy of Little Brother (by Cory Doctorow) as well as other texts to start the text generation.
Andrey Markov on Wikipedia.
The picture shows Andrey Markov in many ways the originator of stochastic processes which are typically known as Markov Chains. In this lab you'll create a program that uses a Markov Chain to model random paragraphs/sentences. The program you write will use an algorithm made "famous" as Mark V Shaney, you can find references to it on Wikipedia and an online version of the program here as well.
Although in the lab you'll generate random sentences using a Markov process one of the common uses of Markov processes uses this method in reverse: attempting to identify an unknown source using Markov models -- this is common in computational biology, for example.
In lab you will generate random text based on a training text. First you'll use letter-sequences to predict/generate a letter, then (if time) you'll use word sequences to predict a word.
We'll use this text as an example, but in lab you'll have text files that are more fun than this one. The text is:
The big dog ate food. The big dog died.
We'll use a Markov process to demonstrate the idea and to demonstrate the lab: use two-letters to predict the third (this is an order-2 process). If we break the text into three-letter substrings we get this list of triple substrings.
['The', 'he ', 'e b', ' bi', 'big', 'ig ', 'g d', ' do' ... ]
We then treat each these triples as a two-letter predicting sequence. This two-letter sequence generates a new letter by choosing one of the letters that follows the two-letter sequence. For example, in the training text the two-letter sequence ' d' (space-'d') is followed by 'o' (big dog) twice and by 'i' once (dog died). This means that the two-letter sequence ' d' (space-'d') will generate or predict one of ['o','o','i'] and it's more likely, two-thirds to one-third, to generate an 'o' than an 'i'.
Using an order-3 process would use 3-letters to predict the fourth, and so on. To see how much more English-like a higher-order process is consider the examples below which come from 2008 campain speeches.
Order-k
Obama
| McCain
| Palin
|
2
|
n rentrion't jobackeentry I knompin he and the in ard ould th cou'll
spow. We forponowelves chat, agethilonow, age thembecal th. Ted
judelisted we we all Hareaven the se a nuclas nom I've then no muc
|
nd, al ditationg tationtionfits one the of minal orthe re VA lain
Kostreed ve rave etens. I we Frans. Firces, me wore litypecest
stareparefte and the millse wilignatte shosed make thend truslair comen
|
ur arever ser; a going therven cal camend weetbacker, antater. Tod Theld
prom het younto tol; and the withar kidepe, whe just whe could the wout
to be ounted we ations. Oileaterall cles it st ou, This
|
3
|
y new dreams, have same time that prevel, I save of that eign
purplummethe me from tough," to Democrations onces.
I've have for our smally keeps talking American it.
The new lege have by are future
|
retary, in mand our VA heall you harity Counding in next continue that,
in as of ther. I will the are to jointo elied to proverns a serve in a
rought too gress well to essed. Pam elems will in an that
|
righter Sept. 11 of highbors allies righ school, to serving on a
finisher in thank as Sen. He who he PTA and things as but to hear. Todd
and polities safer is of Amercial elebrator troops and thing t
|
4
|
nse on is at we his countries to progress for you'll investments across
America, the United overty? Pull your own. Our grace the fuels. And just
money work better an America's promise of go to resourc
|
e left their cours extend upon the good. I belonger has the enterests of
our respecial very invites are not force for peace, and firmly or she
work of freedom the VA should make got lost conce introdu
|
ct a greats and to threat effort for the people who have places week
public office and when Sen. Join ourse, Sen. McCain, for that I didn't
just his opport forms, and the PTA and to just at all the ol
|
5
|
y watched my life is just takes us from harm's way in theirs are
differences; that the best hope that young veterans to take more
accountry, and science, and more accompanies that we need to record's
|
es than the anti-America receive benefit is not the people of Arizona,
and their home are among those complaints about aiming a few
expeditiously assess, and women of us returned from misrepresential
|
ough, he'll be honoring the way, Todd is only one expects us today. I
never really set out of five of America, then we just your average
"hockey mom" in American president of the United States.
Well,
|
8
|
gy independence on something firmer, and more honest in our battlefields
may be Democrats and Republican nominee, John McCain, I will stop giving
them to companies stop discriminating against those wi
|
administration will do all that we can do, in less trying and treat
conditions that predominantly or exclusively affect women. And here the
Veterans' service to one another, of leaving no one behind,
|
, Todd and I met way back in high school, and I promised him a little
surprise for the right to vote. I think as well today of two other
women who came before me in national Guard, that's not why the
| |
---|
Complete the code you snarfed in Markov.py
so that the dictionary for an order-2 Markov
process is created. See the
handin for details. This
means you should complete the functions
make_substrings
and
make_dictionary
(the latter is described in the
handin document).
You should print your dictionary to be sure it's "right". You can do this using short.txt and verify by hand.
random.choice
to
select a random key from the
dictionary. Suppose this choice is '
d' (space-'d'). Call this the seed.
generate_text
.
See the handin document for details, complete and run the program using short.txt an little-brother.txt as examples.