In this lab, you will write a program that reads one or more reference files and, based on frequency and order that characters appear in the given text, generates new text randomly that sounds like it might have been written by the author of the reference files.
The model used to generate the new random text is a called an N-gram model, which uses the assumption that the previous n characters mostly determine the probabilities for the next character. For our purposes, we will use the most recently randomly generated n characters to determine the next randomly generated character by looking at all occurrences of those n characters in the reference text.
Our algorithm starts by picking a random n-character substring as the seed that starts the random text generation. This is done simply by picking an n-character long string at random from the reference text. This initial n-character string is called the predictor in the code that follows. Then, every occurrence of the predictor string is located in the reference text. The character that follows each occurrence is a possible candidate to be the next randomly generated character.
An order-n Markov model uses strings of n-letters to predict text, these are called n-grams. An order-2 Markov model uses two-character strings or bigrams to calculate probabilities in generating random letters. For example suppose that in the text we're using for generating random letters using an order-2 Markov model the bigram "th" is followed 50 times by the letter 'e', 20 times by the letter 'a', and 30 times by the letter 'o', because the sequences "the", "tha" and "tho" occur 50, 20, and 30 times, respectively while there are no other occurrences of "th" in the text we're modeling.
Now suppose that in generating random text we generate the bigram "th" and based on this we must generate the next random character using the order-2 model. The next letter will be an 'e' with a probability of 0.5 (50/100); will be an 'a' with probability 0.2 (20/100); and will be an 'o' with probability 0.3 (30/100). If 'e' is chosen, then the next bigram used to calculate random letters will be "he" since the last part of the old bigram is combined with the new letter to create the next bigram used in the Markov process.
Here is pseudocode to generate random letters (and thus random text) using n-grams and a training text from which probabilities are calculated.
For example, suppose the training text is "bbbabbabbbbaba" and we're using a 3-gram to predict text.predictor = random n-character substring from the training text --- the initial seed repeat numLetters times to generate numLetters random letters for each occurrence of seed in training text record the letter that follows the occurrence of seed in a list choose a random element of the list as the generated letter C print or store C predictor = (last k-1 characters of predictor) + C
The 3-letter string (3-gram) "bbb" occurs three times, twice followed by 'a' and once by 'b'. We represent this by saying that the 3-gram "bbb" is followed twice by "bba" and once by "bbb" since the next 3-gram in generating Markov-modeled, random text is formed by taking the last two characters of the seed 3-gram "bbb", which are "bb" and adding the letter that follows the original 3-gram seed.
The 3-letter string "bba" occurs three times, each time followed by 'b'. The 3-letter string "bab" occurs three times, followed twice by 'b' and once by 'a'. However, we treat the original string/training-text as circular, i.e., the end of the string is followed by the beginning of the string. This means "bab" also occurs at the end of the string (last two characters followed by the first character) again followed by 'b'.
In processing the training text from left-to-right we see the following transitions between 3-grams starting with the left-most 3-gram "bbb"
This can be represented as a map of each possible three grams to the three grams that follow it:bbb &rarr bba -> bab -> abb -> bba -> bab ->abb -> bbb ->bbb -> bba -> bab -> aba ->bab -> abb -> bbb
3-gram following three grams bbb bba, bbb, bba bba bab, bab, bab bab abb, abb, aba, abb abb bba, bbb, bbb aba bab
For another example, suppose for a 3-gram the predictor is "the"
and
the reference file is:
these are the authentic anthems of the ethernet, I hypothesize mathematically
Then the characters that follow the predictor are, in order (note the two space characters):
s nm rsm
One of these is chosen at random to be the next character in the
generated text (notice that 's', ' ', and 'm' are the most likely
since there are two occurrences of each, while 'n' and 'r' are less
likely since there is only one occurrence of each). Then the predictor
is updated to be the last n-1 characters of the old predictor
followed by the chosen character. So, if 'm' is chosen in the example,
the new predictor will be "hem"
(which has following
characters 's' and 'a').
This process is repeated to randomly generate a text of any given length that resembles the original reference text used as its source.
Order-n
Obama
| McCain
| Palin
|
2
|
n rentrion't jobackeentry I knompin he and the in ard ould th cou'll
spow. We forponowelves chat, agethilonow, age thembecal th. Ted
judelisted we we all Hareaven the se a nuclas nom I've then no muc
|
nd, al ditationg tationtionfits one the of minal orthe re VA lain
Kostreed ve rave etens. I we Frans. Firces, me wore litypecest
stareparefte and the millse wilignatte shosed make thend truslair comen
|
ur arever ser; a going therven cal camend weetbacker, antater. Tod Theld
prom het younto tol; and the withar kidepe, whe just whe could the wout
to be ounted we ations. Oileaterall cles it st ou, This
|
3
|
y new dreams, have same time that prevel, I save of that eign
purplummethe me from tough," to Democrations onces.
I've have for our smally keeps talking American it.
The new lege have by are future
|
retary, in mand our VA heall you harity Counding in next continue that,
in as of ther. I will the are to jointo elied to proverns a serve in a
rought too gress well to essed. Pam elems will in an that
|
righter Sept. 11 of highbors allies righ school, to serving on a
finisher in thank as Sen. He who he PTA and things as but to hear. Todd
and polities safer is of Amercial elebrator troops and thing t
|
4
|
nse on is at we his countries to progress for you'll investments across
America, the United overty? Pull your own. Our grace the fuels. And just
money work better an America's promise of go to resourc
|
e left their cours extend upon the good. I belonger has the enterests of
our respecial very invites are not force for peace, and firmly or she
work of freedom the VA should make got lost conce introdu
|
ct a greats and to threat effort for the people who have places week
public office and when Sen. Join ourse, Sen. McCain, for that I didn't
just his opport forms, and the PTA and to just at all the ol
|
5
|
y watched my life is just takes us from harm's way in theirs are
differences; that the best hope that young veterans to take more
accountry, and science, and more accompanies that we need to record's
|
es than the anti-America receive benefit is not the people of Arizona,
and their home are among those complaints about aiming a few
expeditiously assess, and women of us returned from misrepresential
|
ough, he'll be honoring the way, Todd is only one expects us today. I
never really set out of five of America, then we just your average
"hockey mom" in American president of the United States.
Well,
|
8
|
gy independence on something firmer, and more honest in our battlefields
may be Democrats and Republican nominee, John McCain, I will stop giving
them to companies stop discriminating against those wi
|
administration will do all that we can do, in less trying and treat
conditions that predominantly or exclusively affect women. And here the
Veterans' service to one another, of leaving no one behind,
|
, Todd and I met way back in high school, and I promised him a little
surprise for the right to vote. I think as well today of two other
women who came before me in national Guard, that's not why the
| |
---|
If you download the project labs/lab03, you will see a shell
that shows the basic algorithm described above, but that is missing three key
details. You are to complete and test the three methods of the class NGram
described
below that work together to generate random text. For fun, we have provided
a few large data files that have very characteristic styles.
Note, this program is based primarily on randomness to achieve an interesting
result. Whenever randomness is involved, it is hard to verify
that a program actually works as intended; thus, it is important to test each
method you write individually to verify that each works as intended. Only by
testing each method separately can you have any confidence
that the program as a whole works. To test your program, we have also provided
the class NGramTester
that calls each method individually with
specific parameter values and prints the results for you to inspect. You may
also want to make some smaller reference files than those we provided to help
verify each method works as expected.
Complete the method getRandomSubstring
of the class NGram
that
chooses a random substring of length subSize
from the string
s
, given as parameters.
Once you are confident you have completed this method, test it by using the class
NGramTester
to verify that you are returning a string of the correct length that
occurs within the given string.
Complete the method getFollowingCharacters
of the class NGram
which
has been started for you. getFollowingCharacters
should return
all characters that immediately follow a given substring in a
String. For example,
getFollowingCharacters("bbbabbabbbbaba","bab")
should return"bbab".
The code should:
toFind
in the string s
toFind
to the string result
(without any delimiter between the added
characters, as in the example given above)
Once you are confident you have completed this method, test it by using the
class NGramTester
to verify that you are returning all of the
possible following characters given a variety of strings to find.
Complete the function makeNGram
that is passed the reference text
as one big string of every word in order, a value for the n in N-gram,
and the length of the text to generate randomly. The function returns a new string
constructed based on N-gram frequencies.
You should implement the body of the loop. You will need to call both
of the functions you have written previously to find the possible next
characters and then choose one randomly.
Once you are confident you have completed this method, test it by using the
class NGramTester
to verify that you are returning the correct
total number of characters and that it follows the text more closely as n increases.