The grammar to be processed by your program consists of a collection of definitions and rules for each definition. Each definition in the grammar is enclosed by curly-braces. The definition consists of its name, also called a non-terminal, followed by the collection of rules for that definition. Each rule can contain any number of words, or terminals, and non-terminals separated by spaces or new-lines; it ends with a semi-colon.
The examples on the main assignment page were generated from this grammar.
Some non-terminals, like <difficult>
and <status>
do not result in
more definitions being expanded since they contain only simple words. But others, like <description>
and <adjective>
, generate more
choices and texts since the rules associated with them
contain non-terminals.
For example, the second sentence on the main assignment page:
I finished working on this week's APT assignment. The problems were like trivial and Eclipse crashed .
was generated by expanding the <start>
definition as follows:
<start>
, so it is chosen for expansion.
<status>
" is a non-terminal, and is generated in the same manner that <start>
is currently being expanded:
<description>
" is a non-terminal, and is generated in the same manner that <start>
is currently being expanded:
<adjective>
" is a non-terminal, and is generated in the same manner that <start>
and <description>
are currently being expanded:
<difficult>
" is a non-terminal, and is generated in the same manner that <start>
and <description>
are currently being expanded:
<excuse>
"is a non-terminal, and is generated in the same manner that <start>
and <description>
are currently being expanded:
By examining the randomly generated examples you can see how
sometimes a string of adjectives is generated, e.g., like, really, really, so, unbelievably. In theory the length
of this sequence of adjectives, generated by repeatedly choosing
the last of the rules for the non-terminal <adjective>
, could be arbitrarily long, but in
practice choosing this rule happens with probability 0.2 (1/5) so
choosing it repeatedly isn't too likely.
One of the most important things in Computer Science is being able to transform specially formatted files, from simple ones like CSV files to complex ones like grammars or web pages, into structured collections within your program, from simple lists to complex ones like lists of lists of values or dictionaries. The code given to you for this project does just that: it converts a formatted grammar file into a dictionary where the key is a string, the non-terminal, and the value associated with the key is a list of the rules that can be used when the non-terminal is expanded.
When run by itself, the module rsgModel
, loads the example grammar file, apt-issues.g, and prints out its keys and values for your inspection. You should study the output of that module when it is
loaded to make sure you understand how it is organized. By looking at the output
and the code together, you should be able to tell which strings that are
printed represent the
different parts of a grammar. For example, loading
the grammar file generates
output that includes the following.
You can see that the non-terminal <status>
has four rules
with a varying number of words in each rule. Because there can be several rules, the dictionary uses a list of lists of strings to make it easy to process each word (in case one might be a non-terminal).
Random text is generated
beginning with the non-terminal <start>
as it is meant to permit access to all the other definitions.
Using the dictionary representing the grammar, generate random text by finding all
the rules associated with the key <start>
and choosing one of
these rules at
random. Then loop over each string in the chosen rule and either
"print" the string if its a terminal,
or use the string to recursively repeat the process
if the string starts with a < symbol indicating it is a non-terminal
and should be the
source to generate random-text. Thus the dictionary has
rules as the value for corresponding the string/non-terminal
key in the map.
You will need to complete the function generate
that accesses the dictionary for the given non-terminal parameter,
finding its collection of rules,
then chooses one rule
at random, and finally loops over the strings in that rule "printing" or
generating text from them in the case the word is a non-terminal, e.g., by adding it to a list to return or calling generate
recursively. Note, because generate
returns a simple list of strings representing the final text, you will need to be consistent in how combine these different steps into your main list to return from each call: in the base case, it is a simple string, and in the general case, it is a list of strings.
Instead of
printing the text directly during generation, you will build a list
of strings to return. Thus when the call to generate
ends, you can loop over the list of
generated words and print them in a variety of ways that would be difficult while you were also generating new text. That is the purpose of the function pretty_print
. Currently, the function does not live up to its name since it simply prints the text one one line with every "word" separated by spaces.
For extra credit, modify this function to keep each line that is printed under 50 characters. To do that, keep track of the length of the line being printed and if the next word to be printed will make the line length go over 50 characters, print a new line instead of a space and reset the line length. Do not actually split any words across lines like some texts do, just make sure that the words printed on a single line do not total to more than 50 characters (including spaces).
Additionally, it looks strange to see stray punctuation characters separated by spaces like they were complete words. When printing, you should only print a space only if the word is actually a word (i.e., is longer than one character or actually contains alphabetic characters).
For example, the fourth sentence on the main assignment page, after pretty printing, would look like:
I gave up working on this week's APT assignment.
The problems were so, like easy and I had a
midterm.
Note, the comma after the word "so" and the period after the word "midterm" are no longer separated by a space. Also, midterm appears on the last line by itself because printing on the previous line would have made that line's length 52 characters.
To choose a random item from a list, use the function random.choice()
which selects an item randomly based on the length of the collection.
To help
with debugging, you may want to set the state of the random number generator at the top of your module, using the function call random.seed(1234)
.
This way the same sequence of random numbers is generated each time your
program generate random text. This can help debug the program if you
have an error that is hard to replicate because your program
depends on the random numbers being
generated.
To allow your program to use the grammars submitted by your classmates, you will need to change the last line in the rsgRunner
module to the following:
generate_sentences(ROOT_WEB, NUM_TO_GENERATE)
(i.e., change the first parameter from ROOT_DIR
to ROOT_WEB
).