CompSci 06/101, Fall 2011, RSG HOWTO

Grammar Format

The grammar to be processed by your program consists of a collection of definitions and rules for each definition. Each definition in the grammar is enclosed by curly-braces. The definition consists of its name, also called a non-terminal, followed by the collection of rules for that definition. Each rule can contain any number of words, or terminals, and non-terminals separated by spaces or new-lines; it ends with a semi-colon.

The examples on the main assignment page were generated from this grammar.

{ <start> I <status> working on this week's APT assignment. The problems <description> . ; } { <status> gave up ; finished ; am still ; can't believe I haven't started ; } { <description> were <adjective> <difficult> ; were <adjective> <difficult> and <excuse> ; } { <adjective> really ; unbelievably ; so ; like ; <adjective> , <adjective> ; } { <difficult> hard ; impossible ; easy ; trivial ; } { <excuse> I had a midterm ; I got h1n1 ; I couldn't find my computer ; Eclipse crashed ; <excuse> and <excuse> ; }

Some non-terminals, like <difficult> and <status> do not result in more definitions being expanded since they contain only simple words. But others, like <description> and <adjective>, generate more choices and texts since the rules associated with them contain non-terminals.

For example, the second sentence on the main assignment page:

I finished working on this week's APT assignment. The problems were like trivial and Eclipse crashed .

was generated by expanding the <start> definition as follows:

  1. There is only one rule associated with the definition of <start>, so it is chosen for expansion.
    1. "I" is a terminal, simple word, so it must be part of the final text generated.
    2. "<status>" is a non-terminal, and is generated in the same manner that <start> is currently being expanded:
      1. A rule is chosen randomly, in this case, the second one.
      2. "finished" is a terminal, so no further expansion is done.
    3. The words "working on this week's APT assignment. The problems" are all terminals, and so do not need to be expanded further.
    4. "<description>" is a non-terminal, and is generated in the same manner that <start> is currently being expanded:
      1. A rule is chosen randomly, in this case, the second one (yes that is possible :).
      2. "were" is a terminal, so no further expansion is done.
      3. " <adjective>" is a non-terminal, and is generated in the same manner that <start> and <description> are currently being expanded:
        1. A rule is chosen randomly, in this case, the fourth one.
        2. "like" is a terminal, so no further expansion is done.
      4. " <difficult>" is a non-terminal, and is generated in the same manner that <start> and <description> are currently being expanded:
        1. A rule is chosen randomly, in this case, the fourth one.
        2. "trivial" is a terminal, so no further expansion is done.
      5. "and" is a terminal, so no further expansion is done.
      6. "<excuse>"is a non-terminal, and is generated in the same manner that <start> and <description> are currently being expanded:
        1. A rule is chosen randomly, in this case, the fourth one (yes that is still possible :).
        2. "trivial" is a terminal, so no further expansion is done.

By examining the randomly generated examples you can see how sometimes a string of adjectives is generated, e.g., like, really, really, so, unbelievably. In theory the length of this sequence of adjectives, generated by repeatedly choosing the last of the rules for the non-terminal <adjective>, could be arbitrarily long, but in practice choosing this rule happens with probability 0.2 (1/5) so choosing it repeatedly isn't too likely.

Grammar Data Structure

One of the most important things in Computer Science is being able to transform specially formatted files, from simple ones like CSV files to complex ones like grammars or web pages, into structured collections within your program, from simple lists to complex ones like lists of lists of values or dictionaries. The code given to you for this project does just that: it converts a formatted grammar file into a dictionary where the key is a string, the non-terminal, and the value associated with the key is a list of the rules that can be used when the non-terminal is expanded.

When run by itself, the module rsgModel, loads the example grammar file, apt-issues.g, and prints out its keys and values for your inspection. You should study the output of that module when it is loaded to make sure you understand how it is organized. By looking at the output and the code together, you should be able to tell which strings that are printed represent the different parts of a grammar. For example, loading the grammar file generates output that includes the following.

<status> ['gave', 'up'] ['finished'] ['am', 'still'] ["can't", 'believe', 'I', "haven't", 'started']

You can see that the non-terminal <status> has four rules with a varying number of words in each rule. Because there can be several rules, the dictionary uses a list of lists of strings to make it easy to process each word (in case one might be a non-terminal).

Generating Random Text Using the Grammar

Random text is generated beginning with the non-terminal <start> as it is meant to permit access to all the other definitions.

Using the dictionary representing the grammar, generate random text by finding all the rules associated with the key <start> and choosing one of these rules at random. Then loop over each string in the chosen rule and either "print" the string if its a terminal, or use the string to recursively repeat the process if the string starts with a < symbol indicating it is a non-terminal and should be the source to generate random-text. Thus the dictionary has rules as the value for corresponding the string/non-terminal key in the map.

You will need to complete the function generate that accesses the dictionary for the given non-terminal parameter, finding its collection of rules, then chooses one rule at random, and finally loops over the strings in that rule "printing" or generating text from them in the case the word is a non-terminal, e.g., by adding it to a list to return or calling generate recursively. Note, because generate returns a simple list of strings representing the final text, you will need to be consistent in how combine these different steps into your main list to return from each call: in the base case, it is a simple string, and in the general case, it is a list of strings.

Printing the Resulting Text

Instead of printing the text directly during generation, you will build a list of strings to return. Thus when the call to generate ends, you can loop over the list of generated words and print them in a variety of ways that would be difficult while you were also generating new text. That is the purpose of the function pretty_print. Currently, the function does not live up to its name since it simply prints the text one one line with every "word" separated by spaces.

For extra credit, modify this function to keep each line that is printed under 50 characters. To do that, keep track of the length of the line being printed and if the next word to be printed will make the line length go over 50 characters, print a new line instead of a space and reset the line length. Do not actually split any words across lines like some texts do, just make sure that the words printed on a single line do not total to more than 50 characters (including spaces).

Additionally, it looks strange to see stray punctuation characters separated by spaces like they were complete words. When printing, you should only print a space only if the word is actually a word (i.e., is longer than one character or actually contains alphabetic characters).

For example, the fourth sentence on the main assignment page, after pretty printing, would look like:

I gave up working on this week's APT assignment.
The problems were so, like easy and I had a
midterm.

Note, the comma after the word "so" and the period after the word "midterm" are no longer separated by a space. Also, midterm appears on the last line by itself because printing on the previous line would have made that line's length 52 characters.

Using the Random module

To choose a random item from a list, use the function random.choice() which selects an item randomly based on the length of the collection.

To help with debugging, you may want to set the state of the random number generator at the top of your module, using the function call random.seed(1234). This way the same sequence of random numbers is generated each time your program generate random text. This can help debug the program if you have an error that is hard to replicate because your program depends on the random numbers being generated.

Getting Grammars from the Web

To allow your program to use the grammars submitted by your classmates, you will need to change the last line in the rsgRunner module to the following:

generate_sentences(ROOT_WEB, NUM_TO_GENERATE)

(i.e., change the first parameter from ROOT_DIR to ROOT_WEB).