CompSci 101 : Fall 2013

Grammars are a set of rules used in recognizing and generating text (or music or art or ...). In this lab you will use context free grammars to generate (hopefully amusing) random text.

There are many examples of randomly-generated text, graphics, art, and so on. The ones referenced here all use context free grammars. The AD Generator combines slogans with Flickr images to create random ads built on real slogans. The famous SCIgen project generates random computer science papers including those that were actually accepted for publication, albeit in shady conference venues. The site has videos and a complete description of the history of SCIgen.

Context Free Art includes information on generating different images using grammars and computerized drawing.

The Random Sentence Generator was one of the original (1999) SIGCSE/Nifty Assignments. This current version uses regular expressions for parsing and so is more straightforward than the original.

For example, here are some Duke Compsci excuses generated by this grammar.

I can't believe I haven't started working on this week's APT assignment. The problems were unbelievably hard and I couldn't find my computer .
I finished working on this week's APT assignment. The problems were like trivial and Eclipse crashed .
I gave up working on this week's APT assignment. The problems were really , really , so impossible and I got unbelievably sick .
I gave up working on this week's APT assignment. The problems were so , like easy and I had a midterm .
I finished working on this week's APT assignment. The problems were like trivial .

Specifications

You will do three things for this lab — but please read the section below on grammars before starting.

Create your own grammar to get an understanding of how the grammar is organized and formatted.
Answer questions about regular expressions in the context of parsing the grammars. These questions are designed to help you understand how the grammar is parsed and stored as a data structure within Python.
Update the given module, rsgModel, to generate random sentences/texts based on a given grammar by completing the function generate.

You can snarf the starting files for this assignment or view them here.

Grammar Background

The format of the grammar used in this assignment is described briefly here, but you can reason by example from the apt-issues.g file or by browsing submitted grammars what the grammar looks like.

A grammar processed by your program consists of a collection of definitions and rules for each definition.

Each definition is enclosed by curly-braces. The curly-braces help when reading/parsing the definition.
The definition consists of the name of the non-terminal being defined followed by the rules for that definition.
The non-terminal is enclosed by angle-brackets < and > --- this helps in reading/parsing the non-terminal.
Each rule in a definition is separated from other rules by a semi-colon, this helps in reading/parsing as well.

Random text is always generated beginning with the non-terminal <start> as can be seen in the examples shown above generated by this grammar.

Some non-terminals, like <difficult> and <status> do not result in more rules/definitions being chosen. But the others do generate more choices and texts since the rules associated with the definitions also have non-terminals in them.

By examining the randomly generated examples you can see how sometimes a string of adjectives is generated, e.g., like, really, really, so, unbelievably. In theory the length of this sequence of adjectives, generated by repeatedly choosing the last of the rules for the non-terminal <adjective>, could be arbitrarily long, but in practice choosing this rule happens with probability 0.2 (1/5) so choosing it repeatedly is not too likely.

Consider this example, we will walk through how it is generated.

There is only one rule associated with the definition of <start>, so it is chosen for expansion. The rule is:

Expanding means looping over each "word" and expanding the word.
- "I" is a terminal, simple word, so it must be part of the final text generated.
- "<status>" is a non-terminal, and is generated in the same manner that <start> is currently being expanded:
  - A rule for <status> is chosen randomly, in this case, the second one.
  - Since the chose rule is "finished" and it is a terminal, no further expansion is done.
- The words "working on this week's APT assignment. The problems" are all terminals, and so do not need to be expanded further.
- "<description>" is a non-terminal, and is generated in the same manner that <start> is currently being expanded:
  - A rule for <description> is chosen randomly, in this case, the second one. The rule chosen is
  - "were" is a terminal, so no further expansion is done.
  - " <adjective>" is a non-terminal, and is generated in the same manner that <start> and <description> are currently being expanded:
    - A rule is chosen randomly, in this case, the fourth one.
    - "like" is a terminal, so no further expansion is done.
  - " <difficult>" is a non-terminal, and is generated in the same manner that <start> and <description> are currently being expanded:
    - A rule is chosen randomly, in this case, the fourth one.
    - "trivial" is a terminal, so no further expansion is done.
  - "and" is a terminal, so no further expansion is done.
  - "<excuse>"is a non-terminal, and is generated in the same manner that <start> and <description> are currently being expanded:
    - A rule is chosen randomly, in this case, the fourth one (yes that is still possible :).
    - "trivial" is a terminal, so no further expansion is done.

Grammar Data Structure

One of the most important things in Computer Science is being able to transform specially formatted files, from simple ones like CSV files to complex ones like grammars or web pages, into structured collections within your program, from simple lists to complex ones like lists of lists of values or dictionaries. The code given to you for this project does just that: it converts a formatted grammar file into a dictionary where the key is a string, the non-terminal, and the value associated with the key is a list of the rules that can be used when the non-terminal is expanded.

When run by itself, the module rsgModel, loads the example grammar file, apt-issues.g, and prints out its keys and values for your inspection. You should study the output of that module when it is loaded to make sure you understand how it is organized.

Internally a dictionary is used in which the key is a non-terminal and the value corresponding to the non-terminal is a list of rules, each rule is a list. For example, loading the grammar file generates this dictionary entry:

You can see that the non-terminal <status> has four rules with a varying number of words in each rule. Because there can be several rules, the dictionary uses a list of lists of strings to make it easy to process each word (in case one might be a non-terminal).