Compsci 101, Fall 2012, Lab 12:RSG

You should snarf the lab12 files for this lab or browse the code directory for code files provided.

Grammars are a set of rules used in recognizing and generating text (or music or art or ...). In this lab you'll use context free grammars to generate (hopefully amusing) random text.

There are many examples of randomly-generated text, graphics, art, and so on. The ones referenced here all use context free grammars like those you'll be using in this assignment. The AD Generator combines slogans with Flickr images to create random ads built on real slogans. The famous SCIgen project generates random computer science papers including those that were actually accepted for publication, albeit in shady conference venues. The site has videos and a complete description of the history of SCIgen.

Context Free Art includes information on generating different images using grammars and computerized drawing. Grammars known as L-systems have also successfully modeled plant formation.

The Random Sentence Generator was one of the original (1999) SIGCSE/Nifty Assignments. This current version uses regular expressions for parsing and so is more straightforward than the original.

For example, here are some Duke Compsci excuses generated by this grammar.

I can't believe I haven't started working on this week's APT assignment. The problems were unbelievably hard and I couldn't find my computer .
I finished working on this week's APT assignment. The problems were like trivial and Eclipse crashed .
I gave up working on this week's APT assignment. The problems were really , really , so impossible and I got unbelievably sick .
I gave up working on this week's APT assignment. The problems were so , like easy and I had a midterm .
I finished working on this week's APT assignment. The problems were like trivial .

You'll do three things for this lab --- but please read the section on grammars that comes after these three things before you do these three things.

Create a grammar that you upload to the Compsci101/Fall 12 grammar website so that it can be used by everyone in the class. The class will vote on the best grammar in several categories, extra points for top-placing grammars. You submit the grammar online using grammar website, but you can browse the grammars too. You can browse the Fall 10 grammars, or the Spring 11 grammars as well.
Answer questions about regular expressions in the context of parsing the grammars (see the handin pages). These questions are designed to help you understand something about regular expressions and something about the program that generates stories, excuses, etc.
Answer questions about using URLs and files to read grammars (see the handin pages).

First answer the questions on the handin pages, then create and upload a grammar. We'll vote on grammars next week and you'll modify the RSG program as part of the next assignment. You might want to read about grammars below first to understand the terminology used.

Then create a grammar and upload it to the course website. Verify that your grammar is there by loading it via the URLreader.py module.

Grammar Background

The format of the grammar used in this assignment is described briefly here, but you can reason by example from the apt-issues.g file or by browsing submitted grammars what the grammar looks like.

A grammar processed by your program consists of a collection of definitions and rules for each definition.

Each definition is enclosed by curly-braces. The curly-braces help when reading/parsing the definition.
The definition consists of the non-terminal being defined followed by the rules for that definition.
The non-terminal is enclosed by angle-brackets < and > --- this helps in reading/parsing the non-terminal.
Each rule in a definition is separated from other rules by a semi-colon, this helps in reading/parsing as well.

Random text is always generated beginning with the non-terminal <start> as can be seen in the examples shown above generated by this grammar.

{ <start> I <status> working on this week's APT assignment. The problems <description> .; } { <status> gave up ; finished ; am still ; can't believe I haven't started ; } { <description> were <adjective> <difficult> ; were <adjective> <difficult> and <excuse>; } { <adjective> really ; unbelievably ; so ; like ; <adjective> , <adjective> ; } { <difficult> hard ; impossible ; easy ; trivial ; } { <excuse> I had a midterm ; I got <adjective> sick ; I couldn't find my computer ; Eclipse crashed ; <excuse> and <excuse> ; }

Some non-terminals, like <difficult> and <status> don't result in more rules/definitions being chosen. But the others do generate more choices and texts since the rules associated with the definitions also have non-terminals in them.

By examining the randomly generated examples you can see how sometimes a string of adjectives is generated, e.g., like, really, really, so, unbelievably. In theory the length of this sequence of adjectives, generated by repeatedly choosing the last of the rules for the non-terminal <adjective>, could be arbitrarily long, but in practice choosing this rule happens with probability 0.2 (1/5) so choosing it repeatedly isn't too likely.

Consider this example, we'll walk through how's it's generated.

 I finished working on this week's APT assignment. The problems were
 like trivial and Eclipse crashed .

There is only one rule associated with the definition of <start>, so it is chosen for expansion. The rule is: I <status> working on this week's APT assignment. The problems <description> .;
Expanding means looping over each "word" and expanding the word.
- "I" is a terminal, simple word, so it must be part of the final text generated.
- "<status>" is a non-terminal, and is generated in the same manner that <start> is currently being expanded:
  - A rule for <status> is chosen randomly, in this case, the second one.
  - Since the chose rule is "finished" and it is a terminal, no further expansion is done.
- The words "working on this week's APT assignment. The problems" are all terminals, and so do not need to be expanded further.
- "<description>" is a non-terminal, and is generated in the same manner that <start> is currently being expanded:
  - A rule for <description> is chosen randomly, in this case, the second one (yes that is possible :). The rule chosen is
  - "were" is a terminal, so no further expansion is done.
  - " <adjective>" is a non-terminal, and is generated in the same manner that <start> and <description> are currently being expanded:
    - A rule is chosen randomly, in this case, the fourth one.
    - "like" is a terminal, so no further expansion is done.
  - " <difficult>" is a non-terminal, and is generated in the same manner that <start> and <description> are currently being expanded:
    - A rule is chosen randomly, in this case, the fourth one.
    - "trivial" is a terminal, so no further expansion is done.
  - "and" is a terminal, so no further expansion is done.
  - "<excuse>"is a non-terminal, and is generated in the same manner that <start> and <description> are currently being expanded:
    - A rule is chosen randomly, in this case, the fourth one (yes that is still possible :).
    - "trivial" is a terminal, so no further expansion is done.

Grammar Data Structure

One of the most important things in Computer Science is being able to transform specially formatted files, from simple ones like CSV files to complex ones like grammars or web pages, into structured collections within your program, from simple lists to complex ones like lists of lists of values or dictionaries. The code given to you for this project does just that: it converts a formatted grammar file into a dictionary where the key is a string, the non-terminal, and the value associated with the key is a list of the rules that can be used when the non-terminal is expanded.

When run by itself, the module rsgModel, loads the example grammar file, apt-issues.g, and generates two different "stories" using the grammar. Questions are in the handin about this.

Internally a dictionary is used in which the key is a non-terminal and the value corresponding to the non-terminal is a list of rules, each rule is a list. For example, loading the grammar file generates this dictionary entry:

<status> ['gave', 'up'] ['finished'] ['am', 'still'] ["can't", 'believe', 'I', "haven't", 'started']

You can see that the non-terminal <status> has four rules with a varying number of words in each rule. Because there can be several rules, the dictionary uses a list of lists of strings to make it easy to process each word (in case one might be a non-terminal).