OOWS, Object-Oriented and Optimized Word Stats


Write a program that will read a file and determine the words that occur most frequently in the file. Words are delimited by white space and should not have leading or trailing punctuation. All characters should be converted to lowercase equivalents. The program should print the n words that occur most frequently, where n defaults to 20, but is otherwise a parameter to the program. If no filename is specified, the program should read from standard input (cin). The program should also be able to output the words in multiple columns, where the default is one column, but otherwise could be specified as a parameter to the program. You should not use any tapestry collections in this program, but instead use C++'s STL.

Usage

The examples below show how the program can be used.

wordcount
wordcount -f filename -n numwords -c columns
wordcount --file[=filename] --numwords[=number] --columns[=number]
For example, the following are all valid uses.
wordcount < data/poe.txt
wordcount -n 30 < data/poe.txt
wordcount --file=data/poe.txt --numwords=30
wordcount -f data/poe.txt -n 30

Output

Each line of output should contain a count, followed by a two-spaces, followed by a word. The count is the number of times the word occurs in the input being processed. The most frequently occurring word should be printed first, the least frequently word last (of the maximum of n lines printed where n is a parameter to the program that defaults to 20.) The counts should be right-justified so that the words are aligned by the first letter

100  blueberry
 12  apple
 11  berry
 11  cherry
  7  watermelon
  6  orange

The command line option --columns[=number] or -c number should output number count/word pairs on each line (except the last line which may not be full.) Each count is followed by two spaces and counts are right-justified in each column (as above). Each column is separated from the next column by four spaces (between the longest entry in one column and the largest number in the next column).

wordlines --file=foo.txt --columns=3
wordlines -f foo.txt -c 3
These could generate output as follows. Note that there are four spaces between the 'n' in watermelon and the first '1' in the count of 11 for cherry in the next column. There are four spaces between the 'y' in blueberry and the '1' in the 11 count for berry.
100  blueberry    11  berry     7  watermelon
 12  apple        11  cheery    6  orange

Example Code

The following programs demonstrate how to use the getopt function to parse command-line arguments.

Deliverables

  1. Version 1.0 should work correctly reading from standard input and determining the 20 words that occur most frequently (break ties alphabetically). You can process command-line options, but you don't need to. The grade will be based only on correctness, but comments on design will be given.

  2. Version 2.0 should process command-line options and be designed using classes to facilitate alternative implementations with minimal rewriting. The grade will be based 50% on design and 50% on conforming to specifications.

  3. Version 3.0 should be as fast as possible. It will be graded 80% on speed and 20% on design. Programs that produce incorrect output will receive little credit.

Comments?