CPS 108 : KWIC

Spring 2002

A Key Word in Context index is useful in looking up titles, words, and other things. Write a program that, given a set of files, generates a concordance or Keyword In Context (KWIC) index.

Think of the input files as a stream of (word, line-number, file) sets where line-number is the number of the line on which the word occurs and file is the file in which the word appears. By default, every word is printed in alphabetical order, with its context, the three words before and after the word, if it exists, followed by the name of the file in which the word appears and a list of line numbers that span the full context of the word. This might be a single line number or it might be a range of line numbers (see the sample output below).

By definition a word is any sequence of non-white space characters separated from other words by white space (or the first/last word in a file). Leading or trailing punctuation should not be included in the word, though internal punctuation should be. In the sentence below there are seven words: how, are, you, i, asked, can't, complain. The case of the words should be disregarded as well, i.e., "Apple" is the same as "apple", for comparison purposes.

   "How are you?" I asked. Can't complain.

The word whose context is given is printed in all-caps (see the output below) with every word of the context printed as it appeared in the file. The capitalized words should be justified in some manner similar to what's shown below in the sample output.

By default, every word in a file generates a line of output, in which it appears capitalized with a context. Suppose the word giant occurs in a file 20 times. It should appear in the concordance 20 times. But in what order should it appear? Ideally, the occurrences of GIANT that are printed will be printed in the same order in which they occur, i.e., the line numbers for each line in which the key word GIANT occurs will appear in increasing order.

Command-line Options

General usage of your kwic program is described below. Flags in brackets [] are optional; your program should provide the given defaults if not specified.

kwic [-before #] [-after #] [-min #] sources

All dash options, e.g., -before, may come in any order and be abbreviated by a single letter (shown in bold below). You can add other options for extra credit.

before #	maximum number of words of context to print before the keyword, default = 3
after #	maximum number of words of context to print after the keyword, default = 3
min #	minimum number of letters in a word to be considered a keyword in the concordance, default = 3

Example

Two lines from the online copy of the bible are reproduced below. Assume that these are lines 1 and 2.

And the earth was without form, and void; and darkness was
upon the face of the deep.

The output from the text above generated by the following call

kwic -min 4 -b 2 -a 3 data/simple.txt

is shown below:

   void; and DARKNESS was upon the        1-2 data/simple.txt
      of the DEEP                         2 data/simple.txt
     And the EARTH was without form,      1 data/simple.txt
    upon the FACE of the deep.            2 data/simple.txt
 was without FORM and void; and           1 data/simple.txt
darkness was UPON the face of             1-2 data/simple.txt
   form, and VOID and darkness was        1 data/simple.txt
   earth was WITHOUT form and void        1 data/simple.txt

Resources

Storage efficiency and speed are important considerations in designing your program, but the correctness and design of the program are more important (though programs that are unbelievably slow may not get graded). In particular, you should be able to improve the time/space efficiency of your program without affecting your overall design.