[company logo]


You are one of four or five founders of a software startup File Utilities. The company is writing a suite of cross-platform programs based on processing files. Your first product is described below, its success is essential to future venture capital infusions into the company. This is clearly mission-critical software.

Goofi: Grep-like Object-Oriented File Indexing

[goofy] [goofy] The Unix utility grep supports fast look-up of regular expressions. Your project is an implementation of Goofi, a Grep-like Object-oriented File Indexing system. However, Goofi will support matching in a large set of files and will cache results between runs. In this way it will function as a cross between the Unix utilities find and fgrep (see man page entries for these). It is similar, but much smaller than the utility glimpse. (If you check the web page you'll see that glimpse is available on many kinds of systems. It is installed on CS machines, but not on acpub machines.)

Requirements

You will provide two programs: goofi and goof. The first should be written in C++, the second in Java.

GOOFI

The program goofi catalogs a directory or web hierarchy, building a structure that will facilitate searching using the program goof. The goofi program will create at least one file in which information is stored and subsequently read by goof to search for words, phrases, and regular expressions. However, you're welcome to create multiple files used by goof. If you create more than one file, these should be stored in a directory (e.g., a .goofi directory); it is not a good idea to litter a user's directory with multiple files to support an application.

General usage of goofi is described below. Flags in brackets [] are optional.

goofi source [-update] [-exclude subdir] [-source dir]
             [-ignore file] [-suffixes file] [-output dir]
             [-min int] [-depth int]
All dash options, e.g., -update, -ignore, can come in any order and should be abbreviated by a single letter (shown in bold below). You can add other options for extra credit.
update Only reads files that are newer than the last time goofi was run with the same root directory. Without this option all files in the directory hierarchy are read and indexed. With this option only new files (modified since the last run of goofi) are read and indexed.
exclude subdir The subdirectory named as an argument is not searched in this run of goofi. Optionally subdir can be a regular expression and any directory matching the regular expression is ignored.
source dir Instead of reading the .goofi file/directory in the user's home directory, the file/directory named as an argument to source is read.
ignore file The named file is assumed to be white-space delimited words. These words are ignored and not indexed by the goofi run.
suffixes file The named file consists of white-space delimited words/regular expressions. Any file encountered by goofi, whose suffix matches any of those in the file is not indexed during the goofi run.
output dir Store the results of the index in a file or directory in the user's home directory (default = ".goofi").
min int Only words whose lengths are greater than or equal to the integer value argument int are indexed by the goofi run (default = 4).
depth int Only index to certain depth (default = 2 only when indexing web pages)

The source is specified by either an absolute or relative directory path or an URL for a web page that should be recursively searched to create an index for subsequent usage by the goof program. This directory is called the goofi root directory The index can be one file or multiple files stored in a directory. The default location of the index file should be in either a file or a directory named .goofi in the user's home directory. By default goofi should ignore all executable files and those with the following suffixes:

    .Z, .z, .zip, .tgz, .dvi, .ps, .tar, .o
In addition, the user should have the option of excluding certain files and/or directories from being indexed by storing the names of these files in a .goofi-excludes file which will be read, if it exists, by goofi.

Your indexing program  should run in two modes: one that emphasizes the smallness of the index files built, and one that emphasizes the speed of goof queries. Most likely it will be difficult to have a small index and a fast program. For some ideas see the glimpse paper.

GOOF

The program goof searches indexes created by the program goofi to report all the files and line numbers on which a given word or set of words matching a given regular expression.

General usage of goof is described below. Flags in brackets [] are optional.

goof word [-source dir] [-n] [-file regexp] [-reg] 
          [-context]
All dash options, e.g., -source, -context, can come in any order and should be abbreviated by a single letter (shown in bold below). You can add other options for extra credit.
-source dir  Instead of reading the .goofi file/directory in the user's home directory, the file/directory named as an argument to source is read.
-nolines No line numbers are printed, only matching file names.
-file regexp Only files matching the regular expression regexp are searched/indexed for matches. For example, typing goof -file "\.cc$" string would find all occurrences of the word string in indexed files with a .cc suffix.
-reg The word argument to goof is treated as a regular expression instead of a word. Using goof -reg "^r...t$" would search for all five letter letter words starting with r and ending with t, e.g., robot.
-context  In addition to printing line numbers, the matched lines are printed as well.

By default, goof reads a .goofi file/directory in the user's home directory, searches for all occurrences of the word in the index, and prints a list of files and matching line numbers for each file on which the word occurs. Words are delimited by whitespace and punctuation characters. However, punctuation (see ispunct in ctype.h) is not considered part of a word when it comes before the first alphabetic character or after the last alphabetic character.

Implementation Notes

Your implementation must use the object-oriented versions of the c functions getopt and scandir.

Deliverables

  1. Sunday, February 27. A list of who you do and do not want to be in your group. I guarantee that you will not work with those students listed as negatives, but I cannot guarantee your positives. The best thing you can do is organize into groups of four or five and have everyone in the group mail me the other people's names. This makes it clear you have a group that wants to work together.
  2. Tuesday, February 29. A one to two page description of of classes you envision as part of implementing goofi and goof; and a list of issues that arise as you try to pin down the requirements, e.g., vague, ambiguous, conflicting requirements; issues in what's asked of you. You will have one deliverable per group.
  3. Friday, March 3. Libraries that implement scandir and getopt functions along with test programs that verify they work.  A clear description of how you intend to build your small and your fast indices. The program should include the beginnings of a user manual and a programmer manual (for the group that takes over File Utilities when you have gone public cashed in your stock options).
  4. Tuesday, March 7. A basic version of goofi and goof that can build an index and read it. It does not have to support any of the command-line options, but should be designed to accommodate them. The beginnings of the design document and user manual must be submitted electronically.
  5. Friday, March 10. The final program, design document, and user manual must be submitted electronically.

Comments?