Autocomplete


Credits

The Autocomplete assignment was originally developed by Matthew Drabick and Kevin Wayne of Princeton University for their COS 226 class. The introductory portion of this writeup was based off the writeup for the Princeton version of the assignment, which can be found here.

The assignment was restructured for use at Duke by UTA Austin Lu ’15, UTA Arun Ganesh ‘17, and Prof. Jeff Forbes.

Introduction

Welcome to the Autocomplete assignment. In this assignment, you will be implementing the autocomplete algorithm, which is discussed in some detail on the theory page, using binary search and trie traversal. In doing so, you will see some of the advantages of using tries (or more generally, trees/graphs) and sorted arrays over general arrays. The overall of goal of autocomplete is to quickly find the most relevant matches to a query. For this project, relevance is based solely on the frequency of the term in a large data set, and you'll only consider terms which begin with the query as a prefix. More complicated algorithms tend to check for the containment of the query anywhere in the term rather than just at the beginning and also can correct for typos using string difference algorithms.

As always, before starting be sure to snarf the assignment via Ambient so you can read along. You can alternatively download the jar file and use eclipse's import utility.

Provided Classes

Every class necessary to write a complete autocompletor have been provided for you to fill in with your work. The following files, however, should be considered read-only. As usual, these files will be replaced in your submission, so you should depend on modifications for compilation or correctness. You are free to change them for testing purposes of course, but you shouldn't need to make any major adjustments.

Benchmark:

Node:

Autocompletor and BruteAutocomplete:

AutocompleteMain and AutocompleteGUI:

TestBinarySearch, TestTerm, and TestTrie:

Required Classes

Term:

BinarySearchAutocomplete:

TrieAutocomplete:

Analysis

As mentioned above, Benchmark has been to you for timing your implementations of Autocompletor. To test a specific implementation, change the return value of the getInstance method to the desired type.

The benchmark class will time all of the following, for a source file you choose through the file chooser:

You can, of course, modify the benchmark class to add more tests at your leisure, but you should find that the provided tests are sufficient for your analyzing needs. For timing method calls, the benchmark class runs until either 1000 trials or 5 seconds have passed (to try to minimize variation in the results) and reports the average. Thus, it should not be surprising if one test takes 5 seconds to run, but if more than 5 seconds pass and a test has not completed, there is likely an infinite loop in your code. Note that even for the largest source files (fourletterwords.txt and words-333333.txt), both BinarySearchAutocomplete and TrieAutocomplete should have runtimes in the microseconds. Thus, even over 1000 trials, the results are highly susceptible to variation from os interrupts and other sources. For the best results, I'd recommend closing all programs except for eclipse and closely examining your data for any strange data points. Of course this isn't a requirement, just be aware of the potential for your data to say something different from your expectations.

In addition to submitting your completed classes, you should benchmark them and answer the following questions. Use data wherever possible to justify your answers, and keep explanations brief but accurate.

  1. What is the order of growth (big-Oh) of the number of compares (in the worst case) that each of the operations in the Autocomplete data type make, as a function of the number of terms N, the number of matching terms M, and k, the number of matches returned by topKMatches for BinarySearchAutocomplete?

  2. How does the runtime of topKMatches() vary with k, assuming a fixed prefix and set of terms? Provide answers for BruteAutocomplete, BinarySearchAutocomplete and TrieAutocomplete. Justify your answer, with both data and algorithmic analysis.

  3. Look at the methods topMatch and topKMatches in BruteAutocomplete and BinarySearchAutocomplete and compare both their theoretical and empirical runtimes. Is BinarySearchAutocomplete always guaranteed to perform better than BruteAutocomplete? Justify your answer.

  4. For all three of the Autocompletor implementations, how does increasing the size of the source and increasing the size of the prefix argument affect the runtime of topMatch and topKMatches? (Tip: Benchmark each implementation using fourletterwords.txt, which has all four-letter combinations from aaaa to zzzz, and fourletterwordshalf.txt, which has all four-letter word combinations from aaaa to mzzz. These datasets provide a very clean distribution of words and an exact 1-to-2 ratio of words in source files.)

Submission and Grading

Please submit the following classes, any extra classes you wrote, README.txt, and Analysis.pdf for grading:

Make sure when you submit, all of the following are present and functional:

For transparency's sake, below is a list of aspects of your code the automated tests will check. Your code will still be checked by a TA, and passing all these tests does not guarantee a perfect grade. This may appear to be a large number of tests, but if you follow the guidelines set in the method headers and this writeup, you should already be passing most of, if not all of these tests:

Rubric: