Autocomplete
Credits
The Autocomplete assignment was originally developed by Matthew Drabick and Kevin Wayne of Princeton University for their COS 226 class. The introductory portion of this writeup was based off the writeup for the Princeton version of the assignment, which can be found here.
The assignment was restructured for use at Duke by UTA Austin Lu ’15, UTA Arun Ganesh ‘17, and Prof. Jeff Forbes.
Introduction
Welcome to the Autocomplete assignment. In this assignment, you will be implementing the autocomplete algorithm, which is discussed in some detail on the theory page, using binary search and trie traversal. In doing so, you will see some of the advantages of using tries (or more generally, trees/graphs) and sorted arrays over general arrays. The overall of goal of autocomplete is to quickly find the most relevant matches to a query. For this project, relevance is based solely on the frequency of the term in a large data set, and you'll only consider terms which begin with the query as a prefix. More complicated algorithms tend to check for the containment of the query anywhere in the term rather than just at the beginning and also can correct for typos using string difference algorithms.
As always, before starting be sure to snarf the assignment via Ambient so you can read along. You can alternatively download the jar file and use eclipse's import utility.
Provided Classes
Every class necessary to write a complete autocompletor have been provided for you to fill in with your work. The following files, however, should be considered read-only. As usual, these files will be replaced in your submission, so you should depend on modifications for compilation or correctness. You are free to change them for testing purposes of course, but you shouldn't need to make any major adjustments.
Benchmark:
- Efficient way to analyze your code for the analysis section without having to click through the GUI multiple times. You should not need to edit this file at all, the one exception being to change the implementation in the getInstance method at the top of the file.
Node:
- This is a completed Node class for you to use in your implementation of TrieAutocomplete. There's nothing particularly special to see except for the ReverseSubtreeMaxWeightComparator, which you'll need to use at least once when writing TrieAutocomplete.
Autocompletor and BruteAutocomplete:
- Autocompletor defines the interface for an autocompleting algorithm. It requires two methods, both related to finding the best matches in the underlying data structure. The first, topKMatches, finds the k best matches to the prefix given. The second, topMatch, finds only the first. Usually topMatch can be implemented more quickly than topKMatches, so you shouldn't call topKMatches from topMatch since it will slow down your program. BruteAutocomplete is the most basic implementation of this interface and uses linear search and heap sort to find the best matches. BruteAutocomplete should work properly as soon as you have completed the Term class.
AutocompleteMain and AutocompleteGUI:
- These classes only launch the GUI. Don't mess with the GUI code, but feel free to change which implementation of autocomplete is used or how many matches to find at the top of AutocompleteMain.
TestBinarySearch, TestTerm, and TestTrie:
- Three basic JUnit files with test cases for each of the three required classes. Note that BinarySearchAutocomplete depends on the Term class so you should pass those tests first. Also, as always, the provided testers do not guarantee full credit and do not even check for the full range of possible errors that the autograder will look for. If you want to guarantee full credit to a higher degree, try writing your own additional test cases or ask a UTA in helper hours or on piazza for help.
Required Classes
Term:
- Term is the analog to Node for BruteAutocomplete and BinarySearchAutocomplete. It stores a String value and the corresponding positive weight of the word. It implements Comparable to give a default comparison, but there are also three Comparators defined internally which allow for sorting in different (and useful) ways. Since they are inner classes, to use either of these Comparators, you'll need to specify the outer class name at construction. For example:
Comparator<Term> comp = new Term.PrefixOrder(3);
You can read more about implementing term here.
BinarySearchAutocomplete:
- BinarySearchAutocomplete is an improved version of BruteAutocomplete which keeps its list of terms sorted so that searching can be done in log time instead of linear time. Otherwise, both implementations are quite similar in approach and, for some (special) cases, also runtime. You can find more about implementing BinarySearchAutocomplete here.
TrieAutocomplete:
- TrieAutocomplete improves even further upon BinarySearchAutocomplete and comes closer to actual implementations of Autocompletors in the real world. TrieAutocomplete discards the notions of Terms and stores its words in a trie data structure. Rather than storing an entire word in any one place, each term is represented as a path from the root to a leaf. Although this makes it incredibly fast and simple to find a single match, it also introduces a potentially huge branching factor that needs to be accounted for when searching for k matches. You can read more on TrieAutocomplete and its implementation here.
Analysis
As mentioned above, Benchmark has been to you for timing your implementations of Autocompletor. To test a specific implementation, change the return value of the getInstance method to the desired type.
The benchmark class will time all of the following, for a source file you choose through the file chooser:
- The time it takes to initialize a new instance of the class.
- If the class is TrieAutocomplete, the number of nodes in the initialized Trie.
- The time it takes to call topMatch and topKMatches() for varied k on:
- A blank string
- A random word from the source
- Prefixes of that word
- A prefix which does not start any words in the source
You can, of course, modify the benchmark class to add more tests at your leisure, but you should find that the provided tests are sufficient for your analyzing needs. For timing method calls, the benchmark class runs until either 1000 trials or 5 seconds have passed (to try to minimize variation in the results) and reports the average. Thus, it should not be surprising if one test takes 5 seconds to run, but if more than 5 seconds pass and a test has not completed, there is likely an infinite loop in your code. Note that even for the largest source files (fourletterwords.txt and words-333333.txt), both BinarySearchAutocomplete and TrieAutocomplete should have runtimes in the microseconds. Thus, even over 1000 trials, the results are highly susceptible to variation from os interrupts and other sources. For the best results, I'd recommend closing all programs except for eclipse and closely examining your data for any strange data points. Of course this isn't a requirement, just be aware of the potential for your data to say something different from your expectations.
In addition to submitting your completed classes, you should benchmark them and answer the following questions. Use data wherever possible to justify your answers, and keep explanations brief but accurate.
What is the order of growth (big-Oh) of the number of compares (in the worst case) that each of the operations in the Autocomplete data type make, as a function of the number of terms N, the number of matching terms M, and k, the number of matches returned by topKMatches for BinarySearchAutocomplete?
How does the runtime of topKMatches() vary with k, assuming a fixed prefix and set of terms? Provide answers for BruteAutocomplete, BinarySearchAutocomplete and TrieAutocomplete. Justify your answer, with both data and algorithmic analysis.
Look at the methods topMatch and topKMatches in BruteAutocomplete and BinarySearchAutocomplete and compare both their theoretical and empirical runtimes. Is BinarySearchAutocomplete always guaranteed to perform better than BruteAutocomplete? Justify your answer.
For all three of the Autocompletor implementations, how does increasing the size of the source and increasing the size of the prefix argument affect the runtime of topMatch and topKMatches? (Tip: Benchmark each implementation using fourletterwords.txt, which has all four-letter combinations from aaaa to zzzz, and fourletterwordshalf.txt, which has all four-letter word combinations from aaaa to mzzz. These datasets provide a very clean distribution of words and an exact 1-to-2 ratio of words in source files.)
Submission and Grading
Please submit the following classes, any extra classes you wrote, README.txt, and Analysis.pdf for grading:
- BinarySearchAutocomplete.java
- Term.java
- TrieAutocomplete.java
Make sure when you submit, all of the following are present and functional:
- Class: BinarySearchAutocomplete
- Constructor: BinarySearchAutocomplete(String[], double[])
- Method: public String topMatch(String)
- Method: public String[] topKMatches(String, int)
- Method: public static int firstIndexOf(Term[], Term, Comparator)
- Method: public static int lastIndexOf(Term[], Term, Comparator)
- Class: TrieAutocomplete
- Constructor: TrieAutocomplete(String[], double[])
- Method: private void add(String, double)
- Method: public String topMatch(String)
- Method: public String[] topKMatches(String, int)
- Field: protected Node myRoot
- Class: Term
- Class: Term.WeightOrder
- Method: public int compare(Term, Term)
- Class: Term.ReverseWeightOrder
- Method: public int compare(Term, Term)
- Class: Term.PrefixOrder
- Constructor: Term.PrefixOrder(int)
- Method: public int compare(Term, Term)
- Field: private final int r
- Constructor: Term(String, double)
- Method: public String toString()
- Method: public int compareTo(Term)
For transparency's sake, below is a list of aspects of your code the automated tests will check. Your code will still be checked by a TA, and passing all these tests does not guarantee a perfect grade. This may appear to be a large number of tests, but if you follow the guidelines set in the method headers and this writeup, you should already be passing most of, if not all of these tests:
- Do Term's compareTo and Comparator inner classes sort correctly?
- Does Term's constructor initialize correctly and throw exceptions when appropriate?
- Does Term.PrefixOrder(r).compare() take O(r) time?
- Do BinarySearchAutocomplete and TrieAutocomplete's topMatch and topKMatches return the correct results?
- Do BinarySearchAutocomplete and TrieAutocomplete's topMatch simply call topKMatches?
- Can BinarySearchAutocomplete and TrieAutocomplete's topMatch and topKMatches handle an empty string argument?
- Do BinarySearchAutocomplete and TrieAutocomplete's topKMatches handle a k = 0 argument?
- Do BinarySearchAutocomplete and TrieAutocomplete handle words with irregular character values correctly?
- Do BinarySearchAutocomplete and TrieAutocomplete modify the values they are constructed from when they shouldn't?
- Do BinarySearchAutocomplete and TrieAutocomplete use static variables unnecessarily?
- Do BinarySearchAutocomplete and TrieAutocomplete's constructors, topMatch, and topKMatches throw NullPointerExceptions if an argument is null?
- Do BinarySearchAutocomplete and TrieAutocomplete modify the values they are constructed from?
- Do BinarySearchAutocomplete's firstIndexOf and lastIndexOf return the correct results?
- Do BinarySearchAutocomplete's firstIndexOf and lastIndexOf handle empty arrays?
- Do BinarySearchAutocomplete's firstIndexOf and lastIndexOf use the correct number of compares?
- Do BinarySearchAutocomplete's firstIndexOf and lastIndexOf use compare instead of equals?
- Do BinarySearchAutocomplete's firstIndexOf and lastIndexOf change the value of their arguments when they shouldn't?
- Does TrieAutocomplete's add generate the trie correctly?
- Does TrieAutocomplete's add handle calls to the same word twice correctly?
- Do TrieAutocomplete's topMatch and topKMatches avoid exploring more nodes than needed?
Rubric:
- 70% Correctness: for your implementation of Term, BinarySearchAutocomplete, and TrieAutocomplete and passing the tests described above.
- 20% Analysis: for your README, data from Benchmark, answers to the questions, and description of the tradeoffs.
- 10% Engineering: for your the structure and style of your Autcompletor implementations including formatting, comments, and naming conventions.