TrieAutocomplete

You may already have some familiarity with what a trie is, and you should have familiarity with what a tree is. Regardless, completing TrieAutocomplete is very difficult without a mastery of tries, so this page will cover the basics of Tries and their functionality.

What Is A Trie?

A trie is simply a special version of a tree. In some trees, each node has a defined number of children, and a fixed way to refer to each of them. For example, in a binary tree, each node has two children (each of which could be null), and we refer to these as a left and a right child. Tries are different in that each node can have an arbitrary number of children, and rather than having named pointers to children, the pointers are the values in a key-value map.

So, while a node in a Java binary tree might be defined as

public class Node{
    Node myLeft, myRight;
}

A node in a Java trie might look like

public class Node {
    Map<Character, Node> children;
}

(Note that the Node class given to you has much more information than this)

The keys of the map will also correspond to some value. For example, for a trie that stores many Strings (as we wish to in this assignment), the keys will be characters. In order to reach the node representing a word in a trie, we simply follow the series of pointers corresponding to the characters in the String. Below is a drawing of a sample word trie:

The top node is the root. It has three children, and to get to these children we have to use the keys t, A, and i. The word each node represents is the concatenation of the keys of pointers you have to take from the root to get to that node. So, to get to "tea" from the root, we have to follow the root's t pointer, then the e pointer, then the a pointer.

More generally, to get to a node representing a word in the String str, given a root pointer we might use the following code loop:

Node curr = root; for (int k = 0; k < str.length(); k++) { curr = curr.children.get( str.charAt(k)); }

Trie Functionality And Utility

In creating a trie, we will of course have to add values to it. Adding a value to a trie is very similar to navigating to it. To add a value, simply try navigating to that value, but anytime a node on the path to that value is missing, create that node yourself. The code for adding the word in str to a trie might look like this:

Node curr = root; for (int k = 0; k < str.length(); k++) { if (!curr.children.containsKey(str.charAt(k))) { curr.children.put(str.charAt(i), new Node()); } curr = curr.children.get(str.charAt(i)); }

(Again, please note that the Node class given to you is more detailed and this code alone is not a solution to this assignment)

We use tries because they have some useful properties, which we will take advantage of in this project:

The time it takes to find an entry in a trie is independent of how many entries are in that trie. More specifically, to navigate to a node corresponding to a word of length w in a trie that stores n words takes O(w) time as opposed to O(f(n)) for some f(n). This improvement is possible because every navigation to the same word passes through the same set of nodes, and thus takes the same time regardless of what other nodes exist.
All words represented in the subtrie rooted at the node representing some word start with that word. That is, the node representing "apple" will always be below the node representing "app" or more generally, every node below the node representing "app" represents a word starting with "app". For this reason, tries are sometimes called prefix trees. Given that the autocomplete algorithm is searching for words starting with a given prefix, this structure proves very useful.

TrieAutocomplete implements the Autocompletor interface, which means it should, given a list of terms and weights for those terms, be able to find the top match(es) for a prefix amongst those terms. To do this, you will write methods to construct a trie, and then use the trie structure to quickly filter out all terms starting with a given prefix, and then find the highest weighted terms.

Within this class, you should:

Write the trie method add
Write the interface-required method topMatch
Write the interface-required method topKMatches

The Node Class

For this entire page, it may be useful to have the Node class given to you open for reading. In addition, you should be comfortable with the basic concepts of a trie at this point.

The Node class comes with several class variables for your use in completing TrieAutocomplete: - isWord - Set to true if the current node corresponds to a word in the set of words represented by this trie. We need this because in creating the nodes representing words, we create nodes representing words that do not exist (e.g. in creating a node for "apple" we will create a node for "appl"). Thus, we need some way to distinguish between intermediary nodes (nodes between the root and word nodes that don't represent words) and word nodes. - myWord - A convenience variable, which contains the word this node represents. Should be null if isWord is false - myInfo - A convenience variable, which contains the character this node corresponds to - myWeight - The weight of the word this node corresponds to. -1 if this node does not correspond to a word. - mySubtreeMaxWeight - The maximum weight of a word in the subtree rooted at this node (the subtree includes the root node). Useful for navigating quickly to high-weight nodes. Tracking this extra piece of information will heavily speed up our autocomplete algorithm. - children - The map from characters to the corresponding children nodes - parent - A pointer to the parent of this node

Add

Look at the constructor for TrieAutocomplete. It initializes the trie's root, and then calls the void add method on every word-weight pair in the arguments for the constructor. Your task is to write the add method such that it constructs the trie correctly. That is, if you write the method correctly, then every word-weight pair should be represented as a trie, and the descriptions of the class variables for Nodes listed above should be true for every node. More specifically, when add is called on a word-weight pair, you should: - Create the node representing that word and any intermediary nodes if they do not already exist - Set the values of myWord, myInfo, isWord, and myWeight for all word nodes - Set the values of mySubtreeMaxWeight for all nodes between and including the root and the word node

To help you understand what all your add method should be doing, here's an example of a series of adds. In the trie below, each node's mySubtreeMaxWeight is red, the key to that node is the letter inside of it, and if the node represents a word, it has a green background and its weight is in blue.

We start with just a root:

Trie 1

We call add("all", 3). Notice how since the only word at this point has weight 3, all nodes have mySubtreeMaxWeight 3.

Trie 2

We call add("bat", 2). Only the nodes in the right subtree have mySubtreeMaxWeight 2, because the largest weight is still 3.

Trie 3

We call add("apes", 5). This time, we don't create a node for the first letter. We update the mySubtreeMaxWeight of the root and the "a" node.

Trie 4

We call add("ape", 1). Notice how we create no new nodes, simply modify the values of an existing node.

Trie 5

Lastly we call add("bay", 4). Notice the changes in red numbers, and that we only created one new node here.

Trie 6

Also note how in the final tree, any word which has no larger-weight words below it (every word but apes) has the same red and blue values. This is true of any trie we construct in this project, if the parameters of nodes are updated correctly - a node with the same myWeight and mySubtreeMaxWeight has no children with a larger weight.

Most importantly, notice how on the path from a node representing a prefix to the largest weighted word in the subtree rooted at that node, all the red values are the same, and this value is also the weight of the largest weighted word. We will take advantage of this fact when writing topMatch().

This example of constructing a trie using add is a more simple one - we never added a word below an existing word, or a word which already exists. Your add should be able to handle any series of calls on word-weight pairs appropriately, including corner cases.

The most notable corner case is adding a word that already in the trie, but with a lower weight - for example, if we were to call add("apes", 1). In this case, we would have to update the weight of apes, and then the mySubtreeMaxWeight of all its ancestors. This is especially tricky because not all the ancestors would not have the same new value. Be sure to take advantage of parent pointers when writing code specific to this case.

TopMatch

Fortunately, once add is written topMatch becomes very simple. As noted in the above example, the mySubtreeMaxWeight of every node from a prefix node to the node under the prefix node with the highest weight will be the same as that highest weight. Thus, the algorithm to find the top match is simply:

Navigate to the node corresponding to the prefix
Use the mySubtreeMaxWeight of the prefix node to navigate down the tree until you find the Node with the max weight

TopKMatches

topKMatches is similar, but not quite the same as topMatch. We will still be taking advantage of mySubtreeMaxWeight to quickly navigate to high-weight nodes, but this time, we will have to go down multiple branches instead of just one.

To find the top k matches as quickly as possible, we will use what is known as a search algorithm - keep a PriorityQueue of Nodes, sorted by mySubtreeMaxWeight. Start with just the root in the PriorityQueue, and pop Nodes off the PriorityQueue one by one.

Anytime we pop a node off the PriorityQueue, or "visit" it, we will add all its children to the PriorityQueue. Whenever a visited node is a word, add it to a weight-sorted list of words.

When this list has k words with weight greater than the largest mySubtreeMaxWeight in our priority queue, we know none of the nodes we have yet to explore can have a larger weight than the k words we have found. At that point, we can stop searching and return those k words. If we run out of nodes in the PriorityQueue before we find k words, that means there are not k words in the trie rooted at the prefix.