Welcome to Huffman coding, your final programming assignment of the semester. Your task for this programming assignment will be to implement a fully functional Huffman coding suite equipped with methods to both compress and decompress files. This assignment specification/guide should be sufficient to walk you through both Huffman coding step-by-step. However, if you run into any difficulties or are having trouble understanding exactly how Huffman coding works, then feel free to check out the Wikipedia article on Huffman coding, or, as always, the teaching staff will be available both at helper hours and on Piazza to assist you as well.
To begin, you'll need the user interface code available through snarf or here. You should consider these file read-only.
If you're having trouble getting the snarfed code to run, try removing your JRE library and adding it again. To do that, right click on the project, select build path, and click configure build path. Go to the libraries tab and select the JRE System Library (may also just be called jdk...) and then remove it. Next click add library, JRE System Library, and then use the workspace default. If this does not work, post on piazza and a member of the teaching staff will help you get your code running.
Additionally, you should have a data folder which contains three standard compression corpi that are commonly used to test the effectiveness of compression algorithms. Both the calgary and canterbury corpi offer a cash prize for every byte of improvement over the previous record. Unfortunately, standard Huffman coding won't quite be enough to claim any sort of prize. More importantly than the corpi, you've also been given two text files (kjv10.txt and melville.txt) along with a binary file (lena.tif) as well as compressed versions of these files using our implementation of the Huffman algorithm. You should use these files to test your own implementation by comparing your own compressed files to these using the compare tab of the user interface or the JUnit tests. In order to receive full credit, you will need to be able to both compress the original files into an exact match of our compressed files and also decompress our compressed files into an exact match of the original files. For debugging purposes, the JUnit tests check the header, body, and pseudo-EOF sections separately for compression. If your header section is incorrect, this will likely cause the body and pseudo-EOF tests to fail as well, so use these tests incrementally. That is, make sure that your header is correct before trying to test the body since the latter is dependent on the former. As a way to circumvent this if you're really struggling with the header or body, the tests only depend on the length of the sections so you could just write empty bits to fill up the expected length of the header or body section. This way the body sections of your output and the solution will line up in the JUnit tester so that the tests run correctly. As always, remember that these criteria alone will not guarantee full credit, but merely suggest that you're on the right track.
You only need to implement and submit a single file for Huffman: HuffProcessor.java. In order to get full credit, you should thoroughly read and reference this assignment write-up and follow these steps.
Please note that unlike some previous semesters this is NOT a partner project so please work alone and submit your own work.
In order to make your implementation of Huffman easier to test, run, and benchmark, a simple user interface has been provided for you to use. There are four tabs, each performing a unique and important task.
Compress: The compress tab allows you to use the compress method you wrote in HuffProcessor. Add individual files by clicking the red "No File Chosen" text in the top left. Once you have selected a file, you can either press the compress button to run your code or select a new file by clicking on the green text. When you run your code, you should notice the progress bar in the top right slowly filling up. If the bar turns green, then compression finished without any errors. This does not mean that your code is correct, merely that it's not totally wrong. If the bar becomes red instead, you encountered an error and need to begin debugging. Upon completion, the user interface will also display some basic stats about the compression, namely the size of the file before and after, the percentage of space saved relative to the original file size, and how long the algorithm ran before finishing. All of these will be useful later for the analysis. The actual concrete output of running the compress tab is a compressed file sharing the same name as the original with a ".hf" extension.
Decompress: Comparably to the compress tab, decompress allows you to test your decompress method. The user interface works in the same way. The outputed file will remove the ".hf" extension and add ".dehf" in its place.
Compare: The compare tab compares two files bit by bit to ensure that they are identical. Compare will tell you where your files differ if at all. Running compare with your compressed files and the provided compressed files is the best way to test your code. Note again that successfully running compare does not guarantee full correctness, so be sure to check a number of files to cover more edge cases.
Test: The final tab is the test tab which allows you to compress an entire folder of files all at once. It works the same way that compress does, just on a list of files instead of just one. This will be especially useful for the analysis section since it minimizes clicking. There is an added user interface component here: the "test .hf files" checkbox. This allows you to distinguish between compressing and recompressing when benchmarking.
On a final note, although you can switch away from a tab and immediately run something else, it is recommended that you wait until each process is finished since running multiple things at once will likely cause errors (and headaches).
Compression algorithms work by reformatting data to use fewer bits. By default, text files use the 8-bit (or byte) codes specified by ASCII (American Standard Code for Information Interchange). While this may not seem like such a waste of space, 8 bits allows for 256 distinct characters to be expressed. This is far more than most text files will ever use (if you don't believe it, try to come up with 256 different characters that you might find in a regular file in the English language). If you look at some of the characters which ASCII includes, you'll notice that the standard has included a host of mathematical symbols in addition to group of control characters. Even if you argue that the mathematical notation may actually occur quite frequently, many of the control characters are relics of historical coding practices/necessities that are no longer relevant. Some applications even forbid the use of control characters. Thus, in many cases, by ignoring the characters that don't actually occur in the file and reducing the number of unique codes needed, shorter bit codes can be conceived. For example, consider the file with only four unique characters, the characters could be uniquely identified with only 2 bits apiece as opposed to 8 using 00, 01, 10, and 11 to represent each one. A naive approach may do something like this and just trim the size of the character library to be only 128 characters or maybe even just 64, saving 1 bit and 2 bits per character respectively. Although mildly effective in some cases, more complicated algorithms can capitalize on the statistical data of character distributions to extend compressibility even further. Huffman is one of these entropy encoding algorithms.
Entropy encoding algorithms work mainly by utilizing variable length codes combined with character frequencies. Characters that occur a lot get shorter codes while lesser used characters get longer codes. In some cases, longer codes may even exceed the 8-bit length of the standard ASCII codes. While this may seem counter-intuitive, it actually allows for more bits to be saved on higher frequency characters. Consider the following using the String "go go gophers" as an example.
Option A: use standard ASCII codes
Character | Frequency | Bits | Total |
---|---|---|---|
g | 3 | 8 | 24 |
o | 3 | 8 | 24 |
_ | 2 | 8 | 16 |
p | 1 | 8 | 8 |
h | 1 | 8 | 8 |
e | 1 | 8 | 8 |
r | 1 | 8 | 8 |
s | 1 | 8 | 8 |
Total | 104 |
Option B: use alternative codes based on character frequencies
Character | Frequency | Bits | Total |
---|---|---|---|
g | 3 | 7 | 21 |
o | 3 | 7 | 21 |
_ | 2 | 7 | 14 |
p | 1 | 8 | 8 |
h | 1 | 8 | 8 |
e | 1 | 9 | 9 |
r | 1 | 9 | 9 |
s | 1 | 9 | 9 |
Total | 99 |
As you can see, by trading 1 bit from the more frequent characters to the less frequent characters, the alternative solution created some codes that were longer than 8 bits but still saved bits overall. The goal of Huffman and other entropy encoding algorithms is to quickly generate these alternative codes in a predictable and guaranteed way. As it turns out, binary trees turn out to be excellent data structures for storing characters and generating character codes. Provided that every internal-node in the tree has two children, binary trees provide a perfect way to translate a sequence of binary digits into characters. Each binary digit directs you how to traverse down the tree. A 0 indicates that the current node should be updated to the left sub-child while a 1 indicates the right sub-child. Each leaf-node represents a character, and since each leaf-node has a unique path from the root-node, each character has a unique code. This should remind you of tries; in fact Huffman trees are sometimes also referred to as Huffman tries. Variable length codes are even really easy to construct using a binary tree, just imagine a really unbalanced tree (but remember that each internal node still always needs two children). Here's a sample Huffman binary tree of "go go gophers" and the corresponding codes.
Character | Code |
---|---|
g | 10 |
o | 11 |
_ | 001 |
p | 0100 |
h | 0101 |
e | 0110 |
r | 0111 |
s | 000 |
The important thing about binary trees for Huffman coding is the prefix property. When every internal-node contains exactly two children, each leaf-node contains a character, and the code generation conventions outlined above are followed, two related properties are fulfilled.
The code for any character is guaranteed to not be the prefix of another code. For example, a Huffman tree cannot assign the codes 01 and 010 to two distinct characters. This prevents confusion when decompressing (e.g. X has code 01 and Y has code 0101, the input 0101 could be interpreted as XX or Y).
The set of all codes generated by a Huffman tree contains prefixes for every possible binary input. This means that no matter what sequence of binary inputs are fed into a Huffman tree, there is always a valid character interpretation.
When these two properties are combined, it means that for any given Huffman tree, there is a single unique character output for every binary input and vice-versa. This implies that there is an unambiguous way to compress a file without loss of accuracy and to decompress a file and end up with the original uncompressed version (both of which seem like pretty important criteria for data compression).
There have been multiple proposed algorithms for constructing one of these magical Huffman trees, but until Huffman himself devised the algorithm you'll be implementing in a moment, no algorithms guaranteed that the resulting Huffman tree saved the most possible bits. Huffman was able to do this by constructing the tree from the bottom up. The basic idea is that you take the two nodes with the lowest frequency and combine them with an internal-parent-node. The new parent-node takes the combined frequencies of its two sub-children as its frequency and is put back with all of the other nodes. This process is repeated until there is only one node left, which is the root-node. This diagram from Wikipedia clearly shows how the algorithm works.
Now that you've seen an example, try creating a Huffman tree for yourself. Use "go go gophers" as the source for the tree so you can check your answer with the image above. WARNING: when you're done, your tree may not exactly match the picture. This is ok; your characters don't have to be in the same spots, and the structure of the tree may be slightly different. That said, you should end up with something that has essentially the same shape as the example, perhaps with some left and right sub-children flipped. If you aren't certain that your tree is correct, try again, selecting nodes and orientations to specifically emulate the example.
Before you begin to actually implement Huffman, here are some interesting dilemmas to consider. First, compression isn't free. Your decompression algorithm needs to be able to reconstruct the tree you used to compress the file in order to decompress it. You'll have to include some information at the beginning of each of your compressed files which instructs the decompress method on how to recreate the specific tree that it needs. There are a number of ways to encode this information, but they all require a similar amount of space. This additional information or header is why many shorter files can't be compressed; the additional length of the header outweighs the number of bits saved by compressing the body of the file. Second, since the decompress algorithm requires very specific formatting in order to work, inputting a file that was never compressed in the first place could yield some interesting results. In order to prevent decompress from trying to parse nonsensical information, you'll need to include some sort of flag or indicator at the beginning of your compressed files as well to communicate to decompress that it is all right to proceed. There is a constant provided for you in Processor that you can use for this purpose called HUFF_NUMBER. It is highly unlikely that HUFF_NUMBER will naturally occur at the beginning of the file, so this is a nearly 100% effective approach. Finally, since all of the Huffman bit codes are of variable length, there is no surefire mathematical way to know when you've reached the end of the file and need to stop reading bits. Although the file does eventually end, there may be some number of extra bits at the end since files are written in bytes, not bits, and so every file must have a length with a multiple of 8 bits. This could result in additional characters being added to the end of your decompressed file which did not exist in the original uncompressed version. To combat this issue, you should include an end of file character at the end of each of your compressed files. The pseudo-EOF will have its own Huffman code generated from the tree. The decompression algorithm won't write this character to your decompressed file, but rather, once it encounters the unique pseudo-EOF code, it will trigger your loop to stop reading bits and exit.
The following sections will provide you with detailed step-by-step directions on how to properly implement the compression algorithm. It is suggested that you write a short test file that you can compress by hand to compare the output of your compress method to as you go, although you may opt to develop from the junit tests instead. Be sure to use the same strategy to build your Huffman tree as this may be an unnecessary source of confusion. Additionally, it is highly recommended that you write a number of helper methods rather than one long method to implement compress. It is better design and will be much easier to debug should you encounter a problem. Furthermore, use descriptive method and variable names. This is also good design and will be considered as part of your engineering score.
Since Huffman coding relies on character entropy/frequencies to generate new codes, the first step of your compression algorithm should be to count the number of occurrences of each character in the file. You can read bits from the provided BitInputStream using readBits(int howManyBits). You should read BITS_PER_WORD (which equals 8) bits at a time in order to get a single character at each call. You should continue to call in.readBits(BITS_PER_WORD) until it returns -1, which indicates that you've reached the end of the file. Storing the frequencies is best done using either an array or a Map. Since Java can automatically convert char and int primitive values, you can use the indexes of an array as a faster alternative to Character keys in a Map, although both data structures are appropriate. Either way, the values should hold the accumulated frequency of each character. Be sure to reset your BitInputStream once you have finished reading all of the characters so that you can reread the file without error when you encode it later. You can do this by calling in.reset().
Now that you know how often each character occurs, you can construct a Huffman tree based on the file. To construct the tree, first create a HuffNode for each character that occurs at least once. You don't want to include characters that don't occur at all since this will make your tree much larger than it needs to be and minimize the effectiveness of your compression algorithm. Remember also that the whole point of Huffman coding is to leave out the characters that don't occur at all. The weight of each HuffNode should be the character's frequency and the value should be the character itself, converted automatically by Java to an int. Remember to include the pseudo-eof character in the tree; its HuffNode should have a weight of 0 and a value of PSEUDO_EOF, a constant in Processor. You'll need to put each HuffNode into a data structure of your choosing. Since HuffNode implements Comparable, you could use either an array with Arrays.sort(HuffNode[] nodes), some type of List with Collections.sort(List nodes), or any other built in Java library you please.
However, the classic solution to Huffman is to use a PriorityQueue which will automatically sort the nodes for you and return the smallest weighted node each time you call poll(). This implementation is much faster than using an array or a List since you don't have to do a full sort at each step. No matter which data structure you use, you'll need to use a while loop to continue combining nodes until you only have one left. To combine two nodes, first remove the two smallest nodes from your data structure. Then create a new node with the two nodes as its sub-children and their combined weight as its weight. Put the new node back into the data structure and repeat. The node that remains at the end is the root of your Huffman tree.
Since you have a Huffman tree in the form of a single HuffNode, you also have all of the Huffman codes you will need to compress the body of the file. Rather than querying the tree and extracting a code for every character in the file, it will be much more efficient to pre-process your tree and extract all of the codes before compressing the body of the file. As with the frequencies, you can either use an array or a Map to store the new codes. Populating your array/Map can be done with a basic recursive traversal. At the base case where you reach a leaf, simply add a new entry to the array or Map. Otherwise make a recursive call down the left and right sub-tree with the correct updated arguments. Although in theory it does not matter whether 0 indicates the left or right, our implementation uses 0 to indicate the left and 1 to indicate the right, so in order to match our process exactly and optimize your score for Huffman, please use the same pattern. It is advised that you store each Huffman code as a String of 0s and 1s to preserve the integrity of the code. Using an int may preserve the value, but it cannot reliably represent the length of the code when there are leading 0s (e.g. the codes 01 and 001 would be indistinguishable in int form). You may elect to add a second array/Map to separately store the number of bits in the code, although the String approach may be more intuitive and easier to implement with recursion.
First, you need to write the HUFF_NUMBER. Writing bits is done by calling writeBits(int howManyBits, int value) in BitOutputStream. Since HUFF_NUMBER is an int, we want to write BITS_PER_INT bits with the value HUFF_NUMBER. Next, you need to write out the information of the tree. This is best done recursively. For the base case when you reach a leaf node, first write a single bit with the value 1. This distinguishes leaf nodes from internal nodes. Then, write out 9 bits with the value of the leaf node. Even though it should only take 8 bits in theory to store the character value of the node, with the addition of the PSEUDO_EOF node it actually takes 9 bits to fully distinguish all of the possibilities. Alternatively, for an internal node, write out a 0 and then make two recursive calls to write the left and right sub-trees in that order. Again, while technically it doesn't matter whether you write the right or left sub-tree first (so long as you read them in the same order in decompress) use this order to match our implementation and get full credit. As an example, the header for the "go go gophers" tree above would look something like:
0 0 0 1 --S-- 1 --_-- 0 0 1 --P-- 1 --H-- 0 1 --E-- 1 --R-- 0 1 --G-- 1 --O-- The space are added for readability, and the --x--s correspond to the 9-bit version of the ASCII codes for each character.
You're nearly done with compression; all that's left is writing the encoded body of the file. Similar to when you first read the file while counting character frequencies, you need to read BITS_PER_WORD bits at a time to isolate each character. For each character, use your Map or array to obtain the Huffman code. You will call writeBits() again, using the length of the String as the first parameter (or the value from the secondary Map/array holding the length if you made that choice) and the int value of the String as the second parameter. You can use Integer.parseInt(String value, int radix) to do this. The String value you obtain from the Map/array and the radix should be 2 since the number represented by the String is in binary. Once you have written out every character from the original file, make one last writeBits() call using the PSEUDO_EOF entry in your Map or array to complete the compression process.
Similar to compression, it is possible to write the entire algorithm in one method, but it is easier to break down the problem and a few parts are easier to do with recursion. Your code will be easier to read and debug if broken up and you will not lose any points for engineering. Use the test file from before to try compressing and decompressing a file or simply the junit tests. Once these work, use a larger file (melville.txt is a good one) and use the compare tab to make sure the uncompressed and the decompressed versions are the same. Decompression should go much faster than compression since by now you should have a better understanding of the Huffman coding algorithm.
First, check to make sure the HUFF_NUMBER is present at the beginning of the file using the BitInputStream and the readBits() method. You should read BITS_PER_INT bits. If the HUFF_NUMBER isn't there, throw a HuffException with an informative message. Once that is verified, you can move on to reading the header with a recursive helper method. First, read a single bit. If the bit is a 0, then you are at an internal node and need to make two recursive calls to build the current node's two sub-trees. Be sure to make the recursive calls in the right order so that your codes aren't flipped. If the bit was instead a 1, then you need to read the next 9 bits to get the value of the character. Create a new HuffNode with the value of the 9 bits you just read and weight of whatever you'd like. The weight doesn't matter now since the entire structure of the tree is specified by the header. If you wrote the header correctly, then this recursive process should terminate naturally at exactly the end of the header. If instead you are encountering errors, check the write and read methods for the header for any minor mistakes.
With a tree, all you have to do now is continue to decode characters until you reach the PSEUDO_EOF character. With a while loop, read one bit at a time. Depending on the value of that bit, move the current node to either the left or right sub-child. Include a if-case check for the bit equaling -1, if a file was not compressed properly or there's an error in your decompress methods, then you may miss the PSEUDO_EOF and read all the way to the full end of the file. If the current bit equals -1, throw a HuffException like before with the message "Problem with the PSEUDO_EOF" so that you will know where to begin debugging if the error should occur. If the bit was valid and you move to a new node, check if it is a leaf. If so, then check if the value of the node is the PSEUDO_EOF. When you encounter the PSEUDO_EOF, you need to return so that you stop writing and reading bits. For all other characters, write BITS_PER_WORD bits using the BitOutputStream with the value of the character stored in the node. After writing the character, you also need to reset the current node pointer to the root of the tree. By now you should have a fully working compress and decompress system which you can verify using the compare tab. Once you are sure that your code works, move onto the analysis or check out the extra credit section.
Answer the following questions You do not need to write several pages, but you should provide thoughtful answers backed by data and analysis.
Benchmark your code on the given calgary and waterloo directories. Develop a hypothesis from your code and empirical data for how the compression rate and time depend on file length and alphabet size. Note that you will have to add a line or two of code to determine the size of the alphabet.
Do text files or binary (image) files compress more (compare the calgary (text) and waterloo (image) folders)? Explain why.
How much additional compression can be achieved by compressing an already compressed file? Explain why Huffman coding is or is not effective after the first compression.
Devise another way to store the header so that the Huffman tree can be recreated (note: you do not have to store the tree directly, just whatever information you need to build the same tree again).