Recitation 13 — Huffman

Compsci 201, Fall 2013

Huffman Coding Questions

 

A – Create a Huffman tree and the Huffman codes for each of the 6 characters whose freqencies are given below.

Draw the Huffman tree and then use the tree you make to generate the “1010..” codes for each of the six characters. In creating the tree, make the left child the smallest child when creating a new node with the two smallest remaining nodes as children. A ’0′ means ‘left’ and a ’1′ means right when generating codes from trees.

 'a'  's'  'p'  'r'  't'  'o'
10 7 2 6 3 9

B – In the Huffman Compression assignment the compressed file your program creates must store enough information in the file so that it can be decompressed — this requires the tree used in the compression process. This information is called the header information in the assignment/howto writeups. The idea is the compressed file stores first the header, then the compressed data. The header is needed to uncompress the compressed data. In Huffman coding, for example, the header stores information to allow the Huffman Tree/Trie to be created.

The simplest method for storing information in the header is to write out/store 256 32-bit int values representing the frequency/number of occurrences of each 8-bit chunk. Since the initial tree was created by couting characters, the same tree can be created by knowing the counts, e.g., the number of a’s, z’s, and in general the number of each 256 different 8-bit count.

As noted in the huff-howto the code below could write out 256 counts to the compressed file as part of the header where myCounts is an array of counts, one for each of the 256 8-bit chunks.

 

for(int k=0; k < ALPH_SIZE; k++){ 
    out.writeBits(BITS_PER_INT, myCounts[k]); 
}

1 – Based on the discussion above, what’s the value for ALPH_SIZE in the program? You can find this in the interface IHuffConstants and it’s reproduced at the end of this handout.

 

2 – Would the program work if you changed ALPH_SIZE from 256 to some other number like 512? Why?

3 – What’s the value for BITS_PER_INT in the Huffman program? Why?

 

C – Suppose you want to write the Huffman tree to the header instead of 256 32-bit integers. In the Twenty Questions assignment you wrote code to read a tree stored in pre-order form similar to what follows.

public TreeNode build(Scanner r) { 
    String line = null; 
    if ((line = r.nextLine()) != null) { 
        if (line.startsWith(QUESTION)) { 
            TreeNode root = new TreeNode(line,null,null); 
            root.yesLink = build(r); 
            root.noLink = build(r); 
            return root; 
        } else { 
            TreeNode root = new TreeNode(line,null,null); 
            return root; 
        } 
}

The analog to read a tree in Huff would be something like what follows. The idea is that a 0 or 1 bit is used to differentiate leaves from internal nodes. Each leaf stores a nine-bit value representing the character/chunk stored in that leaf. With 256 different characters, an 8-bit chunk would suffice, but the PSEUDO_EOF requires the extra bit for a leaf. Thus the use of the value 9 in the code below.

 

public TreeNode build(BitInputStream bis) { 
    int bits = bis.readBits(1); 
    if (bits == -1) // signal error, nothing read 
    if (bits == 0) { 
        TreeNode root = new TreeNode(0,0,null,null); 
        root.left = build(bis); 
        root.right = build(bis); return root; 
    } else { 
        bits = bis.readBits(9); 
        TreeNode root = new TreeNode(bits,0,null,null); 
        return root; 
    } 
}
  1. Explain why reading the first single bit, which is either a 0 or a 1, is like the “#Q:” in the Twenty Questions program — what purpose does the bit serve?
  2. The 9 bits read for a leaf node are meant to be the “character” or “chunk” value stored in the tree. There are 256 8-bit values, but 9 bits are used which allow 512 values to be represented. Because the PSEUDO_EOF value/character isn’t real, it’s represented by 256, which requires 9 bits to represent since 8-bits represents the values 0–255. What’s the purpose of the PSEUDO_EOF character and why weren’t 257 values written for the “simple” header described earlier in which 256 frequencies were written?

D – A student proposes the following idea instead of using the PSEUDO_EOF character. Write out, as part of the header, the number of bits in the compressed data section of the compressed file (the part that comes after the header). Since this number of bits is known when the file is compressed, the number can be written, then read during uncompression to tell when to stop reading bits (instead of stopping when PSEUDO_EOF is written.)

You decide not to use this idea, because it would limit the size of a compressed file. Why is there such a limit and what is the limit?

(describe a way around this limit)

E – Suppose you’ve created a Huffman tree whose root is stored in TreeNode myRoot. You want to store all the Huffman encodings for each chunk/value stored in the leaves of this tree, so you write this private method: private void createCodings(TreeNode root, String path) { }

  1. You call this initially with createCodings(myRoot,""). Explain in words what the purpose and relationship of parameter path is in relation to parameter root.
  2. Here are the instance variables from the definition of TreeNode: public int myValue; public int myWeight; public TreeNode myLeft; public TreeNode myRight; Given these, what are the two recursive calls for non-leaf nodes in createCodings?
  3. The base case for createCodings is a leaf node. When the base case is identified, the coding/path to the leaf is stored in a map. Write the code that identifies a leaf and stores the path appropriately in instance variable: aMap<Integer,String> myCodes. Use the myValue field of the leaf as they key in the map. Note that you’re writing code in method createCodings, so you use parameter path as the value in the map.