Compsci 100, Spring 2012, Burrows-Wheeler HOWTO

Understanding Burrows-Wheeler

This is an extra credit assignment, and therefore we'll be relying on your ability to do a little research to understand the BWT on your own. This howto focuses on a few particular issues in integrating huffman...it does not really describe the transform itself.

One fairly detailed explanation of the BWT and its use in compression is here. You also may find this this Wikipedia entry useful.

You'll need both a high-level understanding of how BWT works to implement this project, but also expend to do some amount of work modifying your existing huffman code to easily accept this new transformation (that's what's described below).

You're also given a class BurrowsWheeler with TODO: annotations indicating where you must add code to make that class work as intended. There's more discussion about this below.

BWT encoding/compression

Below is the body of my method that compresses using Huffman Coding/Compression from the previous assignment. This isn't how I originally wrote it, but it represents a refactoring of the code so that separate methods are used to write the header and the compressed bits. This separation/refactoring makes it simpler to use the existing code in doing the Burrows-Wheeler transform. The idea isn't to make your code look like this, but to understand this code in the context of the block transform.

public int compress(InputStream in, OutputStream out, boolean force) throws IOException { BitOutputStream bos = new BitOutputStream(out); BitInputStream bis = new BitInputStream(in); int bitCount = 0; TreeNode root = getTree(); makeMap(root,""); bitCount += writeHeader(bos,root); bitCount += writeCompressed(bis,bos); bos.flush(); return bitCount; }

The refactoring broke out creating the tree and codings for compression, then writing the header, and finally writing the bits for compression. Recall that the method compress is above is called from the GUI/View only after chunks have been counted via the method preprocessCompress. This makes it possible to create the tree and the map from the counts.

The basic idea for doing the Burrows-Wheeler transform (BWT) is shown below, leveraging the existing Huffman Compression code as indicated by the color/font of the different sections.

public int transform(InputStream in, OutputStream out) throws IOException{ BitInputStream bis = new BitInputStream(in); BitOutputStream bos = new BitOutputStream(out); int bitCount = 0; BurrowsWheeler bw = new BurrowsWheeler(); while (true){ char[] chunk = bw.transform(bis); if (chunk.length == 0) break; chunk = bw.mtf(chunk); byte[] array = new byte[chunk.length]; for(int k=0; k < array.length; k++){ array[k] = (byte) chunk[k]; } ByteArrayInputStream bas = new ByteArrayInputStream(array); preprocessCompress(bas); int first = bw.getFirst(); // write header information as appriopriate, e.g., // magic-number and first TreeNode root = getTree(); makeMap(root,""); BitInputStream comp = new BitInputStream(new ByteArrayInputStream(array)); bitCount += writeBWTHeader(bos,root); bitCount+=writeCompressed(comp,bos); } bos.flush(); return bitCount;

One of the key ideas here is using a ByteArrayInputStream to adapt the BWT data into a format that the existing Huffman code supports. This type of stream can be constructed from a byte[] array as shown --- once the array has been thus converted into a stream the existing stream-based methods of the Huffman program/code can be used.

Note that two streams are created in the code above: one to count the occurrence of each chunk via preprocessCompress and one to actually do the compression. This mirrors the original Huffman program where one stream was passed to preprocessCompress and another to the compress method. If you look at the code in HuffViewer you'll see the two streams created similarly to the two streams in the method transform above.

BWT decoding/uncompression

You should write a different magic number at the beginning of a compressed file if it's transformed/compressed rather than simply compressed by Huffman Coding. The new IHuffConstants class has a WHEELER_MAGIC value useful for this purpose.

When uncompressing you reverse the process used in compressing: read compressed data, use the unmove-to-front method BurrowsWheeler.unmtf followed by the BurrowsWheeler.decode method which reverses the original transform. In my BWT compressed file I wrote a magic number and the first index at the beginning of every block. This makes it easy to determine if there's another block, your code can read to find the magic number and if it's not there return an empty/zero-length block. It is possible to avoid writing the magic number at the beginning of every block, but you'll need to write the value of first index and then read it when uncompressing.

As in the original Huffman assignment you'll need to store the tree or the counts used to create the tree since the tree is needed to uncompress the Huff-encoded data. Here's the code I used in my BWT untransform method.

// already read WHEELER_MAGIC before calling this method the first time public int untransform(BitInputStream bis, OutputStream out) throws IOException{ BurrowsWheeler bw = new BurrowsWheeler(); int chunkCount = 1; while (true){ int first = bis.readBits(BITS_PER_INT); // read first index if (first == -1){ myViewer.showError("problem getting first index"); break; } ByteArrayOutputStream byteOut = new ByteArrayOutputStream(); BitOutputStream temp = new BitOutputStream(byteOut); doOriginalHuffUncompressAfterMagic(bis,temp); byte[] array = byteOut.toByteArray(); char[] chunks = new char[array.length]; for(int k=0; k < chunks.length; k++){ byte b = array[k]; char bp = (char) (b & 0xff); chunks[k] = bp; } // TODO: write code here: // now that you have a char[] first call unmove-to-front // then call decode to untransform (you'll need first to do this) // then write out each char to the OutputStream out that's a parameter int header = bis.readBits(BITS_PER_INT); if (header == WHEELER_MAGIC){ chunkCount++; } else { break; } } out.flush(); return chunkCount; }

I use a ByteArrayOutputStream to write the compressed data to, then convert the byte[] array associated with the stream to a char[] array which facilitates using the BurrowsWheeler code that deals with char[] arrays. I didn't have a method originally named doOriginalHuffUncompressAfterMagic but I could have -- this simply reads the header and the bits and writes the uncompressed file -- exactly as in the original Huffman assignment but this time the output goes to the ByteArrayOutputStream instead of to a file.

Your code should not necessarily look like my code, but the code above shows how to use a ByteArrayOutputStream to get the uncompressed data using already-written (but perhaps refactored) code from Huff. The code above assumes that each block compressed using BWT is preceeded by both a magic number and the first index required in the BWT untransform process.