One fairly detailed explanation of the BWT and its use in compression is here. You also may find this this Wikipedia entry useful.
You'll need both a high-level understanding of how BWT works to implement this project, but also expend to do some amount of work modifying your existing huffman code to easily accept this new transformation (that's what's described below).
You're also given a class BurrowsWheeler
with TODO: annotations indicating where you must add code to make
that class work as intended. There's more discussion about this
below.
BWT encoding/compression
Below is the body of my method that compresses using Huffman
Coding/Compression from the previous assignment. This isn't how I
originally wrote it, but it represents a refactoring of the code so that
separate methods are used to
write the header and the compressed bits. This separation/refactoring
makes it simpler to use the existing code
in doing the Burrows-Wheeler transform. The idea isn't to make your code
look like this, but to understand this code in the context of the
block transform.
The refactoring broke out creating the tree and codings for compression,
then writing the header, and finally writing the
bits for compression. Recall that the method compress
is above is called from the GUI/View only after chunks
have
been counted via the method preprocessCompress
. This makes
it possible to create the tree and the map from the counts.
The basic idea for doing the Burrows-Wheeler transform (BWT) is shown below, leveraging the existing Huffman Compression code as indicated by the color/font of the different sections.
One of the key ideas here is using a ByteArrayInputStream
to adapt the BWT data into a format that the existing Huffman code
supports. This type of stream can be constructed from a
byte[]
array as shown --- once the array has been
thus converted into a stream the existing stream-based methods of
the Huffman program/code can be used.
Note that two streams are created in the code above: one to count the
occurrence of each chunk via preprocessCompress
and
one to actually do the compression. This mirrors the original
Huffman program where one stream was passed to
preprocessCompress
and another to the
compress
method. If you look at the code
in HuffViewer
you'll see
the two streams created similarly to the two streams in the method
transform
above.
BWT decoding/uncompression
You should write a different magic number at the beginning of a
compressed file if it's transformed/compressed rather than simply
compressed by Huffman Coding. The new IHuffConstants
class
has a WHEELER_MAGIC value useful for this purpose.
When uncompressing you reverse the process used in compressing: read
compressed data, use the unmove-to-front method
BurrowsWheeler.unmtf
followed by the
BurrowsWheeler.decode
method which reverses the original
transform. In my BWT compressed file I wrote a magic number and
the first index at the beginning of every block. This
makes it easy to determine if there's another block, your code can
read to find the magic number and if it's not there return an
empty/zero-length block. It is possible to avoid writing the magic
number at the beginning of every block, but you'll need to write
the value of first index and then read it when uncompressing.
As in the original Huffman assignment you'll need to store the tree or
the counts used to create the tree since the tree is needed to
uncompress the Huff-encoded data. Here's the code I used in my
BWT untransform
method.
I use a ByteArrayOutputStream
to write the compressed data
to, then convert the byte[]
array associated with the
stream to a char[]
array which facilitates using the
BurrowsWheeler
code that deals with char[]
arrays. I didn't have a method
originally named doOriginalHuffUncompressAfterMagic
but I
could have -- this simply reads the header and the bits and writes
the uncompressed file -- exactly as in the original Huffman
assignment but this time the output goes to the
ByteArrayOutputStream
instead of to a file.
Your code should not necessarily look like my code, but the code above
shows how to use a ByteArrayOutputStream
to get the
uncompressed data using already-written (but perhaps refactored)
code from Huff. The code above assumes that each block compressed
using BWT is preceeded by both a magic number and the first
index required in the BWT untransform process.