IHuffProcessor
interface named SimpleHuffProcessor. Choosing
options from the GUI using this
implementation as shown on the left, below,
generates an error-dialog as shown on the right since none of the
methods are currently implemented (they each throw an exception).
You implement the methods for this assignment.
![]() |
![]() |
When you write your methods in SimpleHuffProcessor
to read
or write bits you'll need to create either BitInputStream
or
BitOutputStream
objects to read bits-at-a-time (or write
them). Information and help on how to do this is
given below, but you should probably scan this howto completely
before coding.
If your program generates an out-of-memory error when reading large files, use the Options menu in the GUI to choose Slow Reading as shown in the screen shot below.
This makes reading files slower but the GUI/View code won't map the entire file into memory before reading when you compress or uncompress a file.
The map can be an array of the appropriate size (roughly 256, but be
careful of PSEUDO_EOF) or you can use a Map
implementation. An array
is the simplest approach for this part of the huff/compress process,
using a Map is not necessary, but it's fine to use one.
Once you've tested the code above you'll be ready to create the
compressed output file. to do this you'll read the input file a second
time, but the GUI front-end does this for you when it calls the method
To uncompress the file later, you must recreate the same Huffman tree
that was used to compress (so the codes you send will match). This tree
might be stored directly in the compressed file
(e.g., using a preorder traversal), or it might be created
from 8-bit chunk counts stored in the compressed file. In either case,
this information must be coded and transmitted along with the compressed
data (the tree/count data will be stored first in the compressed file,
to be read by unhuff. There's more
information below
on storing/reading information to re-create the tree.
(For more details, see
the complete huffman coding discussion.)
The operating system will buffer output, i.e., output to disk actually
occurs when some internal buffer is full. In particular, it is not
possible to write just one single bit-at-a-time to a file, all output is
actually done in "chunks", e.g., it might be done in eight-bit chunks or
256-bit chunks. In any case, when you write 3 bits, then 2 bits, then 10
bits, all the bits are eventually written, but you can not be sure
precisely when they're written during the execution of your
program. Also, because of buffering, if all output is done in eight-bit
chunks and your program writes exactly 61 bits explicitly, then 3 extra
bits will be written so that the number of bits written is a multiple of
eight. Because of the potential for the existence of these "extra" bits
when reading one bit at a time, you cannot simply read bits until there
are no more left since your program might then read the extra bits
written due to buffering. This means that when reading a compressed
file, you should not use code like the loop below because
the last few bits read may not have been written by your program, but
rather as a result of buffering and writing bits in 8-bit chunks.
My program generates this error when such a file is found.
In Huffman trees/tables you use in your programs, the pseudo-EOF
character/chunk always has a count of one --- this should be done
explicitly in the code that determines frequency counts. In other words,
a pseudo-char EOF with number of occurrences (count) of 1 must be
explicitly created.
In the file
You're given a
(for more details, see
the complete huffman coding discussion.)
To create a table or map of coded bit values for each 8-bit chunk you'll
need to traverse the Huffman tree (e.g., inorder, preorder, etc.) making
an entry in the map each time you reach a leaf. For example, if you
reach a leaf that stores the 8-bit chunk 'C', following a path
left-left-right-right-left, then an entry in the 'C'-th location of the
map should be set to 00110. You'll need to make a decision about how
to store the bit patterns in the map --- the answer for this assignment
is to use a string whose only characters are '0' and '1', the string
represents the path from the root of the Huffman tree to a leaf --
and the value in the leaf has a Huffman coding represented by the
root-to-leaf path.
This means you'll need to follow every root-to-leaf path in the Huffman
tree, building the root-to-leaf path during the traversal. When you
reach a leaf, the path is that leaf value's encoding. One way to do this
is with a method that takes a
For example, in my working program that only
works with my compressed files, not other standard formats,
I have the following code:
In general, a file with the wrong magic number should not generate an
error, but should notify the user. For example, in my program the
exception above ultimately causes the user to see what's shown
below. This is because the exception is caught and the viewer's
In my non-saving-space
code using SCF, my header is written by the following code. Note that
To write the tree for extra credit, think about the 20
questions program.
For example,
you can use a 0 or 1 bit to differentiate between internal nodes and
leaves. The leaves must store character values (in the
general case using 9-bits because of the pseudo-eof character).
For example, the sequence of 0's and 1's below represents the tree on
the right (if you write the 0's and 1's the spaces wouldn't appear, the
spaces are only to make the bits more readable to humans.)
The first 0 indicates a non-leaf, the second 0 is the left child of
the root, a non-leaf. The next 1 is a leaf, it is followed by 9 bits
that represent 97 (001100001 is 97 in binary), the Unicode/ASCII code
for 'a'. Then there's a 1 for the right child of the left child of the
root, it stores 32 (000100000 is 32 in binary), the ASCII value of a
space. The next 1 indicates the right child of the root is a leaf, it
stores the Unicode/ASCII value for a 't' which is 116 (001110100 is
116 in binary).
Your program can write these bits using a standard pre-order
traversal. You can then read them by reading a bit, then recursively
reading left/right subtrees if the bit is a zero (again, think about
the 20-questions/animal program).
Standard Tree Format in the Huff program/suite uses a pre-order
traversal, a single zero-bit for internal nodes, a single one-bit for a
leaf, and nine bits for the value stored in a leaf. For extra/A credit
your program should be able to read/write this format.
Designing debugging functions as part of the original program will
make the program development go more quickly since you will be able to
verify that pieces of the program, or certain classes, work
properly. Building in the debugging scaffolding from the start will make
it much easier to test and develop your program. When testing, use small
examples of test files maybe even as simple as "go go gophers" that help
you verify that your program and classes are functioning as intended.
You might want to write encoding bits out first as strings or
printable int values rather than as raw bits of zeros and ones which
won't be readable except to other computer programs. A Compress
class, for example, could support printAscii functions and
printBits to print in human readable or machine readable
formats.
We cannot stress enough how important it is to develop your program
a few steps at a time. At each step, you should have a functioning program,
although it may not do everything the first time it's run. By developing
in stages, you may find it easier to isolate bugs and you will be more
likely to get a program working faster. In other words, do not write
hundreds of lines of code before compiling and testing
Compressing using Huffman Coding
The three steps below summarize how compression works and provide
some advice on coding.
int
variables/valuse in your
code rather than char
. Note that the method for reading
bits-at-a-time from a BitInputStream
returns an
int
, so
using int
variables makes sense (this might be different
in the Burrows-Wheeler code you write.)
Any wording
in this write-up that uses the word character means an
8-bit chunk and this chunk-size could (in theory) change.
Do not use any variables of type byte
in your
program. Use only int variables, or perhaps char variables if
you implement the Burrows-Wheeler project.
int
value) to Huffman-codings. The map of chunk-codings
is formed by traversing the path from the root of the Huffman tree to
each leaf. Each root-to-leaf path creates a chunk-coding for the value
stored in the leaf. When going left in the tree append a zero to the
path; when going right append a one. The map has the 8-bit
int
chunks as keys and the corresponding
Huffman/chunk-coding String as the value associated with the key.
IHuffProcessor.compress
to do the compression. For each
8-bit chunk read, write the corresponding encoding of the 8-bit chunk
(obtained from the map of encodings) to the compressed file. You
write bits using a BitOutputStream
object, you
don't write Strings/chars. Instead you write one-bit, either a zero
or a one, for each corresponding character '0' or '1' in the string that
is the encoding.
Help With Coding
The sections below contain explanations and advice on different aspects
of the code you'll write to compress data.
Pseudo-EOF character
IHuffConstants
the
number of characters counted is specified by ALPH_SIZE
which has value 256. Although only 256 values can be represented by
8 bits, these values are between 0 and 255, inclusive. One character is
used as the pseudo-EOF character -- it must be a value not-representable
with 8-bits, the smallest such value is 256 which requires 9 bits to
represent. However, ideally your program should be able to work with
n-bit chunks, not just 8-bit chunks.
Priority Queues
TreeNode
that implements Comparable
. You can use this class in
storing weighted character/chunk objects in a priority queue to make a
Huffman tree.
Creating a Map/Table from a Huffman-tree
TreeNode
parameter and a
String
that represents the path to the node. Initially the
string is empty ""
and the node is the global root. When your code
traverses left, a "0"
is added to the path, and similarly a
"1"
is added when going right.
Writing Bits in the Compressed File
There are three steps in writing a compressed file from the
information your code determined and stored: the counts and
encodings. All this code is written/called from the
IHuffProcessor.compress
method which is called from the
GUI after the IHuffProcess.preprocessCompress
method has
been called to set state appropriately in your model.
IHuffConstants.MAGIC_NUMBER
value either without the IHuffConstants
modifier in your
IHuffProcessor
implementation (because the latter
interface extends the former) or using the complete
IHuffConstants.MAGIC_NUMBER
identifier. When you
uncompress you'll read this number to ensure you're reading a file
your program compressed. Your program should be able to
uncompress files it creates. For extra credit you should be
able to process standard headers, specified by magic
numbers STORE_COUNTS
and STORE_TREE
in the
IHuffConstants
interface. There's also a value for
custom headers.
showError
method called appropriately. Your code should
at least print a message, and ideally generate an error dialog as
shown.
Header Information
ALPH_SIZE
counts as int
values, but you can
also write the tree. Writing the counts will create a header in standard count
format or SCF. This is a header of 255 counts,
one 32-bit int value for each 8-bit chunk, in order from 0-255.
You don't need a count for pseudo-EOF because it's one.
BITS_PER_INT
is 32 in Java.
Tree Header for Extra Credit
0 0 1 001100001 1 000100000 1 001110100
Implementing and Debugging
It's a good idea to either create more than one class to help manage the
complexity in these programs or to add methods/code incrementally after
each has been tested. Because the same data structures need to used to
ensure that a file compressed using your huff algorithm can be
decompressed, you should be able to share several parts of the
implementation. You can use classes to exploit this similarity.
Debugging Code
Using BitInputStream
In order to read and write in a bit-at-a-time manner, two classes are
provided BitInputStream and
BitOutputStream.
Bit read/write subprograms
To see how the readBits routine works, note that the code
segment below is functionally equivalent to the Unix command cat
foo --- it reads BITS_PER_WORD bits at a time (which is
8 bits as defined in IHuffConstants) and echoes what is
read.
System.out.println(7)
. results in 32 bits being written because a
Java int uses 32
bits. Executing obs.writeBits(3,7)
results in 3 bits being
written (to the BitOutputStream obs) --- all the bits are 1
because the number 7 is represented in base two by 000111.
When using writeBits
to write a specified number of
bits, some bits may not be written immediately because of
buffering. To ensure that all bits are written, the
last bits must be explicitly flushed. The function flush
must be called either explicitly or by calling
close
.
Although readBits can be called to read a single bit at a time (by setting the parameter to 1), the return value from the method is an int. You'll need to be able to access just one bit of this int (inbits in code above). In order to access just the right-most bit a bitwise and & can be used:
Of course exceptions may need to be caught or rethrown. For input,
you'll need to always create a
You can choose a force compression option from the GUI/Options
menu. If this is chosen/checked, the value of the third parameter to
To help with this you can use the Diff.java program. Launching this will
prompt you to choose files. You should select two files, not just
one. To select two files use either command-click or control-click
according to Mac/Windows to select the second file. The program will
then tell you whether the two files you've chosen are the same or
are different. On Linux and Macs there's a command-line tool
called diff that does this, but the Java program can be
used for purposes of the Huffman assignment.
InputStream objects
In Java, it's simple to construct one input stream from another. The
Viewer/GUI code that drives the model will send an
InputStream
object to the model for readable-files, it will
also send an OutputStream
for writeable-files. The
client/model code you write will need to wrap this stream
in an appropriate BitInputStream
or
BitOutputStream
object.
BitInputStream
object to
read chunks or bits from. For the output stream, you may need to create
a BitOutputStream
to write individual bits, so you should
create such a stream -- for uncompressing it's possible to just write
without creating a BitOutputStream
using the
OutputStream.write
method, but you'll find it simpler to use
BitOutputStream.writeBits
method.
Forcing Compression
If compressing a file results in a file larger than the file being
compressed (this is always possible) then no compressed file should be
created and a message should be shown indicating that this is the
case. Here's a screen shot from what happens in my program.
IHuffProcessor.compress
is true, and your code should
"compress" a file even though the resulting file will be
bigger. Otherwise (force is false), if the compressed file is bigger,
your program should not compress and should generate an error
such as the one shown above.
Same File Uncompressed/Diff
When you compress a file, e.g., foo.txt to foo.txt.hf
and
then uncompress it to foo.txt.unhf, you'll want to see whether
the .unhf file is the same as the original. For very small text files
you can verify this by eyeballing the file. But for large files, and
for non-text files (e.g., .jpg, .mp3) you'll need a program to help
with this.