I’m going through my list of interesting assignments that are worth writing about and now is Huffman coding‘s turn. Simply put, we had to create a completely functioning file format implementing a compression method based on Huffman’s coding. The implementation of the coding itself is well documented, but the header of the file is the part that interests us. How can we describe the dictionary for the coding the most efficiently?
The coding itself is very simple. We count the number of occurrences of each byte, then we assign each character a bit string in a way that produces the shortest compressed value. Like I said, simple… Ok, maybe not so much. It’s easier to comprehend if we consider that he number of bits required is equal to log2(fileSize / characterCount). Still not clear on how to do it? Well then you’re just going to have to hope I got it right! ;)
The Wikipedia article explains how to implement this much better than I ever can. So I will concentrate on the file header for the rest of this article.
Clean the garbage
Another problem to take care of is the garbage at the end of the file. The Windows file system will always store the file on a integer number of bytes. This means if our encoded file has 17 bits, it will be written as 24 bits. When we read it, we will read 7 bits of garbage (zeros) at the end of the file! This can translate to a magic character appearing at the end of the file. To avoid that, we compress the content and then we write a 3 bits unsigned integer (the garbage length), then a series of zeros the same length as the garbage.
In our submission, we wrote this as a header, meaning we keep the whole compressed content in memory to measure the garbage lenght, then we add the garbage as a header. This requires additional resources to compress large files because we keep the whole compressed content in memory. We cannot put this garbage at the end of the file. It is possible to use the preprocessed character frequencies and the generated character mappings to calculate the total length of the file and modulo this by 8 to obtain the trash length without keeping the output in memory. We did not do this because of a lack of time.
Note that we already cache the input because we need to calculate the character frequencies so we already waste resources. Most compression algorithms split the input into chunks to mitigate this problem.
Plant the tree
Next, we need to serialize our tree into the file header so it can be decompressed, otherwise it is impossible to know what each series of bits means or even to know how to split each character! This part is really the heart of our optimizations.
To represent the tree, we came up with a code. First, we write the characters that we have in the tree in depth first order from one side to the other. Next, we write a binary string representing the tree structure using the following code:
- 0 means we go deeper to the right
- 1 means we place the next letter in the list, then we go up in the tree until we reach a parent from the right and then we go deeper to the left (if we go went from the left, then went left, we would be navigating the same tree branch twice).
Using this encoding scheme, it is possible to encode the structure of the complete tree with as many bits as there are branches in it.
Right now, we write the character count before listing the characters. This is a waste of one byte. It is possible to trim this byte by placing the tree before the character list. This is due to the fact that the last “1″ in the tree structure definition will cause the root to return to its parent, marking the last character in the tree. Using this, we can count the number of ones in the tree structure definition and read the correct number of characters from the file.
Thanks and acknowledgements
We have used parts of the BitIO library which is LGPL licenced (and not owned by us). The rest of the code is licenced under the MIT Licence. This gives you the right to customize and distribute this as closed source, in which case you must replace the BitInputStream and BitOutputStream classes with your own to get rid of the LGPL licence on them. Simply, I just don’t want to be sued, but you can do whatever you want with this Huffman encoder other than that.
As for the Sudoku solver, thanks to my teammates: Mathieu Lacasse-Potvin and Michael Badeau.
The program was part of an assignment for the class LOG320 at École de Technologie Supérieure in Montréal during summer 2010.
Here are the sources (zipped: 10KB).
The utility is called from command line. To use it, simply pass in the letter “e” for encode or “d” for decode, followed by the input and output files. These three parameters are all you need to encode and decode files using this utility.