38.4 Huffman Coding Conceptuals
Huffman coding takes a bottom-up approach to prefix-free codes, as opposed to the top-down approach taken by Shannon-Fano codes. The algorithm is as follows:
- Calculate relative frequencies.
- Assign each symbol to a node with weight = relative frequency.
- Take the two smallest nodes and merge them into a super node with weight equal to sum of weights.
- Repeat until everything is part of a tree.
Huffman Coding: step by step example (Part 1)
Huffman Coding: step by step example (Part 2)
Let's now think about the data structures we would use for the encoding and decoding processes of the Huffman Coding process. Recall that encoding will translate symbols to code words and decoding will do the opposite. An example is as follows:
- Encoding translates
- Decoding translates
There are two options!
- 1.HashMap/TreeMap: create a map from character to bit sequence, calling the
get()method to look up each character
- 2.Array: Each index of the array would represent the character, with the bit sequence in that slot of the array. Recall that each character is just an integer. For example, the letter
What's the difference? Arrays are faster than maps but might use more memory if indices are unused.
There is really only one good data structure that can help us find longest prefixes of bit streams.
Trie: Here, we can use a Binary Trie with numbers 0 and 1. When we get the bitstream, the trie allows easy lookup to the longest prefix. Below is an image that demonstrates how the trie could look.
Now that we have talked about what Huffman Coding is, let's talk about some practical issues that arise when we want to use it. There are two main philosophies we can use.
- For each input type (English text, Chinese text, images), assemble huge numbers of sample inputs for each category. Use corpus to create a standard code for English, Chinese, etc.
- A corpus is a collection of pieces of language to be used as a sample of the language. Below is an example where we specify that we want to use the
ENGLISHcorpus to compress
$ java HuffmanEncodePh1 ENGLISH mobydick.txt
Problem: Suboptimal encoding, which means our corpus does not exactly match our input. What this means in the context of this example is that
mobydick.txtmight not match up well with the general frequencies of
ENGLISHtexts and instead might have some other quirks or specifications from the author.
- For every possible input file, create a unique code just for that file. Then, when someone receives that file, they will know how to decode it using the code we send along the compressed file.
- As seen in the example below, we do not specify a corpus and we send along another file to help the decoding process for that specific file.
$ java HuffmanEncodePh2 mobydick.txt
Problem: This approach requires us to use extra space for the codeword table in the compressed bitstream. However, this generally works better than the corpus philosophy so is used commonly in the real world.