Comment on page
39.1 Models of Compression
In the last chapter, we saw many approaches to compression. This raises an interesting question: for a given bitstream, what is the best algorithm for compression?
For example, consider compression the text of Moby Dick using different compression formats. In this case,
bzip2
yields the best compression.
Different compression formats applied to
mobydick.txt
One might argue, however, that the best possible compression algorithm for
mobydick.txt
is simply as follows:D(B):
if input == 0:
return "Call me Ishamel. ...."
else:
return the text as is
Using this as our decompression function, we can condense all of Moby Dick into a single bit!
However, there is a problem with this approach. If we include the code for the decompression algorithm as part of the compressed model (recall compression model 2 from the previous chapter), we see that Moby Dick is not compressed to one bit, but actually requires more bits than the original text!
Ultimately, the goal of a compresion algorithm is to find short sequences of bits that generate desired longer sequences of bits. Formally stated, our problem is as follows:
- Given a sequence of bits
B
, find a shorter sequenceDA + C(B)
that producesB
when fed into an interpreter.DA
represents the bits for the decompression algorithm, whileC(B)
represents the compressed version ofB
.

Our compression model applied to
mobydick.txt
.Recall the
HugPlant
example from the previous chapter. Using Huffman encoding, we can achieve a compression ratio of 25%. However, using another algorithm we'll call
MysteryX
for now, we can compress HugPlant.bmp
down to 29,432 bits! This achieves a 0.35% compression ratio. Out of the possible bistreams of length
, only one in
can achieve such a compression ratio.

A mystery compression algorithm that outperforms Huffman encoding by a large margin.
What is
MysteryX
? Well, it's simply the Java code HugPlant.java
written to generate the .bmp
file! Going back to the model of self-extracting bits, we see the power of code and interpreters in compression. This leads us to two interesting questions:- comprehensible compression: given a target bitstream
B
, can we create an algorithm that outputs useful, readable Java code? - optimal compression: given a target bitstream
B
, can we find the shortest possible program that outputs this bitstream?