39.1 Models of Compression
Last updated
Last updated
In the last chapter, we saw many approaches to compression. This raises an interesting question: for a given bitstream, what is the best algorithm for compression?
For example, consider compression the text of Moby Dick using different compression formats. In this case, bzip2
yields the best compression.
One might argue, however, that the best possible compression algorithm for mobydick.txt
is simply as follows:
Using this as our decompression function, we can condense all of Moby Dick into a single bit!
However, there is a problem with this approach. If we include the code for the decompression algorithm as part of the compressed model (recall compression model 2 from the previous chapter), we see that Moby Dick is not compressed to one bit, but actually requires more bits than the original text!
Ultimately, the goal of a compresion algorithm is to find short sequences of bits that generate desired longer sequences of bits. Formally stated, our problem is as follows:
Given a sequence of bits B
, find a shorter sequence DA + C(B)
that produces B
when fed into an interpreter. DA
represents the bits for the decompression algorithm, while C(B)
represents the compressed version of B
.
Recall the HugPlant
example from the previous chapter. Using Huffman encoding, we can achieve a compression ratio of 25%.
However, using another algorithm we'll call MysteryX
for now, we can compress HugPlant.bmp
down to 29,432 bits! This achieves a 0.35% compression ratio. Out of the possible bistreams of length , only one in can achieve such a compression ratio.
What is MysteryX
? Well, it's simply the Java code HugPlant.java
written to generate the .bmp
file! Going back to the model of self-extracting bits, we see the power of code and interpreters in compression. This leads us to two interesting questions:
comprehensible compression: given a target bitstream B
, can we create an algorithm that outputs useful, readable Java code?
optimal compression: given a target bitstream B
, can we find the shortest possible program that outputs this bitstream?