39.1 Models of Compression

Comparing Compression Algorithms

In the last chapter, we saw many approaches to compression. This raises an interesting question: for a given bitstream, what is the best algorithm for compression?

For example, consider compression the text of Moby Dick using different compression formats. In this case, bzip2 yields the best compression.

One might argue, however, that the best possible compression algorithm for mobydick.txt is simply as follows:

D(B):
    if input == 0:
        return "Call me Ishamel. ...."
    else:
        return the text as is

Using this as our decompression function, we can condense all of Moby Dick into a single bit!

However, there is a problem with this approach. If we include the code for the decompression algorithm as part of the compressed model (recall compression model 2 from the previous chapter), we see that Moby Dick is not compressed to one bit, but actually requires more bits than the original text!

Decompression Algorithms

Ultimately, the goal of a compresion algorithm is to find short sequences of bits that generate desired longer sequences of bits. Formally stated, our problem is as follows:

  • Given a sequence of bits B, find a shorter sequence DA + C(B) that produces B when fed into an interpreter. DA represents the bits for the decompression algorithm, while C(B) represents the compressed version of B.

Improving Compression

Recall the HugPlant example from the previous chapter. Using Huffman encoding, we can achieve a compression ratio of 25%.

However, using another algorithm we'll call MysteryX for now, we can compress HugPlant.bmp down to 29,432 bits! This achieves a 0.35% compression ratio. Out of the 283895942^{8389594} possible bistreams of length 8,389,5948,389,594, only one in 283601512^{8360151} can achieve such a compression ratio.

What is MysteryX? Well, it's simply the Java code HugPlant.java written to generate the .bmp file! Going back to the model of self-extracting bits, we see the power of code and interpreters in compression. This leads us to two interesting questions:

  • comprehensible compression: given a target bitstream B, can we create an algorithm that outputs useful, readable Java code?

  • optimal compression: given a target bitstream B, can we find the shortest possible program that outputs this bitstream?

Last updated