Comment on page

# 39.1 Models of Compression

In the last chapter, we saw many approaches to compression. This raises an interesting question: for a given bitstream, what is the best algorithm for compression?

For example, consider compression the text of Moby Dick using different compression formats. In this case,

`bzip2`

yields the best compression.Different compression formats applied to

`mobydick.txt`

One might argue, however, that the best possible compression algorithm for

`mobydick.txt`

is simply as follows:D(B):

if input == 0:

return "Call me Ishamel. ...."

else:

return the text as is

Using this as our decompression function, we can condense all of Moby Dick into a single bit!

However, there is a problem with this approach. If we include the code for the decompression algorithm as part of the compressed model (recall compression model 2 from the previous chapter), we see that Moby Dick is not compressed to one bit, but actually requires

*more*bits than the original text!Ultimately, the goal of a compresion algorithm is to find short sequences of bits that generate desired longer sequences of bits. Formally stated, our problem is as follows:

- Given a sequence of bits
`B`

, find a shorter sequence`DA + C(B)`

that produces`B`

when fed into an interpreter.`DA`

represents the bits for the decompression algorithm, while`C(B)`

represents the compressed version of`B`

.

Our compression model applied to

`mobydick.txt`

.Recall the

`HugPlant`

example from the previous chapter. Using Huffman encoding, we can achieve a compression ratio of 25%. However, using another algorithm we'll call

`MysteryX`

for now, we can compress `HugPlant.bmp`

down to 29,432 bits! This achieves a 0.35% compression ratio. Out of the $2^{8389594}$

possible bistreams of length $8,389,594$

, only one in $2^{8360151}$

can achieve such a compression ratio.A mystery compression algorithm that outperforms Huffman encoding by a large margin.

What is

`MysteryX`

? Well, it's simply the Java code `HugPlant.java`

written to generate the `.bmp`

file! Going back to the model of self-extracting bits, we see the power of code and interpreters in compression. This leads us to two interesting questions:**comprehensible compression:**given a target bitstream`B`

, can we create an algorithm that outputs useful, readable Java code?**optimal compression**: given a target bitstream`B`

, can we find the*shortest*possible program that outputs this bitstream?