38.5 Compression Theory

Compression Ratios

The goal of data compression is to reduce the size of a sequence of data while retaining as much information as possible. For example, the letter e appears more frequently in the English dictionary than z, so we would want to represent e with smaller bits.
Compression ratio is a measure of how much the size of the compressed data differs from the original data.
- Huffman Coding is a compression technique that represents common symbols with smaller numbers of bits, resulting in a more efficient encoding.
- Run-length encoding is another compression technique that replaces repeated characters with the character itself and the number of times it occurs.
- LZW is a compression technique that searches for common repeated patterns in the input and replaces them with a shorter code.
The general idea behind most compression techniques is to exploit any existing redundancy or order within the sequence to reduce the size of the data. However, if a sequence has no existing redundancy or order, compression may not be possible.

Self-extracting bits is a compression technique that wraps the compressed bits and the decompression algorithm into a single sequence of bits.
The goal is to simplify the compression and decompression process by combining the two steps into one. Self-extracting bits can be used to create executable files that can be run on any system with an interpreter (e.g. Java interpreter).

To compress an image file like hugplant.bmp, we can break it into 8-bit chunks and Huffman encode each chunk.
We package the compressed data plus decoder into a single self-extracting .java file, represented as a byte[] array.
When the byte[] array is passed to an interpreter, the interpreter executes the Huffman decoding algorithm and produces the original hugplant.bmp image file.
The size of the compressed bitstream and the Huffman decoding algorithm combined is smaller than the original image file, making it more efficient to store and transmit.
However, the receiver of the compressed file must have access to the appropriate interpreter to decode and reconstruct the original image.

Last updated 2 years ago