38.5 Compression Theory
- The goal of data compression is to reduce the size of a sequence of data while retaining as much information as possible. For example, the letter
e
appears more frequently in the English dictionary thanz
, so we would want to represente
with smaller bits. - Compression ratio is a measure of how much the size of the compressed data differs from the original data.
- Huffman Coding is a compression technique that represents common symbols with smaller numbers of bits, resulting in a more efficient encoding.
- Run-length encoding is another compression technique that replaces repeated characters with the character itself and the number of times it occurs.
- LZW is a compression technique that searches for common repeated patterns in the input and replaces them with a shorter code.
- The general idea behind most compression techniques is to exploit any existing redundancy or order within the sequence to reduce the size of the data. However, if a sequence has no existing redundancy or order, compression may not be possible.
- Self-extracting bits is a compression technique that wraps the compressed bits and the decompression algorithm into a single sequence of bits.
- The goal is to simplify the compression and decompression process by combining the two steps into one. Self-extracting bits can be used to create executable files that can be run on any system with an interpreter (e.g. Java interpreter).

- To compress an image file like
hugplant.bmp
, we can break it into 8-bit chunks and Huffman encode each chunk. - We package the compressed data plus decoder into a single self-extracting
.java
file, represented as abyte[]
array. - When the
byte[]
array is passed to an interpreter, the interpreter executes the Huffman decoding algorithm and produces the originalhugplant.bmp
image file. - The size of the compressed bitstream and the Huffman decoding algorithm combined is smaller than the original image file, making it more efficient to store and transmit.
- However, the receiver of the compressed file must have access to the appropriate interpreter to decode and reconstruct the original image.
Last modified 4mo ago