# 38.5 Compression Theory

## Compression Ratios

* The goal of data compression is to reduce the size of a sequence of data while retaining as much information as possible. For example, the letter `e` appears more frequently in the English dictionary than `z`, so we would want to represent `e` with smaller bits.
* **Compression ratio** is a measure of how much the size of the compressed data differs from the original data.&#x20;
  * **Huffman Coding** is a compression technique that represents common symbols with smaller numbers of bits, resulting in a more efficient encoding.&#x20;
  * **Run-length encoding** is another compression technique that replaces repeated characters with the character itself and the number of times it occurs.&#x20;
  * **LZW** is a compression technique that searches for common repeated patterns in the input and replaces them with a shorter code.&#x20;
* The general idea behind most compression techniques is to exploit any existing redundancy or order within the sequence to reduce the size of the data. However, if a sequence has no existing redundancy or order, compression may not be possible.

## Self-Extracting Bits

* Self-extracting bits is a compression technique that wraps the compressed bits and the decompression algorithm into a **single sequence of bits**.
* The goal is to simplify the compression and decompression process by combining the two steps into one. Self-extracting bits can be used to create executable files that can be run on any system with an interpreter (e.g. Java interpreter).

<figure><img src="https://2316889115-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCLYj7ccqvV6l4Pt9R0w5%2Fuploads%2FJemTikv95FxAqt71xrYG%2FScreenshot%202023-04-24%20at%209.34.44%20PM.png?alt=media&#x26;token=b0f9e455-4563-43ad-974f-63e8dd84b508" alt=""><figcaption></figcaption></figure>

## HugPlant Example

* To compress an image file like `hugplant.bmp`, we can break it into 8-bit chunks and Huffman encode each chunk.
* We package the compressed data plus decoder into a single self-extracting `.java` file, represented as a `byte[]` array.
* When the `byte[]` array is passed to an interpreter, the interpreter executes the Huffman decoding algorithm and produces the original `hugplant.bmp` image file.
* The size of the compressed bitstream and the Huffman decoding algorithm combined is **smaller** than the original image file, making it more efficient to store and transmit.
* However, the receiver of the compressed file must have access to the appropriate interpreter to decode and reconstruct the original image.
