Lossless compression algorithms – Attacks on the TLS Record Protocol
21.5.1 Lossless compression algorithms
In a lossless compression algorithm, the input data is encoded in such a way that its length is decreased. After decoding, however, all of the input data can be reconstructed. The opposite of lossless compression is lossy compression, which is common when encoding multimedia data. Here, some of the input data that is deemed unimportant for audiovisual perception are actually thrown away by the algorithm and cannot be reconstructed. In what follows, however, we will only be concerned with lossless compression and therefore frequently omit the attribute lossless.
A very common compression algorithm used in internet protocols (and elsewhere) is called DEFLATE. It was designed by the American computer scientist Phil Katz for version 2.0 of his archiving tool PKZIP released in 1993. In 1996, the algorithm was also specified in RFC 1951. Technically, DEFLATE combines a so-called Huffman code with the LZ77 data compression algorithm.
Huffman codes are a specific type of optimal codes commonly used in lossless or lossy data compression algorithms, for example, in ZIP and JPEG. The process of finding such codes is called Huffman coding, and it was discovered in 1952 by the American computer science pioneer David Huffman [201].
The output of Huffman coding is a variable-length code generated based on the occurrence frequency of input symbols. Frequently used symbols are encoded using fewer bits than less frequent symbols.
As an example, in the sentence this is an example of a huffman tree, the space character occurs seven times, the character ”f” occurs three times, and the character ”x” occurs once. The Huffman code generated from the character frequencies in that sentence encodes the space character as the 3-bit sequence 111, character ”f” as the 4-bit sequence 1101, and character ”x” as the 5-bit sequence 10010.
As a result, Huffman codes reduce the overall data size. In the preceding example, encoding the sentence using the Huffman code requires 135 bits as opposed to 288 bits if the 36 characters in the sentence were to be encoded using 8 bits each.
In 1977, the Israeli computer scientist Abraham Lempel and the Israeli electrical engineer Jacob Ziv published a lossless data compression algorithm, LZ77. The algorithm works by replacing repeated occurrences of symbol sequences with references to a previous occurrence of that sequence in the input data. Every repeated sequence is replaced by a pair of numbers called the distance-length pair, thereby reducing the size of the data.
As an example, the phrase
Google is so googley
would be encoded to
Google is so g(-13, 5)y.
To achieve the compression by identifying recurring symbol sequences, LZ77 must keep track of a pre-configured amount of the most recent data using a data structure called the sliding window. During the encoding process, a sliding window is used to find repeating sequences. During decoding, it is used to interpret the references made occurrences by the encoder. The size of the sliding window – typical values being 2 KB, 4 KB, or 32 KB – determines how many symbols the encoder can use to search for and create references.