

A Gentle Introduction to Data Compression - ramen2387
http://bertolami.com/index.php?engine=blog&content=posts&detail=fundamentals-of-data-compression

======
userbinator
I think that for beginners it's far better to talk about the LZ-style
algorithms first, since they are much easier to understand without requiring
any discussion of probabilities and the associated maths; the idea of
replacing repeated strings with a reference to them is very intuitive, and so
is the implementation of an LZ compressor/decompressor. In fact, an in-memory
LZ decompressor core can be written in less than two dozen bytes of x86
instructions, and can easily outperform the fastest Huffman decompressor.

When I was interested enough in compression to read about it, the majority of
the information I could find seemed highly theoretical and made me think that
Huffman was a good first step so I managed to write a simple static/dynamic
Huffman compressor/decompressor, but that didn't work so well. It was "fast
enough", but the compression ratio was disappointing. When I discovered LZ the
difference was amazing: an _extremely_ simple algorithm, requiring about an
order of magnitude less code in implementation, that compressed much better
than Huffman for data which would intuitively be compressible.

In retrospect, it's quite clear that Huffman and other "variable length code"
techniques only work because of nonuniformity in symbol probabilities, so it
is ineffective on data containing repeating patterns but uniform probability.
E.g. take a string 256 bytes long containing all the byte values 0, 1, 2, ...
253, 254, 255; and repeat that string multiple times in a file. LZ can easily
compress that pattern but Huffman can't. (Does LZ rely on probability
nonuniformity? It doesn't look like it.) The fact that many common methods
like FLATE use Huffman only to compress the output of LZ on their input
probably reflects this, but I don't think I've seen such an intuitive
explanation for it before.

~~~
Scaevolus
I agree. Transforms (LZ77, BWT, RLE, ...) are generally more interesting and
productive than entropy coding.
[http://mattmahoney.net/dc/dce.html](http://mattmahoney.net/dc/dce.html)

------
infogulch
Very nice introduction.

Recently I've been following the progress of the Finite State Entropy
algorithm[1]. From the readme on the github project [2]:

>FSE is a new kind of Entropy encoder, based on ANS theory, from Jarek Duda.

>It is designed to compete with Huffman encoder and Arithmetic ones.

>While huffman is fast but can only represent probabilities in power of 2
(50%, 25%, etc.) arithmetic coding can represent probabilities with much
better accuracy, but requires multiplications and divisions.

> FSE solve this dilemma by providing precise probabilities, like arithmetic
> does, but using only additions, masks and shifts.

> This makes FSE faster, on par with Huffman speed, and even suitable for low-
> power CPU environment.

[1]: [http://fastcompression.blogspot.com/2013/12/finite-state-
ent...](http://fastcompression.blogspot.com/2013/12/finite-state-entropy-new-
breed-of.html) [2]:
[https://github.com/Cyan4973/FiniteStateEntropy](https://github.com/Cyan4973/FiniteStateEntropy)

~~~
terryM4h3r
Well, if it's as good as claimed, then maybe it needs to be described as a new
category, alongside Huffman, arithmetic & Rle

------
simi_
I found Compressor Head quite enjoyable:
[https://www.youtube.com/watch?v=Eb7rzMxHyOk&list=PLOU2XLYxms...](https://www.youtube.com/watch?v=Eb7rzMxHyOk&list=PLOU2XLYxmsIJGErt5rrCqaSGTMyyqNt2H)

------
pslam
Well written! I was just looking for the Shannon Minimum Coding Rate
definition the other day, so it was nice to have a quick refresher on the
basics.

This also highlights a fundamental issue: the coding scheme is tailored to the
expected data input. That's why there's no one-size-fits-all compression
algorithm, and you can achieve far greater efficiency if you have a good model
of the shape of the input data. (The same applies to lossy)

The MCR is also a nice way to demonstrate how random data can't be compressed.
Let's say you have 256 symbols, each with probability 1/256 (independent etc).
The Shannon Entropy = -sum(P(x) * log2(P(x) for x in 0..255) = -2^8 * 2^-8 *
-8 = 8 bits. No compression possible.

------
stared
Pet peeve: dark gray text on light gray background, see
[http://contrastrebellion.com/](http://contrastrebellion.com/).

Otherwise a great text, just its presentation is a bit distracting (at least
for my taste/eyes).

------
b6
Is there any value in thinking of the data to be compressed as a big integer,
and trying to come up with an expression or program that, when evaluated or
executed, produces the desired integer?

~~~
specular
In some sense, this is exactly what decompression does. The question, if I
understand you correctly, becomes one of "how is this decompression program
designed?" Does it require any particular inputs? etc.

Although somewhat different than what you're describing, you might be
interested in arithmetic coding, which encodes data to a single numerical
value that can later be recovered by an inverse process.

------
sizimon
This is a great article, but it fails to mention Weissman Score - the gold
standard of compression measurement.

------
sysk
Huffman encoding reminds me of UTF-8. What is the connection there? Is UTF-8
based on Huffman encoding?

~~~
ronreiter
It's not based on it, but the fact that English takes 1 character and the rest
take 2 does resembles the same concept, since English is more common than
other languages in general.

------
jokoon
lz77 is a little hard to explain, but I don't have any degree in CS nor math
and I managed to understand it a little, not that I could really be able to
reproduce it.

isn't lz77 the most efficient text compression algorithm ?

~~~
ryan-c
LZ77 is usually more efficient when combined with Huffman coding (this is what
DEFLATE does).

