Hacker News new | past | comments | ask | show | jobs | submit login
A Gentle Introduction to Data Compression (bertolami.com)
109 points by ramen2387 on Dec 29, 2014 | hide | past | favorite | 15 comments

I think that for beginners it's far better to talk about the LZ-style algorithms first, since they are much easier to understand without requiring any discussion of probabilities and the associated maths; the idea of replacing repeated strings with a reference to them is very intuitive, and so is the implementation of an LZ compressor/decompressor. In fact, an in-memory LZ decompressor core can be written in less than two dozen bytes of x86 instructions, and can easily outperform the fastest Huffman decompressor.

When I was interested enough in compression to read about it, the majority of the information I could find seemed highly theoretical and made me think that Huffman was a good first step so I managed to write a simple static/dynamic Huffman compressor/decompressor, but that didn't work so well. It was "fast enough", but the compression ratio was disappointing. When I discovered LZ the difference was amazing: an extremely simple algorithm, requiring about an order of magnitude less code in implementation, that compressed much better than Huffman for data which would intuitively be compressible.

In retrospect, it's quite clear that Huffman and other "variable length code" techniques only work because of nonuniformity in symbol probabilities, so it is ineffective on data containing repeating patterns but uniform probability. E.g. take a string 256 bytes long containing all the byte values 0, 1, 2, ... 253, 254, 255; and repeat that string multiple times in a file. LZ can easily compress that pattern but Huffman can't. (Does LZ rely on probability nonuniformity? It doesn't look like it.) The fact that many common methods like FLATE use Huffman only to compress the output of LZ on their input probably reflects this, but I don't think I've seen such an intuitive explanation for it before.

I agree. Transforms (LZ77, BWT, RLE, ...) are generally more interesting and productive than entropy coding. http://mattmahoney.net/dc/dce.html

Very nice introduction.

Recently I've been following the progress of the Finite State Entropy algorithm[1]. From the readme on the github project [2]:

>FSE is a new kind of Entropy encoder, based on ANS theory, from Jarek Duda.

>It is designed to compete with Huffman encoder and Arithmetic ones.

>While huffman is fast but can only represent probabilities in power of 2 (50%, 25%, etc.) arithmetic coding can represent probabilities with much better accuracy, but requires multiplications and divisions.

> FSE solve this dilemma by providing precise probabilities, like arithmetic does, but using only additions, masks and shifts.

> This makes FSE faster, on par with Huffman speed, and even suitable for low-power CPU environment.

[1]: http://fastcompression.blogspot.com/2013/12/finite-state-ent... [2]: https://github.com/Cyan4973/FiniteStateEntropy

Well, if it's as good as claimed, then maybe it needs to be described as a new category, alongside Huffman, arithmetic & Rle

The performance figures displayed on those pages look exceptional. Has anyone experimented with it already ?

I found Compressor Head quite enjoyable: https://www.youtube.com/watch?v=Eb7rzMxHyOk&list=PLOU2XLYxms...

Well written! I was just looking for the Shannon Minimum Coding Rate definition the other day, so it was nice to have a quick refresher on the basics.

This also highlights a fundamental issue: the coding scheme is tailored to the expected data input. That's why there's no one-size-fits-all compression algorithm, and you can achieve far greater efficiency if you have a good model of the shape of the input data. (The same applies to lossy)

The MCR is also a nice way to demonstrate how random data can't be compressed. Let's say you have 256 symbols, each with probability 1/256 (independent etc). The Shannon Entropy = -sum(P(x) * log2(P(x) for x in 0..255) = -2^8 * 2^-8 * -8 = 8 bits. No compression possible.

Pet peeve: dark gray text on light gray background, see http://contrastrebellion.com/.

Otherwise a great text, just its presentation is a bit distracting (at least for my taste/eyes).

Is there any value in thinking of the data to be compressed as a big integer, and trying to come up with an expression or program that, when evaluated or executed, produces the desired integer?

In some sense, this is exactly what decompression does. The question, if I understand you correctly, becomes one of "how is this decompression program designed?" Does it require any particular inputs? etc.

Although somewhat different than what you're describing, you might be interested in arithmetic coding, which encodes data to a single numerical value that can later be recovered by an inverse process.

This is a great article, but it fails to mention Weissman Score - the gold standard of compression measurement.

Huffman encoding reminds me of UTF-8. What is the connection there? Is UTF-8 based on Huffman encoding?

It's not based on it, but the fact that English takes 1 character and the rest take 2 does resembles the same concept, since English is more common than other languages in general.

lz77 is a little hard to explain, but I don't have any degree in CS nor math and I managed to understand it a little, not that I could really be able to reproduce it.

isn't lz77 the most efficient text compression algorithm ?

LZ77 is usually more efficient when combined with Huffman coding (this is what DEFLATE does).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact