Hacker News new | comments | show | ask | jobs | submit login

RLE a given. It's true that the average person rarely understand that this is what computers call compression, but everything after that involves a bit of thinking. Optimal huffman.



I think the LZ family are all pretty intuitive --- just replace repeated sequences with a reference to where they occurred before. Even human languages have things like contractions, abbreviations, and acronyms. Thus it is, at least to me, somewhat surprising that LZ was only documented academically several decades after Huffman; perhaps it was thought to be too trivial? LZ can also be thought of as an extension of RLE.

In any case, an LZ decompressor fits in less than 100 bytes of machine instructions and a simple compressor can be implemented in not more than several hundred, all the while providing extremely high compression for its simplicity. It will easily outperform order-0 static or dynamic Huffman on practical files like English text, and would probably make a good assignment in an undergraduate-level beginning data structures/algorithms/programming course; yet it seems more popular to give an assignment on Huffman using trees, which is somewhat ironic since in the real world Huffman is implemented using bit operations and lookup tables, not actual tree data structures.

To give a trivial example, LZ will easily compress ABCDABCDABCDABCD while order-0 Huffman can't do much since each individual symbol has the same frequency.


Another LZ fan here.

My guess is that the "late" development of LZ is due to mainly two reasons:

i) At that moment the pattern matching algorithms were not so advanced. E.g. suffix tree was very recent, and in the next years lots of advances occurred in that area...

ii) Although LZ can appear easier or more intuitive than Huffman, I think it is much less intuitive to prove a good bound in the compression achieved by LZ. OTOH, Huffman is build in a way that shows that it achieves zeroth-order compression.


The DEFLATE[1] algorithm is actually fairly accessible, and will give a good idea of how compression works.

1: https://en.wikipedia.org/wiki/DEFLATE


I liked this explanation of DEFLATE here on HN a few months ago:

https://news.ycombinator.com/item?id=12334270


IMO Huffman is conceptually more complicated (not the implementation, but the logic) than arithmetic coding.

And Huffman isn't optimal unless you are lucky, unlike arithmetic coding.


I never learned AC. It's on my overflowing stack of thing to read about.


AC is conceptually stupidly simple. All you do is encode a string of symbols into a range of real numbers.

To start your range is [0, 1). For each symbol you want to encode you take your range and split it up according to your probabilities. E.g. if your symbols are 25% A, 50% B and 25% C, then you split up that range in [0, 0.25) for A, [0.25, 0.75) for B and [0.75, 1) for C.

Encoding multiple symbols is just applying this recursively. So to encode the two symbols Bx we split up [0.25, 0.75) proportionally just like we did [0, 1) before to encode x (where x is A, B or C).

As an example, A is the range [0, 0.25), and AC is the range [0.1875, 0.25).

Now to actually turn these ranges into a string of bits we choose the shortest binary representation that fits within the range. If we look at a decimal number:

    0.1875
We know that this means 1/10 + 8/100 + 7/1000 + 5/10000. A binary representation:

    0.0011
This means 0/2 + 0/4 + 1/8 + 1/16 = 0.1875. So we encode AC as 0011.

---

The beauty of arithmetic coding is that after encoding/decoding any symbol we can arbitrarily change how we split up the range, giving rise to adaptive coding. Arithmetic coding can perfectly represent any data that forms a discrete string of symbols, including changes to our knowledge of data as we decode.


Or on a more abstract level to compare to Huffman encoding: Huffman turns each symbol into a series of bits like "011". Arithmetic encoding lets you use fractional bits.

A Huffman tree for digits might assign 0-5 to 3 bits and 6-9 to 4 bits. Encoding three digits will use on average slightly more than 10 bits. Using AC will let you give the same amount of space to each possibility, so that encoding three digits always uses less than 10 bits.


Nice explanation. Can you explain how to remove ambiguity relating to string length?

"0" = 0.0b = 0 falls in the range [0,0.25) so it's a valid encoding for "A"; but isn't it also a valid encoding for "AA", "AAA", etc.?

AA = [0,0.25) * [0, 0.25) = [0, 0.125), and so on.

It seems that adding "A"s to a string in general doesn't change its encoding.


You either reserve a symbol for "end of stream" or externally store the length.

It's the equivalent to pretending a Huffman stream never ends and is padded with infinite 0s.


Huffman seems simpler to me, but I've implemented both at various times so that might colour my perspective.


AC implementation is actually quite tricky, but conceptually IMO it's much simpler and more elegant than Huffman.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: