Hacker News new | comments | show | ask | jobs | submit login

Absolutely love this:

'Suppose you have some strange coin - you've tossed it 10 times, and every time it lands on heads. How would you describe this information to someone? You wouldn't say HHHHHHHHH. You would just say "10 tosses, all heads" - bam! You've just compressed some data! Easy. I saved you hours of mindfuck lectures.'

This is a really great, simple way to explain what is otherwise a fairly complex concept to the average bear. Great work.

That's only one half of the problem - you now have an alphabet five times the size so you have actually increased the size of the message! You also need to explain how to encode this efficiently to explain compression.

Not really - the knowledge of what an alphabet can be universally agreed upon and doesn't need to be transmitted with the data. The metaphor here is that software and hardware-based decoding can now be much more powerful because the hardware is more powerful than it used to be.

And of course the truth is you would just transmit "H_10" with the universally agreed upon knowledge that "H" is "Heads" and "_" is number of times.

> you would just transmit "H_10" with the universally agreed upon knowledge that "H" is "Heads" and "_" is number of times.

Yes I get that the alphabet is already agreed upon.

But if I only transmit H or T (uncompressed) that's just one bit needed per symbol. So I can transmit HHHHHHHHH in ten bits. If I introduce this simplified compression to the system and add 0-9 to the alphabet, that now needs four bits per symbol, so the message H10 is 12 bits long (which is longer than uncompressed). And HTHTHTHTHT would be forty bits so if the message doesn't repeat simply it's now four times larger!

See what I mean? It's not successfully compressed anything.

The solution to this is easy and is Huffman coding, but it doesn't make sense to show it for a ten bit message as it won't work well, and in the trite explanation of compression of 'just the symbol and then the number of times it's repeated' this isn't mentioned, so it's only half the story and people will still be puzzled because they will see that your message contains MORE entropy after 'compression', not LESS!

You are entirely missing the point. His purpose isn't to give the reader a rigorous mathematical understanding. It is to convey a concept. It is an analogy, not a proof. And his analogy is perfectly good. Just do it: say "HHHHHHHHHH" and say "ten tosses, all heads" and get back to me which one transmits the info to another human in more compact form.

"To another human" is the key phrase, and sometimes I wonder if HN is populated with humans or androids. No offense intended to androids with feelings.

I think there's no need to be so pedantic. Replace 10 for 1000 and now the scheme "works".


> The solution to this is easy and is Huffman

Well, not. As you said, for 10 bits doesn't matter; and in general it will depend on the input; sometimes run length encoding performs better than Huffman; and also there are cases were Huffman won't capture high order entropy. Also, for zero order entropy arithmetic encoder is superior than Huffman. Unless you care about decompression speed...

Which bring me back to the fact that there is not such a thing as"the solution" in data compression. But more importantly: it was just an example to show an idea; and actually a pretty good one (run length encoding)

But actually, no. Because you could set up HTTHTTTHHHHHHHHHH the format like this:

That's sixteen bits for 17 coinflips. With no continuous sequences longer than seven, this format takes up one extra bit every seven flips.

How does it work? The first bit is a sign bit. If it's zero, the next seven bits are raw coinflips, 0 for tails, 1 for heads. If it's one, the second bit signifies whether the next sequence consists of heads or tails (again, 0 for tails, 1 for heads), and the remaining six bits tell how long said sequence is.

This is a fairly simple encoding for the strategy described in the article, which I thought of off the top of my head in about five minutes, and I know nothing about compression. It's slightly broken (what if the sequence ends mid-byte?), but it does kind of work. Somebody who actually knows about compression could probably do this better.

I know that, the point is that this kind of stuff needs some thought, it's not so simple as "HTHTHTHT" = "HT five times". The article kind of glosses over that.

In the context of the analogy, it's probably better to read it as saving time to human-parse rather than the space required to send. (And it definitely takes less time to verbally state, even if the sentence is clearly longer; caching) The general idea is the same through; compression by describing patterns rather than explicitly stating the event

Well, yes. Many programmers have trouble mapping ideas into bits: imagine how hard it is for people aren't in programming?

you know it was just an example in terms of what you might tell a friend over the phone about a coin flip right? i suppose you'd send your friend a huffman tree and then say "1110"

This entire thread makes me think of that Silicon Valley scene[1].

[1] https://www.youtube.com/watch?v=-5jF5jtMM_4

wouldn't you just convert the alphabet to 0-1? Shouldn't the 0-1 compression be the most optimal? That is, you can't find a better deal with any alphabets or numerical?

That's how compression was first explained to me, and it really stuck with me ever since. It was in the context of an image though, and instead of heads, it was red pixels.

What's really cool is that the simple explanation can be extended to explain things like why ciphertext doesn't compress well: because ciphertext has no patterns

Not compression per se, but I remember when I was reverse engineering maps for Chip's Challenge. I would often see a tile (represented as a byte) that was encoded as 0xFF0502. I ended up realizing it meant "Repeat 'tile 2' 5 times." It was fun to figure that out as a kid.

RLE a given. It's true that the average person rarely understand that this is what computers call compression, but everything after that involves a bit of thinking. Optimal huffman.

I think the LZ family are all pretty intuitive --- just replace repeated sequences with a reference to where they occurred before. Even human languages have things like contractions, abbreviations, and acronyms. Thus it is, at least to me, somewhat surprising that LZ was only documented academically several decades after Huffman; perhaps it was thought to be too trivial? LZ can also be thought of as an extension of RLE.

In any case, an LZ decompressor fits in less than 100 bytes of machine instructions and a simple compressor can be implemented in not more than several hundred, all the while providing extremely high compression for its simplicity. It will easily outperform order-0 static or dynamic Huffman on practical files like English text, and would probably make a good assignment in an undergraduate-level beginning data structures/algorithms/programming course; yet it seems more popular to give an assignment on Huffman using trees, which is somewhat ironic since in the real world Huffman is implemented using bit operations and lookup tables, not actual tree data structures.

To give a trivial example, LZ will easily compress ABCDABCDABCDABCD while order-0 Huffman can't do much since each individual symbol has the same frequency.

Another LZ fan here.

My guess is that the "late" development of LZ is due to mainly two reasons:

i) At that moment the pattern matching algorithms were not so advanced. E.g. suffix tree was very recent, and in the next years lots of advances occurred in that area...

ii) Although LZ can appear easier or more intuitive than Huffman, I think it is much less intuitive to prove a good bound in the compression achieved by LZ. OTOH, Huffman is build in a way that shows that it achieves zeroth-order compression.

The DEFLATE[1] algorithm is actually fairly accessible, and will give a good idea of how compression works.

1: https://en.wikipedia.org/wiki/DEFLATE

I liked this explanation of DEFLATE here on HN a few months ago:


IMO Huffman is conceptually more complicated (not the implementation, but the logic) than arithmetic coding.

And Huffman isn't optimal unless you are lucky, unlike arithmetic coding.

I never learned AC. It's on my overflowing stack of thing to read about.

AC is conceptually stupidly simple. All you do is encode a string of symbols into a range of real numbers.

To start your range is [0, 1). For each symbol you want to encode you take your range and split it up according to your probabilities. E.g. if your symbols are 25% A, 50% B and 25% C, then you split up that range in [0, 0.25) for A, [0.25, 0.75) for B and [0.75, 1) for C.

Encoding multiple symbols is just applying this recursively. So to encode the two symbols Bx we split up [0.25, 0.75) proportionally just like we did [0, 1) before to encode x (where x is A, B or C).

As an example, A is the range [0, 0.25), and AC is the range [0.1875, 0.25).

Now to actually turn these ranges into a string of bits we choose the shortest binary representation that fits within the range. If we look at a decimal number:

We know that this means 1/10 + 8/100 + 7/1000 + 5/10000. A binary representation:

This means 0/2 + 0/4 + 1/8 + 1/16 = 0.1875. So we encode AC as 0011.


The beauty of arithmetic coding is that after encoding/decoding any symbol we can arbitrarily change how we split up the range, giving rise to adaptive coding. Arithmetic coding can perfectly represent any data that forms a discrete string of symbols, including changes to our knowledge of data as we decode.

Or on a more abstract level to compare to Huffman encoding: Huffman turns each symbol into a series of bits like "011". Arithmetic encoding lets you use fractional bits.

A Huffman tree for digits might assign 0-5 to 3 bits and 6-9 to 4 bits. Encoding three digits will use on average slightly more than 10 bits. Using AC will let you give the same amount of space to each possibility, so that encoding three digits always uses less than 10 bits.

Nice explanation. Can you explain how to remove ambiguity relating to string length?

"0" = 0.0b = 0 falls in the range [0,0.25) so it's a valid encoding for "A"; but isn't it also a valid encoding for "AA", "AAA", etc.?

AA = [0,0.25) * [0, 0.25) = [0, 0.125), and so on.

It seems that adding "A"s to a string in general doesn't change its encoding.

You either reserve a symbol for "end of stream" or externally store the length.

It's the equivalent to pretending a Huffman stream never ends and is padded with infinite 0s.

Huffman seems simpler to me, but I've implemented both at various times so that might colour my perspective.

AC implementation is actually quite tricky, but conceptually IMO it's much simpler and more elegant than Huffman.

I definitely wouldn't say "HHHHHHHHH," since I tossed it 10 times, not 9.

Saying "10 tosses, all heads" reduces the chance of omitting a toss in data entry, which is all to the better.

You're making the assumption the other party knows English, rather than say the abstraction of 'coinflip' which in itself can be abstracted. Do they understand the concept of fairness - is it even odds or not? There's a reason that numbers are considered a more universal 'language' than other forms of communication.

You likely enjoy this read too:


On the HyperLogLog algorithm to count things.

I would probably say I found a two headed coin. I also like how "HHHHHHHHHH" is shorter than "I tossed it 10 times, all heads"

The all-heads sequence is exactly as likely as any other sequence.

right, but you can also say there are many sequences that aren't all heads.

Which is shortest, you can recover the original dataset from any of the following:

- heads, heads, heads, tails, tails, heads

- hhhtth

- h3t2h

- 3b1

FAIL. "10 tosses, all heads" is 20 characters while "HHHHHHHHHH" is 10 characters. You've conducted expansion rather than compression.

That example is about conveying the information verbally to another human, so syllables is interesting, not characters. "h h h h h h h h h h" is 10 syllables, "10 tosses, all heads" is 4 (and could be compressed further to "10 heads").

"10H" is the string you should be comparing it to (or something equivalent.)

If we're going there, H10 may be more efficient.... h10t5h1... Thinking the heads vs tails expression would be before the iteration as a better use case.

Depends which way you're iterating.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact