Hacker News new | comments | show | ask | jobs | submit login

That's only one half of the problem - you now have an alphabet five times the size so you have actually increased the size of the message! You also need to explain how to encode this efficiently to explain compression.



Not really - the knowledge of what an alphabet can be universally agreed upon and doesn't need to be transmitted with the data. The metaphor here is that software and hardware-based decoding can now be much more powerful because the hardware is more powerful than it used to be.

And of course the truth is you would just transmit "H_10" with the universally agreed upon knowledge that "H" is "Heads" and "_" is number of times.


> you would just transmit "H_10" with the universally agreed upon knowledge that "H" is "Heads" and "_" is number of times.

Yes I get that the alphabet is already agreed upon.

But if I only transmit H or T (uncompressed) that's just one bit needed per symbol. So I can transmit HHHHHHHHH in ten bits. If I introduce this simplified compression to the system and add 0-9 to the alphabet, that now needs four bits per symbol, so the message H10 is 12 bits long (which is longer than uncompressed). And HTHTHTHTHT would be forty bits so if the message doesn't repeat simply it's now four times larger!

See what I mean? It's not successfully compressed anything.

The solution to this is easy and is Huffman coding, but it doesn't make sense to show it for a ten bit message as it won't work well, and in the trite explanation of compression of 'just the symbol and then the number of times it's repeated' this isn't mentioned, so it's only half the story and people will still be puzzled because they will see that your message contains MORE entropy after 'compression', not LESS!


You are entirely missing the point. His purpose isn't to give the reader a rigorous mathematical understanding. It is to convey a concept. It is an analogy, not a proof. And his analogy is perfectly good. Just do it: say "HHHHHHHHHH" and say "ten tosses, all heads" and get back to me which one transmits the info to another human in more compact form.

"To another human" is the key phrase, and sometimes I wonder if HN is populated with humans or androids. No offense intended to androids with feelings.


I think there's no need to be so pedantic. Replace 10 for 1000 and now the scheme "works".

Regarding

> The solution to this is easy and is Huffman

Well, not. As you said, for 10 bits doesn't matter; and in general it will depend on the input; sometimes run length encoding performs better than Huffman; and also there are cases were Huffman won't capture high order entropy. Also, for zero order entropy arithmetic encoder is superior than Huffman. Unless you care about decompression speed...

Which bring me back to the fact that there is not such a thing as"the solution" in data compression. But more importantly: it was just an example to show an idea; and actually a pretty good one (run length encoding)


But actually, no. Because you could set up HTTHTTTHHHHHHHHHH the format like this:

  01001000
  11001010
That's sixteen bits for 17 coinflips. With no continuous sequences longer than seven, this format takes up one extra bit every seven flips.

How does it work? The first bit is a sign bit. If it's zero, the next seven bits are raw coinflips, 0 for tails, 1 for heads. If it's one, the second bit signifies whether the next sequence consists of heads or tails (again, 0 for tails, 1 for heads), and the remaining six bits tell how long said sequence is.

This is a fairly simple encoding for the strategy described in the article, which I thought of off the top of my head in about five minutes, and I know nothing about compression. It's slightly broken (what if the sequence ends mid-byte?), but it does kind of work. Somebody who actually knows about compression could probably do this better.


I know that, the point is that this kind of stuff needs some thought, it's not so simple as "HTHTHTHT" = "HT five times". The article kind of glosses over that.


In the context of the analogy, it's probably better to read it as saving time to human-parse rather than the space required to send. (And it definitely takes less time to verbally state, even if the sentence is clearly longer; caching) The general idea is the same through; compression by describing patterns rather than explicitly stating the event


Well, yes. Many programmers have trouble mapping ideas into bits: imagine how hard it is for people aren't in programming?


you know it was just an example in terms of what you might tell a friend over the phone about a coin flip right? i suppose you'd send your friend a huffman tree and then say "1110"


This entire thread makes me think of that Silicon Valley scene[1].

[1] https://www.youtube.com/watch?v=-5jF5jtMM_4


wouldn't you just convert the alphabet to 0-1? Shouldn't the 0-1 compression be the most optimal? That is, you can't find a better deal with any alphabets or numerical?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: