'Suppose you have some strange coin - you've tossed it 10 times, and every time it lands on heads. How would you describe this information to someone? You wouldn't say HHHHHHHHH. You would just say "10 tosses, all heads" - bam! You've just compressed some data! Easy. I saved you hours of mindfuck lectures.'
This is a really great, simple way to explain what is otherwise a fairly complex concept to the average bear. Great work.
And of course the truth is you would just transmit "H_10" with the universally agreed upon knowledge that "H" is "Heads" and "_" is number of times.
Yes I get that the alphabet is already agreed upon.
But if I only transmit H or T (uncompressed) that's just one bit needed per symbol. So I can transmit HHHHHHHHH in ten bits. If I introduce this simplified compression to the system and add 0-9 to the alphabet, that now needs four bits per symbol, so the message H10 is 12 bits long (which is longer than uncompressed). And HTHTHTHTHT would be forty bits so if the message doesn't repeat simply it's now four times larger!
See what I mean? It's not successfully compressed anything.
The solution to this is easy and is Huffman coding, but it doesn't make sense to show it for a ten bit message as it won't work well, and in the trite explanation of compression of 'just the symbol and then the number of times it's repeated' this isn't mentioned, so it's only half the story and people will still be puzzled because they will see that your message contains MORE entropy after 'compression', not LESS!
"To another human" is the key phrase, and sometimes I wonder if HN is populated with humans or androids. No offense intended to androids with feelings.
> The solution to this is easy and is Huffman
Well, not. As you said, for 10 bits doesn't matter; and in general it will depend on the input; sometimes run length encoding performs better than Huffman; and also there are cases were Huffman won't capture high order entropy. Also, for zero order entropy arithmetic encoder is superior than Huffman. Unless you care about decompression speed...
Which bring me back to the fact that there is not such a thing as"the solution" in data compression. But more importantly: it was just an example to show an idea; and actually a pretty good one (run length encoding)
How does it work? The first bit is a sign bit. If it's zero, the next seven bits are raw coinflips, 0 for tails, 1 for heads. If it's one, the second bit signifies whether the next sequence consists of heads or tails (again, 0 for tails, 1 for heads), and the remaining six bits tell how long said sequence is.
This is a fairly simple encoding for the strategy described in the article, which I thought of off the top of my head in about five minutes, and I know nothing about compression. It's slightly broken (what if the sequence ends mid-byte?), but it does kind of work. Somebody who actually knows about compression could probably do this better.
What's really cool is that the simple explanation can be extended to explain things like why ciphertext doesn't compress well: because ciphertext has no patterns
In any case, an LZ decompressor fits in less than 100 bytes of machine instructions and a simple compressor can be implemented in not more than several hundred, all the while providing extremely high compression for its simplicity. It will easily outperform order-0 static or dynamic Huffman on practical files like English text, and would probably make a good assignment in an undergraduate-level beginning data structures/algorithms/programming course; yet it seems more popular to give an assignment on Huffman using trees, which is somewhat ironic since in the real world Huffman is implemented using bit operations and lookup tables, not actual tree data structures.
To give a trivial example, LZ will easily compress ABCDABCDABCDABCD while order-0 Huffman can't do much since each individual symbol has the same frequency.
My guess is that the "late" development of LZ is due to mainly two reasons:
i) At that moment the pattern matching algorithms were not so advanced. E.g. suffix tree was very recent, and in the next years lots of advances occurred in that area...
ii) Although LZ can appear easier or more intuitive than Huffman, I think it is much less intuitive to prove a good bound in the compression achieved by LZ. OTOH, Huffman is build in a way that shows that it achieves zeroth-order compression.
And Huffman isn't optimal unless you are lucky, unlike arithmetic coding.
To start your range is [0, 1). For each symbol you want to encode you take your range and split it up according to your probabilities. E.g. if your symbols are 25% A, 50% B and 25% C, then you split up that range in [0, 0.25) for A, [0.25, 0.75) for B and [0.75, 1) for C.
Encoding multiple symbols is just applying this recursively. So to encode the two symbols Bx we split up [0.25, 0.75) proportionally just like we did [0, 1) before to encode x (where x is A, B or C).
As an example, A is the range [0, 0.25), and AC is the range [0.1875, 0.25).
Now to actually turn these ranges into a string of bits we choose the shortest binary representation that fits within the range. If we look at a decimal number:
The beauty of arithmetic coding is that after encoding/decoding any symbol we can arbitrarily change how we split up the range, giving rise to adaptive coding. Arithmetic coding can perfectly represent any data that forms a discrete string of symbols, including changes to our knowledge of data as we decode.
A Huffman tree for digits might assign 0-5 to 3 bits and 6-9 to 4 bits. Encoding three digits will use on average slightly more than 10 bits. Using AC will let you give the same amount of space to each possibility, so that encoding three digits always uses less than 10 bits.
"0" = 0.0b = 0 falls in the range [0,0.25) so it's a valid encoding for "A"; but isn't it also a valid encoding for "AA", "AAA", etc.?
AA = [0,0.25) * [0, 0.25) = [0, 0.125), and so on.
It seems that adding "A"s to a string in general doesn't change its encoding.
It's the equivalent to pretending a Huffman stream never ends and is padded with infinite 0s.
Saying "10 tosses, all heads" reduces the chance of omitting a toss in data entry, which is all to the better.
On the HyperLogLog algorithm to count things.
- heads, heads, heads, tails, tails, heads