And of course the truth is you would just transmit "H_10" with the universally agreed upon knowledge that "H" is "Heads" and "_" is number of times.
Yes I get that the alphabet is already agreed upon.
But if I only transmit H or T (uncompressed) that's just one bit needed per symbol. So I can transmit HHHHHHHHH in ten bits. If I introduce this simplified compression to the system and add 0-9 to the alphabet, that now needs four bits per symbol, so the message H10 is 12 bits long (which is longer than uncompressed). And HTHTHTHTHT would be forty bits so if the message doesn't repeat simply it's now four times larger!
See what I mean? It's not successfully compressed anything.
The solution to this is easy and is Huffman coding, but it doesn't make sense to show it for a ten bit message as it won't work well, and in the trite explanation of compression of 'just the symbol and then the number of times it's repeated' this isn't mentioned, so it's only half the story and people will still be puzzled because they will see that your message contains MORE entropy after 'compression', not LESS!
"To another human" is the key phrase, and sometimes I wonder if HN is populated with humans or androids. No offense intended to androids with feelings.
> The solution to this is easy and is Huffman
Well, not. As you said, for 10 bits doesn't matter; and in general it will depend on the input; sometimes run length encoding performs better than Huffman; and also there are cases were Huffman won't capture high order entropy. Also, for zero order entropy arithmetic encoder is superior than Huffman. Unless you care about decompression speed...
Which bring me back to the fact that there is not such a thing as"the solution" in data compression. But more importantly: it was just an example to show an idea; and actually a pretty good one (run length encoding)
How does it work? The first bit is a sign bit. If it's zero, the next seven bits are raw coinflips, 0 for tails, 1 for heads. If it's one, the second bit signifies whether the next sequence consists of heads or tails (again, 0 for tails, 1 for heads), and the remaining six bits tell how long said sequence is.
This is a fairly simple encoding for the strategy described in the article, which I thought of off the top of my head in about five minutes, and I know nothing about compression. It's slightly broken (what if the sequence ends mid-byte?), but it does kind of work. Somebody who actually knows about compression could probably do this better.