

A Short, Simple Introduction to Information Theory - moultano
http://knol.google.com/k/ryan-moulton/a-short-simple-introduction-to/3kbzhsxyg4467/7#view

======
lmkg
I think one of the problems that people have with entropy is understanding the
concept of fractions of bits, and it would help to have a deeper explanation.
It's somewhat easy to understand in a probabilistic sense, but it still sort
of defies intuition that you can have less than one bit. You want to say, well
you have to round up, right? A more intuitive explanation is that with an
ideal encoding scheme, 100 non-vowel die rolls would take, on average, 258
bits to encode. This is still somewhat probabilistic, but it makes a bit more
sense, because it's no longer necessary to have a direct relationship between
die rolls and bits. And this idea of encoding a sequence leads naturally to
the topic of compression, which is one of the most useful and most natural
applications of entropy.

~~~
moultano
Someone else commented on this too, and I'm not sure exactly where to go with
that. Discussing block length and compression is deeper into that particular
issue than I was really hoping to go in this, but I don't know of an easier
way of explaining it. Any ideas?

~~~
lmkg
It may be "deeper," but for me information theory only really clicks in my
head in the context of compression. Without compression, I see some voodoo
hand-waving with logarithms and I don't even know what it means for an event
to have X bits of entropy, or why I should care, or if I should believe the
result. With compression, the entropy is suddenly a concrete measurable value
with real-world import, and it's cool that you can prove hard lower bounds
like that. So it may be deep, but at least to me it makes the whole thing make
sense, rather than just adding complication.

You may be able to dodge the issue by not mentioning compression explicitly.
It's nonsensical that a single die roll takes 2.58 bits, but it's perfectly
believable that 100 die rolls could fit in 258 bits. Then with a little work,
you can show that 258 is the smallest expected value of any encoding. That
uses the concept of compression as an intuitive model, without having to
formally define it or call it by name.

~~~
moultano
Ok, check it now. I added a little bit about that, though I'm sure I could
still improve the wording.

 _Now we have something whose unit is "bits" but whose value includes
fractions of a bit. What can we do with this? After all, if we're only storing
one roll, we still need 3 bits to store 6 possibilities. The trick is that we
can use fewer bits if we are storing more rolls at once. There are 2 wasted
possibilities in those 3 bits we used for the first roll, and if you're
clever, you can use those to encode some information about the next roll. If
we're clever enough, and storing enough, 2.58... is the lower bound on the
number of bits required per roll that you'll converge to with an optimal
compression scheme._

------
ced
It's worth emphasizing that information and entropy are _subjective_ , in the
Bayesian sense. The more you know and understand something, the more you can
encode it in a small number of bits.

Consider the sequence:

31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97

Most people would have a hard time remembering it because it looks random...
But of course, it really is "All the primes between 30 and 100". That's _much_
easier to remember, because the brain (at least, the average brain on HN)
already has concepts for "prime number", "30" and "100".

Compression has a strong connection to AI. In fact, Occam's razor is perhaps
AI's most fundamental theorem: if you have derived many programs which can
perform the same task, which one is the smartest?

A: _The shortest one_.

I would also recommend MacKay's book. It's awesome.

~~~
yters
Even if the smallest wouldn't complete in the lifespan of the universe, and a
slightly bigger one would take 5 minutes?

------
miguelpais
Halfway through the article this encoding / data compression algorithm I
learned last semester came to my mind:

Huffman coding <http://en.wikipedia.org/wiki/Huffman_coding>

It was really simple, based on a file with a given number of symbols, it would
rewrite those symbols as length-variable unique binary codes, so that the
symbol with highest number of occurrences would become the code with the
smallest possible bit-length, and so on with the other symbols so that it
would spend most bits on the symbols that occurred less times.

~~~
moultano
Huffman coding is very very related to this. If you have a capped block
length, huffman coding is optimal.

------
compay
If you're looking for a good read, I enjoyed Jeremy Campbell's "Grammatical
Man" quite a bit, it ties together information theory and linguistics.
Obviously not the most in-depth book on either topic, but fascinating
nonetheless.

[http://www.amazon.com/Grammatical-Man-Information-Entropy-
La...](http://www.amazon.com/Grammatical-Man-Information-Entropy-
Language/dp/0671440616)

------
brox
Shannon's original paper, "A Mathematical Theory of Communication", is still a
good read: [http://cm.bell-
labs.com/cm/ms/what/shannonday/shannon1948.pd...](http://cm.bell-
labs.com/cm/ms/what/shannonday/shannon1948.pdf)

