
A Mini-Introduction to Information Theory - godelmachine
https://arxiv.org/abs/1805.11965
======
gajjanag
Witten's talk at the IAS on this subject may also be of interest:
[https://www.youtube.com/watch?v=XYugyhoohhY](https://www.youtube.com/watch?v=XYugyhoohhY)

~~~
ivan_ah
Halfway through the talk he switches to discussing quantum information theory.

Here is an excellent book on this subject "From Classical to Quantum Shannon
Theory" [https://arxiv.org/abs/1106.1445](https://arxiv.org/abs/1106.1445)

------
graycat
For a "mimi-introduction" to _information theory_ , it's essentially how many
little balls can fit inside a big ball.

Each little ball is from a particular signal as received with its errors in
transmission, and the big ball is from the power of the signal.

So, the number of little balls is how many different signals can send, with
the assumed added noise level, and still have the signals all distinct at the
receiver. So, sure, if send with more power have a larger big ball and more
signals. Also if have lower noise, then signals are corrupted less, get
smaller little balls, and can send more signals.

Sorry, that's my best summary from the time I did look at information theory
some years ago!!!

------
painful
Good find. Also see
[https://arxiv.org/abs/1802.05968](https://arxiv.org/abs/1802.05968)

------
kitd
I realise I'm only a s/w dev and probably not the target audience, but I
stopped reading by the end of the first page. For a mini-introduction, it
assumes a lot of mathematical skill.

~~~
throwawaymath
Unfortunately, information theory is not a topic that can be made accessible
without several nontrivial prerequisites. This would normally constitute a
graduate level topic - something to dig into after you've worked through the
better part of a mathematics undergrad (particularly the upper level
analysis/probability/algebra courses).

Mathematics has a peculiar habit (to those outside the domain) to call things
like this "introductions" \- it's technically true, but unhelpful to those
outside the target audience. Unless you dilute the topic to the point of
unusability, this is _basically_ what an introduction to the subject looks
like.

Still, it's a little hard for me to figure out what the target audience for
this is. Grad students would probably want a focused textbook, so I'm guessing
this is for undergrads who are only partially through the full prereqs, but
who have a lot of interest in it.

~~~
kitd
Yes, I was disappointed because the topic is fascinating in its abstract form,
so from the title I was hoping for something a little more accessible.

However it made me hunt around for more appropriate sources and I found a few,
so that at least is good.

------
amelius
> Suppose that one receives a message that consists of a string of symbols a
> or b, say aababbaaaab··· And let us suppose that a occurs with probability
> p, and b with probability 1−p. How many bits of information can one extract
> from a long message of this kind, say with N letters?

They should first define what they mean by probability. For example, what am I
supposed to fill in for p when b always occurs after exactly 3 times a?

~~~
nabla9
If you can't figure it out from the sentence you quoted, you are not the
target audience for that paper.

(p is independent)

~~~
amelius
> (p is independent)

But this does not hold for most real-life data streams.

~~~
henrikeh
Ph.d. student in information theory and signal processing here.

It is actually a very reasonable assumption for almost all kinds of data --
given that suitable compression is applied. Data, which is well-compressed, is
essentially uniformly random.

When p('a') = p and p('b') = 1 - p, it means that the probability of 'a' and
'b' do not depend on anything. The probabilities must sum to 1, since it is
the certain event, so p('a') + p('b') = p('a' \+ 'b') = p + 1 - p = 1. That is
when we assume that only the symbols 'a' and 'b' are possible.

If there was a relationship between 'a' and 'b', say that 'b' always occurs
after every 'aaa', then the when we receive 'aaa', we know that the next
symbol is 'b' \-- always.

So in this case the probability of 'b' has a relationship, a condition on the
history, which would be written as p('b' | hist = 'aaa') = 1. A much more
useful framework for this is a Markov process the a history/memory of 3. A
graph for such a process can be seen here:
[https://mermaidjs.github.io/mermaid-live-
editor/#/view/eyJjb...](https://mermaidjs.github.io/mermaid-live-
editor/#/view/eyJjb2RlIjoiZ3JhcGggVERcbkEgLS0gYSAtLT4gQlxuQiAtLSBhIC0tPiBDXG5DIC0tIGEgLS0-IERcbkQgLS0gYiAtLT4gQVxuQiAtLSBiIC0tPiBBXG5DIC0tIGIgLS0-IEFcbkEgLS0gYiAtLT4gQSIsIm1lcm1haWQiOnsidGhlbWUiOiJkZWZhdWx0In19)

Each node is a state and the edges represent the possible outputs. The rul for
such a graph is that the probabilities of the "leaving" edges must sum to 1 --
we must of course always leave the current state we are in. Notice that it
takes a sequence of 'aaa' to enter node "D", afterwhich we _must_ output a
'b'. Using some matrix formulations it is possible to calculate the
probabilities of 'a' and 'b' (the stationary distribution I think it is
called).

And to return to the first point, why is independence a reasonable assumption.
In the Markov process, in node D, we known that we must always output a 'b'.
In terms of information theory, if we receive 'aaa' then the 'b' is given and
provides no new information. There we can perfectly predict it and we could
also remove it (compress the data), without _losing_ information.

'abaabbaaab'

contains the same information as

'abaabbaaa'

since we _know_ that there must be a 'b'.

I hope that explains why independence is reasonable.

~~~
kanjus
Great write-up!

> It is actually a very reasonable assumption for almost all kinds of data --
> given that suitable compression is applied. Data, which is well-compressed,
> is essentially uniformly random.

What kinds of data are an exception? Your explanation seems to cover
everything

~~~
henrikeh
First of all, we have protocol data such as headers and framing which we might
never properly get rid of. People might also send uncompressed data. All
practical concerns, but for analysis independent (and even uniform) is not
wrong, just rough.

Second, you might (will) not be able to completely compress the data. A
picture might be worth a thousand words, but they still take out a megabyte or
so on disk. That makes for about 1000 bytes per word ;) So the
entropy/information of a picture might be very small ("A dog jumping into
water"), but we have no chance of truly understanding a general source
(reality) and expressing its full machinery.

Think about the difference between JPEG and PNG (or GZIP and a JavaScript
minifier). They are designed for completely different assumptions about the
source and even receiver. JPEG assumes that the most important part of an
image is the human-visual understand; PNG is lossless, but assumes high inter-
pixel dependence. GZIP assumes general bytes (I think); JS minification
assumes that a there is a more fundamental representation of the source
without noise (formatting, comments, reasonable names, dead functions).

~~~
kanjus
Cheers!

