Hacker News new | past | comments | ask | show | jobs | submit login
A Mini-Introduction to Information Theory (arxiv.org)
136 points by godelmachine 81 days ago | hide | past | web | favorite | 24 comments



Witten's talk at the IAS on this subject may also be of interest: https://www.youtube.com/watch?v=XYugyhoohhY


Halfway through the talk he switches to discussing quantum information theory.

Here is an excellent book on this subject "From Classical to Quantum Shannon Theory" https://arxiv.org/abs/1106.1445


Ed Witten gave a talk on information theory? At IAS?! Well, now I've got to bookmark this for later; thanks!


For a "mimi-introduction" to information theory, it's essentially how many little balls can fit inside a big ball.

Each little ball is from a particular signal as received with its errors in transmission, and the big ball is from the power of the signal.

So, the number of little balls is how many different signals can send, with the assumed added noise level, and still have the signals all distinct at the receiver. So, sure, if send with more power have a larger big ball and more signals. Also if have lower noise, then signals are corrupted less, get smaller little balls, and can send more signals.

Sorry, that's my best summary from the time I did look at information theory some years ago!!!


Good find. Also see https://arxiv.org/abs/1802.05968


I realise I'm only a s/w dev and probably not the target audience, but I stopped reading by the end of the first page. For a mini-introduction, it assumes a lot of mathematical skill.


Unfortunately, information theory is not a topic that can be made accessible without several nontrivial prerequisites. This would normally constitute a graduate level topic - something to dig into after you've worked through the better part of a mathematics undergrad (particularly the upper level analysis/probability/algebra courses).

Mathematics has a peculiar habit (to those outside the domain) to call things like this "introductions" - it's technically true, but unhelpful to those outside the target audience. Unless you dilute the topic to the point of unusability, this is basically what an introduction to the subject looks like.

Still, it's a little hard for me to figure out what the target audience for this is. Grad students would probably want a focused textbook, so I'm guessing this is for undergrads who are only partially through the full prereqs, but who have a lot of interest in it.


Yes, I was disappointed because the topic is fascinating in its abstract form, so from the title I was hoping for something a little more accessible.

However it made me hunt around for more appropriate sources and I found a few, so that at least is good.


Footnote 1 explains "The article is based on a lecture at the 2018 summer program Prospects in Theoretical Physics", which is a 2-week school for grad students, most of whom would not be working on exactly this.


The funny part is that this may also be the most approachable thing Witten has ever written. In case the name isn't familiar, he's a leading theoretical physicist (with a fields medal), and I'm sure this was his idea of taking a week's holiday.

https://en.wikipedia.org/wiki/Edward_Witten


> Suppose that one receives a message that consists of a string of symbols a or b, say aababbaaaab··· And let us suppose that a occurs with probability p, and b with probability 1−p. How many bits of information can one extract from a long message of this kind, say with N letters?

They should first define what they mean by probability. For example, what am I supposed to fill in for p when b always occurs after exactly 3 times a?


If you can't figure it out from the sentence you quoted, you are not the target audience for that paper.

(p is independent)


This is bluntly stated...but true.

If for nothing else but space conservation, even relatively accessible papers and books need to constrain how fully self-contained they are. Most advanced topics have prerequisites - in this case, you should really know probability (and therefore calculus/analysis) well before tackling information theory. Linear algebra will also be required if you want to do anything practical with information theory, like error correcting codes.

Striking balance between full exposition and reasonable length can be tricky. You could make this paper fully accessible, but it would be significantly longer for questionable benefit. Most authors are not, in fact, good writers of pedagogical material. This is especially the case if they write beyond their specialization. It turns out it's generally better to go through several more compact and focused resources than one monolithic one.


this comment is a little ironic since what does "p is independent" mean? you mean A={a} and B={b} are independent events?


touche. 'p is independent' is brainfart.

The string is sequence of independent and identically distributed random variables (Bernoulli process). a and b are mutually exclusive events, not independent.


:) i appreciate your comment though that sincere learners don't start out by complaining - i will remember to remind people of that on here more often.


Please don’t discourage honest learners.


Honest leaners don't start with complaint.

I also provided the answer.


> (p is independent)

But this does not hold for most real-life data streams.


That was chapter 2.1 describing Shannon entropy. Jump into the next chapter (2.2). It has conditional entropy.

Learning advice: You shouldn't read line by line stopping to think after every sentence. Skim each chapter first. What is the main point? You may also want look at the next chapters to see where they go before finishing the chapter.


Ph.d. student in information theory and signal processing here.

It is actually a very reasonable assumption for almost all kinds of data -- given that suitable compression is applied. Data, which is well-compressed, is essentially uniformly random.

When p('a') = p and p('b') = 1 - p, it means that the probability of 'a' and 'b' do not depend on anything. The probabilities must sum to 1, since it is the certain event, so p('a') + p('b') = p('a' + 'b') = p + 1 - p = 1. That is when we assume that only the symbols 'a' and 'b' are possible.

If there was a relationship between 'a' and 'b', say that 'b' always occurs after every 'aaa', then the when we receive 'aaa', we know that the next symbol is 'b' -- always.

So in this case the probability of 'b' has a relationship, a condition on the history, which would be written as p('b' | hist = 'aaa') = 1. A much more useful framework for this is a Markov process the a history/memory of 3. A graph for such a process can be seen here: https://mermaidjs.github.io/mermaid-live-editor/#/view/eyJjb...

Each node is a state and the edges represent the possible outputs. The rul for such a graph is that the probabilities of the "leaving" edges must sum to 1 -- we must of course always leave the current state we are in. Notice that it takes a sequence of 'aaa' to enter node "D", afterwhich we _must_ output a 'b'. Using some matrix formulations it is possible to calculate the probabilities of 'a' and 'b' (the stationary distribution I think it is called).

And to return to the first point, why is independence a reasonable assumption. In the Markov process, in node D, we known that we must always output a 'b'. In terms of information theory, if we receive 'aaa' then the 'b' is given and provides no new information. There we can perfectly predict it and we could also remove it (compress the data), without _losing_ information.

'abaabbaaab'

contains the same information as

'abaabbaaa'

since we _know_ that there must be a 'b'.

I hope that explains why independence is reasonable.


Great write-up!

> It is actually a very reasonable assumption for almost all kinds of data -- given that suitable compression is applied. Data, which is well-compressed, is essentially uniformly random.

What kinds of data are an exception? Your explanation seems to cover everything


First of all, we have protocol data such as headers and framing which we might never properly get rid of. People might also send uncompressed data. All practical concerns, but for analysis independent (and even uniform) is not wrong, just rough.

Second, you might (will) not be able to completely compress the data. A picture might be worth a thousand words, but they still take out a megabyte or so on disk. That makes for about 1000 bytes per word ;) So the entropy/information of a picture might be very small ("A dog jumping into water"), but we have no chance of truly understanding a general source (reality) and expressing its full machinery.

Think about the difference between JPEG and PNG (or GZIP and a JavaScript minifier). They are designed for completely different assumptions about the source and even receiver. JPEG assumes that the most important part of an image is the human-visual understand; PNG is lossless, but assumes high inter-pixel dependence. GZIP assumes general bytes (I think); JS minification assumes that a there is a more fundamental representation of the source without noise (formatting, comments, reasonable names, dead functions).


Cheers!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: