Hacker News new | past | comments | ask | show | jobs | submit login
Shannon Entropy Imposes Fundamental Limits on Communication (quantamagazine.org)
77 points by harscoat on Sept 7, 2022 | hide | past | favorite | 49 comments



"If someone tells you a fact you already know, they’ve essentially told you nothing at all."

I think it's interesting to consider how this is/is not true in a real life context.

If someone tells you a fact you already know, then it could be communicating several different things.

  1. It might mean *they think* you don't know it.
  2. It might mean they want to make sure *you know* they know you know it.
  3. It might mean they believe you know it and want to communicate they agree.
  4. It might mean they are hoping for you to (or not to) contradict/argue/refute it.
If, however, you can't tell which of those (or some other alternative) is intended, then no information was communicated at all.

Except...there is a message communicated that they have some interest or concern regarding the topic, which is different from remaining silent.

Whenever someone says anything, true, false, nonsense, lie, they are communicating a true and possibly significant fact - that they chose to say that thing, in whatever circumstance.

I am thinking of "Gödel, Escher, Bach", if you can't tell.


> "If someone tells you a fact you already know, they’ve essentially told you nothing at all."

The reality of zero information transfer is more subtle than this quote makes it sound, and covers the point you raise. In reality, you must not just know the thing you are being told, but you also have to know in advance that it is the thing you are going to be told. So if I tell you "the sun is shining" when you already know it is, there is still information being transferred if I could also have said "yesterday it was raining". Only if "the sun is shining" is absolutely the only thing I could have said is it a zero information statement.


Intent still exists here. We must also consider the communication median. People typically use speech to communicate which may tell me something new about the state of the speakers emotion. It may tell me if you have a cold or sore throat in the process. It's lossy and may even tell me something about the median it passes through.

Zero information transfer is hard. I've never considered it before now but I wonder if it's even possible. There's even a time component and space component involved at a fundamental level it seems. Saying something at one time and another could differentiate things. Space clearly does. I pretty much need know the state of everything involved in the information transfer to not have information transferred.

In that case what is "knowing?" It's not so binary when you think about it. Human memory is lossy, filled with gaps and fictions mixed with reality. Is the information captured and stored truly accurate? Is it mutated by my conscious thought and how I transform, process, and then store it?


I think linguists have a specific name for the things that people say, where it doesn't carry literal meaning per se but rather in the act of saying it.

"Good morning!" What does that mean? Is it a good morning, this? Am I now compelled to have a good morning? Are you telling me about your good morning? Maybe, but it also conveys friendliness or politeness and or other things.


There is information in the fact a signal was sent and received from the observer of the receiver.


If the receiver didn't assign a prior probability of 1 to the sender sending the signal then yes.


It is impossible to assign a prior probability of 1 to just about anything in this world. Perhaps death and taxes, but definitely not telecommunication !


> If someone tells you a fact you already know, then it could be communicating several different things.

You are alluding to the "meaning of communication" while information and communication theory is strictly about bits that are stored somewhere or/and send over the channel. If you know with 100% certainty that the next bit will be zero then receiving this bit gives you no information beyond what you already knew. Of course, if your certainty is less than 100% then the bit delivered you some fractional quantity of information. Mathematically, communication channel couples together random process X at the transmitter with random process Y at the receiver. The game is to describe X while observing only Y. Mutual information is a quantitative measure of how much Y "knows" about X. There is no human meaning attached to it whatsoever.


You could say the first X bits of signal from the start of the communication to the point where you can realize what is being said and can predict the rest with certainty represent the informative part of the message; that is, the less you expected them to say the thing they’re saying in that context, the more information is transmitted.


> That you can zip a large movie file, for example, owes to the fact that pixel colors have a statistical pattern, the way English words do. Engineers can build probabilistic models for patterns of pixel colors from one frame to the next.

This is, in fact, incorrect generally. Movie files are generally compressed via _lossy_ compression, which escapes the Shannon entropy limit. Entropy based encoding, as described in this article, is generally only the final stage in a video "codec" (algorithm to compress, and decompress video). The first stage relies of finding information in the visual field that humans are unlikely to notice the loss of and discarding it. The second stage does interframe (between frames) compression by simply searching for blocks of pixels that match. Entropy coding is generally the final stage.

This, by the way, is why zipping a video file can make it... larger.


This is a pretty unkind reading. The author almost surely was using zip as a synonym for compress. Some people use "hoover" as a proxy for vacuum, even though they're not literally using a Hoover vacuum to clean.

Movies aren't "escaping" the Shannon entropy limit, they're falling well within the Shannon entropy limit for the appropriate domain of human vision.

Whether it's throwing out high frequency noise at one stage or using moving blocks and keeping delta changes at another, there's an underlying assumption of the "codeword" space is versus the source space. Even by the time it gets to the codec it's already "compressed" by throwing out almost all the spectra not visible to us. These are implicit assumptions about the domain that you've dismissed.


That lossy compression "escapes" the Shannon entropy limit doesn't seem like an accurate way of putting it. Lossy compression throws information away, so there is less information to compress. The entropy limit is still in place.

I agree with the overall point, though, that movie files aren't "zipped" up, but are compressed using lossy compression (with few exceptions).


Lossy compression is something Shannon specifically addressed with Rate Distortion Theory https://en.wikipedia.org/wiki/Rate%E2%80%93distortion_theory


> Engineers can build probabilistic models for patterns of pixel colors from one frame to the next. The models make it possible to calculate the Shannon entropy [...] That value tells you the limit of “lossless” compression — the absolute most the movie can be compressed before you start to lose information about its contents.

While this is generally true in practice, strictly speaking the shortest possible encoding of a sequence of bits could be smaller than this would suggest, since it's possible a pattern exists that has escaped notice.

For example: the first 10 billion digits of pi pass statistical tests for randomness, but can be generated by a short program which calculates them until the desired length is reached, in effect "compressing" them to the length of the generator program.

Because of considerations like this, Algorithmic Information Theory[1] equates the information content of a bit sequence with the length of the shortest program which will generate it - though this has the drawback of being generally uncomputable - an intriguingly different paradigm.

1: see https://en.wikipedia.org/wiki/Algorithmic_information_theory


I think your comments on encoding a sequence of bits/pi is a little misleading. For example, in the case of the first 10 billion digits of pi, you do not even need a program to calculate it. One can just use a lookup table which maps the digits 0 through 9 to the frequency they appear in 10 billion digits of pi, which is <40 bytes of data. You are confusing the entropy of the digits of pi, for which the order does not matter, and the entropy of the sequence of pi, for which the order does matter. In the latter, you would want compare the probability of observing the exact sequence of pi compared to other sequences of numbers. This is an entirely different question. Because the sequence of digits of pi is deterministic, its entropy is 0, i.e., if one had a machine from which you could sample "10-billion digit sequences of pi" you would get the same sequence every time.

>strictly speaking the shortest possible encoding of a sequence of bits could be smaller than this would suggest, since it's possible a pattern exists that has escaped notice.

If it is a deterministic pattern, then the entropy of the sequence is actually 0. Otherwise the existence of patterns is something that the entropy computation already takes into account, and is in fact the entire point. In practice, finite sample sizes introduces noise and means that we can only ever estimate the true entropy, and Jensen's inequality implies that all such estimates are higher than the true entropy of a system, so that we observe even deterministic sequences with noisy filters. This is perhaps the clearer version of your point.


Contrariwise, to me you seem to be the confused party.

I am contrasting the compression given by lzw or zip as discussed in the article, which would be unable to significantly compress the bit sequence of pi, with the "compression technique" of simply calculating the same sequence (in order!) with a short program.

A lookup table is irrelevant, since we're talking about printing out the first 10 billion pi digits in order by a simple algorithm for computing them, not about anything involving calculating their frequency or probability. (In any case, since those digits pass statistical randomness tests, the frequency distribution wouldn't tell you much, which is why zip-style compression fails on pi.)

The point I'm making is that running a very short program can print a 10 billion digit sequence which stats-based compression methods would fail to significantly compress. The digits of pi happen to be a good example of this, but there are likely plenty of others.

Regards the second part of your comment, the notion of a "deterministic pattern" is intriguing, but what does it mean? If you mean "not random" then in the algorithmic information world that equates to "can be printed by a program shorter than itself". Obviously any (including random) sequence can be printed by some program, but the presence of regularity (or "patterns" if you like) in a sequence allows shorter encodings, either by creating a dictionary of recurring subsequences, as in lzw or zip style compression, or by some other means as in the calculation of pi's digits.

I suggest you read up on AIT or Kolmogorov complexity at the wikipedia links I posted to get an idea of what my comment was actually about :)


You're not considering the exchange for space versus time. A short program may be able to compute the digits of pi to the accuracy you want but you're discounting the runtime and subsequent resources it uses to compute those digits.

Readers could be left with the feeling that somehow you can get information for free, which is not the case. I can't start a program that prints all strings and claim I've achieved ideal compression because the desired output is among one of the strings printed.

It's not just incomputability but the resources involved.

I'm only barely familiar with Kolmogorov complexity and, as you point out, it looks like this is what Algorithmic Information theory is designed for but even in those scenarios, there's clearly a push and pull between run time and memory. As a heuristic, I tend to think of there being a sort of conservation law at work where sacrificing space will increase run-time and vice-versa.


>I can't start a program that prints all strings and claim I've achieved ideal compression because the desired output is among one of the strings printed.

I don't follow why this is an example of run time mattering. It sounds like an example of not including crucial information in your "compressed" output to enable decoding. You'd have to have a pointer into your sequence.

Also, why wouldn't consideration of run time be a free choice people make? I vaguely remember trying some allegedly state of the art compressors and finding them extremely slow. Probably in the PPM family. Theoretically, people are sometimes interested in the maximum compression no matter how much it takes in resources.


If at the fastest speed a computer can physically operate the program would take longer than the expected duration of the universe, I believe we can safely say it does not usefully represent the information, even though in mathematical relationship it may seem to.

Less absurdly, if the program takes longer to run than the MTBF of the computer it operates on, we can say the same thing with some probability.

It’s a roundabout way of saying that what seems to work in math isn’t always possible useful or functional in material reality. Since Kolmogorov seems to be pointing in this direction of blurring computation in material reality with mathematical information it seems worth considering.


Yes, this is a purely theoretical observation, and not relevant to any practical compression scheme.

In "algorithmic complexity" the time taken to run a program is not taken into account. There is another interesting complexity measure, Levin complexity, which does take time into account. See http://www.scholarpedia.org/article/Universal_search#Levin_c... .

Regarding "information for free", I think your thought about the program that prints all strings is absolutely correct - to count as an encoding of a string the program must print just that string and then halt.

What is uncomputable (in the general case) is the length of the shortest program that will encode a given string in that sense, IOW the "Kolmogorov complexity" of a string is uncomputable.

See [1] for a proof of this. Roughly:

Imagine a function K that computes Kolmogorov complexity, suppose the function occupies l bytes. Then consider a simple program P using this function which evaluates all strings until it finds a string s with complexity equal to l plus 10000000, then prints s and halts. Then P would have length less than l + 10000000 (because it is a tiny loop wrapped around K). But then the Kolmogorov complexity of s is equal to the length of P, because P simply prints s and halts, and P is shorter than (l + 10000000) - a contradiction. So there can be no computable function K.

1: https://en.wikipedia.org/wiki/Kolmogorov_complexity#Formal_p...


My apologies, I mangled the delivery of what I was trying to say.

I was trying to highlight the concerns about incomputability as the concern is real. Finding "the smallest program" of anything is, in general, for a wide a way of problems is either flat out NP-Complete, and so intractable, or just uncomputable.

But we do have programs that can encode things efficiently and often we don't need the most efficient program, we can get away with some program that gets close to the theoretical limit. For example, maybe the spigot algorithm for computing the digits of pi is an example.

My point was that even if you sidestep the incomputability for a moment, you're still not getting anything for free when you move to encoding things in programs. What was laid out in memory is now exchanged for run-time performance.

I'm well aware of all the tricks that can be played to prove things incomputable. In my view these are variations on the proof that the halting problem doesn't exist.

The more practical issue, in my view, is that NP-Complete problems are, essentially, a finite restatement of the halting problem. Instead of asking for the shortest program, ask what the shortest program is for a given finite memory and/or time bound. Then the problem does become computable, just intractable, and this is what NP-Complete problems are trying to address.

For this discussion, one could imagine setting up an NP-Complete problem where the solution is the encoded video. This might be efficient to describe as a program but searching the space could blow up, costing exponential time to find.


Well, yes, you're replying to me as though I was suggesting some scheme for actually compressing things.

But as I say it's just a theoretical observation, nitpicking the bold claim that statistical compression techniques determine "the absolute most the movie can be compressed before you start to lose information". If the movie consisted of digital noise which turned out to be a representation of the first 10 billion digits of PI (for example) then that claim would be incorrect.


I’ll ask it here since I’ve not seen it mentioned anywhere: the Kolmogorov work is interesting, but doesn’t it exclude the information necessary to produce a team of educated humans, a universal computer, a programming language, and a compiler/interpreter, at least? In a brief reading AIT seems to assume a universal Turing machine should exist by natural law.


There is no problem with assuming such a Turing machine exists so long as you compare only measurements of information with respect to that Turing machine. Generally what is relevant to study is entropy differences anyway.

I don't know much about AIT in specific, but it seems that Chaitin 1975 (https://dl.acm.org/doi/10.1145/321892.321894) acknowledges this, as the algorithmic entropy differs from the usual Shannon entropy by a constant term, which one might imagine is the information needed to encode the Turing machine.

I also wanted to mention that because you thought of the information theoretic implications of building a computer, etc., you might be interested to know that this kind of thinking is very interesting and cutting edge in biophysics. Organisms have to essentially build computers to do certain tasks, and in this way they must take into account both Shannon entropy (i.e., information in the abstract sense) and thermodynamic entropy! This is essentially the Maxwell's demon paradox (https://en.wikipedia.org/wiki/Maxwell's_demon).


Thanks very much for the pointer to the paper and ideas. I am quite interested in artificial life and the information theoretical + biophysics view on that is something I’ll enjoy looking into.


Regards the interpreter aspect, that is dealt with by fixing some universal Turing machine - call it U - which runs all the programs considered, so that the program length of U becomes a constant which can either be factored out or reduced to insignificance by considering programs and strings with length much greater than U's.

It is still a fly in the theoretical ointment that the K complexity of a given specific string might differ under different choices of U, not least because programs that encode some string might find chunks of it inside a specific U, like a program which prints the BIOS signon of a particular firmware could be shorter when running on that firmware.

Regards the biological / sociological aspects you mention, I think the theoretical framework requires only the bare existence of some UTM with a finite size.

If it is really Universal, you could just run the program which produces human society on it until we reach the present, I guess :)


Ha! Thanks for your insights. It does seem like a significant gap between the theory/maths and the implementation.


It's a mostly theoretical concept so when making comparisons you can pick whatever machine you want, pretty much. Actual compression contests (like the hutter prize: http://prize.hutter1.net/) will pick a particular platform and configuration to run on.


Could a smaller program be written that could generate the code that generates the digits of pi? So the code that writes the code is the smallest information content? I wonder how many times this could recurse before there is a limit..


That's the https://en.wikipedia.org/wiki/Kolmogorov_complexity. It's uncomputable in general.


Nice, this is exactly what I meant. Actually kind of proud of myself for having a similar thought as Kolmogovov.


Does the seed of a minecraft world fall into this category ?


Yes, provided you use the same version of minecraft each time. The world generation algorithm might change from version to version, after all.


OK but "decompression" takes infinite time


> the first 10 billion digits of pi

No it doesn't, you program you generate produces the first 10 billion and then stops


The information theory class I took in grad school was one of the most mindbending classes I've ever taken. It made doing gnarly math problems fun.

Claude Shannon was a titan. He's also known for publishing what some call the most important masters thesis of all time, https://en.wikipedia.org/wiki/A_Symbolic_Analysis_of_Relay_a....


For anyone interested in these ideas, I'd highly recommend reading Shannon's original 1948 paper (linked in the article). While technical and precise it is actually quite digestible for someone with a bit of a quantitative background.




I'd also highly recommend watching The Bit Player documentary. Not as info-dense as an article or paper, but it's a fun ride.


Information theory is fascinating to anyone who likes the mathy underpinnings of seemingly non-mathy things, and it's everywhere. Cool stuff.

Speaking of information and decompression, my brain initially interpreted the title as a breaking news headline about a female supervillain who's also a theory nerd.

It was only the added context of the domain name that fixed the error.


I'm more interested in the problem of sending strings out into the world where more strings may "stick" to them, possibly producing more strings. You know, like life. Or software. Given a particular string, can do you determine its age? Can you determine which discrete strings came into contact with it along its journey and when? And so on. I'm not sure if this is interesting to anyone else, but it sure is interesting to me! Consider how these are the boundary conditions that justify software design decisions so primordial we don't even consider them decisions anymore.


One example I like to use (when talking about entropy):

A four digit numeric PIN (that we know) has 0 bits of entropy. There is no uncertainty about what the PIN actually is. A randomly selected one (that we do not know) has just over 13 bits.

print(math.log(10)/math.log(2)*4)

13.28771237954945

The more entropy, the more uncertain we are.

However, humans are not random. We use the year we were born, some keyboard sequence or some other predictable number as our PIN. We don't know exactly how much entropy these PINS have (there is some degree of uncertainty), but we do know they are significantly less than 13 bits.


I wrote a blog post [1] with an interactive widget where you can provide an encoding for a random decimal digit and see how close you can get to the theoretical log₂(10) ≈ 3.32 bits.

[1]: https://blog.kardas.org/post/entropy/ (Average Code Length section)


Here's a riddle.

If using my birthday reduces the entropy of my PIN, what does it do to its entropy if I happen to have the same birthday as one of the most famous people in the world? Does it matter if I am aware or not? Does it matter what they use for their PIN?

For the sake of argument, I'm thinking month and day, not year.


The important thing is that your PIN has zero entropy, regardless of its value. Entropy is a property of distributions, not individual values. You may be thinking of the probability (or information content) your PIN is assigned when looking at the overally distributions of PIN, in which case it probably does matter how popular your birthday is (and whether it also matches common patterns people use for PINs). This does feed into the calculation of entropy for the distribution but then it ceases to tell you anything about your PIN specifically. It also only makes sense when you are looking at it relative to the distribution, so it matters how you specify the PINs you are comparing it to.

The 'information content' of a given outcome is the logarithm of the inverse of its probability (i.e. more unlikely events give you more information), and the entropy of a distribution is the expected value of this information content.


That is not a riddle just a question how to handle information on the distribution of passwords in the population. So you get the same answer as if it were four alphabetic characters and your choice is "soup".


Information theory is so cool. I feel like having a solid grasp of the fundamentals is critical for achieving mastery in software development. Closely related would be digital signal processing and frequency domain techniques, both also extremely useful in many branches of software.

The actual information content of a thing has always surprised me, especially when factoring in slightly imperfect (but passable) representations such as JPEG or MP3. To lose some information seems entirely acceptable in many cases.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: