
Maximum Entropy Intuition for Fundamental Statistical Distributions - yetanothermonk
https://longintuition.com/2020/07/20/max-entropy-intuition.html
======
klodolph
> Statisticians are quick to reach for the Central Limit Theorem, but I think
> there’s a deeper, more intuitive, more powerful reason.

> The Normal Distribution is your best guess if you only know the mean and the
> variance of your data.

This is putting the cart before the horse, for sure. The reason why you only
know the mean and the variance of your data is because you chose to summarize
your data that way. And, the reason why you chose to summarize your data that
way is _in order to get the normal distribution_ as the maximum entropy
distribution.

The normal distribution appears in a lot of places because it is the limiting
case of many other distributions, this is the central limit theorem. It is
very easy to work with the normal distribution because you can add or subtract
a bunch of normal distributions and the result is just another normal
distribution. You can add or subtract a bunch of _other_ distributions and the
resulting distribution will often be more normal. You can do a lot of work
with the normal distribution using linear algebra techniques.

So, you choose to measure mean and variance in order to make the math easier.
This does not always result in the best outcome. For example, if you need more
robust statistics, you might go for median and average deviation, rather than
mean and variance. Then when you choose the maximum entropy distribution from
the result, you end up with the Laplace distribution. The Laplace distribution
is very inconvenient to work with mathematically, unlike the normal
distribution.

~~~
jbay808
> This is putting the cart before the horse, for sure. The reason why you only
> know the mean and the variance of your data is because you chose to
> summarize your data that way.

No, it's not... A Gaussian is the best way to represent your knowledge of a
value if you only know the mean and variance of its value.

So if you start with a stack of data and compress it down to a mean and
variance, you've discarded most of your knowledge, and are left with a
Gaussian as your best guess representation.

Yes if you were to boil it down to different summary data, like a max and min,
you'd end up with a different state of knowledge and a different distribution.

But given a mean and variance, the Gaussian is your best choice of
distribution, and not because of the central limit theorem, but because it has
maximum entropy on those constraints. You don't always even have access to the
source data in the first place, maybe just the summary statistics.

~~~
mturmon
I would like to push back against this in favor of the original comment. The
context of this remark within the article is:

> I was extremely confused as to why the Normal (Gaussian) Distribution pops
> up everywhere—in kurtotically-ignorant financial market analysis, in nature,
> everywhere. Thinking about it, the prevalence of the Gaussian is actually
> rather abnormal. Can you guess why it’s everywhere?

This is not a "compression of data" question. It's not an "uninformed
distributional choice" question.

It's a "why is this distribution prevalent in Nature" question.

In this context, I think the CLT gives a better answer. There are a lot of
averaging processes in Nature, and due to the CLT, averaging of independent
perturbations _must_ give rise to normal distributions.

It's possible to perhaps go a step deeper than the above. In some physical
systems, you can look at the second moment as an energy -- like the voltage-
squared in electrical systems.

In this case, due to a-priori finiteness of system energy, the gaussian
distribution can make a claim to being "inevitable" by the maxent argument in
OP. ("In a system characterized by finite energy E, what is the least
informative distributional constraint?")

~~~
lambdatronics
Because additive processes are common. If the variables were multiplied
together instead of summed, you'd get a different asymptotic distribution.

~~~
CrazyStat
You'd get the lognormal, since multiplication is just the exponentiation of
summation.

------
ianhorn
One thing I'd add to this is that this kind of thinking makes your coordinate
system really matter.

Consider a measurement of some uncertainly sized cubes. You could describe
them with their edge length or their volume. Learning one tells you the other.
They're equivalent data. However a maximum entropy distribution on one isn't a
maximum entropy distribution on the other.

Pragmatically, there's always something you can do (e.g. a Jeffreys prior),
but philosophically, this has always made me uneasy with justifications about
max entropy that don't also have justifications of the choice of coordinate
system.

~~~
canjobear
This seems important. Is there somewhere where this example is worked out in
more detail?

~~~
lambdatronics
[http://bayes.wustl.edu/etj/articles/prior.pdf](http://bayes.wustl.edu/etj/articles/prior.pdf)

~~~
yetanothermonk
Thank you!

------
jmoss20
Thought experiment: suppose your friend drives 80 miles to visit you. They
tell you the trip took between 2 and 4 hours. You have no further information.
How confident are you the trip took less than 3 hours?

Now they tell you they maintained a constant speed throughout the trip, a
speed somewhere between 20 and 40mph. How confident are you your friend was
driving faster than 30mph?

The principle of maximum entropy, applied to each, gives you different
answers. P(30mph) = 0.5 implies the trip takes 2hr40mins, not 3hrs. What
gives? Which is the _real_ way we should formulate travel times?

See:
[https://en.wikipedia.org/wiki/Bertrand_paradox_(probability)](https://en.wikipedia.org/wiki/Bertrand_paradox_\(probability\))
Credit for this example: Michael Titelbaum

~~~
fractionalhare
This paradox is a good motivator for when Bayesian probability is a useful.
Your confidence is a posterior probability which is conditioned on some prior
information. Initially you have little prior information, except for an
interval of time and distance. When you receive information about the
derivative of speed throughout the trip, this meaningfully updates your
priors, and so the posterior changes.

~~~
jmoss20
The upshot here is that choosing the max entropy distribution as your prior
isn't enough, you also need to choose some particular way to formulate the
problem. Particular formulations (travel time vs. speed, here) imply different
max entropy priors, even though the formulations are equivalent. Worse, there
are infinite equivalent formulations, all with different implied max entropy
priors.

You can get around this by choosing a non-max entropy prior, like [1], or by
deciding on the One True Formulation for your problem. But (Bayesian) updating
on the other formulations of the problem won't do it, because there isn't any
information in the other formulations -- they're equivalent (by def).

[1]:
[https://en.wikipedia.org/wiki/Jeffreys_prior](https://en.wikipedia.org/wiki/Jeffreys_prior)

------
canjobear
You can derive these distributions with a lot less algebra by characterizing
them with invariances, rather than maximum entropy under constraints.

[https://stevefrank.org/reprints-
pdf/16Entropy.pdf](https://stevefrank.org/reprints-pdf/16Entropy.pdf)

~~~
yetanothermonk
Very cool! Thank you for sharing. We can get to these distributions a bunch of
ways, and I find every incremental way to look at something, the better you
understand it. Now, I’m about to get nerdsniped by symmetry and invariances.

------
martopix
With this method, you can derive all of statistical mechanics from information
theory with constraints originated from thermodynamics. The observation of
thermodynamic quantities, which are high level observations on particles (i.e.
related to means, etc., and not to individual particles), puts constraints of
the same kind as the ones listed in this article. This approach was pioneered
by Jaynes (1952) "Information theory and statistical mechanics, I":
[https://www.semanticscholar.org/paper/Information-Theory-
and...](https://www.semanticscholar.org/paper/Information-Theory-and-
Statistical-Mechanics-Jaynes/08b67692bc037eada8d3d7ce76cc70994e7c8116)

~~~
kgwgk
> This approach was pioneered by Jaynes (1952)

minor correction: 1957

This is a more detailed introduction to the subject (from 1962):
[https://bayes.wustl.edu/etj/articles/brandeis.pdf](https://bayes.wustl.edu/etj/articles/brandeis.pdf)

------
carlosf
> The Normal Distribution is your best guess if you only know the mean and the
> variance of your data.

That's awful advice for some domains. If your process dynamics are badly
behaved (statistically), such as power laws and likes, it turns out the "mean"
and "variance" you're calculating from samples are probably rubbish.

Choosing a starting distribution is actually a statement on how you're
exposing yourself to risk, there is no such thing as "best guess".

~~~
yetanothermonk
I’m making no statement on what your priors are, just that if you have the
mean and variance, the max entropy distribution is the Normal. If you know
skew and kurtosis, you’ll pick something else

------
GolDDranks
"And if we weigh this by the probability of that particular event happening,
we get info ∝ p ⋅ log2(1/p)"

I fail to see the motivation of this step, and I think that's preventing me to
see the argument as "intuitive". Could somebody explain?

The two steps back (info ∝ 1/p) it still makes sense to me: the more rare the
event is, the bigger the resulting number is, so in the case the event
happens, the more "surprised" we are, and more information is gained. However,
what do we achieve by weighing the bitcount of the information with the
probability?

~~~
GolDDranks
Ah, I think I got it. The point of the exercise is not to formulate the
concept of "amount of suprise (∝ amount of information gained) IN CASE the
event happens" but the "EXPECTED amount of entropy gain", for us to know
before it happens.

That's why we need to take a middle ground between very common events that
aren't surprising, and gain us hardly anything, and rare events that gain us a
lot of information, but happen so rarely that they don't matter a whole lot.

The formula derived here manages to find the balance between these two
extremes.

~~~
mturmon
I would agree w/ your statement here. The entropy is the on-average (or,
expected) amount of information gained from seeing one "x".

------
jostmey
I love the article!

My only advice is to end with a list of maximum entropy distributions to
showcase the many applications of this theory. I often refer to such tables
when I have varying constraints and want the best choice for representing the
spread of the data.

See the table in
[https://en.wikipedia.org/wiki/Maximum_entropy_probability_di...](https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution)

~~~
yetanothermonk
Thank you very much! Great idea!!

------
Nesco
This approach can mislead people because it by design make the hypothesis that
the support is infinite and that the variance is finite, which is why it ends
in a thin tail distribution in the first place.

Plus as said by klodolph the choice of arbitrarily restricting your knowledge
to the mean and to the variance as summary statistics will lead to the
Gaussian distribution. Moreover in practice restricting arbitrarily your
knowledge is a violation of probability as a model of intuition as showed by
Jaynes

------
abelaer
Logistic regression can actually also be interpreted as a maximum entropy
distribution after observing some 'training data'.

