
Deriving the Normal Distribution - kfrankc
https://kfrankc.com/posts/2018/10/19/normal-dist-derivation
======
zawerf
It's been years since I took probability, but why start from these particular
set of assumptions?

You've shown that the correct form and constant factors falls out from
assuming rotational invariance and independent components. But why is this
particular set of assumptions intuitively what we call the "normal
distribution"?

The same distribution should fall out starting from other more useful
equivalent definitions right? (e.g., the maximum entropy distribution for some
mean/variance, or the limiting distribution under CLT assumptions, etc)

~~~
BenoitEssiambre
I agree, to me, the maximum entropy argument for the Gaussian most
convincingly makes it one of the most fundamental distribution.

Click on chapter 7 here for a good discussion:
[http://omega.albany.edu:8008/JaynesBook.html](http://omega.albany.edu:8008/JaynesBook.html)

~~~
palmy
Thank you for the link to Jaynes' book! Really nice to see the different
approaches.

I'm intrigued by your comment on maximum entropy, as I personally struggle
with the maximum entropy derivation due to the fact that we're using
differential entropy ("continuous" entropy) to derive the Gaussian under
constraints on the first and second moment. The differential entropy does not
satisfy the same properties as entropy for a discrete distribution, some of
which are the very properties that motivated entropy as a measure of
information. Jaynes himself wrote a paper on this topic of continuous entropy
in the 60s (can dig out the reference in the morning). Even ignoring this, I
also struggle a bit with "we're only constraining the two first and second
moment". Why exactly one the first two? Why not the three first, etc.? One
could say it's motivate by the fact that the Gaussian is the only distribution
with finite nonzero moments, but that seems a bit handwavey?

Would genuinely appreciate some input here, as the concept of Principle of
Maximum Entropy is something I have a bit of trouble coming to terms with for
the reasons described above (in general, mainly because choice of constraints
is abritrary).

~~~
lellotope
There's kind of two issues at least. One is the continuous-discrete issue and
the other is the moment issue.

As for the moment issue, the short story is that as you get into three or four
moments, there isn't a general maximum entropy distribution anymore, except
for some special idiosyncratic cases in the case of three I think. So the
normal is, in some ways, the most conservative distribution you can have in a
general, unspecified scenario sense. You can specify more moments, but then
there isn't a single maxent distribution you can specify that would apply
across all third and fourth-moment scenarios in the same way that would apply
for the first two moments.

As for the continuous versus discrete thing, there's some caution that's
warranted, but a lot of the maxent principles apply, and there are similar,
closely related principles (minimum description length, which has been shown
to be equivalent to maximum entropy inferentially in a sense) that generalize
in the continuous case. If you think of everything as discretized (as is the
case with machine representation), there's some work showing that the
discretized and continuous cases are sort of related up to a constant (doi:
10.1109/TIT.2004.836702).

I realize this is a bit hand-wavy but it is a HN post.

~~~
palmy
Thank you, I really appreciate the response. This was useful.

I do see the reasoning for choosing the normal due to it being the only
distribution with finite non-zero moments, and thus, as you nicely pointed
out, constraints on a finite number of higher order moments will not give a
unique distribution.

But, due to the issues we've now mentioned, I find myself a bit uneasy wrt.
maxent as a derivation of and/or as an explanation of the ubiquity of the
normal distribution. Thus I find myself more comfortable with some of the
other derivations demonstrated by Jaynes.

And thank you for the paper reference; will have a proper look at it sometime.
It might be related to

------
jesuslop
G. C. Rota has analysis of justifications of the univariate normal
distribution in his Fubini lectures, problem 7. DOI
[https://doi.org/10.1007/978-88-470-2107-5_5](https://doi.org/10.1007/978-88-470-2107-5_5),
a great place to sense borderline topics in the field.

------
gpsx
As I understand (or maybe I should say "in my opinion") the magic of the
gaussian distribution lies in the two assumptions you make. You have a
rotational invariant answer (X and Y are related) but you are assuming the
distribution in X and Y are independent. And these are valid things to assume.

The gaussian distribution is not a particularly good representation of most
real problems in the sense that the probability for large errors decreases far
too rapidly. Maybe there are ideal cases you can say are gaussian, but in any
real problem there are some kind of outliers. We go in to a calculation
assuming we have gaussian noise but really we don't. And, we have to add
additional logic to handle these "outlier" cases.

The thing that is magic is that the gaussian distribution factorizes. If we
are evolving the state of a system after taking a measurement, as long as the
system had gaussian errors and the measurement has a gaussian error, the
system after the measurement will still be gaussian. We can paramterize our
errors with two numbers, the center and the spread.

If we didn't have this factorization, the distribution would change shape
after the measurement. We would have to keep a ton more information, the
amount of which grows geometrically with the number of variables we have. It
is just intractable.

So as I see it at least, we use this distribution because we can, more so than
because it is the correct one. (But, of course, it also still does work pretty
well!)

------
abetusk
Interesting. The assumption is that the Normal Distribution is two
dimensional, that it's rotationally invariant and that the X and Y coordinate
are statistically independent. From this, they find the polar equation, phi(r)
is proportional to a Cartesian one, f(x) * f(y). Since phi(r) = phi(sqrt(x^2 +
y^2)) which is then proportional to f(x) * f(y).

More succinctly:

    
    
        phi(r) = phi(sqrt(x^2 + y^2)) = f(x) * f(y)
        phi(sqrt(x^2 + 0)) = f(x) * f(0) = lamba * f(x)
        -> phi(x) = lamba * f(x)
        -> phi(r) = lamba f(sqrt(x^2 + y^2)) = f(x) * f(y)
    

With the last line because of rotational invariance and statistical
independence of the two dimensional axies. I haven't followed the rest but I
assume (maybe with some other minor assumptions?) that the equation, lambda
f(sqrt(x^2 + y^2)) = f(x) * f(y), uniquely determines the Gaussian.

I've long struggled to find a clear explanation of where the Normal formula
comes from. The best description I've seen is by deriving limiting
distribution of sums of uniform distributions on a unit interval, say. The sum
of identical and independently distributed random variables is a convolution
which can be 'de-convolved' by taking the Fourier Transform. After the Fourier
transform, the sum of the random variables turns into a product which can
easily be approximated. There's extra work involved in proving the Fourier
transform of a Gaussian is itself Gaussian and some other technicalities (not
to mention this is only for uniform distributions that are identical) but this
seems much more motivated to me than any other descriptions I've heard,
including this one.

As a benefit, if I remember correctly, the same trick works to derive the
basics of Levy stable distributions as well.

------
alimw
A more enlightening derivation here (ex. 12.11):
[http://pages.physics.cornell.edu/~sethna/StatMech/EntropyOrd...](http://pages.physics.cornell.edu/~sethna/StatMech/EntropyOrderParametersComplexity.pdf#page=311)

------
debbiedowner
Not sure "derive" is the right word here :)

