
Gaussian distributions form a monoid - droque
https://izbicki.me/blog/gausian-distributions-are-monoids
======
simonbyrne
For anyone else who didn't understand what the author meant by a binary
operation on Gaussian distributions: he is really talking about combining
sufficient statistics. In fact, this property is true for any exponential
family (Gamma, Poisson, etc).

------
graycat
> "Gaussians are ubiquitous in learning algorithms because they accurately
> describe most data."

Amazing!

Uh, might want to reconsider and check that.

~~~
morpher
This is especially true if you have multimodal distributions (such as the
income datasets in the article). Although its true that there are simple
algebraic properties that allow you to calculate the mean and variance of the
total population given the same for sub-populations, it often isn't the case
that this will be a good fit to the data. That being said, this is a useful
property for parallelizing gaussian fits.

~~~
graycat
> That being said, this is a useful property for parallelizing gaussian fits.

Might you be able to clarify this sentence? I have zero idea what might be
meant:

> this is a useful property

What is the antecedent of "this", that is, in this phrases, what does "this"
refer to?

> gaussian fits

What is a _gaussian fit_? I have no idea. I'm comfortable with the Lindeberg-
Feller version of the central limit theorem, the weak and strong laws of large
numbers, martingale theory, the martingale proof of the strong law of large
numbers, the Radon-Nikodym theorem, and the fact that sample mean and variance
are _sufficient_ statistics for the Gaussian distribution, but, still, I can't
even guess what a _gaussian fit_ is.

> parallelizing

I can guess that what is meant by "parallelizing" is the computer software
approach of having one program try to get some work done faster by starts
several _threads_ or _tasks_ in one or several processor _cores_ , processors,
or computers. Okay. But what is it about "gaussian fits" that might commonly
call for "parallelizing"?

~~~
mrow84
Fitting a distribution to data is pretty common parlance in my experience, and
there is even a wikipedia article with a relevant name [0].

I presume that the parallelisation point was with reference to the point made
by the article, that the calculation of means and variances can be
parallelised, so large datasets can be dealt with efficiently.

Is there something else you are missing?

[0]
[http://en.wikipedia.org/wiki/Distribution_fitting](http://en.wikipedia.org/wiki/Distribution_fitting)

~~~
graycat
Okay, from the Wikipedia article, _distribution fitting_ appears to be what I
feared it might be.

I'd never do anything like that and would advise others not do also.

Why? Because it is not the least bit clear just what the heck you get.

Next, likely you should not _fit_ at all. Instead, if want to use some
distribution with parameters, e.g., Gaussian, uniform, exponential, then just
estimate the parameters and not the _distribution_.

E.g., if you know that the data is independent, identically distributed
Gaussian, then take the sample mean and sample variance and let those be the
two parameters in the Gaussian distribution. In that case, will know that the
expectation and variance of the distribution are the same as in your data, and
that's good.

That sample mean and variance are _sufficient_ statistics for the Gaussian is
also a biggie.

And look into the situation for the rest of the _exponential family_.

See also the famous

Paul R. Halmos, "The Theory of Unbiased Estimation", 'Annals of Mathematical
Statistics', Volume 17, Number 1, pages 34-43, 1946.

If want to find the variance of a large data set, then how much accuracy do
you want? Generally, sample variance from a few hundred numbers will be okay,
and then don't need to consider execution on parallel computer hardware.

R. Hamming once wrote, "The purpose of computing is insight, not numbers."

Along that line, finding sample mean and variance of a huge data set promises
little or no more "insight" than just sample mean and variance of an
appropriately large sample. Of course, we are assuming that the data is
independent and identically distributed so that a good sample is easy to find.

~~~
GFK_of_xmaspast
I don't know what you inferred from the wiki article, but of course "to fit a
gaussian" is to find the parameters describing it, in this case mean and
variance.

Look at the "Techniques", the first three are :

"Parametric methods, by which the parameters of the distribution are
calculated from the data series.[2] The parametric methods are: method of
moments method of L-moments[3] Maximum likelihood method[4]"

which if you work them out you get exactly what you'd expect.

------
PaulHoule
Back when I was a grad student in physics we used similar thinking to prove
the central limit theorem w/ the renormalization group.

------
drostie
So the basic math here looks like this: in classical probability theory, when
you've got multiple random variables `X, Y` they have a joint probability
`f(x, y)` such that the probability that (X, Y) is in the square (x ± dx/2, y
± dy/2) is `f(x, y) dx dy`. Independent distributions have probabilities which
factor: `f(x, y) = g(x) * h(y)`, to fit the general rule that independent
discrete events have probabilities which multiply. To calculate probabilities
for a function like `Z = X + Y` you have to invent a new probability-
distribution function `p(z) = ∫ f(x, z - x) dx`, which for independent events
appears as a _convolution_ of the two probability-densities.

All of the useful statistics for a probability distribution can be gleaned
from the probability-density function by taking the Fourier transform and then
(when it's not too bad) the (complex) natural logarithm -- or, when it's
easier, just `f(s) = log(E[exp(s X)])` (having `s X` rather than `i s X` up-
top). These are called _cumulant-generating functions_ or CGFs.

If you have a sum of two independent random variables, the CGF of the sum is
the sum of their CGFs. This means that the Taylor polynomial of the CGFs have
components which simply add together. Take derivatives of the CGF to get the
"cumulants", which are pseudo-linear (linear in independent random variables,
with some relationship `f(k X) = k^n f(X)`). So these cumulants become really
convenient for characterizing the distribution as a whole: in fact if you have
all of them you can recover the original distribution.

Now if you describe your distributions in terms of its cumulants, you now have
a special monoid:

    
    
        data Distr n = Distr [n]
        instance (Num n) => Monoid (Distr n) where
            mempty = Distr (repeat 0)
            mappend = zipWith (+)
    

Gaussian distributions in particular have a _form_ for their CGF which is
preserved by this monoid operation, namely:

    
    
        if X ~ Gauss(m, s^2) then ln(E(exp(k X))) = m k + s^2 k^2
    

(Note that the leading term has to be 0, as if s = 0 we have ln(E(1)) = ln(1)
= 0.) Another way to state this is "the Fourier transform of a Gaussian is
another Gaussian."

So that is super-simple, and you can already see the hints of the central
limit theorem emerging: the mean of N identically-distributed variables X_i
will have a CGF related to the coefficients c_i of the original distribution
by:

    
    
        ln(E(k * sum_i X_i / N)) = N sum_m c_m k^m / N^m
    

so the Taylor expansion gets attenuated by successive powers of N: you can
approximate the CGF with truncation.

So, the set of all distributions form a monoid under convolution if you allow
the Dirac delta-function to act as `mempty` -- in a certain representation,
this appears as the termwise-summing monoid. Gaussian distributions are a sub-
monoid of this larger monoid, and they are not the only one: we could easily
go out to 3, 4, or 5 cumulants before setting the rest to 0 and finding a
similar sub-family.

------
SamReidHughes
That diagram of a Gaussian distribution is... something special.

