
Gaussian distributions are monoids and why machine learning experts should care - jackpirate
http://izbicki.me/blog/gausian-distributions-are-monoids
======
dododo
all exponential family distributions may be written in a form that depends
upon a set of fixed dimension sufficient statistics.
<https://en.wikipedia.org/wiki/Exponential_family> these sufficient statistics
have the additive form described in this article (this is a consequence of
i.i.d. sampling). it is common to exploit this structure when implementing
efficient inference in, for example, mixture models.

if you combine this property with a bayesian analysis, and put a conjugate
prior on the parameters of an exponential family distribution, then the
posterior distribution, and the marginal likelihood depend upon on the data
only through these sufficient statistics and everything else is easily
computed. in this form, one of the sufficient statistics often has an
interpretation as a "pseudo-count"; how many effective samples are encoded in
your prior?

exponential family distributions include: poisson, exponential, bernoulli,
multinomial, gaussian, negative binomial, etc.

------
absherwin
That the sum of two Gaussians is a Gaussian an be found in any basic text on
statistics. That combining distributions, calculating them in parallel or
online is possible isn't new either. The Kalman filter which was published in
1961 is an example of an online algorithm for fitting a Gaussian.

This is a interesting exposition and places specific results in a more general
algebraic framework; however, the title suggests this is a revolutionary
discovery which it isn't

~~~
montecarl
I think you mean product. The sum of two Gaussians isn't a Gaussian.

~~~
yummyfajitas
I think he meant the sum of two gaussian random variables is a gaussian. I.e.,
if X and Y are both drawn from a gaussian distribution, then X+Y is too.

~~~
mturmon
Right. (You also need X and Y to be independent, or at least multivariate
Gaussian -- i.e., X and Y can both be Gaussian but (X,Y) can fail to be
Gaussian as a pair.)

The product of Gaussian random variables is certainly not Gaussian.

------
_delirium
A comment on the reddit version of this discussion points out that the
approach the article uses to combine Gaussians has numerical stability issues:
[http://www.reddit.com/r/programming/comments/13r2mh/gaussian...](http://www.reddit.com/r/programming/comments/13r2mh/gaussian_distributions_form_a_monoid_and_why/c76g6jo)

------
jmount
Sorry, but this is an example of bringing in extraneous math and notation to
make something well understood seem mysterious. The fun statistics facts being
abused here are: 1) moment summaries (sum 1, sum x, sum x^2 ...) are easy to
manage; 2) many important summary statistics (count, mean, variance, kurtosis
...) are easy functions of moments; 3) once you decide to think parametric you
can use that many distributions are completely determined by the first one,
two or three sufficient statistics. But since all of the above can be clearly
taught- it isn't as exciting.

~~~
jmount
Also if we want to get all math abusey (using graduate level concepts to do
basic work): how about something like "moments are a co-product"? The
emphasized property in the article is calculations on the raw data can be done
on the moments. So all the calculations of interest map through the moment
summaries (hence you can think of them as a co-product). Surely that is more
exotic than just saying "monoid." And monoid is kind of used up by the crowd
that says "free monoid" when all they mean is "set of all strings."

------
SatvikBeri
...so my knowledge of Abstract Algebra and Category Theory could actually have
practical applications if I learn Haskell?

Edit: I phrased this flippantly, but it's a serious question. My educational
background is in Math, specifically Abstract Algebra. Would learning Haskell
actually help me use those concepts for practical purposes?

~~~
gtani
Yes and no.

There are "issues" with those structures in GHC, i.e. how the
Functor/Applicative/Monoid hierarchy evolved in GHC (applicatives were
discovered a little late)

[http://stackoverflow.com/questions/7595023/why-is-this-
decla...](http://stackoverflow.com/questions/7595023/why-is-this-declaration-
not-allowed-in-haskell)

[http://stackoverflow.com/questions/5730270/alternative-
imple...](http://stackoverflow.com/questions/5730270/alternative-
implementations-of-haskells-standard-library-type-classes)

But: (and this one of the beauties of Learn You a Haskell), learning the
categories (functor, applicative, monoid through arrows, duality, groups etc),
is a superb way to abstract the software analysis/development process and is
one of the winning features of languages with good type systems (haskell, the
*ML's incl F#, scala)

<http://learnyouahaskell.com/>

------
afc
Would somebody explain this step in his explanation?

[http://izbicki.me/blog/wp-content/plugins/optimized-
latex/im...](http://izbicki.me/blog/wp-content/plugins/optimized-
latex/image.php?image=tex_240ba03b0952b1e7d16eb8570d9d1ee5.png)

How does he go from (a - b)^2 to a^2 - b^2? What am I missing?

~~~
jgeralnik
That's the definition of variance. Look here:
<http://en.wikipedia.org/wiki/Variance#Definition>

------
mturmon
Strikes me mainly as a trendy recasting of the statistical concept of a
"sufficient statistic" (<http://en.wikipedia.org/wiki/Sufficient_statistic>)

~~~
bazzargh
Which would be to miss the point that monoids are the general principal of
which sufficient statistics are a specific application.

Also, not sure mathematicians wore skinny jeans in 1904.

~~~
elliptic
Surely the burden is squarely on the author to show what the generalization
(in this context) offers that the specific concept of sufficient statistics
does not? In other words, how else can a statistician or ML expert use this
foreign "monoid" concept to improve or better understand parameter estimation?

~~~
bazzargh
"show what the generalization (in this context) offers that the specific
concept ... does not"

What does group theory tell me in the context of 1+1 that knowing the answer
is 2 doesn't? Generalizations are useful because they apply to more than one
context.

That said; the interesting connection between two areas of mathematics is what
I took from the article, but I'd agree that his title (well, subtitle) is way
overblown - his argument for "Why ML experts should care" seems to boil down
to speed, but the Haskell statistics package itself contains faster algorithms
that are marked unsafe, and he hasn't demonstrated safety.

------
noelwelsh
As I recall the Gaussian distributions are the only distribution that have the
desired property (i.e. form a monoid). While the Gaussian is important it is
by far the only distribution of interest in machine learning. I would like to
hear what the author has to say about handling, for example, the beta
distribution.

In summary: interesting idea but I'm not sure it applies outside a very narrow
domain.

~~~
dododo
i wonder if you are thinking of another monoid property of a gaussian?

suppose we have:

    
    
       x ~ N(0,1)
       y|x ~ N(x, 1)
    

then we have:

    
    
       y ~ N(0, 2)
    

i.e., gaussians are closed under marginalization.

however, i believe gaussians are not the only distribution with this property
either: i think this property corresponds to the stable family of
distributions: <https://en.wikipedia.org/wiki/Stable_distributions>

~~~
mturmon
Stable distributions are something else, not related to marginals or
conditioning. They come up when studying laws of averages.

Gaussian distributions belong to the class of stable distributions, though,
because of another of their properties: independent Gaussians, when added, are
again Gaussian.

~~~
dododo
the particular property of the stable distribution, i was thinking of, is
"closure under convolution" which is the above marginalisation (i believe?).

infinite divisibility is (yet another) property of gaussians!

~~~
mturmon
Nope. Closure under convolution is the same as closure under summation of the
associated random variable, which is the defining property of stable
distributions. This is explained in the first paragraphs of the wikipedia page
you linked to ;-)

Closure under marginalization is something else.

It so happens that the functional form of the gaussian satisfies both, but the
two properties are not at all the same.

    
    
      P1: X, Y gaussian => Z = X+Y gaussian
      P2: X, Y gaussian => X | Y gaussian

------
PaulHoule
This reminds me of the graduate level physics class where we used the
renormalization group to prove the central limit theorem.

------
fudged71
Very interesting to be able to subtract inputs from the learning rather than
starting the learning over again. I had been wondering if this were possible.

