
Gaussian Processes, not quite for dummies - danso
https://yugeten.github.io/posts/2019/09/GP/
======
bhl
Implementing Gaussian Processes in Python is a fun way to practice your
programming and mathematical abilities.

I was working on a course project that used GPs for regression, with the
inputs taken from the output of a CNN. We were trying to predict crop yield
from satellite imagery, and GPs helped to incorporate spatio-temporal
locality: think of predictions for neighboring county, or for image from a
previous year.

One issue we ran into while using the model from the original authors [1] was
a memory constraint from building on top of Google Colab instead of a private
AWS instance, and using a larger dataset to predict upon.

The Gaussian kernel matrix grows O(n^2) per data point, and we had up to
22,000 datapoints to use. To get around the memory constraint, we had to re-
implement the GP from scratch - carefully avoiding memory spikes that came
with computing the kernel and mean predictions.

It was a two-fold process. For understanding the exact equations, I would
highly recommend the resource mentioned in the article [2], Gaussian Processes
for Machine Learning Chapter 2. For a lecture video, I would recommend [3]
which is from a ML course at Cornell.

As for implementing GPs efficiently, this is still a challenge: numpy has a
bunch of matrix-vector functions that are helpful, but there isn't a way to
exploit the symmetry of the kernel matrix and use half the require memory.

[1]
[https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14435/...](https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14435/14067)

[2]
[http://www.gaussianprocess.org/gpml/chapters/RW.pdf](http://www.gaussianprocess.org/gpml/chapters/RW.pdf)

[3]
[https://www.youtube.com/watch?v=R-NUdqxKjos](https://www.youtube.com/watch?v=R-NUdqxKjos)

~~~
nestorD
Another easy solution, if your kernel is appropriate (goes to 0 at infinity
such as a Gaussian, Matèrn or Exponential kernel), is to set a threshold under
which your covariance is set to zero and to use a sparse representation for
the covariance matrix (you can even be more extreme and set a fixed number of
entry to be non-zero giving you a fixed memory use).

~~~
mrfox321
This does not solve the issue. You still need to invert this sparse matrix for
inference. The inverse of a sparse covariance matrix is dense.

~~~
nestorD
The raw mathematical formula uses an inversion but what you truly need is to
solve a linear system (which is why most implementations store a cholesky
decomposition of the covariance matrix) which can be done without transforming
the matrix into a dense matrix.

~~~
clarkeni
What would you use to solve the system? Iterative methods or gradient descent
even?

~~~
plus
Something like MINRES or GMRES or CGSTAB. Correct me if I'm wrong, but the
covariance matrix in GPR should be diagonally dominant, so the diagonal of the
matrix can be used as a somewhat effective preconditioner.

------
Wookai
I would highly recomment David Duvenaud's PhD thesis [1] for anybody
interested in Gaussian Processes. I find his chapter [2] on expressing
structure with kernels super interesting and quite intuitive. It's also one of
the PhD theses with the best-looking graphics I've read!

[1]
[http://www.cs.toronto.edu/~duvenaud/thesis.pdf](http://www.cs.toronto.edu/~duvenaud/thesis.pdf)

[2] [https://raw.githubusercontent.com/duvenaud/phd-
thesis/master...](https://raw.githubusercontent.com/duvenaud/phd-
thesis/master/kernels.pdf)

------
graycat
The OP needs some correction:

The title is "Gaussian Processes". Okay, so, we're considering "processes". A
_Gaussian process_ is a special case of a _stochastic process_.

Here is the usual, first definition of a stochastic process: For the set of
real numbers R and for each t in R (t usually regarded as time) we have a
random variable X_t (TeX notation for subscripts) where the values of X_t are
in R. That is, X_t is a _real valued_ random variable. So, the stochastic
process is, say,

{X_t | t in R}

This stochastic process is a _Gaussian process_ if for each t in R X_t has
Gaussian distribution.

Commonly we continue and assume that both the expectation and variance of each
X_t is the same for each t in R.

In addition for any s in R we might assume that the covariance

Cov(X_t, X_{s + t})

(more TeX notation) is a constant for all t.

With these assumptions we can do a lot in power spectral estimation and
extrapolation.

We can apply the central limit theorem to argue that an _independent
increments_ process is Gaussian.

For more, see, e.g.,

I. A. Ibragimov and Y. A. Rozanov, _Gaussian Random Processes_ , ISBN
0-387-90302-x, Springer-Verlag, New York, 1978.

For one more, if real valued random variables X and Y are independent, then
they are uncorrelated. Net, independence is a much stronger assumption than
uncorrelated.

But, amazingly, if X and Y are joint Gaussian and uncorrelated, then they are
independent.

For one more, for the example of curve fitting and interpolation, I'd suggest
not polynomials but splines or least squares splines.

For one more, for Gaussian X_t for finitely many t, the sample mean and sample
variance are _sufficient_ statistics. That means that in any work in
statistics, we can keep the sample mean and variance, throw away the X_t, and
in principle do as well. The foundation paper on sufficient statistics was by
Halmos and Savage back in the 1940s. The concept of sufficient statistics is
an application of the Radon-Nikodym theorem. E. Dynkin showed that the
Gaussian assumption is critical -- with small changes the result is false.

~~~
CrazyStat
> This stochastic process is a Gaussian process if for each t in R X_t has
> Gaussian distribution.

This is not quite strong enough. You need (X_t1, X_t2, ..., X_tn) to jointly
have a multivariate Gaussian distribution for every finite (t1, ..., tn).

~~~
graycat
Usually! I gave a _weak_ definition!

Also, for correlation, need to assume the expectations exist.

------
nicklovescode
Great work Yuge!

This post from Distill may also be worth checking out for those interested in
Gaussian Processes:

[https://distill.pub/2019/visual-exploration-gaussian-
process...](https://distill.pub/2019/visual-exploration-gaussian-processes/)

------
leni536
> if all k components are Gaussian random variables, then X must be
> multivariate Gaussian

They need to be independent as well or there should be some other criteria.
Otherwise I have counterexamples.

~~~
kgwgk
> They need to be independent as well

No, the sum of correlated Gaussians is also Gaussian. [Edit: or more precisely
it can be, if they are jointly Gaussian, the point is that they they don’t
need to be un correlated.]

~~~
bluesign
They dont need to be independent but they need to be jointly gaussian, but it
is a very small edge case I guess

~~~
AstralStorm
You can check if this holds by inspecting the covariance matrix. It can be
degenerate or the variables may be correlated (thus not joint normal) and even
independent uncorrelated gaussian ones need not be.

[https://en.wikipedia.org/wiki/Multivariate_normal_distributi...](https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Two_normally_distributed_random_variables_need_not_be_jointly_bivariate_normal)

It's best to check using the second definition from:
[https://en.wikipedia.org/wiki/Multivariate_normal_distributi...](https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Equivalent_definitions)

This results in the BHEP test:
[https://en.wikipedia.org/wiki/Multivariate_normal_distributi...](https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Multivariate_normality_tests)

~~~
kgwgk
> You can check if this holds by inspecting the covariance matrix.

Just a clarification: if that’s somehow correct then it’s not about the 2x2
covariance matrix that gives the linear relationship between two variables.
Because two variables can be correlated and the joint distribution be a
multivariate Gaussian. And two variables could be uncorrelated and not be
independent; the covariance matrix doesn’t contain much information about the
joint distribution (unless joint normality is assumed).

~~~
AstralStorm
Correct, that's only useful to check for degeneracy. The test is much more
certain.

And it's bigger than 2x2, it depends on number of dimensions you're checking.
3 variables goes 3x3 etc. For fitting as shown (and GANs), it gets big really
fast.

~~~
kgwgk
I don't understand what covariance matrix you're talking about.

Let's say we have two variables and the joint distribution is given by the
probability density function p(x,y) which can have any form as long as it's
always >=0 and integrates to 1.

The covariance matrix is a 2x2 matrix and contains only three real numbers
(four, but the off-diagonal terms are necessarily equal).

The covariance matrix (together with the vector of means) gives a complete
description of the joint distribution only when the joint distribution is
completely specified by its first two moments and all its higher moments are
zero (i.e. it is a multivariate normal).

The fact that the correlation between the variables is zero doesn't mean
necessarily that the variables are independent when the joint distribution is
not a multivariate normal.

The covariance matrix alone cannot be used to tell whether the joint
distribution is a multivariate normal or not. Every possible covariance matrix
defines a multivariate normal but is also compatible with an infinity of other
distributions.

------
itissid
The most intuitive explanation of GPs came to me from McElreath's Statistical
rethinking. What was most sticky was that was explained in terms of a thing
needed for a simple regression problem that becomes hairy with continuous
variable. It was an example of modeling a caffe's wait time which varies more
continuously with different covariates like time of the day and IIRC size of
the cafe.

~~~
wodenokoto
Is it this one?

[https://xcelab.net/rm/statistical-
rethinking/](https://xcelab.net/rm/statistical-rethinking/)

~~~
itissid
Yes

------
sbt
Great article, easy to understand. I also like the unpretentious tone of it.

------
Dawny33
Was going through this article this morning, and absolutely loved it.

Post that, you can go through the vid lecture recommended in the article.
Makes it easy to follow.

~~~
nabla9
May I suggest next article to read: An intuitive, visual guide to copulas
[https://twiecki.io/blog/2018/05/03/copulas/](https://twiecki.io/blog/2018/05/03/copulas/)

~~~
Dawny33
Thanks mate. This is awesome. The explanations + code made it very easy to
understand.

------
Jenz
I don’t get much of this kind of articles, but I love just looking at these
beautiful graphs.

------
dbcurtis
Great read! Well done!

------
nudpiedo
Am I the only one who found easier and intuitive the abstract definition but
much harder to follow the mathematical explanation? I immediately thought on
branching conditions mapped to a function table (like a knowledge database
table with pattern matching and related to a functions table).

I understand the higher value here is the Gaussian distribution and how to
find or create a distribution that fits real world complex problems, which
would become severely complicated to implement, analyse or visualize with
classic code writing.

P.S. I am speaking form an IT Engineer point of view, the years I studied math
fall more than a decade ago...

~~~
wiz21c
I'd like to point out that the mathematical/formal statements are the result
of loooong researches. That is, the one who stated them most probably had a
real life problem first. Then multiple other ones. Then he built up an
understanding of those problem and some kind of intuitions. Then he gathered
all of that in a nice formula.

When we learn that formula (sometimes) we go the other way around : we see the
formula, the maths, the forma stuff and then we try to understand its
consequences and see how it fits reality.

To me that's a much harder process...

Basically : when I see a bridge, I can't figure out how it was built, or why.

