Gaussian Processes, not quite for dummies 257 points by danso 30 days ago | hide | past | web | favorite | 47 comments

 Implementing Gaussian Processes in Python is a fun way to practice your programming and mathematical abilities.I was working on a course project that used GPs for regression, with the inputs taken from the output of a CNN. We were trying to predict crop yield from satellite imagery, and GPs helped to incorporate spatio-temporal locality: think of predictions for neighboring county, or for image from a previous year.One issue we ran into while using the model from the original authors [1] was a memory constraint from building on top of Google Colab instead of a private AWS instance, and using a larger dataset to predict upon.The Gaussian kernel matrix grows O(n^2) per data point, and we had up to 22,000 datapoints to use. To get around the memory constraint, we had to re-implement the GP from scratch - carefully avoiding memory spikes that came with computing the kernel and mean predictions.It was a two-fold process. For understanding the exact equations, I would highly recommend the resource mentioned in the article [2], Gaussian Processes for Machine Learning Chapter 2. For a lecture video, I would recommend [3] which is from a ML course at Cornell.As for implementing GPs efficiently, this is still a challenge: numpy has a bunch of matrix-vector functions that are helpful, but there isn't a way to exploit the symmetry of the kernel matrix and use half the require memory.
 Another easy solution, if your kernel is appropriate (goes to 0 at infinity such as a Gaussian, Matèrn or Exponential kernel), is to set a threshold under which your covariance is set to zero and to use a sparse representation for the covariance matrix (you can even be more extreme and set a fixed number of entry to be non-zero giving you a fixed memory use).
 This does not solve the issue. You still need to invert this sparse matrix for inference. The inverse of a sparse covariance matrix is dense.
 The raw mathematical formula uses an inversion but what you truly need is to solve a linear system (which is why most implementations store a cholesky decomposition of the covariance matrix) which can be done without transforming the matrix into a dense matrix.
 +1 This is how I solved for the optimal parameter: instead of inverting the matrix, which is memory-heavy and computationally expensive, I used the solve function from scipy.linalg.I looked into numpy and scipy methods for symmetry and sparsity, but my kernel matrix is still too dense.
 What would you use to solve the system? Iterative methods or gradient descent even?
 Something like MINRES or GMRES or CGSTAB. Correct me if I'm wrong, but the covariance matrix in GPR should be diagonally dominant, so the diagonal of the matrix can be used as a somewhat effective preconditioner.
 With most kernels the matrix is symmetric and positive-definite so I would probably use a conjugate gradient: https://en.wikipedia.org/wiki/Conjugate_gradient_method
 yes
 Regarding the challenge of an efficient implementation, pytorch+pyro attempt to tackle it. It's still a work in progress, but you might want to have a look at: https://pyro.ai/examples/gp.html
 The real issue is that inversion / solving is O(n^3).This really hammers you for high dimensional settings (e.g. brain imaging n>100k).Solutions:* Pick a very special kernel with an known / approximable solution.* Reduce the dimension of the problem (sparse gps).
 The n parameter there was the number of datapoints, not the size of the inputs (that would've been way worse). For brain imaging, is the 100k parameter also referencing datapoints, or dimensionality of the input?Chapter 8 of the Rasmussen reference does talk about approximation methods for large datasets, but I couldn't find any reference implementations to go off of.
 "number of datapoints, not the size of the inputs"For brain imaging, 100k was 'the dimensionality of the input' (but also the number of datapoints). This could easily be O(10^6) though with higher res imaging.I'd check out, for discussion:https://www.prowler.io/blog/sparse-gps-approximate-the-poste...For code:https://github.com/GPflow/GPflowBased on the people involved knowing what they are talking about, rather than on experience using that particular work.
 Seeing the complications implied by the O(n^2) problem, I'm curious why it was worth it? What did using GPs to solve the problem buy you that you couldn't get with other methods?I think GPs are interesting but the kernel matrix is a real problem with them in practice. I have yet to understand why they are worth the cost compared to just using least squares or some other regression method. Confidence intervals are an obvious benefit, but since you can get them via monte-carlo, at what point does the GP method pay off?
 Compared to training the CNN, running a GP on the outputs seemed like a worthwhile tradeoff of time if it resulted in a discernible difference in validation error; and the original authors of the paper we were following already had an open implementation.I am open to suggestions if you know of any "post-processing" methods that would help to incorporate locality into the overall prediction.
 What I mean is, why GP instead of, say polynomial regression, or training a NN on the CNN outputs?I'm curious if the decision was due to some feature of GPs that you really needed, and could not easily get otherwise. I'm not doubting your choice, really, I'm just trying to get a feeling for what niche GPs fill (given their difficult computational requirements), as I have never found the opportunity to use them.> that would help to incorporate locality into the overall prediction.Not sure what you mean by this so I am sure I am missing something.
 We used a Gaussian process on the outputs from a CNN, not separately.What I meant by locality is the idea that two points which are close to each other should also have predictions close to each other, assuming that there is some underlying Gaussian variable we can't see immediately from the data or from the CNN.One example could be housing prices: if your neighbor sold their house, controlled for other factors like square ft or the number of bedrooms, you should value your own house similarly. We assume that there is this underlying quality of the neighborhood, possibly given by distance between houses.The niche that GPs fill is broad. Two applications I've seen were their use in trying to find the missing MH370 flight, and another is in hyperparameter tuning of ML models.
 I would highly recomment David Duvenaud's PhD thesis [1] for anybody interested in Gaussian Processes. I find his chapter [2] on expressing structure with kernels super interesting and quite intuitive. It's also one of the PhD theses with the best-looking graphics I've read!
 The OP needs some correction:The title is "Gaussian Processes". Okay, so, we're considering "processes". A Gaussian process is a special case of a stochastic process.Here is the usual, first definition of a stochastic process: For the set of real numbers R and for each t in R (t usually regarded as time) we have a random variable X_t (TeX notation for subscripts) where the values of X_t are in R. That is, X_t is a real valued random variable. So, the stochastic process is, say,{X_t | t in R}This stochastic process is a Gaussian process if for each t in R X_t has Gaussian distribution.Commonly we continue and assume that both the expectation and variance of each X_t is the same for each t in R.In addition for any s in R we might assume that the covarianceCov(X_t, X_{s + t})(more TeX notation) is a constant for all t.With these assumptions we can do a lot in power spectral estimation and extrapolation.We can apply the central limit theorem to argue that an independent increments process is Gaussian.For more, see, e.g.,I. A. Ibragimov and Y. A. Rozanov, Gaussian Random Processes, ISBN 0-387-90302-x, Springer-Verlag, New York, 1978.For one more, if real valued random variables X and Y are independent, then they are uncorrelated. Net, independence is a much stronger assumption than uncorrelated.But, amazingly, if X and Y are joint Gaussian and uncorrelated, then they are independent.For one more, for the example of curve fitting and interpolation, I'd suggest not polynomials but splines or least squares splines.For one more, for Gaussian X_t for finitely many t, the sample mean and sample variance are sufficient statistics. That means that in any work in statistics, we can keep the sample mean and variance, throw away the X_t, and in principle do as well. The foundation paper on sufficient statistics was by Halmos and Savage back in the 1940s. The concept of sufficient statistics is an application of the Radon-Nikodym theorem. E. Dynkin showed that the Gaussian assumption is critical -- with small changes the result is false.
 Original author here: thanks a lot for that, this is very helpful! I was a bit confused about that myself, will make corrections.
 > This stochastic process is a Gaussian process if for each t in R X_t has Gaussian distribution.This is not quite strong enough. You need (X_t1, X_t2, ..., X_tn) to jointly have a multivariate Gaussian distribution for every finite (t1, ..., tn).
 Usually! I gave a weak definition!Also, for correlation, need to assume the expectations exist.
 ive been teaching myself stochastics, i personally find it really interesting. i really like Peter Medvegyev book Stochastic Integration Theory, do you have any thoughts on it? you really did a good job at summarizing the bigger picture in a few words, really helped, so thank you. any additional recommended resources that ties things together like this? edit: also liked the write up yugeten, was helpful
 Great work Yuge!This post from Distill may also be worth checking out for those interested in Gaussian Processes:
 > if all k components are Gaussian random variables, then X must be multivariate GaussianThey need to be independent as well or there should be some other criteria. Otherwise I have counterexamples.
 > They need to be independent as wellNo, the sum of correlated Gaussians is also Gaussian. [Edit: or more precisely it can be, if they are jointly Gaussian, the point is that they they don’t need to be un correlated.]
 They dont need to be independent but they need to be jointly gaussian, but it is a very small edge case I guess
 You're right and the author's note is wrong. Joint Normality of a random vector implies component-wise Normality, but not the other way around, so it's a stronger condition.
 You can check if this holds by inspecting the covariance matrix. It can be degenerate or the variables may be correlated (thus not joint normal) and even independent uncorrelated gaussian ones need not be.https://en.wikipedia.org/wiki/Multivariate_normal_distributi...It's best to check using the second definition from: https://en.wikipedia.org/wiki/Multivariate_normal_distributi...This results in the BHEP test: https://en.wikipedia.org/wiki/Multivariate_normal_distributi...
 > You can check if this holds by inspecting the covariance matrix.Just a clarification: if that’s somehow correct then it’s not about the 2x2 covariance matrix that gives the linear relationship between two variables. Because two variables can be correlated and the joint distribution be a multivariate Gaussian. And two variables could be uncorrelated and not be independent; the covariance matrix doesn’t contain much information about the joint distribution (unless joint normality is assumed).
 Correct, that's only useful to check for degeneracy. The test is much more certain.And it's bigger than 2x2, it depends on number of dimensions you're checking. 3 variables goes 3x3 etc. For fitting as shown (and GANs), it gets big really fast.
 I don't understand what covariance matrix you're talking about.Let's say we have two variables and the joint distribution is given by the probability density function p(x,y) which can have any form as long as it's always >=0 and integrates to 1.The covariance matrix is a 2x2 matrix and contains only three real numbers (four, but the off-diagonal terms are necessarily equal).The covariance matrix (together with the vector of means) gives a complete description of the joint distribution only when the joint distribution is completely specified by its first two moments and all its higher moments are zero (i.e. it is a multivariate normal).The fact that the correlation between the variables is zero doesn't mean necessarily that the variables are independent when the joint distribution is not a multivariate normal.The covariance matrix alone cannot be used to tell whether the joint distribution is a multivariate normal or not. Every possible covariance matrix defines a multivariate normal but is also compatible with an infinity of other distributions.
 A~N(0,1) and I~Bernoulli(1/2) are independent random variables. Then both X1=A and X2=(1-2I)*A are Gaussian random variables. They even have 0 covariance but they are not independent. Also X1+X2 is not Gaussian.
 You were right on the “or there should be some other criteria” part, but not on the “They need to be independent” part.If you start with independent Gaussians A and B and accept that A+B and A-B are Gaussians you’ll easily see that the sum of those is also Gaussian (because it’s 2A) even though they are not independent.
 Yes you need that any linear combination of the k Gaussian is Gaussian.
 The most intuitive explanation of GPs came to me from McElreath's Statistical rethinking. What was most sticky was that was explained in terms of a thing needed for a simple regression problem that becomes hairy with continuous variable. It was an example of modeling a caffe's wait time which varies more continuously with different covariates like time of the day and IIRC size of the cafe.
 Is it this one?
 Yes
 Great article, easy to understand. I also like the unpretentious tone of it.
 Was going through this article this morning, and absolutely loved it.Post that, you can go through the vid lecture recommended in the article. Makes it easy to follow.
 May I suggest next article to read: An intuitive, visual guide to copulas https://twiecki.io/blog/2018/05/03/copulas/
 Thanks mate. This is awesome. The explanations + code made it very easy to understand.
 I don’t get much of this kind of articles, but I love just looking at these beautiful graphs.