
Likelihood for Gaussian Processes - Zephyr314
http://blog.sigopt.com/post/134931842613/sigopt-fundamentals-likelihood-for-gaussian
======
mccourt
I am the author of these posts, and I am happy to answer questions if you have
any.

~~~
nonbel
I have a question that is only tangentially related to this so feel free to
ignore.

Say I flip 50 fair coins, P(heads)=P(tails)=0.5, 10 times each a day for 100
days. For each coin we have 100 records like HTHTHHTTHT. Now I count the
number of tails (5 in the example) for each day for each coin. So each coin
has a record with length 100 like [5, 8, 5, ...,6]. Now I can calculate the
variance of these values for each coin and end up with 50 variances. What do
we expect the distribution of these variances to look like?

~~~
mccourt
I need a bit of clarification here. In the list of length 100, each entry in
that list is a binomial random variable X~Binom(10, .5). The variance of each
of these binomials is Var(X) = 10(.5)(.5) = 2.5.

But I don't think that's the variance you're talking about. Do you mean the
sample variance of the list of length 100? I'm sure that we can figure that
out too (or at least what it should look like) but I want to confirm that's
the variance you are interested in. Off the top of my head, I think this
sample variance should have a roughly Chi-Square distribution, since Binom(10,
.5) is close to normal (sort of). In the Wikipedia article, it explains how
Chi-Square can be expressed a sum of normals squared
([https://en.wikipedia.org/wiki/Chi-
squared_distribution](https://en.wikipedia.org/wiki/Chi-
squared_distribution)). I could maybe figure out the actual distribution if
you really need it.

Then you ask what should the distribution look like? Is this answered by the
answer above, or did you mean something related to the 50 samples in
particular?

~~~
nonbel
Thanks, I'm wondering how to calculate the distribution of sample variances.
Yes, each variance is calculated from the list of 100 counts of #tails. It's
just something that has bothered me for awhile.

~~~
mccourt
Okay. I'll see what I can do. No promises though - I may have taught
probability recently, but I left all the hard problems to my students ;)

~~~
nonbel
I ended up using monte carlo, but it seems like something that should be
known.

------
gerty
I am a frequentist by training and got a little confused by Bayesian and
engineering terminology while flipping through the posts. Bear with me.

Did I understand well that kernel interpolation is what we'd call a kind of
non-noisy kernel regression? If it's the case and dimension d of the
regressors is large, multivariate non-parametric estimation will have very
slow convergence, won't it?

~~~
mccourt
I also get confused by the notation, so you are not alone.

I think it is fair to say that kernel interpolation is a non-noisy kernel
regression. Of course, this would also depend on your choice of terminology
... I use the word regression to mean any sort of fitting of specific basis
functions to data - the fit doesn't have to be perfect. Interpolation, at
least in my mind, means that the fit does have to be perfect.

So when I say kernel interpolation I think it is fine to think of it as
regression that perfectly passes through any data points.

Technically, I like to think of the fact that the data has noise as
independent of the strategy that you use to fit it. So you could perfectly
interpolate noisy data (a bad idea) and you could do a regression on noise-
less data (also bad, but not as bad). But yeah, I think it's probably safe to
say "non-noisy kernel regression".

As far as your real question, chapter 9 of my book talks about that, and the
reason I mention that here is that it is a complicated question. Just to
confirm, when you say dimension you mean physical dimension, not the number of
pieces of data, right?

The classical convergence bounds all deal with the smoothness of the data (or
rather the function that generated the data) and the smoothness of the
reproducing kernel. The relevance of the smoothness is augmented by the
dimension, so the dimension can be relevant. Arguably, this is why
statisticians like Michael Stein would say that Gaussians are not appropriate
for low dimensions (too much smooothness) but are necessary for high
dimensions (because without the smoothness the convergence is too slow).

Some recent results
([http://epubs.siam.org/doi/abs/10.1137/10080138X](http://epubs.siam.org/doi/abs/10.1137/10080138X))
talk about dimension independent bounds, but with some caveats, and almost
entirely from the numerical analysis standpoint (which can be tough to read).

Sorry to sort of fork the answer into pieces there. Anything else I can help
with?

~~~
gerty
Thanks for the answer and the SIAM reference, I did manage to pick up
something interesting from there, I think. Smoothness, dimensionality and also
adaptivity are vast topics indeed. Good luck with the future work!

~~~
mccourt
I tried to do some digging to find easy to access (both through the web, but
also not terribly complicated) references on error bounds for kernels
interpolation. I don't think there is one ... although the internet will
surely correct me if I'm wrong.

The original source that I often return to (even before I look in my book) is
Wendland's 2005 book "Scattered Data Approximation". But that book is really
tough to read, even for me. Lots of tough math. Fasshauer's 2007 book
"Meshfree Approximation Methods in Matlab" is much easier to read, but
references Wendland's book for most of the heavy lifting.

There is a newer branch of study on convergence dealing with what are called
"sampling inequalities". In many ways, the math behind these is even worse as
it usually requires a bunch of polynomial theory, but the results are more
easily accessible. In particular, Christian Rieger and Barbara Wolmuth have
outstanding content on this which is readily available on the web. In
particular, I was able to find Christian's PhD thesis ([https://www.deutsche-
digitale-bibliothek.de/binary/JOWXLCGSV...](https://www.deutsche-digitale-
bibliothek.de/binary/JOWXLCGSV4NBOC5ZMLW553MRS6X65BFB/full/1.pdf)) which
provides a good journey from the start to actual results.

Sorry I can't provide a cleaner statement about this. Maybe the most basic
point I can make about convergence here would be to reference the early part
of Christian's thesis where he alludes to the theorem that says the quality of
an interpolant sort of looks like:

error of interpolant = O(h^{k-d})

That is a gross simplification, but gives the gist of the result. h is the
"fill distance" (or grid width for structured data) k is the smoothness of the
kernel and d is the dimension of the data. This indicates why smooth kernels
(large k) are needed to get the convergence for high d.

------
Xcelerate
Excellent post. Would you have a recommendation on a resource for learning
about reproducing kernel Hilbert spaces? In particular, I'm interested in
learning more about how they are applied to data analysis. I've been reading
about them in a very theoretical sense, but very few math texts provide "real-
world" examples of how they are used.

~~~
mccourt
Thanks for reading, I'm glad you enjoyed it.

You are correct that there are few texts that discuss real-world examples of
reproducing kernel Hilbert spaces without a significant amount of overhead to
get there. In reality, I'm not sure there is an easy way to talk about
reproducing kernel Hilbert spaces without all that functional analysis, which
is a tough topic. I would argue that's probably the greatest thing holding
back kernels - their high barrier to entry.

I'll spend one line here to mention my book (kernel-based approximation
methods in Matlab) and also say that it is pretty heavy on the theory; the
second half of the book deals with reproducing kernels as used in a variety of
fields, but getting there requires a lot of very tough math.

Lemme take a moment to list places (some more real-world than others) where I
know that reproducing kernels pop up - and anyone else reading this please
feel free to add to this list:

1) Approximation theory - My bread and butter and the most important role for
kernels as far as I'm concerned. When you have scattered data (especially in
higher dimensions) and you want to interpolate and predict unobserved values,
reproducing kernels give you several very nice guarantees. Through this, they
also pop up in computer vision or surface reconstruction from point cloud
data.

2) Spatial statistics - They actually play an identical role in kriging as
they do in approximation theory, albeit for totally different reasons. People
who are doing remote sensing or topological mapping may use reproducing
kernels (or software that uses these kernels.

3) Machine learning - Both RBF networks and support vector machines use
reproducing kernels, although in two different ways. The mechanism by which
they are used in support vector machines (feature maps) is difficult for me to
understand which is why I don't talk about them much. They are still a very
important field that uses kernels.

4) Differential equations - The same theory that makes reproducing kernels
work for approximation theory also works for numerically solving differential
equations. You could lump stuff regarding Green's functions in here too,
although those may not be reproducing kernels.

5) Numerical integration - Especially for integration in higher dimensions,
basically all the theory of convergence for quasi Monte Carlo methods relies
on reproducing kernel Hilbert spaces.

Maybe the best thing I can point you to is the same place where I first
learned about this stuff: math.iit.edu/~fass/590\. That's an old (recently
updated) class website from a class I took at the Illinois Institute of
Technology when I was an undergrad there. Dr. Fasshauer does an outstanding
job putting notes and slides for his class together.

If you're feeling adventurous, I would recommend our book, but a good starting
point would be his class notes, which I still turn back to often and are
written to introduce graduate students (who know nothing about kernels) to the
topic. I'll work on putting together a list of good application-oriented
sources as well.

Thanks for your interest! Did you have a specific example/topic of kernels
from data analysis that piqued your interest? I could look into more targeted
content.

~~~
brianchu
Honestly, the biggest thing holding back kernels is deep learning
(unfortunately?).

~~~
mccourt
That may be. And I don't think kernels will ever be a hot topic. But part of
that is that they are a very old topic (you can see Gauss referring to them in
slide 22 of
[http://math.iit.edu/~fass/590/notes/Notes590_Ch1.pdf](http://math.iit.edu/~fass/590/notes/Notes590_Ch1.pdf)).
I think that kinda prevents funding agencies from pushing too hard for
research into them, since there is a more likely quick return on investment
into a new topic. And I don't begrudge them that - funding should be given to
people with new ideas which have the potential to revolutionize things. I'd
also mention that there are people trying to bridge the gap between how GPs
work and the benefits of neural nets
([http://arxiv.org/abs/1502.05700](http://arxiv.org/abs/1502.05700)).

Probably another thing holding them back is their high barrier to entry - you
can't do much with the theory of kernels unless you know statistics and
functional analysis (and of course numerical analysis and linear algebra,)
which is one of the reasons I had difficulty finding research students. To
draw a parallel to numerical analysis (where I am much more comfortable than
machine learning,) this is why finite element methods are much more popular
than, say, boundary element methods for PDEs, despite the fact that boundary
element methods are better in many circumstances. But when implementing
things, especially in industry, simplicity and robustness count for something
and those are not strong points of kernels.

But kernels haven't gone anywhere yet, and they've been fundamental to
analysis since at least David Hilbert, so they are not going anywhere any time
soon. Deep learning ... we'll see. Maybe it has the legs to stick around or
maybe it'll be swept aside by the next hot thing. 8 years ago all the research
funding was going into uncertainty quantification, 5 years ago compressed
sensing was the hot idea (thanks Terry Tao,) and now deep learning is going to
light the world on fire. All of those are, and should be, hotter topics than
kernels, but we'll see what we're talking about in 5 years. It won't be
kernels (unless something crazy happens and I become the leader of the US, in
which case I have a great idea for a new reality show) but I don't know that
it will be deep learning either.

