
A Conceptual Introduction to Hamiltonian Monte Carlo - gwern
https://arxiv.org/abs/1701.02434
======
soVeryTired
Radford Neal's expository paper [1] on HMC is a beautiful overview of the
area, and honestly one of the most enlightening papers I've ever read.
Conceptually, HMC really isn't that difficult. The broad idea is as follows:

Borrowing from statistical physics, it's possible to link energy and
probability as E(x) = exp(-P(x)). So you can interpret your MCMC target as an
"energy landscape". Highpoints in the landscape correspond to low
probabilities in the target distribution, and valleys to high-probability
regions.

Now add a source of kinetic energy (an auxiliary probability distribution),
and think of your MCMC sampler as a hockey puck sliding frictionlessly through
the energy landscape. Since kinetic + potential energy is conserved, it
follows that you move through contours of your joint (target, auxiliary)
distribution. That means you never have to reject a sample, regardless of how
far the hockey puck travels.

This interpretation also shows why you need to use gradient information. To
know where the hockey puck is going to move next, you need to know how steep
the walls of your energy landscape are: that is, you need the gradient!

[1] [https://arxiv.org/pdf/1206.1901.pdf](https://arxiv.org/pdf/1206.1901.pdf)

~~~
kgwgk
I also like Neal's introduction. I find the discussion about the typical set
in the paper linked in the submission more confusing than illuminating. Why
wouldn't the typical set include the mode of the distribution given that its
contribution is higher than for any other region of the same volume. Of course
the "outside half" of a hypersphere (say 0.5<R<1) will contribute more than
the "inside half" (R<0.5) as N grows but it will also be exponentially larger.
It's not clear to me how that defines a neighbourhood (or what does one gain
from doing so, it seems the neighbourhood could be defined to be the full
hypersphere without any loss).

~~~
betanalpha
Because probability has to sum to one. The higher you go in dimension that
unit probability is distributed across more and more "boxes" until any single
box, including the one around the mode becomes irrelevant. When you are
computing expectations you need to focus on those boxes that dominate
contributions to the integrals, which are all in the typical set.

The typical set is a vaguely defined notation as you cannot define an explicit
neighborhood without defining some explicit probability threshold which is
often done in information theory of discrete systems. The point here, however,
is not to define an explicit neighborhood but rather to motivate that the
whichever neighborhood of high-probability mass you would define would not
near the mode and instead extend out into a compact neighborhood around the
mode.

You are welcome to define a neighborhood as the convex hull of the typical
set, which would then include the mode (subject to some technical conditions).
That wouldn't add any appreciable probability mass, however, but would
introduce a whole bunch of space that your statistical algorithm would have to
explore at additional computational cost.

~~~
kgwgk
How would that add a whole bunch of space if the only reason why the
probability mass is not appreciable is that the volume is minuscule?

~~~
betanalpha
Yes, "a whole bunch" is not a technical term. :-p You are correct that
including the entire convex hull would not itself be absurdly costly. A naive
search (say grid search) would still spend most of its time evaluating points
at the boundary of the hypersphere, although the additional evaluations near
the mode would strictly add more cost. Similarly, any algorithm targeting
probability mass (like MCMC) would avoid the interior of the hypersphere
altogether. In practice you don't exclude the mode explicitly, you build an
algorithm that targets mass and it ends up inherently avoiding the mode. The
important conceptual take aways are that, for example, algorithms that try to
utilize information just in the neighborhood around the mode (for example so-
called MAP estimators) will never be able to accurately quantify a probability
distribution and that algorithms that explore neighborhoods of high
probability mass will behave in seemingly-counterintuitive ways.

~~~
kgwgk
> although the additional evaluations near the mode would strictly add more
> cost.

No, one evaluation near the mode is more efficient than one evaluation at the
boundary (because the density is higher).

I agree that the region immediately around the mode is naturally "avoided"
(because the volume is small). But your paper makes it look like we increase
efficiency by explicitely avoiding it and concentrating somewhere else. That's
what I found confusing.

~~~
betanalpha
"No, one evaluation near the mode is more efficient than one evaluation at the
boundary (because the density is higher)."

Incorrect in general. Firstly, one evaluation anywhere does not yield any
reasonably accurate estimate of expectations. We'll always need an ensemble of
evaluations, in which case the relevant question really is "how should I
distribute these evaluations". And for any scalable algorithm none of them
will be near the mode.

This is often hard to grok because people implicit fall back to the example of
a gaussian density where the mode and the Hessian at the mode fully
characterize the density which can then be used to compute analytic integrals.
But for a general target distribution we do not have any of that structure and
instead have to consider general computational strategies.

"I agree that the region immediately around the mode is naturally "avoided"
(because the volume is small). But your paper makes it look like we increase
efficiency by explicitely avoiding it and concentrating somewhere else. That's
what I found confusing."

In high-dimensions the probability mass of any well-behaved probability
distribution concentrates in (or _around_ if you want to acknowledge the
fuzziness) the typical set. Hence accurate estimation of expectations requires
quantifying the typical set. Any evaluation outside of the typical set is
wasted because it offers increasingly negligible contributions to the
integrals -- a few additional evaluations may not drive the cost up
appreciably but they will still be wasted.

So it's not that the only thing that matters is avoiding the mode. Rather what
matters is that, contrary to many's intuitions, the neighborhood around the
mode does inform expectation values and so exploring that neighborhood is
insufficient for estimating expectations. That then motivates the question of
what neighborhoods do matter, which is answered by concentration of measure
and the existence of the typical set.

And in practice, we don't actually do any of this explicitly. Instead we
construct algorithms that somehow quantify probability mass (MCMC, VB, etc)
and they will implicitly avoiding the mode and end up working with the typical
set.

~~~
kgwgk
> We'll always need an ensemble of evaluations, in which case the relevant
> question really is "how should I distribute these evaluations". And for any
> scalable algorithm none of them will be near the mode.

Why not distributing them according to the probability distribution? If I
sample one million points and one happens to be near the mode, this doesn't
mean any additional cost. This is no worse than sampling one million points
none of which happens to be near the mode.

> Any evaluation outside of the typical set is wasted because it offers
> increasingly negligible contributions to the integrals

This is not a reason to exclude the mode from the typical set, because one
evaluation at the mode offers a larger contribution to the integrals than one
evaluation elsewhere.

> Instead we construct algorithms that somehow quantify probability mass
> (MCMC, VB, etc) and they will implicitly avoiding the mode and end up
> working with the typical set.

The algorithms don't have to avoid the mode, they just have to sample from it
fairly (not much, but more than from any other region of the same size in the
rest of the typical set).

~~~
betanalpha
May I suggest that you try sampling from a high-dimensional distribution and
see how many samples end up near the mode? For example, try a 50-dimensional
IID unit gaussian and check how often r = sqrt(x_1^2 + ... x_50^2) < 0.25 *
sqrt(D)? You can also work this out analytically -- it's the classic example
of concentration of measure.

By construct samples from a distribution will concentrate in neighborhoods of
high probability mass, and hence the typical set.

~~~
kgwgk
It was you who said that it was better to exclude the mode from the typical
set because "the additional evaluations near the mode would strictly add more
cost". Now you tell me that evaluations near the mode are extremely unlikely.
Something that, believe it or not, I understand. Maybe you would have liked it
better if I had talked about one sample in one trillion being near the mode.
And in that case that evaluation wouldn't be wasted because its contribution
to the computed integral would be more important than if I had sampled by
chance another point in the typical set with lower probability density.

Excluding the region with highest probability density from the typical set is
a bit like saying that the population of New York is concentrated outside NYC
because most people lives elsewhere.

------
tmalsburg2
Also worth checking out is the author's talk at this year's StanCon:

    
    
      https://youtu.be/DJ0c7Bm5Djk?t=16810
    

Title: Everything you should have learned about Markov Chain Monte Carlo.

~~~
betanalpha
There are also talks focusing on HMC available online, the most recent of
which are
[https://www.youtube.com/watch?v=VnNdhsm0rJQ](https://www.youtube.com/watch?v=VnNdhsm0rJQ)
(more introductory) and
[https://icerm.brown.edu/video_archive/#/play/1107](https://icerm.brown.edu/video_archive/#/play/1107)
(more mathematically formal).

------
xioxox
As far as I know, you need to be able to compute the gradient of the function
being explored to use this method. Is that correct?

Most of my likelihood functions I use are nasty and complex. I've mostly been
using an affine-invariant sampler, emcee,
[http://dan.iel.fm/emcee/current/](http://dan.iel.fm/emcee/current/), which
works well on many problems.

~~~
imurray
Yes, you need the gradient. And despite what the autodiff people say, that
_might_ be hard and/or not very useful in practice, if your likelihood is
based on running a lengthy astrophysical simulation.

You can run HMC on any differentiable surrogate surface surface. It's only the
accept/reject step that has to use your real model and data. You might
consider that if you have a less hairy approximation to your model.

Emcee is popular in astronomy, partly because of network effects, partly
because it's a really nice package, and partly because the method's a great
fit for some of the posteriors you have: often unimodal-ish even if the
likelihood is nasty to compute, and often in not that many dimensions. It
would fall flat on its face for something like a Bayesian neural network.

~~~
betanalpha
"And despite what the autodiff people say, that might be hard and/or not very
useful in practice, if your likelihood is based on running a lengthy
astrophysical simulation."

Only if you treat the simulation as an un-interrogatable black box. All of
those astro simulations are ODE or PDE systems of one kind or another which
can be very cleanly autodiffed to calculate exact gradients. N-bodies are
particularly well-suited to this approach.

"You can run HMC on any differentiable surrogate surface surface. It's only
the accept/reject step that has to use your real model and data. You might
consider that if you have a less hairy approximation to your model."

You _can_ but it won't work above O(10) dimensions. The problem is that the
more dimension you have the more any surrogate surface (and its gradients)
will drift away from the true surface and worse your acceptance probabilities
will be.

With exact gradients the optimal performance of HMC will drop negligibly with
dimension (you need to get up O(2^100) or so before you really start to see
problems), assuming the target distribution itself doesn't get more complex
its dimension increases.

