
A Handwavey Explanation of Metropolis-Hastings - luu
http://ncollins.github.io/
======
ur-whale
Somewhat OT, but: I've always felt there's some sort of connection between
Metropolis and the alias method [1]

Can someone more knowledgeable perhaps shed some light on this idea?

[1]
[https://en.wikipedia.org/wiki/Alias_method](https://en.wikipedia.org/wiki/Alias_method)

~~~
dahart
I don't pretend to be all that knowledgeable, but I've implemented & used both
methods...

There is a similarity in the sense that both are trying to generate samples
with a density that is proportional to a given distribution. Metropolis
generally in the continuous case, and the alias method is for discrete
distribution sampling.

Apart from that, I don't see a very strong connection between the techniques.
Metropolis is for random walks, while the alias method is for direct sampling.

The very brilliantly clever observation of the alias method is that when you
have N items in your discrete distribution, you can sample from a uniformly
discretized set of buckets that have a width that is the average size of your
discrete weights, and then split up and distribute any of the weights that are
of larger than average size into other buckets that have less than average
size. It possible to always have no more than 2 items in a bucket, so it
allows storing the data structure in an array of size 2N, and allows O(1)
sampling from your discrete weights.

Metropolis is more about taking a walk through a probability field, so each
sample is in some sense a result of the history of the previous samples, and
it spends more time in local areas where probabilities are high.

With Metropolis, you have to make sure you don't get stuck in local maxima,
that your sampler can break through low-probability regions in your
distribution. With the alias method, you get independent samples, there's no
risk of missing some of the space. Both methods can accept quasi-random or
low-discrepancy inputs, but both of them will have limited benefit, unlike
inverse transform sampling.

------
mlthoughts2018
I feel like any explanation of Metropolis-Hastings that doesn’t explain
detailed balance and why it matters is so oversimplified as to not only be
useless in practice but also to be dangerous by giving mathematical lay
persons a very false confidence that they understand it or could just take the
relatively simple code of the algorithm in the case of a symmetric proposal
function and plug it in for a real use case, without understanding
autocorrelation, thinning, burn in, convergence, or multi-chain methods (all
of which are simply table stakes to actually use MCMC for any real purpose).

~~~
dahart
Just for friendly argument’s sake, allow me to disagree a little bit, and
explain why. It’s a shame if we can’t allow people to learn and share
mathematical intuition in small pieces. I like that the author saw the
boundary behavior and _thought_ about it, questioned why it actually works.
The intuitive argument presented does take one step in the direction of
detailed balance, it just doesn’t cover the continuous case. (And I’m not sure
- does detailed balance actually apply to the boundary? The process doesn’t
run in reverse for areas where your probability field goes to zero, right?)

I don’t necessarily think that an article on math should cover everything that
is known about a subject. It’s okay to write for an audience that doesn’t know
anything about autocorrelation and multi-chain methods. The need to elaborate
so thoroughly on every possible prerequisite and/or application, and cover
every corner case you might have is one of the reasons so many people dislike
reading math on Wikipedia ... it’s unapproachable unless you already know
everything about it. It’s becoming a reference only for experts not very
usable for learning by a student.

Anyway, what’s the true danger of a simple or incomplete understanding? It’s
not that likely to lead to people putting the wrong thing into nuclear
reactors; isn’t it more likely to lead to someone getting the wrong answer in
a weekend project software and then spending Sunday learning a little more
about MCMC methods?

~~~
mlthoughts2018
> “ Anyway, what’s the true danger of a simple or incomplete understanding?”

In practice the danger is someone not trained in statistics will copy/paste
the “simple” approach and generate poor chains of samples, and base a
seriously incorrect MCMC calculation off their misunderstood application.

If it’s clear this is just for teaching, then sure the risk is less. But it’s
not usually clear.

