

Markov Chain Monte Carlo Without Jargon - diego898
http://jeremykun.com/2015/04/06/markov-chain-monte-carlo-without-all-the-bullshit/

======
simonbyrne
A couple of problems with the authors definition:

> Let D be a distribution over a finite set X.

In statistics, the space is almost always continuous (i.e. non-finite),
typically some moderate-to-high-dimensional Euclidean space (or subset
thereof).

> You are given black-box access to the probability distribution function f(x)
> which outputs the probability of drawing x \in X according to D.

If you know the distribution function, you can do inverse transform sampling.
MCMC is typically used when the _density function_ is only known up to
proportionality (which seems to be what he actually means).

> Design an efficient randomized algorithm A which outputs an element of X so
> that the probability of outputting x is approximately f(x).

Now, here's the problem: MCMC doesn't actually give you this. What you get
instead is a sequence of random elements, which (given some minor conditions)
will converge to the target distribution.

------
markisus
The author draws a distinction between computing expected value of a random
variable and computing the volume of a convex set, but from the point of
measure theory, both are computations of integrals. So maybe the definition
given from the encyclopedia wasn't so bad, given the right prior knowledge.

~~~
mturmon
Well, come on. Lots of things are integrals. The Riemann zeta function is an
integral. But I don't think MCMC has anything to say about that.

There is no obvious "measure" in the convex set example, so it's not clear how
MCMC applies. And, there are not good MCMC convergence results, so even if you
apply it, it's not obvious you'd be able to prove anything.

If you think of the set as being defined by an oracle (a function that reports
whether its input is inside or out), then it looks like it will require
exponentially many probes (in the dimensionality "d" of the space) to get its
volume. Basically, a curse of dimensionality argument -- you have to sample
independently along each of "d" axes. But it turns out there is an MCMC that
is polynomial-time. (Now, granted, IIRC it is like n^17, but...)

~~~
markisus
The Lebesgue measure could be a candidate for the measure you are asking for.
Creating a bounding box around the convex set in question and then scaling, we
get a probability measure (measure of the entire set is 1). Some sorts of
random walks inside this bounding box will converge to this scaled Lebesgue
measure. I think this is the main idea behind the convex volume estimation
algorithms.

The oracle you mention is the indicator function of the convex set. Instead of
integrating this indicator function directly over Lebesgue measure to get
volume, we can use MCMC to get the "expectation" of the oracle.

It's not obvious if we can realize the Riemann zeta function as an integral
over a finite measure space, but if we could, I think MCMC could be used to
evaluate it.

~~~
mturmon
Yes, what you just wrote is evident. I'm saying, in response to your assertion
above, that it is _not trivial_ to derive a useful result about approximating
the volume of convex sets with MCMC, and _not at all obvious_ that anything
smarter than independent probes will be useful. I happened to read the
original 1991 paper on this topic during my PhD, so I'm still vaguely familiar
with it.

To clarify: It's clear than you can approximate a convex volume with
exponentially-many independent (random or deterministic) probes. That's
therefore not an interesting problem.

Surprisingly: there is NO deterministic algorithm that can solve this problem
in polynomial time. But, surprise #2: a randomized algorithm using MCMC can.
Pretty cool.

So, this is the relevance of MCMC to the convex volume problem. Used cleverly,
it can take you from exponentially-many probes to polynomial.

And in particular: You talk about "creating a bounding box" for the reference
Lebesgue measure -- saying that this first step is easy is basically assuming
the whole problem is solved. Because: How do you create a bounding box around
a convex set in R^d, with polynomially-many probes, so that the error is
bounded? You can do it with exponentially-many probes (just go out along each
orthogonal axis). But that's not interesting -- if you have that many probes,
forget about MCMC, just use a deterministic algorithm.

Here's a short summary:
[http://www.cs.berkeley.edu/~sinclair/cs294/n16.pdf](http://www.cs.berkeley.edu/~sinclair/cs294/n16.pdf)

------
xioxox
I'm a scientist who uses MCMC to sample the parameter space of a model and
dataset, to get some idea of the uncertainties on the models parameters. These
parameters are all continuous (albeit at the finite resolution of floating
point numbers). Can someone from a CS background explain how this set approach
connects to my idea of MCMC?

For any scientists, I'd heartily recommend emcee [1] as a sampler. This uses
Goodman-Weare instead of Metropolis-Hastings, which works a lot better in
practice, without having to figure out proposal distributions and so on.

[1] [http://dan.iel.fm/emcee/current/](http://dan.iel.fm/emcee/current/)

~~~
dimatura
This is something addressed by probability theory, not CS. While MCMC is
easier to explain and understand for discrete spaces, it can also be used for
continuous spaces. As a non-mathematician I just think about this as
"discretizing at an infinitely high resolution", but of course to an actual
mathematician/statistician this is nonsense.

------
moultano
MCMC is waaay more general than metropolis-hastings, which is what all that
"bullshit" is about.

------
stoched
Let me make this easier:

Lets plug a bunch of numbers into a function and sum them all up. That sum is
the integral of the function that we plugged all the numbers into. That's the
Monte Carlo method part.

To get the Markov chain part, we just randomly produce a new number based on
the one we just got out of the function, to plug back into the equation.

------
_betty_
That post was still quite full of equations my brain didn't feel like
processing at this time of night.

Used some google and found a stackexchange answer that I thought was clearer

[https://stats.stackexchange.com/questions/165/how-would-
you-...](https://stats.stackexchange.com/questions/165/how-would-you-explain-
markov-chain-monte-carlo-mcmc-to-a-layperson/207#207)

------
partition
Ok, instead of reading a Wikipedia page or two, or the most elementary chapter
in a probability/statistics textbook and gaining a bunch more skills (among
them the tools needed to reason about whether your MCMC is doing anything
close to reproducing the original distribution), I can peruse a huge blog post
filled with patronizing language and it doesn't even have the Metropolis-
Hastings algorithm listed out even in pseudocode. Got it, this is what
progress looks like.

~~~
siddboots
I feel as though your comment comes very close to "gratuitous negativity" of
the type mentioned recently on the YC blog.

[http://blog.ycombinator.com/new-hacker-news-
guideline](http://blog.ycombinator.com/new-hacker-news-guideline)

Clearly, a blog post primer it is not an alternative to an encyclopedia
article or a textbook chapter. Rather it provides something not fulfilled by
either: a brief introduction to a technical subject that relies on very little
prior knowledge.

I am sure that Jeremy is well aware of the challenges involved in writing in a
style that is straight-forward without being "patronizing". Perhaps you don't
like the balance that he has found, but is that a good enough reason to
outright dismiss his efforts?

Lastly, much of the post is in fact a plain-language explanation of the
Metropolis-Hastings algorithm, and he adds: "... if demand is popular enough,
I could implement the Metropolis-Hastings algorithm in code"

~~~
optimusclimb
I was really surprised to view the comments, and find the attitude expressed
in the GP's post #1.

If anything, I read this and thought something along the lines of, "I wish
some of the subjects I learned in college were explained this way, at least to
start."

Now, I fully appreciate understanding fundamentals, and not black boxing
things (see the ongoing framework debate posts), however I didn't read that
article looking to be spoon fed a black box for ML or anything...I was just
bored and curious. So it made for good reading.

Perhaps I'm generalizing too far, but it reminded me of when I was presented
with the classic Lamport paper on "Time, Clocks, and the Ordering of Events in
a Distributed System" in undergrad. This would have been around 2004. I had
been working (i.e. coding for money) for some time by then, but back then, you
could still build a lot with a relatively stock LAMP stack (or similar.)
MongoDB, Riak, Redis, Cassandra, DynamoDB, etc. were not all commonplace (much
less things you could have a cluster ready to play with by means of a simple
'vagrant up'.) So when I was given the paper, without much context, I had a
hard time really "getting it", or how it applied to me. I'm pretty sure I
eventually got some commentary on it from the professor in class, while
hurriedly going through powerpoint slides, probably anxious to get back to his
research. Then a test question or two about it later.

Years later, after working more, and "distributed systems" becoming more of a
tangible thing, I re-visited it, and it made more sense. Anyway, my point is
that had I gotten some sort of a, "and here's what the paper is talking about,
and how it applies to something tangible, in plainspeak", it would have been
useful - so we shouldn't groan about an article like this. Someone out there
is going to understand the subject much better thanks to that post.

[Note to the reader that's very familiar with the vector clocks paper - I
realize it's not written with that much jargon, and DOES do a decent job in
the introduction section of using plain language, but the purpose was just
lost on me. I mean, it makes reference to ARPA Net, and then goes on to
determining event ordering. It just didn't equate to any thing in my world
view, and I had done a decent bit of multi-threaded/process programming by
then too.]

~~~
moultano
It would get a better reception without the intro paragraph and if it were
titled "an introduction to Metropolis Hastings." The post really doesn't even
give you a taste of the most common MCMC algorithms, and what they are useful
for.

For instance, sampling from complicated Bayesian models, and Gibbs sampling
are the two most important things to introduce to roughly cover MCMC and those
are in the paragraph he calls "bullshit" because he doesn't know the words.

