
Variational Inference for Machine Learning [pdf] - alex_hirner
http://shakirm.com/papers/VITutorial.pdf
======
marmaduke
Stan and PyMC3 both implement automatic differentiation based variational
inference, so you can write down your statistical model and not care "much"
about derivatives.

[http://mc-stan.org](http://mc-stan.org) [https://github.com/pymc-
devs/pymc3](https://github.com/pymc-devs/pymc3)

~~~
AlexCoventry
There's also the TensorFlow-based Edward: [https://github.com/blei-
lab/edward](https://github.com/blei-lab/edward)

~~~
proditus
stan and edward dev here. happy to answer any questions.

(shakir's blog posts are amazing; i recommend them all.)

~~~
murbard2
Very cool, many questions

1) Why create a project distinct from Stan? Was it the prospect of benefiting
of all the work going into TF and focus solely on the sampling procedures
rather than autodiff or GPU integration?

2) Are you implementing NUTS?

3) Any plans to implement parallel tempering

4) Any plans to handle "tall" data using stochastic estimates of the
likelihood?

~~~
proditus
great questions.

1\. you touch upon the right strengths of TF; that was certainly one
consideration. edward is designed to address two goals that complement stan.
the first is to be a platform for inference research: as such, edward is
primarily a tool for machine learning researchers. the second is to support a
wider class of models than stan (at the cost of not offering a "works out of
the box" solution).

our recent whitepaper explains these goals in a bit more detail:

[https://arxiv.org/pdf/1610.09787.pdf](https://arxiv.org/pdf/1610.09787.pdf)

2) no immediate plans. but we have HMC and are looking for volunteers :)

3) same answer as above :) should be relatively easy to implement tempering.

4) this is already in the works! stay tuned!

~~~
murbard2
4) which approach are you using? Generalized Poisson Estimator, or estimating
the convexity effect of the exponential by looking at the sample variance of
the log likelihood? The former is more pure, the latter may be more practical
if ugly.

~~~
proditus
theses are great insights.

our first approach is the simplest: stochastic variational inference. consider
a likelihood that factorizes over datapoints. stochastic variational inference
then computes stochastic gradients of the variational objective function at
each iteration by subsampling a "minibatch" of data at random.

i reckon the techniques you suggest would work as we move forward!

~~~
murbard2
Edit: ah never mind, variational inference, got it! I was thinking stochastic
HMC!

\---

Ok but that will get an unbiased estimate of the _log-likelihood_. MCMC or HMC
do work with noisy estimators, but they require unbiased estimates of the
_likelihood_.

At the very least, you need to do a convexity adjustment by measuring the
variance inside your mini batch. Or you can use the Poisson technique which
will get you unbiased estimates of exp(x) from unbiased estimates of x (albeit
at the cost of introducing a lot of variance).

~~~
proditus
great points; yes, the challenge becomes considerably more challenging with
MCMC!

------
coherentpony
> Many samples needed, especially in high dimensions

This isn't true. For Monte Carlo sampling, the convergence of unbiased
estimators (for example the expectation) is independent of the dimension of
the state space. In fact, this is exactly the reason to _prefer_ Monte Carlo
integration over, say, a Riemann sum.

~~~
murbard2
That's true in the asymptotic, but the dimension find a way to rear its head
in the constant to the big O through the variance of your estimator. It's
still better than Riemann integration, but it sucks nonetheless.

For instance, integrate the function f(x1,x2,...xd) = 6^d * x1 * (1-x1) * ...
* xd * (1-xd) on the d-dimensional unit cub (the answer is 1, for all d). the
expected variance for a single point estimator is -1+(6/5)^d, which increases
exponentially in the number of dimensions. That's the multiplicative constant
to your big O.

For a probabilistic example let x_i ~ N(0,1) for i in 1..d and let s = Sum
x_i. Try estimating the probability that -0.1 < s < 0.1 by Monte-Carlo
sampling.

If most of the probability distribution mass is located near a manifold of
lower dimensions, as tends to be the case for natural data, your variance will
be huge.

MCMC, HMC both improve this state of affair by letting you walk or "glide" on
the manifold, but you still have to contend with curvature, with multiple
modes etc.

~~~
zump
What course did you study to get this knowledge?

~~~
murbard2
AlexCoventry is right, these are "the mathematics of Bayesian inference", to
unpack a bit my recommendation is:

\- Get a strong grip on linear algebra and euclidean spaces. You should be
able to have an intuitive feel for the key theorems (e.g. Cayley-Hamilton, the
Spectral theorem) and should be comfortable proving them (at least once). The
point isn't that you need to check that they're correct (they are), but if you
can prove them, it means you've learned a certain amount of prerequisite
knowledge and gained enough familiarity with the topic. They aren't just
theorems you apply, they make sense and you understand the intuition behind
them.

\- Get the same feel for multivariate calculus. The intuition there is
generally easier to acquire than for linear algebra, but you need to be
comfortable with the mechanics of it. Learn to prove your results rigorously,
but also to quickly derive formulas by treating infinitesimals as variables,
like a physicist.

\- Study integration, measures, distributions and the fundamentals of
probability. If the course talks about sigma-algebras, you're in the right
place. Finally study Baseyian statistics, Monte-Carlo Integration, Markov-
Chain Monte-Carlo, some information theory

You don't _need_ that level of rigor and that level of fundamentals to do
machine learning. Some linear algebra, some calculus and some probability
theory that you pick along the way will generally do. However, I think if
you're interested in ML it is worth the effort because it will make most of
the math seamless. This is a lot of math to learn, but it's not particularly
"advanced" math. The underlying intuitions are relatively concrete and a lot
of the procedures are relatively mechanical.

~~~
zump
Thanks for the wonderful reply. How long do you estimate it took for you to
get to the level you are now?

~~~
murbard2
That's hard to answer because there wasn't a precise point where I started and
that's not the only thing I've been studying in my life, and I've studied a
lot of this while working full time jobs which didn't really require this
knowledge (constructive procrastination ftw). I started getting interested in
sequential Monte-Carlo methods around 2008 while working at a hedge-fund and I
think I had a pretty solid grasp by 2011. But I started with solid math
fundamentals, so picking it up wasn't too troublesome.

I think a talented high-schooler could learn this topic in three to four years
by studying it (and nothing else) intensively. I think it tends to happen more
organically in general. You become proficient along the way, it's more of a
lifelong thing. I think every piece you'll learn will be valuable and useful
on its own.

