(shakir's blog posts are amazing; i recommend them all.)
1) Why create a project distinct from Stan? Was it the prospect of benefiting of all the work going into TF and focus solely on the sampling procedures rather than autodiff or GPU integration?
2) Are you implementing NUTS?
3) Any plans to implement parallel tempering
4) Any plans to handle "tall" data using stochastic estimates of the likelihood?
1. you touch upon the right strengths of TF; that was certainly one consideration. edward is designed to address two goals that complement stan. the first is to be a platform for inference research: as such, edward is primarily a tool for machine learning researchers. the second is to support a wider class of models than stan (at the cost of not offering a "works out of the box" solution).
our recent whitepaper explains these goals in a bit more detail:
2) no immediate plans. but we have HMC and are looking for volunteers :)
3) same answer as above :) should be relatively easy to implement tempering.
4) this is already in the works! stay tuned!
My main gripe with Theano is that OpenCL support is near non-existent, but this is also the case with TF.
our first approach is the simplest: stochastic variational inference. consider a likelihood that factorizes over datapoints. stochastic variational inference then computes stochastic gradients of the variational objective function at each iteration by subsampling a "minibatch" of data at random.
i reckon the techniques you suggest would work as we move forward!
Ok but that will get an unbiased estimate of the log-likelihood. MCMC or HMC do work with noisy estimators, but they require unbiased estimates of the likelihood.
At the very least, you need to do a convexity adjustment by measuring the variance inside your mini batch. Or you can use the Poisson technique which will get you unbiased estimates of exp(x) from unbiased estimates of x (albeit at the cost of introducing a lot of variance).
edward does not implement completely implement advi yet. the piece that is missing is the automated transformation of constrained latent variable spaces to the real coordinate space. however, edward offers much more flexibility in specifying the dependency structure of the posterior approximation. diagonal is, just like in stan, the fastest and easiest. however introducing structure (e.g. assigning a dense normal to some of the latent variables while assigning a diagonal to others) is much easier in edward.
I have also a naïve question: why not use the graphical structure of the model itself to add structure to the covariance? For example, in an AR model, each time point places prior on the next time point, so why not assume a banded covariance? More generally, one could use a cutoff on shortest path length (through the model's graphic structure) between parameters to decide if they should have nonzero coefficients.
so I'll try to do my homework before asking more questions ;)
A big issue I ran into with stan even with advi was scaling to large datasets since it (and Eigen) are single threaded. Would Edward answer all my prayers?
When is Riemannian HMC going to arrive?
please take a look here: http://edwardlib.org/troubleshooting
edward should answer some of your prayers :) there's still some time until stan goes parallel/gpu, though there's lots of interest there.
riemannin hmc is likely just around the corner!
I've been working with both Stan & PyMC3 on some large datasets and will definitely try Edward on them.
give it a shot at let us know!
However, if Church allows you to express non differentiable models (e.g. if you have a Heaviside function), then they will either fail or not work well with HMC or ADVI (the variational inference algorithm Stan uses), because both assume that gradient of the posterior can be computed and is informative about the posterior.
This isn't true. For Monte Carlo sampling, the convergence of unbiased estimators (for example the expectation) is independent of the dimension of the state space. In fact, this is exactly the reason to prefer Monte Carlo integration over, say, a Riemann sum.
For instance, integrate the function f(x1,x2,...xd) = 6^d * x1 * (1-x1) * ... * xd * (1-xd) on the d-dimensional unit cub (the answer is 1, for all d). the expected variance for a single point estimator is -1+(6/5)^d, which increases exponentially in the number of dimensions. That's the multiplicative constant to your big O.
For a probabilistic example let x_i ~ N(0,1) for i in 1..d and let s = Sum x_i. Try estimating the probability that -0.1 < s < 0.1 by Monte-Carlo sampling.
If most of the probability distribution mass is located near a manifold of lower dimensions, as tends to be the case for natural data, your variance will be huge.
MCMC, HMC both improve this state of affair by letting you walk or "glide" on the manifold, but you still have to contend with curvature, with multiple modes etc.
- Get a strong grip on linear algebra and euclidean spaces. You should be able to have an intuitive feel for the key theorems (e.g. Cayley-Hamilton, the Spectral theorem) and should be comfortable proving them (at least once). The point isn't that you need to check that they're correct (they are), but if you can prove them, it means you've learned a certain amount of prerequisite knowledge and gained enough familiarity with the topic. They aren't just theorems you apply, they make sense and you understand the intuition behind them.
- Get the same feel for multivariate calculus. The intuition there is generally easier to acquire than for linear algebra, but you need to be comfortable with the mechanics of it. Learn to prove your results rigorously, but also to quickly derive formulas by treating infinitesimals as variables, like a physicist.
- Study integration, measures, distributions and the fundamentals of probability. If the course talks about sigma-algebras, you're in the right place. Finally study Baseyian statistics, Monte-Carlo Integration, Markov-Chain Monte-Carlo, some information theory
You don't need that level of rigor and that level of fundamentals to do machine learning. Some linear algebra, some calculus and some probability theory that you pick along the way will generally do. However, I think if you're interested in ML it is worth the effort because it will make most of the math seamless. This is a lot of math to learn, but it's not particularly "advanced" math. The underlying intuitions are relatively concrete and a lot of the procedures are relatively mechanical.
I think a talented high-schooler could learn this topic in three to four years by studying it (and nothing else) intensively. I think it tends to happen more organically in general. You become proficient along the way, it's more of a lifelong thing. I think every piece you'll learn will be valuable and useful on its own.
If you have an O(1) jump size it'll take time O(100^2) for the MCMC to fully sample the support of this distribution. If you had a jump size of O(100), then you'll be rejecting 99% of your jumps due to the narrowness in the x-direction.
This is hardly an uncommon scenario.
You can, of course, construct examples where MCMC will do worse than simple Monte Carlo integration, but these are uncommon. They mostly illustrate the difficulty of picking an appropriate jump size.
MC has no rejections and always samples the entire distribution. You simply don't need to worry about trajectories not going everywhere they should.
Parallel tempering should also be exploited more.
That said, at the end of the day, the ideal sampler would be able to reason about the distribution as program and not just as a mathematical black box. It should build tractable approximations intelligently and use those to bootstrap exact sampling schemes.
I think we're going to see a wealth of better samplers come out in the next decade, following the path of combinatorial optimization towards preserving the structure of the programs.
aren't there methods for online sparse estimates of the Hessian?
I'd expect a lot of "large" problems for which RMC is useful would have sparse structures.
> ideal sampler would be able to reason about the distribution as program
Are you a developer of Stan? If not, you might be interested.
_edit_ by online estimate of a Hessian, I meant online numerical approximation based on the sequence of Jacobians.