* Problems with few parameters and/or few data (I was going to say toy problems, but there are sometimes important and interesting problems of this nature)
* Problems where the generative model is so complicated that you have no hope of doing any better than this and turn to it as a last resort. A bit like in combinatorial optimisation where you just say "gah, let's throw it at a SAT solver!".
(Perhaps that's not a bad analogy actually. If they can get to the point where SAT solvers are now, that would actually not be a bad proposition.)
* In particular, problems where the generative model is complicated but the complicated part of it is largely deterministic -- perhaps some kind of non-linear inverse problem where there's some simple additive observation noise tacked onto the end, for example.
What I do fear about, is the suggestion that people can just start building fiendishly complicated hierarchical Bayesian models using these things and get valid, useful, robust, interpretable inferences from them without much in the way of statistical training. I suspect even a lot of statisticians would be a bit scared of this sort of thing. Make sure you really read up on things like model checking and sensitivity analysis, that you know something about the trade-offs of different model structures and priors etc. And that's before you start to worry about the frequentist properties and failure cases of any approximate inference procedure which is magically derived for you.
Statisticians tend to favour simpler parsimonious models, not only for computational convenience but because it's easier to reason about them, understand and check their assumptions, understand their failure cases and so on.
I wish these guys lots of luck though, it is a really interesting area and the computer scientist in me really wants them to succeed!
The Geman and Geman paper from 1984 was directed at general procedures for inference using sampling, as was the work of the Brown group (Grenander et al.) throughout the 80s and 90s. BUGS (http://www.mrc-bsu.cam.ac.uk/software/bugs/) was specifically directed at that goal, and has been highly successful in real applications since the mid-1990s. Other more recent software, like Stan (http://mc-stan.org) is also targeted at this approach.
Looking at this from a different angle, the published work of many ML researchers has been directed at unifying common models with the clear view of producing software. I'm thinking about Michael Jordan's "plate" notation (http://en.wikipedia.org/wiki/Plate_notation) and the various unifying reviews of multi-component models and inference algorithms for time series (e.g., http://dl.acm.org/citation.cfm?id=309396), and the corresponding reviews for the EM algorithm done by Meng, van Dyk, and others (e.g., http://www.stat.harvard.edu/Faculty_Content/meng/JCGS01.pdf).
I think there's a lot of value (and a lot of buy-in) for probabilistic modelling languages, which impose some constraints in order to get efficient inference, which help you meet those constraints and don't provide a false promise of generality. And tonnes of value in research which looks for nice unifying formalisms to enable this sort of thing. Also in nice formalisms for inference in general, so perhaps that will be a useful side-effect of this work.
It's the idea of full-blown probabilistic programming, where you have unconstrained turing-complete non-deterministic programming language and inference just works, where I've seen a bit more in the way of healthy skepticism. Of the "this will be nice if it ever arrives, but in the meantime I'll be getting on with doing statistics, which it doesn't obviate the need for".
Similar to a computer scientist and the ultimate "sufficiently smart" compiler I suppose.
Some probabilistic standardisation would be lapped up by the applied community.
MCMC is the most general approach, but also the most inefficient, so its got limited appeal in crunching live numbers coming out a physical system. Some problems are intractable and its the only way, but really you want to limit application of MCMC as little as possible, so if you use church you now have an interoperability problem with binding (LISP) to some ugly but efficient FORTRAN system or some such.
Exactly why I think think this initiative will be highly welcome!
Generally to make inference for these kinds of models fast, you have to make it approximate.
MCMC has the nice property that, while it's an approximate method, it becomes exact in the limit of infinite samples. Meaning that when applying it in a general setting, the compiler doesn't have to make hard and final decisions about where and how to introduce approximations -- it produces MCMC code and you decide how long you can be bothered to wait for it to converge and how accurate you want the answer to be, based on various diagnostics.
Most other non-MCMC-based approximate inference methods (MAP, EM, Variational Bayes, EP, various hybrids of these and MCMC...) don't converge to the exact answer, they converge to an approximate answer which remains inexact no matter how long you run the algorithm for.
Different approximations have different strengths and weaknesses, and the best choice may depend on the model, the data, what you actually want to do with the posterior (a mode, a mean, an expected loss, predictive means, extreme quantiles?), and what frequentist properties you want for the results of the inference, especially given that this is no longer pure Bayesian inference but some messy approximation to it. Often the only practical way to decide will be to try a bunch of different things and see which does best on the application (or best predicts held-out data), even armed with full human intuition. A compiler doesn't really have a chance here.
In short, it's not going to be easily to automate fully because it's not something you can decide on formal grounds alone. You have to decide how and where you're willing to approximate, and you have to understand the approximations used and check the resulting approximated posterior against data in order to know whether to trust the results and how to interpret them.
I'd prefer to decouple those things though, so I can benefit from the first and the second without waiting for a "sufficiently smart compiler" to arrive which can automate the whole shebang well enough to rely on in a fully general programming language setting. That was what I was expressing skepticism about, although I'd be very happy to be proven wrong.
Also, doesn't it strike you strange that we are building simulations of stochastic phenomena (MCMC) using deterministically behaving components? What if we used stochastically behaving components as building blocks to begin with?
Most electronic components start to behave more or less stochastically when you try to run them with as little power as possible, and try to scale them down as small as possible. What if you could build MCMC simulator for the problem of your choice directly from stochastically behaving components? Just think of all the transistors you nowadays use just to generate a random number for simulating 90% probability of something.. they all are components that are manufactured to such tolerances, and run with such power levels that their probability of error is something like on the order of 1e-24 (for one computation). Doesn't that strike as a humongous overkill, when all you needed is something like "1, most of the time"..?
For more related cool stuff, google for imprecise hardware.
Y ~ dbin(0.5, 8) # Y is Binom(8, 0.5) - 8 trials, pSuccess = 0.5
P2 <- step(2.5 - Y) # does Y = 2, 1 or 0?
node mean sd MC error 2.5% median 97.5% start sample
P2 0.1448 0.3519 0.003317 0.0 0.0 1.0 1 10000
Y 4.004 1.417 0.01291 1.0 4.0 7.0 1 10000
The big deal is being able to run the program backwards. I.e., you already have some set of output, obtained from some other source, and you want to figure out how that output could have been produced (i.e., how the source works). So, you write a probabilistic program that encodes your various hypotheses about how the source process works, and then run it backwards to figure out what the forward execution path would have had to have been to produce that output, and thus which hypothesis is correct. That's what "inference" refers to.
Inference is a basic process in scientific investigation- come up with a bunch of hypotheses, do an experiment that generates data through physical processes independently of whatever you hypothesized, and then go backwards to figure out which of your hypotheses would have had to have been true in order to get the experimental output that you did. Which hypothesis is true corresponds to a particular execution path through a probabilistic program, so having efficient implementations of probabilistic languages means we could do much better at accurate analysis of data, which is Good For Science.
What does a compiler-level implementation of probabilistic analysis have to do with accuracy?
Real problems in the wild are a mix of both types of problem, and intractable part and a part you can do precisely.
In an ideal world you would use the iterative methods ONLY for the parts that don't have an exact solution available. In practice once you have developed the general iterative solution, you might as well just use that for the easy parts too, as you are just burdening yourself with more development work and more scope for bugs by writing two inference implementations.
A language that allowed inference to be switched at no development cost it would be amazing
Both of those result in the potential for greater confidence in the accuracy of conclusions reached from experimental analysis.
As a secondary point, re-implementing inference code for each new model makes it almost certain that there are bugs in said code. So even without changing the model, automatically generated inference code is likely to have fewer bugs and thus give more accurate inferences than hand-written code. (assuming it runs to convergence; naturally there are lots of scenarios in which naively generated code will be slower to converge than something hand-tuned).
(1) Even very simple linear models in effect encode uncountably many hypotheses (i.e., y = a x + b where a and b are real numbers). You can't just pull (a,b) out of a hat until the agreement looks good, because the hat is too big.
One tool is, obviously, analytic solutions (where possible) like least squares. Beyond that, iterative methods like the EM algorithm. Beyond that, sampling-based methods like MCMC. You can view MCMC as your idea ("generate a bunch of hypotheses and test them"), but with the change that it does not ignore the old hypotheses when generating a new one to test, it uses an old hypothesis to seed a new one, using a probabilistic update rule.
In the presence of a model, you can basically compile (in the CS sense) an inference program that deduces model parameters from input data. The project to do this has been worked on, in fits and starts, since the late 1940s, by a series of geniuses/visionaries.
(2) Once you have a multi-level model, with many parameters, trading off what you gain by tweaking one parameter versus another parameter becomes hard. If the data is
(x1,y1), ..., (xN,yN)
M1: (x1,Z1_1), ..., (xN,Z1_N), i.e. Z1_i = M1(x_i), and
M2: (x1,Z2_1), ..., (xN,Z2_N), i.e. Z2_i = M2(x_i)
So even if you were content to keep generating hypotheses and checking agreement, you wouldn't know how to measure agreement.
Really grossly impressively inefficient.
Let's say you want to determine if a coin is fair, but you don't know any statistics. What you can do is write a program (model) that represents the fair coin flip process. It might be as simple as:
flips = [random.choice([True, False]) for x in range(0, 10)]
i am speaking from what i gathered reading the article and the other comments. seems pretty interesting to me.
What I'm trying to say is that at the moment it looks like every programming language ever is trivially capable of expressing these sorts of programs using the random number generators that are wildly common. What is different in this programming paradigm?
In particular, this paragraph:
> Another way of thinking about this: unlike a traditional program, which only runs in the forward directions, a probabilistic program is run in both the forward and backward direction. It runs forward to compute the consequences of the assumptions it contains about the world (i.e., the model space it represents), but it also runs backward from the data to constrain the possible explanations.
Are these programs truly running backwards? Is there some backwards reduction relation in their semantics that I could look at?
Conceptually, the easiest way to understand things is via rejection sampling. You can "run things backward" but running things forward and just keeping the traces that match the observed output. That's Bayesian inference. Unfortunately, the probability of matching the observed output is zero for most problems so you need to use more advanced methods. Most accounts gloss over these methods since it's a huge research area.
Running things backwards is also not a perfect analogy because if fails to account for the probability distribution you start with.
Presumably, a programming language designed around this kind of task would make it easier to explore and develop larger, more intricate models.
In one of my lives, I've calculated ballistics. When calculating ballistics, you can probably think of a lot of factors that might go into it.
- Release conditions (velocities, angles, ect...)
- Environmental / trajectory conditions (winds, gusts, different media, ect...)
- Impact conditions and behaviour (spalling, rebounding, fragmentation, yatta....)
For more reading, you can look at something like: http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/2007002... (only uses probabilistic methods for transport)
Anyhow, one way to attempt modelling of this is to run a Monte-carlo of the parameter space. Effectively, you run all the permutations of release, all the permutations of those permutations for transport, and then all the permutations^3 of the impact variation.
Now, one of the advances that has come about from efforts like that, is that probabilistic models for some of these steps have been created to approximate the results in a time effective manner.
For example, in the above, they note that instead of running a Monte-carlo of trajectories and releases, a single trajectory is instead run, and then a probabilistic flight cone is created which calculates parameters based on standard deviation from the mean.
A general, and robust version of a language like this would allow you to specify all of your parameters with similar types of effects. IE, you would have mean and then deviation or statistical curve envelopes you would provide around that mean which would be propagated through calculations, and spit out an overall convoluted probability at the end of the pipeline.
In ballistics for example, you might have a mean release condition, with statistical variation on release. That would then be fed into a trajectory calculation, which would do so based on a mean, but with a transport cone that does statistical variation on the mean flight. Finally, impact might be calculated around a mean value, but would then carry the statistical variation through the release and flight. At the end, you would then be able to pull a probability number, as well as impact condition at that probability from any point in the potential impact zone.
At least that's how I see the use of probabilistic languages.
At its simplest, the forward problem is like a functional composition:
y = F( G( H(x) ) )
@gliese1337 nearby, in an excellent comment, addresses the more complex inverse problem: when you have outputs (y's above) from such a process, how do you determine the most likely model (F, G, H) that produced it?
So for example, if you have a bunch of crater ejecta, what can you tell me about the mass of the incoming objects?
- Start with a value + divergence and then propagate that state through your transforms to get a mean + divergent final state.
I should note that in the given example the base divergences need to be found through some method, which is usually running a Monte-carlo around some baseline state, and then making the significant assumption that new calculations (with small deltas) follow roughly the same statistical behavior.
For the backwards case, that can be a significantly more challenging case, as it basically comes down to performing approximate eigenvalue decomposition. However, there are a lot of SOTA tools like pretty much anything from this list:
IE, the crater example is fairly close to the setup for a Hidden Markov Model, where we get final data, and assume that it is based on a set of hidden states. Or for lighter reading, almost any of the major tools used in the Netflix recommendation contest, which is effectively an enormous version of this problem.
Hmm, not in general. (I've been working on various problem-specific versions of the backwards case for most of my career.)
Disclaimer: my lab mate has been organizing these workshops for the past couple of years and I'm currently procrastinating from preparing a presentation on this material for a reading group later this week, so I've presently got mixed feelings towards it.
Years ago when I was a grad student, I tried to tackle some NLP problems using similar approaches (ie. unsupervised generative bayesian models organized in hierarchical ways that try to mimic our intuitive concept of causality) but mostly failed at making much headway. I couldn't figure out some of the math and my models usually failed to perform any better than other probabilistic algorithms. No one at my university did anything similar or knew much about bayesian probabilities back then which didn't help.
I always believed in the approach however and until I stumbled on this book, never knew there was significant progress being made. I wish it was available to build upon when I was a student.
In the "Learning as Conditional Inference" chapter the section "Inferring an Arithmetic Expression" show the first example I have seen of a legitimate and flexible approach that could some day allow a computer to program itself. I mean the algorithm is still far from doing general programming tasks but I could imagine extending the approach so that given a programming language grammar and a unit tests as inputs, a computer could write a well chosen and simplified program that pass the tests (simplified with the automatic Occam's razor inherent in the bayesian approaches) . That to me is mind-blowing.
Now not all is perfect in the book, the "Inference about Inference" chapter seems oversimplified to me. The last few chapter seem to get a bit mathematically ad-hoc with an alpha parameter introduced to tie categories and sub-categories together in hierarchical models without much mathematical justification. On the other hand, there is strong evidence for this kind of transfer of information in hierarchies and it's not totally unreasonable to make up a parameter, assign it a bayesian uninformative prior that reflects ignorance of how this parameter should be set and try to induce a value for it from data. It's a perfect example of meta probabilities where probabilities are used to guess parameters that govern other probabilities. It's very bayesian.
Anyways before reading this book I was convinced that general AI was not going to happen in my lifetime. Now I'm not so sure.
Yesterday's Program Synthesis Demo  with MS TouchDevelop shows that the approach is viable. What I find lacking in such systems is a way to correct the inferred program in those inevitable cases when the learning system guesses wrongly; having a language to model the possible explanations for the example would allow the user to tweak the program or give hints on how to correct it.
When one can apply it though, it really shines.
I understand the current implementation of matching for xbox live is a big mess of imperative code - this is one area where knowledge of math can actually simplify the programming 
"Online gaming systems such as Microsoft’s Xbox Live rate relative skills of players playing online games so as to match players with comparable skills for game playing. The problem is to come up with an estimate of the skill of each player based on the outcome of the games each player has
played so far. A Bayesian model for this has been proposed..." 
The idea, in a nutshell: create a programming language where random functions are elementary primitives. The point of a program in such a language isn't to execute the code (although we can!), but to define a probability distribution over execution traces of the program. So you use a probabilistic program to model some probabilistic generative process. The runtime or compiler of the language knows something about the statistics behind the random variables in the program (keeping track of likelihoods behind the scenes).
This becomes interesting when we want to reason about the conditional distribution over execution traces after fixing some assignment of values to variables. The runtime of a probabilistic language would let us sample from the conditional distribution -- "what is a likely value of Y, given that X=4?". (in Church this is accomplished with query). A lot of models have really simple analytic solutions, but the inference engine in a probabilistic programming language would work for any probabilistic program. The semantics of this are defined by rejection sampling: run the program a bunch of times until you get an execution trace where your condition holds. This is really, really, grossly inefficient -- the actual implementation of inference in the language is much more clever.
An analogy to standard programming: it used to be the case that all programmers wrote assembly by hand. Then optimizing compilers came along and now almost nobody writes assembly. The average programmer spends their time thinking about higher order problems, and lets the compiler take care of generating machine that can actually execute their ideas.
Probabilistic programming languages aim to be the compiler for probabilistic inference. Let the runtime take care of inference, and you can spend more time thinking about the model. The task of coming up with efficient inference algorithms gets outsourced to the compiler guys, and you just have to worry about coming up with a model to fit your data.
Because you don't have to think too hard about the math behind inference, probabilistic modelling suddenly becomes accessible to a much larger subset of the population. A ton of interesting software these days is relies on machine learning theory that goes way over the heads of most programmers suddenly becomes accessible.
On the other hand, the people that do this work already are freed up to choose more expressive models and be more productive. The current paradigm is: come up with a probabilistic model, then do a bunch of math to figure out how to do efficient inference over the model given some data. Proceed to code it up in a few thousand lines of C++, and panic if the underlying model changes. The probabilistic programming approach: come up with a model, and write it in a few hundred lines of probabilistic code. Let the language runtime take care of inference. If the model changes, don't worry, because inference is automatic and doesn't depend on the specific model.
If you're interested in this, the Probabilistic Computing Group at MIT (probcomp.csail.mit.edu) has some interesting examples on their website.
An really simple example of Venture, their new probabilistic language: http://probcomp.csail.mit.edu/venture/release-0.1.1/console-...
? How does logic programming (Prolog, etc) relate to probabilistic programming ?