
Why Probabilistic Programming Matters - shankysingh
https://plus.google.com/+BeauCronin/posts/KpeRdJKR6Z1
======
mjw
I think a lot of statisticians and machine learners remain to be convinced
that's there's much payoff available from trying to do efficient statistical
inference in such a general setting. As the article warns, it's inherently
really hard in its full generality, and I don't think anyone expects a silver
bullet. It seems likely that the most general probabilistic programming tools
will be strongest on:

* Problems with few parameters and/or few data (I was going to say toy problems, but there are sometimes important and interesting problems of this nature)

* Problems where the generative model is so complicated that you have no hope of doing any better than this and turn to it as a last resort. A bit like in combinatorial optimisation where you just say "gah, let's throw it at a SAT solver!".

(Perhaps that's not a bad analogy actually. If they can get to the point where
SAT solvers are now, that would actually not be a bad proposition.)

* In particular, problems where the generative model is complicated but the complicated part of it is largely deterministic -- perhaps some kind of non-linear inverse problem where there's some simple additive observation noise tacked onto the end, for example.

What I _do_ fear about, is the suggestion that people can just start building
fiendishly complicated hierarchical Bayesian models using these things and get
valid, useful, robust, interpretable inferences from them without much in the
way of statistical training. I suspect even a lot of statisticians would be a
bit scared of this sort of thing. Make sure you really read up on things like
model checking and sensitivity analysis, that you know something about the
trade-offs of different model structures and priors etc. And that's before you
start to worry about the frequentist properties and failure cases of any
approximate inference procedure which is magically derived for you.

Statisticians tend to favour simpler parsimonious models, not only for
computational convenience but because it's easier to reason about them,
understand and check their assumptions, understand their failure cases and so
on.

I wish these guys lots of luck though, it is a really interesting area and the
computer scientist in me really wants them to succeed!

~~~
tlarkworthy
I disagree. People who know how to build complex bayesian models will be over
the moon they don't have to piece a complete system together using matrices. I
see no reason for such a system not to scale, essentially the underlying math
is elementary, but each statistical package at the moment has a kind of lock
in though lack of interoperability. Ideally you want to only do MCMC as a last
resort for part of the reasoning, but once you are in WIN bugs you end up
doing all the basic stuff with MCMC too, as its too much of a pain to try and
tie it to a more efficient system for the bits you can reason with
efficiently. I see potential in models that are partially MCMC and partially
analytical. I see even more potential in embedding those kinds of systems
within larger smart systems.

Some probabilistic standardisation would be lapped up by the applied
community.

~~~
nashequilibrium
I am enjoying going through this "Probabilistic Models of
Cognition":[https://probmods.org/generative-
models.html](https://probmods.org/generative-models.html) .Even though i am a
python guy but the fact that it is written using a functional probabilistic
programming language called Church, really makes it easy to follow along.

~~~
tlarkworthy
yeah looks like nice externals, but yet again, all the inference is with MCMC
:s

MCMC is the most general approach, but also the most inefficient, so its got
limited appeal in crunching live numbers coming out a physical system. Some
problems are intractable and its the only way, but really you want to limit
application of MCMC as little as possible, so if you use church you now have
an interoperability problem with binding (LISP) to some ugly but efficient
FORTRAN system or some such.

Exactly why I think think this initiative will be highly welcome!

~~~
mjw
So you're probably aware of this, but I think it might be interesting to try
and articulate why I fear extending fully-Bayesian inference algorithms beyond
MCMC could be a challenge, from experience with ML-focused Bayesian models and
larger datasets:

Generally to make inference for these kinds of models fast, you have to make
it approximate.

MCMC has the nice property that, while it's an approximate method, it becomes
exact in the limit of infinite samples. Meaning that when applying it in a
general setting, the compiler doesn't have to make hard and final decisions
about where and how to introduce approximations -- it produces MCMC code and
you decide how long you can be bothered to wait for it to converge and how
accurate you want the answer to be, based on various diagnostics.

Most other non-MCMC-based approximate inference methods (MAP, EM, Variational
Bayes, EP, various hybrids of these and MCMC...) don't converge to the exact
answer, they converge to an approximate answer which remains inexact no matter
how long you run the algorithm for.

Different approximations have different strengths and weaknesses, and the best
choice may depend on the model, the data, what you actually want to do with
the posterior (a mode, a mean, an expected loss, predictive means, extreme
quantiles?), and what frequentist properties you want for the results of the
inference, especially given that this is no longer pure Bayesian inference but
some messy approximation to it. Often the only practical way to decide will be
to try a bunch of different things and see which does best on the application
(or best predicts held-out data), even armed with full human intuition. A
compiler doesn't really have a chance here.

In short, it's not going to be easily to automate fully because it's not
something you can decide on formal grounds alone. You have to decide how and
where you're willing to approximate, and you have to understand the
approximations used and check the resulting approximated posterior against
data in order to know whether to trust the results and how to interpret them.

------
Faint
I think the most exciting opportunity here is actually compiling models to run
in specialized inference hardware! Lyric semiconductor a.k.a. Lyric Labs of
Analog Devices is working towards such goals
([http://dimple.probprog.org/](http://dimple.probprog.org/),
[http://arxiv.org/abs/1212.2991](http://arxiv.org/abs/1212.2991)). I hope that
they will get some hardware out at some point.

Also, doesn't it strike you strange that we are building simulations of
stochastic phenomena (MCMC) using deterministically behaving components? What
if we used stochastically behaving components as building blocks to begin
with?

Most electronic components start to behave more or less stochastically when
you try to run them with as little power as possible, and try to scale them
down as small as possible. What if you could build MCMC simulator for the
problem of your choice directly from stochastically behaving components? Just
think of all the transistors you nowadays use just to generate a random number
for simulating 90% probability of something.. they all are components that are
manufactured to such tolerances, and run with such power levels that their
probability of error is something like on the order of 1e-24 (for one
computation). Doesn't that strike as a humongous overkill, when all you needed
is something like "1, most of the time"..?

For more related cool stuff, google for imprecise hardware.

------
ajtulloch
For some examples of one form of "probabilistic programming" is, have a look
at some examples from the BUGS book [1]. Here, we'll estimate P(Z < 3), where
Z ~ Binom(8, 0.5)

    
    
        model {
            Y ~ dbin(0.5, 8) # Y is Binom(8, 0.5) - 8 trials, pSuccess = 0.5
            P2 <- step(2.5 - Y) # does Y = 2, 1 or 0?
        }
    

When this is passed to {Open/Win}BUGS, the software constructs a factor graph
for this model and uses MCMC techniques to sample efficiently on this graph.
For example, you can example the distribution of the nodes P2 and Y.

    
    
        node   mean     sd     MC error    2.5%   median  97.5%   start   sample
        P2    0.1448   0.3519   0.003317   0.0     0.0    1.0     1       10000
        Y     4.004    1.417    0.01291    1.0     4.0    7.0     1       10000
    

Thus, we infer that P(Z < 3) ≈ 0.1448.

[1]: [http://www2.mrc-
bsu.cam.ac.uk/bugs/thebugsbook/examples/](http://www2.mrc-
bsu.cam.ac.uk/bugs/thebugsbook/examples/)

------
mneary
Reading this reminded of a fun quote: "Google uses Bayesian filtering the way
Microsoft uses the if statement" [1]. I imagine that having probabilistic
values as first class primitives is a step in this direction.

[1]:
[http://www.joelonsoftware.com/items/2005/10/17.html](http://www.joelonsoftware.com/items/2005/10/17.html)

------
tree_of_item
I'll admit, I looked over [https://probmods.org](https://probmods.org) and I
don't really get "probabilistic programming". It just looks like you call
random() sometimes...? Is there something going on in the language runtime
that I'm missing?

~~~
gliese1337
The power of probabilistic programming is not in running the programs forward
to generate output. That pretty much does just come down to calling random()
sometimes to alter the flow of the program. Having built-in syntax for that
would be nice for modelling stochastic processes, but we can already do that.

The big deal is being able to run the program _backwards_. I.e., you already
have some set of output, obtained from some other source, and you want to
figure out how that output could have been produced (i.e., how the source
works). So, you write a probabilistic program that encodes your various
hypotheses about how the source process works, and then run it backwards to
figure out what the forward execution path would have had to have been to
produce that output, and thus which hypothesis is correct. That's what
"inference" refers to.

Inference is a basic process in scientific investigation- come up with a bunch
of hypotheses, do an experiment that generates data through physical processes
independently of whatever you hypothesized, and then go backwards to figure
out which of your hypotheses would have had to have been true in order to get
the experimental output that you did. Which hypothesis is true corresponds to
a particular execution path through a probabilistic program, so having
efficient implementations of probabilistic languages means we could do much
better at accurate analysis of data, which is Good For Science.

~~~
j2kun
> so having efficient implementations of probabilistic languages means we
> could do much better at accurate analysis of data

What does a compiler-level implementation of probabilistic analysis have to do
with accuracy?

~~~
AlexCoventry
It makes it cheaper to experiment with different models.

~~~
j2kun
Still not really a statement about accuracy...

~~~
davmre
Experimenting with more models increases the likelihood you'll find a good
model. A good model is, by definition, a more accurate representation of your
domain than a bad model. It will also tend to generate more accurate
predictions, if that's what you care about.

As a secondary point, re-implementing inference code for each new model makes
it almost certain that there are bugs in said code. So even without changing
the model, automatically generated inference code is likely to have fewer bugs
and thus give more accurate inferences than hand-written code. (assuming it
runs to convergence; naturally there are lots of scenarios in which naively
generated code will be slower to converge than something hand-tuned).

~~~
j2kun
I don't consider bugs in code a matter of accuracy of the model. And while you
can compare the accuracy levels of various models, having a compiler do the
inference or you do the inference doesn't chance the accuracy. I also don't
subscribe to the idea that the right model for a scientific or statistical
phenomenon is a random event.

------
mathgenius
I wonder if such a system can be used for "programming by example". Ie.,
generate by hand a bunch of example behaviors and then the system learns the
program that can do that.

~~~
TuringTest
I was pondering that same thought here. Modeling user's _intent_ is one of the
more difficult things in end-user development - it would be great to have a
more-or-less standard way to infer what abstraction the user had in mind when
introducing an example that ought to be generalized into a program.

Yesterday's Program Synthesis Demo [1] with MS TouchDevelop shows that the
approach is viable. What I find lacking in such systems is a way to _correct_
the inferred program in those inevitable cases when the learning system
guesses wrongly; having a language to model the possible explanations for the
example would allow the user to tweak the program or give hints on how to
correct it.

*[1] [https://news.ycombinator.com/item?id=8028620](https://news.ycombinator.com/item?id=8028620)

------
platz
something like PGM [1] (this is not a lightweight class) helps to understand
the concepts. But it still seems like more of a niche domain right now than a
general programming technique.

When one _can_ apply it though, it really shines.

I understand the current implementation of matching for xbox live is a big
mess of imperative code - this is one area where knowledge of math can
actually simplify the programming [2]

"Online gaming systems such as Microsoft’s Xbox Live rate relative skills of
players playing online games so as to match players with comparable skills for
game playing. The problem is to come up with an estimate of the skill of each
player based on the outcome of the games each player has played so far. A
Bayesian model for this has been proposed..." [3]

[1] [https://www.coursera.org/course/pgm](https://www.coursera.org/course/pgm)

[2] [http://research.microsoft.com/pubs/208585/fose-
icse2014.pdf](http://research.microsoft.com/pubs/208585/fose-icse2014.pdf)

[3] [http://research.microsoft.com/en-
us/um/cambridge/projects/in...](http://research.microsoft.com/en-
us/um/cambridge/projects/infernet/)

------
mango_man
This article doesn't do a great job of explaining what probabilistic
programming actually is. It's about 1) making machine learning and
probabilistic modelling accessible to a larger audience, and 2) enabling
automated reasoning over probabilistic models for which analytic solutions are
inconvenient. (Sorry for the wall of text)

The idea, in a nutshell: create a programming language where random functions
are elementary primitives. The point of a program in such a language isn't to
_execute_ the code (although we can!), but to define a probability
distribution over execution traces of the program. So you use a probabilistic
program to model some probabilistic generative process. The runtime or
compiler of the language knows something about the statistics behind the
random variables in the program (keeping track of likelihoods behind the
scenes).

This becomes interesting when we want to reason about the conditional
distribution over execution traces after fixing some assignment of values to
variables. The runtime of a probabilistic language would let us sample from
the conditional distribution -- "what is a likely value of Y, given that
X=4?". (in Church this is accomplished with query). A lot of models have
really simple analytic solutions, but the inference engine in a probabilistic
programming language would work for any probabilistic program. The semantics
of this are defined by rejection sampling: run the program a bunch of times
until you get an execution trace where your condition holds. This is really,
really, grossly inefficient -- the actual implementation of inference in the
language is much more clever.

An analogy to standard programming: it used to be the case that all
programmers wrote assembly by hand. Then optimizing compilers came along and
now almost nobody writes assembly. The average programmer spends their time
thinking about higher order problems, and lets the compiler take care of
generating machine that can actually execute their ideas.

Probabilistic programming languages aim to be the compiler for probabilistic
inference. Let the runtime take care of inference, and you can spend more time
thinking about the model. The task of coming up with efficient inference
algorithms gets outsourced to the compiler guys, and you just have to worry
about coming up with a model to fit your data.

Because you don't have to think too hard about the math behind inference,
probabilistic modelling suddenly becomes accessible to a much larger subset of
the population. A ton of interesting software these days is relies on machine
learning theory that goes way over the heads of most programmers suddenly
becomes accessible.

On the other hand, the people that do this work already are freed up to choose
more expressive models and be more productive. The current paradigm is: come
up with a probabilistic model, then do a bunch of math to figure out how to do
efficient inference over the model given some data. Proceed to code it up in a
few thousand lines of C++, and panic if the underlying model changes. The
probabilistic programming approach: come up with a model, and write it in a
few hundred lines of probabilistic code. Let the language runtime take care of
inference. If the model changes, don't worry, because inference is automatic
and doesn't depend on the specific model.

If you're interested in this, the Probabilistic Computing Group at MIT
(probcomp.csail.mit.edu) has some interesting examples on their website.

An really simple example of Venture, their new probabilistic language:
[http://probcomp.csail.mit.edu/venture/release-0.1.1/console-...](http://probcomp.csail.mit.edu/venture/release-0.1.1/console-
examples.html)

~~~
sitkack
Well written.

? How does logic programming (Prolog, etc) relate to probabilistic programming
?

------
pointfree
This reminds me somewhat of the Bloom programming language for "disorderly
programming in distributed systems".

[http://www.bloom-lang.net/](http://www.bloom-lang.net/)

------
JadeNB
Every time that I see a discussion of probabilistic programming, I'm reminded
that I want to try to find an excuse to use IBAL
([http://www.gelberpfeffer.net/avi.htm](http://www.gelberpfeffer.net/avi.htm)).
I used to have a binary for it lying around, but can't find it any more; does
anyone know where to get it?

~~~
davmre
IBAL is quite old; you're probably better off learning something that's still
being actively maintained. The natural choice would be Avi Pfeffer's new
language, Figaro ([https://www.cra.com/commercial-solutions/probabilistic-
model...](https://www.cra.com/commercial-solutions/probabilistic-modeling-
services.asp)).

~~~
JadeNB
Thank you! I will have a look at it. Do you know if there's any analogue of
the 99 Haskell / Lisp / Prolog problems list geared toward probabilistic
programming?

~~~
davmre
Not as far as I know. The closest I can think of is Josh Tenenbaum and Noah
Goodman's online book "Probabilistic Models of Cognition",
[https://probmods.org](https://probmods.org), which has lots of examples of
simple models, written in Church; these could be a good set of exercises to
try translating into Figaro. Though these systems are immature enough that
there's no guarantee Figaro will actually _work_ on the translated models (or
that Church will even work on the originals :-).

------
sriku
Functional Pearls - Probabilistic Functional Programming in Haskell -
[http://web.engr.oregonstate.edu/~erwig/papers/PFP_JFP06.pdf](http://web.engr.oregonstate.edu/~erwig/papers/PFP_JFP06.pdf)

------
danarlow
Re running your program "backwards": good luck ever figuring out if your
sampling of an unknown but very complex distribution has converged :/

------
basyt
The buzzwording was strong in that article. I am not an expert, but I will be
looking at this in the future.

------
contingencies
Law of Probability Dispersal: Whatever it is that hits the fan will not be
evenly distributed.

