
Probabilistic programming does in 50 lines of code what used to take thousands - ub
http://newsoffice.mit.edu/2015/better-probabilistic-programming-0413
======
jb55
If anyone wants to jump into this, Josh Tenenbaum and Noah Goodman put
together this amazing interactive book for learning probabilistic programming
with Church: [https://probmods.org/](https://probmods.org/)

~~~
rrtwo
Great project but very weird choice of language. How probable is it (pun
intended) that the person coming to learn about probabilistic programming
would already know functional programming?

~~~
Retra
Isn't functional programming a standard part of any computer science
curriculum? Why would you expect programmers not to know it?

~~~
vidarh
Even the places where it is a part of the curriculum, it is often such a small
part that unless people specifically take courses related to functional
programming you can't expect them to be able to actually _use_ it. Or remember
much of it for that matter.

Heck, I spent months on a binge reading functional programming research
papers, and it still doesn't mean I know any functional languages other than
very superficially.

~~~
Retra
That's true of any language though. If you know Java, you don't automatically
know C#, but the core concepts will still mostly transfer. You'll still know
Object-Oriented Programming, even if you've never used C# and can't write
HelloWorld without looking something up.

------
stevenspasbo
Is that 50 lines of code, or 50 lines of using a library that's thousands of
lines of code?

~~~
baldfat
What do we say about languages built on C? Is it 100 lines of code but there
are hundreds of thousands of lines of code for that higher level language you
just coded?

I don't think libraries count in terms of code. We all use code to program.
Standing on the shoulder that preceded us. Using a library and a function
should just count for the most part.

~~~
jarrettc
True, but the parent commenter is getting at something important. The article
suggests that researchers have found a new, much more concise way to express
the solutions to difficult problems. That's different from a library, which
merely packages pre-built solutions to a finite set of problems.

It's like the difference between a complete kitchen that fits in your pocket
and an iPhone app that lets you order a burrito. The article suggests
something like the former. A library which encapsulates 1000 lines of code
into a single function call is like the latter.

~~~
derefr
Even a "kitchen that fits in your pocket" isn't general enough. A library that
parses a DSL, such that you can then code in that DSL, is still a lame duck if
that library, _plus_ the encodings of all the useful solutions in the domain,
add up to more code than just the solutions would be when expressed in a
general language. The ROI of (learning the DSL + maintaining the library)
would be negative.

On the other hand, there are things like Prolog. You _can_ think of Prolog as
a backtracking constraint-solving library, and then another library that
parses a DSL for expressing facts and procedural constraints and feeds it to
the first library. But Prolog's language isn't really a DSL, because it isn't
particular to any domain: there's no closed solution-space where Prolog
applies. The efficiency gains you get from Prolog's elision of proceduralized
contraint-solution code _can_ apply to any program you write. And so its value
is unbounded; its ROI is certainly positive, whatever the cost was to
implement it.

That's the comparison that's useful here, I think. Is this something that only
solves problems in one domain? Or is this something that could be applied to
(at least some little bits of) any problem you encounter?

------
cheatsheet
> “It goes beyond image classification — the most popular task in computer
> vision — and tries to answer one of the most fundamental questions in
> computer vision: What is the right representation of visual scenes?

Can someone knowledgeable in graphics research explain the context that this
question comes from?

If I am reading the question correctly, I infer that the question suggests
that there exists a right way to reproduce the visual experience of reality.
To me, this sounds like a question that is equally valid to have no answer (or
many answers) in aesthetics, art, and philosophy, etc.

~~~
rasz_pl
Think about Dreaming. "seeing" during a dream state works by experiencing pure
data representation of the real world. People fluent in lucid dreaming can
tell you something funny happens when you try to thorough examine objects
while sleeping. Constructed worlds tend to be skin deep, and fall apart when
poked. Everything is build with ideas drawn from your experience.

Its Plato's Allegory of the Cave all the way down.

Imagine "watching" a movie compressed using your very own prior knowledge.
Every scene could be described in couple of hundred lines of plaintext. Today
we do this by reading a book :) What if we could build an algorithm able to
render movies from books?

~~~
sedachv
> What if we could build an algorithm able to render movies from books?

Bob Coyne has been working on a system for generating images of still scenes
from text descriptions for about 15 years now:

[https://www.wordseye.com/](https://www.wordseye.com/)
[http://www.cs.columbia.edu/~coyne/papers/wordseye_siggraph.p...](http://www.cs.columbia.edu/~coyne/papers/wordseye_siggraph.pdf)

------
murbard2
Yes, you can specify extremely powerful statistical models in only a few lines
of code using probabilistic programming.

However, at this point, unless you design your program in a very specific way
and use a lot of tricks, your sampler is very unlikely to converge, and you
won't get any meaningful result without a gargantuan amount of computing
power.

------
dang
Also
[https://news.ycombinator.com/item?id=9363496](https://news.ycombinator.com/item?id=9363496)
from yesterday.

------
ub
No experience but looks like they are organizing a summer school on
probabilistic programming languages.
[http://ppaml.galois.com/wiki/wiki/SummerSchools/2015/Announc...](http://ppaml.galois.com/wiki/wiki/SummerSchools/2015/Announcement)

------
pliny
relevant paper:
[http://mrkulk.github.io/www_cvpr15/1999.pdf](http://mrkulk.github.io/www_cvpr15/1999.pdf)

~~~
e12e
Thanks for that. As far as I can tell the compiler for the Picture language
hasn't been published?

------
newpattern
Anybody here on HN have experience with probabilistic-programming ? This looks
quite disruptive if it works.

~~~
BenoitEssiambre
I've played with it a bit and, in my opinion, the principles behind it at
least, the streamlined and optimized simulation of bayesian generative models
is the best chance we have to solve artificial general intelligence.

Reading probmods.org and dippl.org made me go from being very pessimistic I
would see it in my lifetime to a solid maybe.

~~~
murbard2
It's a very powerful statistical technique, yes, but I doubt it will be enough
for AI. The problem is that you need to sample efficiently from the posterior
distribution, and for anything AI related, MCMC isn't going to cut it.

Let's take a step back and consider a logical problem. Can you put N socks in
N-1 boxes such that no box contains more than one socks? Obviously not, it's
the pigeonhole principle.

Convert that question into a Boolean circuit, and throw a SAT solver at it. It
will die. In fact, using only first order logic, a proof of the pigeonhole
principle _requires_ an exponential number of terms. Looking at logical
propositions alone is too myopic to solve the problem, you have to formulate
higher level theories about the structure of the problem to solve it (in this
case, natural numbers).

The same goes for probabilistic programming. As long as the paradigm is to
treat the problem as a black box energy function to sample from, it is doomed
to be inefficient. Try writing a simple HMM and the system will choke, even
though there are efficient algorithms to sample from such a model.

If you look at deep learning techniques, they take an interesting approach
which is to learn to approximate the generative distribution _and_ the
inference distribution at the same time. This is the basis of the work around
autoencoders, deep belief networks, and it guarantees that you can tractably
sample from your latent representation.

~~~
BenoitEssiambre
I have been out of the machine learning field for years and I haven't look
into deep learning methods even though they are intriguing (just not
sufficiently bayesian to pique my curiosity enough to spend my limited free
time on it). I do believe that automatically building a hierarchical structure
(as I assume happens in deep learning) is the way to abstract away complexity
but I think this is achievable within a generative bayesian monte carlo
approach.

For example, I have been toying with using MCMH similarly to how it is used in
dippl.org to write a kind of probabilistic program that generates other
programs by sampling from a probabilistic programming language grammar. A bit
like with the arithmetic generator here: [https://probmods.org/learning-as-
conditional-inference.html#...](https://probmods.org/learning-as-conditional-
inference.html#example-inferring-an-arithmetic-expression) but for whole
programs.

After the MCMH has converged to a program that generates approximately correct
output, you can tune grammar hyperparameters on the higher likelihood output
program samples so that next time it will converge faster.

I don't know if this counts as approximating "the generative distribution and
the inference distribution at the same time" under your definition but my hope
is that the learned grammar rules are good abstractions for hierarchical
generators of learned concepts.

Of course the worry is that my approach will not converge very fast but there
are reasons to think that having a suitably arranged hyperparametrized
probabilistic grammar might use the Occams' razor inherent in bayesian methods
to produce exponentially fewer, simpler grammar rules that generate
approximations when it doesn't have enough data to converge to more complex
and precise programs and that these simple rules which rely on fewer
parameters might provide the necessary intermediate steps to then pivot to
more complex rules. These smaller steps help MCMH to find a convergence path.
Not sure how well it will work for complex problems however. My idea is still
half baked. I have ton's of loose ends and details I have not figured out,
some of which I might not even have a solution, as well as little time to work
on this (this is just a hobby for me).

Anyways, all that to say that probabilistic programming can go beyond just
hardcoding a generative model and running it.

~~~
murbard2
Using a probabilistic program to generate and interpret a program is an
interesting idea, but I think you're missing the main problem.

How do you sample from the set of programs that produce the correct (or
approximately correct) output?

You could use rejection sampling, but that would take very long as only a tiny
fraction of all possible programs produce the desired output.

You could use a MCMC method, but the problem here is designing a proposal
density. There is no reason to expect programs which are almost correct to be
syntactically close to programs which aren't. Changing a single bit in a
program tends to radically alter its behavior, in a very non linear way.
Program space does not map smoothly to output space.

You mention hyperparameters, but what would those be? If you're building a
probabilistic grammar, they might be things like emissions probabilities...
but tweaking the proportion of "if" statements you produce, as opposed to,
say, function calls, is hardly going to make you better at sampling from the
set of correct programs.

In general, the idea of meta-learning is a powerful one, but without a way to
guide the search, I don't see it as feasible.

~~~
BenoitEssiambre
" There is no reason to expect programs which are almost correct to be
syntactically close to programs which aren't. Changing a single bit in a
program tends to radically alter its behavior, in a very non linear way.
Program space does not map smoothly to output space."

You correctly identify the main hurdle. I admit I am not sure it will work.

However, I think it might be possible to design the programming language such
that changing a few bits doesn't usually radically alter the output.

For example, let's say you are trying to learn a grammar to draw a blue
irregular polygon. If there are primitives with fewer parameters that can get
you an approximate output, say a circle, this makes possible a learning path
that looks like "circle"->"blue circle"->"simple blue polygon"->"complex blue
polygon". In addition to that, if the grammar rules that generate similar
output can be clustered in the space of grammar rules, small jumps in this
space may give you small output changes. Using bayesian priors will naturally
use the simpler, fewer parameter shapes first and pivot to more complex ones
as enough information is learned while, I think, creating these more complex
rules close to the simple ones in the space of grammar rules. That is my hope
anyways. I got it working as expected-ish with a simple vision example like I
just described.

------
eli_gottlieb
Ah, so they wrote a DSL just for doing Bayesian inverse-vision models, and
apparently they've now got it to accuracy rates competitive with most other
major vision methods?

Good job!

------
contingencies
Probabilistic prediction: this is primarily going to be used for robots that
monitor, kill or assist with killing people.

~~~
Retra
Alternatively, one could monitor, save, or assist with saving people.

