
Stan: a probabilistic programming language - rahimiali
http://mc-stan.org
======
probdist
Stan is best viewed in my mind as a successor to BUGS (Bayesian Inference
Using Gibbs Sampling) which more people may have heard of. In fact, there are
lots of players in the probabilistic programming space now, personally I like
the model of "Infer.NET" [1] from Microsoft Research, as I find variational
and approximate variational inference a good solution to my problems and I
like coding models in nearly C# more than in Stan format.

If you want to get started as fast as possible with computational Bayesian
inference and don't need the performance and advanced features Stan, I'd
recommend emcee [2] which is a lightweight python MCMC sampler.

Final recommendation to the HN crowd: Bayesian Data Analysis 3rd Edition [3]
should be on your desk if you are thinking about using any of this software.

[1] [http://research.microsoft.com/en-
us/um/cambridge/projects/in...](http://research.microsoft.com/en-
us/um/cambridge/projects/infernet/)]

[2] [http://dan.iel.fm/emcee/current/](http://dan.iel.fm/emcee/current/)

[3] [https://www.crcpress.com/Bayesian-Data-Analysis-Third-
Editio...](https://www.crcpress.com/Bayesian-Data-Analysis-Third-
Edition/Gelman-Carlin-Stern-Dunson-Vehtari-Rubin/9781439840955)

~~~
dustintran
Hi, stan dev here. I think viewing Stan as a better BUGS is helpful but
limiting.

The syntax is similar, but the class of models Stan fits is far more general.
The class of algorithms we have available also goes beyond MCMC, e.g.,
variational inference, optimization, and interfaces to Stan exist on all
primary programming languages. It's more helpful to think of Stan as its own
probabilistic programming language, and arguably the biggest entity with the
largest user base.

~~~
thisisdave
>the class of models Stan fits is far more general.

This is mostly true, but last time I checked, Stan still couldn't sample
discrete variables like BUGS can. Stan can only fit models with discrete
parameters (e.g. finite mixtures) if the programmer is smart enough to
integrate them out.

------
dj-wonk
One of my first questions was "What are the motivations for a language over a
library?" After reading more, I see that Stan has both. There are integrations
with R, Python, Julia, C++, Stata, the command line, and more.

To ask an intentionally over-simplified question: if the key value add of Stan
in the functionality and approach -- why bother with the overhead of an
external DSL? Why not use an internal DSL -- e.g. borrow data structures
and/or syntax of another language?

Arguments for an external DSL are, hopefully: cross-language portability and
domain-clarity for people.

But how does this play out if you want to support multiple languages and
interop?

Let's say I'm using Python, for example, and want to build up a model
programmatically. The example at [https://github.com/stan-
dev/pystan](https://github.com/stan-dev/pystan) shows using a heredoc. That's
simple and easy to start.

However, a big downside of a plain-text DSL is that it can be hard to
manipulate programatically. To do that, each language implementation has to be
'business' of converting to and from text. This reminds me of the 'SQL'
problem ... many language SQL wrappers spend considerable time munging text.
Ug.

Given my bias towards data-oriented programming, I see another approach: why
not use a rich data structure language (other than plain text) as the
foundation?

Why not edn, for example? If the LISPy nature of that is not preferred, then
why not something else? (Does the C-community have some kind of generic data
description language? I wouldn't recommend JSON necessarily but that would be
better than a hand-rolled DSL, in my opinion.)

Here's why I recommend starting with 'rich' data as opposed to parsing text...

I'm not saying human DSLs are bad. I'm just saying they don't _have_ to be the
foundation. If you want a human external DSL, great -- but why not convert the
human DSL to a canonical data structure? Then build everything around the
data, rather than build everything around the text format?

~~~
bengHN
As you mentioned, there are pros and cons to inventing the Stan DSL.
Originally, we wanted a language that was not too different from the BUGS
language because the BUGS family was what most applied Bayesians were using
(if they weren't using some specialized MCMC algorithm for their problem).

And programmatic manipulation (while not impossible) of a Stan program was not
high on our priority list. Since the underlying language is C++, you can
bypass the Stan language and write (or generate) the C++ directly with a lot
of effort. It uses standard (including Eigen) data structures.

~~~
dj-wonk
> you can bypass the Stan language and write (or generate) the C++ directly
> with a lot of effort

Did you mean "without" a lot of effort? :)

~~~
bengHN
Unfortunately no, but a few brave souls have done so. It is slowly getting
easier.

------
chmullig
Stan is a fantastic language/library/tool. (Disclosure, I've taken 2 classes
with Andrew Gelman[1].

For those who haven't used it, the typical usage is to use Stan in combination
with other languages (most commonly R). In the Stan language you define a
model, specifying the data you'll get, potentially transformations to that
data, then a set of parameters you want to fit, then finally a model saying
how those parameters interact with each other and the data. You can also
define priors for parameters. Then typically you save that model, and using R
pass in the variables to a Stan call. The resulting fit object is returned to
your original environment automatically.

It actually fits the work style reasonably well. All data munging happens as
before, but instead of having some complicated model expression in, say, a
glm() call, you have it in Stan language.

If you're interested in this, Gelman has two great books. Gelman & Hill's
Hierarchical Models[2] which is applied and geared towards social science
researchers, and Bayesian Data Analysis[3].

[1] [http://andrewgelman.com/](http://andrewgelman.com/)

[2]
[http://www.stat.columbia.edu/~gelman/arm/](http://www.stat.columbia.edu/~gelman/arm/)

[3]
[http://www.stat.columbia.edu/~gelman/book/](http://www.stat.columbia.edu/~gelman/book/)

~~~
bengHN
Also, the rstanarm[1] R package (disclaimer: that I co-wrote) will be released
this month, which does not require the user to write any code in the Stan
language. Instead, you specify the likelihood of the data (for a few popular
regression-type models) using conventional R syntax and utilize Stan's
algorithms and optional priors on the parameters to draw from the posterior
distribution. In the demos/ directory of [1], we have replicated most of the
first half of Gelman & Hill's textbook and are starting on the second half,
which heavily utilizes our stan_glmer() function that is compatible with the
syntax of the glmer() function in the lme4 R package.

[1] [https://github.com/stan-dev/rstanarm/](https://github.com/stan-
dev/rstanarm/)

~~~
chmullig
oh neat!

------
mailshanx
Here is a well deserved plug for Cam Davidson Pilon's excellent book called
Bayesian Methods for Hackers:
[https://github.com/CamDavidsonPilon/Probabilistic-
Programmin...](https://github.com/CamDavidsonPilon/Probabilistic-Programming-
and-Bayesian-Methods-for-Hackers)

------
tel
Stan is absolutely a world class general Bayesian sampling tool. It replaces
things like BUGS and uses fantastic algorithms for fast convergence. One of
the major people behind it, Andrew Gelman, is a leader in applied hierarchical
modeling and uses it to great effect in social sciences.

If you have even vague interest in applied statistics in fields with lesser
ability to design experimental controls then you should investigate this tool.

------
ericnovik
I found this recent paper showing how to fit and diagnose a simple non-linear
model in Stan helpful. It is written from a statistician's perspective.
[http://www.stat.columbia.edu/~gelman/research/published/stan...](http://www.stat.columbia.edu/~gelman/research/published/stan_jebs_2.pdf)

This one has more of a CS feel to it:
[http://www.stat.columbia.edu/~gelman/research/published/stan...](http://www.stat.columbia.edu/~gelman/research/published/stan-
paper-revision-feb2015.pdf)

Lastly, here is my own turbulent experience with getting started with Stan:
[http://ericnovik.github.io/2015/08/14/Getting-Started-
with-S...](http://ericnovik.github.io/2015/08/14/Getting-Started-with-
Stan.html)

------
tristanz
The great thing about Stan is it is geared toward practitioners doing real
work in science and social science, but still manages to push the boundary of
stats research (NUTS, ADVI, penalized ML). Other probabilistic programming
languages are more expressive, but are typically much harder to use for day-
to-day work.

That being said, there's still a huge gulf between practitioners in science
and social science that are building parametric Bayesian models and
practitioners in the deep learning / ML community that are focused on building
more general, scalable, machine learning algorithms for tasks like machine
translation and question answering. It would be amazing if somebody could
reconcile these communities by showing deep learning models can be expressed
and fit effectively side-by-side with more parametric models.

------
skimpycompiler
Does anyone have the experience with speed of convergence for some advanced
models? And, defining sequence models in these probabilistic programming
languages?

Seems most (if not all) probabilistic programming languages use sampling as a
general algorithm for training models.

So, I guess it would be a hard job to express something like HMMs or CRFs that
would fit into the framework (I guess one can easily train HMMs/CRFs with a
fixed length of a sequence, but not undefined length).

~~~
nextos
You can easily encode any generative sequence model in e.g. Church (Scheme-
based), Anglican (Clojure-based) or PyMC (Python-based). These are Turing-
complete, so they can model any sampleable (computable) distribution.

See [http://forestdb.org/](http://forestdb.org/) for some examples. E.g. an
infinite HMM.

I have no experience with Stan. There was some controversy on whether it is
Turing-complete, but I cannot elaborate on that.

The obvious disadvantage of using a language that is too expressive for
something as simple as an HMM is that you loose the nice performance
guarantees given by all specialized algorithms such as Viterbi. Theoretically,
they can achieve good efficiency by performing program transformations (e.g.
with abstract interpretation). But in practice, we are still a bit far from
that.

A nice trick is to implement your generative model in something very efficient
(e.g. Probabilisitic C or something GPU-based). You can then forget the burden
of having to encode your own sampling procedure, but at the same time
inference might be tractable.

~~~
mjn
There's some research on automatically deriving efficient algorithms for
classes of distributions, e.g. getting a specialized EM-like algorithm for
anything where EM-like algorithms are applicable:
[http://machinelearning.wustl.edu/mlpapers/paper_files/AA24.p...](http://machinelearning.wustl.edu/mlpapers/paper_files/AA24.pdf)

But obviously that doesn't cover classes as big as "anything expressible in
Church", so there's a pretty big gap. It's possible that languages like Church
could use techniques like that as part of an optimizing compiler that
recognizes specializable patterns, the way regular compilers do loop
transformations and auto-vectorization and such. But, like auto-vectorization,
not that easy to do.

~~~
nextos
Sure, I mentioned abstract interpretation, but as you say all optimizing
compiler techniques might be applicable.

------
proditus
hi. i'm one of the Stan devs. (i work on variational inference: ADVI).

happy to answer any questions here.

~~~
chmullig
Yay Stan! So what's the deal with BBVI? I heard it's integrated in the newest
Stan, but can I use it? What for?

~~~
proditus
glad to see the enthusiasm!

ADVI [1] is a variant of BBVI [2] where we fully leverage all of the amazing
things that Stan has to offer (like automatic differentiation and automatic
transformations of constrained parameters).

you can use ADVI to get an approximation to the Bayesian posterior. the
advantage of using ADVI over sampling is that ADVI is typically faster for
large models (both in terms of # of parameters and # of data observations).
ADVI also a bit better at handling models with multi-modal posteriors, such as
mixture models.

ADVI is currently in cmdStan (but not in RStan or PyStan). we're continuing to
make the algorithm more robust.

[1] [http://arxiv.org/abs/1506.03431](http://arxiv.org/abs/1506.03431) [2]
[http://www.jmlr.org/proceedings/papers/v33/ranganath14.pdf](http://www.jmlr.org/proceedings/papers/v33/ranganath14.pdf)

------
powera
Can anyone explain to me briefly what Stan actually does that's different from
what I could do in Python without it? I don't want to have to read the entire
research paper to get a description in layman's terms.

~~~
bengHN
The number one thing in my opinion is that Stan's algorithm(s) for drawing
from a posterior distribution produce samples that have much less dependence
among adjacent draws than other simpler MCMC algorithms like Metropolis-
Hastings or Gibbs samplers. Since it is often difficult to answer the question
"How much dependence is too much in practice?", it is prudent to use the
algorithm that yields the least dependence because the effective sample size
(from the posterior distribution) per unit of wall time will usually be
greater. PyMC3 has started to incorporate some of Stan's algorithms, although
their implementations are not as far along.

~~~
wlievens
I know some of those words...

~~~
zmmmmm
Forget about the statistics and MCMC and all that. Imagine you have a function
that is a black box and returns a floating point number. You can't see the
implementation, but you want to find its maximum. How do you do it?

If you're like me you'd probably start with some kind of "grid search" by
giving it evenly spaced parameters. Then you would evaluate it at each one and
do some kind of gradient descent / ascent type algorithm. This will work, but
in a high dimensional space (ie. a function with many arguments) you have no
chance of covering even a tiny fraction of the parameter space. The key is,
you don't want to waste time evaluating a set of parameters if it doesn't tell
you anything new. ie. if you get back a similar answer to what you already
had. This is what the GP means (I think) by trying to avoid "dependence
between adjacent draws".

I could be wrong about all this, I'm trying to learn it too.

~~~
wlievens
Thanks, that clarifies it a bit.

------
svisser
It might give the right answers.

