
BayesDB: A probabilistic programming platform - erwan
http://probcomp.csail.mit.edu/bayesdb/
======
afpx
Anyone doing probabilistic programming with big data? I started experimenting
with probabilistic programming frameworks several years ago, but couldn't get
it to scale to the level of data I'm working with (~6 dimensions ~100 trillion
vectors), or even a small fraction of that. But, I'm sure it's being done in
scientific circles somewhere.

Are there communities to collaborate on probabilistic programming? It seems
like he domain knowledge is obscure enough that all the good information is
locked up in the big corporations and academics.

~~~
currymj
Things have gotten much more mature in the past few years.

Check out edwardlib.org which is adapting Tensorflow to support probabilistic
modeling (with a heavy focus on variational inference, less so on MCMC
methods). If you’ve got trillions of observations you can use stochastic VI.
And tensorflow now can do distributed computation graphs, or you could just go
data parallel and then average your parameters at the end.

In general David Blei’s group at Columbia does a lot of work in scalable
probabilistic inference.

The other big option is of course Stan, which is really well optimized but I
don’t think is particularly intended for “big data” of this kind. If you have
“medium data” that fits on one machine though, it’s blazing fast.

~~~
2309kdujj
Funny you mention those.

I've been really excited about Edward but when I tried it for a project last
year I could never get it to come together in the right way. I got the sense
that it wasn't quite ready for prime time yet, although very promising. My
memory of it was that a lot of claimed flexibility in how to specify models
wasn't really implemented fully. The experience also turned me off of
TensorFlow a bit. But that was a year ago, so maybe it's improved?

I ended up doing it in Stan in part because I was more familiar with that, and
it worked out fairly well.

Just a personal anecdote.

------
3pt14159
Welcome to the project I've been waiting for _years_ to get out of alpha. It's
frustrating. If I had a hundred million dollars I'd burn a couple million
getting this funded. It seems like it will be useful to humanity.

~~~
carterschonwald
... they are funded. (I know of >=1 of the investors)
[https://empirical.com/](https://empirical.com/)

------
mayneack
previous discussions:

[https://news.ycombinator.com/item?id=6864339](https://news.ycombinator.com/item?id=6864339)

[https://news.ycombinator.com/item?id=10750900](https://news.ycombinator.com/item?id=10750900)

------
jarym
I can imagine this could be really great wrapped up as a Postgres extension

------
stanfordkid
This sounds kind of similar to the stuff this startup called "Prior Knowledge"
was working on prior to being acquired by Salesforce:
[https://www.crunchbase.com/organization/prior-
knowledge#/ent...](https://www.crunchbase.com/organization/prior-
knowledge#/entity)

------
indescions_2017
Glad to see this is out as well! Using probabilistic frameworks has the
potential to eliminate a lot of the human error which can easily enter a large
simulation. It's fair to say in the future probabilistic modules will become
part of every standard library in every programming language, and distribution
sampling functions will be as common as trig functions in a math library.

I am curious though how I would build up large queries in the BQL (SQL-like
query language) or MML (meta-modeling language). For the orbital example, we
conceivably only have a relatively low dimensional space. But what about a
Bayes net for investigating genetic variants in a large genomic population?
Doesn't this quickly become intractable?

~~~
mozartoz
Some probabilisitic systems can cope with big volumes of data. E.g.
[http://i.stanford.edu/hazy/](http://i.stanford.edu/hazy/)

~~~
sanxiyn
Note that Apple recently acquired this technology. Lattice Data is the
commercialization of DeepDive.

------
kensai
Is there a comparison of its accuracy against traditional methods? Admittedly,
this machine assisted modeling sounds really interesting.

------
elvinyung
They really missed out on an opportunity to call it _DataBayes_.

