Hacker News new | comments | ask | show | jobs | submit login
BayesDB: A probabilistic programming platform (mit.edu)
222 points by erwan on July 9, 2017 | hide | past | web | favorite | 15 comments

Anyone doing probabilistic programming with big data? I started experimenting with probabilistic programming frameworks several years ago, but couldn't get it to scale to the level of data I'm working with (~6 dimensions ~100 trillion vectors), or even a small fraction of that. But, I'm sure it's being done in scientific circles somewhere.

Are there communities to collaborate on probabilistic programming? It seems like he domain knowledge is obscure enough that all the good information is locked up in the big corporations and academics.

Things have gotten much more mature in the past few years.

Check out edwardlib.org which is adapting Tensorflow to support probabilistic modeling (with a heavy focus on variational inference, less so on MCMC methods). If you’ve got trillions of observations you can use stochastic VI. And tensorflow now can do distributed computation graphs, or you could just go data parallel and then average your parameters at the end.

In general David Blei’s group at Columbia does a lot of work in scalable probabilistic inference.

The other big option is of course Stan, which is really well optimized but I don’t think is particularly intended for “big data” of this kind. If you have “medium data” that fits on one machine though, it’s blazing fast.

Funny you mention those.

I've been really excited about Edward but when I tried it for a project last year I could never get it to come together in the right way. I got the sense that it wasn't quite ready for prime time yet, although very promising. My memory of it was that a lot of claimed flexibility in how to specify models wasn't really implemented fully. The experience also turned me off of TensorFlow a bit. But that was a year ago, so maybe it's improved?

I ended up doing it in Stan in part because I was more familiar with that, and it worked out fairly well.

Just a personal anecdote.

I have not, so take my words with a grain of salt. However, with only 6 dimensions 100 trillion vectors is probably extremely overkill (unless you are talking time series data?). Probabilistic programming, from what I understand, thrives on situations where you do not have a lot of data (instead of an overabundance).

You could probably achieve very interesting results by taking a much smaller sample from your large data set.

Could be. I'm just a software engineer with a high pain tolerance and a lot of tenacity who is trying to make conclusions from lots of error-ridden data. It seems difficult to find experts in this space. I've interviewed several statistics Ph.d. from reliable schools, but no luck yet. By the way, sampling doesn't seem to help because all of the data is relational. Or, maybe I don't know how to sample it effectively.

Anyway, in my case, I have data feeds, but none of them are 100% reliable. There is error, and I can guess the error. I want to infer things from the data, but I know that the conclusions are unreliable. So, I want to know how unreliable my conclusions are, if that makes sense.

Anyway, I'm an amateur here. But, My independent research led me to things like MCMC and probabilistic programming which allows me to model things better.

Welcome to the project I've been waiting for years to get out of alpha. It's frustrating. If I had a hundred million dollars I'd burn a couple million getting this funded. It seems like it will be useful to humanity.

... they are funded. (I know of >=1 of the investors) https://empirical.com/

I can imagine this could be really great wrapped up as a Postgres extension

This sounds kind of similar to the stuff this startup called "Prior Knowledge" was working on prior to being acquired by Salesforce: https://www.crunchbase.com/organization/prior-knowledge#/ent...

Glad to see this is out as well! Using probabilistic frameworks has the potential to eliminate a lot of the human error which can easily enter a large simulation. It's fair to say in the future probabilistic modules will become part of every standard library in every programming language, and distribution sampling functions will be as common as trig functions in a math library.

I am curious though how I would build up large queries in the BQL (SQL-like query language) or MML (meta-modeling language). For the orbital example, we conceivably only have a relatively low dimensional space. But what about a Bayes net for investigating genetic variants in a large genomic population? Doesn't this quickly become intractable?

Some probabilisitic systems can cope with big volumes of data. E.g. http://i.stanford.edu/hazy/

Note that Apple recently acquired this technology. Lattice Data is the commercialization of DeepDive.

Is there a comparison of its accuracy against traditional methods? Admittedly, this machine assisted modeling sounds really interesting.

They really missed out on an opportunity to call it DataBayes.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact