Are there communities to collaborate on probabilistic programming? It seems like he domain knowledge is obscure enough that all the good information is locked up in the big corporations and academics.
Check out edwardlib.org which is adapting Tensorflow to support probabilistic modeling (with a heavy focus on variational inference, less so on MCMC methods). If you’ve got trillions of observations you can use stochastic VI. And tensorflow now can do distributed computation graphs, or you could just go data parallel and then average your parameters at the end.
In general David Blei’s group at Columbia does a lot of work in scalable probabilistic inference.
The other big option is of course Stan, which is really well optimized but I don’t think is particularly intended for “big data” of this kind. If you have “medium data” that fits on one machine though, it’s blazing fast.
I've been really excited about Edward but when I tried it for a project last year I could never get it to come together in the right way. I got the sense that it wasn't quite ready for prime time yet, although very promising. My memory of it was that a lot of claimed flexibility in how to specify models wasn't really implemented fully. The experience also turned me off of TensorFlow a bit. But that was a year ago, so maybe it's improved?
I ended up doing it in Stan in part because I was more familiar with that, and it worked out fairly well.
Just a personal anecdote.
You could probably achieve very interesting results by taking a much smaller sample from your large data set.
Anyway, in my case, I have data feeds, but none of them are 100% reliable. There is error, and I can guess the error. I want to infer things from the data, but I know that the conclusions are unreliable. So, I want to know how unreliable my conclusions are, if that makes sense.
Anyway, I'm an amateur here. But, My independent research led me to things like MCMC and probabilistic programming which allows me to model things better.
I am curious though how I would build up large queries in the BQL (SQL-like query language) or MML (meta-modeling language). For the orbital example, we conceivably only have a relatively low dimensional space. But what about a Bayes net for investigating genetic variants in a large genomic population? Doesn't this quickly become intractable?