
BayesDB - a Bayesian database table - sebg
http://probcomp.csail.mit.edu/bayesdb/
======
tristanz
This looks like the technology behind Prior Knowledge
([http://priorknowledge.com/](http://priorknowledge.com/)), which is not
surprising given that a lot of the same people are involved. Awesome that it's
going open source.

It will be interesting to see how far the database analogy can be pushed. The
key thing to realize is that BayesDB is based on a particular model, CrossCat
([http://probcomp.csail.mit.edu/crosscat/](http://probcomp.csail.mit.edu/crosscat/)).
If your database table is an adjacency list that represents a graph, for
instance, it's not going to perform very well compared to a more tailored
model. On the other hand, a generic approach to high-dimensional imputation is
very useful.

~~~
stochastician
Eric Jonas here, former CEO of Prior Knowledge, cofounder of Navia, etc. etc.
[and obviously not speaking for my employer] Tristan is right, our tech was
based on CrossCat, a model originally developed in Josh Tenenbaum's lab at
MIT. We fundamentally believe that structured probabilistic models are the
future -- this is very much the direction that my latest research has taken,
for example.

The reality is that almost any model that lets you build up a good joint
density estimate of the data can be useful for this sort of tabular data.
CrossCat assumes that your joint factors in to several independent factors --
that is, p(x, y, z, w) = p(x, y) p(z, w), for example. Of course, because it's
a dirichlet process of dirichlet processes, all that's learned from the data.
But that's not always the best approach, there are others, and it will be fun
to see what else comes out of the research community.

And Tristan's also right that this is not always a great fit for your data --
if it's graph based, or if your data are super-high-dimensional, or whatever,
it's not a perfect fit. But it is a good start.

For a great review of the Dirichlet Process/CRP, the Indian Buffet Process,
and other probability distributions over infinite structures, I highly
recommend this great tech report from Zoubin and Tom:

[http://mlg.eng.cam.ac.uk/pub/pdf/GriGha11.pdf](http://mlg.eng.cam.ac.uk/pub/pdf/GriGha11.pdf)

~~~
jaybaxter
Agreed! CrossCat works really well for many datasets, but I am also very
excited about implementing BayesDB with other joint density estimators.

------
taliesinb
Interesting project!

I have some questions that I couldn't immediately answer from skimming either
the BayesDB documentation or the paper linked from
[http://probcomp.csail.mit.edu/crosscat/](http://probcomp.csail.mit.edu/crosscat/)

* The CrossCut paper seems to focus on binary features and categorical learning. How does BayesDB generalize that? Does it quantize continuous features first to make it all categorical, or does it generalize CrossCut somehow?

* How are we to think about what BayesDB doing? Is the underlying model most similar to a graphical model? A Bayesian network? A Markov field?

* On an informal level, how is the factorization structure learned?

* What's the time and memory complexity in terms of number of features and examples for different operations? Is insertion constant time? Is it storing sparse contingency tables of some kind?

~~~
jaybaxter
Thanks!

While the original CrossCat paper focused on binary features, it is in fact
much more general. For example, CrossCat uses a beta-bernoulli model for
binary features, normal-gamma for continuous, and dirichlet-multinomial for
categorical data.

CrossCat is a generative bayesian nonparametric probabilistic model.
Informally, the generative process assumed by CrossCat is that the columns are
clustered (into "views") according to a Dirichlet Process, then the rows
within each view are clustered by another Dirichlet Process. Then, the data is
generated by the datatype-appropriate component model for each cluster.

INFER, SIMULATE, and INSERT are all constant time, and most other operations
scale linearly with the number of rows or columns, including inference. It
doesn't store any sparse contingency tables or anything like that -- all it
stores are CrossCat posterior samples.

~~~
teraflop
What about the individual cluster models -- can they represent dependencies of
variables within a cluster? For instance, I'm thinking about the "salary
prediction" example. Is the salary variable considered to be conditionally
independent of all the other variables, given the cluster assignment? Or can
it learn something like an additive model, where categorical variables are
associated with higher or lower salaries within a cluster?

Or to use another example, can it learn correlations between two continuous
variables, to solve things like linear regression?

~~~
jaybaxter
You are correct that each variable is considered conditionally independent of
the other variables given the cluster assignment. CrossCat learns additive
models and correlations between continuous variables by using many clusters
(the clusters don't necessarily have meaningful real-world interpretations).

------
murbard2
The crosscat paper
([http://web.mit.edu/vkm/www/crosscat.pdf](http://web.mit.edu/vkm/www/crosscat.pdf))
is super-terse. Is there a more gentle description somewhere?

I'm quite familiar with generative models and MCMC sampling and reasonably
familiar with the Dirichlet process though I've never implemented it.

~~~
jaybaxter
This computational cognitive science paper about CrossCat
([http://shaftolab.com/lab_papers/repository/shaftoetal06_cros...](http://shaftolab.com/lab_papers/repository/shaftoetal06_crosscat.pdf))
provides a different perspective that may be more approachable.

------
bravura
This looks great. It has similarities to PreQL, the query language from Prior
Knowledge (acq. Saleforce).

I would love to see a full set of QL primitives for common data science
operations.

------
tlarkworthy
Scaling? Can I get images in it? Can I reslice my data? Is it all in memory?

~~~
jaybaxter
Hi, I'm one of the lead developers on the project. It's a relatively young
project, so keep an eye out for lots of substantial releases over the next few
months.

So far, we have been focusing on smaller dataset sizes, such as 10,000 rows by
100 columns, but everything scales linearly (both query processing and offline
inference) so you could use it for larger datasets too. The current version is
all in memory, though, so you're limited there.

Right now you must import data from csv, so no images, and you must do all
preprocessing of your data before loading it in. We hope to add more and more
of this kind of functionality in later releases. I'd love to hear suggestions!
I recommend trying out the VM installation if you want to quickly play around
with it.

~~~
jzwinck
Here's a suggestion for importing data: support HDF5. It's like CSV, but a lot
faster and better. And the set of people who use HDF5 probably overlaps a fair
bit with your target audience.

------
sbashyal
This looks awesome! If it works the way suggested, it is going to save me a
lot of time modeling the Bayesian Analysis in coming projects.

------
benjaminsky
I'm having trouble getting the examples to work on the VirtualBox VM.
run_dha_example gives me "None" every time. I've ran the client.execute with
--timing=True and --pretty=False with the same results. Am I doing something
stupid? :)

~~~
dartdog
I got the VM up but can't get it to do anything useful.. No next steps after
getting the VM up..?? Also, I would be very nice to have a more granular
install.. I just spent the day fixing my system after attempting a native
install on Ubuntu 13.10 (still have more to repair) A list of prerequisites
would be better that the current unified install...

~~~
jaybaxter
Once you have the VM up, you can try running demo scripts, located at
~/bayesdb/examples/ _/_.py after you've checked out and pulled master from
~/bayesdb and ~/crosscat.

Yes, the install process is a pain in the current release (sorry!), but the
next release (almost ready) will be much more friendly and granular to
install.

------
janetboreta
Hi Jay, I was talking with Anne just now and we need explanation on another
level, I am afraid. I would like to understand BayesDB, and I wonder if there
is a layman's summary available? It was great seeing you at Thanksgiving at
Beachcliff! Mimi

------
drewda
A question for the developers: Does this support time series data?

~~~
jaybaxter
The answer is both yes and no. In principle, longitudinal/panel data can be
run through BayesDB, though BayesDB would have to infer the temporal
structure. Also, we've toyed a little with forecasting by making a table where
each row is a sliding window.

That said, BayesDB is really about the classic multivariate statistics
setting: each row is a sample from some population. We think that a streaming
Bayesian database, that models sequences of timestamped UPDATEs to a DB (and
with FORECAST in addition to INFER) is an interesting, distinct project that
we've done a little work on.

Contact us if you're interested in this kind of data and we'd be happy to talk
more.

------
seamusabshere
can we get that in postgresql 9.3.3 ok tks bye

~~~
technoir
You can get predicted columns via
[https://github.com/no0p/alps](https://github.com/no0p/alps) ... Turns out
that adding words like infer to the lex is kind of a big project in pgsql.
Excellent work with the language aspect of it in BayesDB -- well thought out.

~~~
seamusabshere
thanks for the tip! it was worth the karma i lost for my attempt at humor...

------
danso
This must be a joke, a cruel, cruel attempt to troll poor data programmers who
have waited forever for something like this to be made.

