

PredictionIO – open source machine learning for predictive features - rnyman
https://hacks.mozilla.org/2014/04/introducing-predictionio/

======
duaneb
I wish it didn't depend on MongoDB. Is there a good reason for not using a
more general purpose database, like SQL or any of the billion key value stores
out there?

~~~
dszeto
We picked MongoDB a while ago because of its built in geospatial indexing
support. Would be happy to add support of other databases. Any strong
candidates on top of your head?

~~~
bduerst
I know it's a pain to get data into it, but how about Big Query?

~~~
tstonez
Ideally open-source. Surprised MongoDB 2.2+ is such an issue but thanks for
the feedback and suggestions. We'll look into PostgreSQL, Riak and RethinkDB.

~~~
bduerst
It is based on an open-sourced db called Dremel:

[https://code.google.com/p/dremel/](https://code.google.com/p/dremel/)

~~~
yipjustin
But dremel doesn't support incremental updates. Dremel is designed for read-
only data. All its columns are indexed. The whole table needs to be re-built
after update.

------
lefrancaiz
It also comes with a Vagrant box if you're looking to just try it out with
your dev environment. I've tried it personally and had a good experience with
it. I spot checked the recommendations and they felt really good.

[http://docs.prediction.io/current/installation/install-
predi...](http://docs.prediction.io/current/installation/install-predictionio-
with-virtualbox-vagrant.html)

------
wheaties
Sounds fantastic right up until I read requires MongoDb. Thankfully it's open
source and we can fix that.

~~~
berto99
Why is mongo a problem?

~~~
micro_cam
For me it is a red flag in terms of scalability as lots of our data sets won't
fit in mongo backed by a 1-2 TB disk even if they take up < 100 GB in the
original format (usually binary/compressed genetic data).

It also uses a ton of ram and performance really suffers when the data won't
fit in ram so it isn't a great choice if you are trying to push the limits of
what your machines can do.

They are only using it to store models and whatever "behavioral data" is but
models for things like random forests can be really big and you want to be
able to write/read trees from separate machines etc.

I wonder why they chose to use mongo vs local disk or HDFS which they already
require.

~~~
smhchan
it's the real-time prediction query, e.g. geospatial search, that makes use of
mongo's indices.

~~~
micro_cam
Thanks for the clarification, the write up isn't clear. Have you benchmarked
against postGIS or stock mysql? And tried any larger-than-memory databases?

We were using mongo in a suit of web applications that display the results of
ML and statistical analysis of cancer data and we've found its query
performance lacking in a number of cases...I think the mongo geospatial index
is a pretty simple geohash setup on top of their normal query engine and I
would expect it to have the same issues.

I do think this project is very interesting, just providing my feedback based
on doing similar work.

Memory overhead of both mongo and hadoop would actually be my biggest worry
since, especially on desktop workstations it is quite common for machine
learning tools in R or python to need most of the available memory when
tackling even small problems.

------
jey
How did these guys get distribution on an official Mozilla blog despite being
a third-party project?

~~~
rnyman
I'm Robert, the Editor of Mozilla Hacks. We publish articles about anything
regarding to the Open Web and is open source that we believe developers can
learn and get inspired from.

More information at
[https://hacks.mozilla.org/about/](https://hacks.mozilla.org/about/)

------
dminor
One of the really nice things about PredictionIO is that it comes with a dozen
different recommendation algorithms out of the box, and lets you simulate
their results with your data. This makes it much easier to decide which is the
right one to use.

