
MLDB – Machine Learning Database - JPKab
http://mldb.ai/
======
brianshaler
Not having dug through the code or documentation yet, I'm curious if anyone
else knows if this can be used with a minimal memory footprint. The home page
touts performance with everything in memory, which is great for blasting
through a dataset and then spinning down a $1/hr server. What if I'm willing
to accept orders of magnitude slower processing to run this full-time in the
background of a low-memory server?

~~~
barneso
(Founder here).

Could you describe your use-case? This is an interesting question and I'd love
to hear more about what you are thinking of.

There are two main parts to most machine learning workloads: training and
prediction (though in online scenarios they can be mixed). The training is
what tends to be memory intensive, as it involves looking at historical data.
Prediction, especially in real-time, is typically less memory intensive and
more suited to your scenario. It is possible with MLDB to have training and
prediction on separate instances to optimize hardware cost.

Our focus to date has been on getting large real-world use cases to fit, for
both training and prediction, on a single commodity server with decent amounts
of RAM (4GB and up), especially for training. Performance there is good; you
can train user scorers or recommenders over hundreds of millions of users with
20 billion historical actions in minutes on a single server. We haven't
focused much on the absolute memory footprint and there is some overhead; you
will need a server with a good amount of memory, a couple of gigs, to have a
pleasant user experience.

(edit): The team tells me it will work decently down to 256MB of RAM or so,
for carefully designed automated workloads. For interactive use you will want
more RAM.

~~~
brianshaler
I appreciate the detailed response! My use case is yet another indie web, run-
your-own-server kind of thing that processes content my friends post online
(from tweets to shared articles) and can predict whether or not it is relevant
to me based on extracted topics, source, my context/location, and whatnot.
Both the source data and training data comes in at a trickle (though I'm
pondering ways to propagate training data throughout friend-of-friend
networks) and is processed in the background rather than on-demand, so my
performance characteristics are very different than most of the use cases I've
seen.

I'm trying to keep the system frugal with memory but liberal with persistent
storage, since you can run a commodity instance 24/7 and mount a pretty large
volume for fairly cheap. It'll be slow, for sure, if there's only one user for
each installation it won't need to worry about handling many queries per
second.

~~~
barneso
It seems that MLDB would be a decent fit for this use-case. You would be able
to do pre-processing in the background continuously, and predictions could do
a significant amount of work on-demand. Depending upon the size of the overall
training set, you might need to spin up a larger server for an hour or so to
retrain a model... but if you set the system up right, the model would only
need to be trained infrequently as most of the work would be done online. The
more preprocessing you can do in the background, the richer and smaller the
data that would go into the training phase.

MLDB can memory-map some kinds of datasets which would also help with the low
memory-to-datasest size ratio.

Please feel free to reach out (jeremy at datacratic) if you'd like to discuss
further.

------
zeckalpha
How does this compare to RecDB, other than being commercial? [http://www-
users.cs.umn.edu/~sarwat/RecDB/](http://www-users.cs.umn.edu/~sarwat/RecDB/)

~~~
nicolaskruchten
(MLDB product lead here)

I hadn't seen RecDB, thanks for the link! At a glance, other than the
licensing, the main point of comparison is that MLDB is more powerful/general-
purpose than RecDB, as it provides primitives for not only for creating
recommenders but to do other machine-learning tasks like classification,
clustering, dimensionality reduction, visualization etc. MLDB also seems to
have a bit more documentation :)

~~~
boomzilla
I've used madlib[1] before and was quite happy with that. Maybe you can do a
feature comparison with similar existing solutions as many people don't have
the bandwidth to go through your documentations.

[1] [http://madlib.net/](http://madlib.net/)

~~~
nicolaskruchten
That's a good idea, thanks! What kinds of problems were you solving with
madlib?

~~~
boomzilla
Classification problems in a medical domain. It literally took me 30 minutes
to install madlib and run logistic regression on a table. It then probably
took me a few more days to do some analysis, try out various algorithms and
productionize the predictor.

~~~
philgo20
A feature comparison with other solutions is a great idea. We'll work on that
and share here.

