

Ask HN: Fast, In-Memory, Distributed data analysis and machine learning? - henrythe9th

We&#x27;re looking to implement a new data pipeline architecture at work. The primary goal is speed (data size is small enough to fit entirely in memory, sharded across multiple machines if needed). The primary bottleneck is feature extraction, transformation and iteration, which is both CPU and read&#x2F;write intensive. Model building is not too slow, so no need to distribute training&#x2F;testing as of yet.<p>I&#x27;ve heard good things about Spark&#x2F;Shark and Storm. Does anyone have any experiences or recommendations? Maybe we don&#x27;t even need a super sophisticated system and a Riak&#x2F;Redis K-V store cluster would do?<p>Thanks in advance
======
karterk
Hard to offer suggestions without knowing rough size of data - depending on
how much money you're willing to cough up, even 1 TB is in the range of "can
fit in the memory" territory.

Having said that, Spark is really great for running iterative algorithms and
will definitely fit with what you have described. I suggest staying away from
building it on your own using riak/redis (atleast until you have ruled out
spark), as you will run into lots of operational issues like handling
failures, resource allocation, retries etc.

~~~
henrythe9th
Thanks for your input. We're roughly talking around 5GB of data. Data growth
should be linear in the next 6months. Money is not a big concern. Speed of
iteration is key.

We frequently run different processing algorithms over the entire stored
dataset (stored data doesn't change) and update the calculated features each
time. Not sure if this helps narrows things down. Thanks

~~~
karterk
A little bit of context: I have done a lot of hadoop, and also well aware of
spark and storm. Storm is mostly well suited for handling a stream of real-
time data. Spark is specifically for running iterative algorithms - it can
read from HDFS, and with the expressiveness of Scala, it's great for building
machine-learning related stuff.

However, 5GB of data is literally nothing, and that statement holds till your
data size is atleast 50-60 GB. Given that 64 GAM RAM machines are now
commodity, I would just load the entire thing in RAM and write a multi-
threaded program. Sounds old school, but regardless of how well documented
hadoop, spark and storm are, there is nevertheless a learning curve and a
maintenance cost. Both of which are well worth only if you see your data
rapidly growing to the X TB range. Otherwise, it might be just easier to stick
it in a single machine and get stuff done.

You can stick to Scala/Java, and so long you develop good abstractions around
your core algorithms, you can always move to spark/hadoop when you need it.
Feel free to send me an email if you want to talk more (email in profile).

~~~
henrythe9th
Thanks for the suggestion. We've actually thought about just writing a
multithreaded system on a single machine. What type of in-memory storage would
you recommend in this case? (which hopefully may be extended to a distributed
cluster of machines if 1 really large machine becomes expensive)

Thanks

~~~
karterk
I suggest storing your data in files and just memory mapping them during
start-up. JVM can't memory map more than 2GB per file, so just create logical
shards, and map them independently.

Since you will be mostly iterating over all records during your iterative
algorithms, storing them in a separate in-memory DB makes no sense (have to
call external process via socket).

You can then use a framework like zookeeper/akka for managing nodes in the
event that you have to scale out. Even a simple master/slave set-up using
thrift services will do.

------
agibsonccc
I can vouch for storm. If only for the fact it's pretty easy to setup
(especially compared to hadoop) Being able to leverage zookeeper for
coordination allows you some extra capabilities for coordination as well. With
that being said, just watch how you build your bolts/spouts. There's lots of
ways you can send data in to the system, but in general , storm's
documentation has been superb to work with.

I built a mini library for myself to auto construct the topologies based on a
set of named dependencies to handle bolt/spout wiring. Aside from that, the
builder interface for it is really nice if your data pipeline doesn't change.

There's good support for testing with a local cluster as well.

~~~
henrythe9th
Thanks for your suggestion. Do you have any specific readings for me to look
into for building bolts/spouts for sending data into the system?

Thanks

~~~
agibsonccc
Here's the root wiki:
[https://github.com/nathanmarz/storm/wiki](https://github.com/nathanmarz/storm/wiki)

Here's the system architecture:
[https://github.com/nathanmarz/storm/wiki/Concepts](https://github.com/nathanmarz/storm/wiki/Concepts)

Here's non JVM languages (specifically python) for building spouts/bolts
[https://github.com/nathanmarz/storm/wiki/Using-non-JVM-
langu...](https://github.com/nathanmarz/storm/wiki/Using-non-JVM-languages-
with-Storm)

Here's an example project: [https://github.com/nathanmarz/storm-
starter](https://github.com/nathanmarz/storm-starter)

~~~
henrythe9th
Thanks!

------
x0x0
you should check out [http://0xdata.com/](http://0xdata.com/) ; it's built
from the ground up on a custom dkv to do in-memory ML. Reasons to check it
out:

1 - it's open source
[https://github.com/0xdata/h2o](https://github.com/0xdata/h2o)

2 - ingest data from hdfs, s3, csv

3 - I've built systems like what you're discussing twice; the ML algorithms
are often easier to write than expected while data management (moving data,
sending updates, etc) which initially seems easier is much harder. 0xdata
handles this for you.

4 - under active development

5 - it cleanly runs on your dev box with 1 or many nodes for development;
deploying is a simple as uploading a jar to a cluster and putting a single
file on each naming peers in the cluster

5a - see scripts to walk you through doing this

disclosure: I work on it as of very recently =P

------
nihar
Have you looked at Oracle Coherence? It's pretty light weight and has
clustering features as well.

~~~
henrythe9th
Thanks for the suggestion. Looks very interesting, but couldn't find much
information about it besides on Oracle.

How's the community and use cases for Coherence?

Thanks

~~~
nihar
Not much in terms of the open source community, but Oracle forums have some
good support for this. Plus, the documentation that comes with the product is
pretty decent, and a lot of really large firms use the solution.

