
Storm - the Hadoop of realtime processing - lzimm
http://tech.backtype.com/preview-of-storm-the-hadoop-of-realtime-proce
======
vannevar
Storm sounds great, but this post probably should have waited until it was
actually open-sourced. As it is, it just comes across as naked self-promotion
based on a technology that could for all we know be vaporware.

~~~
nathanmarz
(I'm the author of Storm)

Your criticism is totally fair. People have been curious about Storm so we
wanted to provide a little bit of information about it. We'll have demos soon,
and of course it will be open sourced within a few months.

If you're curious about our credibility, I think our other open source
projects speak to the quality of software we produce:

<https://github.com/nathanmarz/cascalog>
<https://github.com/nathanmarz/elephantdb>

~~~
SeoxyS
I think people here are a little too harsh. Storm sounds like an amazing
product and I can't wait to play with something like that. Right now, we run a
bunch of cron jobs every minute with intense MapReduce queries on mongodb to
generate relatively up-to-date analytics. Something like this would be
immensely useful. (As well as Mongo's new 2.0 Aggregation pipeline features.)

Now, I agree that it's kind of a bummer we can't play with it right now, but
the fact that you guys made this are are going to open source it is already
awesome in itself.

~~~
yummyfajitas
As a happy user of ElephantDB, I'd say people are definitely too harsh.
Elephantdb is awesome - my company has completely replaced HBase with
ElephantDB and MaryJane (a lightweight way of putting data into hadoop that we
wrote, <https://github.com/stucchio/MaryJane-> ).

That said, I'd love to see some code released, even if it isn't ready for
primetime.

~~~
gojomo
Can you say anything about what made ElephantDB + MaryJane better than HBase
for your workload? (Occasional batch loads that then need random reads but not
random inserts?)

~~~
yummyfajitas
Absolutely - I need batch loads, and random reads. The term "insert" is
somewhat meaningless - I have random appends and a periodic mapreduce job
compiles the randomly appended data into structured data to served via
ElephantDB. The structured data requires random queries. In principle, HBase
should have filled my needs completely. But in practice, I couldn't make it
work.

Our HBase cluster (3 boxes serving 30 human oracles, each submitting data at a
rate of 1 record every 5-10 seconds) choked frequently - i.e., it stopped
accepting new records. Ultimately what I had to do is have the human data go
into postgres and a cron job flushed that into HBase every half hour or so.

I'll emphasize that this is probably my fault. I'm not claiming HBase doesn't
scale to 30 concurrent users - clearly Facebook demonstrates it can. But I
couldn't figure out how to make that happen. HBase is a complex system and I
make no claim of understanding it.

ElephantDB + MaryJane are simple. There is almost nothing that can go wrong -
put together they probably amount to 5000 lines of code and have as many as 10
minimally interacting configuration options. The effort required to manage
them is minimal - I had EDB working flawlessly in less than a day.

HBase is an enterprise tool. It works well if you are Facebook and can put a
couple of people on maintenance duty. It's overkill if you are Styloot (my
stealth mode startup, currently smaller than Backtype).

~~~
gojomo
Thanks! So on each batched load, is the previous data rewritten with
interleaved new data? Or is the key ordering such that's never necessary?

~~~
yummyfajitas
Each batched load has no ordering. But the data I'm loading is not the same as
the data I'm reading.

The data I'm loading is stuff like tags - e.g., <itemid>\t<tagid>. In human
terms, "Dress A has a ruched collar." Mapreduce can handle data like this,
even when it comes unordered.

The data I'm reading is computational results based on the loaded data - e.g.,
an index: <tagid>\t[<itemid1>, <itemid2>, ...] (where each itemid has been
tagged with tagid). E.g., "here are all the dresses with a ruched collar."

(Actually, we do considerably more than this, nor do we need Hadoop for an
index. But an index is the simplest example I could give.)

The original data is very boring. It's only after aggregation and calculation
that it becomes worth reading.

------
bmohlenhoff
It sounds like a neat project, but I think describing it as "real time" is
misleading if you're not also providing information on latency. The majority
of the provided use cases seem to indicate a high level of scalability and
durability, as well as a high level of throughput, but these are not necessary
characteristics of a true real time system.

It's a common misconception. A real time system doesn't have to be fast,
efficient, or fault tolerant. A real time system must guarantee with 100%
certainty that in all cases it will respond to input X within a time period of
Y.

I would be interested to learn the timing issues driving the development of
this system and how you've guaranteed such a response time, especially given
that it's running on top of the JVM and must therefore deal with a non-
deterministic garbage collection process.

~~~
scott_s
You described a hard real-time system. That exists for things like the
controllers on a jet. What's becoming much more prevalent are soft real-time
systems that perform analytics. There won't be any catastrophic failure if
deadlines aren't meant - and there may not even be any expressed deadline -
it's just understood that the data must be processed and analyzed as fast as
possible to be useful.

~~~
bmohlenhoff
I certainly did describe a hard real-time system. It's nice to see that other
people recognize the distinction.

Every time I see a post describing a "real time" system I always read into it
hoping that what they are describing is a hard real time system, because they
are neat, but they never are, probably because they are so difficult and
expensive to build. Also I guess they aren't the most relevant type of system
for the majority of people here, who are dealing (as you say) with customer-
facing front ends.

------
sigil
This looks interesting. Questions:

(1) What do you mean by a processing topology -- is this a data dependency
graph?

(2) How does one define a topology? Is this specified at deployment time via
the jar file, or can it be configured separately and on the fly?

(3) Must records be processed in time order, or can they be sorted and
aggregated on some other key?

~~~
nathanmarz
1\. A processing topology is a graph of how data flows through workers. The
roots of the topology are "spouts" that emit tuples into the topology (an
example spout is one that reads off of a kestrel queue). The other nodes take
in tuples as input and emit new tuples as output. Each node in the graph
executes as many threads across the cluster.

2\. To deploy a topology, you give the master machine a jar containing all the
code and a topology. The topology can be created dynamically at submit time,
but once it's submitted it's static.

3\. You receive the records in the order the spouts emit them. Things like
sorting, windowed joins, and aggregations are built on top of the primitives
that Storm provides. There's going to be a lot of room for higher level
abstractions on top of Storm, ala Cascalog/Pig/Cascading//Hive on top of
Hadoop. We've already started designing a Clojure-based DSL for Storm.

------
ScottBurson
TW;DR!!

For a variety of reasons, I keep my browser windows about 900 pixels wide.
Your site requires a honking 1280 to get rid of the horizontal scrollbar --
and can't be read in 900 without scrolling horizontally for every line (i.e.
the menu on the left is much too wide).

(OT, I know, but it's a pet peeve of mine. It's been known for years how to
use CSS to make pages stretch or squish, within reason, to the user's window
width. 900 is not too narrow!)

EDITED to add: yeah, I'm willing to spend some karma points on this, if that's
what happens. Wide sites are getting more common, and this is one of the worst
I've seen.

~~~
iramiller
I had to head back and check the article out again since I had not noticed the
width being an issue. For what it is worth it scaled down and was perfectly
usable on my iPad.

I find your width comments especially relevant to me right because we have
started a new site design project focusing on offering 5 different width based
layouts. Your comment is proof that offering multiple content width options to
desktop users and not just mobile is useful to some people besides just me.

------
jbellis
How is this different from a "traditional" CEP system like Esper?

(I mean on the actual processing front, rather than architecturally -- sounds
like Storm is a bunch of building blocks instead of a unified system.)

~~~
nathanmarz
Storm is a unified system. The key difference with other CEP systems is that
Storm executes your topologies across a cluster of machines and is
horizontally scalable. Running a topology on Storm looks like this:

storm jar mycode.jar backtype.storm.my_topology

In this example, the "backtype.storm.my_topology" defines the realtime
computation as a processing graph. The "storm jar" command causes the code to
be distributed across the cluster and executes it using many processes across
the cluster. Storm makes sure the topology runs forever (or at least until you
stop the topology).

(I can't say I'm intimately familiar with every CEP system out there, so feel
free to correct me if there are distributed CEP systems. Those products tend
to have webpages which make it hard to decipher what they actually are / do)

------
scott_s
I work on a similar system that was previously discussed on HN:
<http://news.ycombinator.com/item?id=2442977>

~~~
tlrobinson
Watson? Is that the right link?

~~~
scott_s
It's the right link, but the article is poorly written. See my corrections in
the comments.

------
maxdemarzi
" To compute reach, you need to get all the people who tweeted the URL, get
all the followers of all those people, unique that set of followers, and then
count the number of uniques. It's an intense computation that potentially
involves thousands of database calls and tens of millions of follower
records."

Or you could use a Graph DB to solve a Graph problem.

URL -> tweeted_by -> users -> followed_by -> users

Try that on Neo4j.

~~~
nathanmarz
To do that query on Neo4j, you would need to store in memory on one machine
the entire Twitter social graph, all the people who tweeted every URL ever
tweeted on Twitter, and then do the computation on a single thread. Neo4j
can't handle that scale.

The reach computation on Storm does everything in parallel (across however
many machines you need to scale the computation) and gets data using
distributed key/value databases (Riak, in our case).

~~~
herdrick
Nathan, we'd love to hear your postmortem on BackType's experience with Neo4J,
and how Sphinx is turning out.

~~~
nathanmarz
We used Neo4j over a year ago, and it was pretty unstable when we used it. The
database files were getting corrupted pretty frequently (a few times a week),
so it just didn't work out for us. Ultimately it was for a small feature, so
rather than continue to struggle with Neo4j we just reimplemented the feature
using Sphinx. Like I said, that was a long time ago and Neo4j may have gotten
a lot better since then.

~~~
herdrick
OK, thanks! That's valuable info.

------
herdrick
This sounds great.

 _This is the traditional realtime processing use case: process messages and
update a variety of databases._

Question: I typically think of real-time as a need for user-facing things,
i.e. handling a user's requests before he gets bored and goes away. Is Storm
set up for that? Or is it mostly meant to update a database with results
rather than return them to a waiting process?

~~~
nathanmarz
It handles both cases. Storm can be used to asynchronously update databases in
realtime in a scalable way (replacing traditional systems of queues and
workers). Using Storm for Distributed RPC lets you do intense computations on
Storm and return them to a waiting process.

~~~
herdrick
Ah yes, I should have read not scanned "3. Distributed RPC". Thanks.

------
Maro
I'm not sure if this is the same thing, but there's also a new company called
Hadapt (the commercialization of HadoopDB). It's about adapting Hadoop for
real-time analytic SQL queries by putting local SQL dbs on the Hadoop nodes
and then using the Hadoop plumbing. It's based on Daniel Abadi's research,
he's a really smart guy.

~~~
jbellis
The high-level difference is that anything Hadoop-based is oriented around
"give me a job to run, I'll go crunch over the data, and spit the result back
to you."

In "realtime" analysis, you tell the system "these are the queries I want you
to run" and it continuously updates those with answers as data arrives.

------
pnathan
Can you comment on distributing non-JAR software?

Also, this sounds faintly like the old SunGridEngine.

~~~
nathanmarz
If you're using a programming language other than Java with Storm, you simply
include those files in the jar that you submit to the cluster. For example,
you would put your Python processing code "component.py" in the resources/
subdirectory in your jar (A jar is basically just a zipfile).

------
earl
How is this different / better than Yahoo S4 [1], which does have code on
github? [2]? Why did you choose to build this, or did you start before S4
became public?

[1] <http://docs.s4.io/> [2] <https://github.com/s4/core>

~~~
nathanmarz
Great question. S4 came out right around the time we started designing Storm.

The projects share similarities. The biggest difference with S4 is that Storm
guarantees that messages will be processed, whereas S4 will just drop the
messages. Getting this level of reliability is more than just using TCP to
send messages - you need to track the processing of messages in an efficient
way and retry messages if the message doesn't get completed for some reason
(like a node goes down). Implementing reliability is non-trivial and affects
the design of the whole system.

We also felt that there was a lot of accidental complexity in S4's API, but
that's a secondary issue.

------
tedjdziuba
This sounds like something that's been painfully over-engineered.

One of the main problems they solve is "distributed RPC", from TFA: "There are
a lot of queries that are both hard to precompute and too intense to compute
on the fly on a single machine."

That's generally a sign that you've made a mistake somewhere in your
application design. Pain is a response that tells you "stop doing that".

~~~
m0th87
So there are no complex distributed problems?

~~~
petrilli
Didn't you know, everything is a website that can be written single-threaded?

