

Twitter's Storm (complex event processing system) is now open source - harrigan
https://github.com/nathanmarz/storm

======
nathanmarz
Hey all, I'm the author of Storm. Just wanted to point you to a few resources:

I've written a lot of documentation on the wiki, which you can find here:
<https://github.com/nathanmarz/storm/wiki>

There's a few companion projects to Storm. These are:

One-click deploy for Storm on EC2: <https://github.com/nathanmarz/storm-
deploy>

Adapter to use Kestrel as a Spout within Storm:
<https://github.com/nathanmarz/storm-kestrel>

Starter project with example topologies that you can run in local mode:
<https://github.com/nathanmarz/storm-starter>

Feel free to ask me questions here or on Storm's mailing list (
<http://groups.google.com/group/storm-user> ), and I'll answer as best I can!

~~~
aaronblohowiak
Thank you for releasing such thorough documentation with your project; it is a
great indicator of professionalism and passion. ^.^

~~~
nathanmarz
Thanks! Good documentation is a critical part of having a usable project, and
that will always be a focus of Storm. If you have any feedback for how the
documentation can be better, please let me know.

------
phren0logy
Woah. Launch a Storm cluster on AWS with one line:

<https://github.com/nathanmarz/storm-deploy/wiki>

All of these storm projects with their project.clj files betray the Clojure
roots (using Leiningen as the build tool, which is _amazingly great_ ). Here's
to hoping for more Clojure examples/docs.

~~~
nathanmarz
The Storm deploy is awesome. It's built using an awesome tool called Pallet:
<https://github.com/pallet/pallet>

I'm going to try to get more documentation on using Storm from Clojure in the
next few weeks. I write all my topologies in Clojure using a small Clojure DSL
that ships with Storm:

[https://github.com/nathanmarz/storm/blob/master/src/clj/back...](https://github.com/nathanmarz/storm/blob/master/src/clj/backtype/storm/clojure.clj)

------
michaelbuckbee
It is a little buried in the github wiki, but this 'Rationale' page is a good
overview of the project - <https://github.com/nathanmarz/storm/wiki/Rationale>

~~~
andrewcooke
thanks. given that, can anyone answer the following?

\- if there's no intermediate queuing, how are "overloads" handled? what if
the system can't handle a transient load? is there some kind of flow control
signalling back up the path to control the rate (i assume not!)?

\- what happens if a bolt accepts data then crashes before processing. is that
data resent? if so, how does the system handle bolts that are not idempotent?
(the problem is that if you assume that data is not handled until the handling
process terminates then you may send duplicates; if you assume that data are
handled if they are received then you lose data on crashes).

\- how can "fields grouping" scale? surely if you add more bolts they won't
receive any messages unless a new field appears.

\- is there some kind of automatic restarting? how are failures handled? [ok,
this is handled by supervisor nodes and is described in the link above.
although it's difficult to understand how they can be stateless. what happens
to the managed processes if a supervisor dies?]

thanks again! [edit: thanks for the great answers; should have guessed the
consistent hashing one ;o]

~~~
nathanmarz
You generally feed a Storm topology with a queueing system at the source. For
example, we use Kestrel to feed Storm topologies. There's a setting called
"TOPOLOGY_MAX_SPOUT_PENDING" which controls how many tuples can be pending on
any given spout task at a time. This is more than sufficient for controlling
bursts of data. If things back up, they will queue on the queueing server (s).
In the future (not for a few months, at least), Storm will have auto-scaling
where it will automatically scale the topology to the data.

If a tuple fails to be processed, the tuple(s) that triggered that tuple are
replayed from the spout. See
[https://github.com/nathanmarz/storm/wiki/Guaranteeing-
messag...](https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-
processing) for more info on that. In that sense, Storm is an at-least-once
delivery system (but messages are sent more than once only in failure
scenarios). It is up to you to architect your systems to handle this. The
approach I take is to build systems using a hybrid of batch processing
(Hadoop) and realtime processing (Storm). With batch processing, you can run
idempotent functions even with duplication, which lets you correct what's
happening at the realtime layer. In that sense, Hadoop and Storm are extremely
complementary. Here are some slides from a presentation I gave about this
technique: [http://www.slideshare.net/nathanmarz/the-secrets-of-
building...](http://www.slideshare.net/nathanmarz/the-secrets-of-building-
realtime-big-data-systems)

Fields grouping uses consistent hashing underneath. So if you redeploy with
more parallelism, it scales naturally and easily.

If a supervisor dies, it starts back up like nothing happened. Most notably,
nothing happens to the worker processes. All state is kept either in disk or
in Zookeeper. All daemons in Storm are fail-fast, and the Supervisor uses kill
-9's to kill workers, so that makes things extremely robust.

I hope that answered your questions!

~~~
cosgroveb
The supervisor/worker thing sounds quite a bit like OTP.

------
buss
If you don't want to read through the wiki, here's what I've gathered (though
I may have misunderstood what they're doing).

This looks like a workflow management system, where you define a dependency
graph and their system automatically puts messages in queues, pops them, and
executes a step. It seems like it solves the boilerplate part of distributed
computing - managing message queues and fault tolerance. Please correct me if
I got this wrong or missed something.

~~~
bergie
Sounds a bit like <http://en.wikipedia.org/wiki/Flow-based_programming>

I'll have to read through their stuff to see if they have some interesting
ideas for my FBP implementation :-)

~~~
kragen
It does, doesn't it? I still haven't made it through the FBP book, and I'm
having a hard time with the author's savior-of-the-world attitude.

------
mstanley
Hi Nathan! this is awesome. I'm really excited to dive deeper.

some questions:

\- I'm trying to understand the relationship between ZeroMQ and Kestrel in
your architecture. is ZeroMQ used for message passing? and Kestrel used as a
stream source/sink - aka a sprout? in other words, my assumptions are:
zookeeper helps manage node discovery and coordination while message passing
between nimble managed bolt processes' are through zeromq. kestrel queues are
used for external integration (data stream sources). Is this correct or am I
missing something?

\- do you have any tutorials on using cascalog with Storm? are they compatible
or have you developed a different clojure programming model/DSL for working
with Storm?

thanks and again - nice work!

~~~
nathanmarz
You have it correct. ZeroMQ is used for message passing between components,
and Zookeeper is used for node discovery and coordination.

Realtime processing is fundamentally different than batch processing, so you
can't maintain the same semantics of Cascalog on top of Storm. Storm has a
small DSL for writing topologies in pure Clojure, but it's not a higher level
abstraction like Cascalog is. I've started thinking about what a great higher
level abstraction would look like, but what that should look like is still an
open question.

------
harrigan
Some background: [http://engineering.twitter.com/2011/08/storm-is-coming-
more-...](http://engineering.twitter.com/2011/08/storm-is-coming-more-details-
and-plans.html)

~~~
mhd
I just _knew_ that someone would use the "a storm is coming" quote. Let's see
if this piece of software will shake the universe…

------
beagledude
nathanmarz does the work at BackType, Twitter gets the credit

~~~
xtacy
BackType has been acquired by Twitter.

~~~
beagledude
we know that, but all the work was done pre-twitter aquisition.

~~~
artursapek
He gets the credit he deserves as it's under his Github. Who cares about the
company name?

~~~
shimon_e
Saying it is used by twitter makes it a lot easier to sell to management.

------
jevinskie
I'll be taking a look at this today. I'm most excited about it's fault
tolerance features. If this sufficiently abstracts out the details of
providing robust fault tolerance, it could be a great tool to use with cloud
computing.

------
DevX101
What are some potential applications of this?

~~~
tptacek
Any big data application where the majority of the working set changes
continuously and requires more processing than just "bucketing".

The example Twitter gives here is for trending topics, but Storm is basically
to message queues what Rails is to CRUD web apps, and so you can draw use
cases from eg everything Tibco and JMS are used for today.

Storm is nothing you couldn't have done a year ago, or ten years ago, with a
message-oriented architecture. But it has a very attractive feature, which is
that it bakes in all the fiddley details and problem solving Twitter did while
scaling it to their architecture. Systems like this tend to be easy to
prototype and a nightmare to mature and manage, so that's not a small feature.

~~~
ethank
By their very nature, decentralized processing systems have exponential
complexity, so this is very welcome to see released. But oh how I hate having
to go back into Java :)

~~~
nathanmarz
Storm can be used with any programming language. I still need to finish the
documentation on this, but I have notes here:
[https://github.com/nathanmarz/storm/wiki/Using-non-JVM-
langu...](https://github.com/nathanmarz/storm/wiki/Using-non-JVM-languages-
with-Storm)

The gist is: Storm topologies are just Thrift structures, and Storm can
execute processing components in any language by communicating with
subprocesses over a simple protocol based on JSON messages over stdin/stdout.
Storm has adapters implementing this protocol for Ruby and Python, and it's
easy to make a new adapter for any other language. The adapter libraries are
~100 lines of code and have no dependencies other than JSON.

~~~
jongraehl
Would it be difficult to use protostuff or protobuf instead? Or maybe Thrift
is better for this purpose in some way I don't anticipate?

------
mdaniel
Is it common practice for a corporation to release code via one user's Github
account? I would expect that if Twitter were open-sourcing something, it would
show up as something like <https://github.com/twitter/storm> (that link is
404, to save you the trouble).

~~~
frisco
This is really more nathanmarz's Storm more than Twitter's Storm. He wrote it
at BackType with the intention of open sourcing it, and when Twitter bought
BackType they apparently didn't ask him not to. So, it makes sense that it's
being released through his Github account. If Twitter uses it significantly
internally, they'll clone it to their organization account, like they did for
most of the projects that are on Twitter's Github profile.

------
grantjgordon
This is fantastic! My mind is spinning with the industries that you could
benefit from this, but didn't have the time/resources/focus to roll this sort
of (very difficult to scale) system on their own.

------
ldng
Really glad that you guys now release and than announce ! the best way to
avoid the let down of ending not opening something announced earlier (whatever
the reason). Keep the trend going !

------
sherkund
Just out of curiosity, to what degree was Clojure chosen because of the
ability to use Java libraries vs the language design + community?

~~~
nathanmarz
Clojure is an amazing language, and part of that is the access to a huge
breadth of Java libraries. Storm makes use of Java libraries/projects like
JZMQ, Zookeeper, and Thrift.

------
gfodor
This is awesome, really excited to see this get released!

------
scotto
Yeeeeessssss! Thank you Twitter!!!

------
schiptsov
Someone want to implement something like that without 100 jars of dependencies
and 8Gb of memory required to just run, in old-fashioned C/Lisp way (or more
modern nginx-way)? Update: _on top of Plan9?!_ ^_^

Or, at least, in more suitable Erlang? ^_^

Isn't it an obvious startup-idea?

~~~
nathanmarz
Storm does not have 100's of jars of dependencies, nor does it require much
memory to run. In fact, it's quite straightforward to spin up a Storm cluster
(see [https://github.com/nathanmarz/storm/wiki/Setting-up-a-
Storm-...](https://github.com/nathanmarz/storm/wiki/Setting-up-a-Storm-
cluster) ) and Storm has a local mode where you can develop/test topologies on
your local mode completely in-process.

~~~
schiptsov
As far as I know, even clojure-contrib requires lots of jars, and I guess
Scala is also very costly, but let it go.

I appreciate your innovative idea and amount of work you have done, so this
small efficiency issue does not really matter.

btw, who cares about resources when hardware is so cheap and purchased in
ocean containers? ^_^

~~~
alnayyir
No.

~~~
getsat
Don't bother replying to this guy. Read his comment history. He's either a
professional troll or mentally ill.

