I've written a lot of documentation on the wiki, which you can find here: https://github.com/nathanmarz/storm/wiki
There's a few companion projects to Storm. These are:
One-click deploy for Storm on EC2: https://github.com/nathanmarz/storm-deploy
Adapter to use Kestrel as a Spout within Storm: https://github.com/nathanmarz/storm-kestrel
Starter project with example topologies that you can run in local mode: https://github.com/nathanmarz/storm-starter
Feel free to ask me questions here or on Storm's mailing list ( http://groups.google.com/group/storm-user ), and I'll answer as best I can!
I have a schedule conflict this year :( but would recommend the conference to everyone - the first year was informative with contagious enthusiasm. The other speakers also look cool.
A minor nit, I was looking at the code, and all the hanging parens make it look really sad :-(
All of these storm projects with their project.clj files betray the Clojure roots (using Leiningen as the build tool, which is amazingly great). Here's to hoping for more Clojure examples/docs.
I'm going to try to get more documentation on using Storm from Clojure in the next few weeks. I write all my topologies in Clojure using a small Clojure DSL that ships with Storm:
- if there's no intermediate queuing, how are "overloads" handled? what if the system can't handle a transient load? is there some kind of flow control signalling back up the path to control the rate (i assume not!)?
- what happens if a bolt accepts data then crashes before processing. is that data resent? if so, how does the system handle bolts that are not idempotent? (the problem is that if you assume that data is not handled until the handling process terminates then you may send duplicates; if you assume that data are handled if they are received then you lose data on crashes).
- how can "fields grouping" scale? surely if you add more bolts they won't receive any messages unless a new field appears.
- is there some kind of automatic restarting? how are failures handled? [ok, this is handled by supervisor nodes and is described in the link above. although it's difficult to understand how they can be stateless. what happens to the managed processes if a supervisor dies?]
thanks again! [edit: thanks for the great answers; should have guessed the consistent hashing one ;o]
If a tuple fails to be processed, the tuple(s) that triggered that tuple are replayed from the spout. See https://github.com/nathanmarz/storm/wiki/Guaranteeing-messag... for more info on that. In that sense, Storm is an at-least-once delivery system (but messages are sent more than once only in failure scenarios). It is up to you to architect your systems to handle this. The approach I take is to build systems using a hybrid of batch processing (Hadoop) and realtime processing (Storm). With batch processing, you can run idempotent functions even with duplication, which lets you correct what's happening at the realtime layer. In that sense, Hadoop and Storm are extremely complementary. Here are some slides from a presentation I gave about this technique: http://www.slideshare.net/nathanmarz/the-secrets-of-building...
Fields grouping uses consistent hashing underneath. So if you redeploy with more parallelism, it scales naturally and easily.
If a supervisor dies, it starts back up like nothing happened. Most notably, nothing happens to the worker processes. All state is kept either in disk or in Zookeeper. All daemons in Storm are fail-fast, and the Supervisor uses kill -9's to kill workers, so that makes things extremely robust.
I hope that answered your questions!
This looks like a workflow management system, where you define a dependency graph and their system automatically puts messages in queues, pops them, and executes a step. It seems like it solves the boilerplate part of distributed computing - managing message queues and fault tolerance. Please correct me if I got this wrong or missed something.
I'll have to read through their stuff to see if they have some interesting ideas for my FBP implementation :-)
- I'm trying to understand the relationship between ZeroMQ and Kestrel in your architecture. is ZeroMQ used for message passing? and Kestrel used as a stream source/sink - aka a sprout? in other words, my assumptions are: zookeeper helps manage node discovery and coordination while message passing between nimble managed bolt processes' are through zeromq. kestrel queues are used for external integration (data stream sources). Is this correct or am I missing something?
- do you have any tutorials on using cascalog with Storm? are they compatible or have you developed a different clojure programming model/DSL for working with Storm?
thanks and again - nice work!
Realtime processing is fundamentally different than batch processing, so you can't maintain the same semantics of Cascalog on top of Storm. Storm has a small DSL for writing topologies in pure Clojure, but it's not a higher level abstraction like Cascalog is. I've started thinking about what a great higher level abstraction would look like, but what that should look like is still an open question.
The example Twitter gives here is for trending topics, but Storm is basically to message queues what Rails is to CRUD web apps, and so you can draw use cases from eg everything Tibco and JMS are used for today.
Storm is nothing you couldn't have done a year ago, or ten years ago, with a message-oriented architecture. But it has a very attractive feature, which is that it bakes in all the fiddley details and problem solving Twitter did while scaling it to their architecture. Systems like this tend to be easy to prototype and a nightmare to mature and manage, so that's not a small feature.
The gist is: Storm topologies are just Thrift structures, and Storm can execute processing components in any language by communicating with subprocesses over a simple protocol based on JSON messages over stdin/stdout. Storm has adapters implementing this protocol for Ruby and Python, and it's easy to make a new adapter for any other language. The adapter libraries are ~100 lines of code and have no dependencies other than JSON.
At least in Rubyland, JVM dependencies have tended to be a win.
Or, at least, in more suitable Erlang? ^_^
Isn't it an obvious startup-idea?
I appreciate your innovative idea and amount of work you have done, so this small efficiency issue does not really matter.
btw, who cares about resources when hardware is so cheap and purchased in ocean containers? ^_^