Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Storm - the Hadoop of realtime processing (backtype.com)
119 points by lzimm on May 26, 2011 | hide | past | favorite | 46 comments


Storm sounds great, but this post probably should have waited until it was actually open-sourced. As it is, it just comes across as naked self-promotion based on a technology that could for all we know be vaporware.


(I'm the author of Storm)

Your criticism is totally fair. People have been curious about Storm so we wanted to provide a little bit of information about it. We'll have demos soon, and of course it will be open sourced within a few months.

If you're curious about our credibility, I think our other open source projects speak to the quality of software we produce:

https://github.com/nathanmarz/cascalog https://github.com/nathanmarz/elephantdb


I think people here are a little too harsh. Storm sounds like an amazing product and I can't wait to play with something like that. Right now, we run a bunch of cron jobs every minute with intense MapReduce queries on mongodb to generate relatively up-to-date analytics. Something like this would be immensely useful. (As well as Mongo's new 2.0 Aggregation pipeline features.)

Now, I agree that it's kind of a bummer we can't play with it right now, but the fact that you guys made this are are going to open source it is already awesome in itself.


As a happy user of ElephantDB, I'd say people are definitely too harsh. Elephantdb is awesome - my company has completely replaced HBase with ElephantDB and MaryJane (a lightweight way of putting data into hadoop that we wrote, https://github.com/stucchio/MaryJane- ).

That said, I'd love to see some code released, even if it isn't ready for primetime.


Can you say anything about what made ElephantDB + MaryJane better than HBase for your workload? (Occasional batch loads that then need random reads but not random inserts?)


Absolutely - I need batch loads, and random reads. The term "insert" is somewhat meaningless - I have random appends and a periodic mapreduce job compiles the randomly appended data into structured data to served via ElephantDB. The structured data requires random queries. In principle, HBase should have filled my needs completely. But in practice, I couldn't make it work.

Our HBase cluster (3 boxes serving 30 human oracles, each submitting data at a rate of 1 record every 5-10 seconds) choked frequently - i.e., it stopped accepting new records. Ultimately what I had to do is have the human data go into postgres and a cron job flushed that into HBase every half hour or so.

I'll emphasize that this is probably my fault. I'm not claiming HBase doesn't scale to 30 concurrent users - clearly Facebook demonstrates it can. But I couldn't figure out how to make that happen. HBase is a complex system and I make no claim of understanding it.

ElephantDB + MaryJane are simple. There is almost nothing that can go wrong - put together they probably amount to 5000 lines of code and have as many as 10 minimally interacting configuration options. The effort required to manage them is minimal - I had EDB working flawlessly in less than a day.

HBase is an enterprise tool. It works well if you are Facebook and can put a couple of people on maintenance duty. It's overkill if you are Styloot (my stealth mode startup, currently smaller than Backtype).


Thanks! So on each batched load, is the previous data rewritten with interleaved new data? Or is the key ordering such that's never necessary?


Each batched load has no ordering. But the data I'm loading is not the same as the data I'm reading.

The data I'm loading is stuff like tags - e.g., <itemid>\t<tagid>. In human terms, "Dress A has a ruched collar." Mapreduce can handle data like this, even when it comes unordered.

The data I'm reading is computational results based on the loaded data - e.g., an index: <tagid>\t[<itemid1>, <itemid2>, ...] (where each itemid has been tagged with tagid). E.g., "here are all the dresses with a ruched collar."

(Actually, we do considerably more than this, nor do we need Hadoop for an index. But an index is the simplest example I could give.)

The original data is very boring. It's only after aggregation and calculation that it becomes worth reading.


Cool, great to see you're making use of EDB. Would love to hear more about how you're using, how the transition was, etc. mm@backtype.com.


I'm still waiting on Twitter's rainbird (http://www.slideshare.net/kevinweil/rainbird-realtime-analyt...) to come out!


Me too, I asked them the other day about it,

Response - http://twitter.com/#!/kevinweil/status/73263430873792512


Also, the lack of any scalability charts or diagrams of architecture is suspicious.

If you can't make it opensource, at least write a serious paper to support the claims. Like Google did for Big-Table.

A lot of people think their systems are scalable and fault-tolerant. Most are not. And from the information provided, we can't tell.


We've released open source projects (most notably ElephantDB and Cascalog) in the past that are successfully used in production by us as well as other companies. You should check them out if you're interested in a measure of quality, though I understand your concern.

We're a startup — we're not going to write an academic paper supporting the claims in the post. Nevertheless, Storm's an exciting project many people are curious to learn more about; that's why we've written something about it now.

We have a demo coming soon, and Storm itself will be open sourced soon enough.


I absolutely understand the issue of being resource constrained.

It seems like this is buzz-worthy, (like http://mailchimp.com/omnivore/), but this pitch is nerd-focused, not potential-customer focused. If you pitch to nerds, you want a github link. If you pitch to potential customers, highlight the benefits that are now possible due to this innovation.

At least in our batch, we got drilled this repeatedly: Don't talk features. Talk benefits.


Dont call it "the Hadoop of" if it is not open source. Hadoop is notable as an open source project not actually a new idea...


Precisely. I read the whole post looking for a link to the source on github or something, and then the last sentence was just a huge letdown.


It sounds like a neat project, but I think describing it as "real time" is misleading if you're not also providing information on latency. The majority of the provided use cases seem to indicate a high level of scalability and durability, as well as a high level of throughput, but these are not necessary characteristics of a true real time system.

It's a common misconception. A real time system doesn't have to be fast, efficient, or fault tolerant. A real time system must guarantee with 100% certainty that in all cases it will respond to input X within a time period of Y.

I would be interested to learn the timing issues driving the development of this system and how you've guaranteed such a response time, especially given that it's running on top of the JVM and must therefore deal with a non-deterministic garbage collection process.


You described a hard real-time system. That exists for things like the controllers on a jet. What's becoming much more prevalent are soft real-time systems that perform analytics. There won't be any catastrophic failure if deadlines aren't meant - and there may not even be any expressed deadline - it's just understood that the data must be processed and analyzed as fast as possible to be useful.


I certainly did describe a hard real-time system. It's nice to see that other people recognize the distinction.

Every time I see a post describing a "real time" system I always read into it hoping that what they are describing is a hard real time system, because they are neat, but they never are, probably because they are so difficult and expensive to build. Also I guess they aren't the most relevant type of system for the majority of people here, who are dealing (as you say) with customer-facing front ends.


Yes, Storm is more intended for soft realtime problems.


This looks interesting. Questions:

(1) What do you mean by a processing topology -- is this a data dependency graph?

(2) How does one define a topology? Is this specified at deployment time via the jar file, or can it be configured separately and on the fly?

(3) Must records be processed in time order, or can they be sorted and aggregated on some other key?


1. A processing topology is a graph of how data flows through workers. The roots of the topology are "spouts" that emit tuples into the topology (an example spout is one that reads off of a kestrel queue). The other nodes take in tuples as input and emit new tuples as output. Each node in the graph executes as many threads across the cluster.

2. To deploy a topology, you give the master machine a jar containing all the code and a topology. The topology can be created dynamically at submit time, but once it's submitted it's static.

3. You receive the records in the order the spouts emit them. Things like sorting, windowed joins, and aggregations are built on top of the primitives that Storm provides. There's going to be a lot of room for higher level abstractions on top of Storm, ala Cascalog/Pig/Cascading//Hive on top of Hadoop. We've already started designing a Clojure-based DSL for Storm.


TW;DR!!

For a variety of reasons, I keep my browser windows about 900 pixels wide. Your site requires a honking 1280 to get rid of the horizontal scrollbar -- and can't be read in 900 without scrolling horizontally for every line (i.e. the menu on the left is much too wide).

(OT, I know, but it's a pet peeve of mine. It's been known for years how to use CSS to make pages stretch or squish, within reason, to the user's window width. 900 is not too narrow!)

EDITED to add: yeah, I'm willing to spend some karma points on this, if that's what happens. Wide sites are getting more common, and this is one of the worst I've seen.


I had to head back and check the article out again since I had not noticed the width being an issue. For what it is worth it scaled down and was perfectly usable on my iPad.

I find your width comments especially relevant to me right because we have started a new site design project focusing on offering 5 different width based layouts. Your comment is proof that offering multiple content width options to desktop users and not just mobile is useful to some people besides just me.


How is this different from a "traditional" CEP system like Esper?

(I mean on the actual processing front, rather than architecturally -- sounds like Storm is a bunch of building blocks instead of a unified system.)


Storm is a unified system. The key difference with other CEP systems is that Storm executes your topologies across a cluster of machines and is horizontally scalable. Running a topology on Storm looks like this:

storm jar mycode.jar backtype.storm.my_topology

In this example, the "backtype.storm.my_topology" defines the realtime computation as a processing graph. The "storm jar" command causes the code to be distributed across the cluster and executes it using many processes across the cluster. Storm makes sure the topology runs forever (or at least until you stop the topology).

(I can't say I'm intimately familiar with every CEP system out there, so feel free to correct me if there are distributed CEP systems. Those products tend to have webpages which make it hard to decipher what they actually are / do)


I work on a similar system that was previously discussed on HN: http://news.ycombinator.com/item?id=2442977


Watson? Is that the right link?


It's the right link, but the article is poorly written. See my corrections in the comments.


" To compute reach, you need to get all the people who tweeted the URL, get all the followers of all those people, unique that set of followers, and then count the number of uniques. It's an intense computation that potentially involves thousands of database calls and tens of millions of follower records."

Or you could use a Graph DB to solve a Graph problem.

URL -> tweeted_by -> users -> followed_by -> users

Try that on Neo4j.


To do that query on Neo4j, you would need to store in memory on one machine the entire Twitter social graph, all the people who tweeted every URL ever tweeted on Twitter, and then do the computation on a single thread. Neo4j can't handle that scale.

The reach computation on Storm does everything in parallel (across however many machines you need to scale the computation) and gets data using distributed key/value databases (Riak, in our case).


Nathan, we'd love to hear your postmortem on BackType's experience with Neo4J, and how Sphinx is turning out.


We used Neo4j over a year ago, and it was pretty unstable when we used it. The database files were getting corrupted pretty frequently (a few times a week), so it just didn't work out for us. Ultimately it was for a small feature, so rather than continue to struggle with Neo4j we just reimplemented the feature using Sphinx. Like I said, that was a long time ago and Neo4j may have gotten a lot better since then.


OK, thanks! That's valuable info.


This sounds great.

This is the traditional realtime processing use case: process messages and update a variety of databases.

Question: I typically think of real-time as a need for user-facing things, i.e. handling a user's requests before he gets bored and goes away. Is Storm set up for that? Or is it mostly meant to update a database with results rather than return them to a waiting process?


It handles both cases. Storm can be used to asynchronously update databases in realtime in a scalable way (replacing traditional systems of queues and workers). Using Storm for Distributed RPC lets you do intense computations on Storm and return them to a waiting process.


Ah yes, I should have read not scanned "3. Distributed RPC". Thanks.


I'm not sure if this is the same thing, but there's also a new company called Hadapt (the commercialization of HadoopDB). It's about adapting Hadoop for real-time analytic SQL queries by putting local SQL dbs on the Hadoop nodes and then using the Hadoop plumbing. It's based on Daniel Abadi's research, he's a really smart guy.


The high-level difference is that anything Hadoop-based is oriented around "give me a job to run, I'll go crunch over the data, and spit the result back to you."

In "realtime" analysis, you tell the system "these are the queries I want you to run" and it continuously updates those with answers as data arrives.


Can you comment on distributing non-JAR software?

Also, this sounds faintly like the old SunGridEngine.


If you're using a programming language other than Java with Storm, you simply include those files in the jar that you submit to the cluster. For example, you would put your Python processing code "component.py" in the resources/ subdirectory in your jar (A jar is basically just a zipfile).


How is this different / better than Yahoo S4 [1], which does have code on github? [2]? Why did you choose to build this, or did you start before S4 became public?

[1] http://docs.s4.io/ [2] https://github.com/s4/core


Great question. S4 came out right around the time we started designing Storm.

The projects share similarities. The biggest difference with S4 is that Storm guarantees that messages will be processed, whereas S4 will just drop the messages. Getting this level of reliability is more than just using TCP to send messages - you need to track the processing of messages in an efficient way and retry messages if the message doesn't get completed for some reason (like a node goes down). Implementing reliability is non-trivial and affects the design of the whole system.

We also felt that there was a lot of accidental complexity in S4's API, but that's a secondary issue.


This sounds like something that's been painfully over-engineered.

One of the main problems they solve is "distributed RPC", from TFA: "There are a lot of queries that are both hard to precompute and too intense to compute on the fly on a single machine."

That's generally a sign that you've made a mistake somewhere in your application design. Pain is a response that tells you "stop doing that".


So there are no complex distributed problems?


Didn't you know, everything is a website that can be written single-threaded?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: