
Flying faster with Twitter Heron - Rifu
https://blog.twitter.com/2015/flying-faster-with-twitter-heron
======
haberman
Skimming the paper, I found it hard to get a handle on exactly what kind of
abstraction and guarantees this system implements (I am a co-author on one of
the cited papers, so I'm reasonably familiar with this space). Maybe it is
easier to understand for people who are already familiar with Storm.

Most notably, I couldn't determine whether the system is stateful, or what
kind of guarantees (if any) are provided regarding stateful processing.

For example, the paper says Heron is used to compute real-time active user
counts. That implies that the system needs some way to keep track of and
"remember" how many unique users it has seen in the last N hours or whatever.
How does Heron model this state and how does it guarantee (if it does) that a
crashing node will not lose its accumulated state?

In my experience this is the hardest part, by far, of stream processing, so
when I see any work in this area it's the first thing I am curious to learn
about. A system that guarantees strong consistency (ie. accurate counts) even
in the presence of node crashes is way, way harder to get right (and a lot
more expensive, resource-wise) than one that assumes it's ok to lose a little
bit of data.

It looks like Heron implements only at-most-once and at-least-once semantics,
so maybe that is my answer there. You need exactly-once semantics to get
robust and reliable answers, and you need to guarantee that state changes are
atomic with the exactly-once semantics.

Of course some systems are ok with their output degrading a little when nodes
crash. It's not the end of the world if the active user count is a little off.
But beware of tolerating this too much -- the bad thing about allowing data
loss is that it tends to come in storms (no pun intended). Once something is
going wrong, the answers can be way off. The error is not bounded in most
cases I've encountered.

~~~
vikkyrk
From the paper, "The design of Heron allows supporting exactly once semantics,
but the first version of Heron does not have this implementation. One reason
for tolerating the lack of exactly once semantics is that Summingbird [8]
simultaneously generates both a Heron query and an equivalent Hadoop job, and
in our infrastructure the answers from both these parts are eventually
merged".

The industry term for this approach is lambda architecture ([http://lambda-
architecture.net/](http://lambda-architecture.net/))

------
djb_hackernews
I wonder what Nathan Marz thinks about this, as he is the guy that created
Storm (which IMO, is some of the best OSS code out there)

------
filereaper
Interesting to see how Heron compares up to Spark wrt to performance. I keep
hearing Storm is slower than Spark, does Heron now catch up and exceed in
terms of performance?

~~~
bbulkow
I have run into the author in other contexts, and I believe (without quotable
proof) that Heron must be massively faster than Storm - if he says it is.
Storm has a wide variety of failures (I've presented about "when to use storm"
at a few conferences), such as the difficulty of exactly once processing
(trident attempted to solve that).

I am looking forward to taking Heron for a spin.

~~~
nl
The linked blog posts show that is is faster that _Storm_. The OP was asking
about _Spark_ though. That's a fair questions, since there is a good deal of
overlap between the the typical Storm usecases and Spark usecases.

------
vicaya
Storm is not even 1.0 yet. Since Heron is API compatible with Storm, why can't
Heron be simply a code name for Storm 2.0?

Storm is a much better name than Heron, IMO.

~~~
TheHydroImpulse
Storm has been moved under the Apache umbrella so Twitter coming out with
Storm 2.0 would present a few problems.

------
ptgoetz
(DISCLAIMER: I am an Apache Storm PMC Member)

This could be very good thing for Apache Storm depending on how Twitter
handles it.

Just to clarify, the Storm version mentioned in the paper and blog post is not
an official Apache release and doesn't include many performance improvements
included in the newer releases of Apache Storm. There are a lot, and many more
on the horizon.

That being said, the performance numbers look impressive, even though there is
no way confirm those since no code or benchmarks have been published. IMHO,
until that happens, there's not much to see here (not that I doubt it -- I'd
just like to see proof/code).

My hope is that Twitter is dedicated to the projects it has open-sourced, and
this is not a case of NIH, but rather an honest effort on Twitter's part to
contribute back to the open source community.

@haberman:

Storm implements exactly-once processing through a higher-level API called
Trident, that I like to call Storm's "Streams API" since it's not unlike Java
8's Streams API (and largely inspired by Cascading). Trident processes data in
configurable micro-batches, as opposed to one-at-a-time, which gives it an
advantage in terms of throughput, but at the cost of latency. Trident
topologies "compile" down to Core Spout/Bolt topologies (The Trident API has a
planner implementation that figures that out -- not unlike an SQL query
planner).

The Storm Core API provides at-least-once semantics through an acking
mechanism described here [1]. The Trident API builds on top of that to support
exactly-once semantics by essentially doing a de-dupe [2].

I'm not sure exactly why they don't claim to support this, since Trident is
build on top of Storm's Core API.

[1] [https://storm.apache.org/documentation/Acking-framework-
impl...](https://storm.apache.org/documentation/Acking-framework-
implementation.html) [2] [https://storm.apache.org/documentation/Trident-
state.html](https://storm.apache.org/documentation/Trident-state.html)

@filereaper:

Assuming you are referring to Spark streaming, forget about any benchmarks you
may have seen. Either can be faster than the other depending on how you
configure it, and what your use case is. See my presentation on the subject
here [3]. With either, you can configure yourself into a corner and screw your
performance.

Performance tuning distributed systems is a mysterious art. As is
benchmarking. Unfortunately, that fact is frequently exploited for
"benchmarketing" purposes. Don't trust any benchmark but your own unless it is
fully open-sourced (including configuration).

[3] [http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-
stre...](http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming)

@vicaya

Version numbers don't necessarily equate to code quality, performance, or
stability. I've seen many projects bump to 1.0 only for marketing purposes.

------
jhugg
Open source? No?

~~~
mdaniel
Bizarre, their blog post is tagged "open source" but I checked their GitHub
page and no sign of it.

~~~
jhugg
Storm is OSS, and referenced a few times.

------
shit_parade2
Developing using twitter open source and api is essentially asking to to
robbed by them if anything you make is successful.

~~~
hurrycane
Why is that? Finagle, Mesos and Aurora look like solid open source projects
from Twitter.

~~~
fernandotakai
I started looking at finagle, finatra and scrooge. i love all three projects.
but at the same time, the documentation is really really bad.

if you have the same requirement as twitter, nice -- but if you detour even a
little bit, you are going to have a shitty time (for example, for thrift, i
wanted to use buffered codec instead of framed and there was not a single
document explaining how to do it. i spent ~2h perusing unit tests to find a
way which i don't know if it's correct or not).

