
Open-Sourcing Twitter Heron - samber
https://blog.twitter.com/2016/open-sourcing-twitter-heron
======
scaleout1
As someone who has used Heron (along with MillWheel, Spark Streaming and
Storm) I feel like this announcement is too late. The biggest thing Heron
offer is raw scale but since they decided to use existing Storm API, it has
the same shitty spout /bolt API that Storms offer. In contrast, Spark
streaming/Flink/ Kafka Streaming are all offering map/flatmap/filter/sink
based functional API. At twitter most teams used SummingBird on top of Heron
to get the same functional API but summingbird didnt get a lot of traction
outside twitter and I am not sure how actively maintained OSS version of
summingbird is. Even if you bite the bullet and decide to use SB with Heron,
you will still miss out on a lot of usecases as SB was mostly focused on doing
read/transform/aggregate/write whereas most streaming problem that i have
noticed outside of twitter involve doing
read/transform/aggregate/decision/write. I suppose you can implement
decisioning in SB but i havent seen it done.

Comparing Heron to google millwheel is interesting because of the design
choices they made. Heron support at least one and at most once message
guarantees but at Twitter most job ran with acked turned off so it was at most
once with acknowledged data loss ( they had a batchjob doing mop up work to
pick up missing data). Google on the other hand implemented exactly once
semantic by doing idempotent sinks/ watermarking and managing out of order
messages plus deduping support. Since both Flink and Spark will be
implementing Apache Beam (millwheel's predecessors) model, only reason I see
someone picking heron instead of Flink/Spark is that they are operating at
massive scale that flink/spark dont support yet

~~~
eldenbishop
Storm is a low level system for managing (optionally) transactional multi-
machine tasks. It makes no assumptions about what is being processed (ie.
analytics, data transforms). The primitives you are talking about exist in the
child project Trident which runs on top of storm. Storm itself is no more for
analytics than a web-server. It is a lower level tool.

~~~
weego
The parent also ignored the time-to-process difference which is drastically
lower in storm. It has its flaws but scale is not the only metric to use as a
decider

------
lovelearning
They have dropped clojure completely. All the critical path messaging in Storm
was done using clojure. Dropped netty too. The actual messaging ("stream
manager") seems to be C++. Perhaps that explains the latency and CPU
improvements. The architectural changes mentioned account for better cluster
utilization, fault tolerance and back pressure implementation, but don't
explain why raw streaming performance is so much better in Heron.

Edit: Not so sure what they are using for networking. Seemed like cpp-netlib
at first, but I don't think so now.

~~~
JonathonW
Looks like a homegrown networking framework:
[https://github.com/twitter/heron/tree/master/heron/common/sr...](https://github.com/twitter/heron/tree/master/heron/common/src/cpp/network)

Their default event loop implementation uses libevent, and they're using
protobuf in some of their higher-level networking classes, but the networking
code itself seems to pretty much be plain sockets (with a thin portability
layer on top in a few places).

~~~
zintinio5
It's some...interesting code.

[https://github.com/twitter/heron/blob/master/heron/common/sr...](https://github.com/twitter/heron/blob/master/heron/common/src/cpp/network/event_loop_impl.cpp#L211-L238)

~~~
untog
So _that 's_ what Twitter has been working all this time

------
zenlikethat
Feel like they should feature the link to the repo much more prominently in
the article!

For anyone curious, it's
[https://github.com/twitter/heron](https://github.com/twitter/heron)

~~~
ghayes
And they built a site for the project with a good Getting Started section:
[http://twitter.github.io/heron/](http://twitter.github.io/heron/)

Getting Started: [http://twitter.github.io/heron/docs/getting-
started/](http://twitter.github.io/heron/docs/getting-started/)

------
buremba
Great news! Here is the paper if you're interested in Heron.
[http://dl.acm.org/citation.cfm?id=2742788&CFID=620516550&CFT...](http://dl.acm.org/citation.cfm?id=2742788&CFID=620516550&CFTOKEN=92648489)

------
baby
I'm genuinely curious as to why new products are constantly written in Java.
My experience with the language is far from pleasant. Is it because people
actually like the language? Is it because there is no other alternative when
it comes to solid development? ...?

~~~
fizx
Let's say you want a GC'd language (no C, C++) on *nix (no C#), good/varied
libraries to work with (no esoteric languages), good performance (no Ruby,
Python), reasonable options for implementing concurrency (no Javascript),
what's left?

Looks like a JVM language (Java, Scala, etc), or Golang. Java has better
tooling and more mature implementations. I personally find modern Java nicer
to write than Golang (though Scala nicer than both). These days, it comes down
to whether the JVM memory overhead is a big deal on the specific project, and
if not, for the class of projects discussed in the previous paragraph, I'm
probably choosing Scala (but if my teammates object, then it's back to Java).

~~~
PeCaN
> what's left?

\- Erlang (this fits your criteria at least as Java...)

\- OCaml (concurrency isn't the best)

\- Haskell (also fits your criteria very well)

\- D (easy to use C libraries and sometimes C++)

\- Rust (not GCed, but why is GCed a requirement?)

\- Mercury (admittedly pretty obscure, but using C libraries is easy enough
and it fits everything else)

Java only seems like a good choice if you ignore all the languages that are a
better choice than Java. IMO Erlang seems like the choice here (or Elixir if
you don't like Erlang syntax).

~~~
hansjorg
I think you're underestimating the value of the enormous Java/JVM ecosystem.

That said, I'm hoping to deploy my first Rust project to production soon. It's
come a long way for being such a young language.

~~~
PeCaN
> I think you're underestimating the value of the enormous Java/JVM ecosystem.

You could be right. For a number of the languages I mentioned I banked on
really good FFIs allowing easy usage of C libraries. Perhaps in some domains,
including stream processing, there are more and better Java or C++ libraries
for them to build on.

------
abc_lisper
Wonder how this compares to Storm 1.0

~~~
abc_lisper
Why am I being down-modded? When Twitter first announced Heron(~ a year ago),
they compared it with a version of Storm available then. Since then Storm has
improved it's latency, ability to scale-out, added back-pressure and in some
cases it is 10x faster than the previous version of storm (0.10.0?). I was
wondering how Heron compares to Storm's performance now.

------
itaifrenkel
Does anyone know if Storm ShellBolt compatible with Heron, or is there a
better way in Heron for running non-jvm bolts?

------
dswalter
I've seen a decent amount of complaining about the difficulty of supporting a
kafka infrastructure. I'd be interested in thoughts from people who have used
Heron as to how it is running it in production.

~~~
elodina
We open sourced Kafka support Spout, Bolt & Example Topology
[https://github.com/twitter/heron/pull/751](https://github.com/twitter/heron/pull/751)

We run Kafka Mesos
[https://github.com/mesos/kafka](https://github.com/mesos/kafka) allows for as
a service implementation for broker clusters for users and apps.

Vagrant for running example
[https://github.com/elodina/heron/blob/v2/contrib/kafka9/vagr...](https://github.com/elodina/heron/blob/v2/contrib/kafka9/vagrant/README.md)

------
vemv
Hearing about a Clojure project being dropped feels like a slap in the face. I
only wonder if it was more of a political decision.

------
mark242
How does Heron compare to Spark streaming?

~~~
jsmthrowaway
Well, the big one is that Heron is built to support real-time streaming while
Spark Streaming is not, given its choice to use micro-batching. If latency
matters, that matters to you.

I'd appreciate if it were called Spark Microbatching, but we can't have
everything.

~~~
kod
The amount of FUD thrown around about micro batching is really kind of silly.

How many streaming analytics use cases are ok with the JVM, ok with 10s of
millis of latency, but not ok with 100s of millis of latency?

~~~
jsmthrowaway
Remove "analytics" from your statement and you'll arrive at the answer,
because while analytics is driving the streaming space as we speak it is
literally one of the most boring applications of streaming imaginable. I
positively cannot get excited about engagement numbers from an event pipeline.
Intrusion detection, fraud detection, about half of infrastructure monitoring
tasks, event sourcing... You're also glossing over the impact of microbatching
which looks more like seconds than milliseconds, as well as what microbatching
does to your windowing abilities (such as having to double-process to work
around the microbatch interval as applied to your desired windowing
semantics).

Microbatch latency immediately rules out several useful applications of
streaming. I also didn't say anything about the JVM nor it being okay for my
purposes. You did.

I can back up my statements on streaming from Flink to Samza to Dataflow/Beam
to Storm to MillWheel to Spark "Streaming" and back again because it has been
my primary focus (literally thinking of nothing else) for a couple years.
Please accuse someone else of FUD because that's a relatively veiled way to
say "you don't know what you're talking about," and I assure you that you're
(condescendingly) wrong. I think you're also coming at that angle from
interpreting me as negative on Spark Streaming. Read carefully.

~~~
kod
> You're also glossing over the impact of microbatching which looks more like
> seconds than milliseconds

I've run spark streaming jobs at 250ms batch times. This comment you made,
right here, is why I used the word FUD.

------
ps4fanboy
If twitter's stock price continues to fall maybe they will open source the
entire stack.

