
Scalable Stream Processing: A Survey of Storm, Samza, Spark and Flink - DivineTraube
http://medium.com/baqend-blog/real-time-stream-processors-a-survey-and-decision-guidance-6d248f692056#.b1n6hnbrd
======
techwizrd
What about using Apache Beam[0] for an abstracting over these stream
processing frameworks? In a recent Software Engineering Daily podcast[1],
Frances Perry recommends using Flink for the data flow model.

0: [http://beam.incubator.apache.org/](http://beam.incubator.apache.org/)

1: [http://softwareengineeringdaily.com/2016/08/19/apache-
beam-w...](http://softwareengineeringdaily.com/2016/08/19/apache-beam-with-
frances-perry/)

~~~
nl
The problem with Beam is that it is an abstraction layer.

This sounds like a stupid comment, but when the underlying services are all
evolving as quickly as they do in this space programming against an
abstraction layer means you need to wait for (often sorely needed) new
features.

(Frances Perry is an engineer on Beam at Google, so it would be surprising if
she recommended against it)

~~~
agibsonccc
Do you have examples of such features? Most production deployments of these
kinds of things don't upgrade THAT often.

You have a point that abstraction layers can limit you since it's another
compatibility layer, but looking at what's done in actual practice I don't see
it being that bad.

If anything, it should be evaluated on a case by case basis.

Flink is actually better tech overall for my use case, but a lot of customers
want spark streaming since it's already installed. Having beam where we can do
both is kinda nice.

~~~
nl
We had to abandon Cloudera (and Horton Works) Hadoop distributions because
they didn't ship R support for Spark quickly enough (in 1.3 or 1.2 or whatever
version that was).

We jumped from 1.5 to 1.6 because of algorithms in MLLib (although that turned
out to be a bit of a disappointment).

~~~
agibsonccc
Most organizations I see tend to just wait...this sounds more like 1 off to
me. You're at a unique employer if the data scientists have that much power..

------
marknadal
Great article, although it would be nice to also add the cost (price) of
running the systems. For instance, on our own custom solution we were able on
1 small machine to save 100M+ records (about 100GB) a day for $10 total
(processing cost, disk cost, and backup cost). Which is why I'm curious how it
compares. See
[https://www.youtube.com/watch?v=sG5qtN8E-6Q](https://www.youtube.com/watch?v=sG5qtN8E-6Q)
and
[https://www.youtube.com/watch?v=x_WqBuEA7s8](https://www.youtube.com/watch?v=x_WqBuEA7s8)
.

------
scott_s
I work on IBM Streams, which was discussed and dismissed. The author is
correct that we need to have more public benchmarks. I think our system should
be in the upper right of the high-level view, but of course I would think
that, as someone who works on it.

Something I can point to is a modified Linear Road benchmark:
[https://github.com/IBMStreams/benchmarks/tree/master/Streams...](https://github.com/IBMStreams/benchmarks/tree/master/StreamsLinearRoadBenchmark)

This benchmark was made at the request of a potential customer. Our
implementation scaled to 200 "lanes." The other systems tested did not scale
past 50. Unfortunately, that's as much as I feel I can say until I speak with
some of the people involved in the comparison.

~~~
eternalban
Interested in learing about the architecture and internals & can't dig
anything up. Any doc links you can share?

~~~
scott_s
Official documentation:
[http://www.ibm.com/support/knowledgecenter/SSCRJU_4.1.1/com....](http://www.ibm.com/support/knowledgecenter/SSCRJU_4.1.1/com.ibm.streams.welcome.doc/doc/kc-
homepage.html)

Development community:
[https://developer.ibm.com/streamsdev/](https://developer.ibm.com/streamsdev/)

Some pointers to posts I've made in the development community focusing on the
language and performance: [http://www.scott-a-s.com/streams-
posts/](http://www.scott-a-s.com/streams-posts/)

Academic paper on the language; this is an IBM technical report, a version of
this will be published in TOPLAS:
[http://hirzels.com/martin/papers/tr14-rc25486-spl.pdf](http://hirzels.com/martin/papers/tr14-rc25486-spl.pdf)

Brief academic paper on the systems aspects of the language:
[http://hirzels.com/martin/papers/debull15-spl.pdf](http://hirzels.com/martin/papers/debull15-spl.pdf)

~~~
eternalban
Thanks.

~~~
scott_s
Turns out we do have public information on the Linear Road performance:
[http://www.slideshare.net/RedisLabs/walmart-ibm-revisit-
the-...](http://www.slideshare.net/RedisLabs/walmart-ibm-revisit-the-linear-
road-benchmark)

------
justinsaccount
Is there a SIMPLE stream processing library/framework?

Something like spark or kafka streaming but that doesn't depend on hundreds of
megabytes of java stuffs?

I just want to do basic windowing/counting against data streams.. I don't want
to be part of a gigantic java ecosystem :-(

~~~
wolle
Depending on your requirements, a single-node solution like PipelineDB may
suffice; it's a PostgreSQL extension that lets you write streaming SQL
queries.

~~~
avifreedman
PipelineDB can scale out to multiple nodes (for $).

~~~
wolle
Is there a reference regarding the number of nodes that can be deployed in the
cluster, and are feasible to maintain? From the docs, I gather that "[a]ll
DDL/non-stream DML statements are executed in a distributed transaction on all
nodes in the cluster and committed via two-phase commit." [1]

Doesn't that get in the way of low latency and availability?

[1] [http://enterprise.pipelinedb.com/docs/two-phase.html#two-
pha...](http://enterprise.pipelinedb.com/docs/two-phase.html#two-phase-
commits)

