
Announcing Apache Flink 1.0.0 - jonbaer
http://flink.apache.org/news/2016/03/08/release-1.0.0.html
======
falcolas
> High throughput and low latency stream processing with exactly-once
> guarantees.

Precisely how is this possible in a real-world distributed system? At-least-
once and at-most-once are possible with different distribution methods, but
"exactly-once" is, well, I have yet to see it actually implemented (even
theoretically, see the Two Generals thought experiment). Even their
checkpointing algorithm has an assumption that the underlying transport is
"exactly once" and "FIFO ordered".

If I had to guess, I'd say that what they're really guaranteeing is "at most
once" message delivery. Unfortunately, that has a very different meaning from
"exactly once".

[http://bravenewgeek.com/you-cannot-have-exactly-once-
deliver...](http://bravenewgeek.com/you-cannot-have-exactly-once-delivery/)

~~~
jahitr
This is something it bothers me a lot.

What they don't tell you is that in order to achieve "exactly once" delivery
you need to have idempotent writes. For example inserting in a database using
PKs.

~~~
ptgoetz
Exactly-once in a distributed system doesn't exist, and we need to get past
that. It can be approximated with replay (at-least-once) and de-
dupe/idempotency.

Trying to ensure true exactly-once is a fools errand. In a distributed system
it required guarantees (down to the magnetic material level) that are very
hard to get right. If any of those guarantees fail, you don't have it. What
you have is a close approximation.

Real world applications usually involve a lot more than counting words in
memory.

At-least-once is relatively easy. Combine it with idempotent operations and
your work is done.

Storm is fairly explicit in documenting what it takes to achieve this, but
it's not trivial, and every system involved has to support certain guarantees.

Spark (streaming) made some pretty big claims about exactly-once guarantees,
but it turned out that claim was riddled with holes.

In my opinion, "exactly-once" doesn't imply that there are exceptions to that
rule.

Guard against dupes and you'll be fine (easier said than done, obviously), but
also know the limitations of the systems and frameworks you are working with.

EDIT: Disclaimer: I am an Apache Storm PMC Member

~~~
sewen
I disagree on some points.

First, it is crucial to distinguish between "exactly-once" semantics with
respect to state _inside_ the stream processor (for example an aggregate
computed in a window) an exactly-once delivery to external systems. The former
is built into Flink, the later is only possible in some cases (transactional
systems) and requires extra effort.

Exactly-once for state _inside_ the stream processor is incredibly useful,
because it allows you to implement many non-idempotent operations such that
the writes to external systems are idempotent: For example, you compute the
complex aggregate in the stream processor and only periodically write the
result to the external system (overwriting previous values). Now the external
system always reflects an aggregate without duplicates.

That is very valuable and only possible if _inside_ the stream processor, you
have exactly-once semantics for state. That does imply that the stream
processor has a notion of managed state (in Flink for example the Windows,
key/value state, and generic checkpointed state).

Disclaimer: I am a Flink committer.

------
pcx
This is awesome, congrats to the Flink team!

For our current project we evaluated Spark and Storm, and have gone the Storm
way. Personally I spent a little time on Flink, but could not convince the
management for more serious consideration. Mainly because of the name
recognition, Spark and Storm are so popular already. But I am confident Flink
will become a great option, especially with hitting 1.0. I felt the community
was very active and documentation pretty good.

~~~
mring33621
Flink has a Storm compatibility layer, which may aid the transition, should
your management change its mind.

~~~
ptgoetz
The Storm compatibility layer may work for word count, but not for anything
non-trivial.

Here's the fine print (no fault tolerance, etc.):
[https://github.com/apache/flink/blob/master/flink-
contrib/fl...](https://github.com/apache/flink/blob/master/flink-
contrib/flink-storm/README.md)

------
tristanz
"The Evolution of Massive-Scale Data Processing" presentation from Google
provides a good comparison of Flink with other tools:

[https://docs.google.com/presentation/d/10vs2PnjynYMtDpwFsqmS...](https://docs.google.com/presentation/d/10vs2PnjynYMtDpwFsqmSePtMnfJirCkXcHZ1SkwDg-s/present?slide=id.gd44d9957e_0_243)

------
jahitr
Storm is pretty much dead. Spark, from my point of view, is the current
forerunner.

It support mostly all use cases. Batch jobs. Streaming jobs. Work is being
done to reduce the batch time to ms.

Comercial support is very good from the guys at Cloudera and Databricks.

My 2 cents. I hope It helps someone.

~~~
cbsmith
Storm is anything but dead. Spark streaming is definitely not all there yet.

~~~
jonstewart
As luck would have it, I'm going to write a POC next month with Spark
Streaming and Flink. Should I also throw Storm into the mix? What do you think
about Flink?

I generally can get by with Batch, but streaming makes a lot of our logic
easier, and might allow us to handle some new situations. Flink looks pretty
good on paper.

~~~
cbsmith
Flink looks really interesting to me. If it can get community momentum I might
switch to it.

I'd definitely recommend including Storm for a POC. For any POC I'd warn that
most of these streaming solutions are operationally a bit of a PITA, which
tends to distort POC results.

~~~
jonstewart
We will be half-to-single rack scale for a while, so I'm not too too worried
about operations--we also need to target solo workstations, which is the real
challenge with any of these systems. If things go haywire, we can just restart
processing later.

~~~
cbsmith
You might want to check out Kafka Streams.

------
bsg75
Does anyone have a good comparison of Flink and Spark, especially from a use
case perspective?

Most I have found are light in actual contrast detail.

~~~
doctorcroc
The benchmarking study done by Yahoo was fairly comprehensive, and
quantitatively assesses the different stream frameworks (including Flink and
SS)
[https://yahooeng.tumblr.com/post/135321837876/benchmarking-s...](https://yahooeng.tumblr.com/post/135321837876/benchmarking-
streaming-computation-engines-at)

We also did a podcast about it if you're interested in digging deeper --
[http://softwareengineeringdaily.com/2016/02/03/benchmarking-...](http://softwareengineeringdaily.com/2016/02/03/benchmarking-
stream-processing-frameworks-with-bobby-evans/)

~~~
abeppu
Thanks for sharing that. I recently was looking at a different comparison of
the _batch_ capabilities of flink and SS [1], which found that flink was
faster at terasort than SS. I'm curious to understand why it is that it looks
like SS can get higher throughput than flink in the streaming case, but less
in the batch case.

[1] [http://www.slideshare.net/ssuser6bb12d/a-comparative-
perform...](http://www.slideshare.net/ssuser6bb12d/a-comparative-performance-
evaluation-of-apache-flink)

~~~
sewen
I don't think SS gets higher streaming throughput than Flink. That was an
assumption written in the Yahoo! streaming benchmark without an actual
experiment.

------
bfrog
It seems like there's a multitude of these data analytics frameworks with
slightly different features and goals, but generally speaking, do similiar
things. Is there some significant differences between the 3 that maybe aren't
as clear from a quick overview?

------
ozten
The favicon is... unfortunate.

The squirrel mascot is cute, but doesn't reduce well.

------
anc84
Ok, but what IS it?

~~~
CaptSpify
I always hate when announcements (especially for a 1.0 release) don't start
off with "This is what this is!"

[http://flink.apache.org/index.html](http://flink.apache.org/index.html)

~~~
zeisss
The info graphic on the right does not show Hadoop, but the getting started
guide suggests that Hadoop is a requirement.

------
easytiger
> High throughput and low latency stream processing

It should be illegal to claim nonsense like that without some comparative
example vs a c+= zmq/tcp pipeline doing the same thing

~~~
faizshah
Here:

[https://yahooeng.tumblr.com/post/135321837876/benchmarking-s...](https://yahooeng.tumblr.com/post/135321837876/benchmarking-
streaming-computation-engines-at)

[http://data-artisans.com/extending-the-yahoo-streaming-bench...](http://data-
artisans.com/extending-the-yahoo-streaming-benchmark/)

