Hacker News new | past | comments | ask | show | jobs | submit login
Announcing Apache Flink 1.0.0 (apache.org)
102 points by jonbaer on March 9, 2016 | hide | past | favorite | 48 comments



> High throughput and low latency stream processing with exactly-once guarantees.

Precisely how is this possible in a real-world distributed system? At-least-once and at-most-once are possible with different distribution methods, but "exactly-once" is, well, I have yet to see it actually implemented (even theoretically, see the Two Generals thought experiment). Even their checkpointing algorithm has an assumption that the underlying transport is "exactly once" and "FIFO ordered".

If I had to guess, I'd say that what they're really guaranteeing is "at most once" message delivery. Unfortunately, that has a very different meaning from "exactly once".

http://bravenewgeek.com/you-cannot-have-exactly-once-deliver...


I'm pretty sure that this is the algorithm flink uses: http://arxiv.org/pdf/1506.08603.pdf

EDIT: fyi sewen is one of the authors of the paper and a founder of flink


You do not need physical exactly once. If you can "roll back" changes, than physical "at least once" is enough to realize logical "exactly once".


Rolling back changes means that at some point in time, you have "more than once". It also means you can only ever implement logic which can be rolled back. In a pipeline scenario, that is challenging to do without some sort of external locking system.


True. In most streaming systems, exactly-once refers to the state that is maintained and controlled by the streaming system. For that, Flink can guarantee such semantics. That state can also be exposed externally, see for example here: http://data-artisans.com/extending-the-yahoo-streaming-bench...

For interactions (writes) with external systems, you need transactional integration. There is WIP to do this for transactional databases, for non transactional systems, there are cases where it cannot be strictly solved.


"exactly once" is actually pretty common. You can read about how Google does it in their MillWheel paper: http://static.googleusercontent.com/media/research.google.co...

They also compare their guarantees to other streaming systems, and they're not the only one to offer exactly once.


Physical at-least-once is easy to transform into a logical exactly-once. Just assign a unique id to every message at the source and discard duplicates at the target.


You're describing "at most once". The key is that a message could be lost between components, and there's nothing here which accounts for that.

EDIT: To clarify a bit more, I'm looking at the "at most once" as once throughout the entire system. Not once per node. Combining a physical at-least-once with a logical "at-most-once" only works at the per-logical node level, not at the system level. Some processing will occur more than once in a system, which could have negative consequences if the processing is not strictly idempotent.


Using an acking system (like that in Storm or Heron) would move it to be physically at-least-once with a logical exactly-once by de-duping at the target.

This means if you observe the inner system, you won't be seeing exactly-once semantics but looking at the system from the outside, the semantics would indeed be exactly-once at the target.


Flink guarantees exactly-once application state access/updates. There is no mention of exactly-once delivery anywhere and it is pretty much impossible to offer this for just any general sink. That does not exclude the possibility of offering a special-purpose transactional sink implementation in the future though (e.g. via external distributed locking/state reconciliation), just saying...


Even that solution would approximate exactly-once delivery (eventually) but not strongly guarantee it.


It is possible to provide exactly-once guarantee. Trident does this over Storm.

http://storm.apache.org/documentation/Trident-tutorial.html


This is something it bothers me a lot.

What they don't tell you is that in order to achieve "exactly once" delivery you need to have idempotent writes. For example inserting in a database using PKs.


Exactly-once in a distributed system doesn't exist, and we need to get past that. It can be approximated with replay (at-least-once) and de-dupe/idempotency.

Trying to ensure true exactly-once is a fools errand. In a distributed system it required guarantees (down to the magnetic material level) that are very hard to get right. If any of those guarantees fail, you don't have it. What you have is a close approximation.

Real world applications usually involve a lot more than counting words in memory.

At-least-once is relatively easy. Combine it with idempotent operations and your work is done.

Storm is fairly explicit in documenting what it takes to achieve this, but it's not trivial, and every system involved has to support certain guarantees.

Spark (streaming) made some pretty big claims about exactly-once guarantees, but it turned out that claim was riddled with holes.

In my opinion, "exactly-once" doesn't imply that there are exceptions to that rule.

Guard against dupes and you'll be fine (easier said than done, obviously), but also know the limitations of the systems and frameworks you are working with.

EDIT: Disclaimer: I am an Apache Storm PMC Member


I disagree on some points.

First, it is crucial to distinguish between "exactly-once" semantics with respect to state inside the stream processor (for example an aggregate computed in a window) an exactly-once delivery to external systems. The former is built into Flink, the later is only possible in some cases (transactional systems) and requires extra effort.

Exactly-once for state inside the stream processor is incredibly useful, because it allows you to implement many non-idempotent operations such that the writes to external systems are idempotent: For example, you compute the complex aggregate in the stream processor and only periodically write the result to the external system (overwriting previous values). Now the external system always reflects an aggregate without duplicates.

That is very valuable and only possible if inside the stream processor, you have exactly-once semantics for state. That does imply that the stream processor has a notion of managed state (in Flink for example the Windows, key/value state, and generic checkpointed state).

Disclaimer: I am a Flink committer.


Exactly-once delivery really depends where and how you actually "deliver". Even if your writes are idempontent you need to know if they have been properly committed on the other side, not trivial. If the system you are committing your output offers version control and/or proper transactional support then delivery can be eventually re-conciliated, otherwise with no such assumptions it is pretty much impossible. Apache Flink's snapshotting algorithm solely guarantees exactly-once application state access, plain and simple.


This has been demonstrated for a long time with Storm's Trident.


This is awesome, congrats to the Flink team!

For our current project we evaluated Spark and Storm, and have gone the Storm way. Personally I spent a little time on Flink, but could not convince the management for more serious consideration. Mainly because of the name recognition, Spark and Storm are so popular already. But I am confident Flink will become a great option, especially with hitting 1.0. I felt the community was very active and documentation pretty good.


Flink has a Storm compatibility layer, which may aid the transition, should your management change its mind.


The Storm compatibility layer may work for word count, but not for anything non-trivial.

Here's the fine print (no fault tolerance, etc.): https://github.com/apache/flink/blob/master/flink-contrib/fl...


I don't think it's going to be easy to move from Storm to Flink. They are very different systems underneath. Any substantially complex application is going to end up having compatibility problems.

My bets are on Apache Beam (Google Dataflow) to create that compatibility layer. Haven't used it yet, hoping it works out though!


"The Evolution of Massive-Scale Data Processing" presentation from Google provides a good comparison of Flink with other tools:

https://docs.google.com/presentation/d/10vs2PnjynYMtDpwFsqmS...


Storm is pretty much dead. Spark, from my point of view, is the current forerunner.

It support mostly all use cases. Batch jobs. Streaming jobs. Work is being done to reduce the batch time to ms.

Comercial support is very good from the guys at Cloudera and Databricks.

My 2 cents. I hope It helps someone.


I love Spark, it has great documentations and Scala! But how is Storm dead?! The documentation is not good enough yet, but there is a strong userbase and it is being actively used in a lot of companies.


Storm is anything but dead. Spark streaming is definitely not all there yet.


As luck would have it, I'm going to write a POC next month with Spark Streaming and Flink. Should I also throw Storm into the mix? What do you think about Flink?

I generally can get by with Batch, but streaming makes a lot of our logic easier, and might allow us to handle some new situations. Flink looks pretty good on paper.


Flink looks really interesting to me. If it can get community momentum I might switch to it.

I'd definitely recommend including Storm for a POC. For any POC I'd warn that most of these streaming solutions are operationally a bit of a PITA, which tends to distort POC results.


We will be half-to-single rack scale for a while, so I'm not too too worried about operations--we also need to target solo workstations, which is the real challenge with any of these systems. If things go haywire, we can just restart processing later.


You might want to check out Kafka Streams.


Does anyone have a good comparison of Flink and Spark, especially from a use case perspective?

Most I have found are light in actual contrast detail.


The benchmarking study done by Yahoo was fairly comprehensive, and quantitatively assesses the different stream frameworks (including Flink and SS) https://yahooeng.tumblr.com/post/135321837876/benchmarking-s...

We also did a podcast about it if you're interested in digging deeper -- http://softwareengineeringdaily.com/2016/02/03/benchmarking-...


Thanks for sharing that. I recently was looking at a different comparison of the _batch_ capabilities of flink and SS [1], which found that flink was faster at terasort than SS. I'm curious to understand why it is that it looks like SS can get higher throughput than flink in the streaming case, but less in the batch case.

[1] http://www.slideshare.net/ssuser6bb12d/a-comparative-perform...


I don't think SS gets higher streaming throughput than Flink. That was an assumption written in the Yahoo! streaming benchmark without an actual experiment.


Flink comes from a more database oriented background. It grew out of research project at TU Berlin[1].

I believe, Flink tried to focus on query language and optimization, as you probably would in a database settings. In contrast, Spark is sometimes described as a batch processing system, which provides a real-time experience by intelligently partitioning the work.

[1] http://stratosphere.eu/


That was the research project where many contributors came from.

Most of the streaming tech and developments in Flink are very disconnected from that and have little to do with database tech any more, actually.


The differences are mainly around batch-centric vs. streaming-centric model and executions.

Here is for example a video walkthrough by MapR: https://www.mapr.com/blog/apache-spark-vs-apache-flink-white...


Another major difference is that Flink's scheduler doesn't have a notion of data locality like Spark's. If you want to use data local to the node you have to query whatever you're storing your stuff in (HDFS) and filter those items that aren't on the node.


That works a bit differently in Flink and Spark.

Inside a data flow program, the scheduler tries to schedule as local as possible.

For the inputs to a streaming program (for example Kafka partitions), there is currently no locality consideration, but the locality there changes throughout a program lifetime anyways (brokers change leadership and rebalance)

Flink's DataSet API does assign data to tasks after the scheduling. That assignment respects locality, actually. That lazy assignment makes it possible to handle large numbers of small files, for example.


Hmm, I didn't know that about the DataSet API. When I looked at the scheduler's code it didn't seem to have any notion of data locality except for colocation of vertices. I'll take a look at the DataSet API's code though, thanks!


If you're interested in Apache Beam (dataflow), then Flink seems to me the best candidate to become sn open source runner. Spark 2.0 may change things though.


It seems like there's a multitude of these data analytics frameworks with slightly different features and goals, but generally speaking, do similiar things. Is there some significant differences between the 3 that maybe aren't as clear from a quick overview?


The favicon is... unfortunate.

The squirrel mascot is cute, but doesn't reduce well.


Ok, but what IS it?


I always hate when announcements (especially for a 1.0 release) don't start off with "This is what this is!"

http://flink.apache.org/index.html


The info graphic on the right does not show Hadoop, but the getting started guide suggests that Hadoop is a requirement.


A real time processing engine. You can think of an alternative to Storm or Spark Streaming.


> High throughput and low latency stream processing

It should be illegal to claim nonsense like that without some comparative example vs a c+= zmq/tcp pipeline doing the same thing





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: