> High throughput and low latency stream processing with exactly-once guarantees.
Precisely how is this possible in a real-world distributed system? At-least-once and at-most-once are possible with different distribution methods, but "exactly-once" is, well, I have yet to see it actually implemented (even theoretically, see the Two Generals thought experiment). Even their checkpointing algorithm has an assumption that the underlying transport is "exactly once" and "FIFO ordered".
If I had to guess, I'd say that what they're really guaranteeing is "at most once" message delivery. Unfortunately, that has a very different meaning from "exactly once".
Rolling back changes means that at some point in time, you have "more than once". It also means you can only ever implement logic which can be rolled back. In a pipeline scenario, that is challenging to do without some sort of external locking system.
True. In most streaming systems, exactly-once refers to the state that is maintained and controlled by the streaming system. For that, Flink can guarantee such semantics. That state can also be exposed externally, see for example here: http://data-artisans.com/extending-the-yahoo-streaming-bench...
For interactions (writes) with external systems, you need transactional integration. There is WIP to do this for transactional databases, for non transactional systems, there are cases where it cannot be strictly solved.
Physical at-least-once is easy to transform into a logical exactly-once. Just assign a unique id to every message at the source and discard duplicates at the target.
You're describing "at most once". The key is that a message could be lost between components, and there's nothing here which accounts for that.
EDIT: To clarify a bit more, I'm looking at the "at most once" as once throughout the entire system. Not once per node. Combining a physical at-least-once with a logical "at-most-once" only works at the per-logical node level, not at the system level. Some processing will occur more than once in a system, which could have negative consequences if the processing is not strictly idempotent.
Using an acking system (like that in Storm or Heron) would move it to be physically at-least-once with a logical exactly-once by de-duping at the target.
This means if you observe the inner system, you won't be seeing exactly-once semantics but looking at the system from the outside, the semantics would indeed be exactly-once at the target.
Flink guarantees exactly-once application state access/updates. There is no mention of exactly-once delivery anywhere and it is pretty much impossible to offer this for just any general sink. That does not exclude the possibility of offering a special-purpose transactional sink implementation in the future though (e.g. via external distributed locking/state reconciliation), just saying...
What they don't tell you is that in order to achieve "exactly once" delivery you need to have idempotent writes. For example inserting in a database using PKs.
Exactly-once in a distributed system doesn't exist, and we need to get past that. It can be approximated with replay (at-least-once) and de-dupe/idempotency.
Trying to ensure true exactly-once is a fools errand. In a distributed system it required guarantees (down to the magnetic material level) that are very hard to get right. If any of those guarantees fail, you don't have it. What you have is a close approximation.
Real world applications usually involve a lot more than counting words in memory.
At-least-once is relatively easy. Combine it with idempotent operations and your work is done.
Storm is fairly explicit in documenting what it takes to achieve this, but it's not trivial, and every system involved has to support certain guarantees.
Spark (streaming) made some pretty big claims about exactly-once guarantees, but it turned out that claim was riddled with holes.
In my opinion, "exactly-once" doesn't imply that there are exceptions to that rule.
Guard against dupes and you'll be fine (easier said than done, obviously), but also know the limitations of the systems and frameworks you are working with.
First, it is crucial to distinguish between "exactly-once" semantics with respect to state inside the stream processor (for example an aggregate computed in a window) an exactly-once delivery to external systems. The former is built into Flink, the later is only possible in some cases (transactional systems) and requires extra effort.
Exactly-once for state inside the stream processor is incredibly useful, because it allows you to implement many non-idempotent operations such that the writes to external systems are idempotent: For example, you compute the complex aggregate in the stream processor and only periodically write the result to the external system (overwriting previous values). Now the external system always reflects an aggregate without duplicates.
That is very valuable and only possible if inside the stream processor, you have exactly-once semantics for state. That does imply that the stream processor has a notion of managed state (in Flink for example the Windows, key/value state, and generic checkpointed state).
Exactly-once delivery really depends where and how you actually "deliver". Even if your writes are idempontent you need to know if they have been properly committed on the other side, not trivial. If the system you are committing your output offers version control and/or proper transactional support then delivery can be eventually re-conciliated, otherwise with no such assumptions it is pretty much impossible.
Apache Flink's snapshotting algorithm solely guarantees exactly-once application state access, plain and simple.
For our current project we evaluated Spark and Storm, and have gone the Storm way. Personally I spent a little time on Flink, but could not convince the management for more serious consideration. Mainly because of the name recognition, Spark and Storm are so popular already. But I am confident Flink will become a great option, especially with hitting 1.0. I felt the community was very active and documentation pretty good.
I don't think it's going to be easy to move from Storm to Flink. They are very different systems underneath. Any substantially complex application is going to end up having compatibility problems.
My bets are on Apache Beam (Google Dataflow) to create that compatibility layer. Haven't used it yet, hoping it works out though!
I love Spark, it has great documentations and Scala! But how is Storm dead?! The documentation is not good enough yet, but there is a strong userbase and it is being actively used in a lot of companies.
As luck would have it, I'm going to write a POC next month with Spark Streaming and Flink. Should I also throw Storm into the mix? What do you think about Flink?
I generally can get by with Batch, but streaming makes a lot of our logic easier, and might allow us to handle some new situations. Flink looks pretty good on paper.
Flink looks really interesting to me. If it can get community momentum I might switch to it.
I'd definitely recommend including Storm for a POC. For any POC I'd warn that most of these streaming solutions are operationally a bit of a PITA, which tends to distort POC results.
We will be half-to-single rack scale for a while, so I'm not too too worried about operations--we also need to target solo workstations, which is the real challenge with any of these systems. If things go haywire, we can just restart processing later.
Thanks for sharing that. I recently was looking at a different comparison of the _batch_ capabilities of flink and SS [1], which found that flink was faster at terasort than SS. I'm curious to understand why it is that it looks like SS can get higher throughput than flink in the streaming case, but less in the batch case.
I don't think SS gets higher streaming throughput than Flink. That was an assumption written in the Yahoo! streaming benchmark without an actual experiment.
Flink comes from a more database oriented background. It grew out of research project at TU Berlin[1].
I believe, Flink tried to focus on query language and optimization, as you probably would in a database settings. In contrast, Spark is sometimes described as a batch processing system, which provides a real-time experience by intelligently partitioning the work.
Another major difference is that Flink's scheduler doesn't have a notion of data locality like Spark's. If you want to use data local to the node you have to query whatever you're storing your stuff in (HDFS) and filter those items that aren't on the node.
Inside a data flow program, the scheduler tries to schedule as local as possible.
For the inputs to a streaming program (for example Kafka partitions), there is currently no locality consideration, but the locality there changes throughout a program lifetime anyways (brokers change leadership and rebalance)
Flink's DataSet API does assign data to tasks after the scheduling. That assignment respects locality, actually. That lazy assignment makes it possible to handle large numbers of small files, for example.
Hmm, I didn't know that about the DataSet API. When I looked at the scheduler's code it didn't seem to have any notion of data locality except for colocation of vertices. I'll take a look at the DataSet API's code though, thanks!
If you're interested in Apache Beam (dataflow), then Flink seems to me the best candidate to become sn open source runner. Spark 2.0 may change things though.
It seems like there's a multitude of these data analytics frameworks with slightly different features and goals, but generally speaking, do similiar things. Is there some significant differences between the 3 that maybe aren't as clear from a quick overview?
Precisely how is this possible in a real-world distributed system? At-least-once and at-most-once are possible with different distribution methods, but "exactly-once" is, well, I have yet to see it actually implemented (even theoretically, see the Two Generals thought experiment). Even their checkpointing algorithm has an assumption that the underlying transport is "exactly once" and "FIFO ordered".
If I had to guess, I'd say that what they're really guaranteeing is "at most once" message delivery. Unfortunately, that has a very different meaning from "exactly once".
http://bravenewgeek.com/you-cannot-have-exactly-once-deliver...