I do feel that calling it "Exactly-once delivery with kafka" is slightly misleading as this requires the applications to be written in a certain way. The title makes is sound too general and borders on claiming something that is close to impossible. I dont want to be too critical here as the author was very honest with what this means in the blog post. Regardless of the title this is an amazing feature.
Indeed. Idempotent operations is the mainstay of managing "at least once" message delivery. If you had a true "exactly once" system, idempotency would be unnecessary.
> Regardless of the title this is an amazing feature.
I feel it's a good iterative step forward, but nothing near the "revolutionary" that's being boasted.
I would not venture as far as to say this is impossible. If one day we are able to leverage quantum entanglement phenomenon in telecommunications network this would certainly be very much possible.
Given a "perfect network" I think exactly-once-delivery is possible, right?
And disks that fail and memory that bitflips and communication methods that are susceptible to random noise and machines dependent on failable power sources.
However, it's important to note that this can only provide you with exactly-once semantics provided that the state/result/output of your consumer is itself stored in Kafka (as is the case with Kafka Streams).
Once you have a consumer that, for example, makes non-idempotent updates to a database, there's the potential for duplication: if the consumer exits after updating the database, but before committing the Kafka offsets. Alternatively, it can lead to "message loss" if you use transactions on the database, and the application exits after the offsets were committed, but before the database transaction was committed.
The traditional solution to this problem is to provide your own offset storage for this consumer within the database itself, and update these offsets in the same transaction as the database modifications. However, I'm not certain that, combined with the new Producer API, this would provide exactly-once semantics.
Even if it doesn't, it's still a huge improvement, and significantly reduces the situations under which duplicate messages can be received.
1. State diagram: What are the messages exchanged between Producer and Consumer (and how many round trips to confirm a message)
2. At what point does each side consider the message to have been delivered?
3. Has this been tested empirically? (i.e: Setup producer, consumer, and partition/kill each side randomly to see if messages get lost)
The one thing I don't understand is the following. The two parties communicate by message passing. At some point the message will transition to a new state (i.e: delivered). That transition cannot happen on both sides at the same time. So how do you handle the failure of sending of the last message? Do you stage messages until after the timeout period has passed?
3. Yes, it has been tested empirically. Quoting from the article:
> We wrote over 15,000 LOC of tests, including distributed
> tests running with real failures under load and ran them
> every night for several weeks looking for problems. This
> uncovered all manner of issues, ranging from basic
> coding mistakes to esoteric NTP synchronization issues
> in our test harness. A subset of these were distributed
> chaos tests, where we bring up a full Kafka cluster with
> multiple transactional clients, produce message
> transactionally, read these messages concurrently, and
> hard kill clients and servers during the process to
> ensure that data is neither lost nor duplicated.
This has been under development for a long time in the open, you can track the features history here https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+E...
Once flink (and other systems) support this, I think it will really be a game changer. It will allow for doing things like sending an increment downstream rather than needing to keep all that state. Instead of mostly seeing stream processing as a thing we do for analytic use cases, it can really become the backbone for streaming applications.
Event Sourcing/CQRS has been an idea for a while, but in practice, is difficult to do because to the inherent difficulty in dealing with message semantics and consistency (see https://data-artisans.com/blog/drivetribe-cqrs-apache-flink for a good write up about such an app). The ability to independently optimize for both reads and writes while also not having to make all messages be idempotent, in my mind, will make this feasible for a broad range of teams and won't require a huge amount of work in thinking up clever message formats or working through as many failure scenarios.
That being said, I fully expect this to bite hard when the caveats aren't understood (such as when interacting with external databases) and there are still other hard problems. Like creating a consistent down stream view of an app in terms of business events rather than hooking into a database transaction log.
Still though, I think these are the sort of solutions needed to make distributed systems easier, even though there are lots of caveats :)
This has already caught quite a few bugs, including one that led to a change to the replication protocol which is included in this 0.11.0.0 release. See the community KIP here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-101+-+...
One cool thing that happened recently with these tests is that they were modified to make the client implementation pluggable: https://github.com/apache/kafka/pull/2048 Confluent uses this functionality to test all of its clients (librdkafka, confluent-kafka-python, confluent-kafka-go, confluent-kafka-dotnet) in addition to the Java clients. This not only makes us confident of these clients from their first release, but has also found dozens of bugs in both the clients and the broker implementation itself. Getting automated testing across many clients has really stepped up the quality and robustness of both existing and new features.
If you're interested, the tests themselves are here: https://github.com/apache/kafka/tree/trunk/tests
There is a crazy claim of 'solving the unsolved problem for so many years' from Confluent CTO Neha Narkhed as mentioned below from the above article.
“It’s kind of insane. It’s been an open problem for so many years and Kafka has solved it — but how do we know it actually works?” she asked echoing the doubts of the community.
This is solved only in consumer-transform-produce scenario addressing kafka streams exactly once requriements. This is a nice feature for kafka streams but it is better to avoid tall claims like the above.
Actually blog post also started with major claims but it was made clear in later paragraphs on how it is acheived through de-duplicating messages in broker and the limitations on consumer side. Title of the blog post and comments in the tech crunch does not seem to be in good spirit.
Idempotent producer does not give any API to solve exactly once producer fecthing messages from target systems and sending them to topics.
I like kafka and the design choices taken to keep things simple for users but all these tall claims should have been avoided.
" How does this feature work? Under the covers it works in a way similar to TCP; each batch of messages sent to Kafka will contain a sequence number which the broker will use to dedupe any duplicate send. "
But indeed, it is interesting to see the computing cycle reimplementations spinning over and over again.
1. Please read the section entitled "Is this Magical Pixie Dust I can sprinkle on my app?" Before making angry comments. Answer: for general consumer apps just consuming messages, no. However, Kafka's design, which let's the consumer control it's position in the log, combined with this feature which eliminates duplicates in the log make building end-to-end exactly once messaging using the consumer quite approachable. For stream processing using Kafka's Streams API (https://www.confluent.io/blog/introducing-kafka-streams-stre...), where you are taking input streams, maintaining state, and producing output streams, the answer is that it actually kind of is like magic pixie dust, you change a single config and get exactly-once processing semantics on arbitrary code. Obviously you still need to get the resulting data our of Kafka, but when combined with any off-the-shelf Kafka connector which maintains exactly-once you can get this for free. So for that style of app design you actually can get correct results end-to-end without needing to do any of the hard stuff.
2. Someone is going to come and say "FLP means that exactly once messaging is impossible!" or something else from a half-understood tidbit of distributed systems theory they picked up on a blog. Let me preempt that. FLP is about the impossibility of consensus in a fully asynchronous setting (e.g. no timeout-based failure detection). Of course as you know the vast majority of the systems you use in AWS or your own datacenter depend in deep ways on consensus. Kafka itself is a CP log, about as close of a direct analog to consensus as you could ask for. Obviously Kafka and all these systems are "impossible" in the same sense that if you can make the network or other latency issue bad enough you can make the system unavailable for writes. This feature doesn't change that at all, it just piggybacks on the existing consensus Kafka does. It doesn't violate any theorems in distributed systems theory: Kafka and any consensus-based system can't work in a fully asynchronous setting, Kafka was a CP system in the CAP sense prior to this feature and this feature doesn't change that guarantee.
For those who want a deeper dive into how it all works there is a longer write up on the design here:
The use for stream processing is described here:
The only serious objection to the idea of exactly-once delivery is that in some models of computation, you can't even solve it on a single node.
It may be impossible to process a message and record the fact that this processing have been done in a single atomic step - making any node failure in-between those steps unrecoverable without reprocessing.
This is exactly the objection that was raised in referenced article "You Cannot Have Exactly-Once Delivery".
Though I don't think it is particularly illuminating observation, especially that it doesn't have anything to do with distributed nature of the system.
I write as one of the implementors of exactly once guarantees in Kafka.