Does this just mean (in TL;DR form) that Kafka now has producers generate an ID for each message they send, and Kafka deduplicates it now instead of requiring deduplication on the end consumer?
That is part of it but there is also a general purpose transaction feature that lets you link together updates, state journalling and consumption all in a single transaction. This enables correct stream processing on top of Kafka and is arguably the more technically sophisticated aspect.
This comment in tech crunch article is misleading and you may want to clarify on that. I agree that this is a nice feature for kafka streams or applications having consume-transform-produce loop. I as user/dev like kafka because of the design choices taken to keep things simple and much effective for users.
“It’s kind of insane. It’s been an open problem for so many years and Kafka has solved it — but how do we know it actually works?” she asked echoing the doubts of the community.
" How does this feature work? Under the covers it works in a way similar to TCP; each batch of messages sent to Kafka will contain a sequence number which the broker will use to dedupe any duplicate send. "
This is basically it. I'm always amazed at how much reimplementation of TCP we see at a high level in distributed systems. Backpressure, message ordering, retries, etc. all work pretty well in TCP.
Yes the idempotence part of this feature set is very similar to TCP (the transactional consumption and updates obviously aren't). But this isn't a reimplementation at all. TCP provides deduplication within the context of a connection tied to a process. If that connection is lost or the process dies then duplicates may occur. The feature in Kafka is much stronger as the "connection" is persistent and replicated with the log so effectively the "connection" fails over if the server dies.
Probably these new over-layers of abstractions are needed for some use-cases in for specialized and more performant networks. Otherwise, the simple fact of using TCP would be enough.
But indeed, it is interesting to see the computing cycle reimplementations spinning over and over again.
Re: TCP, it only maintains sequencing guarantees over the lifetime of a single connection. Obviously this is too weak of a guarantee for Kafka as leaders can change in a cluster. We've built idempotence in a way that makes the sequence number and producer ID part of the Kafka log. So it can provide idempotence even if brokers fail and new connections are established between the producer and broker