
Exactly-once or not, atomic broadcast is still impossible in Kafka - 68c12c16
http://the-paper-trail.org/blog/exactly-not-atomic-broadcast-still-impossible-kafka/
======
1_2__4
This kind of fairytale claim from Kreps does real harm in the world. I spend a
not-insignificant amount of time in my job trying to convince junior
developers that no, there is no magic fairy dust that lets them stop having to
think about CAP theorems, eventual consistency, distributed consensus, split
brain failures, and so on. So when Confluent (and similar) come along with it
just wild claims but demonstrably false ones by the limits imposed by reality
it makes my job a million times harder (and the developee's code a million
times worse).

On the other hand when companies and maintainers are honest about the
limitations it allows for a much easier conversation about how those
limitations affect our products and how to trade off the mitigation approach
and risk.

When the vendor claims there's no need for developers to use their brains
they're all too eager to believe it.

~~~
sidlls
I'd love to have only the problem of Junior developers to pass experience to.
I currently have to deal with "senior" developers and architects who
desperately want to put some "Big Data" stuff on their resumes and constantly
chase after things like Kafka even when it's not justified.

~~~
closeparen
Compared to what?

I've often heard wisdom like "it's irresponsible to use Kafka when you could
just use RabbitMQ" but why does that statement not work equally well in the
opposite direction?

It seems like some "you don't need this fad technology" sentiments don't
provide a whole lot of justification for sticking with the unfashionable
choice. Does it all come down to "the devil you know?" What if you aren't that
experienced with either choice?

~~~
sidlls
> I've often heard wisdom like "it's irresponsible to use Kafka when you could
> just use RabbitMQ" but why does that statement not work equally well in the
> opposite direction?

It could, in some circumstances. Generally, though, the two have only a very
small overlap with respect to messaging use cases. One was designed to be
general purpose and solve a variety of MQ use cases and serve even "dumb"
clients. One was designed very specifically to impose a high burden of state
tracking on clients for a very specific paradigm.

> It seems like some "you don't need this fad technology" sentiments don't
> provide a whole lot of justification for sticking with the unfashionable
> choice.

The "unfashionable choice" that works doesn't need justification. The new
shiny one does.

~~~
closeparen
>The "unfashionable choice" that works doesn't need justification

Why? That's what I'm asking.

~~~
suchire
The older choice usually by default has:

\- More people who know how to operate it

\- A greater portion of its failure modes that are known, with workarounds

\- Mature tooling and libraries

New technologies need a compelling reason that their pros will outweigh the
above incumbent benefits.

------
LgWoodenBadger
It still annoys me that a company/product that impresses me so much
(Confluent/Kafka) has doubled down so hard and so many times on their "Exactly
Once" ridiculousness.

Admit your mistake and clarify you really mean "idempotency" or "effectively
once," and only if you say completely within the bounds of Kafka, and move on.

It's becoming a bit of a joke having to combat their fairy dust in my
profession.

~~~
doug1001
until this thread, i had not actually heard the term "effectively once";
what's more, i've always believed that "indempotent" means "at most once."

how do the two guarantees, "effectively once" and "at most once" differ?

~~~
LgWoodenBadger
"At most once" has the implication of "you may never get it" (if the one and
only attempt failed anywhere along the way)

"Effectively once" implies "you got the same thing 1000 times but ignored 999
of them because you already got it."

With an "exactly once" guarantee, I could send you a stream of integers
(1,2,3,4,5) and you could blindly/naively/simplistically add them with no
special concern and be confident that your answer of 15 was correct.

With "effectively once," you'd have to keep track of what you've seen before
so you know not to add 4 an extra 6 times and come up with the wrong sum of
39.

With "at most once" you may be sitting around with a sum of 0 and think that's
correct.

------
nl
I'm not sure I'm following this argument.

Surely Kafka isn't claiming "exactly-once" semantics AND availability? I
thought the claim was that they will do exactly-once, or none-at-all in the
event of an outage (until the outage is cleared, in then you'll get the
messages you were waiting for)

That seems solvable by consensus - indeed, it's the equivalent of what
Zookeeper offers.

What am I missing here?

~~~
di4na
The fact that this is not exactly-once for a couple reasons.

~~~
nl
That's not super helpful, and not completely convincing.

------
wuch
There are perfectly valid reasons to claim that exactly-once delivery is
impossible. FLP impossibility result is not one of them. In fact in FLP model
solution is trivial, just send the message exactly once and it will be
eventually delivered.

* What about network failures? In FLP model network is reliable so there are no network failures.

* What about node failures? In FLP model node failures are permanent, so there is nothing illuminating to say that you cannot deliver message to a node that is permanently offline.

* What if node failures were transient? If network is still reliable and state transitions atomic, then failures are completely unobservable.

* What if state transition are not atomic, and you cannot process message and record that it has been processed in a single step? That would mean that exactly-once delivery is impossible even within a single node, and has nothing to do with distributed nature of computation.

~~~
frankmcsherry
I'm pretty sure the article claimed no such thing. In fact, the article
repeatedly says that it will take no position on whether EO is at all like AB,
but that AB is not something you can guarantee (modulo new model assumptions),
and probably not something that the Kafka folks should be saying they can do.

------
grogers
Using FLP to say that atomic broadcast is impossible is overly simplistic. In
the real world, where people do use atomic broadcast/consensus in real
distributed systems, it just isn't a problem.

FLP says it's always possible to not achieve consensus, but it says nothing
about the probability of it. In practical systems, unless you have a partition
such that no quorum of nodes can talk to each other, the probability of not
reaching consensus rather quickly is effectively 0. Such partitions are rare
in real systems (basically requires multiple data center failures or multiple
fiber cuts). You are much more likely to run into other problems that affect
your availability, like code bugs or failed isolation between some components.

~~~
frankmcsherry
> FLP says it's always possible to not achieve consensus, but it says nothing
> about the probability of it.

It says that the probability is not zero. That is a very important distinction
for some people.

> You are much more likely to run into other problems that affect your
> availability, like code bugs or failed isolation between some components.

All the more reason _not_ to claim that your system provides the guarantee.

~~~
grogers
> It says that the probability is not zero. That is a very important
> distinction for some people

I'm not a mathematician, but I'm pretty sure it doesn't even say the
probability of failed consensus must be nonzero. For example, the cantor set
shows that it is possible to have a set with infinite elements, but zero
length. That is equivalent to an infinite number of interleavings where
consensus doesn't occur, but the probability of hitting any of them is zero.

Sure in real world systems, the probability would always be nonzero, but when
it's still so close to zero, it practically doesn't matter. Which is why in
the real world, people do build very reliable distributed systems out of
unreliable components.

~~~
frankmcsherry
FLP says that any deterministic algorithm satisfying agreement and validity
must have non-terminating executions. If you are allowed to pick the
distribution over the sequence of events in the system I'm sure that the
probability of non-termination is quickly zero. If FLP are allowed to pick the
distribution, the probability of non-termination can be one.

Getting into a quantitative evaluation of the likelihood of consensus is
beyond my ken, but different people have different beliefs about what
constitute "in real systems", and different interpretations of "practically
doesn't matter". For example, the fact that you are talking about data centers
rather than satellites suggests (to me) that your beliefs about the scarcity
and transience of partitions may not generalize.

We can afford to be clear about what is true and what isn't, especially when
trying to build reliable distributed systems.

Note: Ben-Or's consensus algorithm uses randomization to drive the probability
of termination to zero, but still has non-zero probability associated with any
length of execution.

