Hacker News new | comments | show | ask | jobs | submit login
Exactly-once or not, atomic broadcast is still impossible in Kafka (the-paper-trail.org)
79 points by 68c12c16 9 months ago | hide | past | web | favorite | 31 comments



This kind of fairytale claim from Kreps does real harm in the world. I spend a not-insignificant amount of time in my job trying to convince junior developers that no, there is no magic fairy dust that lets them stop having to think about CAP theorems, eventual consistency, distributed consensus, split brain failures, and so on. So when Confluent (and similar) come along with it just wild claims but demonstrably false ones by the limits imposed by reality it makes my job a million times harder (and the developee's code a million times worse).

On the other hand when companies and maintainers are honest about the limitations it allows for a much easier conversation about how those limitations affect our products and how to trade off the mitigation approach and risk.

When the vendor claims there's no need for developers to use their brains they're all too eager to believe it.


I'd love to have only the problem of Junior developers to pass experience to. I currently have to deal with "senior" developers and architects who desperately want to put some "Big Data" stuff on their resumes and constantly chase after things like Kafka even when it's not justified.


As in everything else, the portion of actually useful and competent decision-makers in tech is vanishingly small. There are plenty of fine worker bees, but they're all too easy for marketers to divert. Ironically, such diversion is normally done with promises that deploying NEW_TECH_Z (which is almost always OLD_TECH_Y dressed up in a different outfit) will instantly make users useful and competent.


Compared to what?

I've often heard wisdom like "it's irresponsible to use Kafka when you could just use RabbitMQ" but why does that statement not work equally well in the opposite direction?

It seems like some "you don't need this fad technology" sentiments don't provide a whole lot of justification for sticking with the unfashionable choice. Does it all come down to "the devil you know?" What if you aren't that experienced with either choice?


> I've often heard wisdom like "it's irresponsible to use Kafka when you could just use RabbitMQ" but why does that statement not work equally well in the opposite direction?

It could, in some circumstances. Generally, though, the two have only a very small overlap with respect to messaging use cases. One was designed to be general purpose and solve a variety of MQ use cases and serve even "dumb" clients. One was designed very specifically to impose a high burden of state tracking on clients for a very specific paradigm.

> It seems like some "you don't need this fad technology" sentiments don't provide a whole lot of justification for sticking with the unfashionable choice.

The "unfashionable choice" that works doesn't need justification. The new shiny one does.


>The "unfashionable choice" that works doesn't need justification

Why? That's what I'm asking.


The older choice usually by default has:

- More people who know how to operate it

- A greater portion of its failure modes that are known, with workarounds

- Mature tooling and libraries

New technologies need a compelling reason that their pros will outweigh the above incumbent benefits.


> On the other hand when companies and maintainers are honest about the limitations it allows for a much easier conversation about how those limitations affect our products and how to trade off the mitigation approach and risk.

Yeah well that's a poor sales strategy. Being honest does not pay the rent.


And being dishonest does?


Um, yes? We've redefined parameters around "honest" to include sales and marketing efforts because they are practically essential for modern life, but from a basic perspective, there are definite lies by omission and very likely some lies by commission in virtually all attempts to get people to buy stuff.

There's also a lowest common denominator effect, where the other guy's marketing makes false claim X, and if your marketing doesn't, you are at a big disadvantage. Most of the time competitors aren't dumb enough to make a claim that is so blatantly false as to be legally actionable; as long as they steer clear of that high bar, they will make a lot of gains by making a misleading claim (note: they may make a lot of gains even if they can be held legally accountable, as that is a slow and expensive process, and the expected cost/penalty may fall far short of the expected profit).


mongodb


There are two types of Mongo shops in the world:

1) Their product hasn't launched yet.

2) They desperately regret using Mongo and are trying to get rid of it.


You seriously underestimate the depth of commitment that people have to their decisions. MongoDB is undeniably a wreck (though I hear it's been getting better in recent years), but I've come across very few shops who took a look at the wreck and said "This was a mistake". Rather, they usually just say "Let's buy more consulting services, being on the cutting-edge is expensive but it's worth it!"

Everything is a mind game. People have identified their use of things like Mongo as being on the forefront of a developing technology, it makes them feel important and interesting. Try taking that meaning away from them and see how it goes for you. The practicalities of actually using the thing hardly matter.

And is the guy who initially advocated for Mongo going to show up with his tail between his legs and admit he made a mistake? Nope, even if he wants to, that would be a big hit for him career-wise and after our mid-20s most of us have been disabused of our egalitarian notions and know better than to do that.

RethinkDB is the "good engineering" counter to MongoDB. Didn't oversell, worked hard to build a world-class product that targeted the same general product class. Compare for yourself and see what you get by proceeding with an engineering emphasis.

Marketing is mandatory, and developers are naive if they believe they or their field is immune.


3) Incompetent and don't realize their turgid stack isn't working correctly.


Justified response IMO. Their sales machine has left a significant negative impact on the industry by setting impossible expectations


This is a core strategy of con-men.

The recent discussion between Sam Harris and Scott Adams might be interesting to you. Take into account that moral and good are synonymous with truth, to Sam Harris (strict rationalist) and Scott Adams is a consequentialist/utilitarian.


Evidence says "Hell yes. Obviously."


It still annoys me that a company/product that impresses me so much (Confluent/Kafka) has doubled down so hard and so many times on their "Exactly Once" ridiculousness.

Admit your mistake and clarify you really mean "idempotency" or "effectively once," and only if you say completely within the bounds of Kafka, and move on.

It's becoming a bit of a joke having to combat their fairy dust in my profession.


"Effectively once" is a great way to describe how it actually works!


In the article this claims to debunk, it says:

Another objection I’ve heard to this is that it isn’t really “exactly once” but actually “effectively once”. I don’t disagree that that phase is better (though less commonly understood) but I’d point out that we’re still debating the definitions of undefined terms!

I have to say I agree with this.


until this thread, i had not actually heard the term "effectively once"; what's more, i've always believed that "indempotent" means "at most once."

how do the two guarantees, "effectively once" and "at most once" differ?


"At most once" has the implication of "you may never get it" (if the one and only attempt failed anywhere along the way)

"Effectively once" implies "you got the same thing 1000 times but ignored 999 of them because you already got it."

With an "exactly once" guarantee, I could send you a stream of integers (1,2,3,4,5) and you could blindly/naively/simplistically add them with no special concern and be confident that your answer of 15 was correct.

With "effectively once," you'd have to keep track of what you've seen before so you know not to add 4 an extra 6 times and come up with the wrong sum of 39.

With "at most once" you may be sitting around with a sum of 0 and think that's correct.


I'm not sure I'm following this argument.

Surely Kafka isn't claiming "exactly-once" semantics AND availability? I thought the claim was that they will do exactly-once, or none-at-all in the event of an outage (until the outage is cleared, in then you'll get the messages you were waiting for)

That seems solvable by consensus - indeed, it's the equivalent of what Zookeeper offers.

What am I missing here?


The fact that this is not exactly-once for a couple reasons.


That's not super helpful, and not completely convincing.


There are perfectly valid reasons to claim that exactly-once delivery is impossible. FLP impossibility result is not one of them. In fact in FLP model solution is trivial, just send the message exactly once and it will be eventually delivered.

* What about network failures? In FLP model network is reliable so there are no network failures.

* What about node failures? In FLP model node failures are permanent, so there is nothing illuminating to say that you cannot deliver message to a node that is permanently offline.

* What if node failures were transient? If network is still reliable and state transitions atomic, then failures are completely unobservable.

* What if state transition are not atomic, and you cannot process message and record that it has been processed in a single step? That would mean that exactly-once delivery is impossible even within a single node, and has nothing to do with distributed nature of computation.


I'm pretty sure the article claimed no such thing. In fact, the article repeatedly says that it will take no position on whether EO is at all like AB, but that AB is not something you can guarantee (modulo new model assumptions), and probably not something that the Kafka folks should be saying they can do.


Using FLP to say that atomic broadcast is impossible is overly simplistic. In the real world, where people do use atomic broadcast/consensus in real distributed systems, it just isn't a problem.

FLP says it's always possible to not achieve consensus, but it says nothing about the probability of it. In practical systems, unless you have a partition such that no quorum of nodes can talk to each other, the probability of not reaching consensus rather quickly is effectively 0. Such partitions are rare in real systems (basically requires multiple data center failures or multiple fiber cuts). You are much more likely to run into other problems that affect your availability, like code bugs or failed isolation between some components.


> FLP says it's always possible to not achieve consensus, but it says nothing about the probability of it.

It says that the probability is not zero. That is a very important distinction for some people.

> You are much more likely to run into other problems that affect your availability, like code bugs or failed isolation between some components.

All the more reason not to claim that your system provides the guarantee.


> It says that the probability is not zero. That is a very important distinction for some people

I'm not a mathematician, but I'm pretty sure it doesn't even say the probability of failed consensus must be nonzero. For example, the cantor set shows that it is possible to have a set with infinite elements, but zero length. That is equivalent to an infinite number of interleavings where consensus doesn't occur, but the probability of hitting any of them is zero.

Sure in real world systems, the probability would always be nonzero, but when it's still so close to zero, it practically doesn't matter. Which is why in the real world, people do build very reliable distributed systems out of unreliable components.


FLP says that any deterministic algorithm satisfying agreement and validity must have non-terminating executions. If you are allowed to pick the distribution over the sequence of events in the system I'm sure that the probability of non-termination is quickly zero. If FLP are allowed to pick the distribution, the probability of non-termination can be one.

Getting into a quantitative evaluation of the likelihood of consensus is beyond my ken, but different people have different beliefs about what constitute "in real systems", and different interpretations of "practically doesn't matter". For example, the fact that you are talking about data centers rather than satellites suggests (to me) that your beliefs about the scarcity and transience of partitions may not generalize.

We can afford to be clear about what is true and what isn't, especially when trying to build reliable distributed systems.

Note: Ben-Or's consensus algorithm uses randomization to drive the probability of termination to zero, but still has non-zero probability associated with any length of execution.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: