Hacker News new | past | comments | ask | show | jobs | submit login
A generalised solution to distributed consensus (acolyer.org)
454 points by based2 on Mar 8, 2019 | hide | past | favorite | 56 comments

Is anyone else surprised at the relatively mild reaction to this result here on Hacker News? Only a couple dozen comments, and many are addressing other topics.

Can we take a second to appreciate that this result will rapidly become the first thing taught in every single distributed systems class? And if this holds as a generalization of trustful distributed consensus as a field, then she has defined its Turing Machine equivalent. And it is even remarkably easy to understand! That is wild!!

Props to the authors. Personally, this is going on the top of my “to memorize” list.

This is a very techincal topic, of course the vote to comment ratio is going to be very high.

I disagree. This paper generalizes variants of Paxos - certainly not the class of all consensus algorithms. Moreoever, it generalizes Paxos variants, with different requirements - weaker quorums are always a tradeoff with liveness in the presense of faults.

tldr; I don't see world-changing novelty here, theoretically. There are dozens of other consensus algorithms in the world, some randomized, some not quorum based, etc. This is interesting if you're in Paxos land, trying to tune your Paxos for different fault-tolerance requirements. It's no way comparable to a computational model.

"This paper generalizes variants of Paxos - certainly not the class of all consensus algorithms."

This is not true. It generalizes quite many more than that. It happens to use PaxOS as examples.

It generalizes specifically a class of consensus algorithms that write to sequences of registers in a monotonic way, such that if a "previous" register decides v, future "registers" and associated quorums will decide v.

This is not a property of all consensus algorithms. Consensus algorithms don't need to behave this way. Paxos and derivatives behave this way, but I'm not sure Raft, etc. slide in as neatly (that would require a substantial analysis).

In particular, this generalization does not capture any nondeterministic algorithm, any algorithm tolerating Byzantine faults (or various other consensus problems), any non-quorum based algorithm, Nakamoto style (which is a byzantine consensus algorithm), Ben-Orr, etc. etc.

So, my question for you: what other algorithms do you think this paper generalizes, that are useful outside of the narrow scope of optimizing Paxos?

Not really, in fact it's something I've come to expect here. There's a loss in social status for appearing too enthusiastic about a new thing (that might not be so great) and if the new thing replaces a difficult-to-use old thing then the people who invested effort in understanding the old thing don't want to participate in making it obsolete.

Don't forget that people often just want to use an algorithm and are not necessarily that interested in the underlying mathematical structure, only in whether it works reliably and does a better job than the existing solution.

I'm very skeptical that many people will want this. Data misrepresentation is rarely (never?) due to a fault in the persistence medium. Its usually a political reason, or human error that perverts consensus.

What? Consensus algorithms in this class are used everywhere in web programming, from MySQL to AWS. Ironic that your comment may very well have touched several consensus algorithms on its way to my eyes. The fact that you remain blissfully unaware of them is a testament to their power.

This has nothing to do with "data misrepresentation". It's about distributed consensus. It's not even really about storage, but about multiple parties agreeing on a single truth.

Even just something as simple as which of several replicas is supposed to be the master. Not some great external truth about the world.

Great paper with an incredibly important result. It will take years to shake out in industry and implementations, but should dramatically improve situations where strong consistency is important.

Can you explain how? In many cases when a "generalized result" comes out there was already a generalized result that covered all but obscure corner cases.

This is the first paper i've seen to generalize consensus while only using immutable state.

That is, you can describe PaxOS, Raft, etc using the generalization given here.

The result is fairly understandable.

(Of course, as mentioned, Raft is trivial in her framework because there is only one thing deciding values, Raft is spending it's time in the leader election part to find that one thing)

Also, from the abstract, "the surprising result of this analysis is a substantial weakening to the quorum requirements of these widely studied algorithms."

The big thing here is that its analytical framework allows the authors to clearly analyze Paxos and some important variants and identify performance improvements (e.g. steps in the algorithm you can elide) that don't undermine the guarantees provided.

Would someone ELI5?

Here's my attempt:

Making multiple computers all store the same thing is hard because they have to send messages between each other, and sometimes messages get lost.

Other people have tried to write down how they solved this problem, but their explanations were hard for adults to understand.

This person has tried a new way to solve this problem, and a new way to explain their solution that uses ideas that many big kids who go to big kid school to study computers can understand.

I can’t, but I can say:

1) Heidi Howard’s papers are very fun to read. She’s done lots of work I’ve enjoyed esp wrt Raft 2) She touches on two things in this story’s paper that I first read in another paper of hers that are a joy to realize for two reasons a) She made a practical improvement in the already simple Raft protocol b) quorum != (obvious) majority - in (eg) Raft she identifies that quorum actually spans two states, which are

* leader election and

* “regular” mode with a leader and requisite # followers

Read/watch on and be amazed[0][1]

[0] http://hh360.user.srcf.net/blog/2016/08/majority-agreement-i...

[1] https://youtu.be/gYkueS5sKqo

Really cool result and a very readable paper. I'm struggling to figure out how this would be applicable to iterated consensus (i.e. logs). It seems like the extension examples are focused on reducing the message round trips for specific cases that don't matter when applied to muti-paxos or raft.

I have a lot of questions about this paper. Parts of it aren't clear to me.

Is anyone aware of a forum where consensus algorithms are discussed? A mailing list, subreddit, or website?

I found a Raft mailing list a while ago, but I'm looking for a place for a more general discussion, where real experts hang out.

For questions specifically about this paper, you might have luck with Adrian Colyer's take on it: https://blog.acolyer.org/2019/03/08/a-generalised-solution-t...

Or Howard's blog post: https://hh360.user.srcf.net/blog/2019/02/towards-an-intuitiv...

Twitter used to be a useful place for general distsys discussions, but I haven't been around for a few years now so I'm not certain how much still takes place.

This tweet, for example, I would have expected to see more people chime in on.


There's a distributed systems theory Google Group, but it appears to have completely died off:


To inquire what avenues for theoretical discussion are currently active, I would suggest reaching out on Twitter to Chris Meiklejohn (cmeik), Michael Bernstein (mrb_bk) or Kyle Kingsbury (aphyr).

I believe you have to go through literature to get this kind of back-and-forth. At least, that's how I manage, even if that feels slow and opinionated.

My understanding is the cost of all these weaker quorums is reduced fault tolerance. Is that correct?

Yes. In theory, maybe it's useful to have more flexibility to trade-off durability for performance (smaller quorum size hopefully reduces deciding latency [unless your small quorum contains the straggler!]) for specific kinds of data in the same replica group.

In practice though, it seems easier just to run separate instances of Paxos.

Forgive the tangent, but could quantum entanglement one day provide consensus with trivial (if expensive) complexity? E.g., nodes maintain a supply of preentangled particles, observing them at predetermined timestamps, and using that known shared randomness to make decisions?

Edit: ok after that sat there for a while I realized that generating shared pseudorandom bits would be just as effective.

If you abstract quantum-entangled particles to a communication channel with zero latency, you still have the consensus problem—you just run into the edge cases less frequently. So it would help, but consensus is still a fundamental challenge with parallel processes.

Given the handwaviness I might just be misunderstanding what you mean by "zero latency", so I just wanted to clarify: entangled particles do NOT enable FTL communication. Sorry if I am saying something you know, but others might misread your comment.

As an alternative to Paxos-related consensus algorithms, some projects may want to take a look at the Raft consensus protocol: https://www.brianstorti.com/raft/

Whereas Paxos tries to maintain a fully distributed consensus state, Raft instead proposes a protocol for automatically electing a leader node from a group of follower nodes, and coordinating the replication of the write log through that leader node.

I haven't had the pleasure of experimenting with either one though and would love to hear from people who have.

Raft is arguably more difficult to understand at a deep level. After extensive study, I've noticed that Raft hides all of its complexity under its leader election setup. There's nuance after nuance. With Paxos, the complexity is explicit and at the forefront, which is actually better, when trying to reason about a distributed system and its invariants.

That was my experience as well. The rules around leader election, terms, voting, timeouts and the like are very subtle.

Raft is basically a fixed leader elected by a single round of Paxos.

It's not that simple. Here's an overview on paxos/raft consensus algorithms: https://vadosware.io/post/paxosmon-gotta-concensus-them-all/ and even that is incomplete and not necessarily describes trade-offs well. Submitted generalized solution is a better point to start experimenting from.

This paper is deliberately a generalization of all of them, actually. Raft can also be explained in terms of the rules/system she provides.

She happens to use PaxOS as examples to show they are non-optimal, in part because Raft is already deliberately non-optimal.

The other reason is that Raft is completely trivial in her framework.

It is equivalent to only having one thing decide the registers (and all the hard work is in deciding which thing).

By the way, Paxos is the name of a Greek island; it doesn’t have funky capitalisation. The name is from Lamport’s paper “the part-time parliament” which describes the algorithm in the setting of Ancient Greece.

Exactly this, not sure why people think it's a different thing - just wrap nodes into a higher level of abstraction. I believe Raft might be a better approach to scalable consensus tho but there needs to be more layers to a network so the network can be more self-balancing and localised.

If we we call "heartbeat" the frequency at which a node is synced to the network (achieves consensus) - lower heartbeat nodes need to stay close only to nodes with a "heartbeat" faster than theirs and elect leaders in that group. The leaders in that group are also in a group of other leaders of that layer and listening to their leader, which belongs to a group above (like zooming out from a group and the group becoming a node in a larger graph).

In the end, there is only one state after all and the graph is the ultimate leader with the highest heartbeat. We wouldn't need many layers to cover all nodes, assuming laws of nature and all. Then just go a layer up/down if the network is getting too large, use a karmic system to position new/reactivated nodes. Seems scalable enough.

Sorry OP, it might be too early and I'm just thinking out loud in a comment box.

Can the framework model the leader election though?

They do, effectively. They show what is necessary for quorums, and show that a large number of algorithms meet these requirements.

Hence, the single leader election is not actually necessary (and this is one of their points - the existing algorithms require much stronger quorum guarantees than are actually necessary for correctness)

It took me months to do a reliable Raft implementation. The author of the Raft paper, Diego Ongaro, wrote in a comment on a Raft mailing list that it could take weeks.

The Raft paper is really easy to read and understand, but the actual code is very tricky.

What did you find most difficult about the implementation?

To make it work with any kind of performance, all message passing needs to be asynchronous, and you need an event loop to process messages one at a time. Things are much simpler if your code is effectively single-threaded when processing an inbound message. That took me a while to figure out. Futures and promises are very helpful.

It took a while to get the building blocks together: the log file, which can be traversed, read, and rolled back/truncated (use a kv store for this, trust me), and the transport mechanism to deal with sending complex objects and handling errors and timeouts.

Tracking terms, states (leader, follower, candidate), rejecting out of date terms, voting, elections -- it's all very complex. You'll have to build some really, really good multi-threaded tests that simulate out-of-order and dropped messages, then really pound away at the system. And good logging to find that odd corner case that keeps coming up and throwing mysterious messages.

I haven't even tried membership changes yet.

> use a kv store for this, trust me

Order is important here - is this inherent in your key?


I did two implementations, one with plain text files and one with a kv store. For plain text files, I had to maintain file offsets of the records so I could traverse backward and truncate records that had to be replaced, and then break the files into segments so I could retire segments that are no longer needed. It had all kinds of little bugs.

Then I reimplemented using a kv store, and it took a tiny fraction of the time. And surprisingly had better performance.

Just create a good interface at the start, use the kv store underneath, and then later if you don't want the dependency swap it out for the more complicated code.

Cool - I’d like to talk about your design and how you arrived at it if you’re interested.

Happy to. I just added my email address to my profile.

Not OP, but it is very easy to make mistakes while implementing it and end up with something that is not actually reliable. You have to have the paper opened and write code line by line looking at it. I implemented it a while ago so that's the most I remember, but it did a bit of fuzzing to nail down every bug.

Or I could ask you here one of the questions.

Do you host diep.io on cloudflare, heroku, linode etc.? or do you have own servers?

Because I'm interested about developing something.

Also not OP, I think I went through the paper 10 times over during the implementation. There's a lot of cross-references all over the place.

Could I ask you something, or maybe on Steam?

Isn't raft the default for Bitcoin?

No. Raft and Bitcoin's algorithms for agreeing on what the blockchain is are completely different.

AIUI, Bitcoin's "consensus" is just whoever has the longest valid blockchain is the winner. The algorithm tunes the difficulty s.t. it takes ~10 minutes to find a block, so the chance of two blockchains continuing to have the same length for any long amount of time is unlikely. That is, one chain will find the next block before the other, and in the meantime, be able to distribute that fact to the other nodes and thus "win".

From https://en.bitcoin.it/wiki/Block_chain:

> Honest generators only build onto a block (by referencing it in blocks they create) if it is the latest block in the longest valid chain. "Length" is calculated as total combined difficulty of that chain, not number of blocks, though this distinction is only important in the context of a few potential attacks.

No. Bitcoin uses its own protocol. Raft could not be used for a secure cryptocurrency bevause Raft does not handle byzantine faults[1]. It assumes nodes are not run by malicious actors and messages sent are unaltered in transit.

[1] https://en.m.wikipedia.org/wiki/Byzantine_fault

True. Though in tradeoff it has massively increased throughput. I'm always frustrated when I see blockchains recommended as a solution for a consensus problem in anything but a zero-trust scenario

Even in a zero trust scenario, you need a majority of nodes to be good faith, non-colluding actors. Even anarchy breaks down in the face of unequal power.

Bitcoin consensus is what is called Byzantine consensus - that is your consensus still has to work with some nodes actively trying to send bad information to mess up the protocol.

Raft and Paxos are non-Byzantine. They assume that the nodes are actively trying to follow the protocol though there me be such things as network partitions and crashes.

Bitcoin consensus is extremely simple: the chain with the most accumulated difficulty. Furthermore blocks aren't removed or changed, so consensus is in fact a monotonic process. You don't need either Paxos or Raft or any other sophisticated algorithm; just broadcast every block you believe to be part of the most difficult chain and consensus will be achieved.

The client maintains a decision table with one entry for each quorum of each register set.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact