Hacker News new | past | comments | ask | show | jobs | submit login
A Review of Consensus Protocols (thomasvilhena.com)
121 points by tcgv on Oct 13, 2020 | hide | past | favorite | 45 comments



I'd throw in the "Virtual synchrony" set of protocols in the vein of Horus, Isis, VSync, and Derecho that are similar to Paxos, but older and take a slightly different view. I prefer them a bit from an ergonomic perspective. These formed the basis of the New York stock exchange, the Aegis Combat System, and the French air traffic control system, among other extremely highly available systems.

It appears the fantastic Virtual Synchrony wiki page has been unfortunately deleted, and simply replaced with a redirect to a couple sentences on the reliable multicast page (despite not relying on reliable multicast any more than Paxos does) so here's an archive.org capture of the wiki page. https://web.archive.org/web/20181231000827/https://en.wikipe...


The deletion of this page really reflects poorly on Wikipedia’s editorial policies.


It's not just Wikipedia. VS is very nearly a dead research agenda. If you look through citations from recent years, virtually all of them are just referencing it as a landmark in the prior art alongside Paxos et all. There's not much in the way of new work that's wholly within the VS paradigm. The main reason for this IMO is that the original paradigm predated people's full understanding of some important details of consensus in asynchronous networks, so some of the concepts are cross grain so to speak. People doing novel work today would rather just start on the firmer ground of everything derived from Paxos, etc.


afaict they've been pretty consistent on "no original content" since the beginning. Wikipedia "wants" to be an aggregator of knowledge, not a primary source, and that page seems very clearly on the opposite side of that divide.

whether or not that's a useful place to draw the line: I dunno. wikipedia is what it is with that line, it's hard to say if it's a cause or in spite of it without competition.


It's not really original content though, but instead a pretty accurate summary of it's citations.

And now it's been distilled down to the point of being both inaccurate and difficult to find.


A substantial original summary is original content, in the same way as it would be if it were in a blog / magazine / etc. They've also been consistent on that. It becomes something you can't factually check / has no true references.


A) You can absolutely fact check. There's veritable tomes of VS documentation. Treat it like any other wiki page of a complex topic.

B) It was much smaller than the Paxos page. Should we go delete what exists there too?


You can fact-check anything, yeah. But there's definitely a range of ease. Tens of paragraphs of zero citations is not on the "easy" side.

As to the paxos page: yeah, that looks like the thing they fairly continuously delete, despite its usefulness. There are other ways and other places to have that kind of information. Wikipedia does not seem to desire to be that location, though it'll happily be a terser summary and link to it.


Yeah, just one more example that the deletionists won years ago.


VS and Paxos operate at different layers. For VS to provide its guarantees it needs to use true consensus for group membership. You could see it as a more feature expanded version of leases used to optimize away one of the phases of paxos in the common case.


That's for sure one way to look at it, but I prefer to view VS as looking at the concept of consensus as a spectrum and providing tools in many interesting points on that spectrum. Yes, the question of "who gets to participate in the consensus protocol itself" is itself solved through total consensus in VS (and a total consensus protocol older than Paxos). But the way that is then used to bootstrap looser and more performant views of consensus as a first class concept I believe makes it still interesting to bring up in a discussion of consensus in general.


Not really. Atomic broadcast is equivalent to the consensus problem. VS itself can be turned into Paxos and vice versa via shuffling of details. All of the performance optimizations the VS sales literature pushes so hard are just as possible atop Paxos, and in fact are done now in practice. It's just a divide of concepts/terminology for the most part.


Maybe this is an example my ignorance on the subject.

Can you point to off the shelf Paxos libraries that both have a focus on mixing strengths of consensus as a first class concept and have been used in extremely high availability circumstances?


Basically all of them support some sort of lease mechanism to avoid one of the protocol phases in the common case. Virtually all of them provide snapshot reads as a performance / convenience thing. It's common for systems to use paxos as the configuration/topology layer while fine grained updates flow through something akin to viewstamped replication. Examples would be Kafka, Amazon Aurora.


Shameless plug for a similar post I've made, listing all the variations of Paxos (including Raft), etc:

https://vadosware.io/post/paxosmon-gotta-concensus-them-all/

No implementations/code but if you've ever wondered what the rest of the research landscape looks like for the paxos family tree of coordination (there's also some mention of stuff like SWIM[0], aka batched & randomized gossip at the end as well)

[0]: https://research.cs.cornell.edu/projects/Quicksilver/public_...


That's a pretty impressive summary. Thanks for writing it.


No problem -- I didn't do any of the research so it was very easy. I got a brief intriguing introduction to consensus a long time ago in university but never really appreciated how interesting the space is until much later. I really ought to send an email thanking the professor that taught the class.


Raft? The two most prominent consensus algorithms that I usually come across in 'the wild', are Paxos and Raft. Am I missing something -- or is the article missing something?


As it pertains to reaching consensus, everything in the "Basic Paxos" section of the article is the same for Raft as it is Paxos. Raft can be seen as instance of Paxos with different terminology and prescribed solutions for leader elections, solving dueling leader issues, and persisting past outcomes.

Here's a nice paper on this that dives deeper into the matter and also shows how Paxos can be just as "understandable" as Raft with a few vocabulary changes: https://arxiv.org/pdf/2004.05074.pdf


So could you say that the Raft spec is an opinionated implementation of Paxos?


Yes. Raft is in the basic category called Multi-Paxos, but with specific choices intended to make the algorithm more comprehensible and less error prone to implement. Looks like there's substance to that view as well, looking at both the literature and practice.

A more recent development is Heidi Howard's work, which provides a framework for thinking about the whole category of Paxos algorithms that many of us find a lot more approachable.


>A more recent development is Heidi Howard's work

Glad someone has mentioned her here. This talk helped me greatly https://www.youtube.com/watch?v=KTHOwgpMIiU.


Seconded, her videos and papers are fantastic. Previous discussion here: https://news.ycombinator.com/item?id=22994420


Some people don’t like it when you say that, but yes. I don’t mean to downplay Raft’s significance because its better perceived understandability helped bring a good algorithm to much more popularity, maybe even nudged future researchers to think about the “user experience” of their discoveries (not to say that Leslie Lamport wasn’t, he did try very hard with metaphors in The Part-time Parliament).


Yeah I was kinda wondering that too


There are others too. Proof of Work consensus, etc. Or this:

https://ripple.com/files/ripple_consensus_whitepaper.pdf


I wish this article had a bit more systematic approach. I've been recently looking into consensus protocols, and there are at least two dimensions to consider: permissioned (participants are known in advance) vs permissionless (anyone can join), and partially synchronous (some guarantees about message delivery time) vs asynchronous (only the delivery is guaranteed, the time can be at any point in the future). There are also some recent developments loosely based on Ben-Or protocol (HoneyBadgerBFT, HashGraph) that look a lot more promising than blockchain-based consensus.


If you want a more theoretical review of byzantine consensus protocols, you need to read through the gossip-based protocol literature, in particular byzantine broadcast protocols[1], where consensus is just one powerful primitive among others.

[1]: https://dcl.epfl.ch/site/_media/education/sdc_byzconsensus.p...


Yeah the broadcast layer is the part where a lot of the byzantine protocols breakdown IMHO. Broadcast storms are very bad. There's also the whole problem of tombstone-ing a proposed block if it doesn't achieve consensus fast enough (which is basically impossible to do as a function of time so you end up doing it as a function of a monotonic counter which has other problems).

Anyways, I love all the byzantine stuff but very little of it is actually practical or fast.


depends on your usecases or what throughput you need in your system


Practical Byzantine Fault Tolerance (PBFT) is a consensus algorithm for permissioned environments that allows a number of faulty nodes f<n/3. It does not rely on synchrony for correctness and it is safe against Byzantine faults (when f<n/3). It is used in BFS (PBFT for NFS) and in hyperledger fabric, to name the first two that came in my mind.

http://zoo.cs.yale.edu/classes/cs426/2012/bib/castro02practi...


On Paxos: The following already seems wrong. AFAIK the proposal number space must be divided across proposers. If all proposers use minNumber+1 replicas will diverge on the chosen value.

  protected override void Propose(Instance r)
  {
    proposalNumber = minNumber + 1;
    Broadcast(r, MessageType.Propose, proposalNumber);
  }


Using incrementing ballot numbers is fine. If you have multiple Proposers all sending the same ballot number in Phase 1 (Prepare & Promise) the Acceptors will reject any ballot number less than the highest ballot number they have received. In order to move to Phase 2, the Proposer needs to receive the thumbs up from a majority of the Acceptors, which means only one Proposer can ever win.

If you were to divide the ballot number space, in the event that the owner of the highest values in that space won, every other proposer would immediately need to increase their ballot numbers for all future rounds and so on and so forth or else they would always get rejected.


The problem is on the recovery path.

Recovery path needs to identify if there was a chosen value only from F+1 nodes (out of 2*F+1 nodes total).

When two proposers with same proposal-number, but different values have partially executed the protocol before the crash, then you would end up with all F+1 nodes with same proposal number indicating successful consensus previously, but with divergent values.


I'm not sure what you mean by "recovery path" or "partially executed the protocol before the crash." Paxos has specific terminology for each of the phases. Can you be more precise?


Sure, imagine all nodes are crashed (due to poweroff) in the middle of two proposers A and B doing Phase2 with the same proposal-number, but different values. Remember, they can go to Phase2 cause their proposal-number is the largest ever seen when they are the first.

Proposer-A received Phase2 votes from nodes n_1,...n_f and Proposer-B received one vote from n_f+1 when power is off and other nodes n_f+2, n_f+3...n_2f+1 don't know anything about the Phase2.

Now, consider the case when power is back and only the first F+1 nodes came back online. System is expected to continue with F+1 live nodes.

Proposer-C now does Phase-1 with the same proposal-number. He is expected to pick the already voted value associated with the max proposal-number from the Phase-1 responses, but there are two different values and he can't determine what to pick.


I think you're mistaken about the ability of 2 proposers to move to phase 2 - once an acceptor has agreed to a proposal with a certain seq number, it cannot just agree to another one with the same seq number. And that means only one of the two proposers can obtain the required majority of votes to start sending Accept messages, and thus only one of the two values can be accepted by any acceptor.

Maybe read the papers on Paxos, like the "Paxos Made Simple" one, it's not too big (though it's not all that simple either, at least to me...).


Here is what the paper says:

Phase 1.

(a) A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors.

(b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted.

Important part of the statement is

  ...with a promise not to accept any more proposals numbered less than n...
So, Phase-1 promise is not on the proposer-identity, but on the proposal-number.


You're right, my mistake!

But it still should not be an issue, since according to what you quoted an acceptor will not reply to a second proposal that bears the same number ("If an acceptor receives... number n greater than...), which means only one of the two proposers can gather the required majority of promises to move to the next phase, and thus only one of the values can end up being accepted. After the crash you describe, there will be only one accepted value the Proposer-C will receive.


Such assumptions break network request retries and/or relies on proposer-identity, etc.


I do not want spend more time on this, so I will let you explore.

Algorithm is not tied to the proposer identity (uuid or fqdn or ip, etc.) The promise in Phase-1 response is not to the proposer, but to the proposal-number.

In your model, proposer cannot change his identity, between Phase-1 and Phase-2 which is not a constraint imposed by the algorithm.


Can't we just agree to use one consensus protocol?


I concur.


Some new consensus protocols: Avalanche from AVA labs and Nightshade as used in NEAR protocol.


Another is The Stellar Consensus Protocol by David Mazières .




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: