Hacker News new | past | comments | ask | show | jobs | submit login
Paxos vs. Raft: Have we reached consensus on distributed consensus? (arxiv.org)
343 points by rbanffy 11 months ago | hide | past | favorite | 65 comments

Heidi Howard is incredible and her work on distributed consensus is illuminating. I think she has actually successfully cracked the cookie of "making consensus easy". https://www.youtube.com/watch?v=KTHOwgpMIiU

That was fantastic! Do you know what happened with her ios project?

Whilst working on Ios, I started work on a theoretical result which became known as Flexible Paxos[1]. Unfortunately, I never had time to go back to working on Ios. Maybe someday.

[1] https://drops.dagstuhl.de/opus/volltexte/2017/7094/pdf/LIPIc...

I'm a huge fan of your videos. Do you have a spec for ios? I'd love to give a crack at it in erlang/elixir.

The part I found non-obvious is how to snapshot the unbounded vectors of immutable registers, which is a requirement for any real system. But I never gave it a whole lot of thought either.

To snapshot, I either need to find an epoch such that no older epoch could ever have relevant information for the current node, or I need the nodes to agree on a snapshot. Both seem complicated.

This was a very good talk. Thanks for sharing!

I can't help myself: Wouldn't we need a third consensus algorithm to reach consensus? :D

You just got your wish in the form of Viewstamped Replication.

Diego wrote a great summary of the differences between Viewstamped Replication[1,2] and Raft on the Raft mailing list a few years ago[3].

[1] http://pmg.csail.mit.edu/papers/vr.pdf [2] http://pmg.csail.mit.edu/papers/vr-revisited.pdf [3] https://groups.google.com/forum/#!topic/raft-dev/cBNLTZT2q8o

Ten minute video by co-author on this:

* https://www.youtube.com/watch?v=JQss0uQUc6o

See also her PhD dissertation, "Distributed consensus revised":

* https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-935.html

Love that the author recorded a YouTube summary. This is something more people should start doing!

It was actually the PaPoC workshop[1] that asked me to record a short talk and posted on YouTube. They did a wonderful job!

[1] https://papoc-workshop.github.io/2020/

Finally COVID caused something positive :)

I’m more interested in reaching consensus on “libpaxos” vs. “libraft”: Which algorithm is more amenable to practical abstraction, so that the distributed system developer need only register a handful of callbacks for the actual mechanisms of, e.g., persistent local logging?

From experience working in this area, I believe there's a significant tradeoff between performance, flexibility, and time to delivery when it comes to consensus and the things it's used for, like database replication. It's like: "good, fast, or cheap, pick two".

As one of the core authors of the Apache Kudu Raft implementation <http://kudu.apache.org/> (which is written in C++) I know that we tried to design it to be a pretty standalone subsystem, but didn't try to actually provide a libraft per se. We wanted to reuse the Raft write-ahead log as the database write-ahead log (as a performance optimization) which is one reason that making the log API completely generic eluded us a little.

That said, I'm currently at Facebook helping to adapt that implementation for another database. We are trying to make it database agnostic, and we continue to find cases where we need some extra metadata from the storage engine, new kinds of callbacks, or hacks to deal with various cases that just work differently than the Kudu storage engine. It would likely take anybody several real world integrations to get the APIs right (I'm hopeful that we eventually will :)

I created a toy raft implementation in typescript, mostly to learn typescript. my interface for the datastore is actually pretty small:


What kind of metadata do you need exactly from the datastore?

Sure, here are a couple examples of things we've had to include in the log APIs:

1. Kudu uses a special "commit" record that goes into the log for crash consistency, related to storage engine buffer flushes. So we need an API to write those into the log. They don't have a term and an index, since they are a local-engine thing, so they have to be skipped when replicating data to other nodes in the case of the current node being the leader. If we were not sharing the log with the engine, we wouldn't need this.

2. Another database I'm working with requires file format information to be written at the top of every log segment, and it has to match the version of the log events following it. That info has to be communicated to the follower up-front even when the follower resumes replicating from the middle of the leader's log. So we need plugin callbacks on both sides to handle this, in terms of packing this as the leader and unpacking it as a follower into the wire protocol metadata.

Requirements like these will come up and either you hack around them by making some kind of out-of-band call (not ideal for multiple reasons) or you bake the capability into the plugin APIs and the communication protocol.

Frankly, designing generic APIs is also one of the less sexy aspects to consider because we spend so much of our time dreaming about and building all the cool distributed systems capabilities like leader elections, dynamic membership changes, flexible quorums, proxying/forwarding messages, rack/region awareness, etc etc etc. :)

The details of long-tail stuff like this is often hammered out as it comes up during implementation.

Since I mentioned performance, one other area that makes flexibility nontrivial is how you decide to serialize different types of messages on the wire and on disk. If you don't need extensibility, it's easy to keep things pretty efficient just by using e.g. gRPC and protobuf out of the box. If you want complete flexibility, the simplest thing to do is to give your plugin interfaces blobs to write into and you end up double-serializing everything.

If some big corps are willing to open source their paxos library (I am pretty sure msft Google Amazon all have their implementation), then there should be no need of any other consideration of competition.

Does anyone know why they haven’t been open sourced? Are they too closely tied to proprietary hardware? Too dependent on underlying software platforms that would never be open sourced?

At this point, would they even offer anything superior to etcd-raft? Asking from a position of honest ignorance.

Not paxos, but if I recall, Hashicorp has open sourced their Golang Raft implementation and it seemed reasonably abstract. It’s late and I’m totally blanking on the names of their products though... it’s the etcd-like one.

Consul. Etcd also has an open source version of it.

Consul is the one you’re thinking of.

Shameless plug I've written a writeup of the different implementations/variations of Paxos[0] if you'd like to see more of the ecosystem.

I'm actually kinda wondering why this is news now -- the author of this paper is absolutely right of course but I thought this was common knowledge for anyone who knew what raft was. Raft is less robust than Paxos but simpler to implement (and implement correctly) which is why projects choose it. Basically chat (run a paxos round) to decide a leader and build the log (the history of everything that has every happened in the system, see WALs for a similar concept) at the chosen leader instead of by chatting about every single/batch of changes.

[0]: https://vadosware.io/post/paxosmon-gotta-concensus-them-all/

That's a great article! Thanks a lot. I know about some more variants you might want to check out:

- Compartmentalized Paxos (https://mwhittaker.github.io/publications/compartmentalized_...) shows how you can compartamentalize the various roles of a Paxos node to increase scalability.

- Matchmaker Paxos (https://mwhittaker.github.io/publications/matchmaker_paxos.p...) introduces a separate set of nodes called matchmakers which are used during reconfiguration.

- PigPaxos (https://arxiv.org/abs/2003.07760) places the nodes into relay groups and revises the communication pattern to improve scalability. This seems very similar to Compartmentalized Paxos.

- Linearizable Quorum Reads in Paxos (https://www.usenix.org/system/files/hotstorage19-paper-chara...) shows how you can do linearizable quorum reads in Paxos.

Thanks -- will get started on writing another post to add these in!

This paper was published in PaPoC 2020, which was recently organized online. The rest of the papers, along with the talks are available here: https://papoc-workshop.github.io/2020/programme.html

I think all these kinds of papers are very confusing. Comparing RSM (replicated state machine) to Paxos is just like comparing a car to an engine. It makes very little or no sense.

In the original Paxos paper (https://lamport.azurewebsites.net/pubs/paxos-simple.pdf), the part 3 (RSM) is not extensively explained. There are countless ways to use Paxos to implement RSM. Multipaxos/Raft/Epaxos try to fill in that gap.

By any means, Paxos itself is 10x simpler than Raft or whatever. Every time I heard a "distributed system" engineer said Paxos is complicated, I know he/she does not have much experience in the field or at least has never implemented the core consensus part...

Indeed, in the paper they're comparing MultiPaxos to Raft.

EDIT: For others, here's a very comprehensive (as of ~2018) review of Paxos-related distributed consensus algorithms with an exposition for each one: https://vadosware.io/post/paxosmon-gotta-concensus-them-all/ That's 17 in all, excluding the original Paxos paper. IMO, it should be linked anywhere Paxos is discussed. The link has been posted twice before by others on HN, but unfortunately hasn't seen any discussion, perhaps because it speaks for itself.

There are three main stages about RSM.

1. log replication (m) 2. log consistency (n) 3. log execution (k)

Then you will have m * n * k ways of achieving your goal based on different requirements on the three stages.

That, or "Paxos" is a fantastic name, and "replicated state machine" is a mouthful.

I've implemented the consensus part of Paxos in 10 minutes. But it's a toy. It's totally useless without the other stuff.

No, simply not true at all. I've done both a few times. And yes, I have relevant experience.

What is not true? 10x simpler, no?

There are two possibilities:

1. You misunderstood Paxos, probably confused that with RSM. For example, Paxos is just the leader election part of Raft. Superficially, it seems different from Paxos, but it what it is under the surface.

2. You implemented RSM incorrectly, probably missed some important features or optimizations. For example, log compaction, membership reconfiguration, pipeline, back pressure on execution, etc.

It is VERY important to differentiate Paxos and RSM. There are tons of optimizations you can do with RSM. But on consensus, there is ONLY Paxos today. Or you do it wrong or you invite something truly new.

There are two possibilities.

1. You underestimate my understanding.

2. You underestimate my experience.

I am sure you know this: both are comparable as a shared log abstraction. Once you have that, the state machine part is trivial. For the replicated shared log abstraction, you need to do the same kind of things in both the Paxos and Raft worlds. The latter is simpler to implement because the paper is written close to an engineer's understanding. i have written several versions of both, in production.

No, Paxos is not a shard log abstraction. Paxos is about agreeing on one value rather than a set of values in sequence. So my possibility 1 is true.

I can't edit my earlier response, but on second thought, we are both right. Is that consensus? :)

I think the difference in our stances arises from the two ways in which the term 'Paxos' is used. In the part-time-parliament paper, the single-decree paxos is the fundamental building block. I'm assuming this is what you think of as Paxos, and you are right in your assertion. It is defined as the fundamental act of consensus.

However, I have always viewed the single-decree protocol as a pedagogical device; a single write-once distributed register is useless in practice, so the multi-decree protocol is where it becomes useful, and to me is the point of the paper. I say this for a few reasons: a) state machine replication has been Lamport's focus since his early papers, including the 'time clocks and ordering' paper. (b) in engineering-oriented papers such as 'paxos made live', 'chubby' and so on, the term is used as a proxy for a replicated log. The reason for the many variants of Paxos is due to the underlying impulse that basic paxos is not sufficient in and of itself. You need some variant of shared log or equivalently, atomic broadcast.

So, we are both right in our ways. We can have consensus, as long as we sacrifice consistency of the definition of 'Paxos' :)

Are you speaking of Basic paxos?

The paper's main conclusion is accurate. Raft is more understandable because of the clarity of the paper. But implementation is very tricky. As I've written elsewhere it takes weeks or months to write a solid implementation from scratch.

Raft itself - rather than any framework in which you would actually want to use it - is quite simple to implement. A few classmates of mine and I implemented a barebones Raft instance in about a weekend.

I doubt very much that you implemented a proper command log with truncation and rollback for leaders and followers, a state machine, leader election with voting, epochs, timeouts with smart backoffs, pipelining, async transport servers and clients with object serialization, quorums, initialization and discovery, persistent state, graceful shutdown, adding and removing members, a proper event queue, snapshots, and proper distributed testing.

You can build a toy implementation in a weekend if you have a cookbook. A production-ready implementation takes a bit more.

I believe ucsc has (or had?) a distributed systems class in erlang, and correctly implementing each of those features would be about 3-4 days for a skilled erlang programmer and maybe a week for an undergrad, so that's doable for a team of 2 or 3 undergrads in a term.

Maybe out of scope for an undergrad are things like testing high latency/unreliable/jittery connections.

I feel like setting up the tests to validate that your Raft implementation is actually correct would take at least a weekend by itself.

Writing tests to prove you've a correct implementation is indeed a very hard problem. I toyed with an interesting idea last year where I began to write a simple (but intentionally incorrect) Paxos implementation using P# (now known as Coyote) and wanted to see if P#/Coyote's systematic exploration of the state space will show me the various race conditions which violated the protocol's correctness. To my surprise, the technique was quite effective. P#/Coyote was able to point out to a number of bugs after I specified the safety and liveness properties which weren't too hard to do. In effect, after specifying the safety/liveness properties, I was able to use P#/Coyote's state-space exploration to ensure the implementation had solid test coverage. More details are at https://github.com/imnaseer/DiscoveringPaxos where the project starts out with a simple naive implementation, uses P#/Coyote to find bugs, makes incremental modifications till we finally have a working version faithfully following the Paxos description.

This is a Distributed Systems homework project at several universities. In mine, a Jepsen-style test harness was part of the autograder.

This happens every once in a while on HN: some mentions having done one of these assignments, and immediately gets tackled for it.

Maybe professors aren’t doing a good job conveying the limitations. But also this community is gratuitously hostile to people who have no reason to doubt that the code they wrote from the Raft paper, which passed the test suite, was Raft.

I don't sense any hostility here, just healthy skepticism. At the risk of sounding condescending, building something for a class assignment is very different from building something that you'd feel comfortable rolling out in production. Hell, one of the commenters upthread worked on the implementation of raft in Apache Kudu. To be perfectly frank, I would take their word on something before that of someone talking about their homework assignment. It's an incredibly useful learning tool, but it takes a lot more work to make it robust.

I really hope you read this gently. (As the HN guidelines say, "Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.") I'm not trying to talk down to you or treat you with hostility (and I know that it's really hard to convey that via text). I would just ask that when someone who has professional, real-world experience in something says it is difficult and time-consuming to do it right, you'd avoid assuming they just don't know what they're doing, and that perhaps there are aspects that you haven't considered.

And hey, maybe you or some of the other commenters are just ridiculously smart and focused and can write it in a weekend. But if that's the case, it's pretty uncharitable to push a narrative that it's trivial. Not saying that's what's happening here, but that could be how it's coming off.

I'm not the parent. But I hope you can see how downvotes to oblivion and a bunch of people saying "no you didn't," is hostile.

This is a proportionate response to an undergraduate trying to sell you his new RDBMS. But most students really did write a b-tree. Academic programming elides the supporting infrastructure that bridges the gap between algorithm and software system. Textbooks aren't generally leaving out 100 pages of extra steps required for the list to actually be sorted or the path to actually be shortest. And so a student is not exactly out of line for thinking that the Raft he was taught is actually consistent in the presence of failure. I'll defer to the community's wisdom that he's wrong! But he's still not out of line.

If anything, I'm worried about these classes instilling false confidence. People who think they know these algorithms may go implement them professionally, and not have anyone around to tell them the full story. Cryptography education is careful to put asterisks around "Textbook RSA." Distributed systems education should probably be doing the same.

I really don't think my initial comment was that hostile, but I can see why it could be read that way. I appreciate what you're trying to say here, and should probably work on framing things positively to avoid this kind of contention.

Part of why the Raft paper is so excellent is because is does leave you feeling like you could explain/implement the algorithm. I don't want to discourage people from being excited about these ideas, because I am too.

That being said, I am generally frustrated by the lack of humility that many software engineers exhibit. "Easy" is a trigger word for me, and I really think is something that should be expunged from most of our vocabulary when referencing software.

Right, my point was just that there's no such thing as a "mostly works" consensus algorithm. By definition these algorithms are intended for systems that can't tolerate failure.

If you feel confident you've built a complex algorithm like this correctly on the first try you probably haven't. Hubris and distributed systems just don't mix.

Verifying correctness to back up our empirical claims is often the hardest and most overlooked part of software engineering.

A novel algorithm (for any well known problem) is an audacious claim. An implementation usually isn't.

well in the case of raft for Apache Kudu the commenter says that most of the complexity came from the interaction between the database-log semantics and the raft-log semantics.

this is almost independent with how hard is to build a (good) pure implementation of raft that offers a simpler API

Just figure out what to test in a weekend probably would be worth some Lamport-level smartness...

Or Ousterhout level since we're talking about Raft. His career is amazing with contribution in many different areas of CS -- not to say Lamport's hasn't been.

The comments in this list seem from really smart people.

I used to follow Jepsen https://jepsen.io/ closely. Those tests are quite comprehensive. But I think even those are not sufficient to quantify the quality of Paxos/Raft.

Tools like TLA+ probably require a few months to learn and become proficient.

Is there any open source widely used implementation of multi paxos, because single paxos doesn't seem useful by itself.

2020 video: https://www.youtube.com/watch?v=JQss0uQUc6o

Paxos vs Raft: Have we reached consensus on distributed consensus? — Heidi Howard

The title is ~amusing for many who have experienced consensus-based decision making.

But anyway, maybe this could be applied to helping people work together cooperatively.

I thought raft was the obvious choice since it is a far simpler framework.

The conclusion of the paper is that there isn't actually a significant difference between the algorithms. The Raft paper is much clearer about implementation, but (as Heidi says) the impl ideas from the Raft paper can be applied to Paxos in many cases. Raft's leader election _is_ a bit more elegant and results in a less complex implementation. The paper was a great read!

That's a great summary. Thanks for answering the question for me!

Your YouTube video mentioned earlier in the comments was fantastic, thank you for that!

Nice, I'll have to follow up on it.

Great Summary

We have theorems that say that's impossible, so I guess the answer is no? Or maybe Betteridge's law?

I haven't found a problem well suited to either as a solution. Either the performance constraints are too tight for industrial applications or you are in a P2P space and they are both predicated on nodes not being hostile so you can't use them.

In practice it just seems most efficient to be tolerant of consensus failures and focus on cAP.

A lot of real use production systems (spanner, cockroach, tidb) use the opposite approach - sacrifice reliability for consistency. Scaling constraints are usually solved via sharding (running multiple raft/paxos fsms per dataset)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact