
Paxos vs. Raft: Have we reached consensus on distributed consensus? - rbanffy
https://arxiv.org/abs/2004.05074
======
jganetsk
Heidi Howard is incredible and her work on distributed consensus is
illuminating. I think she has actually successfully cracked the cookie of
"making consensus easy".
[https://www.youtube.com/watch?v=KTHOwgpMIiU](https://www.youtube.com/watch?v=KTHOwgpMIiU)

~~~
dnautics
That was fantastic! Do you know what happened with her ios project?

~~~
heidihoward
Whilst working on Ios, I started work on a theoretical result which became
known as Flexible Paxos[1]. Unfortunately, I never had time to go back to
working on Ios. Maybe someday.

[1]
[https://drops.dagstuhl.de/opus/volltexte/2017/7094/pdf/LIPIc...](https://drops.dagstuhl.de/opus/volltexte/2017/7094/pdf/LIPIcs-
OPODIS-2016-25.pdf)

~~~
dnautics
I'm a huge fan of your videos. Do you have a spec for ios? I'd love to give a
crack at it in erlang/elixir.

------
flexd
I can't help myself: Wouldn't we need a third consensus algorithm to reach
consensus? :D

~~~
enitihas
You just got your wish in the form of Viewstamped Replication.

~~~
heidihoward
Diego wrote a great summary of the differences between Viewstamped
Replication[1,2] and Raft on the Raft mailing list a few years ago[3].

[1]
[http://pmg.csail.mit.edu/papers/vr.pdf](http://pmg.csail.mit.edu/papers/vr.pdf)
[2] [http://pmg.csail.mit.edu/papers/vr-
revisited.pdf](http://pmg.csail.mit.edu/papers/vr-revisited.pdf) [3]
[https://groups.google.com/forum/#!topic/raft-
dev/cBNLTZT2q8o](https://groups.google.com/forum/#!topic/raft-dev/cBNLTZT2q8o)

------
throw0101a
Ten minute video by co-author on this:

* [https://www.youtube.com/watch?v=JQss0uQUc6o](https://www.youtube.com/watch?v=JQss0uQUc6o)

See also her PhD dissertation, "Distributed consensus revised":

* [https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-935.html](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-935.html)

~~~
anogrebattle
Love that the author recorded a YouTube summary. This is something more people
should start doing!

~~~
heidihoward
It was actually the PaPoC workshop[1] that asked me to record a short talk and
posted on YouTube. They did a wonderful job!

[1] [https://papoc-workshop.github.io/2020/](https://papoc-
workshop.github.io/2020/)

------
twoodfin
I’m more interested in reaching consensus on “libpaxos” vs. “libraft”: Which
algorithm is more amenable to practical abstraction, so that the distributed
system developer need only register a handful of callbacks for the actual
mechanisms of, e.g., persistent local logging?

~~~
mpercy
From experience working in this area, I believe there's a significant tradeoff
between performance, flexibility, and time to delivery when it comes to
consensus and the things it's used for, like database replication. It's like:
"good, fast, or cheap, pick two".

As one of the core authors of the Apache Kudu Raft implementation
<[http://kudu.apache.org/>](http://kudu.apache.org/>) (which is written in
C++) I know that we tried to design it to be a pretty standalone subsystem,
but didn't try to actually provide a libraft per se. We wanted to reuse the
Raft write-ahead log as the database write-ahead log (as a performance
optimization) which is one reason that making the log API completely generic
eluded us a little.

That said, I'm currently at Facebook helping to adapt that implementation for
another database. We are trying to make it database agnostic, and we continue
to find cases where we need some extra metadata from the storage engine, new
kinds of callbacks, or hacks to deal with various cases that just work
differently than the Kudu storage engine. It would likely take anybody several
real world integrations to get the APIs right (I'm hopeful that we eventually
will :)

~~~
matthewaveryusa
I created a toy raft implementation in typescript, mostly to learn typescript.
my interface for the datastore is actually pretty small:

[https://github.com/matthewaveryusa/raft.ts/blob/master/src/i...](https://github.com/matthewaveryusa/raft.ts/blob/master/src/interfaces.ts#L17-L21)

What kind of metadata do you need exactly from the datastore?

~~~
mpercy
Sure, here are a couple examples of things we've had to include in the log
APIs:

1\. Kudu uses a special "commit" record that goes into the log for crash
consistency, related to storage engine buffer flushes. So we need an API to
write those into the log. They don't have a term and an index, since they are
a local-engine thing, so they have to be skipped when replicating data to
other nodes in the case of the current node being the leader. If we were not
sharing the log with the engine, we wouldn't need this.

2\. Another database I'm working with requires file format information to be
written at the top of every log segment, and it has to match the version of
the log events following it. That info has to be communicated to the follower
up-front even when the follower resumes replicating from the middle of the
leader's log. So we need plugin callbacks on both sides to handle this, in
terms of packing this as the leader and unpacking it as a follower into the
wire protocol metadata.

Requirements like these will come up and either you hack around them by making
some kind of out-of-band call (not ideal for multiple reasons) or you bake the
capability into the plugin APIs and the communication protocol.

Frankly, designing generic APIs is also one of the less sexy aspects to
consider because we spend so much of our time dreaming about and building all
the cool distributed systems capabilities like leader elections, dynamic
membership changes, flexible quorums, proxying/forwarding messages,
rack/region awareness, etc etc etc. :)

The details of long-tail stuff like this is often hammered out as it comes up
during implementation.

------
hardwaresofton
Shameless plug I've written a writeup of the different
implementations/variations of Paxos[0] if you'd like to see more of the
ecosystem.

I'm actually kinda wondering why this is news now -- the author of this paper
is absolutely right of course but I thought this was common knowledge for
anyone who knew what raft was. Raft is less robust than Paxos but simpler to
implement (and implement correctly) which is why projects choose it. Basically
chat (run a paxos round) to decide a leader and build the log (the history of
everything that has every happened in the system, see WALs for a similar
concept) at the chosen leader instead of by chatting about every single/batch
of changes.

[0]: [https://vadosware.io/post/paxosmon-gotta-concensus-them-
all/](https://vadosware.io/post/paxosmon-gotta-concensus-them-all/)

~~~
judofyr
That's a great article! Thanks a lot. I know about some more variants you
might want to check out:

\- Compartmentalized Paxos
([https://mwhittaker.github.io/publications/compartmentalized_...](https://mwhittaker.github.io/publications/compartmentalized_consensus.pdf))
shows how you can compartamentalize the various roles of a Paxos node to
increase scalability.

\- Matchmaker Paxos
([https://mwhittaker.github.io/publications/matchmaker_paxos.p...](https://mwhittaker.github.io/publications/matchmaker_paxos.pdf))
introduces a separate set of nodes called _matchmakers_ which are used during
reconfiguration.

\- PigPaxos
([https://arxiv.org/abs/2003.07760](https://arxiv.org/abs/2003.07760)) places
the nodes into relay groups and revises the communication pattern to improve
scalability. This seems very similar to Compartmentalized Paxos.

\- Linearizable Quorum Reads in Paxos
([https://www.usenix.org/system/files/hotstorage19-paper-
chara...](https://www.usenix.org/system/files/hotstorage19-paper-
charapko.pdf)) shows how you can do linearizable quorum reads in Paxos.

~~~
hardwaresofton
Thanks -- will get started on writing another post to add these in!

------
codepie
This paper was published in PaPoC 2020, which was recently organized online.
The rest of the papers, along with the talks are available here:
[https://papoc-workshop.github.io/2020/programme.html](https://papoc-
workshop.github.io/2020/programme.html)

------
ideal0227
I think all these kinds of papers are very confusing. Comparing RSM
(replicated state machine) to Paxos is just like comparing a car to an engine.
It makes very little or no sense.

In the original Paxos paper ([https://lamport.azurewebsites.net/pubs/paxos-
simple.pdf](https://lamport.azurewebsites.net/pubs/paxos-simple.pdf)), the
part 3 (RSM) is not extensively explained. There are countless ways to use
Paxos to implement RSM. Multipaxos/Raft/Epaxos try to fill in that gap.

By any means, Paxos itself is 10x simpler than Raft or whatever. Every time I
heard a "distributed system" engineer said Paxos is complicated, I know he/she
does not have much experience in the field or at least has never implemented
the core consensus part...

~~~
wahern
Indeed, in the paper they're comparing MultiPaxos to Raft.

EDIT: For others, here's a _very_ comprehensive (as of ~2018) review of Paxos-
related distributed consensus algorithms with an exposition for each one:
[https://vadosware.io/post/paxosmon-gotta-concensus-them-
all/](https://vadosware.io/post/paxosmon-gotta-concensus-them-all/) That's 17
in all, excluding the original Paxos paper. IMO, it should be linked anywhere
Paxos is discussed. The link has been posted twice before by others on HN, but
unfortunately hasn't seen any discussion, perhaps because it speaks for
itself.

~~~
ideal0227
There are three main stages about RSM.

1\. log replication (m) 2\. log consistency (n) 3\. log execution (k)

Then you will have m * n * k ways of achieving your goal based on different
requirements on the three stages.

------
ccleve
The paper's main conclusion is accurate. Raft is more understandable because
of the clarity of the paper. But implementation is very tricky. As I've
written elsewhere it takes weeks or months to write a solid implementation
from scratch.

~~~
prismatk
Raft itself - rather than any framework in which you would actually want to
use it - is quite simple to implement. A few classmates of mine and I
implemented a barebones Raft instance in about a weekend.

~~~
james-mcelwain
I feel like setting up the tests to validate that your Raft implementation is
actually correct would take at least a weekend by itself.

~~~
closeparen
This is a Distributed Systems homework project at several universities. In
mine, a Jepsen-style test harness was part of the autograder.

This happens every once in a while on HN: some mentions having done one of
these assignments, and immediately gets tackled for it.

Maybe professors aren’t doing a good job conveying the limitations. But also
this community is gratuitously hostile to people who have no reason to doubt
that the code they wrote from the Raft paper, which passed the test suite, was
Raft.

~~~
kelnos
I don't sense any hostility here, just healthy skepticism. At the risk of
sounding condescending, building something for a class assignment is very
different from building something that you'd feel comfortable rolling out in
production. Hell, one of the commenters upthread worked on the implementation
of raft in Apache Kudu. To be perfectly frank, I would take their word on
something before that of someone talking about their homework assignment. It's
an incredibly useful learning tool, but it takes a lot more work to make it
robust.

I really hope you read this gently. (As the HN guidelines say, "Please respond
to the strongest plausible interpretation of what someone says, not a weaker
one that's easier to criticize. Assume good faith.") I'm not trying to talk
down to you or treat you with hostility (and I know that it's really hard to
convey that via text). I would just ask that when someone who has
professional, real-world experience in something says it is difficult and
time-consuming to do it right, you'd avoid assuming they just don't know what
they're doing, and that perhaps there are aspects that you haven't considered.

And hey, maybe you or some of the other commenters are just ridiculously smart
and focused and _can_ write it in a weekend. But if that's the case, it's
pretty uncharitable to push a narrative that it's trivial. Not saying that's
what's happening here, but that could be how it's coming off.

~~~
closeparen
I'm not the parent. But I hope you can see how downvotes to oblivion and a
bunch of people saying "no you didn't," is hostile.

This is a proportionate response to an undergraduate trying to sell you his
new RDBMS. But most students really did write a b-tree. Academic programming
elides the supporting infrastructure that bridges the gap between algorithm
and software system. Textbooks aren't generally leaving out 100 pages of extra
steps required for the list to _actually_ be sorted or the path to _actually_
be shortest. And so a student is not exactly out of line for thinking that the
Raft he was taught is actually consistent in the presence of failure. I'll
defer to the community's wisdom that he's wrong! But he's still not out of
line.

If anything, I'm worried about these classes instilling false confidence.
People who think they know these algorithms may go implement them
professionally, and not have anyone around to tell them the full story.
Cryptography education is careful to put asterisks around "Textbook RSA."
Distributed systems education should probably be doing the same.

~~~
james-mcelwain
I really don't think my initial comment was that hostile, but I can see why it
could be read that way. I appreciate what you're trying to say here, and
should probably work on framing things positively to avoid this kind of
contention.

Part of why the Raft paper is so excellent is because is _does_ leave you
feeling like you could explain/implement the algorithm. I don't want to
discourage people from being excited about these ideas, because I am too.

That being said, I am generally frustrated by the lack of humility that many
software engineers exhibit. "Easy" is a trigger word for me, and I really
think is something that should be expunged from most of our vocabulary when
referencing software.

------
enitihas
Is there any open source widely used implementation of multi paxos, because
single paxos doesn't seem useful by itself.

------
kevindeasis
2020 video:
[https://www.youtube.com/watch?v=JQss0uQUc6o](https://www.youtube.com/watch?v=JQss0uQUc6o)

Paxos vs Raft: Have we reached consensus on distributed consensus? — Heidi
Howard

------
mirimir
The title is ~amusing for many who have experienced consensus-based decision
making.

But anyway, maybe this could be applied to helping people work together
cooperatively.

------
chapium
I thought raft was the obvious choice since it is a far simpler framework.

~~~
cmckn
The conclusion of the paper is that there isn't actually a significant
difference between the algorithms. The Raft paper is much clearer about
implementation, but (as Heidi says) the impl ideas from the Raft paper can be
applied to Paxos in many cases. Raft's leader election _is_ a bit more elegant
and results in a less complex implementation. The paper was a great read!

~~~
heidihoward
That's a great summary. Thanks for answering the question for me!

~~~
cmckn
Your YouTube video mentioned earlier in the comments was fantastic, thank you
for that!

------
stelfer
We have theorems that say that's impossible, so I guess the answer is no? Or
maybe Betteridge's law?

------
blamestross
I haven't found a problem well suited to either as a solution. Either the
performance constraints are too tight for industrial applications or you are
in a P2P space and they are both predicated on nodes not being hostile so you
can't use them.

In practice it just seems most efficient to be tolerant of consensus failures
and focus on cAP.

~~~
dilyevsky
A lot of real use production systems (spanner, cockroach, tidb) use the
opposite approach - sacrifice reliability for consistency. Scaling constraints
are usually solved via sharding (running multiple raft/paxos fsms per dataset)

