
Paxos Made Moderately Complex - sargun
http://paxos.systems
======
mslot
Paxos is one of the simplest distributed algorithms out there.

For a proposal i with value x:

1\. Ask everyone not to participate in proposals older than i, get a majority
to agree and send back any previously accepted values.

2\. Ask everyone to accept proposal i with the newest previously accepted
value or x if none, get a majority to accept.

\- On failure, restart with i > max(previous proposals)

That is the Paxos algorithm.

Its fault-tolerance comes from the fact that you can tolerate losing the
minority (1 out of 3 / 2 out of 5). There's no magic. If the majority fails,
Paxos blocks. If two proposers keep blocking each other's proposals by going
through round 1, Paxos can take forever.

The goal of Paxos is to agree on 1 value. Once consensus is reached (majority
agrees), the value can never be changed, since no one can complete round 1
without getting back the newest accepted value from at least one node.
However, you can run Paxos again and again to agree on a sequence of values
(e.g. transitions in a state machine). This is Multi-Paxos.

The Paxos algorithm is optimal, but Multi-Paxos allows for a lot of
optimizations and there is no clearly defined implementation. A pretty typical
optimization is to elect a leader (through Paxos) which makes the proposals.
This avoids duelling proposers and ensures you have a node that is always
aware of the latest accepted value. Making this optimization requires you to
make decisions that fall outside the scope of Paxos, such as how long a node
remains leader, what happens if the leader fails, and what happens if two
nodes think they are the leader. These optimizations can easily cause you to
lose the fault-tolerance and consistency guarantees that Paxos gives you
within the scope of reaching consensus on a single value.

It's important to remember that Paxos itself is easy to understand and a lot
of the problems in Multi-Paxos can be solved by running Paxos. You need to
familiarize yourself with it as though it was a programming construct.

~~~
c_lebesgue
In the step 2. you write that you "Ask everyone to accept proposal i", but to
satisfy the safety property of Paxos you can only accept proposals from round
"i" if you were a participant of the round "i" in the step 1.

~~~
mslot
Ah true, I should have probably written "all round 1 participants". It can be
enforced by the receiver as well. There are some other implied receiver-side
steps that I did not write down, such as rejecting all proposals with a
sequence number <i.

------
nicolast
At my previous employer, I worked on Arakoon, a consistent distributed key-
value store based on Multi-Paxos, implemented in OCaml, using TokyoCabinet (a
fork of it) as the local storage mechanism.

Implementing the protocol in an industrial setting brings up lots of ...
interesting design and engineering issues for sure!

See [https://github.com/Incubaid/arakoon](https://github.com/Incubaid/arakoon)

------
zphds
There is a popular alternative as well, Raft.
[https://ramcloud.stanford.edu/wiki/download/attachments/1137...](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf)

Has anyone used both? What are the pros and cons?

~~~
krenoten
Raft has a dirty little secret: your cluster breaks when two nodes become
partitioned from each other (but not the rest of the cluster) and one of them
is the leader.

Script: nodes {1, 2, 3*} where 2 and 3 are partitioned, and 3 is the current
leader.

Node 2 fires its leader election timer, broadcasts RequestVotes to 1 and 3,
only 1 gets it, but now 3 is ignored for the term of its leadership and 2 is
the only node capable of quorum. In a decent implementation 3 will be nack'd
the next time it sends an AppendEntries to 1, and 1 will pass along the
current term. 3 isn't hearing from the leader so broadcasts RequestVotes
(either failing once due to failure to hear the current term from 1, or
jumping straight into the next term, which actually increases the ratio of
cluster livelock). Leadership bounces back and forth rapidly, making your
cluster worthless.

The current hotness is spec paxos, which gets the positive trade-offs of both
cheap paxos and fast paxos, if you're able to actually implement it in your DC
(has networking assumptions).

~~~
r0naa
How's that a "dirty little secret"? Raft not supporting asymmetric partitions
is somewhere on the third line of the original paper.

Besides that, I genuinely wonder how often does this happen in the real world?
I see a typical case where parts of your cluster is behind a NAT. Otherwise,
what could cause an asymmetric partition?

~~~
kelnos
That sounds easy, actually. Take a cloud provider like AWS. They divide their
regions into "availability zones", which are supposedly in different
buildings, don't share the same power, internet connectivity, etc. Say you
have 3 AZs, and you put a node in each. A simple network connectivity issue
between AZ 1 and 2 (but not between 1 and 3 or 2 and 3) would cause this
scenario to happen, assuming I'm understanding this correctly.

------
Maro
Here's a Paxos implementation in C++ that I wrote for my old startup's
distributed database product:

[https://github.com/scalien/scaliendb/tree/master/src/Framewo...](https://github.com/scalien/scaliendb/tree/master/src/Framework/Replication/Paxos)

If you just want to agree on a leader, there's PaxosLease:

[https://github.com/scalien/scaliendb/tree/master/src/Framewo...](https://github.com/scalien/scaliendb/tree/master/src/Framework/Replication/PaxosLease)

~~~
sargun
What was Scalien, and its purpose?

~~~
Maro
Scalien was my startup, its purpose was to sell ScalienDB and make money :)

~~~
sargun
Why didn't it work out?

~~~
Maro
1\. Our selling point was: "there's a bunch of others like MongoDB out there,
but they're eventually consistent; if you really care about your data, use
ScalienDB, it uses Paxos and there's no consistency issues". But, (i)
EC/Mongo/etc were so hyped that our message wasn't heard, (ii) people didn't
really understand what we're saying, (iii) others were making false claims in
their marketing that their product can also be tuned to be consistent. We
didn't really understand how to market this. Contrast this with the excellent
execution of Mongo.

2\. We got foobared by potential investors, we focused on one group but they
eventually walked away, and by that time we ran out of money. The other side
of this was long enterprise deal lifecycles. Trying to convince somebody who
is big enough to pay for DB software to use your alpha thing is 6-12 months,
that's the same timescale you run out of money.

3\. We did have some bugs, so we were claiming consistency/reliability, but it
was an alpha product with bugs. (Of course it was, it was just us, 2 guys
writing it, we didn't even have money to buy testing infrastructure!)

In the end we quit after ~3.5 years, after we exhaused all options, at great
personal (but not monetary) cost, eg. I got divorced shortly after.
Fortunately we were able to exit gracefully from all business affairs.

------
Everlag
Why list all of those variants without referring to byzantine paxos at all?
Byzantine failure is everything except a clean end of participation; perhaps
its worth mentioning the flavor of paxos that can deal with it.

~~~
denizz
Author of the website here: Byzantine Paxos fundamentally is a very different
protocol than Multi-Paxos because of the failure model it supports. The
variants I included are all variants of the Multi-Paxos protocol. Because of
this, even if it includes Paxos in its name it was not in the scope of this
work. I can maybe extend the content of the website to cover all protocols
related to Paxos and underline how they are related and how they are different
and include Byzantine Paxos in that discussion.

------
midko
Nitpick: Paxos is a consensus protocol, not " a protocol for state machine
replication in an asynchronous environment that admits crash failures.".

Consensus can be used for SMR, atomic broadcast, leader election. One can
argue all of these replicate a sort of state machine. However, that dilutes
the meaning of state machine replication -- replication of state with
durability and performance in mind

------
eva1984
Nice writeup. Complex protocol like this always makes me wonder, how do people
even test it?

~~~
gioele
They have these things called "proofs":
[http://cstheory.stackexchange.com/questions/25851/correctnes...](http://cstheory.stackexchange.com/questions/25851/correctness-
proofs-of-classic-paxos-and-fast-paxos)

Jokes aside, there are two ways out: huge simulators or writing very very
simple code that matches 1-to-1 the steps of the proof. And Lamport's writing
style for proofs [1] lends itself to simple matching implementations.

[1] "How to Write a 21st Century Proof" [http://research.microsoft.com/en-
us/um/people/lamport/pubs/p...](http://research.microsoft.com/en-
us/um/people/lamport/pubs/proof.pdf)

------
daurnimator
[http://harry.me/blog/2014/12/27/neat-algorithms-
paxos/](http://harry.me/blog/2014/12/27/neat-algorithms-paxos/) is quite a
good article for an introduction to PAXOS.

Last time on HN:
[https://news.ycombinator.com/item?id=8806835](https://news.ycombinator.com/item?id=8806835)

