
Neat Algorithms: Paxos - hbrundage
http://harry.me/blog/2014/12/27/neat-algorithms-paxos/?hn=1
======
krat0sprakhar
This is up for debate but this[0], IMHO, is pretty much the gold standard of
explaining distributed algorithms.

[0] -
[http://thesecretlivesofdata.com/raft/](http://thesecretlivesofdata.com/raft/)

~~~
nacs
Thanks the presentation, helps a lot.

Quick question though. On this slide (
[http://i.imgur.com/m02CMxx.png](http://i.imgur.com/m02CMxx.png) ) it shows a
network split condition and shows how the 2 split networks will eventually
negotiate and the 3 node split wins because it had a majority while the 2 node
side's uncommitted changes are thrown out.

What happens if the split happens right down the middle (3 active nodes on
each side instead of the 2 and 3)? Wouldn't both sides elect leaders that both
have majorities with committed data?

------
ahelwer
Software engineers love Paxos because it takes something very complex (a
distributed system) and makes it equivalent to working with a single machine:
you only ever talk to the leader. It gives you redundancy at the expense of
performance.

Paxos is used to achieve something called Strong Consistency, where each node
sees the same message in the same order. If you think of each node as a
deterministic state machine, they are guaranteed to end up in the same state
after responding to the same sequence of messages. It's nice and intuitive,
but requiring global synchronization on every write is terrible for
performance.

Other consistency schemes exist. A popular one is Eventual Consistency, where
writes are made immediately at any node (not just the leader) and the system
is expected to synchronize in the background and "converge" to the same state.
However, this can result in merge conflicts: if you're editing a document in
collaboration with other users, what if you edit a word in a paragraph while
another user deletes that entire paragraph? Does the system resolve this
automatically, or require user assistance? The answer to this question varies
according to system requirements. I think most HN users have experienced the
joys of resolving merge conflicts.

A newer model is something called Strong Eventual Consistency, which is
similar to Eventual Consistency but merge conflicts are impossible by design:
every update to the system must be commutative, associative, and idempotent
with other updates. It is not always possible to design your system this way.
These systems are implemented with Conflict-Free Replicated Data Types (or ad-
hoc equivalents) and have excellent liveness/throughput/performance
characteristics compared to Strong Consistency.

CRDTs are not as simple as Paxos. You're forced out of the cozy one-system
world and your system must deal with two nodes concurrently holding different
values. For most applications, magic Paxos dust is all you need. For others,
CRDTs are an excellent tool.

~~~
steventhedev
I strongly suggest Shapiro's paper[0] on CRDTs. In a nutshell, the only really
problematic data type are sequences such as arrays or strings. There are some
specialized approaches specifically for those, and you can always fall back on
a LWW conflict resolution.

In general, I like to think of Paxos as an approach that uses LWW for all
value types.

[0] - [http://pagesperso-
systeme.lip6.fr/Marc.Shapiro/papers/RR-695...](http://pagesperso-
systeme.lip6.fr/Marc.Shapiro/papers/RR-6956.pdf)

~~~
ahelwer
Shapiro's name is on about seven different CRDT papers :) it makes citations
difficult for the Wikipedia article. Personal opinion, this[0] 2011 paper is
probably the best one to read, where Shapiro's and Baquero's teams finally
joined forces and put out a good comprehensive paper on the subject. The one
you linked focuses a bit too heavily on TreeDoc in lieu of good treatment of
CRDT theory. Their survey of known CRDTs[1] is also worth reading.

[0]
[https://hal.inria.fr/file/index/docid/609399/filename/RR-768...](https://hal.inria.fr/file/index/docid/609399/filename/RR-7687.pdf)

[1]
[https://hal.inria.fr/inria-00555588/document](https://hal.inria.fr/inria-00555588/document)

~~~
steventhedev
Oops. I had meant to link the survey. His lecture[0] on the subject is quite
approachable, for those who prefer visuals to papers.

[0] [http://youtu.be/ebWVLVhiaiY](http://youtu.be/ebWVLVhiaiY)

------
emin-gun-sirer
This well-illustrated post is technically about the core Synod agreement
protocol in Paxos. Building a consistent distributed service on top requires
additional scaffolding and infrastructure. Typically, people layer on a system
that implements a "replicated state machine (RSM)" on top, which maintains the
illusion of a single consistent object, even though it is composed of
distributed replicas.

Also keep in mind that Raft, Zab, and View-Stamped replication (in reverse
chronological order) are alternatives to the Synod protocol in Paxos. These
protocols differ from Paxos by employing a different leader-election mechanism
and slightly different way of maintaining their invariants.

There have been many Paxos variants. This site [1] shows the various Paxos
variants over a timeline and points out their contributions.

Those of you interested in building replicated state machines using Paxos
should take a look at OpenReplica [2]. It is a full Multi-Paxos implementation
that takes any Python object and makes it distributed and fault-tolerant, like
an RPC package on steroids.

[1] [http://paxos.systems/](http://paxos.systems/)

[2] [http://openreplica.org/faq/](http://openreplica.org/faq/)

~~~
ahelwer
It looks like you are one of the developers of OpenReplica?

~~~
emin-gun-sirer
Yes, it's an open-source project from my research group at Cornell.

------
amelius
Also interesting: [1]

> Raft is a consensus algorithm that is designed to be easy to understand.
> It's equivalent to Paxos in fault-tolerance and performance. The difference
> is that it's decomposed into relatively independent subproblems, and it
> cleanly addresses all major pieces needed for practical systems. We hope
> Raft will make consensus available to a wider audience, and that this wider
> audience will be able to develop a variety of higher quality consensus-based
> systems than are available today.

[1] [https://raftconsensus.github.io/](https://raftconsensus.github.io/)

~~~
SamReidHughes
However, see [http://www.cl.cam.ac.uk/techreports/UCAM-CL-
TR-857.pdf](http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-857.pdf) about
Raft's pitfalls in environments where the network is being difficult.

------
ash
> Honest-to-goodness real-life implementations of Paxos can be found at the
> heart of … Google’s magnificent Spanner database…

I'm not sure about Spanner and Paxos. Sebastian Kanthak said during his Google
Spanner talk:

"If you've been to the Raft talk this morning, our Paxos implementation is
actually closer to the Raft algorithm than to what you'd read in the Paxos
paper, which is… if you haven't read it, don't read it, it's horrible." (at
7:43)

[http://www.infoq.com/presentations/spanner-distributed-
googl...](http://www.infoq.com/presentations/spanner-distributed-google)

------
Animats
Nice animations.

How do you keep a broken or hostile node from advancing the sequence number to
the end of the sequence number space?

There's an algorithm from one of Butler Lamson's grad students at MIT which
fixes this, but it seems to require one more message per cycle.
([http://pmg.csail.mit.edu/~castro/thesis.pdf](http://pmg.csail.mit.edu/~castro/thesis.pdf))
That paper later appears as a Microsoft Research paper on how to make an NFS-
like file system with this consensus properly. Did Microsoft ever put that in
a product?

------
lordnacho
Having looked for a few minutes, it really reminds me of the routing protocols
used for distributing routes in networks. (Also Layer 2 stuff IIRC). There you
also find heartbeats, elections, etc.

Is there a connection?

Also, does it have anything to do with the Byzantine Generals Problem?

~~~
ahelwer
Heartbeats (is this system up?) and leader election (which server should I
talk to?) are common components of any distributed system. Byzantine generals
is a much different problem. Whereas Paxos gets the _majority_ of nodes (a
quorum) to agree on a certain value, Byzantine Generals deals with _unanimous_
agreement in a system with network link failures. Furthermore, in Byzantine
Generals, some of the generals can be traitors actively working against the
others. Unanimous agreement isn't all that useful or interesting, but
production distributed systems should handle Byzantine failures. This covers
innocent events such as corrupted data/messages and software bugs all the way
up to hackers trying to take over your system through a compromised node.
Variants of paxos exist which handle or mitigate Byzantine failures.

~~~
neilc
> production distributed systems should handle Byzantine failures

I don't think this is true, at least not as stated. You should certainly be
thinking about more than crash-stop failures, but full-blown Byzantine fault
tolerance is rarely warranted in practice. Empirically, the number of systems
that use non-Byzantine vs. Byzantine agreement is probably on the order of
100:1.

~~~
ahelwer
Good point. Depends on your threat model. Most systems can get away without
full-blown Byzantine fault tolerance. You're probably right about the 100:1
ratio for production systems.

------
subbu
The animations are a bit fast. Would've been great if the reader could control
them. Easier to learn.

------
_almosnow
> Side note: it’s important that no two proposers ever use the same sequence
> number, and that they are sortable, so that they truly reference only one
> proposal, and precedence between proposals can be decided using a simple
> comparison.

They are moving the core problem into a different domain. Worst explanation of
PAXOS ever... nice animations though.

Edit: 'Worst explanation' is just an exaggeration, obv. It is nice, but
doesn't explain really important issues.

~~~
zerker2000
The requirement is that they are sortable, nothing about the numbers
reflecting the actual order of proposals (inasmuch as uncommitted actions in a
distributed system can be considered to be ordered). Concatenating system time
and node number upholds this property: 4:01-A and 4:01-B produced by nodes A
and B respectively are distinct and numerically sortable, and both in turn
"predate" 4:13-C attempted later.

------
fitshipit
> Know Paxos? Stealth-mode big data startup is hiring founding engineer

lol

