
In search of a simple consensus algorithm - justinjlynn
http://rystsov.info/2017/02/15/simple-consensus.html
======
ReidZB
The article abuses Kolmogorov complexity...

> When it is applied to the algorithms, it means that an algorithm with the
> shortest implementation is simpler.

That is misleading. Kolmogorov complexity is the length of the shortest
program (in a pre-defined language) that produces a given object. So, if the
_shortest program_ that produces Algorithm A is smaller than the _shortest
program_ that produces Algorithm B, then the Algorithm A is less Kolmogorov-
complex ("simpler") than Algorithm A.

This does not mean you can take two existing implementations (in C, say) and
compare the implementation length and declare one is "simpler," unless you are
claiming that both implementations are as short as possible. Since Kolmogorov
complexity is not computable, that seems like a tall order.

Maybe they are right that Single-Decree Paxos is simpler (either in the sense
of Kolmogorov complexity or in some other sense, who knows), but invoking
Kolmogorov complexity here seems totally unwarranted -- it doesn't add
anything substantive.

~~~
rystsov
Thank you! I'll rework this paragraph to be correct, I wanted to make an
observation that the given data (two attempts to implement key-value storages
with keeping the length of a program as short as possible) favour Gryadka but
of course isn't wrong to make strong statements based just on one data point.

~~~
jerf
I would suggest just dropping Kolmogorov in general, and just referencing the
human idea that shorter is generally going to be simpler. I wouldn't have a
problem sticking with you through that. Sure, I can feed that to my pedant
mill as with anything else, but since it's not really the core idea I'll roll
with you on it.

~~~
ReidZB
Agreed. The section basically boils down to "despite maybe looking simpler,
Single-Decree Paxos is still tricky." I don't think something particular and
formal like Kolmogorov complexity even fits the tone there.

~~~
throwaway91111
Right. There's no strict value to higher or lower complexity in the Kolmogorov
sense, so you might as well assign the value to something more tangible (like
pseudocode terseness or something)

------
cube2222
I have the feeling like this article is just bashing strong master consensus
protocols. But the truth is, yes, they incur a penalty for electing a master.

However, this really gets amortized in most workloads if the leader changes
only rarely. Additionally, in an environment with a good network connection
between nodes (a few ms), you can set the timeout to be much less than a few
seconds (could be less than a second actually), this way you have shorter
unavailability.

There's another point he touches, about the unavailability of the whole
cluster when the leader is down. Really, this isn't something dependent on the
protocols, but on the applications. If you have one paxos/raft group per
replica, you actually only get a small unavailability. Additionally, even
consistent reads do not need a living master to be possible.

It's worth reading the spanner paper to get more insight into high
availability/consistency achieved with a paxos based db implementation (
[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-
osdi2012.pdf) ).

EDIT: And in my opionion, calling it a SPOF is missing the point a little bit.

~~~
rystsov
> consistent reads do not need a living master

It's wrong. If you're fine with stale reads then you don't a living master,
but if you want to have a guarantee that the read value is up-to-date then the
living master is necessary.

Etcd has a bug related to this -
[https://github.com/coreos/etcd/issues/741](https://github.com/coreos/etcd/issues/741)

~~~
cube2222
It may not be up to date at the time of actually READING the data, but it will
be up to date if the date is the moment the client requested it. You can
actually achieve this using timestamps. (So basically, it's semi-stale data,
you're right, because you have a guarantee about the timestamp at which it was
up to date, where the timestamp can be recent)

(Not talking about etcd, check out the spanner paper)

~~~
rescrv
In order to get a linearizable read in Spanner, you do need to have a
heartbeat from the leader of the Paxos tablet for the data you are reading in
order to know that all data has been replicated up to the time at which you
are taking your read. This is not free, but can be cheaper than active
communication. (referred to as t_safe in the paper)

~~~
cube2222
As I understood it, if you are actually reading at some point it time (let's
say 100 ms ago), then the non-leader node can check for himself, if he was in
contact with the master after this point of time and give you an answer.

------
irfansharif
It is my understanding that the motivation in seeking out consensus algorithms
with strong leaders (or equivalent) as opposed to horizontally weighted peer-
to-peer ones is due to the performance penalty imposed by the latter in the
general case. Structuring the protocol to be a dissemination from the 'leader'
node down to the followers as opposed to a bottom-up approach fares better
when your leader is long-lived, circumventing the cost of determining
'consensus' every single time. It's readily apparent that this would lend to a
performance penalty in the pathological case, as is demonstrated here, when
the leader node is taken down repeatedly - but I'm skeptical if this is true
for workloads that systems like coreos/etcd, cockroachdb/cockroach were
intended to handle.

~~~
SEJeff
Funnily enough, there is a paper on _exactly_ what you're referring to from
Microsoft Research called "Flexible Paxos":

[https://arxiv.org/pdf/1608.06696.pdf](https://arxiv.org/pdf/1608.06696.pdf)

The gist of it is that there are realistically 2 quorums to be had: One for
read/write to a cluster, and one for leader election. Assuming leader election
is a relatively uncommon event (as it is in Paxos), you can require more
quorum nodes for leader election while allowing less quorum nodes for quorum
read/write. This means you get strong consistency with better performance.

Distributed systems are one of my favorite things. I literally spent part of
an evening with a beer reading a small handful of papers like this one. If you
want an absolute treasure trove of similar works, visit the blog of Adrian
Coyler:

[https://blog.acolyer.org/](https://blog.acolyer.org/)

He has a good writeup on the Flexible Paxos paper as well:

[https://blog.acolyer.org/2016/09/27/flexible-paxos-quorum-
in...](https://blog.acolyer.org/2016/09/27/flexible-paxos-quorum-intersection-
revisited/)

~~~
Xorlev
> I literally spent part of an evening with a beer

Just one? :)

Consensus algorithms are _fascinating_! And +1 for
[https://blog.acolyer.org/](https://blog.acolyer.org/) , I'll often read his
blog over lunch.

Another great paper is
[https://research.google.com/archive/paxos_made_live.html](https://research.google.com/archive/paxos_made_live.html)
which talks about what it's like to realistically run Paxos. There's a ton of
corner cases, engineering realities, and optimizations to be done to make
Paxos work well in practice.

Another good resource is the Raft GitHub page[1] which links to the paper, has
an interactive visualization, and a plethora of talks by various people.

Raft is the backbone of opensource, I'd be curious to hear from any Googlers
in the know whether there's deficiencies in Raft that lead to continued use of
Paxos, or if it's experience (and already battle-tested code) with Paxos that
leads them to continue deploying Paxos-backed systems.

[1] [https://raft.github.io/](https://raft.github.io/)

~~~
SEJeff
Great comments and links! Pretty sure Google wrote chubby before raft existed.

Chubby paper: 2006
[http://dl.acm.org/citation.cfm?id=1298487](http://dl.acm.org/citation.cfm?id=1298487)

Raft paper: 2014
[http://dl.acm.org/citation.cfm?id=2643666](http://dl.acm.org/citation.cfm?id=2643666)

If you've got something battletested, there isn't a lot of value to rip it all
up if it works within the given business requirements. Also they figured out
how to implement Multi-Paxos, which is known for being difficult.

------
strictfp
When looking for a per-key strongly consistent but still highly available
datastore I found Hibari, which uses chain replication. Really interesting
rech which tries to solve the same problem:
[http://www.snookles.com/scott/publications/erlang2010-slf.pd...](http://www.snookles.com/scott/publications/erlang2010-slf.pdf)

Hibari does have a master orchestrator, similar to GFS master server. But it's
only needed in reconfiguration events.

------
freels
I think this overstates the instability of strong leader-based consensus
algorithms, or at least over-generalizes Raft's instability to apply to all
consensus algorithms with a stronger leader.

In Raft, it's possible for multiple nodes to prevent leader election progress
by being overly aggressive when requesting votes. It is also possible for a
follower to knock out a perfectly healthy leader by being too quick time out
the leader and start a new term.

Both of these limitations stem from the simplicity of Raft's leader election
algorithm. To compensate, most Raft implementations I've seen have more
conservative follower timeouts that extend the time to detect leader failure
and elect a new one.

It's possible for a more optimized algorithm to get sub-second latencies for
detecting and re-electing a leader, even in a latent (e.g. geo-distributed)
environment. In other words, well within the commit window for the replica set
based on network hop latencies.

Also, while the latency for individual writes in single-decree paxos can be
closer to strong leader protocols, it is non-trivial to achieve the same level
of throughput that is possible in Raft et al when writing to an ordered log,
as you cannot start a paxos instance for a new log entry until all prior
instances have been resolved. Raft can just add new values in the next append
call (or just spam out appends for new messages w/o waiting for the replies
for previous ones).

IME, I'd say both single-decree paxos and raft are probably equivalent in
terms of understandability, but raft is a better base on which to build a fast
high-throughput consensus protocol.

~~~
rystsov
> cannot start a paxos instance for a new log entry until all prior instances
> have been resolved

It's wrong. In Grydka (Single-decree Paxos) all the keys are independent so
it's possible to update them at the same time without blocking.

Grydka's throughput is comparable to Etcd on the same type of machines (4720
vs 5227 rps) and I never optimized for it (my goal was to fit 500 lines) so
it's also wrong that it's "non-trivial to achieve the same level of
throughput" \- I did it by accident.

So I don't understand why Raft is a better base to build a fast high-
throughput consensus protocol.

~~~
freels
This is true, but comes at the cost of not being able to preserve linearity
across arbitrary keys, and you are still limited by the throughput on updates
to a single key.

~~~
rystsov
Assuming linearity is atomic multi-key updates:

1\. There are tasks which don't require atomic multi-key updates

2\. Atomic multi-key updates can be implemented on the client side (see RAMP,
Percolator transactions or the Saga pattern)

3\. Once the data overgrow the size of one machine you need to shard the log
and at this time you're in the same situation

~~~
freels
Fair enough, though at that point, you're layering on additional complexity
and getting further away from the goal of simplicity. (And to point 3, while
this is true, using a log means this problem can be dealt with a lot later,
and there are solutions that don't involve giving up linearizability, such as
a dedicated set of log replicas apart from data partitions, or implementing
something like Calvin.)

My main point was that the deficiencies of strong-leader-based consensus
protocols are overstated, and despite a (minor IMO) level of additional
starting complexity, a raft-like protocol is going to be quite a bit simpler
than a paxos-based protocol of equivalent capability.

------
startupdiscuss
I am surprised:

1\. I am technically pretty strong but I have no idea what this paper is about

2\. So many people know this is about that it shot up to #1 on HN

Can someone give a pointer (a link or two) to the lay, interested audience
here about what the field IS. Just a sort of intro guide to someone who knows
about programming and math, but has never heard the term paxos?

I am curious, and I am sure many others are as well.

edit: spelling

~~~
krenoten
These consensus algorithms are awesome! They are the backbones of many highly
available systems in modern DC's today. It's how people manage to get any
sleep at all when taking care of systems that require very reliable databases.

Raft is a consensus algorithm that is touted as being easy to understand. It
is well specified, compared to paxos, which leaves many implementation details
up to the creator. Raft, however, is fairly explicitly specified on page 4 of
this paper: [https://raft.github.io/raft.pdf](https://raft.github.io/raft.pdf)

Consensus algorithms allow us to build reliable castles out of constantly
failing servers (sand). Systems like zookeeper, etcd, chubby, consul, and
others use these algorithms to achieve high availability AND strong
consistency (linearizability, the strongest possible) despite up to a majority
of the cluster failing.

~~~
krenoten
Whoops, I meant to say "up to the largest minority failing", if a majority is
lost then the gig is up! Too late to edit!

------
rdtsc
> For example, instead of using a distributed log as an Event Sourcing backend
> for building a key/value storage as an interpretation of an ordered stream
> of updates we can run a distributed log instance per key.

That is a key insight. I often wonder if people who implemented Raft and
discarded Paxos right off the bat knew this? Also I think "Paxos Made Live"
scared everyone away from Paxos for a long time.

But what is often missing is that Google implemented a distributed log system.
Paxos doesn't do a distributed log by default and just deals with reaching
consensus on a value. In practice there is often a need for a log, but not
always. If a distributed log is not need Paxos becomes less scary.

------
thezviad
Good to see more leaderless consensus protocol implementations and that RAFT
isn't be all and end all of all consensus problems.

One advantage not mentioned in the article, of leader based consensus
algorithms is the ability to more easily implement read leases for faster
reads.

Read leases can allow for fresh reads without having to run them through the
quorum protocol, by trading off availability (due to leader failure) and also
correctness in certain edge cases (since read leases will depend on ability
for individual machines to measure time delta with reasonable accuracy, which
may not be true on some weird VM scenarios).

------
billsmithaustin
Maybe Aphyr can test it out between whiteboard interviews.

------
tschottdorf
I don't think that gryadka is correct, see [http://tschottdorf.github.io/if-
its-not-paxos-its-probably-w...](http://tschottdorf.github.io/if-its-not-
paxos-its-probably-wrong-gryadka).

~~~
rystsov
I responded in the comments: [http://tschottdorf.github.io/if-its-not-paxos-
its-probably-w...](http://tschottdorf.github.io/if-its-not-paxos-its-probably-
wrong-gryadka#comment-3221871145) the analysis is based on a false assumption
about the read operation

------
agentultra
I've been curious to know if one could model performance constraints in a
model -- or at least the probabilistic bounds -- and have the checker
invalidate a design that steps over them.

------
rescrv
Paxos's complexity is often overstated. I made a simple implementation of
Paxos for a key-value store as a proof of concept to demonstrate how you can
simplify Paxos by removing leader election, multi-master management, and log
garbage collection.

Here's my blog post on the issue:
[http://hack.systems/2017/03/13/pocdb/](http://hack.systems/2017/03/13/pocdb/)

The entire implementation is 1100 lines of code including comments.

~~~
iainmerrick
How heavily have you tested it? I haven't tried implementing Paxos myself, but
anecdotally, it's very hard to get it completely right. And when it's being
used as key low-level infrastructure, it has to be completely right.

This blog post we're discussing agrees: "I planned to finish it in a couple of
days, but the whole endeavor lasted a couple of months, the first version has
consistency issues, so I had to mock the network and to introduce fault
injections to catch the bugs. [...] Single Decree Paxos seems simpler than
Raft and Multi-Paxos but remains complex enough to spend months of weekends in
chasing consistency."

Your approach of using many small Paxos instances is very interesting, though!
I'd love to see some real world performance comparisons.

~~~
rescrv
Indeed pocdb is not intended to be a fully "ready" implementation. There are
likely liveness bugs (that can be mitigated by periodically calling
work_state_machine on every write outstanding), but it is unlikely to have
safety bugs stemming from protocol misunderstandings (any safety bug is likely
a small typo).

My main projects using Paxos are Replicant[1] and Consus[2], both of which
have consumed significantly more time to get Paxos correct.

[1] [https://github.com/rescrv/replicant](https://github.com/rescrv/replicant)
[2] [http://consus.io/](http://consus.io/)

------
grogers
It's strange that I can't find anything on implementing a CAS register this
way. It seems like a relatively straightforward combination of single-decree
paxos and the ABD algorithm. Of course, reasoning about distributed algorithms
is never simple, and this still isn't...

~~~
rystsov
I described it in this post [http://rystsov.info/2015/09/16/how-paxos-
works.html](http://rystsov.info/2015/09/16/how-paxos-works.html)

With single-decree Paxos you can write a storage which provides the following
API:

function changeQuery(key, change, query) {

var value=this.db.get(key); value = change(value);

this.db.commit(key, value);

return query(value);

}

By providing different implementations of change & query you can achieve
different beviour including CAS.

------
XorNot
Interesting. I wonder if the etcd API would survive being implemented this
way?

~~~
justinjlynn
Something like that:
[https://github.com/rystsov/perseus/blob/master/gryadka/src/e...](https://github.com/rystsov/perseus/blob/master/gryadka/src/etcd-
like-api.js)

------
felixgallo
I've been implementing epaxos for a few years, slowly. It's decidedly not
simple in recovery, but Iulian has been available and kind.

------
hyperpape
Why test one store with wrk, and other with a js client? How do you know the
load testing framework isn't skewing the results?

~~~
rystsov
It can screw only to the worse side so I don't care about the true results as
long as the lower bound is sufficient enough.

------
forgotpwtomain
I didn't get anything out of this blog-post except a bunch of numbers based on
the default parameters (such as leader election timeout) of various systems.
Here is the link to the EPaxos paper it purports to discuss though:

[https://www.cs.cmu.edu/~dga/papers/epaxos-
sosp2013.pdf](https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf)

------
kerkeslager
Maybe I'm missing it because I'm on my phone, but where's the code?

~~~
justinjlynn
The implementation they talk about (gryadka):
[https://github.com/gryadka/js](https://github.com/gryadka/js)

------
partycoder
I wonder what are the author thoughts on ZAB.

~~~
rystsov
I don't have a deep understanding of how ZAB works but based on the
documentation (initLimit) and prior experience with ZooKeeper, ZAB has the
same issues as Raft when a leader dies.

Folk from Elastifile demonstrated it in their Bizur paper -
[https://arxiv.org/abs/1702.04242v1](https://arxiv.org/abs/1702.04242v1)

------
coldpie
Does every technical article need to be scattered with Impact font memes now?

