Hacker News new | comments | show | ask | jobs | submit login
In search of a simple consensus algorithm (rystsov.info)
230 points by justinjlynn 247 days ago | hide | past | web | 83 comments | favorite

The article abuses Kolmogorov complexity...

> When it is applied to the algorithms, it means that an algorithm with the shortest implementation is simpler.

That is misleading. Kolmogorov complexity is the length of the shortest program (in a pre-defined language) that produces a given object. So, if the shortest program that produces Algorithm A is smaller than the shortest program that produces Algorithm B, then the Algorithm A is less Kolmogorov-complex ("simpler") than Algorithm A.

This does not mean you can take two existing implementations (in C, say) and compare the implementation length and declare one is "simpler," unless you are claiming that both implementations are as short as possible. Since Kolmogorov complexity is not computable, that seems like a tall order.

Maybe they are right that Single-Decree Paxos is simpler (either in the sense of Kolmogorov complexity or in some other sense, who knows), but invoking Kolmogorov complexity here seems totally unwarranted -- it doesn't add anything substantive.

Thank you! I'll rework this paragraph to be correct, I wanted to make an observation that the given data (two attempts to implement key-value storages with keeping the length of a program as short as possible) favour Gryadka but of course isn't wrong to make strong statements based just on one data point.

I would suggest just dropping Kolmogorov in general, and just referencing the human idea that shorter is generally going to be simpler. I wouldn't have a problem sticking with you through that. Sure, I can feed that to my pedant mill as with anything else, but since it's not really the core idea I'll roll with you on it.

Agreed. The section basically boils down to "despite maybe looking simpler, Single-Decree Paxos is still tricky." I don't think something particular and formal like Kolmogorov complexity even fits the tone there.

Right. There's no strict value to higher or lower complexity in the Kolmogorov sense, so you might as well assign the value to something more tangible (like pseudocode terseness or something)


The problem isn't just the incomputable quality of Kolmogorov complexity but that fact that Kolmogorov complexity applies only to finite strings or things that can be meaningfully mapped to them. Especially, Kolmogorov doesn't apply directly to abstract algorithms or programs with multiple implementations.

The concept of Kolmogorov complexity can be extended to infinite strings, see e.g. L. Staiger's _The Kolmogorov complexity of infinite words_ (https://www.sciencedirect.com/science/article/pii/S030439750...).

related: "Program-Size Complexity Computes the Halting Problem", https://www.cs.auckland.ac.nz/research/groups/CDMTCS/researc...

Kolmogorov complexity is not computable in general, but it is decidable for a substantial subset of programs.

Really? Do you have an example of a substantial set of programs where that holds? I don't see a priori how it could work unless you can do exhaustive search over the space of all programs and determine if they halt and return the right answer.

Formally, there are quite a few which are quite simple to show. For instance, for the string 'a', the shortest python problem which can produce this string is obviously print('a') since 'a' has only one character.

Additionally, if you limit your language to a recursive language, you can compute complexity (which is no longer Komnogorov) directly. Simply begin enumerating all programs in order of length and then run them checking the output to see if it is that string. While this is by no means efficient, it works for recursive functions since they must halt. For Komogorov complexity, there isn't really a notion of inputs, merely that some particular string should be produced as output.

For recursively enumerable (i.e. Turing complete) languages, the former method will not work because a program might run forever.

Interesting. This way you get an upper bound on the Kolmogorov complexity of any given string. Do you know how good this approximation is in the worst/best/average case? This sounds like the kind of question that would have been investigated.

The hard part here is not finding a set of programs that always halt (primitive recursive functions can compute anything you'd ever want to run on large inputs), but proving that they are correct (I'm pretty sure equivalence of primitive recursive functions is undecidable).

Edit: but if someone gives you the primitive recursive Kolmogorov complexity of a program, you can check it by running all shorter programs on all inputs until you found a counterexample for each of them. So it is semi-decidable.

Edit to edit: This would even work for the general Turing-machine definition of Kolmogorov complexity.

> The hard part here is not finding a set of programs that always halt [...], but proving that they are correct (I'm pretty sure equivalence of primitive recursive functions is undecidable).

This may be true, but:

If you use primitive recursive functions (instead of turing complete) because of practicality, with the same reasoning you can cap the inputs at some insanely large number. Then, these functions still "can compute anything you'd ever want to run on large inputs".

In that setting, equivalence is decidable, because you can simply run both functions over the finite set of all possible inputs.

I have the feeling like this article is just bashing strong master consensus protocols. But the truth is, yes, they incur a penalty for electing a master.

However, this really gets amortized in most workloads if the leader changes only rarely. Additionally, in an environment with a good network connection between nodes (a few ms), you can set the timeout to be much less than a few seconds (could be less than a second actually), this way you have shorter unavailability.

There's another point he touches, about the unavailability of the whole cluster when the leader is down. Really, this isn't something dependent on the protocols, but on the applications. If you have one paxos/raft group per replica, you actually only get a small unavailability. Additionally, even consistent reads do not need a living master to be possible.

It's worth reading the spanner paper to get more insight into high availability/consistency achieved with a paxos based db implementation ( https://static.googleusercontent.com/media/research.google.c... ).

EDIT: And in my opionion, calling it a SPOF is missing the point a little bit.

> consistent reads do not need a living master

It's wrong. If you're fine with stale reads then you don't a living master, but if you want to have a guarantee that the read value is up-to-date then the living master is necessary.

Etcd has a bug related to this - https://github.com/coreos/etcd/issues/741

It may not be up to date at the time of actually READING the data, but it will be up to date if the date is the moment the client requested it. You can actually achieve this using timestamps. (So basically, it's semi-stale data, you're right, because you have a guarantee about the timestamp at which it was up to date, where the timestamp can be recent)

(Not talking about etcd, check out the spanner paper)

In order to get a linearizable read in Spanner, you do need to have a heartbeat from the leader of the Paxos tablet for the data you are reading in order to know that all data has been replicated up to the time at which you are taking your read. This is not free, but can be cheaper than active communication. (referred to as t_safe in the paper)

As I understood it, if you are actually reading at some point it time (let's say 100 ms ago), then the non-leader node can check for himself, if he was in contact with the master after this point of time and give you an answer.

TrueTime is a very special case. I have not seen anyone else who has the same hardware infrastructure required to provide guarantees related to wall-clock time. Spanner also trades off transaction latency to create the wall clock time guarantees, so its not magically free.

It isn't, I haven't said it is. The point is, there are disadvantages you are writing about, that aren't really disadvantages of the protocol itself, but disadvantages of the common implementations.

Yes that's true if all the communication between agents goes through the database then the Monotonic Reads consistency level is equivalent to linearizability.

In case there are off the database communications then this level of consistency isn't enough.

When leader is not around, isn't a majority read sufficient?

It depends on the implementation. If you have a replication system that generates a monotonic sequence ID for each update, and you have all writes block on a majority of replicas acking an update, then you can just read from a majority and pick the result with the highest monotonic identifier. This will not be consistent unless the leadership mechanism demands that leadership is chosen based on the same criteria.

> If you have one paxos/raft group per replica, you actually only get a small unavailability

It isn't a small unavailability, it's an unavailability of the whole replica. If you don't have a lot of data then it's the whole cluster :)

Of cause, we can introduce something like virtual replicas and eventually end up with a replica per key which is almost a Single Decree Paxos but with an overhead on log compaction and snapshotting.

Well, the unavailability is around 1/(number of machines in cluster) with a sufficient amount of replicas. That said, it's small relative to a whole-cluster unavailability and with small leader timeouts, it gets much better than you described in your article.

Btw, thanks for answering my comments actively, I really appreciate it!

Just fyi, you were presumably thinking of incur rather than infer.

Thank you, I'm not a native speaker :)

It is my understanding that the motivation in seeking out consensus algorithms with strong leaders (or equivalent) as opposed to horizontally weighted peer-to-peer ones is due to the performance penalty imposed by the latter in the general case. Structuring the protocol to be a dissemination from the 'leader' node down to the followers as opposed to a bottom-up approach fares better when your leader is long-lived, circumventing the cost of determining 'consensus' every single time. It's readily apparent that this would lend to a performance penalty in the pathological case, as is demonstrated here, when the leader node is taken down repeatedly - but I'm skeptical if this is true for workloads that systems like coreos/etcd, cockroachdb/cockroach were intended to handle.

Funnily enough, there is a paper on exactly what you're referring to from Microsoft Research called "Flexible Paxos":


The gist of it is that there are realistically 2 quorums to be had: One for read/write to a cluster, and one for leader election. Assuming leader election is a relatively uncommon event (as it is in Paxos), you can require more quorum nodes for leader election while allowing less quorum nodes for quorum read/write. This means you get strong consistency with better performance.

Distributed systems are one of my favorite things. I literally spent part of an evening with a beer reading a small handful of papers like this one. If you want an absolute treasure trove of similar works, visit the blog of Adrian Coyler:


He has a good writeup on the Flexible Paxos paper as well:


> I literally spent part of an evening with a beer

Just one? :)

Consensus algorithms are _fascinating_! And +1 for https://blog.acolyer.org/ , I'll often read his blog over lunch.

Another great paper is https://research.google.com/archive/paxos_made_live.html which talks about what it's like to realistically run Paxos. There's a ton of corner cases, engineering realities, and optimizations to be done to make Paxos work well in practice.

Another good resource is the Raft GitHub page[1] which links to the paper, has an interactive visualization, and a plethora of talks by various people.

Raft is the backbone of opensource, I'd be curious to hear from any Googlers in the know whether there's deficiencies in Raft that lead to continued use of Paxos, or if it's experience (and already battle-tested code) with Paxos that leads them to continue deploying Paxos-backed systems.

[1] https://raft.github.io/

Great comments and links! Pretty sure Google wrote chubby before raft existed.

Chubby paper: 2006 http://dl.acm.org/citation.cfm?id=1298487

Raft paper: 2014 http://dl.acm.org/citation.cfm?id=2643666

If you've got something battletested, there isn't a lot of value to rip it all up if it works within the given business requirements. Also they figured out how to implement Multi-Paxos, which is known for being difficult.

Btw, Gryadka uses an idea of quorums with non-standard size to change a configuration of the cluster, see http://rystsov.info/2016/01/05/raft-paxos.html#details1. I came up with this idea independently of Howard when misread the Vertical Paxos paper :)

yup, I've been closely following Heidi's work for some time now - non-intersecting consensus groups, still hoping someone beats me to a Go implementation so I don't have to. +1 for Adrian Coyler's blog, constantly amazed at the sheer volume and depth of his analyses.

Ah got it. I was thinking, "Wow, this is literally someone describing the Flexible Paxos work. I wonder if they have read the paper?"

Distributed systems learning ftw

The raft paper states that its use of an explicit leader election was chosen for understandability, not performance. Leaderless systems can potentially improve performance, especially when replicas are far apart (so you don't have the extra network hop to the leader before you can get started). The main drawback IMHO to ePaxos (as a hopefully representative leaderless algorithm) is not performance but its complexity, since it intrudes into the application to track dependencies and conflicts between proposed commands.

If I squint at the scheme proposed here, it looks like ePaxos taken to the extreme of a single key/value per paxos group, so conflict tracking becomes trivial. It has demonstrated good performance in the case of uncontended writes; I'd be curious to see how it behaves under contention, or when a downed node rejoins the cluster.

> It is my understanding that the motivation in seeking out consensus algorithms with strong leaders (or equivalent) as opposed to horizontally weighted peer-to-peer ones is due to the performance penalty imposed by the latter in the general case.

Yeah, that was my understanding as well but without having read the paper in question (as of yet) I'm entirely short on the particular technical details of the trade-offs.

I didn't read the whole thing carefully, but... Leader Election + Terminating Reliable Broadcast is practically equivalent to building the Weakest Failure Detector and is the most feasible way of doing so. So, there are good reasons for doing leader election. This is why almost every serious consensus system ends up doing it whether they know they need to or not.

What is the performance penalty you're talking about?

Gryadka does 1 roundtrip to write a value and its performance (4720 rps, 1.68ms latency) is very similar to Etcd (5227 rps, 1.55ms).

The performance penalty is on reads. If you elect a leader with a timeout you can do strong reads without re-establishing consensus.

Etcd used this scheme to provide fast reads but Aphyr demonstrated that it may lead to stale reads https://github.com/coreos/etcd/issues/741.

If you can't afford stale reads then you should ask Etcd to wait for confirmation from the majority of followers before acknowledging a read with ?quorum=true. It makes Etcd do 1 round trip for reads (just like Gryadka).

The same is applicable to other products (you can't relay on time in distributed systems, unless you're Google, consequently you can't relay on read leases)

So there is no performance penalty.

etcd didn't implement leases; it just assumed that the last Raft election was still good.

You can elect a leader for a set period of time (say, 10 seconds) and serve strong reads for a lesser period of time (5 seconds) if you have reasonable assumptions of how good your local oscillators work and avoid jumps. If you don't trust your local clock to any level of accuracy, why do you trust your local CPU?

Cockroach DB does this correctly. https://github.com/cockroachdb/cockroach/blob/master/docs/de...

Making implicit assumptions about the environment is wrong.

An instance may be running in virtual environment where time freezes are possible. A human may make an error and rollback time to 1970.

It's impossible to eliminate all these factors so yes I don't trust time but I trust CPU.

Maybe CockroachDB is doing it correctly but the terrible default settings make this optimisation negligible because when the leader dies the system hangs for 12 seconds.

When looking for a per-key strongly consistent but still highly available datastore I found Hibari, which uses chain replication. Really interesting rech which tries to solve the same problem: http://www.snookles.com/scott/publications/erlang2010-slf.pd...

Hibari does have a master orchestrator, similar to GFS master server. But it's only needed in reconfiguration events.

I think this overstates the instability of strong leader-based consensus algorithms, or at least over-generalizes Raft's instability to apply to all consensus algorithms with a stronger leader.

In Raft, it's possible for multiple nodes to prevent leader election progress by being overly aggressive when requesting votes. It is also possible for a follower to knock out a perfectly healthy leader by being too quick time out the leader and start a new term.

Both of these limitations stem from the simplicity of Raft's leader election algorithm. To compensate, most Raft implementations I've seen have more conservative follower timeouts that extend the time to detect leader failure and elect a new one.

It's possible for a more optimized algorithm to get sub-second latencies for detecting and re-electing a leader, even in a latent (e.g. geo-distributed) environment. In other words, well within the commit window for the replica set based on network hop latencies.

Also, while the latency for individual writes in single-decree paxos can be closer to strong leader protocols, it is non-trivial to achieve the same level of throughput that is possible in Raft et al when writing to an ordered log, as you cannot start a paxos instance for a new log entry until all prior instances have been resolved. Raft can just add new values in the next append call (or just spam out appends for new messages w/o waiting for the replies for previous ones).

IME, I'd say both single-decree paxos and raft are probably equivalent in terms of understandability, but raft is a better base on which to build a fast high-throughput consensus protocol.

> cannot start a paxos instance for a new log entry until all prior instances have been resolved

It's wrong. In Grydka (Single-decree Paxos) all the keys are independent so it's possible to update them at the same time without blocking.

Grydka's throughput is comparable to Etcd on the same type of machines (4720 vs 5227 rps) and I never optimized for it (my goal was to fit 500 lines) so it's also wrong that it's "non-trivial to achieve the same level of throughput" - I did it by accident.

So I don't understand why Raft is a better base to build a fast high-throughput consensus protocol.

This is true, but comes at the cost of not being able to preserve linearity across arbitrary keys, and you are still limited by the throughput on updates to a single key.

Assuming linearity is atomic multi-key updates:

1. There are tasks which don't require atomic multi-key updates

2. Atomic multi-key updates can be implemented on the client side (see RAMP, Percolator transactions or the Saga pattern)

3. Once the data overgrow the size of one machine you need to shard the log and at this time you're in the same situation

Fair enough, though at that point, you're layering on additional complexity and getting further away from the goal of simplicity. (And to point 3, while this is true, using a log means this problem can be dealt with a lot later, and there are solutions that don't involve giving up linearizability, such as a dedicated set of log replicas apart from data partitions, or implementing something like Calvin.)

My main point was that the deficiencies of strong-leader-based consensus protocols are overstated, and despite a (minor IMO) level of additional starting complexity, a raft-like protocol is going to be quite a bit simpler than a paxos-based protocol of equivalent capability.

I am surprised:

1. I am technically pretty strong but I have no idea what this paper is about

2. So many people know this is about that it shot up to #1 on HN

Can someone give a pointer (a link or two) to the lay, interested audience here about what the field IS. Just a sort of intro guide to someone who knows about programming and math, but has never heard the term paxos?

I am curious, and I am sure many others are as well.

edit: spelling

Lesley Lamport, Paxos inventor, describes it thus in the abstract to the original paper:

"[Paxos] provides a new way of implementing the state-machine approach to the design of distributed systems."

Original Paxos paper (written with somewhat whimsical style):


Follow up paper to explain Paxos more simply:


See also: https://en.wikipedia.org/wiki/Paxos_(computer_science)

Wikipedia is pretty good. The page for "consensus algorithm" has links for Paxos and Raft.

The basic idea is how to coordinate multiple independent agents over an unreliable network. For example, multiple servers trying to manage a shared database. Lots of HN people know about this because it's a key building block of reliable and scalable cloud computing.

These consensus algorithms are awesome! They are the backbones of many highly available systems in modern DC's today. It's how people manage to get any sleep at all when taking care of systems that require very reliable databases.

Raft is a consensus algorithm that is touted as being easy to understand. It is well specified, compared to paxos, which leaves many implementation details up to the creator. Raft, however, is fairly explicitly specified on page 4 of this paper: https://raft.github.io/raft.pdf

Consensus algorithms allow us to build reliable castles out of constantly failing servers (sand). Systems like zookeeper, etcd, chubby, consul, and others use these algorithms to achieve high availability AND strong consistency (linearizability, the strongest possible) despite up to a majority of the cluster failing.

Whoops, I meant to say "up to the largest minority failing", if a majority is lost then the gig is up! Too late to edit!

It's basically the topic of distributed computing and algorithms for the coordination of distributed systems.

This is a great introductory course about it in my opinion: https://www.coursera.org/learn/cloud-computing

Then, it's just reading papers like, which you mentioned, "The part time parliment" (paxos)

> For example, instead of using a distributed log as an Event Sourcing backend for building a key/value storage as an interpretation of an ordered stream of updates we can run a distributed log instance per key.

That is a key insight. I often wonder if people who implemented Raft and discarded Paxos right off the bat knew this? Also I think "Paxos Made Live" scared everyone away from Paxos for a long time.

But what is often missing is that Google implemented a distributed log system. Paxos doesn't do a distributed log by default and just deals with reaching consensus on a value. In practice there is often a need for a log, but not always. If a distributed log is not need Paxos becomes less scary.

Good to see more leaderless consensus protocol implementations and that RAFT isn't be all and end all of all consensus problems.

One advantage not mentioned in the article, of leader based consensus algorithms is the ability to more easily implement read leases for faster reads.

Read leases can allow for fresh reads without having to run them through the quorum protocol, by trading off availability (due to leader failure) and also correctness in certain edge cases (since read leases will depend on ability for individual machines to measure time delta with reasonable accuracy, which may not be true on some weird VM scenarios).

Maybe Aphyr can test it out between whiteboard interviews.

I don't think that gryadka is correct, see http://tschottdorf.github.io/if-its-not-paxos-its-probably-w....

I responded in the comments: http://tschottdorf.github.io/if-its-not-paxos-its-probably-w... the analysis is based on a false assumption about the read operation

I've been curious to know if one could model performance constraints in a model -- or at least the probabilistic bounds -- and have the checker invalidate a design that steps over them.

Paxos's complexity is often overstated. I made a simple implementation of Paxos for a key-value store as a proof of concept to demonstrate how you can simplify Paxos by removing leader election, multi-master management, and log garbage collection.

Here's my blog post on the issue: http://hack.systems/2017/03/13/pocdb/

The entire implementation is 1100 lines of code including comments.

How heavily have you tested it? I haven't tried implementing Paxos myself, but anecdotally, it's very hard to get it completely right. And when it's being used as key low-level infrastructure, it has to be completely right.

This blog post we're discussing agrees: "I planned to finish it in a couple of days, but the whole endeavor lasted a couple of months, the first version has consistency issues, so I had to mock the network and to introduce fault injections to catch the bugs. [...] Single Decree Paxos seems simpler than Raft and Multi-Paxos but remains complex enough to spend months of weekends in chasing consistency."

Your approach of using many small Paxos instances is very interesting, though! I'd love to see some real world performance comparisons.

Indeed pocdb is not intended to be a fully "ready" implementation. There are likely liveness bugs (that can be mitigated by periodically calling work_state_machine on every write outstanding), but it is unlikely to have safety bugs stemming from protocol misunderstandings (any safety bug is likely a small typo).

My main projects using Paxos are Replicant[1] and Consus[2], both of which have consumed significantly more time to get Paxos correct.

[1] https://github.com/rescrv/replicant [2] http://consus.io/

I think that this is cause mainly by the fact, that Leslie Lamport had first written "The one time parliment", which was said to be complex because of the language he used there. I think this is the main cause, as now there are a lot of materials available to understand Paxos.

If you mean, "The Part-time Parliament," it's indeed a little complex.

It's worth reading his description of the paper to understand why: http://lamport.azurewebsites.net/pubs/pubs.html#lamport-paxo...

In short, not everyone has the same sense of humor.

I now it was for fun, but it did make the algorithm substantially more complex to understand.

Paxos made simple is much easier in comparison.

Gryadka is less than 500 but supports membership change :) yet 3 hours on it is quite impressive

JS vs sparse C++? Cut comments and change coding style and you can easily get close to 500 lines.

One thing I think is worth touching on is the pipelining ability of multi-paxos that is missing when you have a single register with the Synod protocol. For key-value operations, it's not a problem. For a true replicated state machine, this can hinder performance.

In Replicant[1] I added pipelining to ensure Paxos was unlikely to be a bottleneck to replicated state machines. For complex state machines, the state machine becomes CPU bound.

[1] https://github.com/rescrv/replicant

It would still lack "Consistent get operations", "Paxos group changes", be "bound to localhost" and do more than 1 round trip to commit.

What do you mean by pipelining?

It's strange that I can't find anything on implementing a CAS register this way. It seems like a relatively straightforward combination of single-decree paxos and the ABD algorithm. Of course, reasoning about distributed algorithms is never simple, and this still isn't...

I described it in this post http://rystsov.info/2015/09/16/how-paxos-works.html

With single-decree Paxos you can write a storage which provides the following API:

function changeQuery(key, change, query) {

var value=this.db.get(key); value = change(value);

this.db.commit(key, value);

return query(value);


By providing different implementations of change & query you can achieve different beviour including CAS.

Interesting. I wonder if the etcd API would survive being implemented this way?

I've been implementing epaxos for a few years, slowly. It's decidedly not simple in recovery, but Iulian has been available and kind.

Why test one store with wrk, and other with a js client? How do you know the load testing framework isn't skewing the results?

It can screw only to the worse side so I don't care about the true results as long as the lower bound is sufficient enough.

I didn't get anything out of this blog-post except a bunch of numbers based on the default parameters (such as leader election timeout) of various systems. Here is the link to the EPaxos paper it purports to discuss though:


Maybe I'm missing it because I'm on my phone, but where's the code?

The implementation they talk about (gryadka): https://github.com/gryadka/js

I wonder what are the author thoughts on ZAB.

I don't have a deep understanding of how ZAB works but based on the documentation (initLimit) and prior experience with ZooKeeper, ZAB has the same issues as Raft when a leader dies.

Folk from Elastifile demonstrated it in their Bizur paper - https://arxiv.org/abs/1702.04242v1

Does every technical article need to be scattered with Impact font memes now?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact