
Raft Consensus Algorithm - edward
http://raftconsensus.github.io/
======
joshuak
I use etcd[1] which is an implementation of Raft for cluster configuration. I
have a hard time seeing how Raft or consensus in general scales to a general
solution for reliable consistency services.

One can not scale by adding servers because all potential members must be
known, and consistency is only guaranteed if you are willing to require
confirmations forcing full round trip communication with all members for every
write. Anything more than 7 servers or so, and things start getting dodgy. Try
to improve reliability by adding servers in geographically diverse areas is
untenable altogether.

I don't see how simply having multiple servers with failover, and a version
clock isn't equally or even more reliable due to it's scalability. The
byzantine generals problem[2] is more about reliable voting and trust rather
that high performance and reliability.

I'd very much appreciate more insight on this.

1\. [https://github.com/coreos/etcd](https://github.com/coreos/etcd)

2\.
[http://en.wikipedia.org/wiki/Byzantine_fault_tolerance](http://en.wikipedia.org/wiki/Byzantine_fault_tolerance)

~~~
scottdw2
A few things:

1\. You don't have to round trip to every member before acking a write
request, only to a majority of them.

2\. If you use the right data structures for your state machine (like
persistent data structures), you don't have to fully commit a write before you
start the next write. That is, as long as you can quickly revert to a
previously committed state (which you can with persistent data structures, it
just becomes a CAS on a pointer) you can pipeline requests and batch commits.
That lets you start on write n+1 using the state machine results of write n,
even though write n hasn't been committed yet. You can then commit a bunch of
results (n, n-1, n-2, ...) as a single batch. That does create a dependency
that some write j may be rolled-back because a write i < j failed to commit,
but you would have had that dependency anyways (you wouldn't have even started
write i unless write j had committed). So, although latency of a single
request is bounded by the time to replicate a log entry to n/2 + 1 hosts, the
throughput is not bounded by the latency of a single write.

The raft paper doesn't actually say you can do this (it says things like
"don't apply state machines until a transaction has been committed), but
that's because it assumes arbitrary side effecting state machines. However,
with some restrictions (like using persistent data structures), you can relax
that requirement. Relaxing that requirement can give quite good throughput.

3\. You can horizontally scale raft the same way you horizontally scale any
other distributed data store: using consistent hashing. You would use a hash
ring of raft clusters.

4\. You can loose up to n/2 hosts from a raft cluster and it will still work.

5\. Raft is "multiple servers with failover and a version clock". In fact it's
"multiple servers with failover, a version clock, and strong sequential
consistency".

~~~
joshuak
Thanks, what confuses me though is why you must retain n/2 hosts, and not be
able to recover at least in many cases from a loss down to a single host. I
understand that a known majority insures that a write will always
persist/propagate even after a network segmentation and later re-join. But
what if the network isn't segmented but the servers are simply down, or there
wouldn't be a conflict regardless because the relevant clients couldn't have
written to the segmented network anyway? When they rejoin they would all know
they are behind the one that remained online (as one does with version
clocks), so they could just get all missing events and be consistent again.

The majority requirement would, seems to me, reduce reliability over the
ability to failover down to a single server.

Does anyone know of a list or test suite of all failures that a consistency
algorithm is resistant too, then at least I could understand the reasoning. I
get the impression that there are a lot of failure modes that are handled at
some cost, but are not relevant to my use cases.

~~~
scottdw2
In a leader election, a node won't vote for a candidate that doesn't have a
record it believes to be committed. Thus, writing to a majority ensures that
when the leader dies, the new leader will have all the committed dats. Thus,
up to n/2 - 1 nodes can be lost with a guarantee that no committed data is
lost.

If you don't need strong consistency (after a write commits, all future reads
will see the write), you can use simpler replication strategies.

Redis, for example, using async replication, that is not guaranteed to
succeed. A redis master may ack a write, and a subsequent read may see it, but
a loss of the master before replication occurs can result in data loss. The
failover is not guaranteed to contain all writes. Some times weak consistency
is ok. Sometimes it's not, but eventual consistency is ok. Other times strong
consistency is needed.

Raft is useful when you need strong consistency.

Raft does support membership changes (adding and removing nodes from the
cluster), so it can support losing more than n / 2 - 1 nodes, just not
simultaneously.

~~~
joshuak
(click) Now I finally get it! Thanks so much, very helpful.

------
alexnewman
Having implemented raft a couple of times now, I'm convinced it's too high
level and not powerful enough at the same time. I think I'm going back to two
phase commit for the data and s.d.p for providing fault tolerance to the
coordinator. Treode has a great implementation of something like that.

~~~
oh_sigh
Rafts been around for only about 2 years, and you've implemented it multiple
times already?

~~~
Empact
Confirmed, Alex Newman writes distributed databases in his spare time.

[https://github.com/posix4e](https://github.com/posix4e)

[https://github.com/cloud-software-foundation/c5](https://github.com/cloud-
software-foundation/c5)

[https://www.linkedin.com/in/posix4e](https://www.linkedin.com/in/posix4e)

~~~
oh_sigh
Well, that makes pretty good sense then. Alex, I'd love to hear your thoughts
on fleshing out the comment you made above. Do you have a blog?

~~~
alexnewman
I do not right blog but you can bug me on irc! delamancha on efnet

------
striking
"In my experience, all distributed consensus algorithms are either:

"1: Paxos,

"2: Paxos with extra unnecessary cruft, or

"3: broken. "

— Mike Burrows (possibly)

Edit: I think the reason this exists is best explained in this blog post
([http://kellabyte.com/2013/05/09/an-alternative-to-paxos-
the-...](http://kellabyte.com/2013/05/09/an-alternative-to-paxos-the-raft-
consensus-algorithm/)). It's not that this provides more than Paxos, the goal
is to write something that's simpler and easier to implement. The article here
doesn't actually explain this, and doesn't compare it meaningfully with Paxos,
making it difficult to reason about.

~~~
einarvollset
Paxos was published after the first randomized consensus algorithms, and Paxos
has been shown to be a variation on those principles, so not sure the Paxos
worship is all that warranted..

(Please read my bio before responding..)

~~~
striking
Certainly Paxos is not the be-all-end-all of randomized consensus algorithms.
However, what algorithm do you think we should worship?

~~~
DSingularity
Hmmm... Ben Ors randomized consensus algorithm? Can't beat the simplicity of
that one...

~~~
einarvollset
Yes! In fact published before Paxos, and studiously ignored by Lamport + shown
to be equivalent.

------
gtirloni
Relevant previous discussions:
[https://hn.algolia.com/?query=raft%20consensus&sort=byPopula...](https://hn.algolia.com/?query=raft%20consensus&sort=byPopularity&prefix&page=0&dateRange=all&type=story)

------
amelius
Lecture here:
[https://www.youtube.com/watch?v=YbZ3zDzDnrw](https://www.youtube.com/watch?v=YbZ3zDzDnrw)

