

Raft: Understandable Distributed Consensus - otoolep
http://thesecretlivesofdata.com/raft/

======
PeterisP
Horrible presentation - the enforced pauses (even hiding the continue button
during them!) cause frustration after _every_ _single_ _sentence_.

Why can't I read at a normal pace instead of being interrupted all the time
and having to wait while the next sentence is shown?

[edit] This behavior would be very suitable if I was making a presentation to
an audience with this content - but it's quite contrary to what's needed for
the audience to view the content themselves at their own pacing.

~~~
annnnd
Well, I like it - a lot.

It would be even nicer if the "Continue" button had a permanent position (and
if I could use enter/space/pagedown/... instead of mouse), but I didn't notice
that it was hidden between animations. I guess I am slower than you are. :)

I am not sure if the concept is valid (some other comments have issue with
that), but it was well presented. Good job, OP - keep it up!

~~~
lambda
You can use the arrow keys to navigate, which is a lot better than clicking
"Continue" each time.

~~~
lobster_johnson
Arrow keys don't help much on a tablet or a phone.

------
Illniyar
So this site is using a custom built library called "playback.js":

[https://github.com/benbjohnson/playback.js](https://github.com/benbjohnson/playback.js)

looks interesting.

~~~
restalis
I accidentally clicked twice on "Continue" and there wasn't any mean to go
back to read the missed comment. This slide-show player wants to be clean and
simple and ends up with holes in functionality.

~~~
evrenesat
While this will not help for touch screen devices, left arrow goes back.

------
pepijndevos
In the example with 5 nodes and a split, it is my understanding that the two
nodes can't elect a leader.

While the candidate in the smaller split receives votes from a majority of the
split, there is no true majority, so no leader. The cluster is configured with
the total number of nodes.

What could happen is that an already elected leader continues to think it's
the leader for a while while the rest fo the cluster elects a new leader. The
split leader will however fail to commit its log, and throw them away once it
rejoins.

Another important detail that's missing is that node only votes once pere
term, and only for a node that has an equal or higher term than itself. It
will never vote twice or vote for an outdated node.

Changing the configuration is in fact handled in a special way at the end of
the raft paper in a way that avoids split-brain.

[edit] Oh, the 2-node split was in fact already the leader, so it does exactly
what I described. Dur...

------
FredericJ
It's important to note that this consensus model only works if all nodes are
honest.

~~~
deathanatos
Given that nodes are trusted and communicate securely, why would a node be
dishonest? (I'm thinking about this in a standard server setting: is your
point that you can't use it for distributed peer-to-peer stuff on random user
machines?)

~~~
the_schwarz
A "dishonest" node doesn't necessarily have to be explicitly malicious, it can
be simply faulty.

~~~
elbenshira
Lots of research has gone into this, under the name Byzantine fault tolerance.

[http://en.wikipedia.org/wiki/Byzantine_fault_tolerance](http://en.wikipedia.org/wiki/Byzantine_fault_tolerance)

------
grogers
Can someone explain to me how Raft differs from Viewstamped Replication? From
reading both papers (vr revisited) it looks like Raft just renamed all of VR's
nomenclature without changing anything significant. Paxos is fairly different
since it only relies on a distinguished leader for guaranteed progress, it
"works" without it. Under the hood the mechanism is still similar though, as
opposed to something like chain replication.

~~~
Muzzaf
See section 10 of the Raft paper.

> VR uses a leaderbased approach with many similarities to Raft.

> However, Raft has less mechanism that VR or ZooKeeper because it minimizes
> the functionality in non-leaders. For example, log entries in Raft flow in
> only one direction: outward from the leader in AppendEntries RPCs. In VR log
> entries flow in both directions (leaders can receive log entries during the
> election process); this results in additional mechanism and complexity

> Raft has fewer message types than any other algorithm for consensus-based
> log replication that we are aware of. For example, we counted the message
> types VR and ZooKeeper use for basic consensus and membership changes
> (excluding log compaction and client interaction, as these are nearly
> independent of the algorithms). VR and ZooKeeper each define 10 different
> message types, while Raft has only 4 message types (two RPC requests and
> their responses).

------
toolslive
Raft itself is very nice, and the paper
[https://ramcloud.stanford.edu/wiki/download/attachments/1137...](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf)
Does an excellent job in explaining it, but I have some problems with this
claim: "A user study with 43 students at two universities shows that Raft is
significantly easier to understand than Paxos: after learning both algorithms,
33 of these students were able to answer questions about Raft better than
questions about Paxos".

Honestly, what I think happened is this: They first explained paxos to the
poor students, then asked questions. In a later session explained Raft and
asked questions. Can't it be the students started processing the problem of
distributed consensus between the sessions so they got a better grasp of the
topic? This would mean paxos helped them understand raft better. Anyway I'm
nitpicking etc etc.

~~~
sbhat7
From 9.1 section in the paper,

> Each student watched one video, took the corresponding quiz, watched the
> second video, and took the second quiz. About half of the participants did
> the Paxos portion first and the other half did the Raft portion first in
> order to account for both individual differences in performance and
> experience gained from the first portion of the study. We compared
> participants’ scores on each quiz to deter- mine whether participants showed
> a better understanding of Raft.

~~~
toolslive
This seems to be a good strategy but it is difficult to factor out the quality
of the explanations.

The reason I'm so picky about this claim is that before you know it, you have
mythical pseudo-statistical claims like "some programmers are more than 10
times as good as others" that will live a life of their own. CS has way too
many of those.

------
hardwaresofton
For those who are encountering distributed systems for the first time, it
might also do you some good to look up (or use as search fodder) "Paxos".

If you want to dive even deeper "FLP" and "Leslie Lamport" should also open up
a can of interesting worms

------
regularfry
Can't this lose commits? It looks like the message to commit the log entry to
the followers happens _after_ the message to the client to say that the commit
is confirmed. The client can't actually tell when their commit has been
replicated. If the leader dies before sending that confirm to the followers,
the client will end up thinking the new leader has a commit which it's going
to have to roll back.

Is this something the full algorithm handles differently to the way the
diagrams would indicate?

~~~
skew
The response to the client is only sent after a majority of followers have the
log entry. That's described in the text in the "Protocol Overview", and nicely
animated in "Log Replication".

~~~
regularfry
Yes, they have the log entry, but it can still be rolled back, can't it?
Here's how I understand the process:

1\. Client sends log entry to leader

2\. Leader appends log entry, forwards it to followers

3\. Majority of followers confirm

4\. Leader commits the log entry

5\. Leader confirms the commit to the client

6\. Followers commit on the next heartbeat

What happens if the leader goes away between 5 and 6? To my eyes, it looks
like the followers will time out, elect a new leader, and have to roll back
the last log entry.

~~~
DanWaterworth
If an entry has been replicated to a majority of followers, then the new
leader is guaranteed to have that entry and therefore it won't be rolled back.

~~~
regularfry
How does the new leader know it went to a majority? The only entity which
could confirm that is the old leader.

~~~
Muzzaf
That is correct. The solution to this is given in section 5.4.1 (election
restriction), section 5.4.2 (Committing entries from previous terms) and
section 8 (Client interaction) of the Raft paper.

Roughly, a newly elected leader will have all committed entries (guaranteed by
the "election restriction", 5.4.2) but it does not know precisely which are
committed. The new leader will commit a no-op log entry (section 8) and after
it has received replies from a majority of the cluster it will know which
entries have already been committed.

~~~
regularfry
Ooh, thanks. The no-op commit is particularly interesting.

------
blutoot
This maybe a bit off-topic but I fail to understand why the top 2 textbooks on
Distributed Computing - Tanenbaum and Coulouris - don't have a dedicated
section on Consensus Algorithms. I learned distributed computing from
Tanenbaum and can't recall encountering it.

Contrary some of the folks here, I found the presentation very cool. But that
maybe because I'm a slow learner.

~~~
tigeba
You may want to check out "Reliable and Secure Distributed Programming", it
has a few chapters on Consensus.

------
lclarkmichalek
In the network partition example, you say that in the smaller partition,
changes cannot be commited because they cannot be replicated to the majority
of nodes (as the smaller partition is... smaller). How is the partition to
know this? The system can't tell the difference between a node leaving the
network and a node undergoing a (tempoary) partition.

To give an example, say I have n machines in datacenter A, and n*.99 in
datacenter B. datacenter A gets destroyed, permanently. Does datacenter B now
reject all (EDIT: where reject = not commit) requests until a human comes
along to tell it that datacenter A isn't coming back?

~~~
sbhat7
> To give an example, say I have n machines in datacenter A, and n*.99 in
> datacenter B. datacenter A gets destroyed, permanently. Does datacenter B
> now reject all (EDIT: where reject = not commit) requests until a human
> comes along to tell it that datacenter A isn't coming back?

Of CAP, you are now choosing CP with Raft. So yes, the system is unavailable
until an external agent fixes it. In other words, the system needs to have a
majority of nodes online to be "available".

~~~
lclarkmichalek
What would happen if nodes were to be added to each side of a network
partition (unknown to the other side), so that each side believed they had a
majority? Or is the "writing" side of the partition determined at partition
time, and not changed until they are restored?

~~~
sbhat7
To add a new node to the network, it

* needs to have the same data as the other nodes

* needs a round of Raft to notify its presence to other nodes

So you can only add new nodes (automatically) when you have a 'live' system.

majority = ceil((2n + 1)/2) : so by getting the number of available nodes in
the partition, nodes can figure out if they are in the majority or minority
cluster.

See section 6 in the paper for details of its implementation.

------
coreymgilmore
Very cool intro website.

Is this used in production somewhere already? Would love to hear more of the
details about use cases and deployment.

~~~
stonith
Raft?

It's used in etcd, consul, serf and probably more.

~~~
mitchellh
Just a nitpick here with the qualifications that I'm one of the authors of
Serf: Serf doesn't use Raft. Serf is masterless and the distributed messaging
protocol used is SWIM (a gossip protocol).

~~~
stonith
Apologies, thanks for the correction.

------
pandatigox
They've written a lot on the subject. For anyone who prefers the arrow key
over the 'continue' button,
[https://speakerdeck.com/search?q=raft](https://speakerdeck.com/search?q=raft)

~~~
polskibus
The arrrow key works for me (in Chrome at least).

------
rwinn
So what happens if a network partition occurs where both sides can elect a new
leader?

~~~
pepijndevos
This can't happen. You can't devide a cluster in half and still have a
majority. See my other post.

~~~
rwinn
Ah, ok so all nodes needs to be known beforehand

~~~
knyt
Yeah, and any changes to the set of nodes (adding or removing a node from the
cluster) must be agreed upon by a majority of the existing nodes.

------
pit
Holy cow: social media is a means for distributing consensus.

------
fiatjaf
How do you do these animations?

------
EGreg
Sex.

------
stfp
Awesome

------
thrownaway2424
Where's the skip intro button on this horror show?

~~~
thrownaway2424
[https://ramcloud.stanford.edu/wiki/download/attachments/1137...](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf)

------
Illniyar
If I've understood the presentation correctly, Raft is a master-slave protocol
that determines how to choose a master.

Considering it basically relies on random chance (I.E. who receives the
message first) to elect a master, has basically no real way of resolving a
conflict in election (I.E. if two nodes receive the same amount of votes, we
do a re-election ad infinitum) and does not address the situation of two nodes
having conflicting sets of data (for instance from network partition).

Considering all that, this protocol doesn't seem very interesting (from a use-
case point of view).

~~~
deathanatos
> and does not address the situation of two nodes having conflicting sets of
> data (for instance from network partition).

I believe it does address this. Each log entry is either committed or not; an
entry can only be committed if it has been replicated to a majority of nodes.
Any node that lacks a committed entry cannot be elected master because of the
election rules: a node will not vote for another node less complete than
itself. Since a committed entry has been replicated to a majority, a node
lacking that entry cannot receive a majority of the votes. (Thus the committed
log entries will always be the same on all nodes (though some may be behind,
and may only have a subset), which is the purpose of the protocol.)

> Considering it basically relies on random chance (I.E. who receives the
> message first) to elect a master, has basically no real way of resolving a
> conflict in election

This is mostly true. The PDF slides I link to below recommend that the
election timeout be much greater than the broadcast time, the idea being that
things should work out in the long run.

Highly recommend the PDF slides here, as they explain it better than I can:
[https://ramcloud.stanford.edu/~ongaro/userstudy/](https://ramcloud.stanford.edu/~ongaro/userstudy/)
— there's also a YouTube talk here:
[https://www.youtube.com/watch?v=YbZ3zDzDnrw](https://www.youtube.com/watch?v=YbZ3zDzDnrw)

> Considering all that, this protocol doesn't seem very interesting (from a
> use-case point of view).

I'd love to hear of alternatives.

~~~
Illniyar
> I believe it does address this. Each log entry is either committed or not;
> an entry can only be committed if it has been replicated to a majority of
> nodes. Any node that lacks a committed entry cannot be elected master
> because of the election rules: a node will not vote for another node less
> complete than itself. Since a committed entry has been replicated to a
> majority, a node lacking that entry cannot receive a majority of the votes.
> (Thus the committed log entries will always be the same on all nodes (though
> some may be behind, and may only have a subset), which is the purpose of the
> protocol.)

What happens if (for instance) a 4 node cluster splits into 2 node clusters
(I.E. a network fault between two data centers)- does each cluster choose a
leader? how are is "majority" calculated? is the raft protocol unable to
handle half of it's nodes being taken down? What happens if two clusters break
off, both choose a leader (if it's possible), both gets new writes and then
both clusters come back together?

> I'd love to hear of alternatives.

I no of no protocols per se, but for implementations of a master-slave
protocol, there's mongo's replica-set algorithm (one notable change is that
each node can have a priority).

There are also master-master implementations (such as cassandra's) that
require no election, and serve IMO more interesting use-cases.

~~~
Muzzaf
> What happens if (for instance) a 4 node cluster splits into 2 node clusters
> (I.E. a network fault between two data centers)- does each cluster choose a
> leader?

A Raft cluster must have an odd number of nodes.

> how are is "majority" calculated?

ceil(nodes/2).

> is the raft protocol unable to handle half of it's nodes being taken down?
> What happens if two clusters break off, both choose a leader (if it's
> possible), both gets new writes and then both clusters come back together?

They cannot each choose a leader, see above.

~~~
Flenser
> A Raft cluster must have an odd number of nodes.

what about a 7 to 3 / 3 / 1 split?

~~~
Muzzaf
Not sure I understand. A node in each split cluster would need at least 4
votes to be elected leader. Hence no node can be elected leader since all
split clusters have strictly fewer than 4 nodes.

Theorem. With 2n + 1 nodes, there can not be two separate majorities after a
net split.

Proof. By way of contradiction, assume there _are_ two separate majorities.
Each separate majority would contain at least ceil((2n + 1)/2) = n + 1 nodes.
This implies that there are in total at least 2(n + 1) = 2n + 2 nodes in the
system, contradiction.

