
Scaling Raft - tadasv
http://www.cockroachlabs.com/blog/scaling-raft/
======
ccleve
Yup, this is the key problem with these consensus algorithms. They don't
handle shards very well. Most systems these days either punt and don't do
updates via the consensus algorithm (Apache Kafka), or they use two-phase
commit (FoundationDB, many others).

This is important work, but I don't think they're really going to succeed is
making a truly large-scale system. The problem is the heartbeat; you can't
have every node talking to every other node every few seconds.

What we really need is someone smart to come up with a consensus algorithm
that doesn't need a heartbeat. Until then, it's two-phase commit if you want a
reliable, large-scale (if not performant) system.

Right now, if you want a system that runs fast you use a consensus algorithm
for shard metadata, but you do writes directly to the nodes without getting a
consensus first. You run the risk of losing acknowledged writes, but it's the
best you can do if you need speed.

~~~
bdarnell
(author of the original post here) Raft can be pretty relaxed about
heartbeats. It doesn't, strictly speaking, require every node to talk to
everyone else. All you really need is some health signal about the other nodes
so you A) don't call an election while the leader is still alive and B) call
an election promptly when a leader disappears. As a first step, we've reduced
our heartbeat traffic from one per group to one per node, and we can reduce it
further (e.g. by having each node send heartbeats only to a subset of its
peers and sharing the results to other nodes).

Also, we are using _both_ Raft and two-phase commit: Raft manages the
consistency of individual ranges, but transactions that span multiple ranges
require 2PC.

~~~
ccleve
There's an interesting comment on the blog post:

> Does handling all range's consensus traffic in bulk would be the most
> effective way to handle this? Instead could all the ranges from a node be
> represented as a single state machine to be handled by plain etcd raft
> implementation, like the approach taken in spanner paper?

Do you have any comment on this? I've been trying to wrap my head around the
implications. What's the downside?

~~~
bdarnell
I replied on the blog post. The downside is basically that if you have one
giant consensus group, then to add a new replica you have to replicate that
entire consensus group. This hurts load balancing and recovery times, since
you can't scatter a dead node's ranges across all the other nodes in the
cluster.

