
RethinkDB 2.1 beta: automatic failover with Raft - coffeemug
http://rethinkdb.com/blog/rethinkdb-2.1-beta/
======
coffeemug
Hey guys, Slava @ RethinkDB here. I'll be around all day to answer questions.

We're super-excited about this beta. We've been working on Raft, automatic
failover, and all the relevant enhancements for over a year. It's a massive
engineering project that's getting very close to production quality, and it
feels so good to finally get it into people's (proverbial) hands!

~~~
scott_karana
Congratulations! Have you been testing this with Jepsen internally, out of
curiosity? :)

~~~
coffeemug
Yes, of course! We've also been working with Kyle to understand the Jepsen
codebase better. RethinkDB does really well on the Jepsen tests in our
internal testing. Of course it's up to Kyle to have the final word :)

------
RoboTeddy
I'm continually impressed with the rate and quality of rethinkdb's
advancement. There's some really fantastic software engineering going on over
there.

~~~
v3ss0n
The way RethinkDB crews co-operate over Github is fantastic. Every workflow is
tru github and you can transparently see their operation. I learnt a lot from
that and i apply that to my team. We are building "A Killer" Communication
tool using RethinkDB + HTML5. We will be announcing soon.

~~~
eterps
Can you give an example how their cooperation over Github is special?

~~~
v3ss0n
I am with @pest . rethinkdb's github issues becomes my daily read, along with
Github's Atom repo.

[https://github.com/rethinkdb/rethinkdb](https://github.com/rethinkdb/rethinkdb)

Just look at the issues , how they handle them , how their professional
responses , and how quick they resolves issues after issues with full co-
operation between issue owner , users and contributor. They guide contributors
as well. And it is really Addictive , be warned.

------
contingencies
Don't get me wrong, it's good to see people building alternative database
systems, and it's great to see the community as a whole embracing the
architectural issues of high availability through a combination of Paxos and
Raft.

However, I'm not sure that we should be supporting every damn service having
its own rolled-in implementation of its own choice of consensus algorithm.

Why not limit service scope to service provisioning, and normalize to a
reasonable service-independent solution for high availability consensus and
orchestration?This leads to far more solid and well tested operations
practices that respect service interactions and you don't wind up with "we
swept it under the rug" problems like the RethinkDB 2.1 beta announcement's
"with only a few seconds of unavailability". Come on guys, call a spade a
spade.

~~~
coffeemug
We explored this possibility in quite a bit of depth, but this turns out to be
much harder to do in practice than it seems.

There are currently two types of projects that offer this kind of service. The
first is built in C/C++ and packaged as a library. We tried to use these
libraries, but a project of RethinkDB's scope has a lot of abstractions for
networking, threading, memory allocation, etc. We quickly found that we
couldn't effectively integrate existing libraries into our coroutine system
and networking stack.

The second type is projects that are typically built in higher level languages
and run standalone. We tried that too, but it turns out to be a user
experience nightmare. Users would need to know how to configure these
different services, and would have to deal with deployment. The configuration
aspect is quite difficult -- it's hard to configure these services in the
right way without making a mistake (to make them work for the needs of a
database system). We also looked into abstracting that away by building a
porcelain UI, but there are a lot of challenges there that are very difficult
to overcome.

Finally, RethinkDB's needs are quite specialized and there isn't a service
that does what we need out of the box. We looked at implementing Raft
ourselves and realized it isn't actually hard -- a much bigger challenge is
properly architecting the rest of the system. To give you an idea of timing,
it took us two weeks to get a Raft implementation, and another few weeks to
polish it. It took another year to get everything else working (plus two years
of expertise in the field wrt real-world issues that users encounter).

 _> only a few seconds of unavailability_

This isn't something you can escape in system that's based on authoritative
replicas. All other systems that use this architecture face this problem; an
external service wouldn't solve it.

TL;DR: it would be really nice if we could use an existing service, but
unfortunately this is a much trickier problem than it first appears.

~~~
contingencies
Yes, it's difficult. However, sometimes the best and most responsible thing to
do - _especially_ when you find something difficult - is to keep it out of
scope!

I was referring to the "higher level languages and run standalone" class of
solution and I agree with you about their drawbacks (hard to use). However, I
would argue that a hard to use but proven and correct solution for a general
class of cases that someone else maintains is ten times better than an easier
to use but unproven and new solution for a specific case that you have to
maintain. Your suggestions requires everyone to learn another syntax for every
damn daemon they wish to operate, along with the ongoing cognitive drain and
maintenance overhead (upgrades, etc.).

Actually upgrades are a great case in point. How does your daemon handle
updates while in clustered mode? I suppose it doesn't. This is another reason
why a proven, general solution is great to have ... you actually get an
operations process that handles the extremely common but swept under the rug
edge cases that nobody wants to talk about, like upgrades, disk failures, oops
I unplugged the cable/DDoS/network fabric issues, horizontal scaleout,
appropriate rules for (non-)co-habitation with other services, for all
services! Yes, TMTOWTDI, but I seriously doubt some random daemon knows better
how to handle failure than a proven, daemon-agnostic operations process
designed at a significantly more abstract, resource-oriented level.

Have you had a look at the OCF format resource agents? If you define one, you
can wash your hands of the whole issue:
[https://github.com/ClusterLabs/resource-
agents/](https://github.com/ClusterLabs/resource-agents/)

I don't understand why "authoritative replicas" demand a few seconds of
downtime? I find that most daemons, assuming they _fsync()_ appropriately,
come up again in well under a second using a dual master (live, standby) setup
with DRBD (essentially network RAID1, works with any filesystem) with zero
client reconfiguration required by using a shared (floating) IP to handle
failover. Check it out.

~~~
krakensden
A lot of people have a lot of trouble with the Cluster Labs stack- and get
into a lot of trouble. I've never really used it much myself, but integrating
failover into the server process, and making sure the defaults are all correct
for a very particular use case seems like a fairly reasonable thing to do.

------
thelinuxl1ch
Being developing with RethinkDB for one year, I can say that it's amazing how
fast this database evolved, the API is very flexible, the developer team cares
a lot about user feedback, it's a thriving community! Auto failover will be a
life saver, but I'm much more anxious for reliable changefeeds :)

~~~
coffeemug
Reliable feeds (and much more) is coming in 2.2. I think we found an elegant
solution[1] that will make different types of users happy, so it'll be
exciting to ship it!

[1]
[https://github.com/rethinkdb/rethinkdb/issues/3471](https://github.com/rethinkdb/rethinkdb/issues/3471)

~~~
v3ss0n
Thats very nice to hear, Can't wait for Reliable change feeds.

------
ascendantlogic
Every time I say to myself "How can Rethink get more awesome?" I am shown how.
What a great product.

~~~
v3ss0n
well said.

------
jtwebman
I have played around with this database I am looking forward to using it on my
next project! This is great news but I would love if they wrote a
Erlang/Elixir driver. There is one that that the community wrote but would
love to see someone full time on it.

------
nodesocket
@cofeemug, can you describe a typical HA deployments? Are 3 nodes required to
start? If you want to shard across 3 nodes, but also sustain a 1 node failure,
how many nodes are required? Finally, how about shard across 3 nodes, but
sustain a two node failure?

~~~
danielmewes
3 nodes are indeed required to have automatic failover working.

We recommend using 3 (or more) nodes and replicating all tables with a
replication factor of 3. That way each node is going to maintain a full copy
of the data. In case of a 1 node failure (with or without sharding, as long as
the replication factor is set to 3), one of the two remaining servers is going
to take over for the primary roles that might have been hosted on the missing
server.

If you want to sustain a two node failure without any data loss, you will need
5 servers and also set the replication factor to 5. 5 is the lowest number
that guarantees that if two nodes fail, you will still be left with a majority
of replicas (i.e. 3 out of 5). A majority like that is required to guarantee
data durability and to enable auto failover. If you are ok with losing a few
seconds worth of changes and do not require automatic failover, even a 3 node
setup can be enough to sustain a two node failure. In that case you will have
to perform a manual "emergency_repair" to recover availability (see
[http://docs.rethinkdb.com/2.1/api/javascript/reconfigure/](http://docs.rethinkdb.com/2.1/api/javascript/reconfigure/)
for details), but most of the data should still be there.

In addition you can shard the table into 3 shards for additional performance
gains. This is for the most part unrelated to availability and data safety.

~~~
sciurus
With data sharded across three nodes, why can't a replication factor of two
handle one node's failure? Why does the replication factor need to be three?

~~~
danielmewes
That's a great question.

In our current architecture, we perform failover by selecting a new primary
among the existing replicas for a given shard. We do not however add new
servers to the replica list during a failover condition. Therefore, if you
have a table with two replicas per shard and have a one node failure, you
would only have a single replica left.

Currently we do not automatically failover in that case. We always require a
majority of replicas to be present for any given shard before failing over the
primary of that shard.

I believe we could relax this constraint in this specific case, and allow
failing over despite only a single replica being left, as long as we still
have a majority for the table overall. (I'm going to double-check this with
one of our core developers...) Even if we did perform the failover, that would
not restore write availability of the table though, since there wouldn't be a
majority to acknowledge a write (unless you set write_acks to "single"). It
could still be useful to restore availability for up to date read queries. We
might add support for auto failover in this scenario in the future.

------
thelinuxl1ch
Are there any performance issues we should know after the Raft implementation?

~~~
danielmewes
General query performance should be the same, if not better in this release
(the improvements are not due to Raft, but due to other changes). We use Raft
only to determine a consistent configuration. The queries themselves are
executed in the same efficient way as before.

Note that in the beta, reconfiguring tables over servers on rotational disks
can be slow
[https://github.com/rethinkdb/rethinkdb/issues/4279](https://github.com/rethinkdb/rethinkdb/issues/4279)
. If you store large documents, you might also see some increases in memory
usage during backfills
[https://github.com/rethinkdb/rethinkdb/issues/4474](https://github.com/rethinkdb/rethinkdb/issues/4474)
. We're going to address both of these issues before the final release.

~~~
DanWaterworth
> We use Raft only to determine a consistent configuration. The queries
> themselves are executed in the same efficient way as before.

So, you've made the noob mistake in using Raft, where only configuration goes
in the log and actual changes to data aren't replicated in the same manner.

~~~
danielmewes
We have carefully designed the way data is versioned and replicated in
RethinkDB 2.1 to result in a correct and consistent system in combination with
the Raft-supported configuration management. For example we make sure that
both components use consistent quorum definitions and the same membership
information at any point.

This allows us to provide different degrees of consistency guarantees for the
data, depending on the user's need. Our default is already pretty
conservative, but allows uncommitted data to be visible by reads. In the
strongest consistency mode, RethinkDB 2.1 provides full linearizability (see
[http://docs.rethinkdb.com/2.1/docs/consistency/](http://docs.rethinkdb.com/2.1/docs/consistency/)
for details). We have confirmed this both theoretically as well as by testing
the overall system using the Jepsen test framework.

~~~
DanWaterworth
What do you gain? Let's say that you're right and you've managed actually
managed to provide the guarantees that you say you have. By adding things to
Raft, you haven't made it simpler and, having implemented Raft myself
recently, I can attest that it's not the simplest of things to start with.

If you're looking for different degrees of consistency, I'm fairly sure you
can do that without protocol level changes to Raft.

------
aaronorosen
Awesome to see!

I tried giving this a shot starting up 3 rethinkdb nodes with a proxy node
with my existing application. Setup 3 shards and 2 replicas. Unfortunately, I
wasn't able to get my application working. The UI had the tables going back to
outdated reads (everything was able to initialize though once). I'll give it
another shot next week and report back. I'm sure it could also be a problem
with my setup maybe.

~~~
coffeemug
Could you submit a bug report at
[https://github.com/rethinkdb/rethinkdb/issues/new](https://github.com/rethinkdb/rethinkdb/issues/new)
? It would help immensely!

------
v3ss0n
I havn't check the PRs yet , but does this BETA also have performance
improvement commit a few weeks ago?
[https://github.com/rethinkdb/rethinkdb/issues/4441](https://github.com/rethinkdb/rethinkdb/issues/4441)

That one rethinkdb overall performance of writes by 2x right?

~~~
danielmewes
Yes, that patch for soft durability writes is in there. Looking forward to
hear if it improves things for you.

~~~
v3ss0n
thanks , i am going to test after building.

------
shockzzz
What's the most mature client driver for RethinkDB, in your opinion?

~~~
coffeemug
Ruby, Python, and JavaScript are all equally mature (and developed in-house
full time). Of the community drivers, Go is probably the most mature (but
other community drivers, like the .NET one are really nice).

