
MySQL High Availability at GitHub - samlambert
https://githubengineering.com/mysql-high-availability-at-github/
======
elvinyung
I feel like this article stabs at a kind of interesting tradeoff in
distributed databases: using consensus to store your data, or using consensus
to point at the master for the data.

A lot of the bleeding-edge DBMS out there (e.g. CockroachDB, TiDB) use the
former probably because of the Spanner influence, but maybe the latter is good
enough for almost everyone.

~~~
gregwebs
NewSQL databases still may use consensus to point to a leader for data, its
just that data is transparently partitioned and each partition can have a
different leader.

In TiDB for example each region (data partition) has its own leader and only
that leader can serve writes (but the leader can change via a raft-based
election). By default all servers are both leaders of some partitions and
followers of others [1].

Putting all your data on one master stops scaling at some point when data is
large enough. However, even before that point, even without high-availability
requirements, you are making trade-offs which may be good or bad (e.g.
throughput is limited on a single master, but latency is low).

[1]: [https://pingcap.github.io/docs/tikv/tikv-
overview/](https://pingcap.github.io/docs/tikv/tikv-overview/)

~~~
elvinyung
Isn't sharding orthogonal to this? It seems completely possible to have a
system where the individual data partitions are asynchronously replicated, but
the metadata of leader leases and locations for all the partitions are listed
on an separate strongly-consistent fault-tolerant source. I think Vitess does
something like that, and certainly Bigtable [1] was an early implementation
using Chubby to point to the "root" tablet.

[1]
[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-
osdi06.pdf)

~~~
gregwebs
Sharding is generally used to refer to a particular type of partitioning:
physically distributing data on different machines with the purpose of scaling
horizontally (to increase write capacity).

What many NewSQL databases use is a logical partitioning scheme where data is
partitioned into <= 1 GB chunks. This makes it easy to both scale horizontally
and achieve fault tolerance (and change these characteristics without
interrupting service). In some systems that component may be coupled and
hidden away. In TiDB it is a separately deployed component called the
Placement Driver (which embeds etcd, a Chubby equivalent):
[https://pingcap.com/blog/2016-11-09-Deep-Dive-into-
TiKV/#pla...](https://pingcap.com/blog/2016-11-09-Deep-Dive-into-
TiKV/#placement-driver)

With sharding you do still need a system to locate your data. A big difference
is that most sharding systems are not designed to distribute transactions (or
possibly even joins) across shards. So Vitess has a special 2PC for cross-
shard transactions with poorer performance than a single shard transaction.

------
connor11528
Vitess is another system for horizontally scaling MySQL that they use at
YouTube [https://vitess.io/](https://vitess.io/)

~~~
sethammons
Another cool one is ProxySQL. We opted towards that rather than Vitess because
it allowed a crawl, walk, run approach. Vitess came across as all or nothing.

------
majidazimi
Am I mistaken or it seems to me that promoting a slave to a master is kind of
fragile operation(detecting that it did really fail). I mean there are more
robust methods:

1\. There is a bookkeeper cluster providing replicated log. (No Kafka! Hold on
a second, move to 4)

2\. All nodes of a single MySQL cluster have a service on them that has two
responsibilities: 1. accepting writes. 2. following the log. MySQL service
only serves reads. Write will be performed indirectly from the writer service
which is running besides MySQL service.

3\. Bookkeeper (and DistributedLog on top of it) enforces single writer. So
essentially one node is appending to log, the others are following it.

4\. Whenever one of slaves thinks master is failed (or maybe partially-
failed), it issues a FENCE request to bookkeeper, which either stops current
master from appending to log, or fails to get majority vote from bookkeeper
nodes. In either way log stays consistent. (There exist no two masters at a
single point in time)

5\. If one slave is able to fence the current master, it just notifies service
discovery component to redirect clients. This operation doesn't have to happen
atomically. Since old master is fenced and it can't append to log, anyway. So
client operation will be safely rejected.

This approach can convert any data store to a consistent, highly available
store. It is even capable of running multiple databases grouped as one
cluster. Replicated log is a separate module. It is possible to replicate from
MongoDB to SQL Server as an example. It has the advantage that, you don't have
to adopt any untested databases. Just use good old MySQL/PG/...

I understand that, it requires quite a lot of work to refactor already running
infrastructure. But it is more robust and somewhat safer. Bookkeeper's built-
in fence mechanism precludes any random writes from any partially failed
master.

------
benmmurphy
i think the joyent-manatee HA
[[https://github.com/joyent/manatee](https://github.com/joyent/manatee)]
solution is the safest way of taking a system like PG that supports
synchronous replication and giving it good availability properties while
preserving durability.

they basically have 3 nodes.

\- the master which must synchronously commit all writes to the slave

\- the slave which accepts synchronous writes from the master and may be
promoted to the master

\- the asynchronous slave that accepts writes asynchronously from the slave
[or from the master if you support rolling back the timeline]

if the master dies, the the slave can promote itself to the master if it gains
the support of the asynchronous slave.

if the slave dies, then the master can no longer commit and it tries to
promote the asynchronous slave to the new slave. if there is some confusion
over who is dead the asynchronous slave basically adjudicates because it can
only be promoted once to slave either by the master or the original slave.

basically writes are only accepted if at least 2 nodes acknowledge the write
and the system can survive one node dying. but because you need co-operation
of 2 of 3 nodes in order to actually commit data in the system it doesn't have
the split-brain issues you have with an external system checking for health
and then telling nodes to stop connecting to the unhealthy master.

i think the actual state machine used is a bit more complicated because they
need to support removing/adding nodes and i think they end up requiring
zookeeper in order to coordinate.

flynn [[https://flynn.io/docs/databases](https://flynn.io/docs/databases)]
also have a HA solution for postgres/mysql based on this.

though, i can see why maybe this kind of setup might not perform as well as
one where you are accepting writes after committing 2 out of N to the read
replicas. especially when you already need lots of read replicas.

~~~
Annatar
“the slave which accepts synchronous writes from the master and may be
promoted to the master”

That is failover, not high availability, since there is downtime, however
short. High availability means zero downtime, like with Vertica.

~~~
redwood
Sounds like you're looking for a different word from "high". Maybe
"continuous"?

------
merb
I can understand that VIP's can sometimes be akward, ARP fails or even worse
that VIPs won't be correctly released due to unknown stuff.

However I do wonder why they choose such a complicated thing instead of
looking into something like BGP that would've replaced their ARP setup.

Also they should've probably tried to look into K8s since inside a cluster
such problems do not exist. (There you have the problem to have a good LB at
the outside of your cluster)

~~~
scurvy
+1 for BGP/OSPF-based VIPs in a layer 3 network. It's 2018. No reason why you
can't easily run a layer 3 network with rich routing protocols. Super easy.

You don't need application level "discovery" when the IP never changes; the
network will tell you where it currently lives.

I'll take a layer 3 network over raft/consul any day and all night.

~~~
voltagex_
I don't think I've ever heard BGP described as "easy", but then again it's not
really my area.

Where could I go to read up on this without being a networking expert? Are
there any safe ways to play with BGP in a lab environment? How expensive would
said lab be to set up?

~~~
merb
Are there any safe ways to play with BGP in a lab environment? actually there
are multiple ways of running bgp in a lab. there are cheap variants that are
routed into the internet with ipv6 and there are ways to just play without
without talking to the outside network.

How expensive would said lab be to set up? it heavily depends on what you
want. for a high available lab you would of course need multiple
nodes/routers, depends on what you want to do.

for just playing with bgp, actually using bird
[http://bird.network.cz/](http://bird.network.cz/) is good enough.

Actually ripe sometimes publishes articles with BGP. Here is one that sets up
a ANYCAST routing with BGP
[https://labs.ripe.net/Members/samir_jafferali/build-your-
own...](https://labs.ripe.net/Members/samir_jafferali/build-your-own-anycast-
network-in-nine-steps)

~~~
scurvy
Juniper and Cisco have lab versions of their products (Olive, vMX, vSRX) for
free. Arista has something, too.

BIRD is super easy to setup, too.

~~~
voltagex_
If I'm not studying for any particular cert (I'm just looking to fill some
gaps in my knowledge) then it looks like BIRD is the way to go.

~~~
scurvy
bird is the word

------
deanmoriarty
Why not InnoDB Cluster[1] with single primary? There is a slight decrease in
write throughput as it is effectively similar to a synchronous replication on
a majority, but you get native failover and very easy cluster membership
management.

[1] [https://dev.mysql.com/doc/refman/8.0/en/mysql-innodb-
cluster...](https://dev.mysql.com/doc/refman/8.0/en/mysql-innodb-cluster-
userguide.html)

~~~
deanmoriarty
Would really love to know why this got downvoted instead of replying "InnoDB
Cluster is not good because..."?

It's actually the most recommended way to do replication from MySQL now, they
are heavily deprecating everything around traditional replication, they even
removed MySQL Fabric (the proxy for the client to work well with failover) in
favor of MySQL Router, which works with just InnoDB cluster.

HN, go figure...

~~~
philaroo
Small clarification: MySQL Router usage does not require InnoDB cluster;
although agreed that most Router development is focused there.

------
ton31337
How about another one implementation of "master election" like here
[http://blog.donatas.net/blog/2017/02/26/exazk/](http://blog.donatas.net/blog/2017/02/26/exazk/)
?

------
ngrilly
I feel like I'm missing something. I see two layers of Raft consensus at play
here: one at orchestrator's level, and another at Consul's level. Would it be
possible to manage consensus at only one level?

------
Annatar
High availability requires a database cluster which provides synchronous
multimaster replication. What they have is failover, not high avilability.

~~~
detaro
Failover is a valid high-availability technique. One with limitations and not
for every scenario, but still one.

~~~
Annatar
Per definition, highly available means no interruption of service. Failover
does not meet this criterion since there is an interruption.

~~~
detaro
I've never seen it defined that way, neither in industry material nor
academia, so I guess "citation needed"?

These things always have limits, and what's defined as "availability" and
acceptable failure rates/times is application-dependent. High-availability
merely describes reaching a goal that's higher than a "conventional" setup can
achieve, and failover is a fairly obvious way of massively reducing Time-To-
Recovery

------
dazoot
Why not TiDB ?

~~~
sanxiyn
FYI: TiDB is relevant because it is distributed and protocol-compatible with
MySQL.

~~~
elvinyung
I think at this point it's basically a non-option though. Ti is nowhere near
mature enough, the ops investment cost is extremely nontrivial, and it and
still has some interesting issues to work out (e.g. [1]).

Furthermore, protocol-compatible doesn't directly imply API-compatible (I'm
thinking of very very subtle differences between things like datatypes or
isolation level behavior). I know a bit more about CockroachDB, which
consciously does some things differently from Postgres.

[1]
[https://github.com/pingcap/tidb/issues/2712](https://github.com/pingcap/tidb/issues/2712)

~~~
sanxiyn
It's not a drop-in, but several very large websites migrated from MySQL to
TiDB, so it definitely is an option.

~~~
Rafuino
Source? I'm guessing you're referring to Mobike and Yiguo... any others?

[https://www.pingcap.com/cases/](https://www.pingcap.com/cases/)

~~~
sanxiyn
Those two are just case studis. There is a fuller list here:
[https://pingcap.com/docs/adopters/](https://pingcap.com/docs/adopters/)

------
coldseattle
MS SQL Server may be a good choice, too.

------
nickthemagicman
Why not Aurora?

~~~
morgo
Besides being AWS only, Aurora is very expensive when you factor the IO costs
for a large deployment.

That's not to say that I'm not a fan of it, but it's for a smaller sized
deployment than GitHub. You can see this for yourself on the customer list
here:

[https://aws.amazon.com/rds/aurora/customers/](https://aws.amazon.com/rds/aurora/customers/)

------
vitalia4
So in other words, they had to hack together some tools to get clustering
features on a crucial piece of company infrastructure. I understand that
relational databases have been around for 40 years but very few of them come
close to offering the multi-master replication or master-election protocols
implemented in modern NOSQL.

~~~
deanmoriarty
Actually MySQL has a well working multi-master replication with great
automatic master election protocol and automatic client routing towards the
current master. It's called InnoDB cluster, and my other comment, where I was
trying to start a discussion around it, was downvoted.

