
How I Learned to Stop Worrying and Love Automated Database Failover - michaelfairley
https://www.braintreepayments.com/braintrust/how-i-learned-to-stop-worrying-and-love-automated-failover
======
adrianhoward
_And, now that it is in production, we regularly test and exercise the tools
involved._

This is one of the most important sentences in the article. I've seen too many
systems in my time fail because the wonderful recover/failover system has
never really been tested in anger, or the person who set it up left the
company and the details never quite made it into the pool of common knowledge.
Dealing with failover situations has to become _normal_.

One of the nicest piece of advice I got, many years ago, was naming. Never
name systems things like 'db-primary' and 'db-failover' or 'db-alpha' and 'db-
beta' - nothing that has an explicit hierarchy or ordering. Name them
something random like db-bob and db-mary, or db-pink and db-yellow instead. It
helps break the mental habit of thinking that one system _should_ be the
primary, rather than one systems just _happens_ to be the primary right now.

Once you do that start picking a day each week to run the failover process on
something. Like code integration - do it often enough and it stops being
painful and scary.

(Geek note: In the late nineties I worked briefly with a D&D fanatic ops team
lead. He threw a D100 when he came in every morning. Anything >90 he picked a
random machine to failover 'politely'. If he threw a 100 he went to the
machine room and switched something off or unplugged something. A human chaos
monkey).

------
falcolas
Automated failover, with manual recovery, is probably the best thing you can
do to get high availability with databases.

They just fail sometimes. The ability to be back up and running before an
admin can even respond will pay for itself after your first automated failover
(which doesn't even address the fact that automated failover scales well -
human based failover doesn't).

I also like their modifications to the Pacemaker resource to not flap the
master role - that's really important with databases, and often overlooked
with Pacemaker.

------
joch
I have started trying out Galera Cluster[1] for MySQL, to replace a single
MySQL server node with 3-4 nodes, all synchronously replicated. This should
hopefully solve the problem with having to split writes to the master and
reads to the slaves, and provide redundancy in case of a server going down.

Does anyone have any experience with Galera in a production environment? Is
the setup in this article preferable to that?

[1] <http://codership.com/content/using-galera-cluster>

~~~
falcolas
I have some experience with Galera. It's pretty slick, but you probably won't
want to split writes. It technically works, though you can run into some
unexpected deadlocks due to optimistic locking on remote nodes vs. pessimistic
locking on the write node[1]. What you really don't have to worry about is
replication being behind and loosing data if you lose a host.

On the other hand, your overall throughput is constrained by the network,
since all commits must at least ping all of the nodes.

Which setup are you referring to? MySQL 5.5, or MySQL NDB cluster?

[1][http://www.mysqlperformanceblog.com/2012/08/17/percona-
xtrad...](http://www.mysqlperformanceblog.com/2012/08/17/percona-xtradb-
cluster-multi-node-writing-and-unexpected-deadlocks/)

~~~
joch
Thanks, I greatly appreciate the link! The deadlock scenario does not look
good, however I think it won't be a problem in this particular application.
But since a single server easily handles the load, I will definitely take the
safe route and not split the writes, and just use the other db nodes as
failover and for reading.

I was initially set on using NDB cluster to provide failover, but the table
constraints (tables can't be too wide) means that there would have to be code
changes. I recently set up and tried Galera, and everything "just worked". The
application currently uses standard MySQL 5 with Innodb, and going with Galera
would mean that no code change would be necessary.

It just sounded too good to be true though, so I just wanted to know that
people actually use this in production. The application handles payments, so
corrupted or rows not committing on all nodes would be a huge problem.

