
Making Database Maintenance Fun: mysql_role_swap  - chinmoy
https://37signals.com/svn/posts/3344-making-database-maintenance-fun-mysqlroleswap
======
falcolas
What does this do that a dedicated failover tool like MySQL High Availability
[1] does not do?

[1] <https://code.google.com/p/mysql-master-ha/>

~~~
themcgruff
We talked with some popular "mysql consultants" about things like mysql-
master-ha and there was always a lot of hand waving murmuring about how
existing solutions were incomplete / had lots of bugs / were not to be trusted
in production.

It looks to me like they do mostly the same thing. We wrote our script in Ruby
because it's what we know and have the most expertise with, which makes it
easy for us to debug.

mysql_role_swap is not black magic, it doesn't do automatic failover, it
doesn't run in the background / daemonize itself. It fails gracefully in a way
we find predictable and useful.

~~~
falcolas
The problem with rolling your own is that you're going to make mistakes. MHA,
PRM, MMM, Tungsten, DRBD - all were written by people who know a lot about
MySQL database failover, and yes, they still have problems. The reason they
have problems is because doing failover properly under load is a _hard_
problem to solve.

As for them not production ready? Sounds like a line being fed by someone who
wants to sell their own solution. I have personally seen all of these
solutions in use in production systems.

Some of the potential problems I note with your solution (and if you haven't
been bitten by these yet, you will at some point):

1) You need to break existing connections to the old master. This is the
single biggest cause of split brain that I've seen in these situations. SET
GLOBAL read_only=1 only affects new connections.

2) You should verify that the slave has read _and applied_ all of the logs
from the master, to ensure that no data is lost when the slave is stopped.

3) You really should capture the binlog file & position from the slave _after_
stopping the slave, to avoid getting inconsistant data in your two databases.

4) You may want to ensure that the arp change actually took effect on your
proxy machine: prior to unlocking the proxy, make a connection from the proxy
server to the VIP & verify that the server_id matches the slave's server_id.

MHA has a number of problems when considered as an automated failover
solution. However, it's not required to run in an automated fashion to do
manual failovers like this, and is quite good at doing them correctly; it
encapsulates the lessons learned from its predecessors, and the knowledge of
the engineers who have been bitten by doing a failover under load incorrectly.

There are some really great tools out there, and not using them (or learning
from them) is going to cause your DBAs hours of hard work to get your DBs back
in operational order when something goes wrong. Not if.

------
BryantD
Always cool to see how people handle automation.

I wonder if they considered implementing this within another framework? Seems
like this would make an excellent chef recipe.

~~~
themcgruff
We write out the cluster config and update the script using Chef.

In an emergency situation, chef-client is a wee bit slow ;)

Also we like having the script standalone so there's just one moving part.

~~~
BryantD
Heh, that's fair enough. More moving parts is always a problem. Thanks for
sharing!

