Hacker News new | comments | show | ask | jobs | submit login
Making Database Maintenance Fun: mysql_role_swap (37signals.com)
22 points by chinmoy 1782 days ago | hide | past | web | 7 comments | favorite

What does this do that a dedicated failover tool like MySQL High Availability [1] does not do?

[1] https://code.google.com/p/mysql-master-ha/

We talked with some popular "mysql consultants" about things like mysql-master-ha and there was always a lot of hand waving murmuring about how existing solutions were incomplete / had lots of bugs / were not to be trusted in production.

It looks to me like they do mostly the same thing. We wrote our script in Ruby because it's what we know and have the most expertise with, which makes it easy for us to debug.

mysql_role_swap is not black magic, it doesn't do automatic failover, it doesn't run in the background / daemonize itself. It fails gracefully in a way we find predictable and useful.

The problem with rolling your own is that you're going to make mistakes. MHA, PRM, MMM, Tungsten, DRBD - all were written by people who know a lot about MySQL database failover, and yes, they still have problems. The reason they have problems is because doing failover properly under load is a hard problem to solve.

As for them not production ready? Sounds like a line being fed by someone who wants to sell their own solution. I have personally seen all of these solutions in use in production systems.

Some of the potential problems I note with your solution (and if you haven't been bitten by these yet, you will at some point):

1) You need to break existing connections to the old master. This is the single biggest cause of split brain that I've seen in these situations. SET GLOBAL read_only=1 only affects new connections.

2) You should verify that the slave has read and applied all of the logs from the master, to ensure that no data is lost when the slave is stopped.

3) You really should capture the binlog file & position from the slave after stopping the slave, to avoid getting inconsistant data in your two databases.

4) You may want to ensure that the arp change actually took effect on your proxy machine: prior to unlocking the proxy, make a connection from the proxy server to the VIP & verify that the server_id matches the slave's server_id.

MHA has a number of problems when considered as an automated failover solution. However, it's not required to run in an automated fashion to do manual failovers like this, and is quite good at doing them correctly; it encapsulates the lessons learned from its predecessors, and the knowledge of the engineers who have been bitten by doing a failover under load incorrectly.

There are some really great tools out there, and not using them (or learning from them) is going to cause your DBAs hours of hard work to get your DBs back in operational order when something goes wrong. Not if.

I wonder that as well... Hmm.

Always cool to see how people handle automation.

I wonder if they considered implementing this within another framework? Seems like this would make an excellent chef recipe.

We write out the cluster config and update the script using Chef.

In an emergency situation, chef-client is a wee bit slow ;)

Also we like having the script standalone so there's just one moving part.

Heh, that's fair enough. More moving parts is always a problem. Thanks for sharing!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact