I'm part of a team responsible for cloud-based video editing software. We use multi-master replication (perhaps also known as optimistic replication) with our own tools throughout, but it does require careful design - keep as much data as immutable as possible, every piece of data that might be updated by different machines at the same time should have its own row, GUIDs on each row.
Each machine can generate its own local IDs, which look a lot like a timestamp with some unique stuff on the end. Each row gets a GUID and a 'version' ID column, and we only update relayed database updates if the incoming version is newer. This is largely last-timestamp-wins for the case of conflicts (rare because of design decisions), but there is some Lamport timestamp behaviour in there too for updating a existing row.
The main downside is still that all machines need to handle every write, but with batching up incoming processing into larger transactions, we've had no problems with quite a number of database updates on a dozen commodity machines. Obviously filtering into different shards would be an easy solution.
I'm looking forward to seeing what other people are doing with multi-master replication.
If B fails, we have a monitoring system in place to tell A that it is responsible for w, x, y and z at once. The monitor sets read-write permissions for databases y and z on A (through user permissions), and then notifies all of our application servers that things have shifted. The applications for their part include a piece of common code that monitors for changes in the cluster and allows the application code to cope with these changes. For web requests, if failover happened in the middle of a transaction, the request fails. For long running processes, the process will have to go to the top of the event loop and request a new database connection, etc.
So far it's worked fairly well. We are able to achieve high availability with it, since our master nodes are in two different data centers. There are definitely issues with this approach in general, but it works for our work load.
At the end of the day, it's always going to require well thought out schemas and data layout if you want to take writes for both masters.
Multi-master is less difficult if you only write to a single node at a time (hot-standby style).
But if you can design your system to take this into account, it's pretty easy to scale out and everything Just Works(tm).
- High Performance MySQL (Schwartz et al)
- MySQL High Availability (Bell et al)
We talked a bit about how race conditions are a part of life with multi-master replication, and the ways to best avoid it.
When you know your data, you know what you can get away with.
I would like to throw in the tungsten replicator into this discussion http://code.google.com/p/tungsten-replicator/. I have been researching it for the past few weeks as a replacement to mysql's built in replication hoping to solve a lot of the current pitfalls especially in 5.0.x. There is a very thorough guide here http://tungsten.sourceforge.net/docs/Tungsten-Replicator-Gui...
In your scenario, yes. The two changes would be a conflict, but hopefully (if the user makes the same change) your conflict-resolution code would realise it is the same and merge them correctly. I fear I don't know how MySQL::Replication handles this, but it has to be application-level logic, so I assume it will allow you to override a default of, say, last-modification-wins.
These are things that can and do happen, and you have to plan for it. Designing your applications and infrastructure with eventual consistency in mind can do wonders for both performance and redundancy.
I suspect this is a deal breaker for many. It certainly is for me.
- The tables need to be changed and the application layer needs to be changed to support it, which is a big hassle and very fragile. It's easy to introduce update conflict. It's a nightmare when dealing with group of updates in a transaction. You can't really roll back a transaction at the replicated nodes.
- Whenever a node fails, the replication ring is broken and updates pile up at the previous node, while subsequent nodes' data become stale. It requires immediate human attention to fix it, which defeats the purpose of a HA cluster.
- Related to above. It's very difficult to add a new master node without stopping the cluster. The "catchup" process is very manual and fragile.
- Data in different node becomes stale under high replication load. Clients reading different masters would get stale data. They are supposed to be masters and got stale data?!
- Multi-master doesn't help write scalability as all; all nodes need to handle all writes. MySQL's single thread update in replication doesn't help. For read scalability, master-slave is better.
I abandoned the design after a while and chose a different approach. I ended up using a disk-based replication, like DRDB. A two-machine cluster forms the master cluster, one active and one standby. Writes are replicated to both machines at disk level synchronously. When the active node fails, the standby becomes active within seconds automatically with the complete data on disk.
The beauty of this approach is the simple design and setup. The data are always in sync, no stale data. Failover is immediate and automatic. The failed node can automatically catch up when back online. The database application doesn't need any change and all the SQL semantics are preserved. The cluster has one IP so the clients don't need special connection logic. They just need to retry when connection fails.
For disaster recovery, I built another two-machine cluster in another datacenter acting as the slave, which did async replication from the master cluster. When the two-machine master cluster completely failed (as in the datacenter got burnt down), the slave cluster can become master via a manual process within 30 minutes. The 30 minutes SLA is for someone got paged, look at the situation and decide to fail over. There are too many uncertainties across datacenters to fail over automatically.
Added bonus, slaves can still hang off the master cluster for read scalability. And it works with any disk-based databases, not just MySQL.