Hacker News new | comments | ask | show | jobs | submit login

Reading this post-mortem and their MySQL HA post, this incident deserves a talk titled: "MySQL semi-synchronous replication with automatic inter-DC failover to a DR site: how to turn a 47s outage into a 24h outage requiring manual data fixes"


> In MySQL’s semi-synchronous replication a master does not acknowledge a transaction commit until the change is known to have shipped to one or more replicas. [...]

> Consistency comes with a cost: a risk to availability. Should no replica acknowledge receipt of changes, the master will block and writes will stall. Fortunately, there is a timeout configuration, after which the master can revert back to asynchronous replication mode, making writes available again.

> We have set our timeout at a reasonably low value: 500ms. It is more than enough to ship changes from the master to local DC replicas, and typically also to remote DCs.


> The database servers in the US East Coast data center contained a brief period of writes that had not been replicated to the US West Coast facility. Because the database clusters in both data centers now contained writes that were not present in the other data center, we were unable to fail the primary back over to the US East Coast data center safely.

> However, applications running in the East Coast that depend on writing information to a West Coast MySQL cluster are currently unable to cope with the additional latency introduced by a cross-country round trip for the majority of their database calls.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact