In my career, the worst outages (longest downtime) I can recall have been due to...

toast0 · on Oct 31, 2018

In cases where you can rely on the system to self-repair, automatically moving writes quickly seems reasonable. But otherwise, it seems like you want the system to cope with the situation where writes are unavailable -- clearly a lot of things will be broken, but if writes fail fast, reads are still viable.

Assuming you have that, it's OK to rely on a human to assess the situation, make sure the dead master is really dead, salvage any partially replicated transactions, and crown a new master. With the right tools, it could take only a few minutes -- a bit longer if you have to wait for the old master to boot to see if it had locally committed transactions that didn't make it to the network. If it takes 5 minutes to resolve this (including time to get to the console), you can do this ten times a year and still have three nines.

For the more likely case where it's a network blip, the situation resolves itself (in a nice way) by the time the operator gets to the console.

xenadu02 · on Oct 31, 2018

Indeed, given recent history I’d almost suggest it is better to take the site down for a few minutes than let the automatic failover systems put you into a 24-hr degraded service situation.

contingencies · on Oct 31, 2018

We're more aware of simple processes that don't work well than of complex ones that work flawlessly. - Marvin Minsky, MIT AI lab co-founder ... via http://github.com/globalcitizen/taoup

tetha · on Oct 31, 2018

I share that sentiment. HA and automated failover seems to be either simple because the application supports it, or really really hard.

Stateless applications are simple. Systems built with this in mind, like cassandra, elasticsearch, redis+sentinel just do it right after two or three settings like minimum quorum sizes.

But if you have systems without this builtin, like NFS, Mysql, Postgres? I guess we don't hear about the successful automated failures, but we surely hear about really messy automated failover attempts.

ADefenestrator · on Nov 5, 2018

Even in cases like Cassandra it's far from flawless, mostly due to the massive complexity. Failover works great, compaction works great, schema changes work great, repairs work... but what happens if two of those happen at once? Or all of them? There have been quite a few bugs over the years that involve corner cases when the various coordination and deferred-work systems interact.

rixed · on Oct 31, 2018

I wish we had a public database of outages / postmortems somewhere that anyone could contribute to, so that each of us could move from his own slowly acquired experience to more thorough statistical data. Does such thing already exist?