Yuh, I have the same question too :-). Hard to do cost/benefit analysis when you...

rdl · on Dec 26, 2012

In general, automated failover seems to make most small problems non-problems, but turns some small problems into big problems. It probably depends on actual numbers what makes sense for you app.

For some systems, I'd take getting rid of small outages -- I'll happily take an increased risk of a projected 15 minute loss of heart function becoming a >60 minute loss of heart function if it also eliminates what would otherwise be a bunch of 5 minute losses of heart function, since even the 5 minute interruptions would be fatal.

(Or, for a better example, revolvers vs. semi-autos. A revolver generally is more reliable, but if it goes out of timing, it's basically doomed, whereas a semi-auto can jam or pieces can break, but a monkey can clear, and a trained monkey can fix.)

kyrra · on Dec 26, 2012

Failover is meant to deal with hardware failures, which will tend to work just fine. But if the node you are failing over onto was already has 60% capacity and you add another 60% capacity during the failover, things are going to get worse.

The top-level systems probably need to be able to deal with increased latency or timeouts, and properly handle retries and throttling of traffic.

If you have some HA failover setup going but your alternate is already being used more for load balancing than for failover, problems like this will occur.

(I used to work on failover drivers for a SAN).

jackowayed · on Dec 26, 2012

GitHub's failover problems have never been load-related. GitHub has pairs of fileservers where one is the master and the other's sole job is to follow along with the master and take over if it thinks the master is down, so when they do failover, it is to a node with just as much capacity as the previous master.

All the failover problems I can think of since they moved to this architecture 4 years ago have been coordination problems where something undesired happened when transitioning from one member of a pair to another. In this case, network problems lead them to a state where both members of a pair thought they were the master.

regularfry · on Dec 27, 2012

The same argument applies to UPSes. I hear about far more outages caused by UPS failures than by PSU or supply failures.

rdl · on Dec 27, 2012

I've never seen a redundant PSU be worse than a single PSU, actually (like the dual-line-cord modules on many servers or network devices). PSU failures have gone way down in the past 15 years or so that I've been observing them. The only power supplies I routinely see dying are external transformers on low end network devices and on systems exposed to really dirty power.

I have seen facility-scale UPSes go bad, and sometimes in weird ways, but an order of magnitude less frequently than grid power.

I think we reached a crossover point where designing for facility-scale survivability vs. replicating facilities ceased to be worthwhile for most Internet applications sometime in the past 10 years. It doesn't really make sense to drop $2b on a ~100k sf datacenter like AboveNet used to do for e.g. 365 Main. There are still some systems where replication is a pain, but even for those, I think metro area replication shouldn't be that hard. Even just running 5km of fiber in a loop between a few buildings in the same town gets you a huge amount of resilience against most facility problems.