Isn't the moral of the story, "Check your backups"? There was a defective fan in...

olefoo · on June 18, 2012

Having been affected several times by colocation facilities bouncing the power during a test of the failover system, I can tell you that such tests are not without risk. Yes, you should test redundant systems, but how often, at what cost, and what risks are you willing to run while doing so.

It's a fact of life that when dealing with complex, tightly coupled systems with multiple interactions between subsystems that you will routinely see accidents caused by improbable combinations of failures.

spartango · on June 18, 2012

I wonder if it's better to create an accidental outage during a scheduled test, or to have an outage completely out of the blue. Obviously mitigation is tricky even during a scheduled test, but perhaps its plausible?

raverbashing · on June 18, 2012

Or do it with a non-production load

AWS could create some machines at a lower cost and lower availability, just something that if goes down doesn't affect you much, or one-off usages.

I'm not sure how migrating machines between nodes happens in S3 or if it's easy to do it (maybe with some downtime)

dkulchenko · on June 18, 2012

With a scheduled test, you have the benefit of having the main power actually working if the backup being tested comes crashing down; seems to me that mitigation would be much quicker in a scheduled test than in a real outage.

excuse-me · on June 18, 2012

But would you rather risk a once in 10years real power failure testing your backups, or would pull the plug once per year to test it yourself?

richardw · on June 18, 2012

I suspect a monthly test - if communicated to customers as well - would drive better customer behaviour, e.g. multi-AZ usage, automated validation of EBS, adding extra machines in another AZ automatically. Maybe start with one or two AZ's with an opt-in from customers.

It's the same as advice to routinely replace your live data from backups. It's not a real backup until you've tested that you can recover from it.

excuse-me · on June 18, 2012

It's an interesting and well studied area of statistics for medical tests.

eg. If you have a test that is 99% accurate and a treatment that harms 1% of the patients and you do this screening of a million people - how common does the disease have to be before you cure more people than you kill ?

gameshot911 · on June 18, 2012

I agree, it sounds like this could have been discovered had the two backup power systems been properly tested.

Note how in the case of backup power, "properly tested" doesn't mean 'Does the generator turn on? Are we getting electricity from it? Ok, pass!'. It means running the backup generator in a way that is consistent with what you would expect in an actual power failure - i.e., for more than just a few minutes.

Same thing with storage backup. Checking your backups isn't just 'was a backup file/image created?', it means _actually trying to recover your systems from those backup files_.

enjo · on June 18, 2012

But what about parts that are going to fail in the next 35 minutes of running, and then you run it for 30. There are too many variables to account for here I think.

patrickgzill · on June 18, 2012

The datacenter I use, has a policy to do full load runs for 30 minutes each month.

AWS would have found this, and been able to fix it in a timely fashion, if they did the same (the genset lasted for 10 minutes under load before failing).

jaggederest · on June 18, 2012

Most of the generators I'm aware of require this for maintenance purposes anyway - without running them occasionally the lubricants will freeze up and sieze the engine.

dredmorbius · on June 18, 2012

All sorts of stuff.

Diesel is great food. For bacteria, that is. So it's treated, but you've still got to stir it to keep gunk from settling, you've got to rotate it (so you burn through your stock every so often), you've got to filter it. And stages of all of that can go wrong.

I recall a cruise on a twin V12 turbodiesel powered ship (hey, we've got full redundancy!) in which both engines failed. Cause? Goopy fuel and clogged filters (she spent a lot of time in port). This happened a couple of hours into cruise, fortunately on inland waterways, not open seas, shallow enough water to anchor, and numerous skiffs with which we could head to shore and find replacement parts.

More recently, an colo site I know of was hit by a similar outage: utility power went out, generators kicked in, but a transfer switch failed to activate properly. Colo dumped.

Second time it was the fire detection system which inadvertantly activated. One of its activation modes is to deenergize the datacenter (if you've got a problem with too much energy liberation, not dumping more energy into the system can be a useful feature). Again, full colo dump. And APCs will only buy you enough time for a clean shutdown. If you've got them at all.

But, yes: exercise your backups.

mikiem · on June 18, 2012

Yes, generators must be run priodically. However, not all data centers actually put the full load (or any load at all) on the generators during non-emergency periodic testing.