Hacker News new | past | comments | ask | show | jobs | submit login

I've seen places that /did/ test their backup power - but they got failures anyway, because of faults the test didn't reveal.

For example you switch off the data centre circuit breakers and everything fails over to generators just fine. Test successful, right?

Then when there's a real outage you have problems because the operations team's computers have all gone off, so they can't migrate load to a different data centre. It didn't happen in testing, because they aren't in the data centre so their kit isn't connected to the breakers you turned off.

Or it turns out the wireless APs aren't on UPSes. Or it turns out there's a switch in a closet somewhere that isn't on a UPS. Or they tested for a single loss of power, but when the mains power toggles on and off every 30 seconds the UPS batteries get run down. Or they need to top up the generator and they discover you can't get fuel delivered at 9pm on a Friday. Or the generator doesn't recharge the UPS, but you have to turn off the generators to refuel them. Or a guy had a standalone UPS for his desktop, but his monitor wasn't connected as the UPS only came with IEC C13 power cables and his monitor needed IEC C5...




One place I worked at found this during the big storm in the UK UPS worked fine and all Telecoms Golds Machines stayed - but they had forgotten to put the Modems on the UPS :-)


This is why chaos monkeys are a good thing. Regular drills too.

But you want to test the things you haven't thought about.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: