Amazon had a correct setup--but not great testing.
By the way, these are great questions to ask of your datacenter provider: Are there two completely redundant power systems up to and including the PDUs and generators? How often are those tested? How do I set up my servers properly so that if one circuit/PDU/generator fails, I don't lose power?
There is a "right way" to do this--multiple power supplies in every server connected to 2 PDUs connected to 2 different generators--but it's expensive, and many/most low-end hosting providers won't set this up due to the cost.
(I ran a colocation/dedicated server company from 2001-2007.)
And "are they tested for long enough to detect a faulty cooling fan that'll let the primary generator run at normal full working load for ~10mins and are the secondary gensets run and loaded up long enough to ensure something that'll trip ~5mins after startup isn't configured wrong?
While they clearly failed, I do have some sympathy for the architects and ops staff at Amazon here. I could very easily imagine a testing regime which regularly kicked both generator sets in, but without running them at working load for long enough to notice either of those failures. My guess is someone was feeling quite smug and complacent 'cause they've got automated testing in place showing months and years worth of test switching from grid to primary to secondary and back, without every having thought to burn enough fuel to keep the tests running the generators long enough to expose these two problems.
"There is a "right way" to do this …"
There's a _very_ well known "right way" to do this in AWS - have all your stuff in at least two availability zones. Anybody running mission critical stuff in a single AZ has either chosen to accept the risk, or doesn't know enough about what they're doing… (Hell, I've designed - but never go to implement - projects that spread over multiple cloud providors, to avoid the possible failure mode of "What happens if Amazon goes bust / gets bought / all goes dark at once?")
Run the generators and have the grid as backup
And just stop the generators to validate fallback to grid once in a while
It comes down to cost and zoning/permitting. It's much easier to get a permit to run a generator for backup use than to run one 24x7. It's also hard to get a 1-10MW plant which is per-KWh as efficient/inexpensive as the grid (although now that natural gas is about 20% of what it was when I last bought it, gas turbines actually are cheaper than industrial tariff grid power, if you have good gas access...). Being able to actually use the waste heat is what makes the combined cycle efficiency worth it.
There was a crazy plan to run a datacenter on a barge tethered to the SF waterfront, for a variety of reasons, but a primary one being power -- the SF city government wouldn't be able to regulate the engines/generators on a ship running 24x7.
They're a nice idea in principle (and were the best option back in the mainframe era), but power electronics have gotten better faster than rotational maintenance at a datacenter company. They also weren't widely deployed enough to have a great support system, and it was firmware/software which caused most of their outages.
Dual line cord for network devices, and then STSes per floor area, probably make the most sense. Basically no commodity hosting provider uses dual line cord servers on A and B buses. I love having dual line cord for anything "real" (including virtualization servers for core infrastructure), but when you're selling servers for $50/mo, you can't.
(there's the Google approach to put a small UPS in each server, too...)
No. Incorrect. There is a reason I 100% refused to move my hosting company there. I'm not going to say anything else publicly, but it wasn't the hardware that caused repeated outages there. (I moved my hosting company from San Francisco to San Jose, and lived in the Bay Area for 10 years. Everyone in the hosting industry in the Bay Area knew each other. I also hosted for years in AboveNet SJC3, which had the same flywheel setup.)
Note: I hope at this point they've fixed the issue. I've been out of the industry for a few years. I wish them the best.
But the hitech UPS was a weak link. When they sold all their facilities to someone else (DRT), that fixed most of the other issues.
I also wanted to address you point about batteries. We have a device on each battery that monitors it's state. So we can find faults before they cause the entire UPS to fail.
Curious - when "testing" them, how long do you run them for and at what load?
I could see the beancounters being _very_ unhappy with the ops people saying "we want to run both gen sets at full datacenter load for more than 10 minutes at a time, every month", which is what Amazon would have to have done to detect the faulty cooling fan problem. I'm guessing there are _some_ organisations who do that, but I suspect most datacenters don't.