Usually the root cause appears simple - a dead fan, breaker set to the wrong threshold, alarm that didn't trigger, incorrect component picked during design phase, or whatever else that gets the blame - things it would seem to a software guy that good processes could mitigate.
Can any electrical engineers elaborate on why power networks fail (in my experience at least) so frequently? I guess failure modes (e.g. lightning strike) are hard to test, but surely an industry this old has techniques. Is it perhaps a cost issue?
As an example, Equinix in Chicago failed back in like 2005. Everything went really well, except there was some kind of crossover cable between generators that helped them balance load that failed because of a nick in its insulation. This caused some wonky failure cycle between generators that seemed straight out of Murphy's playbook.
They started doing IR scans of those cables regularly as part of their disaster prep. It's crazy how much power is moving around in these data centers, in a lot of way they're in thoroughly uncharted territory.
In my experience (in ~30 datacenters worldwide, and reading about more), the actual -48v Direct Current plant is usually ROCK SOLID, in comparison to the AC plant. It's almost always overprovisioned and underutilized, at least in older facilities, or those with telcos onsite (who, unlike crappy hosting customers, actually understand power).
My pro tip for datacenter reliability is to try to get as much of your core gear on the DC plant as possible -- core routers, and maybe even some of your infrastructure servers like RRs, monitoring, OOB management, etc. Ideally split stuff between DC and AC such that if either goes down, you're still sort of ok, or at least can recover quickly. DC and AC is even better than dual AC buses, since what starts out as dual AC can easily end up with a single point of failure later (like when they start running out of pdu space, power, or whatever), and dual AC is also more likely to have a closer upstream connection.
DC stuff is WAY simpler to make reliable and redundant, just uses larger amounts of copper and other materials.
- The work is usually done by outside contractors, working off of specifications that may or may not make sense.
- Some aspects of testing have the potential to be dangerous to the people doing them. (ie. if network switch fails in testings, no big deal. If some types of electrical switch break during testing, the tester is dead.) High voltage electricity is not a toy.
- IT and facilities staff usually don't talk much, and often don't understand each other when they do.
- There's no instrumentation. I get an alert when IT systems aren't configured right. Nothing from the other stuff.
- There is a wide variance in quality of electrical infrastructure that isn't obvious to someone who isn't skilled in that area. IT folks don't need to deal with computers built in 1970. Electricians deal with ancient stuff that may be completely borked all of the time.
I know that you can test in real time your electrical protection systems for almost all the possibilities you can imagine (thousands of them), for example: faults in your high voltage utility distribution system, breaker failures, coordination of the protection systems, lost of your back-up generator power. I don't know their systems or their philosophies, would be interesting for me to know why they don't parallelize groups of generators (at the backup system), so, when one generator fails, the power load are balanced to the others (and using well known schemes to avoid cascade failures).