It's really incredibly complicated, and difficult to test fully. The bits of Amazon's DC that failed seem like stuff normal testing should catch, but the DC power failures I've dealt with in the past always had some really precise sequence of events that caused some strange failure no one expected.
As an example, Equinix in Chicago failed back in like 2005. Everything went really well, except there was some kind of crossover cable between generators that helped them balance load that failed because of a nick in its insulation. This caused some wonky failure cycle between generators that seemed straight out of Murphy's playbook.
They started doing IR scans of those cables regularly as part of their disaster prep. It's crazy how much power is moving around in these data centers, in a lot of way they're in thoroughly uncharted territory.
The even crazier thing is big industrial plants where they are using tens or hundreds of MW and have much lower margins than datacenter companies, so they run with dual grid (HV, sometimes like 132kV) feeds and no onsite redundancy. As in, when the grids flicker, they lose $20mm of in-progress work.
I'd guess that's because "tens or hundreds of MW" of on-site backup power would be _ludicrously_ expensive to own/maintain, and the tradeoff against the risk of both ends of their dual grid flickering at once and trashing the current batch is less expensive. (or maybe the power supply glitches are insurable against, or have contract penalty clauses with the power companies?)