I also wanted to address you point about batteries. We have a device on each battery that monitors it's state. So we can find faults before they cause the entire UPS to fail.
Curious - when "testing" them, how long do you run them for and at what load?
I could see the beancounters being _very_ unhappy with the ops people saying "we want to run both gen sets at full datacenter load for more than 10 minutes at a time, every month", which is what Amazon would have to have done to detect the faulty cooling fan problem. I'm guessing there are _some_ organisations who do that, but I suspect most datacenters don't.