I've been noticing a trend recently when reading about large scale failures of any system: it's never just one thing.
AWS EBS outage, Fukushima, Chernobyl, even the great Chicago Fire (forgive me for comparing AWS to those events).
Sure there's always a "root" cause, but more importantly, it's the related events that keep adding up to make the failure even worse. I can only imagine how many minor failures happen world wide on a daily basis where there's only a root cause and no further chain of events.
Once a system is sufficiently complex, I'm not sure it's possible to make it completely fault-tolerant. I'm starting to believe that there's always some chain of events which would lead to a massive failure. And the more complex a system is, the more "chains of failure" exist. It would also become increasingly difficult to plan around failures.
This is an interesting point you hit on and something that is stressed in scuba diving. Basically whenever you go out on a dive you only want to change one thing at a time. Only one thing can be "new" or "untested" or "new to you". Otherwise you run the risk of being task overloaded which leads to cascading failure - potentially catastrophic and/or nonrecoverable.
A similar point is made in Gene Weingarten's "Fatal Distraction" (http://www.pulitzer.org/works/2010-Feature-Writing), which was about parents who forget a child in the car. Excerpt: "[British psychologist James Reason] likens the layers to slices of Swiss cheese, piled upon each other, five or six deep. The holes represent small, potentially insignificant weaknesses. Things will totally collapse only rarely, he says, but when they do, it is by coincidence -- when all the holes happen to align so that there is a breach through the entire system."
I've been an AWS (S3, EC2, SQS) user for over 3 years now and this article detailing their systems at a mid-level is kind of scaring me off of their platform. It just sounds so complicated and I'm not sure I want to rely on it for anything critical until I can really understand it myself.
Also, a couple other complex systems for your trend are; financial markets and commercial jets.
The examples he draws from are nuclear power plant failures (TMI in particular), civil aviation and oil transport. But the basics will be recognizable to anyone who has dealt with large computing installations; interactive complexity, tight coupling and cascading failures.
It is not a reassuring book, you won't be able to look at any complex system without asking yourself what sequence of simple, predictable failures of widely separated parts could tip it into a catastrophic failure mode.