Besides, it will be fixed soon enough. Except, then something else comes up, and the last fault goes unresolved.
If our work-culture was less focussed on success and blame and more focussed on communication and effort, fewer catastrophes would happen, I'm sure. Unfortunately, the world doesn't work like that.
Accurate communication is possible only in a non-punishing situation.
From Robert Anton Wilson's Illuminatus! trilogy.
From Wikipedia: ""communication occurs only between equals." Celine calls this law "a simple statement of the obvious" and refers to the fact that everyone who labors under an authority figure tends to lie to and flatter that authority figure in order to protect themselves either from violence or from deprivation of security (such as losing one's job). In essence, it is usually more in the interests of any worker to tell his boss what he wants to hear, not what is true."
Management, above all, is responsible for an organization that can enable quality through systematic means. There are no other means.
At the largest scale, I find the analysis of Diamond and Tainter comes into play. The capacity to survive smaller crises and overcome them just increases the magnitude of your final failure, though Diamond suggests a few means by which failure may be averted (Tainter seems to find it inevitable).
Ultimately, the resources required to maintain a system prove insufficient.
I've seen, just to list a few:
Load balancers which failed due to software faults (they'd hang and reboot, fortunately fairly quickly, but resulting in ~40 second downtimes), back-up batteries which failed, back-up generators which failed, fire-detection systems which tripped, generator fuel supplies which clogged due to algae growth, power transfers which failed, failover systems which didn't, failover systems which did (when there wasn't a failure to fail over from), backups which weren't, password storage systems which were compromised, RAID systems which weren't redundant (critical drive failures during rebuild or degraded mode, typically), far too many false alerts from notifications systems (a very common problem even outside IT: http://redd.it/1x0p1b on hospital alarms), disaster recovery procedures which were incomplete / out of date / otherwise in error.
That's all direct personal experience.
It generally assumes there's one super-cause, and maybe some things that contributed. (Usually even pre-specified as "root cause" analysis when trying to find a problem)
The ultimate (although unstated) goal is almost always to find out how a specific person messed up, and then note what they did wrong. (Kind of human nature)
The culture usually assumes humans are inherently unsafe (ie, they don't create safety), and we're protecting them from themselves. (Does probably meet the statement that complex systems are heavily layered with protections against failure)
It often assumes that we can achieve a level of omniscient safety, where no-one is ever unsafe and we see all problems before they occur (safety culture names that imply "less than zero problems" or "we make you safer working here").
The probabilistic nature of accidents is not acknowledged, and its usually whack a mole instead. (This often ties in with the hindsight bias to note how a practitioner messed up the perfect safety system)
Problem is, I'm not sure how you would actually implement a good, probabilistic safety system that largely keeps people safe, but acknowledges bad, random things occasionally happen, and that line folks are your best defense for seeing and stopping it. Its counter to the whole leadership meme of decisive action and quick resolution to project strength. Its not very satisfying to hear "we could have spend $1M more on our safety program, but Bob still would have been burnt because it was due to three unlikely things occurring in quick succession."
Through the engineering process though you can generally have an idea where your weakest/unsafe points are based on previous studies. I see no reason that one couldn't stack those failure points into a probabilistic matrix and then apply mitigation methods around those points.
The acceptance of random failure as something largely unavoidable though is something that can't be engineered away it's a human trait. Just as tire blowout on an 18 wheeler doesn't necessarily mean you failed in safety design for that tire, the subsequent balance load shift is the un-recognized catastrophe mitigation built in to the system. Yet people will still focus on the tire.
I wonder if it might be possible to blind investigators to whether they are looking at facts preceding an accident or from an audit without a following incident.
is an interesting read and provides a more concrete example of how to run a highly concurrent and fault tolerant application.
It doesn't talk about the social or psychological bullet points in this article, it is more technical. But I found it very readable.
As an addition, I can think of these patterns (just thinking about it in 1 minute, mostly remembering Erlang talks I've listen to, some from practice):
* Build system out of isolated components. Isolation will prevent failures from propagating. In Erlang just launch a process and don't use custom compiled C modules loaded in the VM. In other cases launch an OS process (or container).
* If your service is running on one single machine, it is not fault tolerant.
* Don't handle errors locally. Build a supervision tree where some of part of the system does just the work it is are intended to (ex.: handling a client's request), and other (isolated part) does the monitoring and error handling. Have one process monitor others, one machine monitor another etc.
Once a segfault or malloc fault has occurred installing a handler and trying to recover might not be the best solution. Restarting an OS (or Erlang) process might be easier. Another way to put it, once the process has been fouled up, don't trust it to heal itself. Trust another one to clean up after it and spawn a new instance it.
* Try not to have a master or a single point of failure. Sometimes having a master is unavoidable to create a consistent system, so maybe it can be elected with a well defined algorithm or library (paxos, zab, raft etc).
* Try to build a crash-only system. So that isolated units (OS or Erlang processes) can be killed instantaneously for any reason, any time and system would still work. If you are controlling the system you are building use atomic file renames, append-only logs, and SIGKILL (or technologies that use those underneath). Don't rely on orderly shutdowns. Sometimes you are forced to use databases/hardware/system that already don't behave nicely. Then you might not have a choice.
* Always test failure modes as much as possible. Randomly kill or mess with your isolated units (kill your processes), degrade your network, simulate switch failures, power failures, storage failure. Then simulate multiple failure simultaneously -- your software crashes while you detected a hardware failure and so on.
* As a side-effect of first point and the crash-only property. Think very well about your storage. In order to be able to restart a process, it means, it might have had to save a sane checkpoint of its state. That means having reliable, stable and fault tolerant storage system. Sometimes recomputing the state works as well.
Moreover, Cook's piece is very broadly applicable, it doesn't apply just to software systems.
These two documents are complimentary, not mutually exclusive.
> 18) Failure free operations require experience with failure.
> Recognizing hazard and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure. More robust system performance is likely to arise in systems where operators can discern the “edge of the envelope”.
This maps interestingly to the work of strategic thinker John Boyd (http://en.wikipedia.org/wiki/John_Boyd_%28military_strategis...). (I summarized the general thrust of Boyd's thought in a blog post here: http://jasonlefkowitz.net/2013/03/how-winners-win-john-boyd-...)
In analyzing what separates organizations that win victories from those that do not, Boyd wrote of a quality he called Fingerspitzengefühl -- a German word that can be understood as something like "intuition." (The literal translation of the German word is "fingertip feeling," as in how a successful baseball pitcher can tell where the ball is going to go solely from how it feels rolling out of his hand.) His point was that winning organizations exposed their people to both training (good) and experience (better!) enough so that they could learn to react to emergent situations on instinct, rather than by consulting a manual or waiting for instructions from above. The point quoted above sounds like a call for people working on complex systems to get opportunities to develop their own Fingerspitzengefühl.
Which leads to the thought that maybe a completely failure-free system is not something we should strive for. After all, in a completely failure-free system, nobody would ever get enough experience groping around the edge of the envelope to learn how to intuit where the other edges are. All they'd have is "here there be Dragons!" warnings from the past, which would become less compelling the farther into the past they come from. People are quick to discount warnings that contrast with their personal experience, and if your experience is that the System never fails, it's not hard to imagine people starting to believe that the System cannot fail. Which is fine, until it does fail, and nobody has any idea what to do to fix it.
It's sort of the same thing that happened to the financial sector in the US. After the Crash of 1929 and the Great Depression, a whole set of legal and institutional safeguards were put in place to prevent those things from happening again. But as time passed and generations grew up that had not experienced those crises directly, people began to decry those safeguards as needless bureaucracy. Eventually enough people did so that most of the safeguards were stripped away; at which point the system promptly collapsed again.