
How Complex Systems Fail (1998) - mr_golyadkin
http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
======
empressplay
A core fundamental here is that people inherently misjudge the competency of
their peers / the state of other projects / components / departments. Thus
when they have a "problem" they make an assumption that their single point of
failure won't be catastrophic (because everything else is okay), and that
there's no need to sound an alarm over it (and potentially jeopardise their
career).

Besides, it will be fixed soon enough. Except, then something else comes up,
and the last fault goes unresolved.

If our work-culture was less focussed on success and blame and more focussed
on communication and effort, fewer catastrophes would happen, I'm sure.
Unfortunately, the world doesn't work like that.

~~~
dredmorbius
Closely related are Celine's Laws, particularly the 2nd:

Accurate communication is possible only in a non-punishing situation.

From Robert Anton Wilson's _Illuminatus!_ trilogy.

From Wikipedia: ""communication occurs only between equals." Celine calls this
law "a simple statement of the obvious" and refers to the fact that everyone
who labors under an authority figure tends to lie to and flatter that
authority figure in order to protect themselves either from violence or from
deprivation of security (such as losing one's job). In essence, it is usually
more in the interests of any worker to tell his boss what he wants to hear,
not what is true."

[http://en.wikipedia.org/wiki/Celine%27s_laws](http://en.wikipedia.org/wiki/Celine%27s_laws)

~~~
calinet6
This reveals the main point emergent from (but not necessarily within) the
article: the failure of complex systems is most often the result of poor
management practices.

Management, above all, is responsible for an organization that can enable
quality through systematic means. There are no other means.

~~~
dredmorbius
In the case of smaller organizational systems (projects, companies, possibly
even states), yes.

At the largest scale, I find the analysis of Diamond and Tainter comes into
play. The capacity to survive smaller crises and overcome them just increases
the magnitude of your final failure, though Diamond suggests a few means by
which failure may be averted (Tainter seems to find it inevitable).

Ultimately, the resources required to maintain a system prove insufficient.

------
siliconc0w
At a previous place I worked we used to run a couple of DRBD'd databases. For
every downtime they averted they seem to cause two. It's really tempting to
just build simple monolithic apps on a single high performance server. For a
lot of companies where tech isn't their core competency - they can more or
less scale with moore's law and avoid the complexity of distributed systems
altogether.

~~~
dredmorbius
Your fail-safes are themselves the source of faults and failures.

I've seen, just to list a few:

Load balancers which failed due to software faults (they'd hang and reboot,
fortunately fairly quickly, but resulting in ~40 second downtimes), back-up
batteries which failed, back-up generators which failed, fire-detection
systems which tripped, generator fuel supplies which clogged due to algae
growth, power transfers which failed, failover systems which didn't, failover
systems which did (when there wasn't a failure to fail over from), backups
which weren't, password storage systems which were compromised, RAID systems
which weren't redundant (critical drive failures during rebuild or degraded
mode, typically), far too many false alerts from notifications systems (a very
common problem even outside IT: [http://redd.it/1x0p1b](http://redd.it/1x0p1b)
on hospital alarms), disaster recovery procedures which were incomplete / out
of date / otherwise in error.

That's all direct personal experience.

------
akkartik
As it happens, I was just reading
[http://www.macroresilience.com/2012/02/21/the-control-
revolu...](http://www.macroresilience.com/2012/02/21/the-control-revolution-
and-its-discontents-the-uncanny-valley) today, the high point of which is
describing "defense in depth" as a fallacy because of its tendency to increase
fragility in complex systems in the long term.

------
araes
Working in the design of large scale engineering systems, and watching safety
/ risk reviews, its a little disturbing to read this, and note that almost all
hazard and safety analysis performed goes almost completely counter to these
concepts.

It generally assumes there's one super-cause, and maybe some things that
contributed. (Usually even pre-specified as "root cause" analysis when trying
to find a problem)

The ultimate (although unstated) goal is almost always to find out how a
specific person messed up, and then note what they did wrong. (Kind of human
nature)

The culture usually assumes humans are inherently unsafe (ie, they don't
create safety), and we're protecting them from themselves. (Does probably meet
the statement that complex systems are heavily layered with protections
against failure)

It often assumes that we can achieve a level of omniscient safety, where no-
one is ever unsafe and we see all problems before they occur (safety culture
names that imply "less than zero problems" or "we make you safer working
here").

The probabilistic nature of accidents is not acknowledged, and its usually
whack a mole instead. (This often ties in with the hindsight bias to note how
a practitioner messed up the perfect safety system)

Problem is, I'm not sure how you would actually implement a good,
probabilistic safety system that largely keeps people safe, but acknowledges
bad, random things occasionally happen, and that line folks are your best
defense for seeing and stopping it. Its counter to the whole leadership meme
of decisive action and quick resolution to project strength. Its not very
satisfying to hear "we could have spend $1M more on our safety program, but
Bob still would have been burnt because it was due to three unlikely things
occurring in quick succession."

~~~
AndrewKemendo
_Problem is, I 'm not sure how you would actually implement a good,
probabilistic safety system that largely keeps people safe, but acknowledges
bad, random things occasionally happen, and that line folks are your best
defense for seeing and stopping it._

Through the engineering process though you can generally have an idea where
your weakest/unsafe points are based on previous studies. I see no reason that
one couldn't stack those failure points into a probabilistic matrix and then
apply mitigation methods around those points.

The acceptance of random failure as something largely unavoidable though is
something that can't be engineered away it's a human trait. Just as tire
blowout on an 18 wheeler doesn't necessarily mean you failed in safety design
for that tire, the subsequent balance load shift is the un-recognized
catastrophe mitigation built in to the system. Yet people will still focus on
the tire.

------
dllthomas
_" Hindsight bias remains the primary obstacle to accident investigation,
especially when expert human performance is involved."_

I wonder if it might be possible to blind investigators to whether they are
looking at facts preceding an accident or from an audit without a following
incident.

------
rdtsc
I think Joe Armstrong's thesis:

[https://www.sics.se/~joe/thesis/armstrong_thesis_2003.pdf](https://www.sics.se/~joe/thesis/armstrong_thesis_2003.pdf)

is an interesting read and provides a more concrete example of how to run a
highly concurrent and fault tolerant application.

It doesn't talk about the social or psychological bullet points in this
article, it is more technical. But I found it very readable.

As an addition, I can think of these patterns (just thinking about it in 1
minute, mostly remembering Erlang talks I've listen to, some from practice):

* Build system out of isolated components. Isolation will prevent failures from propagating. In Erlang just launch a process and don't use custom compiled C modules loaded in the VM. In other cases launch an OS process (or container).

* If your service is running on one single machine, it is not fault tolerant.

* Don't handle errors locally. Build a supervision tree where some of part of the system does just the work it is are intended to (ex.: handling a client's request), and other (isolated part) does the monitoring and error handling. Have one process monitor others, one machine monitor another etc.

Once a segfault or malloc fault has occurred installing a handler and trying
to recover might not be the best solution. Restarting an OS (or Erlang)
process might be easier. Another way to put it, once the process has been
fouled up, don't trust it to heal itself. Trust another one to clean up after
it and spawn a new instance it.

* Try not to have a master or a single point of failure. Sometimes having a master is unavoidable to create a consistent system, so maybe it can be elected with a well defined algorithm or library (paxos, zab, raft etc).

* Try to build a crash-only system. So that isolated units (OS or Erlang processes) can be killed instantaneously for any reason, any time and system would still work. If you are controlling the system you are building use atomic file renames, append-only logs, and SIGKILL (or technologies that use those underneath). Don't rely on orderly shutdowns. Sometimes you are forced to use databases/hardware/system that already don't behave nicely. Then you might not have a choice.

* Always test failure modes as much as possible. Randomly kill or mess with your isolated units (kill your processes), degrade your network, simulate switch failures, power failures, storage failure. Then simulate multiple failure simultaneously -- your software crashes while you detected a hardware failure and so on.

* As a side-effect of first point and the crash-only property. Think very well about your storage. In order to be able to restart a process, it means, it might have had to save a sane checkpoint of its state. That means having reliable, stable and fault tolerant storage system. Sometimes recomputing the state works as well.

~~~
dredmorbius
Armstrong's thesis is more complete, but at 295 pages vs. 5, lacks the ready
assimilation of the 5 page treatment. There are trade-offs.

Moreover, Cook's piece is very broadly applicable, it doesn't apply just to
software systems.

~~~
jacquesm
Nothing beats reading papers like Armstrong's and making your own condensed
notes and eventual summary.

These two documents are complimentary, not mutually exclusive.

------
smacktoward
I was intrigued by this observation, near the end:

 _> 18) Failure free operations require experience with failure._

 _> Recognizing hazard and successfully manipulating system operations to
remain inside the tolerable performance boundaries requires intimate contact
with failure. More robust system performance is likely to arise in systems
where operators can discern the “edge of the envelope”._

This maps interestingly to the work of strategic thinker John Boyd
([http://en.wikipedia.org/wiki/John_Boyd_%28military_strategis...](http://en.wikipedia.org/wiki/John_Boyd_%28military_strategist%29)).
(I summarized the general thrust of Boyd's thought in a blog post here:
[http://jasonlefkowitz.net/2013/03/how-winners-win-john-
boyd-...](http://jasonlefkowitz.net/2013/03/how-winners-win-john-boyd-and-the-
four-qualities-of-victorious-organizations/))

In analyzing what separates organizations that win victories from those that
do not, Boyd wrote of a quality he called _Fingerspitzengefühl_ \-- a German
word that can be understood as something like "intuition." (The literal
translation of the German word is "fingertip feeling," as in how a successful
baseball pitcher can tell where the ball is going to go solely from how it
feels rolling out of his hand.) His point was that winning organizations
exposed their people to both training (good) and experience (better!) enough
so that they could learn to react to emergent situations on instinct, rather
than by consulting a manual or waiting for instructions from above. The point
quoted above sounds like a call for people working on complex systems to get
opportunities to develop their own _Fingerspitzengefühl_.

Which leads to the thought that maybe a _completely_ failure-free system is
not something we should strive for. After all, in a completely failure-free
system, nobody would ever get enough experience groping around the edge of the
envelope to learn how to intuit where the other edges are. All they'd have is
"here there be Dragons!" warnings from the past, which would become less
compelling the farther into the past they come from. People are quick to
discount warnings that contrast with their personal experience, and if your
experience is that the System never fails, it's not hard to imagine people
starting to believe that the System _cannot_ fail. Which is fine, until it
_does_ fail, and nobody has any idea what to do to fix it.

It's sort of the same thing that happened to the financial sector in the US.
After the Crash of 1929 and the Great Depression, a whole set of legal and
institutional safeguards were put in place to prevent those things from
happening again. But as time passed and generations grew up that had not
experienced those crises directly, people began to decry those safeguards as
needless bureaucracy. Eventually enough people did so that most of the
safeguards were stripped away; at which point the system promptly collapsed
again.

~~~
nahname
That seems similar to the anti-fragile mindset.

