
How complex systems fail (2002) [pdf] - mpweiher
http://web.mit.edu/afs.new/athena/course/2/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf
======
ryanmarsh
_The system practitioners operate the system in order to produce its desired
product and also work to forestall accidents. This dynamic quality of system
operation, the balancing of demands for production against the possibility of
incipient failure is unavoidable. Outsiders rarely acknowledge the duality of
this role. In non-accident filled times, the production role is emphasized.
After accidents, the defense against failure role is emphasized. At either
time, the outsider’s view misapprehends the operator’s constant, simultaneous
engagement with both roles._

So much this. I coach and train my clients (Fortune 500) in extreme
programming (unit testing, TDD, CI, CD, etc). The duality these developers
live with and managements obliviousness to their own detriment create a toxic
and anxiety inducing work environment. It's very upsetting to me how these
developers live with the manager/stakeholder constantly breathing down their
necks to cut corners and "get it done" meanwhile holding them accountable for
any mistakes.

~~~
Jtsummers
jdjebc82747's dead comment: >"get it done" meanwhile holding them accountable
for any mistakes. Sorry, is there an area of business that is not like this?

It's not that other areas of business _aren 't_ like this, it's that
management creates an environment where mistakes are more likely (and in some
cases almost certain) to happen, and then penalizes people when they happen.

An example: Bosses refuse to provide funding for materials and time to
automate the test framework (embedded systems). So testing is done mostly
manually, this consumes a great deal of time _or_ tests don't get conducted
due to the lack of time or capability (I can't flip a switch 10 times in a
second, or at a particular and precise time). So either we don't have enough
time to take the test feedback and correct the system, or we never get the
test feedback (because some tests aren't done) in order to correct the system.
Errors are virtually guaranteed to slip into production if you're operating on
either short schedules or complex systems under these circumstances.

Management expects perfect results, but ties the engineers hands too much so
that they aren't able to execute effectively, and then blames (and often
dismisses) the engineers as a result.

------
rossdavidh
This all looks spot on from my experience, except I have this nagging voice in
my head that says the methods they are criticizing, seem to be exactly what
the airline industry has done for decades in the wake of plane accidents. The
result has been an ever-improving safety record, without preventing normal
operation; although it is a pain to get through security at the airport,
people do in fact continue to get from here to there. So, how do the methods
they criticize here, differ from the airline industry's, or if they don't
what's the flaw in this analysis?

~~~
adrianratnapala
I know nothing of the industry, so everything that follows is guesswork.

I'd say airlines are a sweet spot because we are probably near a technological
local-optimum. We are not constantly evolving new paradigms for the industry,
rather just incrementally improving the old one.

If post-mortem culture in the industry results in "fighting the last war",
then that probably does little harm, at worst it adds in incremental cost. In
the mean-time there is still incremental techonolgical improvment giving us
more head-room in the cost vs. safety trade-off.

~~~
rossdavidh
Well this makes some sense, since the speed of sound pretty much stopped
technological progress toward every-higher speeds a few decades past.

~~~
jodrellblank
Really?

The ancient ones, the humans who lived before my birth - those who launched
rockets to the moon and detonated city destroying bombs with atomic power,
feats no country has matched in these declining years of civilisation - they
also had faster than sound passenger aircraft.

Although it might be the tendency to exaggerate stories with time, rose-tinted
glasses and the myth of the noble-savage.

~~~
taneq
There's a common theme amongst many of those heroic feats: They tended to
occasionally catch fire, fall out of the sky and/or explode at inopportune
times.

It's the difference between prototype and a production. The Saturn V rocket
and the Concorde were both prototype-grade vehicles. The marvel of them is
that they worked at all. Now that we've proved the point, we won't put either
"into production" unless it's reliable enough to be boring.

~~~
jodrellblank
'Production quality' things spontaneously catch fire these days, too.

[http://www.theclever.com/20-dangerous-cars-with-the-
highest-...](http://www.theclever.com/20-dangerous-cars-with-the-highest-risk-
of-spontaneous-combustion/)

------
stcredzero
This episode is the most relevant, but see the whole series:

[https://www.youtube.com/watch?v=3HaqpSPVhW8](https://www.youtube.com/watch?v=3HaqpSPVhW8)

------
qualitytime
TL;DR Spend 80% on expensive consultants qualified in "project management
method" such as PRINCE2/V-Model and make a f __ __*g nightmare for the people
who actually try to make it work.

A picture is worth a thousand words:

[https://upload.wikimedia.org/wikipedia/commons/8/8f/Systems_...](https://upload.wikimedia.org/wikipedia/commons/8/8f/Systems_Engineering_and_Verification.jpg)

~~~
xenity7
It seems you've been hurt recently by a consultant :).

The institutional response to managing the risk of complex systems is often to
introduce layers of approval and process that appear to be dealing with a
problem ("this system can't fail, therefore please check with everyone
involved in the system before making a change") but really have little real
world value except driving everyone crazy and incurring enormous cost. They
also shield individuals from real responsibility (how carefully do you review
something you are the only approver on? What if there are 5 approves? What
about 15?).

A better answer is to find ways to conduct real world tests with subsets of
your system where you can roll back bad consequences. As far as I can tell
This is the approach of google/Facebook and other newer tech companies with
pushing small changes to subsets of customers for testing.

Legacy enterprise companies are woefully inequipped to do this for the most
part, both technically and culturally.

Some industries have regulations that make this approach difficult (financial
services) or human consequences to failure that make the cost of
experimentation too high (medicine, airlines, etc)

------
csours
If this is interesting to you, I highly recommend "The Field Guide to
Understanding Human Error" by Sidney Dekker - it covers these points with
examples. [0]

Another note, I wondered what the root cause of the financial meltdown was for
a number of years, but looking at it from this point of view, it's obvious
that a number of things have to go wrong simultaneously; but it is not obvious
beforehand which failed elements, broken processes, and bypassed limits lead
to catastrophe.

For your own business/life, think about things that you live with that you
know are not in a good place. Add one more problem and who knows what gives.

This is not intended to scare or depress, but maybe have some compassion when
you hear about someone else's failure.

0 [https://www.amazon.com/Field-Guide-Understanding-Human-
Error...](https://www.amazon.com/Field-Guide-Understanding-Human-
Error/dp/1472439058)

------
udkl
Previous relevant discussion on similar papers :

[https://news.ycombinator.com/item?id=8282923](https://news.ycombinator.com/item?id=8282923)

[https://news.ycombinator.com/item?id=14127543](https://news.ycombinator.com/item?id=14127543)

------
aw3c2
Here it is directly from MIT instead of the cancerous researchgate silo:
[http://web.mit.edu/afs.new/athena/course/2/2.75/resources/ra...](http://web.mit.edu/afs.new/athena/course/2/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf)
(the 1998 version to be fair but the sole addition seems to be some
infographic of the contents)

~~~
sctb
Thanks, we've updated the link from
[https://www.researchgate.net/publication/228797158_How_compl...](https://www.researchgate.net/publication/228797158_How_complex_systems_fail).

