
Examining and Learning from Complex Systems Failures - wallflower
https://journal.uptimeinstitute.com/examining-and-learning-from-complex-systems-failures/
======
a3n
> 16\. Safety is a characteristic of systems and not of their components.

> 18\. Failure-free operations require experience with failure (Cook 1998).

These two things in particular caught my eye, especially as I'm not in system
design.

So with 16, you can't just slap together components with sufficient -ility
rating, you have to study and design the safety of the system as a whole.
Those components, in two different designs, are exposed to different stresses
in different ways, and live within different monitoring and mitigation
systems.

With 18 ... I wonder what the value would be to a promising new engineer who
opted to not take the big offer from Google or Apple, and instead worked down
among the unwashed for awhile, gathering experience. There's probably a lot of
value in having experienced failures in a system that has no relevant
mitigation, and you have to figure out what to do next, as the first person to
have seen it.

Luxurious to work at Google, where if it hits the fan, there's lots of systems
and experience to handle it. Character building if you alone had to figure out
how to keep the ship from sinking, from an "iceberg" that no one had thought
of and planned for.

------
awinter-py
very much want to read this but uptimeinstitute is currently, and very
ironically, down

(maybe because of a complex system failure? or just load)

~~~
yagibear
Archived at [https://web-
beta.archive.org/web/20170417232443/https://jour...](https://web-
beta.archive.org/web/20170417232443/https://journal.uptimeinstitute.com/examining-
and-learning-from-complex-systems-failures/) or use the Google cache version
from the web link at the top of the thread

~~~
a3n
Thanks for the Google cache reminder. When I view that, it also spins forever,
with no page display below Google's header. If I click on the Text link, then
I see the text version. I wonder if there's a problem in the source itself,
rather than the server.

