
Myths of Robust Systems - vinnyglennon
https://www.verica.io/top-seven-myths-of-robust-systems/
======
mwfunk
I'm not sure what to draw from this- the 7 "myths" being busted here aren't
being shown to be wrong, just not the entire truth 100% of the time, and
AFAICT no alternative approaches are suggested.

For example, item 2 ("Simplify") tries to debunk the notion that simpler
systems are more robust. The author seems to be trying to say, "simplifying a
system isn't always going to make it more robust". Likewise, item 1 ("Add
redundancy") tries to debunk the notion that redundant systems are more
robust. Item 7 ("Remove the people who cause accidents") tries to debunk the
notion that if you have someone who is always screwing stuff up, getting rid
of that person will help you.

But...OK, I get it, none of these things are silver bullets or sacred cows or
whatever. But what's the suggestion? Make systems more complicated? Avoid
redundancy? Keep people around who always screw stuff up? This would make a
lot more sense if it was presented as things which might help, but shouldn't
be thought of as silver bullets that will automatically fix everything.
Instead it's presented as myths being debunked, implying that the "myths"
aren't just imperfect, but wrong, which doesn't make sense to me. Simpler is
better than overly complicated. Redundancy doesn't solve everything but can
certainly help. Getting rid of terrible employees who never seem to improve is
generally a good idea. The way the list is presented seems not actionable at
best, and misleading at worst.

~~~
joe_the_user
The article seems like contrarian clickbait. Take a bunch of things that are
well known to be true and challenge them in an intelligent way - except the
stuff really is true and the challenges are just "sometimes not true" and
"even a good thing can be overdone". It's a completely opportunistic, low-
quality discussion framework even if done by a qualified person saying a bunch
of technically correct things.

------
crocal
What is written here is sometimes wrong, sometimes over-simplifies what is
needed to build robust systems.

If look at the list:

7/ You absolutely must remove people who demonstrate they are not skilled for
the role assigned to them in the system. Role / skill match is one of the
first thing an assessor will look at.

6/ Documenting best practice is the only way we have to ensure a system will
survive the demise of its creators. This is captured in a quality management
system and again, will be scrutinized by an assessor. Check EN 50126 in
railway (my field)

5/ This is ridiculous. Failure is the greatest teacher. It shows us where we
failed and by analyzing failures we progress. Dismissing this is hubris. Many
safety techniques have been introduced due to prior incidents (e.g. collision
avoidance systems in planes, etc)

4/ Procedures are not meant to make anyone feel clever, but ensure robust /
safe operation. While you can wish procedures to be made as smooth as possible
by automation, you can’t just ignore them. Some procedures are simple and save
lives routinely (e.g. stop at red when driving your car)

3/ Risks must be analyzed, quantified and mitigated. But when a risk can be
eliminated entirely at acceptable cost, it should. EN 50126 is again a good
read here.

2/ Simplicity is said to be achieved when you cannot remove anything. But it
is not necessarily exempt of complexity. Saying something complex is
inherently more robust does not make sense. The simpler a system, the better
it can be understood by others and thus made more robust by peer review and
contributions.

1/ This is not black and white. Sometimes redundancy is needed, sometimes it’s
a rotten idea. Reliability (MTBF, MTTR, MTBSF) calculations are needed to
determine the right balance.

------
mjb
Much of this is so overstated as to be wrong.

> runbooks, etc actually provide little value in preventing incidents.

There's a great deal of evidence that checklists have value in preventing
accidents of omission. In some systems we can build around needing checklists
by adding automation, but in others we can't (or it's not worth it). Runbooks
are good, but you shouldn't have too many of them.

> Unfortunately it turns out catastrophic failures in particular tend to be a
> unique confluence of contributing factors and circumstances, so protecting
> yourself from prior outages, while it shouldn’t hurt, also doesn’t help very
> much.

Things have causes. In complex systems there can be many causes, and the
causes themselves can be systemic and complex. That doesn't mean that it
"doesn't help" to defend against some of those causes. You shouldn't try to
paper over every single little gap, but at the same time it's worth fixing
some things. Adding mechanisms like runbooks, for example. Or backpressure. Or
timeouts. Or unit tests. Or whatever.

There not being a single root cause shouldn't be taken to mean that you can't
fix anything.

> The end result is that the additional bureaucracy inhibits the adaptive
> capacity of the organization and ultimately makes the system less safe.

Again, overstated. Of course too many procedures and too much bureaucracy is
bad. But no procedures isn't the solution - depending on people's good
intentions to do things is extremely failure prone, even if people have the
best intentions in mind. Procedures are mechanisms that can help us avoid
those failures.

> They prevent people from interacting with the risks in a system in a way
> that teaches them where the safety boundary actually is.

Just letting people break stuff is a great way to help them learn, but can
also be an extremely expensive one.

> So rather than trying to simplify or remove complexity, learn to live with
> it.

I don't even know what to say about this. Sure, complex systems are complex.
But accidental complexity is bad and adds risk.

> The number one myth we hear out in the field is that if a system is
> unreliable, we can fix that with redundancy.

This one seems right. Redundancy adds complexity (and hence risk), and only
fixes unreliability if the cause of the unreliability is uncorrelated across
the redundant systems. Is the cause of your unreliability uncorrelated? If
it's random system failure, maybe. If it's software, no.

------
meristem
The writer is trying to convey issues in Reliability Engineering, Human
Factors and Systems Theory, but with a lot of shortcuts that muddle the
message.

7 /I believe the intent was to talk about holistic problem solving, not
suggesting low performance get a pass. The most contemporary understanding of
human error is that although people can introduce errors, in complex systems
(software included) most situations go downhill due to multiple issues. In
many industries people do get fired as a way to fix the 'cause' of an
incident. The Silicon Valley equivalent would be firing the human closest to
the faulty configs that interacted with a bug and caused Google Cloud’s June
outage, without addressing the issues that allowed the network to come to a
halt; or firing a Cloudfare engineer who pushed a change that led to Cloudfare
losing most of its traffic for a while in July instead of digging deep into
the reasons that brought their system to its knees.

6/"...we have a strong case that thorough documentation...actually provide
little value in preventing incidents." There are 2.5 ideas conflated here: 1/
complex systems are dynamic, and documentation/practices have to to be just as
dynamic otherwise will documents a past state. 2/ complex systems are dynamic
and incidents will be most likely caused by unexpected interactions not
covered in runbooks. 2.5/ Expertise involves explicit and implicit knowledge;
translating implicit knowledge is a challenge.

5/ The central issue with RCA is that in complex systems there is not a
single, isolated 'cause'. There are often multiple reasons an incident
happened, and these reasons ripple through the whole organization. Adding
mitigation strategies to address a prior incident can worsen a situation (or
create a new one) especially when the choice of mitigation strategy does not
address the actual reasons that created the initial incident.

4/"...people designing the procedures have a theory..." Whenever reviewing
incidents, find out how the work is _done_, not written. Chances are the
written procedure has mutated to make it more efficient. That change can now
have highlighted a previously-unknown weakness in the system design.
(Corollaries: procedures are a living document; people doing the work should
write the procedures)

3/This is not about operational risk mitigation... It is about knowing how
systems work at their safety boundary. One idea is that not having to deal
with the system at its safety edge leads engineers to solutions that ignore
that boundary or to a decrease in skills.

An example from aviation is automation diminishing a pilot's ability to aviate
manually.

2/ A slight edit would have helped here. Basically some systems at their
simplest work ok but will work better if complexity is added, for a particular
vector of 'better'. So simplicity is not always best. There may be an
assumption that all complexity is unnecessary. While complexity limits
observability, there is such a thing as thoughtful system design without
unnecessary complexity.

1/"...redundancy is often orthogonal to robustness,..." This item is
conjoining system reliability, robustness, observability. The super short
version: redundancy is one aspect of robustness. It alone will not make your
system robust or reliable.

