
The Cult of the Root Cause - fanf2
http://reinertsenassociates.com/the-cult-of-the-root-cause/
======
bartread
It's also worth pointing out that you may not be able to fix the root cause.
The sailing example is great here:

\- Can you fix the crack in your hull whilst you're out at sea? Almost
certainly not.

\- Can you even _tell_ you have a crack in the hull until you've reduced the
water in the bilges by running the pumps for a time? Again, quite possibly
not.

Treating the symptom is really the only sensible option, unless it's serious
enough that you need to put out a Mayday. Again, not a course of action that
addresses the root cause but, in some situations, absolutely the right thing
to do. To take things up many, many notches, the sinking of the Titanic was an
appalling tragedy from which relatively few people were saved, but I guarantee
that nobody at all would have been saved if the people on the ship had opted
for a series of committee meetings about how to solve the problem of the
iceberg. (Not to say there weren't a very large number of hideous blunders in
the management of that situation going all the way back to the ship's design
and fit-out.)

Moreover, another problem with Five Whys is that, applied heedlessly, it's an
extraordinarily arrogant philosophy, because it makes the assumption that you
can know the answer to those five whys. Often you can't, at least not without
going on a journey, and fixing a few things along the way, and without that,
you can simply be wasting your time trying to answer questions that you can't
answer in your present situation.

And _that_ really, and somewhat poignantly, cuts to the root cause of why I
view the kind of thinking frameworks/fads popular in business with a degree of
scepticism: over-applied or misapplied, they paralyse people into inaction,
and thereby provide fertile stock for breeding mediocrity.

~~~
roenxi
If you get the iceburg as a root cause then you are really doing something
wrong; in fact I would suggest the point of a 5-Whys would be to move people
past thinking that the iceburg is the problem and get to the actual root cause
:P.

Why is the ship sinking? Hit an Iceburg. Why did we hit an iceburg? [Equipment
or command failure] What is wrong with [maintenance strategy/command
structure] that allowed this mistake to slip through? [technical details]

If you have a team of people and something goes wrong, it is overwhelmingly
likely that a human could make a decision differently or do some task that is
not being done that would mitigate the worst of the damage.

People absolutely are saved by the committee process, in the same way that
items tend to roll downhill. Pretending that humans have perfect control over
their environment and could have done something more has proven remarkably
successful at getting results. It isn't very impressive, it feels very
unreasonable, and it isn't going to work on its own, but it is a very useful
tool to let people to stand up and ask "sure something is going on that is out
of our control, so why aren't we ready for it? This happens sometimes and we
need to be prepared".

Basically, if you just ask why 5 times without any sense of personal
responsibility you'll get stupid results. True of any process. But if an
uncontrollable event has impacted your endeavor, it is absolutely worth asking
"why are we exposing ourselves to a risk we can't control? could we somehow
have avoided this".

~~~
jjoonathan
Exactly which part of this is supposed to help the ship that just hit an
iceberg and is in the process of sinking?

If your answer is "the relevant committee would have met before the accident"
or "the committee would reduce / mitigate future accidents," you have missed
GP's point.

~~~
Moto7451
Root Cause Analysis isn’t really meant for active emergencies like that where
you can’t take time to analyze “why”. It’s a retrospective tool. If you’d like
a mental model in your toolbox for an active emergency the OODA loop[0] is
well tested.

RCA is good for figuring out how to keep others out of the mess you’re in
after the fact.

[0] [https://www.fs.blog/2018/01/john-boyd-ooda-
loop/](https://www.fs.blog/2018/01/john-boyd-ooda-loop/)

~~~
jschwartzi
While the article starts to digress into a pseudoscientific mess at the end,
this method of problem solving is pretty damn effective when you're under the
gun. While you don't have time to deeply analyze the situation, you must make
some time for information gathering and analysis, whether your deadline is in
5 minutes or 5 days. And the higher the stakes, the more important it is that
you don't take shortcuts.

------
maccam94
This article is creating a straw man. You don't do a 5 Whys and fix what you
think the root cause was, you fix the issues that were most serious. If there
are problems too big to tackle immediately, put in short term mitigations and
incorporate your new understanding of the system's reliability into your
future plans.

~~~
hinkley
Doubly so because you don't use the 5 whys during the emergency. You use it
_during the post mortem_. After you've unplugged the burning computer. After
you've gotten the ship out of the immediate existential threat. If the
computers are burning due to faulty wiring no amount of triage is going to
stop that from happening next week. If ships are getting holes because the
currents have shifted and bergs are appearing in places where hobby sailors
frequent then the maps and some public outreach are the right solution.

I dunno who taught these people about the 5 Whys, but someone (possibly
themselves) has done them a tremendous disservice.

------
phlakaton
If there is a cult of the root cause, I have yet to meet it.

Here's what doing an exercise similar to 5 Why's gets you:

\- An understanding of where issues come from. Whether your plan of action
starts from the top, bottom, or middle, taking the time to step back and
broaden your perspective before you jump in to fix a problem helps to make
sure you're going after the right things for the right reasons.

\- A culture of _not_ just picking the most expedient and facile solution
every time issues come up, and going with that. In companies I've been in, the
pressure is almost always on to find the dumbest, hackiest, absolutely fastest
path out of trouble. Spend multiple years solving every problem that way, and
you are in deep trouble! It takes institutional courage to push back against
that, and having a practice in place to force you to stop and think now and
again gives you an opportunity to summon that courage.

\- A culture of ownership. This seems a little counterintuitive to me, since
if you follow root causes deep enough you're liable to stumble onto people and
process problems that are way out of your control and pay grade. Looking at
root causes this way, you might think it's a process of passing the buck.
However, by shining a light on such things, and finding people to address
those things where they have no owner, you can push towards a better
collective ownership of the real issues that face your company.

No good management idea is free from abuse, of course. You must exercise taste
and judgment in deciding how deep to push with root causes, and what to do
with the discoveries. I would think it's rather self-evident that 5 Why's
doesn't mean you always ask exactly five questions in a strictly linear
pattern. But for heaven's sake, make sure you ask more than one!

~~~
cirgue
I would say it's even more basic than that: the 5 whys are a way to push
people to gather information and talk to one another before making decisions
about solutions. The point is not to achieve perfection, but to consistently
_not_ make stupid and easily avoidable mistakes.

------
11thEarlOfMar
I've never taken the 5 Whys literally. It's obvious to me that all root causes
are not '5 Deep', therefore, this can't be a literal objective. I see it as a
metaphor for being an effective problem solver, as a reminder to second guess
the cause I've identified and ask myself or my team, "Is there a deeper,
underlying cause?".

------
ssivark
The most interesting cases are complex systems where a fault/even results from
a combination of multiple factors. In that case, asking for a "root" cause is
unproductive, and a better question to ask would be: how might the system be
patched, with the least pernicious side effects. This is also why I don't
always like the drive for more "accountability". As can be seen in the other
HN thread on the front page about data driven medicine and the side effects on
the healthcare system, you need to be very careful about which of the causes
you decide to intervene on.

When you closely manage something to reduce variation, you also lose any
information you might get about the system from the variation of that
quantity. This point is nicely made in another post on the same blog:
[http://reinertsenassociates.com/the-dark-side-of-
robustness/](http://reinertsenassociates.com/the-dark-side-of-robustness/)

Especially with reflexive systems (involving humans), the appropriate response
might sometimes involve performing no intervention, or performing an
intervention downstream, to modify its assumptions about what it receives (eg:
adding error handling).

------
osteele
My ops postmortem template tried to elicit breadth. Once you’ve got a forest
of causes, you can apply Five Whys to add depth.

There’s some overlap among the following questions. The intent is to elicit
observations and ideas, not to uniquely categorize them.

* What are all the factors that could have prevented the incident?

* What are all the factors that could have detected the issue before production?

* What are all the factors that could have detected the issue sooner when it did occur?

* What are all the factors that could have accelerated mitigation? (Including, especially, changes that could have reduced the risks of mitigations considered too risky to apply.)

* What are all the factors that could have accelerated remediation?

* What could have reduced the scope or impact?

It’s common to come out of this with a laundry list that overfits the last
incident and, if applied, would increase the complexity of the system and add
risk. We’d typically apply one or two fixes, and stockpile the rest to see if
any of them would have addressed any future incident. Usually most of the
“solutions” turn out to be specific to the single incident that prompted them.

~~~
jwatte
5y is about finding ways to prevent the conditions that let incidents happen,
not just preventing incidents, tough. That's kind of the point of asking 5 why
questions. "We fell over because we lost a database server and didn't have
enough spare capacity to run peak load on the standby. We don't have full N+1
because it hasn't been funded. It hasn't been funded because the business
didn't have a good model for the risk adjusted cost. Action item: add the cost
of this outage to the next budget forecast, and add a requirement for risk
adjusted cost estimates to all future financial plans."

------
mirceal
My favorite, goto, on this one: [https://blog.acolyer.org/2016/02/10/how-
complex-systems-fail...](https://blog.acolyer.org/2016/02/10/how-complex-
systems-fail/)

The truth is that by “fixing” the root cause you will sometimes destabilize
the complex system you are running.

~~~
dbenhur
Yes, root cause analysis and corrective action should only to be done with
Cook's insights in mind.

"Post-accident attribution to a ‘root cause’ is nearly always wrong." "Post-
accident remedies usually increase the coupling and complexity of the system.
This increases the potential number of latent failures and also makes the
detection and blocking of accident trajectories more difficult."

How Complex Systems Fail is short but loaded with value; if you haven't read
it, go do so now!

~~~
BaronSamedi
I agree with you. Root cause analysis should be informed by an understanding
of complex, dynamic systems. The article's assumption, however, that RCA and
systems thinking are somehow at odds is incorrect. Root cause doesn't
necessarily, as the author implies, mean a single, isolated cause. It can
designate the linking of "multiple contributors" as the author advocates.

------
mbesto
> Second, it assumes that the best location to intervene in this chain of
> causality is at its source: the root cause.

It doesn't though, that's just a built-in assumption for the lazy. The point
of Root Cause Analysis and the "5 Whys" isn't necessarily to get to the root
and fix the root...it's to provide a framework for traversing a problem set.
The point of this methodology is so that you traverse the problem,
understanding each step along the way...not that you simply jump to the root
and try to fix it blindly.

------
twelve40
Most of these examples don't seem relevant to RCA at all.

> shifted from pumping, to plugging, to hull repair

Pumping and plugging are immediate response, just like a decision to
temporarily shut down the website when a compromise is discovered, or pulling
the plug on a smoking computer. What do these have to do with root cause
analysis?

~~~
jlgaddis
The point was that, sometimes, the best thing is not to worry about finding
the root cause -- not _now_ anyways -- but, instead, to "treat the symptom".

That is, when your boat is sinking, immediately focusing on fixing the root
cause (the crack in the hull) is not necessarily the best course of action.
Instead, treat the obvious symptoms that you can to "stop the bleeding" (plug
the hole, pump the water out) and you can deal with the root cause when you're
in a better position to do so (back in port, not miles out to sea).

See also: metaphor.

~~~
chc
Are there actually people on the other side of this argument? Like, is there
someone who is saying "Immediately do a root cause analysis — just that, and
nothing else — no matter what the situation is"? This seems obvious on the
same level as "a piece of kale is not suitable for use as a pacemaker."

~~~
williamdclt
Yes and no.

First, I think the author is (as someone else said and as you seem to agree)
raising a strawman.

Second, I don't think there are people on the other side of the _argument_ ,
but there are people on the other side _in situ_ : I mean that even if
rationally they would never admit it, they hijack the immediate "fire
extinguishing" or "pulling of the plug" or "pumping of water" to discuss where
the problem is coming from.

I've seen those people lacking discernment, not only do they get in the way of
the short-term action, they also are pretty bad at cause analysis and confound
it with blaming people or "I told you so".

This is a generalization of course, but the TLDR is that even if those people
wouldn't argue against it, they act against it.

~~~
hpcjoe
This. Its highly problematic when these people are in positions where they are
able to derail efforts to stabilize by insisting upon RCA first-and-only.

Seen this everywhere. Been on both sides of this. Learned to be humble from
the experience.

------
ratacat
I totally appreciate OPs insights here. It can be so tempting to see things
linearly. But obviously the real world is anything but. There's a curiously
wicked theory out there by David Abrams that part of this seeming human
predilection for linearity has occured somehow through neural conditioning
involved with adopting systems of writing, compared to the crazy fractal ocean
of sensory input that living in a real living landscape begets. The writing
has a start. A finish. It progresses in a single visual direction, and what's
more, the pieces themselves only represent reality. The letters themselves are
completely different from the things they are describing. And they might be
one of the first objects in human history that work like this...

Anyway, the book is called the Spell of the Sensuous, and it's dense af, but
bursting with fascinating lines of inquiry.

------
myWindoonn
People are notoriously bad at playing five-whys. If you haven't reached an
ideological or metaphysical problem by your third 'why' then you aren't
playing well enough.

~~~
williamdclt
Once again that's a problem of work culture. the five-whys is super simple,
it's up to people to be smart using it. If I had an ideological problem after
a 5 whys, my team would kindly ask me what the fuck I am doing

------
motohagiography
My experience with Root Cause Analysis is it's often an exercise for poor
managers to deflect accountability by diluting it among more junior people.

However, 5-whys is very useful as a design principle instead of using it to
respond to failure.

It goes something like:

0: build a thing why 1: because customer asked why 2: because thing is what
they think they need why 3: because it is one solution to a gap in their
ability to achieve something. why 4: because that something is an economic
need. why 5: because market opportunity to fill that gap with something, maybe
this thing, maybe something better.

~~~
amygdyl
I suspect that this is because a human propensity exists to allocate and
apportion blame, particularly in cases where the interactions involved in the
immediately prior actions are unclear.

I have actually not even once encountered this 5 Why's method.

I neither heard of it before.

But I founded my company in 1996, starting out with almost two hundred years
of experience surrounding my incredulous and lucky younger self, including
several PhDs and former Fortune 500 board members.

I will hazard that this 5whys technique is fundamentally flawed and easily
susceptible to manipulation for procuring a scapegoat.

I only hope that explains why I have never encountered this before. I hope
moreso I can feel a little like something​was going on in the right way, in my
business, to filter and reject what I think is, and definitely comes across as
bogus to me.

------
jschwartzi
The biggest problem with the 5 whys is that the whole concept is taken out of
context. In a manufacturing context every repeatable problem is a problem with
your process. So in that context it makes a lot of sense to search for a
process change that resolves your problem.

Let's say you make light bulbs, and every fifth bulb comes out misshapen. You
would use the 5 whys to trace it back to the molding station, where you
discover that bulbs cool at a different rate in one of the machines because
the mold uses a better insulator. You could stop there and replace the
insulator. But if your job is to increase yield, then you can save the company
a lot of future money if you figure out how that mold got there in the first
place. You might find that purchasing subbed a cheaper replacement based on an
incomplete spec. Or you might find that the supplier recently switched
materials.

The point is that when you're trying to establish a controlled, repeatable
process, you need to understand where your controls break down.

Once you understand the process problem, then you make a business decision
about what to change. It was never meant to be applied to R&D problems. R&D
processes are not as concerned with repeatedly doing something correctly.
They're concerned with making sure something can be done in the first place.
It's a different class of problem.

------
sethammons
This reminds me of what we've been doing for a while, the "blameless
postmortem." The technique is championed by Etsy, and you can read more about
it in an article that introduces their debriefing guide:

[https://codeascraft.com/2016/11/17/debriefing-
facilitation-g...](https://codeascraft.com/2016/11/17/debriefing-facilitation-
guide/)

From the linked PDF from the article:

> “Adaptability and learning. We learn through honest, blameless reflection on
> lessons and surprises. We believe that traditional root cause analysis makes
> learning from mistakes difficult. Our blameless post-mortem process is a
> widely-cited technique that we believe is becoming best practice among
> organizations that value innovation. Blameless postmortems drive a
> significant percentage of our development as we analyze what about our
> production environment was less than optimal and rapidly make corresponding
> adjustments.” (Etsy, Inc., 2015)

The idea, boiled down, is to inspect timelines, procedures, and actions and
develop a narrative of how an incident came to be. The goal is for everyone to
walk away with a (better) understanding of everything. With this, people are
better armed to put into place solutions.

One example from the text is where an engineer pushes out a change because
they thought the build system had zero failures. The push breaks the system
and causes a regression that should have been caught in the tests. During the
postmortem, the engineer says, "I thought the tests had zero failures. I guess
I need to be more diligent in the future." Upon further timeline
investigation, it is noted that the tests actually had eight failures, but the
font had eights and zeros looking very similar. The fix was not "be more
diligent;" the fix is maybe to have a better font or use colors for pass/fail.

Overall, I like the ideas proposed in this blameless postmorem style. It runs
counter to the natural tendency to "find a problem and fix it" because it
feels like we are talking less about the problem and the fix and talking more
about the narrative of the failure. But what I've seen is folks gaining better
understanding of how everyone else works, learning about tools and tricks, and
about assumptions. And knowing more about the narrative leads to better
solutions.

~~~
dingaling
> Our blameless post-mortem process

Theirs?

Accident investigation agencies such as the AAIB and NTSB have been following
"blameless" processes for decades. Find the causes and save lives. Who pressed
the button or forgot to connect the oil line is irrelevant compared to the
fact that the failure modes were possible.

~~~
detaro
Yes, their process, as in "the process they have implemented and documented",
not as in "they've invented the concept". They widely reference prior art and
experiences in other fields (e.g. the second sentence of the post the parent
linked)

------
maxxxxx
Shouldn't it be the cult of the single root cause? Most problems seem to have
several contributing factors. You can almost randomly pick one factor, improve
it, and the whole situation will get better.

You see this a lot in public debate like education or health care. Instead of
fixing one of the many problems a lot of time and energy is wasted on finding
THE root cause that will fix everything.

~~~
Moto7451
This was what came to my mind as well. The fire strawman presented exemplifies
this. Finding the root cause is simply a mental model. Discovering that there
are three and picking the best one to fix (assuming you’re limited by
time/money/complicity/etc or it’s undesirable to fix others) is 100% ok. The
existence of instances of multiple root causes does not really say anything
negative about Root Cause Analysis.

In cases where you can’t discover the root cause (I.e. a plane that explodes
and destroys the root cause) you simply have to go as deep as is reasonable
and work from there.

If someone is unwilling to be reasonable and accept a number of root causes
between M-N then the issue is with them, not Root Cause Analysis.

------
Illniyar
I wouldn't take such an advice. For me getting to the root cause of things is
one of the core values of a good programmer (and for operations as well).

Cause #2 is also fictitious, the 5 whys never say anything about fixing the
problem, only understanding it. In fact, for me, not fixing the problem is
just as valid a solution - as long as you know what caused it, you can
determine if it's worth fixing it at the root or even at all.

As to the linearity of cause and effect, while it's true that many problems
have multiple causes, a solution to a linear problem will prevent alternative
causes below it. Besides, the grand majority of issues arising in mature
systems arise from a single cause and have linear cause and effect.

------
placebo
Reminded me of this entertaining clip:

[http://vooza.com/videos/the-5-whys/](http://vooza.com/videos/the-5-whys/)

I doubt that most people who use the five whys really take it to be as
simplistic as the author of this article suggested, but to those that do, it's
a good wake up call.

In fact, life is even more complex than the article suggests, when you throw
in effects of chaos, feedback loops, missing information, unknown influences -
just to mention a few. Still, tracking down the order in processes has got
humanity quite far (at least as far as being able to predict and engineer
accordingly) so it's obviously effective.

~~~
mianos
To save some time watching the vid. It's 'because the Illuminati or
something'. It is a great little sketch and does illustrate the major
objection to the OP's essay.

------
himom
Seems hand-waving, thin on value and promoting a consulting business.

It would’ve been better to talk about the real-world including TPMS and the
NTSB investigation approach... making cars very reliable and very complex
aircraft safer with strict regulations.

~~~
anoncept
For a deeper look from the systems engineering side, check out
[http://mit.edu/psas](http://mit.edu/psas), specifically the book-length
treatment in “Engineering a Safer World”.

For applications of similar ideas to cloud software and devops-style
environments, [https://www.kitchensoap.com/2012/02/10/each-necessary-but-
on...](https://www.kitchensoap.com/2012/02/10/each-necessary-but-only-jointly-
sufficient/) is also helpful!

------
emgee_1
I think that 5W and 5W2H are means to develop a network of logical reasoning.
These diagrams ( fishbone and Ishikawa) are used as a means to discuss what
attack areas you have concerning a problem. They are used in many ways and in
many different fields. They are very useful when designing experiments (DoE).
8D and 10D teams use them extensively in (high tech) manufacturing.

------
gerbilly
> There are often multiple causes for an effect,

I find that some of the toughest bugs to solve are the ones where the
undesirable effect has more than one cause.

To be precise it's the kind of situation where the bug can be triggered by 2,
3 or more independent causes, i.e. each cause is sufficient on its own to
cause the bug.

Often when attempting to solve a bug like that I'll find one likely cause,
address it but because the bug persists, I end up undoing the fix for cause
#1, then finding cause #2 and ping ponging between them till I realise that I
have to address multiple causes to make the bug go away.

------
Nomentatus
Looking at this (thanks anoncept):
[http://psas.scripts.mit.edu/home/get_file.php?name=STPA_hand...](http://psas.scripts.mit.edu/home/get_file.php?name=STPA_handbook.pdf)

I'm not sure it takes us all _that_ much further (for general purposes) than
John Stuart Mill writing on causation in the nineteenth century - he did very
well creating a philosophical foundation for causation-talk. (And the STPA
handbook is excellent, as well.) In any case, Mill is the original source for
the modern understanding plural and complex causation.

------
jwatte
This article feels very straw man to me.

First, the 5y I've learned and run in the last 10 years takes pains to
identify a fix or mitigation at each step - not just the root.

In fact for small failures and big costs, the root cause is deliberately not
worked on, because analysis says the finer grain fixes are better/cheaper.

Second, root cause trees have been quite common, because there's almost never
just one chain, especially when you're running a system that has plugged all
the easy holes and fixed all the obvious first level problems.

Straw man article IMO, but I can't figure out for what purpose.

------
0xBA5ED
Not a fan of this term "the root cause". It appears to be causing some
semantic trouble for the author as well. If you never get it in your head that
you're going to find the "root cause" of something (initial conditions of the
universe? lol), then you won't be looking for it. You'll be looking for the
most promising point of intervention, which is what we do.

------
tdrd
The examples in this article are all straw men; where's the compelling
specific case where 5 whys lead to some demonstrable mistake?

Nothing to see here.

------
monksy
Whilst I agree with the article being that context matters a lot. But there
are a lot of great things about RCAs.

A few of the things:

1\. It shows visibility about the engineer's abilities

2\. It attempts to show weaknesses of technology choices that happen above the
people who support it

3\. It attempts to show a weakness in the process. (Similar to a retro) Yet,
in practice, rarely is this addressed.

~~~
Too
Stopping before reaching the process is a big mistake many do when doing RCA.
For example: Why did program crash: [Divide by zero]. Why? [Loop iterating
array not terminated] Why? [Input arguments not validated] Why? [Developer was
"sloppy"] and then they stop there.

The answer to the last question should instead be on or many of: no unit
tests, no code reviews, working overtime, no QA before shipping, can't
concentrate in open landscape, compiler warnings disabled, using too low-level
programming language for high level logic, developers not educated on current
tech stack, and so on. With follow up whys on all of those.

------
legulere
The problem usually is not finding and fixing the causes but that you could
have spotted bugs earlier when the damage would have been smaller. People will
always make mistakes, but catching as many as possible and as early as
possible lessens the consequences

------
bbbbyyyy
This article makes me angry. I think author has some useful things to say
(like, there's usually not a single root cause) but the tone is super
arrogant, most of the arguments start with strawmen.

------
megaman22
Ultimately, the root cause is always "you fucked up." At least that's what
people always seem to be pushing towards when they start dragging out terms
like root cause.

------
liveoneggs
I recently had to do a five-whys write up after typo-ing a config file. I'm
pretty sure it was some kind of shaming exercise.

------
mmjaa
Chains are chains are chains. Can you construct a sail?

------
jlgaddis
(2011)

