Imagine you're built on a public cloud in a single Data Centre. That DC has an issue. Your 5 Whys shouldn't end with 'the Data Centre had an issue' It should continue with 'my service was only in one DC.... Why?' and continue from there.
Our service was down.
Because the platform it runs on went down.
Why only one platform?
Because management doesn't want us to spend several years engineering for multiple platforms.
I find it hard to imagine a realistic path of 5 whys that leads us back to a cause within our own team.
Yes and that's fine. It's okay to end up with an RCA like that and note that as a known risk. If a cost-benefit analysis shows that the risk is dwarfed by the solution, then it's not worth fixing. Knowing the existing risks of your system and reevaluating these risks in the face of changes (more people available, inefficient context switching, etc) is a key component of making progress.
However, most real life situations I've seen show that there's always room to improve.
For example, how quickly did your service recover once the underlying platform recovered. Rare events (like DC failure) can be a bad thing for operational training. Teams forget how to recover. No runbooks, no training, or tooling that's infrequently used no longer works.
You don't just shruggie and move in "that ship has sailed" at least if you're doing it right.
Because management declined to approve purchase orders for sufficient hardware to allow redundancy.
And your career just hit a wall. Passing the buck up the chain tends to do that.
If your management is bad, then there are a lot more risks to your career than trying to work with them to improve the process. It is more likely that a given employee has dead-ended their career because they don't know how to explain a problem without blaming people and having the humility to see from other people's point of view; both useful skills for getting a promotion.
Is your superior's superior also incompetent? What about his superior? Does your superior have someone else on his level that isn't incompetent? If you genuinely believe the company is top to bottom incompetent then what are you still doing there? Otherwise identify the competent people and start working that angle, either to get yourself transferred under someone more competent or to move critical decision making power from your superior to someone more competent (perhaps yourself)?
Or, if you're that sort of person, find a way to use their incompetence to mask your lack of giving a fuck and just slack and collect a pay check until they fire you.
Management only comes into play when causes like "under-funding", "constantly changing requirements" or "too tight deadlines" are identified.
> We should want and expect software engineers to have the agency and bigger-picture thinking to do more than just design for inadequate inputs.
Saying software engineers should have bigger-picture thinking doesn't help if the software engineers aren't empowered by management to make the necessary changes.
Getting back to the question at hand. "5 whys = 1 how". Because if we do the root cause analysis, we will definitely find a solution to our problem. What???? Because there is always a solution to our problem??? It's that same skipping over of the most important detail: you aren't doing the most important problems first because that way there will be space for the lesser important problems. You are doing the most important problem first because then when you run out of time, you've done the most important thing! Similarly, you aren't finding a solution when you are doing root cause analysis, your are finding problems. 5 whys is a tool for helping search the problem space. That is all.
The "5 whys = 1 how" quote is really unfortunate and I think it's taken completely out of context. You don't just do 5 whys once and say, "Oh now we know how to fix our problem". You keep doing the 5 whys over and over and over again so you can identify your problems. And at that point you can start to get a handle on your solutions. "5 whys = 1 how" doesn't mean there is only "1 how" that you need.
And just to sum up, if one of your 5 whys ends up with "An external entity failed us", then you might need to find a way that you aren't relying on that entity. However, run some other 5 whys to help you search the problem space a bit more. You may find another "how" that will help you better.
For example, "I trusted this dependency which broke" doesn't mean you Did Something Bad, it just means that the cause of the failure was the lack of a way to validate the dependency and fall back on a known-good version. A good example of this is build pipelines relying on the ability to download npm/nuget/debian packages from a remote server instead of vendoring them locally - this is an intrinsically unreliable choice to make. The remote server being unreachable is not your fault, but there are steps you can take to fix this problem in the future. Avoiding external blame in your 5Ys (without ignoring the external factors) encourages you to make local changes.
As always you have to apply reason here and not adhere to dogma, but it's a useful dogma.
For some products a AWS S3  outage might be a influential enough that they add a redundant second storage at another provider, for other it might be "this is too expensive and time consuming to work around for the problems it caused, let's hope Amazon gets their shit together"
On the big picture, I agree with GP - different people will come to very different conclusions with the "Why" game - because the world is complex, and many factors come into play when something goes wrong. You don't need to fix all of them, but it does become a game of figuring out the most convenient aspect to fix (and no, another level of "Why" won't fix that).
Somewhat related: I had a 2nd level manager who in meetings kept saying "In an ideal world, ..." and using that as a starting point for requirements, problem solving, etc. At one point, I interrupted him and said "Nope. In an ideal world, I would not need to work for a living. Your ideal is already far from it, so let's not make distinctions between ideal and non-ideal."
It's easy to use the "whys" to come up with answers like "Because X needs to make a living" and "Because the incentives with the manager/department/org/company create a situation where testing is devalued (despite all claims by management to the contrary)" or "Because we cannot retain talent due to our compensation policies" or "Because people in the team are unwilling to learn version control systems newer than cvs."
These are not facetious answers - they should be treated on an equal footing with technical solutions. In my experience, management that does not want to deal with solutions that involve other teams/people/policies are ones where the work has been miserable. It's also been my experience that SW people seem to prefer technical solutions over alternatives, which is tragic. When you study things like negotiation, it's almost inverted. They have their own push for "Why", but the push is usually in the motives direction. The person you're dealing with is presenting a concrete demand that may seem objective, but behind it is almost always a fairly human reason (not looking bad, higher status, etc). And they always emphasize trying to root cause to those and addressing them.
Many here know the saying: In the top tech companies, all problems are social ones. They have the talent to achieve anything. If any piece of SW fails, it's not because they didn't have technically capable people. It's because they failed in the social/organizational domain. So when I hear management say they don't want to address that in the 5 Why's, I see management that is solving the wrong problem.
I took a systems engineering class once. The thing they emphasized throughout is "If every aspect of a product is designed perfectly as its own unit, you'll get a crappy product that will not win in the marketplace." The idea is that a perfect widget in isolation may not integrate well, and there are budgetary issues as well. There has to be a certain amount of compromise to integrate with other widgets in the product, and the teams involved need to talk to one another and have dependencies on one another. The role of the system engineer is to oversee that this is happening.
Insisting that your SW or service must be bulletproof of other services, when both are related and in the same product, is highly suboptimal (as is the other extreme). Often the optimal solution is to say "Our service will work, provided service Y works".
 I was on good terms with him so I got away with it.
Seems like 5 Whys in that scenario would lead you to look at things like insufficient test coverage, no code review, code reviewers missed the bug because...
Those paths of inquiry lead to your team changing things under your direct control that could make it less likely for similar bugs to reach production in the future.
For instance, when ext4 gained traction, some users found a number of zeroed-out files after a crash.
Why? Because the programs managing those files updated them by truncating them and then writing them.
Why? Because fopen() has no "replace file" mode, even though the "w" mode seems to do just that as long as the system doesn't crash (it does O_WRONLY|O_CREAT|O_TRUNC).
Why? Because file systems used to not have delayed allocation when fopen() was created. ext4 introduced the concept to the ext family.
Why? Because ext4 was designed to improve performance compared to ext3, and that change does that without breaking POSIX.
Why? Because, even with ext3, the only way to correctly implement "replace file" on POSIX is to write the replacement file on the same directory, fsync() the file, close the file descriptor, fsync() the directory, and rename the file.
At that point, the whys have shifted the blame through userland programmers, POSIX fopen(), ext4, and back to userland programmers. Depending on where you stop, the solution can be:
1. fix all userland programs,
2. change ext4 to force sync upon closing a truncated file's descriptor (which got implemented),
3. upgrade the POSIX standard to have a dedicated "replace file" function call.
The truth is, userland is buggy, and the POSIX standard is arcane, so they both share blame. (But ext4, which was correctly implemented, is the only thing that got patched.)
Implementing it in ext4 increased the risk of zeroing out a file from “having all files updated in the past 20ms be zeroed out” to “having all files updated in the past 4 seconds be zeroed out” in the event of a system crash.
Quite often when I'm doing 5 Why's I find myself socially engineering Why 3-5 to lead us to an action item we can do something about. There are any number of Why #5's someone can generate but a lot of them do not represent progress the team can get behind.
The fact that this happens with so many different teams and managers makes me wonder if it's me or there's just an obvious pattern of misuse going on with 5 Whys.
This is the big issue. Ideology, bias and self-interest creep in and you can easily end up blaming whatever it is that you personally disagree with.
And that's to not even mention the "unknown unknowns", which by their very nature cannot be factored in to a "5 Whys" process, no matter their real world relevance to the problem at hand
How is this better than "Just So Stories?"
In the author's context (healthcare), 5 Whys may in fact be dangerous to promote. In software, where many of us are building CRUD sites and consumer apps, it's a huge improvement over the norm. I'd be very skeptical of the idea of an engineering team trying to employ CAST (linked in the article to analyze cardiovascular surgery mistakes) unless they're building something mission-critical.
My point is: consider when to use a tool. In many contexts, the 5 Whys is a fantastic tool and should be used. In other contexts (likely this author's), it can oversimplify.
This problem is similar to what happens in tech with big new ideas. People do present it as orthodoxy, people do present it as a silver bullet, people do present it as always being sufficient. Then people inevitably find it's not quite that hard-and-fast and get upset.
See the massive backlash against 'agile'. It was supposed to be a flexible way to approach doing things and some general ideas. But a generation of programmers have been told it's an absolute rule that you must do things a certain way - you must have a standup and it must be so many minutes long - and now they're angry.
Of course. But you shouldn't stop at one why. You should keep digging until you find more. I'd say... five of them should be enough.
To be fair, the idea of multiple root causes (and the difficulty 5 whys has in uncovering multiple causes) is important. But that's not grounds for total dismissal. Because the value of 5 whys isn't "ask why 5 times and you arrive at the root cause", but rather "Don't stop at the first answer you find". Since rather a lot of people stop at the first cause they see rather a lot of the time, it's a valuable approach.
(Occasionally 2 Whys turns into a toxic environment based on shaming. Why are we down? Because Steve screwed up. Don't be like Steve!)
The argument "tool is misusable and therefore has no value" is absurd. Of course any discussion framework is going to rely on intelligent people using it sincerely, and even then will only provide some additional insight. It's a tool to help people make 10% better conclusions.
I think the same may be true for networked software services as well. At a certain scale "root cause" becomes a matter of opinion or interpretation. Attempting to use a 5W methodology will lead you to one of many different, possible, causes but will ignore other confounding factors depending on who is doing the investigation and how it is conducted.
In my experience finding the root cause of a particularly nasty error in a production cloud system required a formal model to prove that a particular error was triggered under very specific circumstances. The error was not reproducible and a test case elusive and yet it frequently ate up our error budget. There was no way any kind of root-cause analysis could have detected that error. It was a behavior comprised of over a dozen steps where the conditions had to be right to trigger it. Humans are not good at finding things like that on their own and simply asking, "Why?" like a parrot is not helpful... the system is simply too complex.
Instead of root cause I like to think in terms of factors. What is the probability that a given factor has a high impact on our SLOs? Let's fix that. If we don't understand what the source of the problem is let's write a formal specification and test our assumptions and find the factor. Most of all, build for o11y and make reliability an engineering concern.
Causes are something we create and construct afterwards. They aren’t a primitive that make up incidents. They don’t fundamentally exist to be found.
Causal explanations limit what we find and learn and the irony is that root cause analysis is built on an idea that incidents can be fully comprehended—they can’t!
Instead think about the conditions that allowed an incident to occur. Separate out everything you think looks like a cause and explore each of them.
Talk about the properties and attributes that were present. There are so many aspects to incidents that aren’t even causes at all, and they don’t follow a linear chain of this-led-to-that.
People are presented with RCA and 5 whys as really getting to the bottom of what matters, but the reality is this approach is a linear simplification. We need to kick it up a notch and practice more holistic investigations of incidents.
Stop getting at why people didn’t do what they thought they should have done, and start getting to the point of what actually happened and how those actions seemed reasonable at the time.
On a dark night, a blind man jaywalked and was nearly hit by a drunk driver who swerved and killed a pedestrian wearing black on the sidewalk. What was the root cause? Was it because the man was jaywalking? What if he was jaywalking because he was blind and didn't know any better? If the driver wasn't drunk, could he have braked instead of swerving? If the pedestrian was wearing brighter colors, could the driver have swerved a different direction? Sometimes there is no singular root cause, but rather a complex interaction of multiple failures that may or may not arise if you take one failure away. There may be a better approach than to use a tool that is designed to hone in on a singular root cause.
Unfortunately, this only sometimes works. The rest of the time, they actually do try to pick apart the example and claim there is a root cause. The only recourse then is to present an alternative singular root-cause and say that by trying to force a single root cause interpretation instead of accepting the complexity of reality, that your root cause will ultimately reflect your own biases rather than reality (for example, who do you hate more: jaywalkers, drunk drivers, or pedestrians that wear black at night?).
I still wish I had a better way of explaining this succinctly, in a way that makes it more clear.
I'm not actually joking - applying this process seriously requires looking at not just who did what (people) but the presence or absence of mechanisms. Corrective actions that rely on the good intentions of people "remembering to do the right thing" rarely work.
I review several RCAs a week in my job. If I saw one for a scenario like this where the actions or inattention any of the humans involved here was identified as the ultimate root cause, it'd get kicked back for re-work.
Problem: The Washington Monument is deteriorating
Why? Harsh chemicals are being used to clean the monument
Why? The monument is covered in pigeon droppings
Why? Pigeons are attracted by the large number of spiders at the monument
Why? Spiders are attracted by the large number of midges at the monument
Why? Midges are attracted by the fact that the monument is first to be lit at night.
Solution: Turn on the lights one hour later.
Without getting into a massive amount of details the following things are important to note about 5 whys in lean.
5 whys isn't meant to arrive a a single conclusion or solution, it is meant to provide deeper analysis so that an appropriate and proportional set of countermeasures can be proposed and considered. The key concepts being a "set" of countermeasures and "considered".
5 whys does not need to exist in a vacuum as a problem solving tool. It is often just one such tool in a more thorough approach such as the A3 process
There is no reason to limit the number of whys to 5 or even go that far in some cases.
The countermeasures that come out of a 5 whys analysis are not expected to be implemented and immediately forgotten about. They are part of a plan that expects follow-ups, measurement and observation of the results.
The underlying causes of the analysis are not assumed to be the only causes or problems. The purpose of 5 whys is to limit over analysis. This may not make sense in life-or-death situations, but for most business processes it is appropriate and efficient to make decisions at a reasonable point in the analysis and move forward rather than investing a disproportionate amount of resources in that analysis.
The concept of a "proportional response" is important to lean. If the outcome of a problem is a patient's death, then the proportional response can be expected to be considerably more significant that just doing 5 whys.
If all a team does is ask 5 why questions then propose and implement a solution based on the last answer, this would be a pretty textbook example of a lean cargo cult.
This is indeed the problem, at least according to people close to me that work in healthcare. Lean does not have a good reputation in these circles.
I'm not sure what can really be done about that though. It isn't my industry so my understanding of the problem is pretty peripheral.
Requiring causal analysis to make direct claims about cause and effect, which 5 whys does, is important otherwise its hard to check or even identify those claims or evaluate the evidence. It's especially good for getting labor and suppliers to think which they often are not paid to do. It also oftens lays bare shallowness in that thinking. I can easily see that other tools might be better in other circumstances, but management is still going to need explicit claims about cause/effect and evidence to support those claims in order to make process improvements.
It's also worth pointing out that there are other shop floor tools in the quality toolbox we've inherited from manufacturing engineering other than the 5 whys which are easy to apply and can be very useful and helpful (https://en.wikipedia.org/wiki/Seven_basic_tools_of_quality). Pareto charts for example are often seen in computer monitoring systems and can be super helpful in seeing changing or developing failure patterns. Control charts are less often seen in the computer biz but are very useful for visualizing variability and understanding when processes and systems are and are not stable.
I find it odd that the paper cites so many theoretical problems with "5 whys" and then elevates these other methods as preferred solutions without any discussion. It's possible RCA committees already recognize the limitations of "5 whys" and use it as a heuristic that helps focus their efforts. That doesn't mean these committees all sit around a table and say "well, X sounds like a major problem, but it doesn't fit our '3rd why,' so we have to ignore it."
> the potential for users to rely on off-the-cuff deduction, rather than situated observation when developing answers, as well as difficulty in prioritising causes[, and thinking the causes are a linear list rather than a branching tree and the problem therefore with 5 Whys being greedy in picking a single branch of that tree]
"Not all problems have a single root cause. If one wishes to uncover multiple root causes, the method must be repeated asking a different sequence of questions each time." - https://en.wikipedia.org/wiki/5_Whys
When you work on incidents, you should already know that the incident you're looking at could have wide-spread effects, and may only be a symptom of a series of complex events. You may be able to find a cause with 5 Whys, but probably not the cause, because the causal chain is almost always complex. 5 Whys is just a shoot-from-the-hip way to find the first, most obvious cause from the perspective of the person doing 5 Whys.
You might not be able to identify the "real" root cause right away, but by documenting each incident's cause and linking them into a value chain, and reflecting these in runbooks, you can over time eventually see where the root causes are coming from. But you also don't need to immediately solve the root cause to begin adding safeguards around the known incidents.
Is 5 whys normally just 5 causes in a line?
In short finding one root case not all avoids analysis paralysis and gets a solution in place quick.
That said, I really wish the healthcare industry would stop using ideas from manufacturing. LEAN, 5 whys, etc. came from manufacturing widgets, humans(patients) are not widgets. They should use their brains and come up with unique ideas to their unique problems rather then piggy backing off ideas that aren't meant to deal with one-off's.
This is also why i left healthcare and will not go back. Such an important sector, ripe for disruption, it seems to be lacking so much.
^ That's the best section of the article. Also, I think the causal tree they included for an incident at a hospital (patient received the wrong medicine) is fascinating (see Figure 1; I included a direct link to the JPG below). They explored one chain of causal subtrees to a depth of 11, and the total number of nodes is 78 (although some subtrees are repeated - it would have been better as a DAG instead of a tree). It exposes significant policy and work culture issues at the facility and gives the reader a sense that they already know a lot about what it must be like for the nurses who work there.
I’m always left wondering “how am I supposed to really get something out of this?”
T5W is the sanity check before you leap to problem solving, and perhaps solve the wrong problem - which often is no solution at all.
Read this. Read it again. RCA FTW!
I particularly appreciate the treatment DevOps Handbook gives them as far as spelling out Systems are Complex, there will be many contributing factors, you need to take in multiple perspectives to fully illustrate what happened.
This lines up with the article's analysis although the conclusion is a bit sensationalist, I'm loving the long list of 'other techniques' they present right before it.
5 Whys should really be considered a place to start and encourage people to go deeper into failure analysis, then work towards more formal methods.
I can choose an arbitrary answer to a why question every single time.
We are mostly satisfied with a how answer to a why question. And most why questions are in fact how questions or just used for something else (like defiance from a kid asking why they have to go to bed).
Also we stop asking why when we are subjectively satisfied with the answer, not when we've found "the cause". And we've been conditioned since we were kids to feel it's rude to ask why more than once or twice.
That's why "5 Whys" is so important -- to re-kindle the intellectual curiosity that is snuffed out in children.
But in general it is good to ask questions, and it's good to allow multiple answers to the same question.
Like... no one held a gun to your head and said "find a single causal pathway!" That's something you inflicted on yourself.
There are plenty of problems I could call up with regard to 5 Whys and how I've seen it practiced, but this is just a silly one.
"HROs commonly aim for a reliability rate of ‘six sigma’ (three errors per million opportunities). By these measures, healthcare is struggling to move beyond two sigma (308 500 errors per million opportunities, or a 30.85% error rate). [...]
But healthcare is far more complex than automobile manufacturing, and takes place amid processes and systems that are woefully underdesigned in comparison to a modern factory. [...]
As a result, approaches developed for solving problems in the automotive manufacturing context may not be as effective in the healthcare arena."
So, in a way, the problem isn't really with 5 Whys. The problem is that it's applied to processes where errors/defects are much more common.
I've seen it applied to my own domain as well, keeping computer systems (web sites, mainly) running. If we have one incident with possible downtime per year, there are probably still way too many potential root causes for 5 Whys to be useful... and that's still probably at least a magnitude better than anything I've personally worked on.
In manufacturing you should ideally fix one problem at a time. In healthcare you don't have that opportunity, you have to deal with comorbidities.
In manufacturing you can fire troublemakers (eventually). In healthcare you have to treat them.
"why?" easily leads to cycles and going back down the abstraction hierarchy, where "for what purpose?" always goes up the abstraction ladder.
- why are you eating? because it's 7pm
- for what purpose are you eating? to satisfy my hunger
We see this pattern in databases, where there are 1..n relationships out of something, or, in reverse, n..1 relationships into something.
The "5 Whys" described whys only as having a single cause feeding into each successive cause. This article describes the problem with that, which is simply that in the real world, a single result, a single "why", might have multiple (1..n) causes feeding into it, and multiples into those.
Now... get ready to have your mind blown... Here's the "problem" with this article (I know, this article makes more sense than the "five whys" alone, but hear me out!)
Think of "The Five Whys" as level 1 reasoning and this article as level 2.
Level 2 is smarter because it considers more.
So what's level 3?
Level 3 then, which I think will be discovered (or at least expounded upon by some person vastly brighter than myself), is using the level 2 method, and expanding that set of predecessor causes even more to everything in life, which theoretically should result in the observation that everything is circular.
That is, all events, causes and circumstances -- eventually feed back on themselves! Some to lesser degrees, some to greater degrees, but I think it will be discovered that all causes move in circular loops, even though the cause->effect->cause chain (or chains) might have thousands, perhaps hundreds of millions of elements...
I think the ancient Greeks (well, some of them, the enlightened ones anyway) had an understanding of this. I think that Shakespeare and Dante Alighieri understood this, at least in terms of the human character.
Consider Toyota... If Toyota is asking "why" some failure occurred and tracing it back 5 levels, then one might ask "Why does Toyota have to make cars in the first place?", because that sets the stage for all other "whys" that come next...
Eventually, with enough "whys", everything has to be circular... You just need to go enough why levels "deep" to see this. And some causes feed other causes to lesser degrees, some to greater degrees.
And I believe that people in antiquity who were men of great learning... somehow knew or intuited this... while not provable, it seems "logical"...
Loops inside of loops inside of loops... all with different loop lengths...
So rather than something technical like "process crashed", most of my investigations begin with "user is frustrated because they did not accomplish their task". Then my why-tree will have branches for, say, "the label on this control was unclear", "there was no documentation for this feature", "it was not clear how to get to the correct documentation page", "user was afraid to click it -> because there was no Undo", "user didn't know this feature already exists", etc.
Usually, I have to go a few questions deep before I get to a geeky technical solution that can be solved by more programming. I'll typically come up with 4 to 6 actionable improvements, and half will be completely non-technical, and most of the rest will be less than one line of code (e.g., to rearrange or clarify).
Users don't like going to the effort of filing bug reports, so whenever someone does, it's a good indicator to me that 10 or 100 other people also failed at a similar task with my software, or will soon. Users are also resourceful, and will try 2 or 3 ways to solve something, so if they ultimately fail, it means the software failed at every path they tried. By attacking every problem at multiple levels, it's possible to improve results for a whole lot of people at once.
5 Whys's also needs to be preceded by a detailed representation of what happened and followed by multiple next actions that are tracked to completion.