Hacker News new | past | comments | ask | show | jobs | submit login
The Problem with ‘5 Whys’ (2016) (bmj.com)
237 points by zdw on Apr 10, 2019 | hide | past | favorite | 116 comments



Many years ago, I worked at Amazon, and at the time they LOVED "5 Why" analysis. I had to write a couple of them. The second and final one I wrote came back from my manager, who told me that it was unacceptable for my "5 Whys" to blame something outside our team's control when the most immediate cause was a bug written by our team, and I needed to come up with a different ultimate cause. This was really frustrating, but it was easy enough to completely change the cause without being dishonest. "5 Why" style analysis is really easy to steer toward whatever you want to blame.


You can't end a 5 Whys with 'another team I depend on screwed up'. You have to own your software or service. If you depend on something, you need checks, alarms, and whenever possible the ability to handle dependency failure well.

Imagine you're built on a public cloud in a single Data Centre. That DC has an issue. Your 5 Whys shouldn't end with 'the Data Centre had an issue' It should continue with 'my service was only in one DC.... Why?' and continue from there.


> You can't end a 5 Whys with 'another team I depend on screwed up'. You have to own your software or service.

Our service was down.

Why?

Because the platform it runs on went down.

Why only one platform?

Because management doesn't want us to spend several years engineering for multiple platforms.

---

I find it hard to imagine a realistic path of 5 whys that leads us back to a cause within our own team.


> Because management doesn't want us to spend several years engineering for multiple platforms.

Yes and that's fine. It's okay to end up with an RCA like that and note that as a known risk. If a cost-benefit analysis shows that the risk is dwarfed by the solution, then it's not worth fixing. Knowing the existing risks of your system and reevaluating these risks in the face of changes (more people available, inefficient context switching, etc) is a key component of making progress.


True. In a case this simple, you likely wouldn't have to write an RCA.

However, most real life situations I've seen show that there's always room to improve. For example, how quickly did your service recover once the underlying platform recovered. Rare events (like DC failure) can be a bad thing for operational training. Teams forget how to recover. No runbooks, no training, or tooling that's infrequently used no longer works.


You actually want to assign to root cause and track that. Them over time you build up data on if it's worth it to stay on one platform or not. Eg doing this let's you build a risk profile to counter the $/time argument against. Then you get to make an informed decision.

You don't just shruggie and move in "that ship has sailed" at least if you're doing it right.


Alternatively:

Our service was down.

Why?

Because the platform it runs on went down.

Why only one platform?

Because management declined to approve purchase orders for sufficient hardware to allow redundancy.

---

And your career just hit a wall. Passing the buck up the chain tends to do that.


So you don't word it in a way to assign blame. You say previous cost benefit analysis indicated X known trade off from lack of redundancy. This incident either confirms the trade off (finance will usually support you for this) or say additional X factor not previously known needs to be taken into the trade off and revised.


Plot twist: X factor was known but dismissed by management. Management knows this and knows how to read between lines. Your career hit a brick wall and you didn't even get to openly snub the management.


You're pushing the scenario into foolish territory. If management decisions are the root cause of a problem and they won't accept that, then the quality of the process used to identify the root cause is irrelevant. You could use the best or worst process and it wouldn't matter.

If your management is bad, then there are a lot more risks to your career than trying to work with them to improve the process. It is more likely that a given employee has dead-ended their career because they don't know how to explain a problem without blaming people and having the humility to see from other people's point of view; both useful skills for getting a promotion.


Interesting. I'm certain you're right don't blame your superiors if they hold your fate in their hand. But what do you do when you know they are incompetent? Cover their ass and feed their incompetence for your own gain? I'm not sure.


But what do you do when you know they are incompetent?

Is your superior's superior also incompetent? What about his superior? Does your superior have someone else on his level that isn't incompetent? If you genuinely believe the company is top to bottom incompetent then what are you still doing there? Otherwise identify the competent people and start working that angle, either to get yourself transferred under someone more competent or to move critical decision making power from your superior to someone more competent (perhaps yourself)?

Or, if you're that sort of person, find a way to use their incompetence to mask your lack of giving a fuck and just slack and collect a pay check until they fire you.


That indeed sounds much better.


how do make effective action items with that root cause?


This reminds me of the Twitter [0]/HN [1] thread by someone trying to argue that the Boeing 787 MAX couldn't be blamed on software engineers, because it was the aerodynamic/mechanical/systems engineers and business executives who demanded too much (and also, the pilots for just not being better trained). Yes, that's all technically true. We could even put more blame on the human desire for convenient speedy travel as more of a direct factor. But software engineering is part of systems engineering for any modern project. We should want and expect software engineers to have the agency and bigger-picture thinking to do more than just design for inadequate inputs.

[0] https://twitter.com/trevorsumner/status/1106934369158078470

[1] https://news.ycombinator.com/item?id=19414775


The buck stops with senior management; the buck always stops with senior management. That is the essential lesson of the 5 whys and the essential problem with trying to use it without being fully committed to the Toyota Production System. If senior management aren't willing to cop the blame for literally everything that ever goes wrong, the whole effort is just a dog-and-pony show.

https://en.wikipedia.org/wiki/Toyota_Production_System


I disagree. If an outage in the product your team managed was caused by a bug your team introduced 5 can point at several root causes and solutions that your team will be able to control and implement.

Management only comes into play when causes like "under-funding", "constantly changing requirements" or "too tight deadlines" are identified.


Management have ultimate responsibility for your team - staffing, training, resources, workload and everything else. If one of those is wrong, it's their fault. The alternative is pure scapegoating. Blaming a team or an individual doesn't get to the root cause - why did that employee underperform, why was that team badly led, why did nobody identify or report these problems until it was too late? The answer might be as simple as "we hired a complete moron", but management have to take ownership of that error if they want to prevent similar errors in future.


Reminds me I need to finish Extreme Ownership. Got about half way through, a really excellent philosophy.


Management also has to support the idea of blamelessness and the costs of investigation and remediation and to participate in reviews. Without that support it may work some of the time but often won't.


Management is still at fault for not putting in place enough quality to ensure that the bug didn't happen, or was found first. There are a number of ways to do this. They can spend more time and money testing the product before it goes to production. They can spend more money/time on automated testing infrastructure. They can spend time/money on failover to last good version. They can spend time/money on up front design. They can decide that the costs of any of the above is too high and so they will accept the risk. Depending on your exaction situation any of the above can be most cost effective (often a combination - in particular accepting risk eventually is required no matter how much you spend on something else).


The fundamental point of the argument that software wasn't to blame was that the problem wasn't with the way the software was implemented, but with the way it was specced, and that it was performing to spec, and therefore blaming the software was pinning the blame on the wrong folks as the software folks didn't write the spec (and the people who wrote the spec didn't control the factors that led to the need for the spec, etc).

> We should want and expect software engineers to have the agency and bigger-picture thinking to do more than just design for inadequate inputs.

Saying software engineers should have bigger-picture thinking doesn't help if the software engineers aren't empowered by management to make the necessary changes.


Sometimes a problem that affects you is outside your immediate ownership, and the right answer is to escalate and seek to fix it. If you limit yourself to only considering changes that your immediate team can make on its own without consulting others, you're doing your organization a disservice.


sure, if your service behaved as well as possible during your dependency's outage, yes.


I think there is an underlying misunderstanding with the application of "5 whys" which is a bit like the misunderstanding that people have with prioritisation. Let me diverge a bit with prioritisation: a lot of people know about the story of how to fit large rocks, medium sized rocks, small rocks and sand into a jar. You start with the largest size and work your way down to the smallest size (sand). The smaller rocks fit in the spaces of the large ones. The moral of the story is to look at the items that are most important as being large rocks and the ones that are being least important as being sand. And by doing the most important things first, you'll get everything done. Because... um... lesser important tasks fit into the spaces of more important tasks??? What??? No. This is a bad analogy that looks like's related, but isn't related in any way!

Getting back to the question at hand. "5 whys = 1 how". Because if we do the root cause analysis, we will definitely find a solution to our problem. What???? Because there is always a solution to our problem??? It's that same skipping over of the most important detail: you aren't doing the most important problems first because that way there will be space for the lesser important problems. You are doing the most important problem first because then when you run out of time, you've done the most important thing! Similarly, you aren't finding a solution when you are doing root cause analysis, your are finding problems. 5 whys is a tool for helping search the problem space. That is all.

The "5 whys = 1 how" quote is really unfortunate and I think it's taken completely out of context. You don't just do 5 whys once and say, "Oh now we know how to fix our problem". You keep doing the 5 whys over and over and over again so you can identify your problems. And at that point you can start to get a handle on your solutions. "5 whys = 1 how" doesn't mean there is only "1 how" that you need.

And just to sum up, if one of your 5 whys ends up with "An external entity failed us", then you might need to find a way that you aren't relying on that entity. However, run some other 5 whys to help you search the problem space a bit more. You may find another "how" that will help you better.


Where does it stop though? Because in that case all externalities will become a "I trusted that dependency but it failed, why did I trust it and didn't work around it?".


the causes of failures that you explicitly enumerate which were under your control aren't always You Did A Bad Thing.

For example, "I trusted this dependency which broke" doesn't mean you Did Something Bad, it just means that the cause of the failure was the lack of a way to validate the dependency and fall back on a known-good version. A good example of this is build pipelines relying on the ability to download npm/nuget/debian packages from a remote server instead of vendoring them locally - this is an intrinsically unreliable choice to make. The remote server being unreachable is not your fault, but there are steps you can take to fix this problem in the future. Avoiding external blame in your 5Ys (without ignoring the external factors) encourages you to make local changes.

As always you have to apply reason here and not adhere to dogma, but it's a useful dogma.


It's okay to conclude after a root cause analysis that the root cause is not something you want to work to prevent, because the cost-benefit is skewed.

For some products a AWS S3 [1] outage might be a influential enough that they add a redundant second storage at another provider, for other it might be "this is too expensive and time consuming to work around for the problems it caused, let's hope Amazon gets their shit together"

[1] https://aws.amazon.com/message/41926/


I think what you're touching on is the crux of the article too. Depending on the "why" that gets delivered, you end up with very different conclusions.


But that would be 6 whys.


>You can't end a 5 Whys with 'another team I depend on screwed up'. You have to own your software or service. If you depend on something, you need checks, alarms, and whenever possible the ability to handle dependency failure well.

On the big picture, I agree with GP - different people will come to very different conclusions with the "Why" game - because the world is complex, and many factors come into play when something goes wrong. You don't need to fix all of them, but it does become a game of figuring out the most convenient aspect to fix (and no, another level of "Why" won't fix that).

Somewhat related: I had a 2nd level manager who in meetings kept saying "In an ideal world, ..." and using that as a starting point for requirements, problem solving, etc. At one point, I interrupted him and said "Nope. In an ideal world, I would not need to work for a living. Your ideal is already far from it, so let's not make distinctions between ideal and non-ideal."[1]

It's easy to use the "whys" to come up with answers like "Because X needs to make a living" and "Because the incentives with the manager/department/org/company create a situation where testing is devalued (despite all claims by management to the contrary)" or "Because we cannot retain talent due to our compensation policies" or "Because people in the team are unwilling to learn version control systems newer than cvs."

These are not facetious answers - they should be treated on an equal footing with technical solutions. In my experience, management that does not want to deal with solutions that involve other teams/people/policies are ones where the work has been miserable. It's also been my experience that SW people seem to prefer technical solutions over alternatives, which is tragic. When you study things like negotiation, it's almost inverted. They have their own push for "Why", but the push is usually in the motives direction. The person you're dealing with is presenting a concrete demand that may seem objective, but behind it is almost always a fairly human reason (not looking bad, higher status, etc). And they always emphasize trying to root cause to those and addressing them.

Many here know the saying: In the top tech companies, all problems are social ones. They have the talent to achieve anything. If any piece of SW fails, it's not because they didn't have technically capable people. It's because they failed in the social/organizational domain. So when I hear management say they don't want to address that in the 5 Why's, I see management that is solving the wrong problem.

I took a systems engineering class once. The thing they emphasized throughout is "If every aspect of a product is designed perfectly as its own unit, you'll get a crappy product that will not win in the marketplace." The idea is that a perfect widget in isolation may not integrate well, and there are budgetary issues as well. There has to be a certain amount of compromise to integrate with other widgets in the product, and the teams involved need to talk to one another and have dependencies on one another. The role of the system engineer is to oversee that this is happening.

Insisting that your SW or service must be bulletproof of other services, when both are related and in the same product, is highly suboptimal (as is the other extreme). Often the optimal solution is to say "Our service will work, provided service Y works".

[1] I was on good terms with him so I got away with it.


I don't understand, if the bug was written by your team, how could the ultimate cause be something outside of your team?

Seems like 5 Whys in that scenario would lead you to look at things like insufficient test coverage, no code review, code reviewers missed the bug because...

Those paths of inquiry lead to your team changing things under your direct control that could make it less likely for similar bugs to reach production in the future.


Broken expectations are a cause.

For instance, when ext4 gained traction, some users found a number of zeroed-out files after a crash.

Why? Because the programs managing those files updated them by truncating them and then writing them.

Why? Because fopen() has no "replace file" mode, even though the "w" mode seems to do just that as long as the system doesn't crash (it does O_WRONLY|O_CREAT|O_TRUNC).

Why? Because file systems used to not have delayed allocation when fopen() was created. ext4 introduced the concept to the ext family.

Why? Because ext4 was designed to improve performance compared to ext3, and that change does that without breaking POSIX.

Why? Because, even with ext3, the only way to correctly implement "replace file" on POSIX is to write the replacement file on the same directory, fsync() the file, close the file descriptor, fsync() the directory, and rename the file.

At that point, the whys have shifted the blame through userland programmers, POSIX fopen(), ext4, and back to userland programmers. Depending on where you stop, the solution can be:

1. fix all userland programs,

2. change ext4 to force sync upon closing a truncated file's descriptor (which got implemented),

3. upgrade the POSIX standard to have a dedicated "replace file" function call.

The truth is, userland is buggy, and the POSIX standard is arcane, so they both share blame. (But ext4, which was correctly implemented, is the only thing that got patched.)

http://dream.thunk.org/tytso/blog/2009/03/12/delayed-allocat...


To me it seems like ext4 was the problem here -- it tried to provide a feature that wasn't adequately supported by the underlying OS.


Delayed allocation was adequately supported.

Implementing it in ext4 increased the risk of zeroing out a file from “having all files updated in the past 20ms be zeroed out” to “having all files updated in the past 4 seconds be zeroed out” in the event of a system crash.


Team two promised us this service would never fail, so we didn't check for errors...


The root cause may be something outside your team, but that doesn't mean you're saying you'll sit on your hands. "The service we depend on failed" can still mean, "and we need to implement better retries or redundancy" as a remedy.


So you know how in Refactoring you're supposed to be doing bottom-up design, but when the problem gets big it turns into top-down design masquerading as bottom-up? (aside: the Mikado method allows you to fix this)

Quite often when I'm doing 5 Why's I find myself socially engineering Why 3-5 to lead us to an action item we can do something about. There are any number of Why #5's someone can generate but a lot of them do not represent progress the team can get behind.

The fact that this happens with so many different teams and managers makes me wonder if it's me or there's just an obvious pattern of misuse going on with 5 Whys.


Kinda like ‘doing agile’ can mean anything, it seems.


Kinda like `Object Oriented` Programming, which is not about Objects.


OOP has some very clearly defined concepts, object being one of them, and the concepts are almost always baked into the languages themselves.


It's "Object Oriented Programming" not "Object Only Programming


> "5 Why" style analysis is really easy to steer toward whatever you want to blame.

This is the big issue. Ideology, bias and self-interest creep in and you can easily end up blaming whatever it is that you personally disagree with.

And that's to not even mention the "unknown unknowns", which by their very nature cannot be factored in to a "5 Whys" process, no matter their real world relevance to the problem at hand


"5 Why" style analysis is really easy to steer toward whatever you want to blame.

How is this better than "Just So Stories?"


Exactly


This argument seems rather pedantic to me, possibly because the author's context is different from mine. Of course you shouldn't treat the 5 Whys as orthodoxy. Of course it's not a silver bullet in RCA. Of course it's not always sufficient. It's a useful tool for getting yourself into the right mindset for RCA: when you think you have an answer, dig a little deeper and you might find something more fundamentally wrong.

In the author's context (healthcare), 5 Whys may in fact be dangerous to promote. In software, where many of us are building CRUD sites and consumer apps, it's a huge improvement over the norm. I'd be very skeptical of the idea of an engineering team trying to employ CAST (linked in the article to analyze cardiovascular surgery mistakes) unless they're building something mission-critical.

My point is: consider when to use a tool. In many contexts, the 5 Whys is a fantastic tool and should be used. In other contexts (likely this author's), it can oversimplify.


> Of course you shouldn't treat the 5 Whys as orthodoxy. Of course it's not a silver bullet in RCA. Of course it's not always sufficient.

This problem is similar to what happens in tech with big new ideas. People do present it as orthodoxy, people do present it as a silver bullet, people do present it as always being sufficient. Then people inevitably find it's not quite that hard-and-fast and get upset.

See the massive backlash against 'agile'. It was supposed to be a flexible way to approach doing things and some general ideas. But a generation of programmers have been told it's an absolute rule that you must do things a certain way - you must have a standup and it must be so many minutes long - and now they're angry.


Yes I've observed the same thing. It doesn't mean we should give up on coherent frameworks though. It just means we need to emphasize and be vocal about the dangers of orthodoxy (which is what I'm constantly trying to do).


For me it's not about orthodoxy. It's rather about teaching WHY we should do certain things instead of HOW. If we focus on the why, is sufficiently easier for people to decide this fits or does not.


> For me it's not about orthodoxy. It's rather about teaching WHY we should do certain things instead of HOW. If we focus on the why, is sufficiently easier for people to decide this fits or does not.

Of course. But you shouldn't stop at one why. You should keep digging until you find more. I'd say... five of them should be enough.


The problem with 5 Whys? Taking it too literally, and then using this as basis for an argument criticizing the whole concept.

To be fair, the idea of multiple root causes (and the difficulty 5 whys has in uncovering multiple causes) is important. But that's not grounds for total dismissal. Because the value of 5 whys isn't "ask why 5 times and you arrive at the root cause", but rather "Don't stop at the first answer you find". Since rather a lot of people stop at the first cause they see rather a lot of the time, it's a valuable approach.


It's the XY problem. Stopping at 1 or 2 Whys almost always results in a large body of bandaids, duct tape and bailing wire. If you go three or four deep you start digging into cross cutting concerns that create the same class of error in multiple ways. By 5 you've probably found a useful culprit in 99% of cases.

(Occasionally 2 Whys turns into a toxic environment based on shaming. Why are we down? Because Steve screwed up. Don't be like Steve!)


Agreed.

The argument "tool is misusable and therefore has no value" is absurd. Of course any discussion framework is going to rely on intelligent people using it sincerely, and even then will only provide some additional insight. It's a tool to help people make 10% better conclusions.


I read the conclusion much differently. It said that in a system with a high rate of error the 5W analysis is insufficient at directing investigators to a reasonable solution. A definition for reasonable in the case of healthcare is given. It suggested that context was important and that Toyota's engineering practices are sufficiently different from health care that the technique cannot be recommended.

I think the same may be true for networked software services as well. At a certain scale "root cause" becomes a matter of opinion or interpretation. Attempting to use a 5W methodology will lead you to one of many different, possible, causes but will ignore other confounding factors depending on who is doing the investigation and how it is conducted.

In my experience finding the root cause of a particularly nasty error in a production cloud system required a formal model to prove that a particular error was triggered under very specific circumstances. The error was not reproducible and a test case elusive and yet it frequently ate up our error budget. There was no way any kind of root-cause analysis could have detected that error. It was a behavior comprised of over a dozen steps where the conditions had to be right to trigger it. Humans are not good at finding things like that on their own and simply asking, "Why?" like a parrot is not helpful... the system is simply too complex.

Instead of root cause I like to think in terms of factors. What is the probability that a given factor has a high impact on our SLOs? Let's fix that. If we don't understand what the source of the problem is let's write a formal specification and test our assumptions and find the factor. Most of all, build for o11y and make reliability an engineering concern.


Of course 5 whys is no substitute for the scientific method. It isn't meant to be. Criticizing 5 ways because it's not good at science is unhelpful. It's a way to dig pretty deep into a problem very quickly with a simple heuristic. A 5 whys analysis of anything shouldn't take more than a couple of minutes. If your problem is going to need more than a couple of minutes of analysis, don't do 5 whys.


I really should find a name for this sort of fallacy (I'm sure some clever person has a clever name for it). "This is not perfect, therefore it is useless". I've seen this particular fallacy quite a bit on HN lately, and it's annoying me to no end.


We haven’t acknowledged that trying to ‘find causes’ at all is misleading!

Causes are something we create and construct afterwards. They aren’t a primitive that make up incidents. They don’t fundamentally exist to be found.

Causal explanations limit what we find and learn and the irony is that root cause analysis is built on an idea that incidents can be fully comprehended—they can’t!

Instead think about the conditions that allowed an incident to occur. Separate out everything you think looks like a cause and explore each of them.

Talk about the properties and attributes that were present. There are so many aspects to incidents that aren’t even causes at all, and they don’t follow a linear chain of this-led-to-that.

People are presented with RCA and 5 whys as really getting to the bottom of what matters, but the reality is this approach is a linear simplification. We need to kick it up a notch and practice more holistic investigations of incidents.

Stop getting at why people didn’t do what they thought they should have done, and start getting to the point of what actually happened and how those actions seemed reasonable at the time.

https://www.oreilly.com/ideas/the-infinite-hows


Basically, causal analyses are graphical data structures. They're not linear or strictly hierarchical. They're directed graphs.


This is a rant that I've had for so long, and it's so hard to concisely summarize. This does a great job of explaining in detail, but it is much harder to tell your boss to read a long blog post when you disagree with him about the correct approach. I found a metaphorical approach to sometimes be helpful. I use this:

On a dark night, a blind man jaywalked and was nearly hit by a drunk driver who swerved and killed a pedestrian wearing black on the sidewalk. What was the root cause? Was it because the man was jaywalking? What if he was jaywalking because he was blind and didn't know any better? If the driver wasn't drunk, could he have braked instead of swerving? If the pedestrian was wearing brighter colors, could the driver have swerved a different direction? Sometimes there is no singular root cause, but rather a complex interaction of multiple failures that may or may not arise if you take one failure away. There may be a better approach than to use a tool that is designed to hone in on a singular root cause.

Unfortunately, this only sometimes works. The rest of the time, they actually do try to pick apart the example and claim there is a root cause. The only recourse then is to present an alternative singular root-cause and say that by trying to force a single root cause interpretation instead of accepting the complexity of reality, that your root cause will ultimately reflect your own biases rather than reality (for example, who do you hate more: jaywalkers, drunk drivers, or pedestrians that wear black at night?).

I still wish I had a better way of explaining this succinctly, in a way that makes it more clear.


5 Whys doesn't find a "single root cause". I don't understand why you wrote a great example of how 5 Whys helps uncover issues to address -- drunk driving (which you could go deeper into -- why did the drinker need to drive? Why wasn't the drinker prevented from driving, or a driver prevented from drinking?), pedestrian visibility (and you could expand on that side -- why isn't the sidewalk more physically separated for the roadway?-- and then claimed that it's a failure.


The real man at fault here, of course, was the bartender that served the driver his last drink. Isn't it obvious? /s


Courts have held that to be true before and even gone farther, finding the bar ultimately responsible for failing to train or supervise the bartender to prevent such an outcome. This is probably mostly because bars usually have larger insurance policies than drunk drivers.


No, it's the car that allowed the drunk person to drive it. The corrective action is cars should have breathalyzer interlocks as standard equipment.

I'm not actually joking - applying this process seriously requires looking at not just who did what (people) but the presence or absence of mechanisms. Corrective actions that rely on the good intentions of people "remembering to do the right thing" rarely work.

I review several RCAs a week in my job. If I saw one for a scenario like this where the actions or inattention any of the humans involved here was identified as the ultimate root cause, it'd get kicked back for re-work.


Hey now, the street also allowed jaywalking.


Pedestrians and heavy vehicles shouldn't be routed on the same level (/altitude/z-layer) in the first place.


Don't bother explaining why it's wrong. Suggest something better.


    Problem: The Washington Monument is deteriorating

        Why? Harsh chemicals are being used to clean the monument

        Why? The monument is covered in pigeon droppings

        Why? Pigeons are attracted by the large number of spiders at the monument

        Why? Spiders are attracted by the large number of midges at the monument

        Why? Midges are attracted by the fact that the monument is first to be lit at night.

    Solution: Turn on the lights one hour later.

This is a massive over simplification of the 5 whys process. The author seems to be implying that this is the core of the process when it is not, it is merely a simplified example what the process would look like to someone unfamiliar with it.

Without getting into a massive amount of details the following things are important to note about 5 whys in lean.

5 whys isn't meant to arrive a a single conclusion or solution, it is meant to provide deeper analysis so that an appropriate and proportional set of countermeasures can be proposed and considered. The key concepts being a "set" of countermeasures and "considered".

5 whys does not need to exist in a vacuum as a problem solving tool. It is often just one such tool in a more thorough approach such as the A3 process

There is no reason to limit the number of whys to 5 or even go that far in some cases.

The countermeasures that come out of a 5 whys analysis are not expected to be implemented and immediately forgotten about. They are part of a plan that expects follow-ups, measurement and observation of the results.

The underlying causes of the analysis are not assumed to be the only causes or problems. The purpose of 5 whys is to limit over analysis. This may not make sense in life-or-death situations, but for most business processes it is appropriate and efficient to make decisions at a reasonable point in the analysis and move forward rather than investing a disproportionate amount of resources in that analysis.

The concept of a "proportional response" is important to lean. If the outcome of a problem is a patient's death, then the proportional response can be expected to be considerably more significant that just doing 5 whys.

If all a team does is ask 5 why questions then propose and implement a solution based on the last answer, this would be a pretty textbook example of a lean cargo cult.


"this would be a pretty textbook example of a lean cargo cult"

This is indeed the problem, at least according to people close to me that work in healthcare. Lean does not have a good reputation in these circles.


I've heard similar from people I know in healthcare. Healthcare has bought into the six-sigma consultants hard and "the gemba" is being taken along for the ride in something that very much appears to be the antithesis of lean.

I'm not sure what can really be done about that though. It isn't my industry so my understanding of the problem is pretty peripheral.


Awesome answer!


The 5 whys are part of the justly famed Toyota Production System for managing factories. The key to TPS and the part that's always transferable is to get teams to think about their their work and its results (cause and effect) in an effective way and then to use the results of that thinking to improve their products and processes. Anything that contributes to that goal is well founded. Anything else is cargo cult.

Requiring causal analysis to make direct claims about cause and effect, which 5 whys does, is important otherwise its hard to check or even identify those claims or evaluate the evidence. It's especially good for getting labor and suppliers to think which they often are not paid to do. It also oftens lays bare shallowness in that thinking. I can easily see that other tools might be better in other circumstances, but management is still going to need explicit claims about cause/effect and evidence to support those claims in order to make process improvements.

It's also worth pointing out that there are other shop floor tools in the quality toolbox we've inherited from manufacturing engineering other than the 5 whys which are easy to apply and can be very useful and helpful (https://en.wikipedia.org/wiki/Seven_basic_tools_of_quality). Pareto charts for example are often seen in computer monitoring systems and can be super helpful in seeing changing or developing failure patterns. Control charts are less often seen in the computer biz but are very useful for visualizing variability and understanding when processes and systems are and are not stable.


The conclusion is basically this (the last sentence before the "conclusion" section... > But it does mean that we cannot afford to compound these problems through the use of an RCA tool that is so deeply and fundamentally flawed. Other more systems-focused techniques, such as fishbone75 or lovebug diagrams,72 causal tree diagrams,21 Causal Analysis based on Systems Theory (CAST)76 or even prospective risk assessment approaches,77–81 should be considered instead.

I find it odd that the paper cites so many theoretical problems with "5 whys" and then elevates these other methods as preferred solutions without any discussion. It's possible RCA committees already recognize the limitations of "5 whys" and use it as a heuristic that helps focus their efforts. That doesn't mean these committees all sit around a table and say "well, X sounds like a major problem, but it doesn't fit our '3rd why,' so we have to ignore it."


Those other systems are either based on 5 Why's or have essentially the same core. The author's complaint is that there exists an informal approximation to his technical system, that's easy to understand and which can be used effectively if you don't have a RCA team that's gone through a bunch of certification classes. It's like a Java programmer mad that someone launched a product using Python.


The problem is

> the potential for users to rely on off-the-cuff deduction, rather than situated observation when developing answers, as well as difficulty in prioritising causes[, and thinking the causes are a linear list rather than a branching tree and the problem therefore with 5 Whys being greedy in picking a single branch of that tree]


About half the 5 Whys analyses I've seen at my company end up doing this anyway, because anyone paying attention realizes this immediately. It's pretty easy to do a branching 5 Why analysis that branches off in different directions. It's not an orthodoxy, it's a tool and is so simple that it's easy to modify for your use cases.


The value of the 5 whys, IMO, isn't to find absolute truth, it's to avoid making stupid and easily avoidable mistakes by forcing people to talk to one another while developing solutions. In many places, people have multiple concurrent demands on their time, and that can force people into a reactionary get-it-done-quick mindset. The 5 whys are a way to minimize the impact of time pressure by having a thorough problem-solving rubric that is expected to be followed. Basically, it's cover for people to devote more time to communication and critical thought.


5 Whys is not intended to solve problems, or find a 'definitive' cause. It is just one way to identify a cause.

"Not all problems have a single root cause. If one wishes to uncover multiple root causes, the method must be repeated asking a different sequence of questions each time." - https://en.wikipedia.org/wiki/5_Whys

When you work on incidents, you should already know that the incident you're looking at could have wide-spread effects, and may only be a symptom of a series of complex events. You may be able to find a cause with 5 Whys, but probably not the cause, because the causal chain is almost always complex. 5 Whys is just a shoot-from-the-hip way to find the first, most obvious cause from the perspective of the person doing 5 Whys.

You might not be able to identify the "real" root cause right away, but by documenting each incident's cause and linking them into a value chain, and reflecting these in runbooks, you can over time eventually see where the root causes are coming from. But you also don't need to immediately solve the root cause to begin adding safeguards around the known incidents.


When we did 5 whys at a company I used to work for, we built a tree of 'whys', 5 levels deep, many branches wide. Then picked several things that could be improved.

Is 5 whys normally just 5 causes in a line?


Short answer is no. Five is not a magical limit, nor is one root cause.

https://en.m.wikipedia.org/wiki/5_Whys


I don't think I have heard the monument story before. It sounds quite ridiculous, and is about as counterproductive as an example as one could think of, as it gets the 5 whys thinking completely wrong. Obviously there is nothing magic about the number 5 nor is it the case that every problem has a unique cause which has its own unique cause in turn. The point to asking "why" as many times as you have to is that instead of just solving one particular problem you want to prevent similar problems in the future. More specifically, instead of saying "the problem is human error and we fired the guy who fucked up", you want to improve processes to make human error less likely and/or less damaging when it occurs.


5 whys has never claimed to find all root causes. The idea is you don't really know when you have found all the root causes - they are known unknowns meaning you don't know that there is at least one root cause, but you don't know how many are in the full tree so you can never be sure you are done analyzing. However you don't need to find them all: by finding any root cause and putting something into place to stop that you have made things better.

In short finding one root case not all avoids analysis paralysis and gets a solution in place quick.


I think "5 whys" works great for software dev, i still rely on it for RCA often.

That said, I really wish the healthcare industry would stop using ideas from manufacturing. LEAN, 5 whys, etc. came from manufacturing widgets, humans(patients) are not widgets. They should use their brains and come up with unique ideas to their unique problems rather then piggy backing off ideas that aren't meant to deal with one-off's.

This is also why i left healthcare and will not go back. Such an important sector, ripe for disruption, it seems to be lacking so much.


> The real problem with ‘5 whys’ is not how it is used in RCA [root cause analysis], but rather that it so grossly oversimplifies the process of problem exploration that it should not be used at all. It forces users down a single analytical pathway for any given problem, insists on a single root cause as the target for solutions, and assumes that the most distal link on the causal pathway (the fifth ‘why’) is inherently the most effective and efficient place to intervene.

^ That's the best section of the article. Also, I think the causal tree they included for an incident at a hospital (patient received the wrong medicine) is fascinating (see Figure 1; I included a direct link to the JPG[0] below). They explored one chain of causal subtrees to a depth of 11, and the total number of nodes is 78 (although some subtrees are repeated - it would have been better as a DAG[1] instead of a tree). It exposes significant policy and work culture issues at the facility and gives the reader a sense that they already know a lot about what it must be like for the nurses who work there.

[0] https://qualitysafety.bmj.com/content/qhc/26/8/671/F1.large....

[1] https://en.wikipedia.org/wiki/Directed_acyclic_graph


Whenever I try to do a ‘5 Whys’ analysis on a belief I have (“I believe X.” “Why?”) I almost always have multiple answers (“Because Y and Z”). Then I have to recursively apply another “Why” to both Y and Z, and it ends up being a pretty wide tree of answers.

I’m always left wondering “how am I supposed to really get something out of this?”


5 Whys can't solve something that isn't a problem.


The most interesting thing that I learned in the first six months working at the NASA Enterprise Applications Competency Center was that Root Cause Analysis meetings were the punishment for screwing up. If you didn't have a good story and the political capital to push it, you were in the hot seat.


I'm a fan of T5W. But even I don't see it as a panacea. For me it's a reliable way to vet your original assumption. That is, have you identified a (root) problem or only a symptom.

T5W is the sanity check before you leap to problem solving, and perhaps solve the wrong problem - which often is no solution at all.


Here: https://web.mit.edu/2.75/resources/random/How%20Complex%20Sy...

Read this. Read it again. RCA FTW!


I was working with a large aircraft manufacture in Seattle, let's call them Doeing. We, as a supplier, had an issue and put together a 5 why workshop with all the stakeholders from both sides. We get a good list of primary whys, and work all of them out. Then the Doeing lead gets up and points to the two top causes and states, "we have to eliminate these, because we can't tell management that the failure is on our side." So we finished the workshop with secondary failures on our side and spent a year and who knows how much money on a fix that never worked.


I've had my head buried in software post-mortems lately.

I particularly appreciate the treatment DevOps Handbook gives them as far as spelling out Systems are Complex, there will be many contributing factors, you need to take in multiple perspectives to fully illustrate what happened.

This lines up with the article's analysis although the conclusion is a bit sensationalist, I'm loving the long list of 'other techniques' they present right before it.

5 Whys should really be considered a place to start and encourage people to go deeper into failure analysis, then work towards more formal methods.


The main problem is that why just doesn't exist. There is no why.

I can choose an arbitrary answer to a why question every single time.

We are mostly satisfied with a how answer to a why question. And most why questions are in fact how questions or just used for something else (like defiance from a kid asking why they have to go to bed).

Also we stop asking why when we are subjectively satisfied with the answer, not when we've found "the cause". And we've been conditioned since we were kids to feel it's rude to ask why more than once or twice.


> And we've been conditioned since we were kids to feel it's rude to ask why more than once or twice.

That's why "5 Whys" is so important -- to re-kindle the intellectual curiosity that is snuffed out in children.


My kids also use the "5 whys". It's quite powerful together with "Explain Like I'm Five". (five year olds are actually smart, while consequences may seem obvious for a grown-up, it seems consequences are difficult to predict and are mostly learned by experience!? for example, it's not obvious that the Washington Monument is deteriorating because of Midges )


I've never understood the "5 Why's" method to be taken that literally. In my experience at a major manufacturing company it was taught as a way of reminding people to think deeper about a problem and not stop at the first level and think that you have arrived at the root cause. Sometimes it's four Whys', sometimes it's six.


It seems to me it is more like something bad happened what could have prevented it? Then find out multiple things that could have prevented it. Then figure out which of those ways would be most economically feasible and also would avoid many other similar problems.

But in general it is good to ask questions, and it's good to allow multiple answers to the same question.


I only skimmed the article, but it looks like a textbook case in "doing 5 Whys wrong".

Like... no one held a gun to your head and said "find a single causal pathway!" That's something you inflicted on yourself.

There are plenty of problems I could call up with regard to 5 Whys and how I've seen it practiced, but this is just a silly one.


I think the conclusion is super interesting:

"HROs commonly aim for a reliability rate of ‘six sigma’ (three errors per million opportunities). By these measures, healthcare is struggling to move beyond two sigma (308 500 errors per million opportunities, or a 30.85% error rate). [...]

But healthcare is far more complex than automobile manufacturing, and takes place amid processes and systems that are woefully underdesigned in comparison to a modern factory. [...]

As a result, approaches developed for solving problems in the automotive manufacturing context may not be as effective in the healthcare arena."

So, in a way, the problem isn't really with 5 Whys. The problem is that it's applied to processes where errors/defects are much more common.

I've seen it applied to my own domain as well, keeping computer systems (web sites, mainly) running. If we have one incident with possible downtime per year, there are probably still way too many potential root causes for 5 Whys to be useful... and that's still probably at least a magnitude better than anything I've personally worked on.


In manufacturing you strive to reduce exceptions (problem states, differences, inconsistencies), but in healthcare you are pretty much dealing with exceptions by default - healthy people generally don't go to the doctor.

In manufacturing you should ideally fix one problem at a time. In healthcare you don't have that opportunity, you have to deal with comorbidities.

In manufacturing you can fire troublemakers (eventually). In healthcare you have to treat them.


after experimenting for a while, I now prefer "for what purpose?" instead of "why?".

"why?" easily leads to cycles and going back down the abstraction hierarchy, where "for what purpose?" always goes up the abstraction ladder.

- why are you eating? because it's 7pm

- for what purpose are you eating? to satisfy my hunger


That sounds weird. The starting point of a post-mortem is usually something like "Why was x down?". "For what purpose was x down?" is not the correct question there, it's not like you were purposely bringing it down for maintenance.


For what purpose only works if there is a purpose. "For what purpose are you hungry" doesn't really work.


You can probably make up reasons. "To induce me to ingest nutrients" "To provide me with energy" "To fulfill some biological imperative" is one possible route.


Pretty valid reasoning it seems though it doesn't provide much in the way of alternatives.


Why does HN consistently upvote ill-iformed poor analyses to the front page?


The article summarized: Basically, for any given event, there may be two or more "whys" which feed into it, which in turn may have have two or more "whys" which feed into it.

We see this pattern in databases, where there are 1..n relationships out of something, or, in reverse, n..1 relationships into something.

The "5 Whys" described whys only as having a single cause feeding into each successive cause. This article describes the problem with that, which is simply that in the real world, a single result, a single "why", might have multiple (1..n) causes feeding into it, and multiples into those.

Now... get ready to have your mind blown... Here's the "problem" with this article (I know, this article makes more sense than the "five whys" alone, but hear me out!)

Think of "The Five Whys" as level 1 reasoning and this article as level 2.

Level 2 is smarter because it considers more.

So what's level 3?

Level 3 then, which I think will be discovered (or at least expounded upon by some person vastly brighter than myself), is using the level 2 method, and expanding that set of predecessor causes even more to everything in life, which theoretically should result in the observation that everything is circular.

That is, all events, causes and circumstances -- eventually feed back on themselves! Some to lesser degrees, some to greater degrees, but I think it will be discovered that all causes move in circular loops, even though the cause->effect->cause chain (or chains) might have thousands, perhaps hundreds of millions of elements...

I think the ancient Greeks (well, some of them, the enlightened ones anyway) had an understanding of this. I think that Shakespeare and Dante Alighieri understood this, at least in terms of the human character.

Consider Toyota... If Toyota is asking "why" some failure occurred and tracing it back 5 levels, then one might ask "Why does Toyota have to make cars in the first place?", because that sets the stage for all other "whys" that come next...

Eventually, with enough "whys", everything has to be circular... You just need to go enough why levels "deep" to see this. And some causes feed other causes to lesser degrees, some to greater degrees.

And I believe that people in antiquity who were men of great learning... somehow knew or intuited this... while not provable, it seems "logical"...

Loops inside of loops inside of loops... all with different loop lengths...


This is also something I hear people talk about and never use.


Very few people have the discipline to do proper postmortems, and even fewer industries and companies have the discipline to enable those people to do them.


I have actually used it for root cause analysis of serious software defects, and that led to some valuable process and training improvements.


Do you have other examples? FWIW I find 5-why's useful for personal introspection, but never found it very efficient in professional settings.


I find it tremendously helpful with software -- but I also always include the user as a component of the system (e.g., in architecture diagrams), which apparently isn't very common.

So rather than something technical like "process crashed", most of my investigations begin with "user is frustrated because they did not accomplish their task". Then my why-tree will have branches for, say, "the label on this control was unclear", "there was no documentation for this feature", "it was not clear how to get to the correct documentation page", "user was afraid to click it -> because there was no Undo", "user didn't know this feature already exists", etc.

Usually, I have to go a few questions deep before I get to a geeky technical solution that can be solved by more programming. I'll typically come up with 4 to 6 actionable improvements, and half will be completely non-technical, and most of the rest will be less than one line of code (e.g., to rearrange or clarify).

Users don't like going to the effort of filing bug reports, so whenever someone does, it's a good indicator to me that 10 or 100 other people also failed at a similar task with my software, or will soon. Users are also resourceful, and will try 2 or 3 ways to solve something, so if they ultimately fail, it means the software failed at every path they tried. By attacking every problem at multiple levels, it's possible to improve results for a whole lot of people at once.


We use it often, it can be very effective but it isn't the only tool we have.


I have used it to success, but you have to allow for multiple answers to be given at each level that create their own branches of a tree. Which is effectively what the author illustrates as completely different outcomes. The group should make an effort to give multiple reasons in each iteration and follow them each separately.

5 Whys's also needs to be preceded by a detailed representation of what happened and followed by multiple next actions that are tracked to completion.


what was the solution to the washington monument/lincoln memorial deteriorating? was it just using less harsh cleaning chemicals?


Water. Chemists call water the universal solvent for a reason: is dissolves nearly anything.


Including the Washington Monument? :P


Given enough of it for enough time...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: