
Why Blameless Post-Mortems - tptacek
https://medium.com/tock/why-blameless-post-mortems-80f9f446fb77
======
malisper
Expanding on the post, one corollary of "people are rarely the cause of the
outage" is that "you can't fix people". Given that Dave accidentally caused an
outage by pressing some button he should't have, explaining to him why he
shouldn't press the button won't prevent anyone else from making the same
mistake.

Instead, you need to look at the reasons underlying why Dave pressed the
button. Is the button confusingly labeled? Is there no playbook on what the
process for pressing the button should be? If you fix those underlying
problems, you will prevent other people from making the same mistake Dave
made.

~~~
closeparen
If your response to every problem is "we need a process to make this idiot
proof," pretty soon you are going to be up to your eyeballs in CYA training
webinars and permission requests and committee reviews. Small groups of
competent people who can be trusted to wield powerful tools will start to run
circles around you.

My employer does a reasonable job with this. Our tools usually make the path
of least resistance the right thing. We have guardrails that help prevent
accidents. But at the end of the day, our controls presume competence and good
faith, and there's always an "I know what I'm doing" button. Actions taken
after pressing it incur a bit more personal responsibility.

~~~
jodrellblank
When parent poster says “look at the underlying reasons why”, you jump to
untrustworthy idiot employees hampered by CYA red tape.

When your employer does the same thing, it leads to “good tools with sensible
affordances and guard rails” for competent responsible people.

Why is that?

~~~
closeparen
It's a balance. "Harden the process" is good and necessary in moderation for
the reasons already mentioned, but can be taken too far, to the point that the
process is so rigid it's paralyzing.

A finding of "the process was reasonable and the individual wasn't" or "a
process guaranteed to stop this kind of action would be unreasonably heavy"
may sometimes need to be on the table.

------
Thersites
Ties to Warren Buffet, someone probably smarter than me ;) "Criticize
generally, praise specifically" or "Praise by name, criticize by category". In
my experience the biggest problem with criticizing specific actions is that it
destroys loyalty and undermines trust.

------
wgerard
My $0.02:

Despite being a strong believer in it, I've worked at places that did
blameless post-mortems and it was sometimes frustrating because remediation
often felt like a second-class citizen. Remediation is often _much more_
difficult in these environments because systematic changes can obviously much
more complex than "fixing" an individual's behavior.

If someone pushes the big red button accidentally I'm fully on-board with the
"well what led them to push it and why were there no safeguards?" but it also
kinda sucks if you spend so much time on that conversation that there's no
time left for "ok, who's going install the confirmation screen for the big red
button?"

Sometimes the "remediation" was even just "well, that's just the way it is and
we can't do anything about it easily, but hopefully the post-mortem itself
will serve as a way to inform people not to push the big red button" which is
obviously frustrating because memories are short and people come and go quite
frequently in the tech industry.

Anyway, and this might be obvious to some, but remediation is important and
sometimes incredibly difficult in these environments. Adopting such an
environment requires a commitment to actually fixing the systemic issues. It
can sometimes feel _more_ frustrating than "blameworthy" post-mortems
especially if it's the same individual(s) over and over again.

~~~
panopticon
> _I 've worked at places that did blameless post-mortems and it was sometimes
> frustrating because remediation often felt like a second-class citizen_

Maybe I'm missing something, but that seems orthogonal to the blamelessness.
Blame won't necessarily bring remediation front-and-center either (and will
probably do far more harm do that end).

------
bonestamp2
All good reasons. I'll take the first one a little further... when such an
event happens, it may cost the company actual money. If you fire or create an
environment where someone wants to look for another job, then whatever that
event cost you is no longer an investment in the education and experience of
your employee -- you've now invested that money in someone else's employee.

A person who plays a part in a costly error is going to be much more cautious
in the future and be very helpful in improving systems so it can't happen
again. You want to foster that approach rather than discourage it.

~~~
jandrese
> A person who plays a part in a costly error is going to be much more
> cautious in the future

Unless they're just a fucking cowboy all the time and you should just fire him
now because he's going to do it again. Maybe gauge the level of surprise of
his collegues. Are they shocked at how this could happen or are they going
"sounds like Jerry".

~~~
folkhack
To add to this - if there's a pattern of having to take part blame/ownership
for people who should know better I'm going to move on to greener pastures.

Some people work more carefully than others. This is absolutely a thing in
software development/operations (my fields). When I was young, I had more
screw-ups because I was still learning "how to work carefully" with certain
things... It's hard for me to admit this but it is 100% the truth. When things
went wrong, I felt the need to take it on _my_ shoulders vs. to put it on the
rest of the team who likely had nothing to do with the error.

As with all things this isn't black and white, which is why I'm making this
comment.

~~~
lonelappde
If you keep making mistakes that get caught in testing, you need to be
careful. If you keep making mistakes that get caught in production, you need
to write better tests.

~~~
folkhack
> If you keep making mistakes that get caught in testing

Ha, yea. I can get behind this. There are times that one dumb thing causes the
entirety of my tests to crumble and I'm always admittedly laughing saying
"that's why we have tests!"

------
brlewis
To expand on the CYA reason why blaming causes problems, people who want to
avoid blame want to say there's nothing they could have done that would have
prevented the incident. In a blameless environment people are more likely to
figure out a way they can contribute to prevention.

------
heelix
Honestly, I expected the article to be hot trash -- considering how I saw the
implementation at our shop. I've changed my mind. We just go about it wrong.
What went right, what went wrong, where did we get lucky... If we answered
those questions rather than dodge responsibility... this could be useful.

~~~
DoreenMichele
Think of it this way: _Blameless_ means looking for solutions (which involves
identifying root causes), not scapegoats.

Sometimes, it really is one asshole and they need to be fired. Most of the
time, it's not.

------
hitekker
> As a rule, people are not the cause of an outage.

But leaders are.

Most enterprise post-mortems I've read will deliver pages and pages explaining
(or justifying) poor system architectures. But they will never, ever criticize
the executive or tech lead who decided on that structure to begin with.

Everyone makes mistakes, for sure. One stray outage is likely technical in
nature. A simple oversight that can be corrected. But if an org's systems are
constantly failing or breaking, then it falls to the leader of that org to
reshape its structure.

Or they can pretend that blame doesn't exist. The system is fine. A band-aid
will suffice.

Without commitment to fix systemic failures, "blamelessness" enables the
abdication of leadership.

~~~
vmchale
> Most enterprise post-mortems I've read will deliver pages and pages
> explaining (or justifying) poor system architectures. But they will never,
> ever criticize the executive or tech lead who decided on that structure to
> begin with.

Certainly explains their popularity lol

