Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What is the best postmortem you've seen?
63 points by pbohun on March 5, 2023 | hide | past | favorite | 35 comments
It seems like there are a lot of examples of companies handling a security breach or loss of service poorly. Are there examples when a company handled an incident well, especially with a great postmortem writeup?



Edward Tufte's analysis of the Space Shuttle Columbia explosion[0] is by far the most informative post mortem I've seen. It directly impacted everything I've written since reading it.

If you hit the link, you'll see the page appears to be a wall of text, not a simple slide or two. As you read deeper into the report, you'll understand that's an intentional aspect of the report. (I'll also note this is the Columbia explosion, not the better known Challenger disaster O-ring post-mortem discussed by Richard Feynman in his autobiography[1], even though that's a great post mortem as well).

[0]https://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=...

[1]https://www.amazon.com/What-Care-Other-People-Think/dp/03933...


An old colleague of mine worked pretty extensively with a backend engineer on a script to do some data migration on user accounts. The other engineer did most of the work on the script itself. Mind you, the company has millions of users and didn't use staging databases. (No idea if this is still true.)

Come D-day, my colleague runs the script on a limited group of users (a 100k or so) to validate. I forget the details, but something in the script was incorrect and ended up breaking some features for all of those users.

Once reports started coming in, they were super worried and semi-freaked out. A war room was setup that day and all the people involved jumped in.

One of the first things that happened after determining what happened was to calm them down and reassure they weren't in trouble. After that the group worked on a solution for a few hours and established a plan to fix everything.

I was actually surprised that the response was so well handled. There was no finger pointing and just a group effort to fix the problem. To me, that's how every problem should be handled, and not instilling fear for losing your job if something bad potentially happens.

To anyone who wants to leave replies about staging databases, bad dev practices, etc, don't bother please. This was years ago and it was how things were done at the company. Our team was not part of the backend team or infra and worked with lots of areas of engineering on different issues.


This is a fair anecdote about response an active production incident. While there is a live incident, the focus should be on resolving it, and restoring service to customers.

But, this anecdote does not comment about a postmortem or other organizational habits that can be used to review historical incidents or near misses (after the incidents have been resolved and everyone has had time to rest), diagnose risks, identify contributing direct or indirect causes, then use this feedback and learning to set goals and drive changes to reduce the risk of similar incidents recurring in future.


While I can't add too much to what I already said, I have heard semi-recently the company is working on dev or staging databases finally. Can't say the anecdote in question contributed to that, but it's nice to consider.


Which is precisely why I don't understand the purpose to even have postmortems for 95% of outages. If everyone is aware of what went wrong and the issue is unlikely to ever happen again, what is the point?

Well, at companies of the size I work with it is to point fingers, make PMs feel more important, and give people talking points.


> If everyone is aware of what went wrong and the issue is unlikely to ever happen again, what is the point?

Because the only way you can make everyone aware is to write it down. Anything else is hearsay.

And going through the process of a thorough postmortem can ensure you do know exactly what went wrong and why, and how you can prevent the same and similar issues from happening again in the future.

Perhaps from this example it serves as documented proof that work on setting up staging databases needs to be prioritised and invested in? Maybe it's that scripts such as this should be reviewed by another engineer before running? Maybe the standard operating procedure is updated so a backup is taken immediately before running any scripts that write to the database? Maybe you create a rule to limit the blast radius in future and do smaller roll outs to 1k users instead of 100k? Maybe scripts should be developed with a dry-run feature?


If you're having enough outages that postmortems are burdensome, I don't think the problem is the number of postmortems.


A postmortem is, in most cases, really just a formal process to ensure everyone (including leadership) knows what went wrong, and how to prevent the issue from happening again, with some associated ceremony about like actually keeping track of the relevant action items to prevent recurrence of the issue.

It's pretty rare that you run into an issue that can't reoccur, or have similarly shaped reoccurrences.


> If everyone is aware of what went wrong

Big assumption. Chances are at least someone was out on vacation or something. A postmortem can help spread the word - and perhaps calm those downstream who weren't in the loop as much, through a show of clear and open communication and ownership, instead of trying to bury it.

> and the issue is unlikely to ever happen again,

Also a big assumption. You can almost certainly https://en.wikipedia.org/wiki/Five_whys your way into a broader pattern of something that will happen again, and can review what worked to expectations, what didn't, what should be improved, and what can be left alone - even if it perhaps will happen again - perhaps being more expensive to try and fix, than to merely accept the occasional failure as an acceptable price of doing business.


> everyone is aware of what went wrong

This is never true past a certain company size threshold, and even for smaller companies once you start asking the "five whys" you see there is never just one root cause. Even for straightforward cases where everyone generally agrees, it can still be informative to capture the analysis for future reference and review of reliability patterns.

Incidents are a learning opportunity. If people are pointing fingers then that's a sign of a bigger cultural issue, one which will not be solved by avoiding discussing and documenting incidents.


Interesting anecdote but this is not about postmortems whatsoever.


Colleague wrote a very detailed postmortem document detailing the issues that led to the issue. Also included details about impact and estimated time to fix.

No one got fired, the issue was handled professionally, accounts were fixed, and the original ticket was eventually resolved to everyone's satisfaction.

Sorry if I didn't capture certain details in my original post. I kind of just assumed most people reading this have been through the "everything is on fire" experience and would get that the postmortem was resolved successfully.

Not everything has to end in a hanging or massive change to an organization.


The best postmortem I've seen was actually a conference talk: "Debugging Under Fire: Keep your Head when Systems have Lost their Mind" by Bryan Cantrill (in the GOTO conference, 2017)

Here's a YouTube video of it: https://www.youtube.com/watch?v=30jNsCVLpAE

Here's the slides: https://gotochgo.com/2017/sessions/86/keynote-debugging-unde...

It goes into detail about a pretty bad outage (when an entire data center was brought down), the human aspects, automation, how they handled it, the various risks, architectures, how things fail and about software development in general.


Unfortunately I can't share any of ours because they're all proprietary client work, but I think my teams have done a really masterful job. They're among some of the best I've read anywhere. For me the standout traits of a good postmortem are:

* Honesty about the reality of the situation; no sugarcoating, no spin

* Blameless, factual tone that avoids the passive voice

* Describes technical details at a level helpful for practitioners

* Makes use of other resources as needed (e.g. references corporate wiki, external ideas, blog posts)

* Good writing that's easy to read and is free of grammatical ambiguities and spelling errors


A young guy driving on a country road without a seatbelt rolled his vehicle and died of his head injuries. Quite fascinating to discover that his only real problem was a badly bruised, contused and bleeding brain. Mind you, he probably would have had heart problems in his 50s because his heart arteries already had plaque and he was only in his twenties.

That was the best because it was the only postmortem I've seen.


Was hoping to read an answer like this, rather than the same old computer bullshit everyone else is rambling on about.


Back then, most cars did not have seat belts fitted. Within two weeks I had seat belts installed in in my car.


Computer BS is fine, was just hoping it wasn’t gonna be sprint retros.


I love premortems at work. Gary Klein came up with the idea of asking why a project could fail before you start it: https://hbr.org/2007/09/performing-a-project-premortem




… the description of this topic didn't follow the title in a way that I was expecting at all…

I immediately thought of the old GamaSutra (now GameDeveloper.com) postmortems; interviews with members of the teams behind many classic videogames, great late-night reading. https://www.gamedeveloper.com/audio/10-seminal-game-postmort...


I used to have a blog compiling a bunch of them along with articles on best practice creation of post mortems. Unfortunately it never made any money and I took it off line when cpanel put their prices up 1000% and the hosting cost became too much.

I still have a back up somewhere and the domain names. Could maybe put it back up one day if I could spare the time and find a very cheap solution. It was a wordpress blog.

I realise this doesn't help the OP. I just wanted to vent :-)


I think a github jekyll or similar static site blog is probably the easiest option?


Agreed, if I could find time to do it. Probably someone has written a wordpress to jekyll convertor.


There was a really great podcast called The Downtime Project that dissected and discussed a postmortem in each episode. There were like a dozen episodes in the first season and they never did make a second one. Pity, I really, really liked it. Might be up your alley, it's only a couple years out of date now.



I love this part:

"5. Why did the backup procedure fail silently? - Notifications were sent upon failure, but because of the Emails being rejected there was no indication of failure. The sender was an automated process with no other means to report any errors. 6. Why were the Emails rejected? - Emails were rejected by the receiving mail server due to the Emails not being signed using DMARC."

Ouch



We had a payment system for vaccination field workers in africa that stopped working, so people did not get paid. There was a section in the post mortem template that went something like

"what is the impact of the error: there is an angry mob with torches demanding to get paid outside"


Dont have a link at hand, but the report investigating the infamous Therac incident tops my list.


https://gvnshtn.com/posts/maersk-me-notpetya/

Its a long read, but gives an insight how the ransomware NotPetya crippled Maersk and how they recovered.

Looking at how the earlier ransomware WannaCry crippled the crown jewels of many countries, highlights a weakness in non diverse systems.

I even know who is behind them, but I cant prove it, so why even mention it? Because I'm getting closer to proving it, which makes this game all the more interesting, even they have weaknesses they have failed to identify!

The WannaCry weekend was when I met Dame Stella Rimington and Baron Jonathon Evans hill walking at Scafell Pike and you can call me the world famous Walter Mitty because people are so obedient to authority.


Our lord and savior Jesus Christ


Do you mind answering a few more questions? There's quite a bit of knowledge gap on things that have happened since then, and there's little better than living memory to fill it.


the boulder did not suffice. what other steps can we take to prevent recurrence?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: