Hacker News new | past | comments | ask | show | jobs | submit login
Lessons Learned from Reading Post Mortems (danluu.com)
156 points by ingve on Aug 20, 2015 | hide | past | web | favorite | 25 comments

If you would like to read postmortems, I maintain a list of them here (over 250 so far): https://pinboard.in/u:peakscale/t:postmortem/

This is great - thank you!

In a similar vein is the Google+ Postmortems community: https://plus.google.com/u/0/communities/11513614020301839179...

Wow, this is a great list.

It would be fantastic to take this set visualise their causes. It would be really interesting to see whether the causes are that different for large corporations vs smaller startups. I suspect, as in the article, it remains that configuration, error handling and human causes are still the most popular regardless of whether you have vast quantities of process or money for tooling.

Thanks for this - I've been working my way through it, and it's fascinating, far more so than I would have imagined.

> Configuration bugs, not code bugs, are the most common cause I’ve seen of really bad outages, and nothing else even seems close.

The funny thing about this conclusion is that configuration and code are not dramatically different concepts when you think about it. One of them is "data" and the other "code", but both affect the global behavior of the system. Config variables are often played up as being simpler to manage, but it's actually more complicated from an engineering standpoint, since we know there is code required to support said configuration.

The process is what's dramatically different. "Write a story with acceptance criteria, get it estimated by engineers, get it prioritized by management, wait two weeks for the sprint to be over, wait for QA acceptance, deploy in the middle of the night," vs. "Just change this field located right here in the YAML file..."

Also, I can't speak for all companies, but where I work configuration is how we define the differences between our test and production environments.

If your config files are intentionally different, because in test you should use authentication server testauth.example.com and in production you should use auth.example.com, then how can you avoid violating test-what-you-fly-and-fly-what-you-test?

Obviously, you could add an extra layer of abstraction (make the DNS config different between test and production and both environments could use auth.example.com) but that's just moving the configuration problem somewhere else :)

This how we do it - a) Have a regression test suite running continuously and have alerts pop up when they fail. Have a minimal set of config values in your regression suite and fire off alerts when they fail. b) Setup monitoring for your components and trigger alerts based on some thresholds c) With (a) and (b) setup, rollout your bits to a canary environment and if all looks good, trigger rolling deployment to your prod environment.

you automate the deployment and that automation runs checks. If things don't work out, it refuses to deploy.

"The lack of proper monitor is never the sole cause of a problem, but it’s often a serious contributing factor."

I am continuously amazed that downtime issues go undetected until a) customer notifies you b) things go downhill and alarms are blazing. Our central principle is that monitoring and alerting has to be part of your deployment. The way we apply at my work place is that every design doc has a monitoring section which has to be filled out.

Well, the thing about human mistakes is that they are easy to blame on some part of the hardware or software. I configure an incorrect data element and the system dies, then I won't write down "I configured something wrong so the system died", I will write a bug that says "software didn't catch that kind of misconfiguration". Also it's not reasonable to admit too many mistakes publicly. People who haven't thought about all the mistakes they make will start to think that you are not able to perform well in your job.

Technically all outages are "human mistakes". Humans build hardware, write software, configure and maintain systems, and manage other humans. Which is why explaining an outage as "human error" is not constructive.

There are known methods to create systems that are resistant to human error: automation, checklists, testing, etc. Humans will make mistakes-that is a certainty. A solution to an outage will be to employ these techniques, not tell your team members to not make mistakes.

A lightning strike or earthquake is not a human mistake.

Human mistakes probably led to the lightning strike not being grounded properly or the earthquake causing structural damage.

As a recent convert to functional programming, I'd like to point out that more functional styles of error handling helps one address them more properly and not sweep them under the rug.

This ties into the article's first point about how poor error handling is a common source of bugs.

How exactly does functional programming help here?

Sum types allow a function to return either a result or an error. Not accounting for both possibilities is a compile time type error.

Go mimics this behavior, so it's not only a functional thing: http://blog.golang.org/error-handling-and-go

I guess he's referring to things like the maybe monad.

Daily load tests against both staging and production (yes, really) can help catch a lot of issues of the kind described in the article. You do have to have a solid monitoring & alerting set up still though.

Curious, how do you perform load tests in production for an e-commerce site?

Well, the gist of it is that it's not that hard technically, but could be more difficult to implement organisationally. Production load-testing is something that needs to be agreed with various teams/people across the organisation (easier if you're a small startup), think ops, marketing, analytics etc.

The basic thing is to make sure you can separate real and synthetic requests (with a special header for example). This will allow you to mock/no-op certain operations like attempting to charge the user's card or reducing the quantity of stock you have. It'll also allow you to remove synthetic requests from your analytics data, so that marketing does not get excited by the sudden influx of new users. If you have user accounts on your system, make all fake users register with @somenonexistentdomain.com so you can filter for that too etc.

Obviously start slow and ramp up over time as you iron out issues.

JustEat.co.uk run daily load-tests in production at +20-25% of their peak traffic. As in: extra 20% simulated load during their peak hours, which happen to be between 6-9pm every day. They process a very respectable number of real-money transactions every second, a number that a lot of ecommerce sites would be very happy with. (Source: a presentation at ScaleSummit in London this year)

Feel free to @message me if you want to talk more about this.

One could learn a lot from observing the methods and practices applied to legacy systems.

Legacy systems tend to ossify and not change much. How is that a good lesson?

I think OP means more along the lines of the fact that legacy systems were very hard to change, so changes went through a much more rigorous review process.

The problem with that theory is that because current systems are so easy to change, the cost of failure is much lower, so the upfront cost to avoid failure no longer has as good an ROI.

Define legacy system.

I have a tendency to overlook a valuable lesson, perhaps the most valuable lessons, when I redefine post mortem in this way.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact