
Lessons Learned from Reading Post Mortems - ingve
http://danluu.com/postmortem-lessons/
======
timf
If you would like to read postmortems, I maintain a list of them here (over
250 so far):
[https://pinboard.in/u:peakscale/t:postmortem/](https://pinboard.in/u:peakscale/t:postmortem/)

~~~
cheeseprocedure
This is great - thank you!

In a similar vein is the Google+ Postmortems community:
[https://plus.google.com/u/0/communities/11513614020301839179...](https://plus.google.com/u/0/communities/115136140203018391796?cfem=1)

------
tboyd47
> Configuration bugs, not code bugs, are the most common cause I’ve seen of
> really bad outages, and nothing else even seems close.

The funny thing about this conclusion is that configuration and code are not
dramatically different concepts when you think about it. One of them is "data"
and the other "code", but both affect the global behavior of the system.
Config variables are often played up as being simpler to manage, but it's
actually more complicated from an engineering standpoint, since we know there
is code required to support said configuration.

The process is what's dramatically different. "Write a story with acceptance
criteria, get it estimated by engineers, get it prioritized by management,
wait two weeks for the sprint to be over, wait for QA acceptance, deploy in
the middle of the night," vs. "Just change this field located right here in
the YAML file..."

~~~
michaelt
Also, I can't speak for all companies, but where I work configuration is how
we define the differences between our test and production environments.

If your config files are intentionally different, because in test you should
use authentication server testauth.example.com and in production you should
use auth.example.com, then how can you avoid violating test-what-you-fly-and-
fly-what-you-test?

Obviously, you could add an extra layer of abstraction (make the DNS config
different between test and production and both environments could use
auth.example.com) but that's just moving the configuration problem somewhere
else :)

~~~
suryaj
This how we do it - a) Have a regression test suite running continuously and
have alerts pop up when they fail. Have a minimal set of config values in your
regression suite and fire off alerts when they fail. b) Setup monitoring for
your components and trigger alerts based on some thresholds c) With (a) and
(b) setup, rollout your bits to a canary environment and if all looks good,
trigger rolling deployment to your prod environment.

------
bbali
"The lack of proper monitor is never the sole cause of a problem, but it’s
often a serious contributing factor."

I am continuously amazed that downtime issues go undetected until a) customer
notifies you b) things go downhill and alarms are blazing. Our central
principle is that monitoring and alerting has to be part of your deployment.
The way we apply at my work place is that every design doc has a monitoring
section which has to be filled out.

------
erikb
Well, the thing about human mistakes is that they are easy to blame on some
part of the hardware or software. I configure an incorrect data element and
the system dies, then I won't write down "I configured something wrong so the
system died", I will write a bug that says "software didn't catch that kind of
misconfiguration". Also it's not reasonable to admit too many mistakes
publicly. People who haven't thought about all the mistakes they make will
start to think that you are not able to perform well in your job.

~~~
protonfish
Technically all outages are "human mistakes". Humans build hardware, write
software, configure and maintain systems, and manage other humans. Which is
why explaining an outage as "human error" is not constructive.

There are known methods to create systems that are resistant to human error:
automation, checklists, testing, etc. Humans will make mistakes-that is a
certainty. A solution to an outage will be to employ these techniques, not
tell your team members to not make mistakes.

~~~
quadrangle
A lightning strike or earthquake is not a human mistake.

~~~
grogers
Human mistakes probably led to the lightning strike not being grounded
properly or the earthquake causing structural damage.

------
ionforce
As a recent convert to functional programming, I'd like to point out that more
functional styles of error handling helps one address them more properly and
not sweep them under the rug.

This ties into the article's first point about how poor error handling is a
common source of bugs.

~~~
dkarapetyan
How exactly does functional programming help here?

~~~
acconsta
Sum types allow a function to return either a result or an error. Not
accounting for both possibilities is a compile time type error.

Go mimics this behavior, so it's not only a functional thing:
[http://blog.golang.org/error-handling-and-go](http://blog.golang.org/error-
handling-and-go)

------
hassy
Daily load tests against both staging _and_ production (yes, really) can help
catch a lot of issues of the kind described in the article. You do have to
have a solid monitoring & alerting set up still though.

~~~
suryaj
Curious, how do you perform load tests in production for an e-commerce site?

~~~
hassy
Well, the gist of it is that it's not that hard technically, but could be more
difficult to implement organisationally. Production load-testing is something
that needs to be agreed with various teams/people across the organisation
(easier if you're a small startup), think ops, marketing, analytics etc.

The basic thing is to make sure you can separate real and synthetic requests
(with a special header for example). This will allow you to mock/no-op certain
operations like attempting to charge the user's card or reducing the quantity
of stock you have. It'll also allow you to remove synthetic requests from your
analytics data, so that marketing does not get excited by the sudden influx of
new users. If you have user accounts on your system, make all fake users
register with @somenonexistentdomain.com so you can filter for that too etc.

Obviously start slow and ramp up over time as you iron out issues.

JustEat.co.uk run daily load-tests in production at +20-25% of their peak
traffic. As in: extra 20% simulated load during their peak hours, which happen
to be between 6-9pm every day. They process a very respectable number of real-
money transactions every second, a number that a lot of ecommerce sites would
be very happy with. (Source: a presentation at ScaleSummit in London this
year)

Feel free to @message me if you want to talk more about this.

------
sengork
One could learn a lot from observing the methods and practices applied to
legacy systems.

~~~
dkarapetyan
Legacy systems tend to ossify and not change much. How is that a good lesson?

~~~
jedberg
I think OP means more along the lines of the fact that legacy systems were
very hard to change, so changes went through a much more rigorous review
process.

The problem with that theory is that because current systems are so easy to
change, the cost of failure is much lower, so the upfront cost to avoid
failure no longer has as good an ROI.

------
0xdeadbeefbabe
I have a tendency to overlook a valuable lesson, perhaps the most valuable
lessons, when I redefine post mortem in this way.

