
A collection of postmortems - luu
https://github.com/danluu/post-mortems
======
dewey
I think it'd be easier to browse if it'd be organized in a directory structure
like:

    
    
        Microsoft/
            Outage 1
            Outage 2
    

In case anyone feels like contributing:

[https://hn.algolia.com/?query=postmortem&sort=byPopularity&p...](https://hn.algolia.com/?query=postmortem&sort=byPopularity&prefix&page=0&dateRange=all&type=story)

[https://hn.algolia.com/?query=post-
mortem&sort=byPopularity&...](https://hn.algolia.com/?query=post-
mortem&sort=byPopularity&prefix&page=0&dateRange=all&type=story)

~~~
avz
I'd rather have them arranged by root cause category (e.g. bad service config,
bad network config, units mismatch, inconsistent binary version, software bug,
etc) than by which company it affected.

~~~
Sir_Cmpwn
How about both?

    
    
        Organization/
            Microsoft/
                Outage 1
                Outage 2
        Category/
            Network/
                Microsoft 1 -> ../../Organization/Microsoft/Outage 1
    

symlinks are awesome!

------
shoo
Similarly, see also the RISKS-FORUM digest, with thirty years of archives [1]:

> Its intent is to address issues involving risks to the public in the use of
> computers. As such, it is necessarily concerned with whether/how critical
> requirements for human safety, reliability, fault tolerance, security,
> privacy, integrity, and guaranteed service (among others) can be met (in
> some cases all at the same time), and how the attempted fulfillment or
> ignorance of those requirements may imply risks to the public. We will
> presumably explore both deficiencies in existing systems and techniques for
> developing better computer systems -- as well as the implications of using
> computer systems in highly critical environments.

[1] [http://catless.ncl.ac.uk/Risks/](http://catless.ncl.ac.uk/Risks/)

------
leothekim
Relevant: Foursquare's post-mortem with mongodb.
[https://groups.google.com/forum/m/#!topic/mongodb-
user/UoqU8...](https://groups.google.com/forum/m/#!topic/mongodb-
user/UoqU8ofp134)

~~~
technion
I'm hoping I'm reading this wrong. Because what I'm reading is "A database
grew to 67GB on a server with only 66GB RAM". And apparently in that situation
you can expect performance to tank until a service is unusable?

~~~
mjevans
You did not read that wrong.

The performance tanked because the working set now required a disk hit.

Each, and every, query, required a disk hit.

That many IOs/sec, irrespective of the size could be enough.

------
josemagana
I like how the healthcare.gov one is the only one without a description

------
luxpir
Really interesting repo and kind of a relief that mishaps happen to everyone,
more or less. We can try to mitigate as best as possible, but these are just
the high profile cases. Imagine the millions of internal post mortems carried
out daily by all manner of successful companies.

My outage anxiety is reducing, even if contingencies are in place.

------
ekianjo
It's not really what people usually call post-mortems though. A lot of those
stories are just issues regarding products or services and how they were
uncovered/fixed. Post-mortems are not limited to that kind of stories at all.

~~~
kabouseng
And post mortems also aren't only limited to startup's that failed :D

------
avitzurel
When we have a horrible bug or an outage, I always share a post-mortem email
company wide, not just engineering.

I think post-mortems are a huge opportunity to learn about resiliency and
common system and engineer errors, I know I grew as an engineer with each of
those.

Thanks for sharing!

------
yuvipanda
[https://wikitech.wikimedia.org/wiki/Incident_documentation](https://wikitech.wikimedia.org/wiki/Incident_documentation)
has a fairly complete list of Wikimedia's Post Mortems.

------
0x400614
I like how HealthCare.gov doesn't have a description

------
Vecrios
I found it hilarious that they didn't even bother writing a description for
Healthcare.gov.

~~~
Too
You are the third post to remark on this without adding any further value too
the discussion, in a thread of just 11 comments, surely you could have read
through them all before posting this?

