Hacker News new | past | comments | ask | show | jobs | submit login
A collection of postmortems (github.com)
152 points by luu on Aug 8, 2015 | hide | past | web | favorite | 16 comments



I think it'd be easier to browse if it'd be organized in a directory structure like:

    Microsoft/
        Outage 1
        Outage 2
In case anyone feels like contributing:

https://hn.algolia.com/?query=postmortem&sort=byPopularity&p...

https://hn.algolia.com/?query=post-mortem&sort=byPopularity&...


I'd rather have them arranged by root cause category (e.g. bad service config, bad network config, units mismatch, inconsistent binary version, software bug, etc) than by which company it affected.


How about both?

    Organization/
        Microsoft/
            Outage 1
            Outage 2
    Category/
        Network/
            Microsoft 1 -> ../../Organization/Microsoft/Outage 1
symlinks are awesome!


Similarly, see also the RISKS-FORUM digest, with thirty years of archives [1]:

> Its intent is to address issues involving risks to the public in the use of computers. As such, it is necessarily concerned with whether/how critical requirements for human safety, reliability, fault tolerance, security, privacy, integrity, and guaranteed service (among others) can be met (in some cases all at the same time), and how the attempted fulfillment or ignorance of those requirements may imply risks to the public. We will presumably explore both deficiencies in existing systems and techniques for developing better computer systems -- as well as the implications of using computer systems in highly critical environments.

[1] http://catless.ncl.ac.uk/Risks/


Relevant: Foursquare's post-mortem with mongodb. https://groups.google.com/forum/m/#!topic/mongodb-user/UoqU8...


I'm hoping I'm reading this wrong. Because what I'm reading is "A database grew to 67GB on a server with only 66GB RAM". And apparently in that situation you can expect performance to tank until a service is unusable?


You did not read that wrong.

The performance tanked because the working set now required a disk hit.

Each, and every, query, required a disk hit.

That many IOs/sec, irrespective of the size could be enough.


I like how the healthcare.gov one is the only one without a description


Really interesting repo and kind of a relief that mishaps happen to everyone, more or less. We can try to mitigate as best as possible, but these are just the high profile cases. Imagine the millions of internal post mortems carried out daily by all manner of successful companies.

My outage anxiety is reducing, even if contingencies are in place.


It's not really what people usually call post-mortems though. A lot of those stories are just issues regarding products or services and how they were uncovered/fixed. Post-mortems are not limited to that kind of stories at all.


And post mortems also aren't only limited to startup's that failed :D


When we have a horrible bug or an outage, I always share a post-mortem email company wide, not just engineering.

I think post-mortems are a huge opportunity to learn about resiliency and common system and engineer errors, I know I grew as an engineer with each of those.

Thanks for sharing!


https://wikitech.wikimedia.org/wiki/Incident_documentation has a fairly complete list of Wikimedia's Post Mortems.


I like how HealthCare.gov doesn't have a description


I found it hilarious that they didn't even bother writing a description for Healthcare.gov.


You are the third post to remark on this without adding any further value too the discussion, in a thread of just 11 comments, surely you could have read through them all before posting this?




Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: