"Lessons learned from reading post-mortems" http://danluu.com/postmortem-lessons/ is a good place to dig deeper

The first graph quoted from a survey paper is a classic fitting the GCE outage well:

Initial error --92%--> Incorrect handling of errors explicitly signaled in software

https://mitpress.mit.edu/books/engineering-safer-world is also an excellent resource that more people who care about post-mortems should read.

(As background, the author, MIT Prof. Nancy Leveson, summarizes decades of work in the field, offers groundbreaking new theoretical tools that scale up to some of the world's most complex accidents, and has the experience and evidence to back up their relevance e.g. via work on Therac-25, the Columbia Space Shuttle, and Deepwater Horizon to name just a few...)

