In a similar vein is the Google+ Postmortems community: https://plus.google.com/u/0/communities/11513614020301839179...
It would be fantastic to take this set visualise their causes. It would be really interesting to see whether the causes are that different for large corporations vs smaller startups. I suspect, as in the article, it remains that configuration, error handling and human causes are still the most popular regardless of whether you have vast quantities of process or money for tooling.
The funny thing about this conclusion is that configuration and code are not dramatically different concepts when you think about it. One of them is "data" and the other "code", but both affect the global behavior of the system. Config variables are often played up as being simpler to manage, but it's actually more complicated from an engineering standpoint, since we know there is code required to support said configuration.
The process is what's dramatically different. "Write a story with acceptance criteria, get it estimated by engineers, get it prioritized by management, wait two weeks for the sprint to be over, wait for QA acceptance, deploy in the middle of the night," vs. "Just change this field located right here in the YAML file..."
If your config files are intentionally different, because in test you should use authentication server testauth.example.com and in production you should use auth.example.com, then how can you avoid violating test-what-you-fly-and-fly-what-you-test?
Obviously, you could add an extra layer of abstraction (make the DNS config different between test and production and both environments could use auth.example.com) but that's just moving the configuration problem somewhere else :)
I am continuously amazed that downtime issues go undetected until a) customer notifies you b) things go downhill and alarms are blazing. Our central principle is that monitoring and alerting has to be part of your deployment. The way we apply at my work place is that every design doc has a monitoring section which has to be filled out.
There are known methods to create systems that are resistant to human error: automation, checklists, testing, etc. Humans will make mistakes-that is a certainty. A solution to an outage will be to employ these techniques, not tell your team members to not make mistakes.
This ties into the article's first point about how poor error handling is a common source of bugs.
Go mimics this behavior, so it's not only a functional thing:
The basic thing is to make sure you can separate real and synthetic requests (with a special header for example). This will allow you to mock/no-op certain operations like attempting to charge the user's card or reducing the quantity of stock you have. It'll also allow you to remove synthetic requests from your analytics data, so that marketing does not get excited by the sudden influx of new users. If you have user accounts on your system, make all fake users register with @somenonexistentdomain.com so you can filter for that too etc.
Obviously start slow and ramp up over time as you iron out issues.
JustEat.co.uk run daily load-tests in production at +20-25% of their peak traffic. As in: extra 20% simulated load during their peak hours, which happen to be between 6-9pm every day. They process a very respectable number of real-money transactions every second, a number that a lot of ecommerce sites would be very happy with. (Source: a presentation at ScaleSummit in London this year)
Feel free to @message me if you want to talk more about this.
The problem with that theory is that because current systems are so easy to change, the cost of failure is much lower, so the upfront cost to avoid failure no longer has as good an ROI.