Hacker News new | past | comments | ask | show | jobs | submit login

Agreed, especially regarding the culture but isn't this pretty much the same explanation they gave a few years ago when something similar happened?

I seem to recall an EC2 or S3 outage a few years ago that boiled down to an engineer pushing out a patch that broke an entire region when it was supposed to be a phased deployment.

I could be mis-remembering that but it's important that these lessons be applied across the whole company (at least AWS) so it would be a bigger mark against AWS if this is a result of similar tooling to what caused a previous outage.

Pretty sure that one was a Microsoft Azure outage.

(Source: am a self-identified post-mortems connoisseur. :)

Not a bad plan. If you don't make enough mistakes on your own, ya gotta learn from the mistakes of others as a preventative.

Do you by chance keep a public log of your postmortem collection :)?

I don't, but danluu does! https://github.com/danluu/post-mortems

Yeah an EC2 engineer switched over traffic to a backup network connection that had significantly less bandwidth, triggering cascading failures.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact