Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Google Cloud outage post-mortem (cloud.google.com)
8 points by antoncohen on June 6, 2019 | hide | past | favorite | 2 comments


Some interesting things for multi-region availability planning. For example Pub/Sub is global, but it seems like if a Publish happens in a region that is having issues, a Subscribe from a region without issues might fail, even if the Publish succeeds.

On the other hand Cloud Storage multi-region buckets seem to really work. The 'us' location experienced 3.5% error rate, higher than the non-affected single regions (~1%), but still much lower than the affected regions (43% - 96%).

It is unclear how Cloud Spanner fared when the instance had only two regions and one of them was affected.


> ... to allow an adequate window for recovery with no user impact.

Which made me think: it would be cool if companies were doing "pre-mortems". Like: "today we had a configuration problem, but because of our defense in depth, it slowed our systems by 7.8%". Or maybe Googlers already have something like that internally?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: