Google Cloud outage post-mortem

antoncohen · on June 6, 2019

Some interesting things for multi-region availability planning. For example Pub/Sub is global, but it seems like if a Publish happens in a region that is having issues, a Subscribe from a region without issues might fail, even if the Publish succeeds.

On the other hand Cloud Storage multi-region buckets seem to really work. The 'us' location experienced 3.5% error rate, higher than the non-affected single regions (~1%), but still much lower than the affected regions (43% - 96%).

It is unclear how Cloud Spanner fared when the instance had only two regions and one of them was affected.

kozziollek · on June 6, 2019

> ... to allow an adequate window for recovery with no user impact.

Which made me think: it would be cool if companies were doing "pre-mortems". Like: "today we had a configuration problem, but because of our defense in depth, it slowed our systems by 7.8%". Or maybe Googlers already have something like that internally?