
Google Cloud outage post-mortem - antoncohen
https://status.cloud.google.com/incident/cloud-networking/19009
======
antoncohen
Some interesting things for multi-region availability planning. For example
Pub/Sub is global, but it seems like if a Publish happens in a region that is
having issues, a Subscribe from a region without issues might fail, even if
the Publish succeeds.

On the other hand Cloud Storage multi-region buckets seem to really work. The
'us' location experienced 3.5% error rate, higher than the non-affected single
regions (~1%), but still much lower than the affected regions (43% - 96%).

It is unclear how Cloud Spanner fared when the instance had only two regions
and one of them was affected.

------
kozziollek
> ... to allow an adequate window for recovery with no user impact.

Which made me think: it would be cool if companies were doing "pre-mortems".
Like: "today we had a configuration problem, but because of our defense in
depth, it slowed our systems by 7.8%". Or maybe Googlers already have
something like that internally?

