
Google cloud outage - thomassharoon
https://status.cloud.google.com/incident/cloud-networking/20004
======
qmarchi
Heyo Googler here.

The problem was a mix between another cloud provider and GCP.

Dare I say, there should be little customer impact as of 13:37 PST.....

The status dashboard is going to be your best idea on information.

~~~
svacko
Is the another cloud provider AWS? I could see tons of connection timeoutes
between GCP & S3/Elasticsearch service.

Hope everything is resolved now for good.

~~~
judge2020
Seems AWS, connection to gmail's smtp relay also started timing out.

------
nammi
We were seeing timeouts in east-1. I don't know what "normal" looks like, but
Pingdom's map seems to show the whole east coast as affected
[https://livemap.pingdom.com/](https://livemap.pingdom.com/)

------
svacko
yeah, our GKE pods running in us-east1 were dying ~90minutes ago like crazy...
hope they are gonna resolve this soon. not the luckiest day for Google, nor us

------
x__x
I was bummed out when Siteground moved all their cloud accounts over G,
without telling their customers beforehand

------
kgraves
This is extremely concerning as somebody looking to move or build on top of
GCP for the long term. I wonder why anyone would choose GCP if outages are
occurring on a regular basis.

~~~
pgodzin
Any evidence they happen more frequently that the other clouds?

------
tagux
"We had a router failure in Atlanta".

WHAT? You kidding us?

Urs Hölzle, technical infrastructure at Google Cloud senior vice president,
said, "We're very sorry about that! We had a router failure in Atlanta, which
affected traffic routed through that region. Things should be back to normal
now. Just to make sure: This wasn't related to traffic levels or any kind of
overload, our network is not stressed by COVID-19."

~~~
ocdtrekkie
Was it like... a hardware failure? If you serve more than 100 people you
probably should have redundant routers. Was it a configuration issue that
replicated over to multiple devices at least, I hope?

~~~
toast0
Have you worked with redundant routers? They certainly reduce the number of
outages, but sometimes the hardware (or software) fails in exciting ways that
doesn't engage the redundancy, or doesn't engage it properly, and you still
get an outage (or you get an outage that wouldn't have happened). Or
sometimes, one circuit is out of service for repair or upgrade, and the other
circuit is connected to the router that failed. And routing for the AS that
travels on that circuit was set not to fallback to transit because the last
time that happened, it caused major issues.

I have no specific knowledge of today's events, but this sort of thing
happens. You can get the number of incidents down pretty low, but not to zero.

~~~
ocdtrekkie
I have. I am just highlighting that the problem surely should be more complex
than described. Or that their redundancy for these events was not adequately
devised.

~~~
toast0
Google often releases a pretty solid post-mortem, which will give the detail
of the event. The level of detail appropriate for same-day release is really
'router failure' or 'power failure' or 'software failure' or 'vehicle drove
into the building failure'. Expecting more than 'we know what it was, and we
fixed it' or 'we don't know what it was, but it stopped happening' or 'yes,
we're working on it' on a same-day twitter post is unreasonable.

