
Google Cloud Partial Outage - sz4kerto
https://status.cloud.google.com/incident/zall/20003
======
pm90
From a strictly “reliability” perspective it’s not a good look for Google.
Their reputation for developing high uptime systems has been endangered by the
recent series of outages and service interruptions. When I signed on to GCP, I
assumed that the same people who wrote the SRE handbook were in charge here...
how could this stuff happen?

However, as an engineer... I can empathize with their situation. GCP has less
public cloud experience than AWS. They’re somewhat transparent about outages
(so much better than being gaslit by AWS). Shit happens and at least the same
problem doesn’t happen twice, which is what I really care about.

However my non technical manager won’t think this way and neither will other
companies considering Google Cloud. GCP is in a tough spot from a marketing
perspective if they get a reputation for being unreliable due to these
incidents.

~~~
izacus
Is GCP actually less reliable than AWS or Azure, or is that only HN effect
where everything bad about Google gets voted to the top?

I'm hearing a lot of talk from my contacts about AWS just simply denying and
hiding any outages until they're undeniably proven - so perhaps the issue is
just the fact that GCP is too honest?

~~~
buttersbrian
Curious about this as well. Google hate here can be palpable

~~~
zepto
Wind back 5 years or more and this place was a Google love fest.

Somehow it seems that Google has managed to disappoint a lot of people.

~~~
izacus
> Somehow it seems that Google has managed to disappoint a lot of people.

Yeah, and those people aren't wrong - it's not like Google didn't do their own
share of messups.

Still, the upplaying of Googles' faults and downplaying of other companies
faults (even when they're the same!) has not been the best outlook of this
community lately.

------
tlackemann
I like how I found out about this here and not through an email. Tons of
customer reports from our site of being logged out randomly (we use Firebase)
and we had no idea why.

~~~
tlackemann
The outage is going on for almost 12-hours. Quick math says that's 99.86%
reliability.

On top of this, they want to start charging $70/mo for GKE because they "[...]
are also introducing a Service Level Agreement (SLA) that's financially backed
with a guaranteed availability of 99.95%" (pulled from pricing[0])

This is unacceptable.

[0] [https://cloud.google.com/kubernetes-
engine/pricing](https://cloud.google.com/kubernetes-engine/pricing)

~~~
BilalBudhani
I'm guessing because we are going through a pandemic they might be taking more
time than usual to apply fixes.

~~~
pm90
They’ve been pretty vocal about the fact that this has nothing to do with the
pandemic.

~~~
quicklime
The outage itself might not have been caused by anything related to the
pandemic, but the resolution is likely being done by engineers adjusting to
working from home.

------
FigmentEngine
The bad part here is that its impacting multiple regions. really hard for
customers to build high-availability solutions when isolation boundaries can
be breached this way. (disclosure ex-aws)

------
pritambarhate
[https://status.cloud.google.com/incident/compute](https://status.cloud.google.com/incident/compute)

It just looks horrific from the reliability standpoint. May be Google should
consider adding columns for affected regions! Maybe they are too honest in
admitting the failures than other providers, but that page paints a very bad
picture.

~~~
breakingcups
I'm glad they do. Compared to AWS this transparancy is a breath of fresh air.

And I'm no Google fan either.

------
breakingcups
Already looking forward to that post-mortem. It might be morbid, but they are
often my favorite things to read.

------
angristan
This is a disaster...

> We are currently investigating an issue affecting Dataflow, BigQuery,
> DialogFlow, Kubernetes Engine, Cloud Firestore, App Engine, Cloud Functions,
> Cloud Monitoring, Cloud MemoryStore, Cloud Spanner, Cloud Storage, Cloud
> Composer, Cloud Dataproc, Cloud KMS, Cloud Container Registry, Compute
> Engine, Cloud IAM, Cloud SQL, Firebase Storage, Cloud Healthcare API, Cloud
> AI, Firebase Machine Learning, Data Catalog and Cloud Console.

~~~
lonelappde
When you see that, it's almost certainly a low level networking outage
(networking hardware failure or a very low level networking config change that
isn't possible to canary with the vendor's current technology, possibly
combined with something insufficient failover resources due to other issues).

------
Jyaif
Apparently the outage is caused by a third party router bug:
[https://twitter.com/uhoelzle/status/1243398255083311105](https://twitter.com/uhoelzle/status/1243398255083311105)

------
brianwawok
What region is this? I have not seen a blip in central.

------
jiveturkey
resolved as of 2020-03-27 06:54 US/Pacific

~~~
a012
What UTC time is it?

------
amelius
Even Google needs humans to keep its services running.

~~~
naringas
I'll bet it was also humans who caused this outage in the first place.

------
F117-DK
Oh my! GCP is on fire?

------
rainboiboi
I guess the number of comments here reflect how many people are using Google
Cloud - apparently not much?

~~~
dijit
There's nothing really to say about it.

I run about 8,000 CPU cores in GCP and what am I going to say?

I like the transparency, in my experience it's more reliable than AWS-
although AWS does _not_ report problems until they're absolutely undeniable.

Other than that, shit happens, this is upsetting but eh.

------
higooglebye
Google Cloud was the least reliable major cloud provider in the last 2 years.
May be they are too busy killing their other services, so they don't have time
for other least important things like cloud.

~~~
Kye
Is this based on independent research, or officially-reported downtime? There
are enough anecdotes in threads here about cloud reliability to suggest AWS is
bad at reporting outages.

