
Google Cloud Networking Incident - mau
https://status.cloud.google.com/incident/cloud-networking/17002
======
Danack
We've been having 'fun' with ongoing issues for a site since 6pm UTC
yesterday, which got dramatically worse this morning...and having been
recurring during the day.

Having multiple hour outages makes me really want to go back to hiring a
couple of physical servers in a rack somewhere.

~~~
slackingoff2017
The "cloud" is a hilarious failure at the main marketing point. Decentralized
just means you have no idea where your servers are.

Scaling is the only real selling-point of cloud and far more people think they
need it than actually do

~~~
joshribakoff
I have no idea where my power or water come from, or where the cell towers are
located for my phone, or the satellites they communicate with. I don't
want/need to know. Most cloud providers tell you which region of which country
you're in, most people don't need to know much more (and if you do, then dont
use cloud)

~~~
slackingoff2017
The difference is you're not concerned about where those come from. It doesn't
matter whos water it is or where the power is coming from. And they're both
simple commodities with few metrics.

We're trying to fit something that's generally very centralized into the same
model. Where the servers are does matter. What OS they run and how reliable
they are matters on an individual level. The environment your server runs on
is quite important and it's definitely your server, not "any server will do"
by a Longshot.

If the cloud was just some source of CPU instructions we would have a
government regulated source of CPU power for everyone. But depending on what
you're running ram size, cache size, network latency, CPU architecture, drive
type, endless variables come into play that are all important.

Depending on hardware that you cant control to have metrics you definitely
need to control is going to make the system less reliable, and that's what
we're seeing now with cloud computing.

~~~
redwood
I lived in a country where everyone has a power generator in their building.
Let's just say the quality of life was significantly lower. This cloud shift
is like an unstoppable tidal wave. I'm always surprised when I hear people
with your argument. Are you willing to imagine that in a couple years things
may change your perspective?

~~~
slackingoff2017
The key is that most services don't need to be reliable. I think the cloud has
huge promise here. Engineers tend to think their app needs five-nines
reliability when we live in a world where the banks close twice a week.

I don't think on-prem will ever die out. It's like owning vs renting your
office. There's pros and cons to each and we'll eventually hit some kind of
equilibrium.

------
tostaki
For anyone wanting a quick workaround, try removing all the backend of your
load balancer then add them again. No idea why but it did work for us.

~~~
ABS
I can confirm several people have had luck doing this (not us though :-( )

~~~
ABS
According to Google's support (....) the workaround works but will likely fix
it only temporarily.

They are currently rolling out a change that should completely fix the issue
and somewhere else I read they are rolling back a config change they did
recently..

------
gtaylor
Just to verify my understanding:

* This is a multi-region (us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1) outage (yeesh)

* Incident began at Incident began at 2017-08-29 18:35 PDT. At the time of this comment, we're over 13 hours (??)

Correct?

~~~
wwayer
It's now past 12:00 US/Pacific and the incident is still ongoing. I'm thinking
that they don't have as many customers as they'd like us to believe. There's
been very little discussion and no outrage. If AWS had an incident lasting
this long, there would be a lot more noise.

~~~
user5994461
If AWS had an incident this long, there would be no noise and no public post
about it. And the customer support would keep repeating have you try buying
bigger instances?

------
0xbear
"A distributed system is one in which the failure of a computer you didn't
even know existed can render your own computer unusable." — Leslie Lamport

I think Leslie needs to update this for Cloud, however.

------
wwayer
Our load is not being "balanced", as not all backends are being utilized
because of this.

~~~
bdimcheff
This is our experience as well. On one service with 3 backends, we lost all
connections to one at about 0530UTC, then the second at about 1010. The third
backend has been able to handle all of our traffic so far, but we're also
seeing intermittent connection resets or failed connections.

------
ABS
we've been suffering from this since 3:07am UK time and it's sad to see google
still hasn't learnt how to do support/communicate with their paying customers
after all the flack they always take about this.

The number of mistakes (silly and otherwise) keeps increasing and I'm not
talking about the technical problem itself! ok, end of rant

~~~
mdekkers
_google still hasn 't learnt how to do support/communicate with their paying
customers_

You are confusing "don't know how" with "absolutely not incentivised to spend
money on this"

~~~
ABS
nope I'm not :-)

writing poorly on the status dashboard, messing/mixing up timestamps, changing
them retroactively hoping people don't notice, forgetting and/or mistakenly
swapping the names of affected regions, consistently writing "next update at x
o'clock" and then invariably publishing the update several to 10+ minutes
after o'clock are all mistakes done by whoever is already paid to update that
page and communicate with customers.

------
manigandham
Stuff goes wrong everywhere. The only problem is thinking that any cloud
vendor or service is 100% reliable.

------
pbarnes_1
Does this affect the global load balancer too?

We haven't seen any issues, so presumably not? Or is it just a subset?

------
notyourday
Whenever the sun starts shining, the clouds disappear.

------
ninjakeyboard
Yikes! that's a huge failure.

