
Google Cloud networking issues in us-east1 - decohen
https://status.cloud.google.com/incident/cloud-networking/19016
======
boulos
Disclosure: I work on Google Cloud (but I'm not in SRE, oncall, etc.).

As the updates to [1] say, we're working to resolve a networking issue. The
Region isn't (and wasn't) "down", but obviously network latency spiking up for
external connectivity is bad.

We are currently experiencing an issue with a subset of the fiber paths that
supply the region. We're working on getting that restored. In the meantime,
we've removed almost all Google.com traffic out of the Region to prefer GCP
customers. That's why the latency increase is subsiding, as we're freeing up
the fiber paths by shedding our traffic.

Edit: (since it came up) that also means that if you’re using GCLB and have
other healthy Regions, it will rebalance to avoid this congestion/slowdown
automatically. That seemed the better trade off given the reduced network
capacity during this outage.

[1] [https://status.cloud.google.com/incident/cloud-
networking/19...](https://status.cloud.google.com/incident/cloud-
networking/19016)

~~~
ricardobeat
Tangential question: does Google allow employees, not directly tasked with it,
to represent the company online as they wish? Most companies I know of have a
strict ‘do not speak for the company’ policy.

~~~
boulos
As kyrra says below, you're in the clear if you state that this is just your
opinion. Naturally, prefacing something terrible as "just your opinion"
doesn't make it fine.

In my case, Cloud PR knows me, but I also knowingly risk my job (I clearly
believe I have good enough judgment in what I post). If Urs and Ben think I
should be fired, I'm okay with that, as it would represent a significant
enough difference in opinion, that I wouldn't want to continue working here
anyway.

Finally, for what it's worth, I have been reported before for "leaking
internal secrets" here on HN! It turned out to be a totally hilarious
discussion with the person tasked with questioning me. Still not fired, gotta
try harder :).

~~~
bardworx
That’s...that’s some petty fucking shit. I didn’t go through your comments but
considering your email is in your profile, someone really had to have a hard-
on to report you for leaks.

I would love to understand the though process of someone going out of their
way to remove someone’s livelihood from them because of a comment on HN (when
applied in a normal circumstance of adding additional information or
correcting a misconception — I’m clearly not saying that bonehead comments
shouldn’t have consequences.)

~~~
Thorrez
You're assuming that the person making the report said "boulos needs to be
fired!".

Maybe the person making the report said "Hey, I found some internal details on
this external site. I'm not sure if this is allowed. Maybe someone who knows
more should take a look at it, here's the link to the page."

~~~
bardworx
Their email is in their profile. I would think it is sensible to reach out to
them directly or speak with your manager to get a second opinion.

Submitting a complaint to an internal review because “you’re not sure it’s
allowed” is really petty.

In my opinion, and experience, folks who have good intentions usually pull you
to the side to get a feel for a situation before filing a formal complaint.

------
harshreality
Hacker News: The real status page and help desk for the internet.

Do companies realize how absurd this is?

ETA: It seems someone at Google had a change of heart, and most of what boulos
posted in this thread has been added as updates to the official google status
page. Better late than never, I guess, especially if this is the start of a
trend in outage reporting.

~~~
notatoad
seriously, they've got a text field on the official status page, why not put
the text boulos posted here in that instead of the meaningless text they've
got there?

~~~
boulos
Can you expand on why you find it “meaningless”? As my other comment says, I’m
not in SRE and the real people fixing it are trying their best to remediate
the problem. I agree that the text I posted (with blessing from SRE!) gives
you some more detail, but you can’t do anything differently with it, right?
What about the new text do you prefer? (We’re happy to improve!)

~~~
toufka
Your, even brief, description is interpretable by your clients and some
customers - and is actually really informative. It helps estimate the
magnitude of the issue, and the types of downstream problems to expect or
avoid.

Knowing an astroid took out the entire continent tells you something about the
repairability, resources required to fix the problem, and generally provides
context for later updates, as opposed to other contexts like a cut fiber line,
a burning datacenter or a bad power supply.

------
fastest963
Here's the original issue: [https://status.cloud.google.com/incident/cloud-
networking/19...](https://status.cloud.google.com/incident/cloud-
networking/19015)

Not sure why they closed that one at 9:12 just to open a new one at 10:25. We
didn't see any traffic coming to us-east1 during that time period so I would
assume the original issue is still the root cause.

~~~
boulos
Yeah, that happens sometimes based on which team notices, thinks it _might_ be
different and then opens an outage.

Sorry for the confusion, and yes, the fiber link issue is the root cause.
Draining the Google.com traffic presumably resolved the issue for you, though
you may still be seeing elevated latency as the updates suggest.

~~~
fastest963
Since we use GCP Global LBs I presume that "draining the Google.com traffic"
also meant that you're diverting all global LB traffic, which is what we see.
The second incident (the OP's link) indicates that but at first it was very
confusing to a customer when the first issue was marked as resolved but we
still saw no traffic being sent to us-east1 via our global LBs. If that makes
sense.

~~~
boulos
This part was somewhat nuanced, so I wasn’t sure to post it: yes, if you are
using GCLB, and have more than 1 healthy Region, we will also rebalance to
avoid us-east- for now (though not so statically as that sounds, mumble
mumble).

Edit: added this to the top level comment so more folks see it.

~~~
edwintorok
There were reports of 404 from Google Cloud Run earlier today (I can confirm
that I got both a 404 and a successful load after retrying that website):
[https://news.ycombinator.com/item?id=20336102](https://news.ycombinator.com/item?id=20336102)
Was it related, it is a bit odd to get a 404 instead of a 50x?

~~~
boulos
Sorry, I hadn't seen your post earlier. No, the Cloud Run (intermittent) 404s
were unrelated.

------
mehrdadn
Does anybody else feel like there have been a lot of outages in recent months?
And I don't mean Google -- I mean lots of others too (I seem to recall
CloudFlare, Facebook, etc.)... are they really increasing or are we just
hearing more about them? Seems a bit odd.

~~~
m0zg
That's more or less inevitable. As complexity increases (which it does
naturally, if there's no effort to decrease it) at some point it begins to
outstrip the limits of human understanding.

I've been saying this repeatedly (and downvoted for it repeatedly): if you
want truly reliable systems, use simple, boring technology, and don't fuck
with it after it's set up, and run it yourself. 99.99% of all these outages
are due to screwing up something that already works, something that if it was
in your own rack you could just leave alone and not touch at all.

~~~
dodobirdlord
> 99.99% of all these outages are due to screwing up something that already
> works

Fiber optic cables are a great technology, but they don't react well to being
cut in half by a backhoe. Is the solution you are recommending that we stop
using fiber optic cables, or that we stop using backhoes?

~~~
m0zg
Stopping depending so much on remote datacenters unnecessarily would be a good
start.

~~~
ithkuil
I'm a remote employee of a distributed company. Where do you suggest we deploy
our code/services?

------
Thaxll
Looks like an external issue. "The Cloud Networking service (Standard Tier)
has lost multiple independent fiber links within us-east1 zone. Vendor has
been notified and are currently investigating the issue."

~~~
imroot
It's not independent fiber links if they use the same tube to get into the
building...just ask any backhoe operator.

~~~
fredthomsen
my brother-in-law's construction company actually did just that. ground wasn't
properly marked and the fiber got cut, multiple links

~~~
zamadatix
It's not uncommon to see 500 strand in one tube get cut by a backhoe. So much
so it's even jargon at this point [http://www.catb.org/jargon/html/F/fiber-
seeking-backhoe.html](http://www.catb.org/jargon/html/F/fiber-seeking-
backhoe.html)

~~~
mitchs
500? Those are rookie numbers.

------
thsowers
Why so many problems at Google lately? Calendar down two weeks ago[0], and
Google Cloud had a larger outage a month ago[1]

[0]:
[https://news.ycombinator.com/item?id=20213092](https://news.ycombinator.com/item?id=20213092)

[1]:
[https://news.ycombinator.com/item?id=20077421](https://news.ycombinator.com/item?id=20077421)

~~~
tscanausa
Terrance here from Google Cloud Support.

There are only 3 things I can say about this situation. 1) These issues are
currently unrelated. 2) We learn a lot from these situations. 3) A lot of
these types of issues can be mitigated by running in more then 1 region.

I really cant promise that today's situations will never happen again. There
are a lot of moving pieces in our system and sometimes there are things
outside of Google's control.

~~~
mathattack
“You should be using more than 1 region” could also be “you should be using
more than one provider”, no?

~~~
BurritoAlPastor
Well, sure, if you hate your devops team and you want to make sure they can’t
use any of the proprietary functionality of either provider. At which point,
if you want to be managing a fleet of vanilla Linux boxes yourself, why use a
cloud provider at all?

~~~
toomuchtodo
* You should not be locking yourself into proprietary functionality of a cloud provider unless you are deeply interested in what happened to Oracle customers getting raked over the coals happening to you.

* DevOps teams can be multi-cloud relatively easy when using infrastructure as code tooling (Terraform, Packer, etc) and traditional DevOps practices

* Why manage a fleet of vanilla boxes when you can use vanilla boxes with Kubernetes and not get gouged by cloud providers in the first place?

You don't need to jump off the hype train if you never got on in the first
place.

~~~
tomcam
if I voluntarily choose a provider at a price that’s acceptable to me am I
being gouged?

~~~
tjr225
Not yet, but it seems obvious to me that the GP was referring to a situation
where the price changes and then you are getting gouged. That's exactly what
the negative connotations of lock-in refer to.

------
mrmattyboy
To whomever commented something like 'laughs in AWS' (comment was removed
before I submitted the comment)...

please don't...

glass house and all that... but I also share the same glass house as you.. I
don't want bad luck

... and it's only a fluke that this happened to google in eu-east1 and not AWS
in X region and then you (and I) would be having a time of hell! :/

~~~
outworlder
Google seems to be more forthcoming with their issues. We have seen incidents
in AWS where the status never got updated, but support confirmed issues.

~~~
deanCommie
Show me a GCP post-mortem that's as detailed and proactive about future
improvement as
[https://status.aws.amazon.com/s3-20080720.html](https://status.aws.amazon.com/s3-20080720.html)

Their last one was laughable in it's lack of self-awareness.

~~~
joshuamorton
[https://status.cloud.google.com/incident/cloud-
networking/19...](https://status.cloud.google.com/incident/cloud-
networking/19009)

Can you explain what's better about the AWS one? They both do, approximately,
the same thing: provide a few paragraphs of background, approximately one
paragraph describing the actual issue, and a few paragraphs describing
concrete followups. The AWS one has more timestamps.

You aren't confusing this[0] with the postmortem, are you?

[0]: [https://cloud.google.com/blog/topics/inside-google-
cloud/an-...](https://cloud.google.com/blog/topics/inside-google-cloud/an-
update-on-sundays-service-disruption)

------
inlined
Holy crap. It’s an outage in all zones? What’s the point of AZs if you lose
whole DCs at a time.

~~~
klodolph
Availability is hierarchical.

~~~
neonate
Can you explain that more?

~~~
klodolph
There is no service with 100% availability. You put multiple AZs in one region
but nobody was ever pretending that regional failures were impossible, just
that single-AZ failures are more common than regional failures. You want high
availability, you want multi-regional. Above that you want multi-provider.

The same decisions that make regions fail also makes infra-region traffic
cheaper. This is true for all large cloud providers. If you are okay paying
more for internal network traffic you can get multiregional. But multi-AZ is
still better than single-AZ. Up to you to decide if it’s worth it. For that
you need good SLAs and (IMO) support contracts.

~~~
neonate
Thanks, I understand what you meant now.

------
username444
Cloudflare was returning a 502 this morning, wonder if they're related. Lots
and lots of sites down for about an hour, including all of Shopify.

~~~
boulos
As jgrahamc (Cloudflare CTO) noted below, these aren't related. They had a
push that they rolled back, we lost some fiber links.

~~~
jhgg
Cloudflare took us down this morning, but also shielded us from the impact of
this fiber cut, due to direct peering with google (I’m assuming over different
fiber paths.)

------
estsauver
In a moment that's likely to be very, very frustrating for a large number of
you that have businesses and customers that depend on G cloud, let's try to
remember that somewhere there's an engineer or an SRE having a really hard day
just trying to fix things.

Please, be kind and decent to each other, especially when things are hard.

~~~
danaur
I don't follow comments like these, should people refrain from criticising
giant companies because there are people working at them? I don't understand
the purpose of this comment

~~~
highesttide
Complaining about the communication and response time of a company is
different from yelling in the direction of some stressed engineer that they
are useless and incompetent at everything they do. Sadly you get too much of
the latter around the Internet.

~~~
StreamBright
Who is yelling in the direction of the "stressed engineer"? Does anybody have
direct channel to those guys or you think they rigorously monitor the comment
section of HN in the middle of an outage for yelling?

------
pupdogg
> The disruptions with Google Cloud Networking and Load Balancing have been
> root caused to physical damage to multiple concurrent fiber bundles serving
> network paths in us-east1.

I am assuming some sort of construction zone at or nearby the facility and the
backhoe operator dug in and accidently cut the cables?

------
verdverm
I've been working out of us-east1 all day and haven't noticed

------
partiallypro
It's been down for 4 hours and it's just now being posted on HN? Is it
intermittent?

~~~
boulos
Disclosure: I work on Google Cloud.

There were (and continue to be) connectivity issues due to a subset of the
fiber links having trouble. But that’s different from being “down”, it’s
“just” an outage. We won’t declare the outage over until the impact is
minimal.

------
digitalsanctum
I routinely see notices of outages like this posted on HN while HN itself
never seems to be impacted. This begs the question: Where and how is HN hosted
in a way that avoids being impacted by widespread network and provider
outages?

~~~
hunter2_

      $ host news.ycombinator.com
      news.ycombinator.com has address 209.216.230.240
    

[https://whois.arin.net/rest/net/NET-209-216-230-0-1/pft?s=20...](https://whois.arin.net/rest/net/NET-209-216-230-0-1/pft?s=209.216.230.240)

M5 Computer Security

[https://www.m5hosting.com](https://www.m5hosting.com)

Unrelated: [https://begthequestion.info/](https://begthequestion.info/)

~~~
MisterPea
The begs the question site is one of my pet peeves. Language is not moderated
by a select few who want to claim it, this isn't France.

This is why Ebonics is still a valid form of English - as long as it is used
consistently.

If everyone uses "begs the question" and everyone else understands it as
"raises the question" then it is perfectly valid.

~~~
hunter2_
Vernacular creates validity, for sure. The site's author acknowledges this,
but nonetheless maintains that preserving this particular phrase is useful
because there's not really another synonymous and popular phrase that means
this particular fallacy, just the Latin _petitio principii_ and the modern
translation "laying claim to the principle" which is pretty clumsy if you ask
me. Not that "begging the question" is crystal clear either, but at least it's
googleable.

------
pkaye
Looks like all that high end engineering talent and processes still has its
limits.

------
dragonwriter
> The disruptions with Google Cloud Networking and Load Balancing have been
> root caused to physical damage to multiple concurrent fiber bundles

Is this concurrent damage to separated bundles or damage to colocated bundles?

------
saltminer
The title says "almost 4 hours" (was posted at around 3 PM EST), but the
incident was created at 10:25 AM PST, which is 1:25 PM EST. Has it been more
like 2 hours or is there more to this incident?

------
bob33212
Down or just high latency? For some folks that is the same thing.

~~~
crankylinuxuser
is a 4 hr latency, "latency"?

You make a good point though. Downtime seems to be awfully overloaded.

~~~
geogram
On our tests the latency is surprisingly low (20-40ms) but it has an error
rate of 10-30%.

------
hnaccy
What's the actual number of 9s for the major cloud services these days?

My impression from their PR seems to mismatch the number of outages and issues
lately.

~~~
Johnny555
AWS EC2 promises 4 9's (4.3 minutes of downtime/month) before their SLA kicks
in, but they only give a 10% discount until availability dips below 99% (7.5
hours of downtime/month) when they give a 30% discount. If availability is
below 95% (36 hours) in a month, they give a full refund.

For an individual instance, they only promise 90% availability.

~~~
user5994461
Availability of what? I've noticed entire afternoon where it wasn't possible
to provision instances of some types, when I was working with AWS daily.

~~~
sudosteph
Availability of network access to existing instances.

What you're talking about with provisioning capacity is a totally different
matter. Provisioning availability is not guaranteed (unless you purchase
reserved instances) and there are frequently periods where certain instance
types are not available in certain AZs, though they do try to resolve that as
fast as practicality allows them to. It really stinks sometimes though -
especially if you get into a situation where something fails in your
autoscaling group and there is no capacity available for a replacement
instance. Usually you can get around that though by making sure your ASG is
set up for multiple AZs, or worst case changing instance types (though that
can be problematic in it's own way).

source: I used to work for AWS Support.

~~~
sofaofthedamned
Yeah, you can now provision multiple instance types in an ASG which mitigates
this somewhat.

I think people sometimes forget that the cloud isn't magic, and a sudden burst
of requests for new instances needs somebody to actually rack up some servers.

------
wwwpppddd
App Engine and Cloud functions were apparently returning error rates of > 30
percent overall between 11 a.m. and 3 p.m., with some projects experiencing a
100 percent error rate. GCS was also experiencing issues for the first half,
which was attributed to the networking issues. Google said the networking
issues were resolved initially but then stated they were investigating the GAE
issues. Those issues were resolved, and the networking issue has been reopened
as of 2:35 eastern: [https://status.cloud.google.com/incident/cloud-
networking/19...](https://status.cloud.google.com/incident/cloud-
networking/19016).

GAE and all other services still show green here, of course:
[https://status.cloud.google.com/](https://status.cloud.google.com/)

------
noncoml
Bad config push again?

~~~
gaogao
Running a betting pool on cloud service outage root causes would be fairly
fun.

I'm going to guess load balancer cascading failures.

~~~
notriddle
Nope. Physical destruction of fiber-optic cables is to blame, according to the
GC status page. [https://status.cloud.google.com/incident/cloud-
networking/19...](https://status.cloud.google.com/incident/cloud-
networking/19016)

------
z3t4
When choosing a big cloud provider people forget that it's many orders of
magnitude more complicated to run something at Google scale then to maintain
_one_ single server. For example the whole Stack overflow website runs on one
or two servers. World of Warcraft also used to run on one single (blade)
server. Chances are one server will be good enough for most use cases. And if
you don't want to have it in your closet there are plenty of dedicated hosting
and colocations.

~~~
avocado4
How can Stack Overflow run on a single server? Do you mean single cluster?

~~~
davedunkin
As of 2016, Stack Overflow ran on dozens of servers in two data centers.

[https://nickcraver.com/blog/2016/03/29/stack-overflow-the-
ha...](https://nickcraver.com/blog/2016/03/29/stack-overflow-the-
hardware-2016-edition/)

~~~
mehrdadn
Also interesting what their minimum requirements were in 2014
:[https://nickcraver.com/blog/2013/11/22/what-it-takes-to-
run-...](https://nickcraver.com/blog/2013/11/22/what-it-takes-to-run-stack-
overflow/#core-hardware)

------
mountainofdeath
Another day, another Google outage. It feels like it's once a month this year

------
rco8786
2019 has been a really rough year for GCP

------
geogram
Longer than 4 hours. We have stackdriver setup to monitor uptime/latency and
its been acting up since 2am PST.

~~~
tlynchpin
ObPedant: notice in google's status page "...as of Tuesday, 2019-07-02 09:11
US/Pacific." This notation is useful because it's stable year round. I don't
recommend 'PDT', instead colloquially 'out here on the left coast' or
specifically US/Pacific.

~~~
geogram
Thanks. Good point. Regardless, gcloud has been having issues for nearly
12hours. (timezone agnostic)

------
garyb2
Notice they did not get around posting the next status update on time.

------
awinter-py
do they not have extra hands on staff to dedup the messages? what's with the
identical messages at 14:31, :44, :48? This happened last time too.

------
lgats
Pretty sure I've read before that us-east1 is one of the older Google data
centers presumably with older equipment

~~~
boulos
Disclosure: I work on Google Cloud.

I think you’re thinking of AWS’s us-east-1 in Virginia. I don’t recall when
us-east1 for us was constructed, but this wasn’t any sort of “old equipment”
issue. Even there, while your experience may vary, AWS certainly has both old
and new equipment.

------
spullara
Wow. GCP is always a networking issue. Their QA on networking changes needs
work. Maybe they should spend 20% on it.

------
dx87
Kind of related to this, but these types of outages are why I moved from
Google Play to Spotify for streaming music. Their infrastructure seems so
large that things that should be a standalone service, like streaming music,
are bound to be collateral damage when they mess something up on another
service. Having everything provided by one company is convenient until it all
goes down at the same time and you can't access your email, videos, or music
because they all run on the same infrastructure.

~~~
tomschlick
Spotify is hosted on google cloud: [https://www.wired.com/2016/02/spotify-
moves-itself-onto-goog...](https://www.wired.com/2016/02/spotify-moves-itself-
onto-googles-cloud-lucky-for-google/)

~~~
crusader76
I think the point OP was trying to make was relating to google services and
their dependencies on each other.

~~~
lern_too_spel
I have the opposite conclusion as OP. Google doesn't use Google Cloud for
anything critical, so I wouldn't use Google Cloud for anything critical or
services that run on Google Cloud for that matter.

------
codingslave
Google engineers ran into a coding problem that wasnt on leetcode

~~~
dang
Please don't post unsubstantive comments here.

