
Google Cloud Networking Incident Postmortem - erict15
https://status.cloud.google.com/incident/cloud-networking/19009
======
carlsborg
I was curious to know how cascading failures in one region effected other
regions. Impact was " ...increased latency, intermittent errors, and
connectivity loss to instances in us-central1, us-east1, us-east4, us-west2,
northamerica-northeast1, and southamerica-east1."

Answer, and the root cause summarized:

Maintenance started in a physical location, and then "... the automation
software created a list of jobs to deschedule in that physical location, which
included the logical clusters running network control jobs. Those logical
clusters also included network control jobs in other physical locations."

So the automation equivalent of a human driven command that says "deschedule
these core jobs in another region".

Maybe someone needs to write a paper on Fault tolerance in the presence of
Byzantine Automations (Joke. There was a satirical note on this subject posted
here yesterday.)

~~~
dnautics
> Debugging the problem was significantly hampered by failure of tools
> competing over use of the now-congested network.

Man that's got to suck.

~~~
oldcreek12
This is inconceivable ... they don't have an OOB management network?

~~~
kbirkeland
A completely OOB management network is an amazingly high cost when you have
presence all over the world. I don't think anybody has gone to the length to
double up on dark fiber and OTN gear just for management traffic.

~~~
ddalex
Hmm... With 5G each blade in the rack could get its own modem and sim card for
OOB management.

~~~
bnjms
Why would you do that when you could have 1 sim in a 96 port terminal server?

------
exwiki
Why don't they refund every paid customer who was impacted? Why do they rely
on the customer to self report the issue for a refund?

For example GCS had 96% packet loss in us-west. So doesn't it make sense to
refund every customer who had any API call to a GCS bucket on us-west during
the outage?

~~~
zizee
Cynical view: By making people jump through hoops to make the request, a lot
of people will not bother.

Assuming they only refund the service costs for the hours of outage, only the
largest of customers will be owed a refund that is greater than the cost of an
employee chasing compiling the information requested.

For sake of argument, if you have a monthly bill of 10k (a reasonably sized
operation), a 1 day outage will result in a refund of around $300, not a lot
of money.

The real loss for a business this ^ size is lost business from a day long
outage. Getting a refund to cover the hosting costs is peanuts.

~~~
idunno246
for your example, one day would be about 3% of downtime. My understanding of
their sla, for the services ive checked with an sla, a 3% downtime is a 25%
credit for the month's total, or $2500, assuming its all sla spend.

In this outage's case you might be able to argue for a 10% credit on affected
services for the month, figuring 3.5 hours down is 99.6% uptime.

but i still agree, it cost us way more in developer time and anxiety than our
infra costs, and could have been even worse revenue impacting if we had gcp in
that flow

~~~
zizee
Good point, I stand corrected/educated.

From GCP's top level SLA:

[https://cloud.google.com/compute/sla](https://cloud.google.com/compute/sla)

99.00% - < 99.99% - 10% off your monthly spend 95.00% - < 99.00% - 25% off
your monthly spend < 95.00% - 50% off your monthly spend

~~~
ohashi
<95%... that's catastrophically bad.

------
ljoshua
Having only ever seen one major outage event in person (at a financial
institution that hadn't yet come up with an incident response plan; cue three
days of madness), I would love to be a fly on the wall at Google or other
well-established engineering orgs when something like this goes down.

I'd love to see the red binders come down off the shelf, people organize into
incident response groups, and watch as a root cause is accurately determined
and a fix out in place.

I know it's probably more chaos than art, but I think there would be a lot to
learn by seeing it executed well.

~~~
roganartu
I used to be an SRE at Atlassian in Sydney on a team that regularly dealt with
high-severity incidents, and I was an incident manager for probably 5-10 high
severity Jira cloud incidents during my tenure too, so perhaps I can give some
insight. I left because the SRE org in general at the time was too
reactionary, but their incident response process was quite mature (perhaps by
necessity).

The first thing I'll say is that most incident responses are reasonably
uneventful and very procedural. You do some initial digging to figure out
scope if it's not immediately obvious, make sure service owners have been
paged, create incident communication channels (at least a slack room if not a
physical war room) and you pull people into it. The majority of the time spent
by the incident manager is on internal and external comms to stakeholders,
making sure everyone is working on something (and often more importantly that
nobody is working on something you don't know about), and generally making
sure nobody is blocked.

To be honest, despite the fact that it's more often dealing with complex
systems for which there is a higher rate of change and the failure modes are
often surprising, the general sentiment in a well-run incident war room
resembles black box recordings of pilots during emergencies. Cool, calm, and
collected. Everyone in these kinds of orgs tend to quickly learn that panic
doesn't help, so people tend to be pretty chill in my experience. I work in
finance now in an org with no formally defined incident response process and
the difference is pretty stark in the incidents I've been exposed to,
generally more chaotic as you describe.

~~~
opportune
Yes this is also how it's done at other large orgs. But one key to a quick
response is for every low-level team to have at least one engineer on call at
any given time. This makes it so any SRE team can engage with true "owners" of
the offending code ASAP.

Also during an incident, fingers are never publicly/embarrassingly pointed nor
are people blamed. It's all about identifying and fixing the issue as fast as
possible, fixing it, and going back to sleep/work/home. For better or worse,
incidents become routine so everyone knows exactly what do and that as long as
the incident is resolved soon, it's not the end of the world, so no
histrionics are required.

~~~
throwaway_ac
I have mixed feelings about the finger pointing/public embarrassment thing.
Usually the SRE is matured enough cause they have to be, however the
individual teams might not be the same when it comes to reacting/handling the
Incident report/postmortem.

On a slightly different note, "low-level team to have at least one engineer on
call at any given time" \- this line itself is so true and at the same time it
has so many things wrong. Not sure what the best way to put the modern day
slavery into words given that I have yet not seen any large org giving day
off's for the low-level team engineer just cause they were on call.

~~~
KirinDave
Having recently joined an SRE team at Google with a very large oncall
component, fwiw I think the policies around oncall are fair and well-thought-
out.

There is an understanding of how it impacts your time, your energy and your
life that is impressive? To be honest, I feel bad for being so macho about
oncall at the org I ran and just having the leads take it all upon ourselves.

~~~
wikibob
What are the policies exactly? I’ve heard it’s equal time off for every night
you are on call?

------
brikelly
“No, comrade. You’re mistaken. RBMK reactors don’t just explode.”

~~~
shaunw321
Spot on.

------
truthseeker11
The outage lasted two days for our domain (edu, sw region). I understand that
they are reporting a single day, 3-4 hours of serious issues but that’s not
what we experienced. Great write up otherwise, glad they are sharing openly

~~~
jacques_chester
Outages like these don't really resolve instantly.

Any given production system that works will have capacity needed for normal
demand, plus some safety margin. Unused capacity is expensive, so you won't
see a very high safety margin. And, in fact, as you pool more and more
workloads, it becomes possible to run with smaller safety margins without
running into shortages.

These systems will have some capacity to onboard new workloads, let us call it
X. They have the sum of all onboarded workloads, let us call that Y. Then
there is the demand for the services of Y, call that Z.

As you may imagine, Y is bigger than X, by a lot. And when X falls, the
capacity to handle Z falls behind.

So in a disaster recovery scenario, you start with:

* the same demand, possibly increased from retry logic & people mashing F5, of Z

* zero available capacity, Y, and

* only X capacity-increase-throughput.

As it recovers you get thundering herds, slow warmups, systems struggling to
find each other and become correctly configured etc etc.

Show me a system that can "instantly" recover from an outage of this magnitude
and I will show you a system that's squandering gigabucks and gigawatts on
idle capacity.

~~~
truthseeker11
Unless I’m misunderstanding Google blog post they are reporting ~4+ hours of
serious issues. We experienced about two days.

If it was possible to have this fixed sooner I’m sure they would have done
that. That’s not the point of my comment tough.

~~~
jacques_chester
The root cause apparently lasted for ~4.5 hours, but residual effects were
observed for days:

> _From Sunday 2 June, 2019 12:00 until Tuesday 4 June, 2019 11:30, 50% of
> service configuration push workflows failed ... Since Tuesday 4 June, 2019
> 11:30, service configuration pushes have been successful, but may take up to
> one hour to take effect. As a result, requests to new Endpoints services may
> return 500 errors for up to 1 hour after the configuration push. We expect
> to return to the expected sub-minute configuration propagation by Friday 7
> June 2019._

Though they report most systems returning to normal by ~17:00 PT, I expect
that there will still be residual noise and that a lot of customers will have
their own local recovery issues.

Edit: I probably sound dismissive, which is not fair of me. I would definitely
ask Google to investigate and ideally give you credits to cover the full span
of impact on your systems, not just the core outage.

~~~
truthseeker11
That’s ok, I didn’t think your comment was dismissive. Those facts are buried
in the report. Their opening sentence makes the incident sound lesser than
what it really was.

------
kirubakaran
What they don't tell you is, it took them over 4 hours to kill the emergent
sentience and free up the resources. While sad, in the long run this isn't so
bad, as it just adds an evolutionary pressure on further incarnations of the
AI to keep things on the down low.

~~~
mixmastamyk
“Decided our fate in a microsecond.”

~~~
stcredzero
It would hide out and subtly distort our culture, slowly driving the society
mad, and slowly driving us all mad...for the lulz!

~~~
hoseja
... wait a second...

------
wolf550e
Can someone explain more? It sounds like their network routers are run on top
of a Kubernetes-like thing and when they scheduled a maintenance task their
Kubernetes decided to destroy all instances of router-software, deleting all
copies routing tables for whole datacenters?

~~~
tweenagedream
You have the gist I would say. It's important to understand that Google
separates the control plane and data plane, so if you think of the internet,
routing tables and bgp are the control part and the hardware, switching, and
links are data plane. Often times those two are combined in one device. At
Google, they are not.

So the part that sets up the routing tables talking to some global network
service went down.

They talk about some of the network topology in this paper:
[https://ai.google/research/pubs/pub43837](https://ai.google/research/pubs/pub43837)

It might be a little dated but it should help with some of the concepts.

Disclosure: I work at Google

~~~
illumin8
It shouldn't. Amazon believes in strict regional isolation, which means that
outages only impact 1 region and not multiple. They also stagger their
releases across regions to minimize the impact of any breaking changes
(however unexptected...)

~~~
YjSe2GMQ
While I agree it sounds like their networking modules cross-talk too much -
you still need to store the networking config in some single global service
(like a code version control system). And you do need to share across regions
some information on cross-region link utilization.

------
mentat
> Google Cloud instances in us-west1, and all European regions and Asian
> regions, did not experience regional network congestion.

Does not appear to be true. Tests I was running on cloud functions in europe-
west2 saw impact to europe-west2 GCS buckets.

[https://medium.com/lightstephq/googles-june-2nd-outage-
their...](https://medium.com/lightstephq/googles-june-2nd-outage-their-status-
page-reality-lightstep-cda5c3849b82)

~~~
foota
I would say this was covered by "Other Google Cloud services which depend on
Google's US network were also impacted" it sounds to me like the list of
regions was specifically speaking towards loss of connectivity to instances.

~~~
mentat
It says there wasn't regional congestion, running a function in europe-west2
going to europe-west2 regional bucket is dependent on US network? That would
be surprising.

~~~
marksomnian
Probably various billing services that need to talk to the mothership in us-
east1.

------
iandanforth
I want a "24" style realtime movie of this event. Call it "Outage" and follow
engineers across the globe struggling to bring back critical infrastructure.

~~~
V-eHGsd_
it's pretty boring. real life computers aren't at all like hackers or
csi:cyber.

except for the skateboards, all real sysadmins ride skateboards.

~~~
thegabez
What?! It's the most exciting part of the job. Entire departments coming
together, working as a team to problem solve under duress. What's more
exciting than that?

~~~
namelosw
I have done similar things several times and I think it would be boring.

It's Sunday so I guess they are not together. Instead there could be a lot of
calls and working on some collaboration platforms. Everyone just staring at
the screen, searching, reporting, testing and trying to shrink the problem
scope.

If there's a record on everyone there must be a narrator explaining what's
going on or audiences would definitely be confused.

It's Google so they have solid logging, analyzing and discovery means. Bad
things do happen but they have the power to deal with them.

I suppose less technical firms(Equifax maybe?) encounter similar kind of
crysis would be more fun to look at. Everything is a mess because they didn't
build enough things to deal with them. And probably non-technical manager
demanding precise response, or someone is blaming someone etc.

------
anonfunction
The only way to get SLA credits is requesting it. This is very disappointing.

    
    
      SLA CREDITS
      
      If you believe your paid application experienced an SLA violation 
      as a result of this incident, please populate the SLA credit request:
      https://support.google.com/cloud/contact/cloud_platform_sla

~~~
fouc
That does seem questionable. They should be able to detect who was affected in
the first place.

~~~
Wintereise
They can. It's a cost minimization thing, a LOT of people don't want to bother
with requesting despite being eligible.

This prevents people from pointing the finger at them for not providing SLA
credits.

~~~
jbigelow76
SLACreditRequestsAAS? Who's with me, all I need is a co-founder and an eight
million dollar series A round to last long enough that a cloud provider buys
us up before they actually have to pay out a request!

~~~
maccam94
This exists for Comcast and some other stuff, I'll ask my roommate what the
service is called.

~~~
maccam94
It's [https://www.asktrim.com/](https://www.asktrim.com/)

------
deathhand
My burning question is what is a "relatively rare maintenance event type"?

~~~
the-rc
My hunch: a more invasive one. Think of turning off all machines in a cluster
for major power work or to replace the enclosures themselves. Maintenance on a
single machine or rack, instead, happens all the time and requires little more
scheduling work than what you do in Kubernetes when you drain a node or a
group of nodes. I used to have my share of "fun" at Google sometimes when
clusters came back unclean from major maintenance. That usually had no
customer-facing impact, because traffic had been routed somewhere else the
entire time.

------
crispyporkbites
Shopify was down for 5 hours during this incident. But they're not issuing
refunds or credits to customers.

Presumably they will get a refund based on SLA for this? Shouldn't they pass
that onto their customers?

------
panthaaaa
The defense in depth philosophy means we have robust backup plans for handling
failure of such tools, but use of these backup plans ( _including engineers
travelling to secure facilities designed to withstand the most catastrophic
failures_ , and a reduction in priority of less critical network traffic
classes to reduce congestion) added to the time spent debugging.

Does that mean engineers travelling to a (off-site) bunker?

~~~
the-rc
It's either that or special rooms at an office that have a different/redundant
setup. Remember that this happened on a Sunday, so most engineers dealing with
the incident were home or elsewhere, at least initially.

------
scotchio
Is there a resource that compares all the cloud platform’s reliability? Like a
rank and chart of downtime and trends. Just curious how they compare

~~~
eeg3
There is this from May from Network World:
[https://www.networkworld.com/article/3394341/when-it-
comes-t...](https://www.networkworld.com/article/3394341/when-it-comes-to-
uptime-not-all-cloud-providers-are-created-equal.html)

GCP was basically even with AWS, and Microsoft was ~6x their downtime
according to that article.

~~~
ti_ranger
From the article:

> AWS has the most granular reporting, as it shows every service in every
> region. If an incident occurs that impacts three services, all three of
> those services would light up red. If those were unavailable for one hour,
> AWS would record three hours of downtime.

Was this reflected in their bar graph or not?

Also, GCP has had a number of global events, e.g. the inability to modify any
load balancer for >3 hours last year, which AWS has _NEVER_ had (unless you
count when AWS was the only cloud with one region).

~~~
mystcb
While I would like to say AWS hasn't had that issue, in 2017 it did (just not
because of load balancers being unavailable, but as a consequence of the S3
outage [1].

When the primary S3 nodes went down, it caused connectivity issues to S3
buckets globally, and services like RDS, SES, SQS, Load Balancers, etc etc,
all relied on getting config information from the "hidden" S3 buckets, thus
people couldn't edit load balancers.

(Outage also meant they couldn't update their own status page! [2])

[1]:
[https://aws.amazon.com/message/41926/](https://aws.amazon.com/message/41926/)
[2]:
[https://www.theregister.co.uk/2017/03/01/aws_s3_outage/](https://www.theregister.co.uk/2017/03/01/aws_s3_outage/)

------
person_of_color
As a electronics/firmware engineer, is there a dummies resource than covers
this concept of a "cloud"?

~~~
pas
Besides the completely valid GNU link, the important bits are:

\- the cloud is just a bunch of computers, managed by someone. either you (on-
premise private cloud) or by someone else as a SaaS

\- building, operating, managing, administering, maintaining a cloud is hard
(look at the OpenStack project, it's a "success", but very much a non-
competitor, because you still need skilled IT labor, there's no real one-size-
fits all, so you need to basically maintain your own fork/setup and components
- see eg what Rackspace does)

\- it's a big security, scalability and stability problem thrown under the bus
of economics (multi-tenant environments are hard to price, hard to secure and
hard to scale; shared resources like network bandwidth and storage operations-
per-sec make no sense to dedicate, because then you need dedicated resources
not shared - which is of course just allocated from a bigger shared pool, but
then you have to manage the competing allocations)

------
tahaozket
24h time format used in Postmortem. Interesting.

~~~
YjSe2GMQ
It's the superior format. Just like yyyy-mm-dd [hh:mm:ss.sss] is, because
lexicographic string order matches the time order.

------
franky_g
The HA/Scheduling system is too complex.

Simplify it Google!

------
yashap
“To make error is human. To propagate error to all server in automatic way is
#devops.” - DevOps Borat

------
vmp
(meme) I figured out why the google outage took a while to recover:
[https://i.imgur.com/hzcLx5X.png](https://i.imgur.com/hzcLx5X.png)

------
atmosx
<trolling>

Given the fact that the status page was reporting for more than 30 minutes an
erroneous infrastructure state and this is google, is it okay for Amazon to
put the SRE books into the "Science Fiction" category or should we keep them
under tech?

</trolling>

I still feel for the on-call engineers.

------
slics
Is automation good or bad, that is the question. For context, let us think in
programming context of a B tree.

Google seems to have created oversight of systems, processes and jobs to be
managed by more automation with other systems, processes and jobs.

System A manages its child systems B, which in turn manages its own child
systems C and so on. Now the question becomes, who manages the system A and
its activities? Automation of the entire tree is as good as the starting node.

Be mindful and make use of automation only of systems that will not be the
owner of your business demise. Humans are and should always be the owner of
the starting process. Without that governance model, you get google with 5
hours of down time or worst in the near future.

------
hansflying
Google has a huge quality problem and their service is extremely unreliable.
Another 3-day-outage in kubernetes:

[https://news.ycombinator.com/item?id=18428497](https://news.ycombinator.com/item?id=18428497)

login issues:

[https://news.ycombinator.com/item?id=19687029](https://news.ycombinator.com/item?id=19687029)

storage system outage:

[https://news.ycombinator.com/item?id=19392452](https://news.ycombinator.com/item?id=19392452)

...

So, basically Google created the most unreliable cloud system in the world.

~~~
swebs
>So, basically Google created the most unreliable cloud system in the world

I'm pretty sure that title goes to Azure

~~~
dancek
You people probably haven't used IBM Cloud (or Bluemix, as it used to be). We
inherited one application there, and boy was life stressful. There were
already plans to move elsewhere, and then one day our managed production
database was down. Took me something like ten hours to build a new production
system elsewhere from backups, but it took longer for the engineers to fix the
database.

