
Google Cloud Downtime Postmortem - tosh
https://status.cloud.google.com/incident/cloud-networking/18012?m=1
======
RestlessMind
> One of the features contained a bug which would cause the GFE to restart;
> this bug had not been detected in either of testing and initial rollout.

I wish they elaborated more on what type of bug was that, which was not caught
by testing / initial rollout. Either tests must be poorly written or the bug
must be very subtle.

On an unrelated note: kudos to Google for publishing this postmortem and hope
that this becomes an industrywide practice. I also wish they publish (a
belated) one about Google+ and their throng of messaging apps over the years.

~~~
morrbo
With all due respect, it's not a postmortem, it's an advert. It doesn't really
say anything other than "We had a problem, we fixed it.". There are virtually
no technical details in there other than "something would restart
spontaneously, which shifted the load somewhere else". Maybe i'm a bit jaded
by cloudflare and aws writeups, but this really isn't anything special or
worthwhile reading.

~~~
tylerl
I donno, looks comprehensive enough for me. What were you looking for,
exactly? Source code? Packet traces? Pew-pew maps?

A _useful_ PM include a summary, impact analysis, root cause analysis, and a
comprehensive (and realistic) set of measures that will prevent recurrence.

After reading what they provided, I have a reasonable understanding of what
went wrong (sufficient for me to plan my own safeguards if necessary), a
useful measure of the team's response and remediation capabilities, and enough
information to judge my comfort with their preventative measures.

Anything else is just cleverly-disguised marketing.

~~~
occams_chainsaw
It's pretty bare-bones next to a postmortem like
[https://www.epicgames.com/fortnite/en-US/news/postmortem-
of-...](https://www.epicgames.com/fortnite/en-US/news/postmortem-of-service-
outage-at-3-4m-ccu)

~~~
forgot-my-pw
I feel that's way too much info.

------
westoque
Amazing that they handled the downtime in such a short time span. Boosts my
confidence to actually use GCP. Huge props!

Meanwhile.. waiting for the Amazon prime day post mortem.

~~~
throwaway5752
+1'ed you on the props to GCP, but on Amazon prime, that was only on their
retail site, right? Do vendors there have SLA or is there otherwise an
obligation to publish a post-mortem on the incident? I think it's different
when it's a platform/service provider but the prime day outage only hurt
Amazon. My recollection is that AWS service outages had prompt and thorough
published incident reports.

~~~
paulddraper
> I think it's different when it's a platform/service provider but the prime
> day outage only hurt Amazon.

Have you not used Amazon? It's where half the country does their shopping.
Many businesses live and die on Amazon as just much as they do on AWS.

~~~
tedunangst
How many businesses died as a result of the prime day outage?

~~~
ikeboy
None. Sales were way over sales on a typical day.

The outage was way over-hyped.

------
nashadelic
> Google engineers were alerted to the issue within 3 minutes and began
> immediately investigating

As a user of their service, our engineers were notified within <30 secs when
the issue started. Given GCP had a large population impacted, how is it that
it took them much longer to acknowledge?

> The GFE development team was in the process of adding features to GFE to
> improve security and performance. These features had been introduced into
> the second layer GFE code base but not yet put into service. One of the
> features contained a bug which would cause the GFE to restart; this bug had
> not been detected in either of testing and initial rollout.

Something going down after a deployment is the most common source of issues. A
system KPI abnormality after a rollout should be common practice to monitor
and to perform an almost instant auto-rollout on. Also, doesn't GCP perform
dark launched, partial launches? Launch to 1%, see KPIs, increase to 5% and so
on?

~~~
foobaw
Maybe they were on reddit. Seriously though, 30 seconds vs 3 minutes is a long
time in the reliability world but not atrocious.

~~~
smueller1234
It's the difference between being at home, sober, and with a laptop handy vs.
being actively logged in and ready to go at all times.

In terms of quality of life for people on call, it's an entirely different
world. And in a setup where your oncall engineers are extremely highly skilled
and have all the choice in the world in terms of where to work, that little
bit of respect of their time is a necessary investment.

------
tubaguy50035
Did they learn nothing from the Azure of yore? You don't roll out a change
globally, ever.

While the postmortem is appreciated, I'd rather they just didn't roll out
changes globally.

------
piinbinary
I'm very impressed that they can go from deploying a fix (12:44) to it being
effective (12:49) in just 5 minutes.

~~~
hilbertseries
Reading the post mortem, I'm guessing it's more like it took them 30 minutes
to figure out what was happening and then they rolled back the deploy and it
was fixed.

Edit: Reading further down, they actually admit that it was just a roll back

> At 12:44 PDT, the team discovered the root cause, the configuration change
> was promptly reverted, and the affected GFEs ceased their restarts

------
zilchers
I love GCP’s postmortems - they’re open, honest, insightful, and I wish I
could get my company to OK us releasing details like this when we have
outages. It’s part of the reason I personally like GCP more than AWS (and
certainly azure, those guys don’t admit to shit regardless how bad the outage
is).

Edit: Wow, downvotes because I like transparency from my cloud hoster, super
interesting...

~~~
sovnade
You're getting downvotes because this was not a transparent and open report.
It was vague and was more advertise-y than postmortem-y.

Not saying Amazon is perfect by any means either, but there's a lot of room
for improvement. Good postmortems give everyone ideas on how to solidify their
own processes and prevent other issues. This was just fluff.

------
aequitas
A little offtopic but, why is it called a postmortem?

When I got first introduced to the concept of incident reports it was under
the name of postmortem, as I worked for a mainly English speaking company then
and didn't think twice about it. But earlier this week I mentioned it to a
colleague he found it a rather macabre term for something like an incident
report. When you think of it, nothing really died (maybe some engineers died a
little inside that their design was not as 100% reliable as they though). But
for the rest it was just a temporary state, nothing permanent like death. All
other uses (eg: medical) of this word all seem to relate strictly to death.

Maybe it is because incident reports just sound to formal or is there a
etymology of this term in the IT world?

~~~
jldugger
It's called a 'post mortem' in the medical world because the patient died. It
involves a high level of inspection into what went wrong and what can be
learned to prevent it. I assume the term was adopted from there into project
management.

It might make a little more sense in the world of shipping software in retail
boxes, where products/projects had a 'done' date. The project is dead, what
contributed to it's demise? Or you might generalize death into failure, and
that's why we use the term instead of post-incident.

~~~
swozey
Not all post-mortems are for failures/dead products which might add even more
confusion. For instance Gamasutra runs a game development post-mortem section
where developers of popular (and unpopular) games can hop on and describe
difficult situations they've encountered, how they did what they did, why they
did that thing, etc.

It's really fun to read.

[http://www.gamasutra.com/features/postmortem/](http://www.gamasutra.com/features/postmortem/)

edit: Wow, there hasn't been one since 2014. I wonder why this died out.
There's 10 pages of them since 2007..

~~~
s_ngularity
Ironic. Apparently we need a postmortem postmortem

------
meesterdude
this was amusing for me, because i am JUST starting to test out google's cloud
offerings and got hit by this outage on basically day 1. Luckily, it wasn't
too long before i figured out it was them and not me.

~~~
EugeneOZ
Bobby Tables?

~~~
dylan604
Recently ran into an issue with a provider's CSV export of their data not
quoting some of their text fields. It was a minor annoyance in that the data
in the text field was user input where the user used commas. Importing in
Excel then had the columns mangled by the extra commas.

I then made the comment to sign in using Bobby Drop Tables for a user name.
The silence in the room quickly reminded me I was not in the company of other
developers. Such a waste

------
nodesocket
I believe the load balancer outage only affected global load balancers not
regional load balancers. Is that still accurate?

~~~
swozey
My understanding is yes; I only use regional, which encountered no outages.
This hugely affected appengine users. I've not encountered a lot of GKE users
that use Global LBs yet (I've run in GKE for roughly 4 years).

------
djhworld
A good write up, thanks.

I remember a configuration change being rolled out by an automated system
caused a problem on GCP a few years ago, it's an interesting area that's
probably quite hard to fix

------
tolk460
I'm an SRE for a large software company. We call it an "incident
retrospective" for this reason.

------
grogers
It sounds like the impact wasn't limited to particular users or regions. Why
would they deploy the configuration worldwide at the same time?

~~~
buahahaha
It's a global load balancer.

------
jaimex2
Bet Tesla are glad they made the switch to MapBox and Valhalla months ago.

~~~
iKSv2
genuine question, whats Valhalla? Google returned something Norse.

~~~
crazysim
Tack on the "Tesla" keyword in your search. "Tesla Valhalla".

~~~
jwommack
That does it but also you can find more detailed information on their Github
page:
[https://github.com/teslamotors/valhalla](https://github.com/teslamotors/valhalla)

------
exabrial
Basically, due to the lack of customer service, Google cloud is not for
serious busines, just casual side projects. Reading through this and the
article about getting their servers cut off made my stomach lurch

~~~
asfasgasg
All major cloud platforms have occasional unplanned downtime. AWS had an
outage last year that took many sites offline for hours. A single instance of
Google having such unplanned downtime is meaningless without more datapoints.

As for whether it's useful for serious business. Well. Proof by example? This
has sixteen pages of case studies:
[https://cloud.google.com/customers/](https://cloud.google.com/customers/).
That is by no means all of GCP's serious customers.

~~~
exabrial
That sort of proves my point :( unless you are a big fish with marketing
potential, you're nothing to Google

~~~
robinwassen
We have been using GCP for 6 years (and is one of the companies in the linked
case studies) and I must say that their customer service is good.

I think I had over 30 touches with their support and key account managers
regarding everything from billing, minor issues with services and just asking
for advice regarding stuff. They have always delivered.

The expectation from us has been that we pay the $150/mon support package fee.

