
Post-Mortem for Google Compute Engine’s Global Outage on April 11 - sgrytoyr
https://status.cloud.google.com/incident/compute/16007?post-mortem
======
brianwawok
This is a very good Post-Mortem.

As I assumed it was kind of a corner case bug meet corner case bug met corner
case bug.

This is also why I am of afraid of a self driving cars and other such life
critical software. There are going to be weird edge cases, what prevents you
from reaching them?

Making software is hard....

~~~
ben_jones
Self driving cars don't have to be perfect. They just have to be safer then
driving is today [1].

The real question is if society can handle the unfairness that is death by
random software error vs. death by negligent driving. It's easy to blame
negligent driving on the driver, we're clearly not negligent so it really
doesn't effect us right? But a software error might as well be an act of god,
it's something that might actually happen to me!

[1]:
[https://en.wikipedia.org/wiki/List_of_motor_vehicle_deaths_i...](https://en.wikipedia.org/wiki/List_of_motor_vehicle_deaths_in_U.S._by_year)

~~~
gliderShip
Well No, There is an upper limit on the damage a bad driver can do by say
crushing his car with a bus or something like that. Imagine a bug or malware
triggered at the same moment world-wide. It could kill millions. So it not as
simple as 'It just has to be better than a human'

~~~
essayist
I've been itching to release this terror movie plot into the wild:

 _It 's 2025 and more than 10% of the cars on the road in the US are self-
driving. It's rush hour on a busy Friday afternoon in Washington, DC. Earlier
that day, there'd been a handful of odd reports of self-driving Edsels (so as
not to impugn an actual model) going haywire, and the NTSB has started its
investigation._

 _But then, at 430pm, highway patrol units around the DC beltway notice three
separate multi-Edsel phalanxes, drivers obviously trapped inside, each phalanx
moving towards the Clara Barton Parkway, which enters DC from the west. Other
units notice four more phalanxes, one comprising 20 Edsels, driving into DC
from the east side, on Pennsylvania Avenue._

 _At this point, traffic helicopters see similar car clusters, more than two
dozen, all over DC, all converging on a spot that looks to be between the
Washington Monument and the White House._

 _We zoom in on the headquarters of the White House Secret Service. A woman is
arguing vociferously that these cars have to be stopped before they get any
closer to the White House. A colleague yells back that his wife is one of
those commandeered cars and she, like the rest of the "hackjacked" drivers and
passengers is innocent._

~~~
Afforess
I'm afraid Daemon (novel) beat you to the punch. It's an excellent novel,
about fairly similar situations.

[http://www.goodreads.com/book/show/6665847-daemon](http://www.goodreads.com/book/show/6665847-daemon)

~~~
pavel_lishin
And a really fun read. If someone here decides to get this book, get the
sequel as well - Daemon ends on kind of a cliff-hanger.

~~~
mitchellhislop
Freedom (TM) is the sequel, and the author (Daniel Suarez) has a few other
near-term what-if-this-all-goes-skynet books which are equally good.

~~~
Cerium
Thank you! I didn't know there was a sequel. I really enjoyed the first book.

------
teraflop
> There are a number of lessons to be learned from this event -- for example,
> that the safeguard of a progressive rollout can be undone by a system
> designed to mask partial failures -- ...

This is a really important point that should be more generally known. To quote
Google's own "Paxos Made Live" paper, from 2007:

> In closing we point out a challenge that we faced in testing our system for
> which we have no systematic solution. By their very nature, fault-tolerant
> systems try to mask problems. Thus they can mask bugs or configuration
> problems while insidiously lowering their own fault-tolerance.

As developers we can try to bear this principle in mind, but as Monday's
incident demonstrated, mistakes can still happen. So, has anyone managed to
make progress toward a "systematic solution" in the last 9 years?

~~~
nostrademons
Google actually does have a systematic solution: fault injection. Google's
systems are designed so that you can (manually, if you have the right
privileges) tell an RPC to fail regardless of whether it would otherwise have
succeeded, and then test the response of the system as a whole.

The problem is that these failure cases are exercised much less frequently
than the "normal execution" code paths are. For example, every year Google
does DiRT [1] exercises which test system responses to a large calamity, eg. a
California earthquake that kills everyone in Mountain View and SF including
the senior leadership, and also knocks out all west coast datacenters. The
half-life of code at Google (in my observation) is roughly 1 year, which means
that half of all code has never gone through a DiRT exercise. The same applies
to other, less serious fault injection mechanisms: they may get executed once
every year or two, and serious bugs can crop up in the meantime. Automated
testing of fault injection isn't really feasible, because the number of
potential faults grows combinatorially with the number of independent RPCs in
the system.

I'd be willing to bet that the two bugs that caused this outage were less than
6 months old. In my tenure at Google, the vast majority of bugs that showed up
in postmortems were introduced < 3 months before the outage.

[1] [http://everythingsysadmin.com/2012/09/devops-google-
reveals-...](http://everythingsysadmin.com/2012/09/devops-google-reveals-
their-di.html)

~~~
peterwwillis
Testing doesn't detect failure, it only detects the failure of a test. Real
failures happen more often than test failures, for the same test on the same
code with the same input and output. The best systematic solution would detect
real failures, not see what happens when you fail a test.

~~~
nostrademons
That's monitoring, then. As Steve Yegge's Platforms Rant [1] mentioned,
testing and monitoring are two sides of the same coin. Google does both, but
the original thread-starter here was asking about how to detect failures when
the system itself is designed to mask & recover from failures. (FWIW, most
such systems do log when they've encountered a failure condition and recovered
from it, and this stat is available to the monitoring system.)

[1]
[https://plus.google.com/+RipRowan/posts/eVeouesvaVX](https://plus.google.com/+RipRowan/posts/eVeouesvaVX)

~~~
peterwwillis
Basically, yes. But we don't have to make a traditional monitor, or have it be
an extra component. Monitoring all the facets of, say, a code deployment, or a
software build, or performance testing, is a dynamic thing. It may fail, or it
may succeed, or it might be _suspicious_.

Normally we design systems for humans to determine that 3rd part; in this
case, there should have been a system where humans could see the one or two
pieces of unusual activity and investigated. But there wasn't, or it didn't
work right. So a "fix" would be to develop software that adapts to
nondeterministic behavior the way a human does. I wouldn't exactly call that
monitoring, though.

------
cosud
Great writeup! PS: "To make error is human. To propagate error to all server
in automatic way is devops." -DevOps Borat

------
cjbprime
It looks like there were at least three catastrophic bugs present:

1\. Evaluated a configuration change before the change had finished syncing
across all configuration files, resulting in rejecting the change.

2\. So it tried to reject the change, but actually just deleted everything
instead.

3\. Something was supposed to catch changes that break everything, and it
detected that everything was broken but its attempt to do anything to fix it
failed.

It is hard to imagine that this system has good test coverage.

~~~
mjibson
I'm attempting to even imagine how one would build a useful way to test this.
Would they have to have a secondary, world-wide datacenter network with all
their various services behind it?

~~~
manquer
While testing would have been quite difficult, any simple canary release or
timed release mechanism would have prevented this / limited the damage. At
such mission critical systems, applying any global change in a such manner is
asking for it, Devops can also be SPOF, this seems one such case.

~~~
mgw
They had a canary release mechanism in place. This is described in the post
mortem.

> These safeguards include a canary step where the configuration is deployed
> at a single site and that site is verified to still be working correctly,
> and a progressive rollout which makes changes to only a fraction of sites at
> a time, so that a novel failure can be caught at an early stage before it
> becomes widespread. In this event, the canary step correctly identified that
> the new configuration was unsafe. Crucially however, a second software bug
> in the management software did not propagate the canary step’s conclusion
> back to the push process, and thus the push system concluded that the new
> configuration was valid and began its progressive rollout.

Taking no cofirmation of the canary testing process as a signal to go ahead
though is not just a bug but a design flaw IMO.

------
stcredzero
_In this event, the canary step correctly identified that the new
configuration was unsafe. Crucially however, a second software bug in the
management software did not propagate the canary step’s conclusion back to the
push process, and thus the push system concluded that the new configuration
was valid and began its progressive rollout._

Classic Two Generals. "No news is good news," generally isn't a good design
philosophy for systems designed to detect trouble. How do we know that
stealthy ninjas haven't assassinated our sentries? Well, we haven't heard
anything wrong...

~~~
fixermark
It may not be good design, but it might be necessary / practical design. If
you have enough machines that some percentage of them are down or unreachable
at any given time, you can't wait for full go-ahead before proceeding; you'll
never get full go-ahead. So you're left with probabilistic solutions, and as T
approaches infinity the expectation of more than zero false-positives
approaches 1.

~~~
stcredzero
The whole point of the canary sub-population, though is that 1) It's not your
whole population. 2) You want to find out empirically if something's wrong.

------
Gravityloss
I'm waiting for the time when they push over the air updates to airplanes in
flight.

"You can fly safely, we have canaries and staged deployment"

A year forward:

"Unfortunately because the canary verification as well as the staged
deployment code was broken, instead of one crash and 300 dead, an update was
pushed to all aircraft, which subsequently caused them to crash, killing
70,000 people."

I'm not 100% sure why they don't do the staged deployment for google scale
server networking over a few days (or even weeks in some cases) instead of a
few hours, but I don't know the details here...

It's good that they had manually triggerable configuration rollback
possibility and a pre-set policy so it was solved so quickly.

~~~
joering2
Yeah ,the part with canary code rub me the wrong way too.

 _These safeguards include a canary step where the configuration is deployed
at a single site and that site is verified to still be working correctly_

This sounds very unprofessional imho. "Touch this cable to see if there is
electricity running" sort of thing.

Is that really how its should be done?

~~~
oconnor663
If you're doing electrical work, _eventually_ you're going to have to touch
the cable!

------
ndesaulniers
At Google, they do these really awesome post-mortems when there's a major
failure. It provides a point of reflection, and are usually well written
entertaining reads. Didn't know they made (some?) public.

They're a good learning exercise writing one, and is more of a learning
exercise than a punishment.

~~~
advisedwang
It's worth noting that the publicly posted postmortem is not the same as the
internal postmortems (which include much more detail, specific action items,
timelines etc). The SRE book
([https://landing.google.com/sre/book.html](https://landing.google.com/sre/book.html))
has a whole chapter on our internal postmortems, which is probably a better
learning exercise in how to write one.

Source: I work on the team that writes these external postmortems.

------
dylanz
Completely off topic, but this thread is an example of why I (and a lot of
people) want collapsible comments native to HN. I'm on my phone, in Safari,
and I had to scroll for over 20 seconds just to reach the second comment. The
first comment was a tangent about self-driving cars, which while relevant, I
didn't want to read about.

~~~
OldSchoolJohnny
Especially considering that nearly every post on HN features an often
tangential first comment that goes on and on and on...

------
ikeboy
>However, in this instance a previously-unseen software bug was triggered, and
instead of retaining the previous known good configuration, the management
software instead removed all GCE IP blocks from the new configuration and
began to push this new, incomplete configuration to the network.

>Crucially however, a second software bug in the management software did not
propagate the canary step’s conclusion back to the push process, and thus the
push system concluded that the new configuration was valid and began its
progressive rollout.

I assume the software was originally tested to make sure it works in case of
failure. It would be interesting to know exactly what the bug was and why it
didn't show in tests.

~~~
djfergus
Network management software complexity is supposed to be one of things that
SDN was built to solve (by introducing more modularity and defined
interfaces). But in this case the fault was at the edge with BGP route
updates, which the internet has been doing for decades. I share your curiosity
in the specific bug.

However, this is a great detailed post-mortem from a service provider. Your
Telco or ISP will never provide this much detail...

------
pjlegato
Attention startups: _this_ is what incident post-mortems should look like.

~~~
wyldfire
Many startups probably don't have anywhere near the same level of SLA nor
revenue of GCE.

~~~
pjlegato
Of course. You should still have a goal to strive towards.

------
eranation
This is very interesting. From the little I understand (sorry for using AWS
terms as I am more versed with AWS than GCE) this can happen to AWS as well
right? even if your software is deployed to multiple AZs / multiple regions,
if bad routing / network configuration makes it through the various protection
mechanisms then basically no amount of redundancy can help if your service is
part of the non functional IP block. I mean it seems no matter how redundant
you are, there will always be somewhere along the line a single point of
failure, even if it has multiple mechanism to prevent it from happening, if
all of these mechanisms fail, then it's still a single point. What prevents
this from happening at Azure / AWS? Is there anything that general internet
routing protocols need to change to prevent it from happening?

e.g. I'm sure that we will never hear that Bank of X has transferred a billion
dollar to an account but because of propagation errors it published only the
credit but didn't finish the debit and now we have two billionaires. This two
or more phase commit is pretty much bulletproof in banking as far as I know,
and banks are not known to be technologically more advanced than Google, how
come internet routing is so prone to errors that can an entire cloud service
unavailable for even a small period of time? I'm far from knowing much about
networking (although I took some graduate networking courses, I still feel I
know practically nothing about it...) So I would appreciate if someone versed
in this ELI5 whether it can happen in AWS and Azure regardless of how
redundant you are, (which leads to a notion of cross cloud provider redundancy
which I'm sure is used in some places) and whether the banking analogy is fair
and relevant, and if there are any RFCs to make world-blackout routing
nightmares less likely to happen.

~~~
poooogles
I'm not sure the AWS network follows the same setup, AWS has very distinct
blocks between the US/EU/APAC compared to GCP where you can inherit the same
IP if you quickly delete/recreate instances in different regions?

~~~
Swannie
I was going to post the same comment too.

My understanding, from the odd bits and bobs of information I have, is that
AWS regions are typically managed somewhat independently.

------
wyldfire
> . Internal monitors generated dozens of alerts in the seconds after the
> traffic loss became visible at 19:08 ... revert the most recent
> configuration changes ... the time from detection to decision to revert to
> the end of the outage was thus just 18 minutes.

It's certainly good that they detected it as fast as they did. But I wonder if
the fix time could be improved upon? Was the majority of that time spent
discussing the corrective action to be taken? Or does it take that much time
to replicate the fix?

~~~
VLM
Having worked in ISP operations on BGP stuff (admittedly more than 10 years
ago), it was both too slow and too fast.

If the rollout took 12 hours instead of 4 or the VPN failure to total failure
was multiple hours instead of minutes, they'd have had enough time to noodle
it out. Eventually at a slow enough deploy rate they'd have figured it out. It
only took 18 hours to make the final report after all, so an even slower 24
hour deploy would have been slow enough, if enough resources were allocated.

On the opposite side, most of the time when you screw up routing the
punishment is extremely brutal and fast. If the whole thing croaked in five
minutes, "OK who hit enter within the last ten minutes..." and five minutes
later its all undone. What happened instead was dude hit enter, all is well
hours later although average latency was increasing very slowly as anycast
sites shut down. Maybe there's even shift change in the middle. Finally hours
later it finally all hit the fan meanwhile the guy who hit enter is thinking
"it can't be me, I hit enter over four hours ago followed by three hours of
normal operation... must be someone else's change or a memory leak or novel
cyberattack or ..."

Theoretically if you're going to deploy anycast you could deploy a monitoring
tool to traceroute to see that each site is up, however you deploy anycast
precisely so that it never drops... Its the titanic effect, why this is
unsinkable, why would you bother checking to see if its sinking? And just like
the titanic if you break em all in the same accident, that sucker is
eventually going down, even if it takes hours to sink.

~~~
djfergus
Hmm. Seems like this begs for a different way to solve the problem, like
alarming on major changes to configuration files or better recognition of
invalid configs, i.e. google should be able to make a rule that says "if I
ever blackhole x% of my network then alarm"...

~~~
VLM
The first one is alarm fatigue. Like the "Terror Thermometer" or whatever its
called where we're in eternal mauve alert meaning nothing to anyone. All our
changes are color coded as magenta now. Or its turned down such that one
boring little ip block isn't a major change. After all, it isn't. Of course
you (us) developers could run crazy important multinational systems on what to
us networking guys was one boring little IP block who cares about such as
small block of space.

The second one is covered in the article, their system for that purpose
crashed and then the system that babysits that crashed and then whatever they
use to monitor the monitors monitor system didn't notice. Probably showed up
in some dude's nightly syslog file dump the next day. Oh well. If your monitor
tool breaks due to complexity (as they often do) it needs to simplicate and
add lightness not slather more complexity on. Usually monitoring is more
complicated and less reliable than operating, its harder computationally and
procedurally to decide right from wrong than to just do it.

The odds of cascaded failure happening are very low. Given fancy enough backup
systems that means all problems will be weird cascaded failure modes. That
might be useful in training.

When I was doing this kind of stuff I was doing higher level support so see
above at least some of my stories are weird cascaded impossible etc. A slower
rollout would have saved them, working all by myself I like to think I could
have figured it out by comparing BGP looking glass results and traceroute
outputs from multiple very slowly arriving latency reports to router configs
with papers all over my desk and multiple monitors in at most maybe two days.
Huh, its almost like anycast isn't working at more sites every couple hours,
huh. Of course their automated deployment is complete in only 4 hours, which
means all problems that take "your average dude" more than 4 hours of BAU time
to fix are going to totally explode the system and result in headlines instead
of a weird bump on a graph somewhere. Given that computers are infinitely
patient, slowing down the rollout of automated deployments from 4 hours to 4
days would have saved them for sure. Don't forget that normal troubleshooting
places will blow the first couple hours on written procedures and scripts
because honestly most of the time those DO work. So my ability to figure it
out all by myself in 24 hours is useless if the time from escalation to hit
the fan was only an hour because they roll out so fast. Once it hit the fan a
total company effort fixed it a lot faster than I could have fixed it as an
individual.

Or the strategy I proposed where computers are also infinitely fast, roll out
in five minutes, one minutes to say WTF, five minutes to roll back, 11 minute
outage is better than what they actually got. Its not like google is hurting
for computational power. Or money.

I'm sure there are valid justifications for the awkward four hour rollout
thats both too fast and too slow. I have no idea what they are, but the google
guys probably put some time into thinking about it.

------
obulpathi
> Finally, to underscore how seriously we are taking this event, we are
> offering GCE and VPN service credits to all impacted GCP applications equal
> to (respectively) 10% and 25% of their monthly charges for GCE and VPN.

These credits exceed what is promised by Google Cloud in their SLA's for
Compute Engine and VPN service!

~~~
duskwuff
... which is precisely (almost word-for-word) what the post-mortem goes on to
say. Is there something specific you're trying to call attention to here?

~~~
obulpathi
Nop. Probably did too much copy-pasting :( Mearly wanted to highlight the
point.

------
balls187
Nice post mortem.

That outtage gives GCE at best a four 9's reliability for 2016.

~~~
daveguy
Based on the higher level status page:

[https://status.cloud.google.com/summary](https://status.cloud.google.com/summary)

It looks like GCE uptime is _well_ below four 9's reliability for a sliding 1
year timeframe.

~~~
dgacmu
Traynor was quoted in a networkworld article last year saying they aim for
three and a half nines (99.95%). But you need to read into the incidents more
carefully -- figuring out actual "uptime" is quite hard. Consider the longest-
lasting incident:

    
    
      "On Tuesday 23 February 2016, for a duration of 
       10 hours and 6 minutes, 7.8% of Google Compute Engine
       projects had reduced quotas.  ...  Any resources that
       were already created were unaffected by this issue."
    

I'm not sure off the top of my head how I'd try to compute the overall
availability #s from that one. One can possibly try to determine and sum the
effects on the individual customers, but we can't from the information
provided. But it's certainly less overall downtime than just counting it as a
7 hour failure.

~~~
daveguy
Agreed. It is difficult to tell. But if the bug is preventing you from
processing (because you can't save the existing results) then it's essentially
down time for new processing. There are also connectivity issues by region and
DNS issues. It is difficult to get exact downtime considering partial
failures.

That said, this is the second major asia-east1 downtime in 90 days:

[https://status.cloud.google.com/incident/compute/16002](https://status.cloud.google.com/incident/compute/16002)

------
huula
I always like Google's serious attitude towards engineering, even after they
have made some mistakes, they never try to hide anything.

------
totally
> However, in this instance a previously-unseen software bug was triggered,
> and instead of retaining the previous known good configuration, the
> management software instead removed all GCE IP blocks from the new
> configuration

> Crucially however, a second software bug in the management software did not
> propagate the canary step’s conclusion back to the push process

I'm sure the devil is in the details, but generally speaking, these are 2
instances of critical code that gets exercised infrequently, which is a good
place for bugs to hide.

------
pbreit
Do SLAs even matter in the slightest? Or are they just sort of "feel-good"
things or ways for negotiators to demonstrate their worth?

~~~
duskwuff
SLAs aren't about guaranteeing uptime. They're about setting consequences for
downtime.

~~~
cbr
But once there are strong consequences for downtime the service provider is
going to set up training, monitoring, oncall, etc to make sure things stay
within the SLA limits. So you are effectively negotiating uptime.

------
heisenbit
"Lessons learned from reading post-mortems" [http://danluu.com/postmortem-
lessons/](http://danluu.com/postmortem-lessons/) is a good place to dig deeper

The first graph quoted from a survey paper is a classic fitting the GCE outage
well:

Initial error --92%--> Incorrect handling of errors explicitly signaled in
software

~~~
anoncept
[https://mitpress.mit.edu/books/engineering-safer-
world](https://mitpress.mit.edu/books/engineering-safer-world) is also an
excellent resource that more people who care about post-mortems should read.

(As background, the author, MIT Prof. Nancy Leveson, summarizes decades of
work in the field, offers groundbreaking new theoretical tools that scale up
to some of the world's most complex accidents, and has the experience and
evidence to back up their relevance e.g. via work on Therac-25, the Columbia
Space Shuttle, and Deepwater Horizon to name just a few...)

------
simonebrunozzi
I love his signature: "Benjamin Treynor Sloss | VP 24x7".

------
rdtsc
> However, in this instance a previously-unseen software bug was triggered,
> and instead of retaining the previous known good configuration, the
> management software instead removed all GCE IP blocks from the new
> configuration and began to push this new, incomplete configuration to the
> network.

Always test your crash / exception handling / special case
termination+recovery code in production.

I have seen this too often. Most often in in "every day" cases when service
has a "nice" catch way of stopping and recovering. Then has a separate "if
killed by SIGKILL/immediate power failure" crash and recovery. This last bit
never gets tests and run in production.

One day power failure happens, service restart and tries to recover. Code that
almost never runs, now runs and the whole thing goes into an unknown broken
state.

~~~
senderista
See [https://en.wikipedia.org/wiki/Crash-
only_software](https://en.wikipedia.org/wiki/Crash-only_software)

------
halayli
This isn't the first time a config system at Google causes a major outage.

[https://googleblog.blogspot.com/2014/01/todays-outage-for-
se...](https://googleblog.blogspot.com/2014/01/todays-outage-for-several-
google.html)

~~~
rrdharan
That's entirely unsurprising. The recent major Facebook outage was also caused
by bad configuration IIRC.

See: [http://danluu.com/postmortem-lessons/](http://danluu.com/postmortem-
lessons/)

> Configuration > > Configuration bugs, not code bugs, are the most common
> cause > I’ve seen of really bad outages. When I looked at publicly available
> > postmortems, searching for “global outage postmortem” returned > about 50%
> outages caused by configuration changes. Publicly > available postmortems
> aren’t a representative sample of all > outages, but a random sampling of
> postmortem databases also > reveals that config changes are responsible for
> a disproportionate > fraction of extremely bad outages. As with error
> handling, I’m > often told that it’s obvious that config changes are scary,
> but > it’s not so obvious that most companies test and stage config >
> changes like they do code changes.

~~~
contingencies
Great link there! Also check out his list of public postmortems at
[https://github.com/danluu/post-mortems](https://github.com/danluu/post-
mortems)

PS. On HN you should use asterisks to _italicize_ instead of > for quoting.

------
DanielDent
My post yesterday seems even more relevant today:
[https://news.ycombinator.com/item?id=11477552](https://news.ycombinator.com/item?id=11477552)

It's a shame it's not easier or more common for people to create clones of
(most|all) of their infrastructure for testing purposes.

Something like half of outages are caused by configuration oopsies.

If you accept that configuration _is_ code, then you also come to the
following disturbing conclusion: the usual test environment for critical
network-related code in most environments is the production environment.

~~~
aiiane
The main issue there is that "environments" are defined by configuration, so
if you try to set up a configuration test environment, you run into a direct
logical impass: either your configs are production configs, and thus not a
separate environment, or they're different from production configs, and thus
may provide different test results from production.

~~~
DanielDent
While I agree with you, I think we could get closer to "production" than is
common right now.

In an AWS environment, imagine a setup where all that differs is the API keys
used (the API keys of the production vs test environment). What gets tricky is
dealing with external dependencies, user data, and simulating traffic.

For an example more relevant to today's issue: imagine a second simulated
"internet" in a globally distributed lab environment. With BGP configs, fake
external BGP sessions, etc, servers receiving production traffic, etc.

I get that it's a lot of work to setup and would require ongoing work to
maintain - and that it's hard/impossible to have it correctly simulate the
many nuances of real world traffic - and yet I also think in many cases it
would be sufficient to prevent issues from making it into production.

------
zaroth
For the amount this cost them, they should have bought CloudFlare. If you play
with [global BGP anycast] you are bound to get burned. This is not the first
time that BGP took out your entire routing. This is probably not the last time
that BPG will take out your entire routing. Whoever's job it was to watch the
routing, I am sorry.

Pulling your own worldwide routes because you have too much automation; it
will make a good story once it's filtered down a bit! Icarus was barely up in
the air, too early for a fall.

------
swills
The thing that stood out for me was:

"...team...worked in shifts overnight..."

~~~
delroth
(Usual disclaimer: I speak for myself, not for my employer, etc.)

The team in charge of solving this particular problem is located in two sites
in two different timezones. This is true of most critical SRE teams at Google,
and it is precisely to be able to have 24h coverage in these time sensitive
situations.

In the 2+ years I have spent in SRE I have never heard of a single instance of
an SRE being asked or even encouraged to stay after hours (let alone
overnight) for incident remediation. There is quite a lot of emphasis being
put on work/life balance.

~~~
senderista
Wow, that's amazing to read, having served as a de-facto SRE (like every other
SDE) at an unnamed competitor to GCE, where I was expected to stay up all
night if necessary to resolve an issue (relatively few teams had follow-the-
sun coverage). I swore I would never carry a pager again after that, but maybe
Google really is different.

------
grogers
How important for redundancy/quality of service is the feature of advertising
each region's IP blocks from multiple points in Google's network? It seems
like region isolation is the most important quality that Google's network
could provide, and their current design is what made something like this
possible, not just the bugs in the configuration propagation. They mention the
ability of the internet to route around failures, so why not rely on that
instead?

------
trhway
as devops Borat was saying all along, automated propagation of a error as the
main root cause here. A error (new configuration) should be rolled out site by
site - ok us-east1, move onto us-west1 ... ok, move onto ... . A canary site
may be the first in sequence, yet success ("no failure reported") can't be a
big "ok" for automated push to all sites at the same time.

------
mjevans
I hope that one of their solutions is the obvious one; make change control
testing a closed loop instead of an open loop. (Watch for /success/ reported
instead of failure notification.)

------
platz
> configuration file

configuration files strike again - remember knight capital?

------
nickysielicki
What does Google use for BGP? Quagga, OpenBGPD, BIRD, their own?

Also, does anyone have a link to statistics on global BGP software usage? I'm
curious what the marketshare looks like.

~~~
kijiki
Google has contributed ISIS and BGP code to Quagga in the past, as well as
funding some testing at the OSRF. Presumably they use it in at least some
parts of their operations.

------
Tistel
The postmortem used the word "quirk." They might consider drilling down on the
specifics there. Especially if that is the heart of the bug/accident.

------
JustUhThought
Just a thought. Maybe change the name from 'post-mortem' to, anything else
before the event actually is a post-mortem.

------
sengork
Networking issues in either the storage or communication subsystems of any
platform normally result in wide-spread disruptions.

------
itaifrenkel
What is the reason different GCE regions use the same IP blocks?

------
hvass
What is defense in depth? It is mentioned as a core principle.

~~~
koalaman
[https://en.wikipedia.org/wiki/Defense_in_depth_(computing)](https://en.wikipedia.org/wiki/Defense_in_depth_\(computing\))

------
awinter-py
chaos monkey?

------
hsod
> Crucially however, a second software bug in the management software did not
> propagate the canary step’s conclusion back to the push process, and thus
> the push system concluded that the new configuration was valid and began its
> progressive rollout.

Perhaps the progressive rollout should wait for an affirmative conclusion
instead of assuming no news is good news? I'm not being snarky, there may be
some reason they don't do this.

~~~
windwake12
Presumably it received a false positive (or it was interpreted as such). This
really seems like the root cause, and I suspect a case of happy path
engineering striking again.

------
contingencies
TLDR; they simply didn't test their (global!) custom route announcement
management software. An edge case was triggered in production, and they gee-
whiz-automatically went offline. Epic fail.

PS. To the downvoters, truth hurts.

~~~
mmel
I think you're getting downvoted due to the snarky tone more than any "truth"
you are stating.

~~~
contingencies
Well, how to phrase the same thing briefly without sounding snarky?

~~~
Estragon
You only need to change a few words:

"In other words, they simply didn't test their (global!) custom route
announcement management software. An edge case was triggered in production,
and unsurprisingly they automatically went offline."

~~~
contingencies
There is no accounting for taste.

------
herrvogel-
A bit of topic, but it really bugs me, that the banner on the top so
pixilated.

------
qaq
DRY "The inconsistency was triggered by a timing quirk in the IP block removal
- the IP block had been removed from one configuration file, but this change
had not yet propagated to a second configuration file also used in network
configuration management."

~~~
senderista
Yes, DNS was clearly designed by idiots who had never heard of DRY.

------
NetStrikeForce
I think most people are missing the main failure point: Why does one change
propagate automatically to all regions?

All this could have been contained if they deployed changes on different
regions at different times. That would also help with screwing less your
overseas users by running a maintenance at 10am their local time :-)

~~~
aiiane
> These safeguards include a canary step where the configuration is deployed
> at a single site and that site is verified to still be working correctly,
> and a progressive rollout which makes changes to only a fraction of sites at
> a time, so that a novel failure can be caught at an early stage before it
> becomes widespread. In this event, the canary step correctly identified that
> the new configuration was unsafe. Crucially however, a second software bug
> in the management software did not propagate the canary step’s conclusion
> back to the push process, and thus the push system concluded that the new
> configuration was valid and began its progressive rollout.

The system _does_ do progressive rollouts, which are essentially what you are
referring to (albeit perhaps at a different pace). The number of changes being
rolled out means that it's not really feasible to hand roll out configurations
to different regions, so the checks are automated. In this case, the automated
checks failed as well.

~~~
senderista
Waiting a longer time between regional rollouts (so monitoring systems would
have time to detect serious failures) would sacrifice deployment latency, but
not deployment throughput (assuming deployments can be made in parallel). For
continuous deployment, throughput really matters more than latency.

