
Google Compute Engine Incident #16015 - Artemis2
https://status.cloud.google.com/incident/compute/16015
======
piinbinary
It seems like most work done to make distributed systems reliable is aimed at
handling machines or groups of machines going down (e.g. the leader node in
one region goes down at the same time as an entire other region). This half
seems to be a solved problem.

The postmortems published by Google, Amazon, and Azure (as well as postmortems
internal to the company I work for) are nearly always due to some type of
change (code or configuration) being rolled out. It seems to me that we need
some help from computers to make these systems reliable. - something like
static type checking in a programming language, but applied to a distributed
system. Perhaps your architecture wouldn't "compile" if the network traffic
will go the wrong place, or if a rate limit is above the capacity something is
expected to handle, or if the change would impact too many servers at once.

~~~
BinaryIdiot
> Perhaps your architecture wouldn't "compile" if the network traffic will go
> the wrong place, or if a rate limit is above the capacity something is
> expected to handle, or if the change would impact too many servers at once.

This is an interesting idea. I'm always terrified when I have to deploy a
minor configuration change into a production system that gets distributed out;
if there was an easy way to apply some sort of check against all of it (you
know without having to build something explicitly to do this) that would be
awesome.

I wonder if the future of something like this might be using containers. Each
piece that gets deployed is in its own container networked together and you
stand it up in a staging environment, ensure it's talking together correctly
through some sort of set of integration tests then push it into the production
environment.

~~~
DigitalJack
I wish the law could be statically checked. In the US at least you are, at any
given moment, likely breaking a law because for each and every law, there is
one that contradicts it (at least partially).

~~~
tener
Quite likely you'd first have to invent strong AI which would give you a
chance of defining a coherent theory of law which is really a tangled mess.
Pretty sure no human has capacity to do it on their own in the foreseeable
future.

~~~
DigitalJack
I think I'd approach the problem from the other end, find a way to write laws
such that they are checkable with assertions, properties, and constraints.

~~~
BinaryIdiot
This would be amazing. If we had this the first line of judges could just be
automated. Feed evidence in, computer finds you guilty or innocent, done. Then
if you appeal you can seek a human judge in case judgement / exceptions need
to be made (so say you technically broke a law but it was actually necessary /
a good thing then you have recourse). First appeals could be very quick, less
traffic to first human judge, etc.

Hell even without automated the justice system just doing as you suggested
alone would be amazing. I've often wondered about putting laws in GIT or
similar.

~~~
vertex-four
You realise that humans are very, very good at figuring out what _just_
doesn't count as breaking an extremely strictly defined law, right?

~~~
im4w1l
This can be solved by proportional punishments. Then _just_ not breaking the
law, and _just_ breaking the law gives similar outcomes (nothing vs slap on
wrist)

------
jread
According to network checks we have running against VMs in each GCE region,
this event resulted in about 1.6 hours of concurrent ICMP timeouts for every
region (except us-west1 for only 10 minutes). We use Panopta for monitoring
which verifies outages from multiple external network paths. When outages
trigger we also use Ripe Atlas to confirm them using 100s of last mile network
paths of which 85-95% resulted in timeouts. This is the second global GCE
networking outage this year. These outages are particularly problematic
because even multi-region load balancing will not avert downtime.

[https://cloudharmony.com/status-for-google](https://cloudharmony.com/status-
for-google)

Prior global outage - April 11:
[https://status.cloud.google.com/incident/compute/16007](https://status.cloud.google.com/incident/compute/16007)

Disclaimer: I am the founder of CloudHarmony Edit: Outages triggered due to
ICMP timeouts

~~~
sshykes
Not accusing you of being intentionally misleading, but it would be nice if
you put a clear disclaimer pointing out that you work for CloudHarmony.

~~~
jread
I founded CloudHarmony and now work for Garter which acquired CloudHarmony in
2015.

~~~
simonebrunozzi
I think it would still be nice to mention that you work there.

------
bkeroack
The most disturbing part of this incident--to me--is that Google
Search/YouTube/GMail/etc __never went down __. Not even a blip.

That means that Google is not dogfooding GCE to the degree that I would hope
and expect for me to risk my business on using the service. Disappointing, to
say the least.

Say what you will about AWS (and I've said many critical things, and will
continue to do so, publicly and privately), but when AWS has a major outage it
also affects Amazon digital products. They have major skin in the game, while
it appears that GCE is a special snowflake service completely separate from
the important, money-making services at Google.

~~~
manigandham
This always comes up but Google has been pretty explicit in saying that they
share some internal infrastructure but that lots of existing stuff isnt
running on GCP. They have decades of customized software/hardware running
already.

GCP is pretty new and there are plenty of big customers running on GCP that
you can reference like Spotify and Snapchat. Also it _is_ a rather important
and money-making service, on track to potentially eclipse their entire ad
business.

~~~
bkeroack
Just because they say it doesn't mean it's good, or even acceptable.

"[O]n track to potentially eclipse..." sounds like the worst form of
quarterly-report spin.

~~~
manigandham
What? Good/acceptable in what sense?

This is just the size of the market, cloud computing is already a major
industry and has just started. The potential upside for a major cloud player
dwarfs the entirety of digital advertising.

------
zzzcpan
Networking is so fragile. I feel like DNS failover to another AS in another
datacenter is the only solution for web services to actually have resilience
against such failures.

~~~
BinaryIdiot
I feel like DNS is one of the best ways to do service discovery but its
complexity is always baffling to me. I mean I understand _roughly_ how it all
works but if you asked me to stand up DNS internally for various servers to
use it...might take me a while.

I've wondered if it's simply because I don't know it well enough or if we need
an easier, more simple solution to replace DNS in internal networks. There are
a ton of ways to do service discovery.

~~~
eropple
_> I've wondered if it's simply because I don't know it well enough_

IMO, and emphatically not ripping on you: it's this one. DNS is a pretty
simple protocol and at "you don't have ten thousand developers" scale managing
it is pretty easy. It might take you a while the first time, for sure--but it
won't the second, third, or twelfth time. =)

In particular, even setting aside explicit failover cases (which do require
some outside monitoring and staging to make useful, i.e. when do you
failover?), round-robin DNS actually does a _lot_ to provide redundancy,
internal to a single data center or deployment, when coupled with SRV records.

~~~
BinaryIdiot
Fair enough. Any suggested reading material regarding this subject?

~~~
thenewwazoo
I honestly recommend just setting up a toy split-view DNS service with a
hidden master using BIND9. Get zone transfers working and maybe play with
dynamic updates, and you'll basically know all you need to know. I set up an
internal DNS infra at a previous job and was pretty surprised at how simple it
really ended up being.

~~~
BinaryIdiot
Thanks! I'll take a look.

------
petewailes
There's a refreshing humility in admitting how it went wrong, and what they've
done to fix it, and not mentioning any names that I admire here. No attempting
to lay the blame at someone's feet, just admitting that a process failed, and
they've amended it so it doesn't happen again.

Kudos.

~~~
nine_k
Google's culture of blameless post-mortems is commendable.

In internal post-mortems names are named when necessary for disambiguation,
without consequences to the persons named. This results in all participants
being candid, and the real source of the problem more easily uncovered. (I
wrote two post-mortems for small incidents I triggered.)

~~~
BinaryIdiot
> without consequences to the persons named

I would hope so (everyone makes mistakes some of which may lead to HUGE
issues) and not giving them a chance to learn from it is a bit unrealistic. At
the same time, however, I have been part of organizations where something has
gone wrong and us engineers basically all had to "take the blame" otherwise
the single person likely would have been fired. I'm confident Google is not
like that but it's always something I think about.

~~~
ben_jones
We live in a culture where we invoke a draconian system of punishment as a
"justice" system. The idea, which does not translate to software, is that fear
will keep the local systems in line. Fear of this battle station. So the fact
that certain companies recognize a huge societal flaw and actively operate
against it is rare (imo).

------
pjlegato
Attention startups: _THIS_ is what your public postmortems should look like.

This level of detail and honesty, this level of specific steps that are being
taken to prevent recurrence.

75% of startups try to sweep downtime under the rug and don't even acknowledge
problems happened at all -- or only do so privately in a personal support
email after hours of interrogation. Another 20% just write "we had some minor
issues but everything is fine now."

Be like Google.

------
RantyDave
I'm constantly amazed by the competence of these 'big cloud' teams, I can't
imagine the amount of preparatory work they put in. It does, however, make it
clear how badly the Internet is starting to creak.

------
itaysk
Shit happens for everyone, and this kind of error could have happened for
other clouds as well. It's the responsibility towards customers and lessons
learned that matter.

------
josephb
Some detailed info which is great, however I am more interested in knowing 1)
why, or what, led to the wrong decision being made, and 2) why it took so long
to notice / revert.

~~~
d4l3k
These configurations are absurdly finicky. If you remember a couple of years
ago, Pakistan took down youtube to a similar border gateway protocol
misconfiguration. These things are pretty quick to be caught, but the
configuration changes take time to propagate (in this case 30m).

[http://www.cnet.com/news/how-pakistan-knocked-youtube-
offlin...](http://www.cnet.com/news/how-pakistan-knocked-youtube-offline-and-
how-to-make-sure-it-never-happens-again/)

------
karmicthreat
So in instances like this where everything goes wrong. Does google have the
equivalent of a revert button to undo whatever infrastructure changes were
done?

~~~
asuffield
(Tedious disclaimer: my opinion only, not speaking for anybody else. I'm an
SRE at Google. My team is oncall for this service and I know exactly what
happened here; I probably can't answer most questions you might have.)

Let's go with "yes", as the most accurate answer. As soon as I or whoever is
oncall has figured out what change was responsible, we can usually revert it
quickly and easily. Usually, if I'm oncall and I have reason to even suspect a
recent change might be the cause, I'll revert it and see if the problem goes
away.

The difficulty becomes more apparent when you realise the sheer number of
infrastructure changes being made every hour, some of which will be fixes to
other outages, and some of which will be things you can't revert because they
are of the form "that location has fallen offline; probably lost networking"
or "we are now at peak time and there are more users online". So if your
question is "can we just roll the whole world back one day" \- no, too much
has changed in that time.

~~~
karmicthreat
I know it late but thanks for this. It kind of reinforces the amazing size of
the system and the number of people making changes to it.

