
GCE down in all regions - vox_mollis
https://status.cloud.google.com/incident/compute/16007
======
gtaylor
I've been overall very impressed with the direction of Google Cloud over the
last year. I feel like their container strategy is much better than Amazon's
ECS in that the core is built on open source technology.

This can wipe away a lot of goodwill, though. A worldwide outage is
catastrophic and embarrassing. AWS has had some pretty spectacular failures in
us-east (which has a huge chunk of the web running within it), but I'm not
sure that I can recall a global outage. To my understanding, these systems are
built specifically _not_ to let failures spill over to other regions.

Ah well. Godspeed to anyone affected by this, including the SREs over at
Google!

~~~
MichaelGG
I'm totally impressed with gcloud. Slick, smooth interface. Cheap pricing. The
fact the UI spits out API examples for doing what you're doing is really cool.
And it's oh-so-fast. (From what I can tell, gcloud's SSD is 10x faster or
1/10th the cost of AWS.)

And this is coming from a guy that really dislikes Google overall. I was
working on a project that might qualify for Azure's BizSpark Plus (they give
you like $5K a month in credit), and I'd prefer to pay for gcloud than get
Azure for free

~~~
Artemis2
Same, was considering GCP for the future, but this is bad. I'm not using them
without some kind of redundancy with another provider. I hope they write a
good post-mortem, these are always interesting at large scale.

~~~
Gratsby
How bad is it really? They started investigating at 18:51, confirmed a problem
in asia-east1 at 19:00, the problem went global at 19:21, and was resolved at
19:26.

They posted that they will share results of their internal investigation.

That kind of rapid response and communication is admirable. There will be
problems with cloud services - it's inevitable. It's how cloud providers
respond to those problems that is important.

In this situation, I am thoroughly impressed with Google.

~~~
Artemis2
It's bad because it concerns all their regions at the same time, while
competing providers have mitigations against this in place. AWS completely
isolates its regions for instance [1], so they can fail independently and not
affect anything else. That Google let an issue (or even a cascade of problems)
affect all its geographic points of presence really shows a lack of maturity
of the platform. I don't want to make too many assumptions, and that specific
problem could have affected AWS in the same way, so let's wait for more
details on their part.

The response times are what's expected when you are running one of the biggest
server fleets in the world.

1: [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-
re...](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-
availability-zones.html)

~~~
Gratsby
Expecting that problems won't happen with a cloud provider that happen
everywhere else is a pipe dream. They might be better at it because of scale,
but no cloud provider can always be up. It happened at Amazon, now it's
happened at Google. Eventually, finding a provider that never went down will
be like finding the airline that never crashed.

Operating across regions decreases the chances of downtime, it does not
eliminate them.

> The response times are what's expected when you are running one of the
> biggest server fleets in the world.

That may be true, but actually delivering on that expectation is a huge
positive. And more than having the right processes in place, they had the
right people in place to recognize and deal with the problem. That's not a
very easy thing to make happen when your resources cross global borders and
time zones.

Look at what happened with Sony and Microsoft - they were both down for days
and while Microsoft was communicative, Sony certainly was not. Granted, those
were private networks, but the scale was enormous and they were far from the
only companies affected.

~~~
Artemis2
> It happened at Amazon

AWS has _never_ had a worldwide outage of anything (feel free to correct me).
It's not about finding "the airline that never crashed", it's finding the
airline whose planes don't crash all at the same time. It's pretty surprising
coming from Google because 15 years ago they already had a world-class
infrastructure, while Amazon was only known for selling books on the Internet.

Regarding the response times, I recognize that Amazon could do better on the
communication during the outage. They tend to wait until there is a complete
failure in an availability zone to put the little "i" on their green
availability checkmark, and not signal things like elevated error rates.

~~~
Gratsby
Here's an example from this thread:
[http://status.aws.amazon.com/s3-20080720.html](http://status.aws.amazon.com/s3-20080720.html)

~~~
Artemis2
I stand corrected, my statement was too broad.

AWS had two regions in 2008 [1]. That was 7 years ago, and I think you would
agree that running a distributed object storage system across an ocean is a
whole different beast than ensuring individual connectivity to servers in
2016.

1: [https://aws.amazon.com/about-aws/global-
infrastructure/](https://aws.amazon.com/about-aws/global-infrastructure/)

------
avolcano
Spotify is down due to this, which is, uh, pretty hilarious
[https://news.spotify.com/us/2016/02/23/announcing-spotify-
in...](https://news.spotify.com/us/2016/02/23/announcing-spotify-
infrastructures-googley-future/)

~~~
jasonjei
Was Google services (search, email, drives, apps) impacted at all?

~~~
asadlionpk
No they don't use gcloud to host their own apps. Yes that's ridiculous.

~~~
morgante
Which is exactly why I won't use GCE. If Google isn't confident enough to use
it for themselves, neither will I.

The fact that Amazon dogfoods AWS is a major advantage for them.

~~~
brown9-2
Even if Amazon.com uses AWS (the extent of which seems it may be a mixture of
marketing hype and urban legend), there are many ways AWS could fail that
affects customers but leaves Amazon.com unaffected.

------
silverlight
That was nuts. Interested to read the post-mortem on this one. Our site went
down as well. What could cause a sudden all-region meltdown like that? Aren't
regions supposed to be more isolated to prevent this type of thing?

Seems to have only been down for about 10 minutes, so I'm thinking some sort
of mis-configuration that got deployed everywhere...they were working to fix a
VPN issue in a specific region right before it went down...

~~~
ninkendo
To quote @DEVOPS_BORAT:

To make error is human. To propagate error to all server in automatic way is
devops.

~~~
chiph
So, you're saying the servers are aladeen?

------
HorizonXP
So in all seriousness, how do folks deal with this?

In this case, it ended up being a multi-region failure, so your only real
solution is to spread it across providers, not just regions.

But I imagine it's a similar issue to scaling across regions, even within a
provider. We can spin up machines in each region to provide fault tolerance,
but we're at the mercy of our Postgres database.

What do others do?

~~~
pjlegato
Most people just deal with it and accept that their site will go down for 20
minutes every 3-4 years or so, even when hosting on a major cloud, because:

1) the cost of mitigating that risk is much higher than the cost of just
eating the outage, and

2) their high traffic production site is routinely down for that long anyway,
for unrelated reasons.

If you really, really can't bear the business costs of an entire provider ever
going down, even that rarely (e.g. you're doing life support, military
systems, big finance), then you just pay a lot of money to rework your entire
system into a fully redundant infrastructure that runs on multiple providers
simultaneously.

There really aren't any other options besides these two.

~~~
ghshephard
If you are doing life support, military systems, or HA big finance, then you
are quite likely to be running on dedicated equipment, with dedicated
circuits, and quite often highly customized/configured non-stop
hardware/operating systems.

You are unlikely to be running such systems on AWS or GCE.

~~~
creshal
And that's why IBM is still in the server business: There's nothing like a
mainframe when it comes to uptime.

~~~
ghshephard
HP also has some good products in the highly available space -
[http://h20195.www2.hp.com/v2/getpdf.aspx/4aa4-2988enw.pdf](http://h20195.www2.hp.com/v2/getpdf.aspx/4aa4-2988enw.pdf)
, likely from their acquisition of Tandem.

~~~
creshal
> likely from their acquisition of Tandem.

Yep. Those were originally Itanium-only, so their success was somewhat…
limited, compared to IBM's "we're backwards compatible to punch cards"
mainframes.

Only recently did Intel start to port over the mission critical features like
CPU hotswap to Xeons, so they can finally let the Itanic die, so we're
hopefully going to see more x86 devices with mainframe-like capabilities.

------
alecbaldwinlol
Guys, I've got it- instead of locking ourselves in with one vendor's platform
and being subject to their mismanagement and arbitrary rules, why don't we buy
our own hardware and do it ourselves?

We can free up our OpEx budget too! My sales rep sent me a TCO that shows it
is way cheaper to run a data center than to pay a cloud subscription!

I'm calling the CFO!

~~~
magic_man
Will your own hardware have the same reliability as google?

~~~
thawkins
I think it was sarcasm......

------
simonebrunozzi
AWS had several major outages in the past, especially between 2009 and 2012.
In some cases, it was not only downtime, but also data loss, which is the
hardest part. 8,760 hours in a year, if you are down for a total of less than
8.7 hours, you're in the >99.9% uptime category (also called "three 9s"). Four
9s (99.99%) is considered a nice plus. Very few businesses really need that.

However, uptime is one thing. Data loss is another different beast.

AFAIK, this one for Google is only downtime. Being able to maintain GCE up for
most of the year, except a few hours, means more than 99.9% availability,
which is what most customers need.

Operational excellence, or the ability to have your cloud up and running,
comes only with a large customer set; Google is now gaining a lot of
significant customers (Spotify, Snapchat, Apple, etc), and therefore I expect
them to learn what's needed over the coming months.

2016 will be ok. 2017, in my view, will be near perfect.

If Google wants to differentiate them from AWS, they should offer an SLA on
data integrity (at a premium, obviously). Here's how you can get thousands of
enterprise customers.

Shameless plug: I've also extensively written about AWS, Azure and GCE here:
[https://medium.com/simone-brunozzi/the-cloud-wars-
of-2016-3f...](https://medium.com/simone-brunozzi/the-cloud-wars-
of-2016-3f87e0a03d18)

------
ikeboy
Great timing for that new book :)

------
Artemis2
Seems kinda misleading from Google to claim repeatedly that they are hosted
just on the same infrastructure of GCP, and not go down with it.

EDIT: Switched from "dishonest" to "misleading"; while it's abundantly clear
that Google doesn't run on GCP, GCP feels like a second-class citizen to
Google because you just cannot get Google uptime with it.

~~~
dsymonds
Google and GCP run on the same infrastructure, but this was a GCP problem, not
a problem with that common infrastructure.

~~~
remosi
(I'm a Google SRE, I'm on the team that dealt with this outage)

This did impact common infrastructure. Some (non-cloud) Google services were
impacted. We've spent years working on making sure gigantic outages are not
externally visible for our services, but if you looked very closely at latency
to some services you might have been able to see a spike during this outage.

My colleagues managed to resolve this before it stressed the non-cloud Google
services to the point that the outage was "revealed". If this was not
mitigated, the scope of the outage would have increased to include non-cloud
Google services.

------
jpatokal
Back up as of 19:27 US/Pacific:
[https://status.cloud.google.com/incident/compute/16007](https://status.cloud.google.com/incident/compute/16007)

------
mrdrozdov
I do not envy the current on-call rotation.

~~~
spyspy
I don't envy anyone with an on-call job.

~~~
jethro_tell
What do you recommend? I figure that if you're working on something without
oncall no one probably cares about it any way. I prefer to have a good
rotation than no rotation.

~~~
dredmorbius
Staffing such that on-call is handled by presently-in-office staff. This is,
as I understand, pretty much what Google does. When you're in the office,
you're in the office, but when you're not, you're not. Having global coverage
means ops in several timezones, and this is what Google accomplishes.

Not knowing when, at any time, your phone or pager will go off wears in
interesting ways over time.

~~~
kyrra
It depends on the team and type of oncall rotation for the service. My team (a
SWE team) has its own oncall rotation as we don't have dedicated SREs for all
of our services.

Since we US based only, it means the oncall person will have pager duty while
they sleep. Our pager can be a bit loud at night due to the nature of our
services, so it's definitely not for everyone (luckily it's optional).

~~~
dredmorbius
Is this at Google?

I'll note you're SWE not SRE. I'm talking _mostly_ about dedicated Ops crew on
pager.

It's one thing if you're responding to pages resulting from other groups'
coding errors or failure-to-build sufficiently robust systems. Another if
you're self-servicing.

One of my own "take this job and shove it" moments came after pages started
rolling in at 2am, bringing me on-site until 6am. I headed back for sleep,
showed up that afternoon and commented on the failure of any of the dev team
to answer calls/pages/texts (site falling over, I had exceptionally limited
access capabilities and was new on team). Response was shrugs.

Mine was "That wasn't your ass being hauled out of bed. See ya."

~~~
kyrra
_The opinions stated here are my own, not necessarily those of Google._

Yes, it is at Google. Our important and high visibility bits have SREs that
help monitor our services (SREs actually approached us to take over some bits
that were more important).

Google has a lot of oncall people that aren't going to go into a data center
(most googlers never see a data center). So there is lots of oncall rotations
that still have an SLA that can be handled from their bed if it happens at
2am.

(I sadly can't give any examples)

------
avs733
There has to be a certain level of karma/schadenfreude of this happening in
the week where they are pushing their SRE book...did they handle it well? It
seems so, but a lot of their book is an ounce of prevention over a pound of
on-call pagers going off.

~~~
packetslave
Prevention is a big part of SRE, but an equally big part is formalizing a
process to learn from the inevitable outages that come with running a large,
complex, distributed system built by fallible humans.

You figure out what went wrong and fix it, of course, but more importantly,
you figure out where your existing systems and processes (failover,
monitoring, incident response, etc.) did and didn't work, and you improve them
for the next time.

------
paulsutter
Ask HN: Is anyone using different cloud providers for failover and what's your
DNS configuration?

Do any cloud providers allow announcing routes for anycast DNS?

~~~
retrogradeorbit
I, too, would be interested in info on any cloud providers that support
anycast.

~~~
Artemis2
I'm no networking expert but packet.net has a page on this:
[https://www.packet.net/bare-
metal/network/anycast/](https://www.packet.net/bare-metal/network/anycast/)

~~~
NetStrikeForce
I see benefit on using anycast for your DNS, but is anycast actually a better
option than DNS load balancing for my site? The idea behind using anycast is
to use at least two different providers, so having packet.net only doesn't
really cut it. Also I can do DNS load balancing with any provider by using
something like Azure's Traffic Manager, so I struggle to see advantages.

------
Scarbutt
So what Google services went down with this? looks like they are not eating
their own dog food?

~~~
chipperyman573
Google is self-hosted, but they might not use the same hardware GCE uses.

~~~
benley
different hardware isn't really part of the equation - it's more that most of
google's internal systems aren't _on_ gce, but _adjacent to_ gce. There's a
cloud beneath that cloud, so to speak.

------
tempestn
Google Custom Search also seems to have gone down globally today. Likely
related, although GCE is back up, CSE is still out, leaving many sites without
an international search feature.

------
kahwooi
I did not choose because GCE does not have server at Asia Pacific, Microsoft
Azure, DigitalOcean and AWS has one. Sorry, correction, What I mean is South
East Asia.

~~~
gresrun
They have three zones in Asia Pacific[0].

[0]
[https://cloud.google.com/compute/docs/zones#available](https://cloud.google.com/compute/docs/zones#available)

~~~
kahwooi
Sorry, I mean South East Asia. [https://azure.microsoft.com/en-
us/status/](https://azure.microsoft.com/en-us/status/)

------
salilpa
i migrated from aws to google yesterday. fml

------
cellularmitosis
Time Warner Cable's DNS server went down at roughly the same time (in Austin).
I'm hoping that's just a coincidence.

------
novaleaf
Correct me if i'm wrong, but it looks like Cloud VPN went down, not all of
GCE.

FYI I run about 15 servers in Asia, USAEast, and Europe on GCE with external
monitoring and didn't get a peep from my error checking emails during that
timeframe.

------
qaq
I know this will get downvoted but clouds suck and this is just one more
manifestation of why they suck. Unless you have very spiky workload save
yourself long term pain and don't go this route (applies if your monthly
AWS/GCE/Azure bill is over few K)

~~~
virmundi
I'm not seeing why not. Your data center could go down for a myriad of reasons
(ISP goes down, HDs, tripping on power cable, etc). If that happens you're
pretty much screwed. You could compensate by having multiple data centers with
different infrastructure providers. If you do, you're probably spending more
than the few K you referenced in your post.

Yes, it's bad that apparently all of the regions failed. Google will hear
about it. People will get in trouble. But a screw up at this level is rare. If
you use cloud, or even a VPS provider like Linode, you get auto-fail over and
someone that is contractually obligated to deal with failures.

~~~
qaq
You are paying penalty in complexity, latency and poor tenant isolation when
running on "cloud infrastructure" and when things blow up you have no
recourse.

~~~
Aoreias
Do you have any examples of poor tenant isolation in AWS, GCE, or Azure?

Cloud complexity is also lower because you don't have to worry about power,
cooling, upstream connectivity, capacity budgeting, etc. If 99.9-99.95%
availability is fine for your application then you probably don't have to
worry about your provider either.

~~~
qaq
AWS Netflix consumes enough resources that if they spike 40-50% everyone is
screwed. The software required to run the cloud like AWS is orders of
magnitude more complex then what avg project would need and results in major
screwups. Both major AWS outages were due control plane issues second case was
result of massive Netflix migration that triggered throttles for everyone in
affected AZs. The throttles in the first place were put in due to the major
outage that lasted for many hours.

------
awinter-py
five nines = 45?

~~~
fixermark
No; 999.99% uptime.

~~~
awinter-py
1000% = 10 days per year

------
max_
so much for the SRE they publishd last week

~~~
kyrra
16-17 minutes of down time isn't all that bad if you consider the SLA for GCE
is 99.95%:
[https://cloud.google.com/compute/sla](https://cloud.google.com/compute/sla)

So they can have 262 minutes of down time a year and still be within their
SLA.

------
chris_wot
Hmmm... would this have affected Netflix?

~~~
TheDong
It's fairly well known that netflix runs primarily on AWS.

So probably no.

~~~
fs111
how come netflix is working on IPv6 then, when aws does not offer it?

~~~
notpeter
ELBs do IPv6 at the edge and everything else (ELB->EC2) is IPv4.

~~~
fidget
Though notably, EC2Classic ELBs only

