
Google Kubernetes Engine's third consecutive day of service disruption - rlancer
https://status.cloud.google.com/incident/container-engine/18005
======
shareometry
I am currently evaluating GCP for two separate projects. I want to see if I
understand this correctly:

1) For three whole days, it was questionable whether or not a user would be
able to launch a node pool (according to the official blog statement). It was
also questionable whether a user would be able to launch a simple compute
instance (according to statements here on HN).

2) This issue was global in scope, affecting all of Google's regions.
Therefore, in consideration of item 1 above, it was questionable/unpredictable
whether or not a user could launch a node pool or even a simple node anywhere
in GCP at all.

3) The sum total of information about this incident can be found as a few one
or two sentence blurbs on Google's blog. No explanation nor outline of scope
for affected regions and services has been provided.

4) Some users here are reporting that other GCP services not mentioned by
Google's blog are experiencing problems.

5) Some users here are reporting that they have received no response from GCP
support, even over a time span of 40+ hours since the support request was
submitted.

6) Google says they'll provide some information when the next business day
rolls around, roughly 4 days after the start of the problem.

I really do want to make sure I'm understanding this situation. Please do
correct me if I got something wrong in this summary.

~~~
manigandham
When everything works, GCP is the best. Stable, fast, simple, reliable.

When things stop working, GCP is the worst. Slow communications and they
require way too much work before escalating issues or attempting to find a
solution.

They already have the tools and access so most issues should take minutes for
them to gather diagnostics, but instead they keep sending tickets back for
"more info", inevitably followed by a hand-off to another team in a different
time zone. We have spent days trying to convince them there was an issue
before, which just seems unacceptable.

I can understand support costs but there should be a test (with all vendors)
where I can officially certify that I know what I'm talking about and don't
need to go through the "prove its actually a problem" phase every time.

~~~
laurencei
As someone who works for Government and Enterprise - all I care about
sometimes is how a company behaves when everything goes wrong.

The issue with outages for the Government organizations I have dealt with is
rarely the outage itself - but strong communication about what is occurring
and realistic approximate ETAs, or options around mitigation.

Being able to tell the Directors/Senior managers that issues have been
"escalated" and providing regular updates are critical.

If all I could say was a "support ticket" was logged, and we are waiting on a
reply (hours later) - I guarantee the conversation after the outage is going
to be about moving to another solution provider with strong SLAs.

~~~
totallyashill
Very similar thing at our office. Considering the scale of which we run
things, any outage could be a potential loss of millions _every minute_.

Sure, we use support tickets with vendors for small things. Console button
bugging out, etc. But for large incidents, every vendor has a representative
within an hour driving distance and will be called into a room with our
engineers to fix the problem. This kind of outage, with zero communication,
means the dropping of a contract.

Communication is critical for trust, especially if we're running a business
off it.

~~~
y4mi
Going single cloud on that scale is simply irresponsible though.

You need failovers to different providers and hopefully also have your
hardware for general workloads

And suddenly the CEO doesn't care anymore if one of your potential failovers
is behaving flaky in specific circumstances

Not saying it's good as it is.. communication as a saas provider is - as you
said- one is the most important things... But this specific issue was not as
bad as some people insinuate in this thread

~~~
softawre
Agree, if we are really talking about millions per minute (woah), then you can
afford to failover to AWS.

------
usmannk
We had an issue a few weeks ago where the google front-end servers were
mangling responses from Pub/Sub and returning 502 responses, making the
service completely unusable and knocking over a number of things we have
running in production. Despite paying for enterprise support and having in a
P1 ticket, we had to spend Friday to Sunday gathering evidence to prove to the
support staff that there was indeed a problem, because their monitoring wasn't
detecting it. Right now I'm doing something similar (and since Friday!) but
for TLS issues they're having. Again, because their support reps don't believe
there's a problem. There are so many more problems than they ever show on
their status page...

~~~
Jedi72
They work for Google so obviously they are much smarter than you. If theres a
problem its probably the customers fault. /sarcasm

~~~
mbrumlow
I was so mad to read that until you said /sarcasm :p

That being said I really do think there is a difference between who is working
at google today and the google we all fell in love with pre-2008.

I am sure there are a amazing people still working at google, but nowhere near
like it was.

The way I like to think about google is that some amazing people mad ea
awesome train that builds tracks in front of it -- you can call them gods
maybe -- but those people are gone -- or a least the critical mass required to
build such a train has dwindled to just dust. What we have left is a awesome
train full of people pulling the many levers left behind.

To make things even worse my last interview as a SRE left me wondering if even
the people who are there know this as well, and they are actually working hard
to keep out those who might expose light on to this. I don't say that because
I did not get the job -- I am actually happy I did not get extended a offer.

I say this with one exception, the old-timer who was my last interview. I
could tell he was dripping in knowledge and eager to share it with any that
would listen. I came out of his 45 min session learning many things -- I wold
actually pay to work with a guy like that.

I would also like to point out that the work ethic was not what I expected. I
was told that when on call, my duty was to figure out the root cause was in
the segment I was responsible for. I don't know about you, but if my phone
rings at night I am going to see through to a resolution and understand the
problem in full -- even if it is not on the segment that I was assigned.

/end rant

~~~
blub
You've managed to roll the myth of 10x engineers, start-up geniuses, nostalgia
and gut feelings into one message of very dubious veracity.

At the same time you ignored the massive complexity and size of Google
compared to what they were at the beginning.

This is voodoo organisational analysis.

~~~
engineeringwoke
10x engineers are real. Start-up geniuses are real. The large majority of
people have their heads up their asses. Wake up and smell the coffee

~~~
mav3rick
You can't find and hire geniuses for every component. Systems should scale
with the average engineer in mind (so should code). We have all smelt the
coffee and it smells even better when your team is efficient and well rested.

~~~
engineeringwoke
I don't disagree. Building around 10x'ers in this way creates god complexes
and unhappy "senior" engineers (since they don't do any of the cool work).
Having one very early still juices your productivity

------
Jedi72
"The data says engagement is down 46%, I think its time we drop the product."

\- Someone at Google right now, probably.

~~~
justinsb
I can assure you that's not the case! Also, while people like to repeat this
meme, Google Cloud does have a formal deprecation policy
([https://cloud.google.com/terms/](https://cloud.google.com/terms/)), whose
intent is to give you some assurances.

(I work at Google, on GKE, though I am not a lawyer and thus don't work on the
deprecation policy)

~~~
chrisseaton
> Google may discontinue any Services or any portion or feature for any reason
> at any time without liability to Customer

for any reason

at any time

~~~
uluyol
Nice job cherry picking text.

> 7.1 Discontinuance of Services. Subject to Section 7.2, Google may
> discontinue > any Services or any portion or feature for any reason at any
> time without > liability to Customer.

Let's take a look at Section 7.2:

> 7.2 Deprecation Policy. Google will announce if it intends to discontinue or
> > make backwards incompatible changes to the Services specified at the URL
> in > the next sentence. Google will use commercially reasonable efforts to
> continue > to operate those Services versions and features identified at >
> [https://cloud.google.com/terms/deprecation](https://cloud.google.com/terms/deprecation)
> without these changes for at least > one year after that announcement,
> unless (as Google determines in its > reasonable good faith judgment): > >
> (i) required by law or third party relationship (including if there is a
> change > in applicable law or relationship), or > > (ii) doing so could
> create a security risk or substantial economic or material > technical
> burden. > > The above policy is the "Deprecation Policy."

To me that looks like a reasonable deprecation policy.

~~~
scoot
> To me that looks like a reasonable deprecation policy.

It might be, until they jack up the prices 15X with limited notice (looking at
you, Google maps [1]). No deprecation needed, just force users off the
platform unless they're willing to pay a massive premium.

[1]
[https://www.google.com/search?q=google+maps+price+increase](https://www.google.com/search?q=google+maps+price+increase)

~~~
jkaplowitz
Google Maps has never been subjected to that policy, unlike GCP services.
These org chart divisions are real but only clear to Googlers, Xooglers (I'm
in this category), and people who pay extremely close attention.

The fact that they're all Google makes reputation damage bleed across
meaningfully different parts of what's in truth now a conglomerate under the
umbrella name Google.

~~~
jjeaff
Except all the Google maps setup and API keys are generated from the gcp UI
and the billing happens on the cloud platform as well. While maps didn't start
as a gcp product, they seem to have rolled it in to gcp fully.

~~~
jkaplowitz
Not fully. Really what happened is they did a re-org that gave them Google
Cloud as an umbrella brand including GCP, Google Maps Platform (this new
version of Google Maps as a commercial service), Chrome, Android, G Suite...

The bit of Maps Platform integration for management of the billing and API
layer was called out in the announcement blog as an integration with the
console specifically, and the docs and other branding around Maps Platform
remain distinct from GCP still in excessively subtle ways that Googlers pay
more attention to than everyone else, like hosting the docs on
developers.google.com instead of cloud.google.com and having Platform in its
name separately from Cloud Platform.

This stuff makes sense to Googlers not only because of the org chart but also
because Google has a pretty unified API layer technology and because Google
put in a lot of work to unify billing tech & management. Reusing that is
efficient but not always clear.

But you're right to be confused. Their branding is a mess and always has been.
This is the same company that thought Google Play Books makes sense as a
product name.

Google's product / PR / comms / exec people are very bad at understanding how
external people who don't know Google's org chart and internal tech will
perceive these things, or at least bad at prioritizing those concerns.

They live and breathe their corporate internals too much to realize this. Some
Google engineers and tech writers realize the confusion but pick other battles
to fight instead (like making good quality products).

They do at least document which services are subjected to the GCP Deprecation
Policy (Maps is not there):
[https://cloud.google.com/terms/deprecation](https://cloud.google.com/terms/deprecation)

As for what products are actually part of GCP, it's the parts of this page
that aren't an external partner's brand name, aren't called out separately
like G Suite or Cloud Identity or Cloud Search, and aren't purely open source
projects like Knative and Istio (as opposed to the productized versions within
GCP), with the caveat that the level so far of integration into GCP of Google
acquisitions like Apigee, Firebase, and Stackdriver varies depending on per-
company specifics:
[https://cloud.google.com/products/](https://cloud.google.com/products/)

G Suite and Cloud Identity accounts can be used with GCP, just like any other
Google accounts. They are part of Google Cloud but not Google Cloud Platform.

Hope I waded through the mess correctly for you. :)

------
justinsb
Hi - I work at Google on GKE - sorry about the problems you're experiencing.
There's a lot of people inside Google looking into this right now!

It looks like the UI issue was actually fixed, and that we just didn't update
the status dashboard correctly. But we're double checking that and looking
into some of the additional things you all have reported here.

~~~
antpls
The status dashboard is inaccurate and/or a lie. It only tells about the GKE
incident, while in fact the problem also impacts Google Compute Engine users.
I was unable to create any google compute instance today, not even a basic
1vcpu, on NA and Europe-west.

As another comment pointed out, what's the point of having so many zones and
redundancy around the globe if such global failure can still happen? I thought
the "cloud" was supposed to make this kind of failure impossible

~~~
carbocation
> I was unable to create any google compute instance today, not even a basic
> 1vcpu, on NA and Europe-west.

I've been creating GCP instances in us-central1-a and us-central1-c today
without issue. Which zone were you using in NA?

I have been noticing unusual restarts, but I haven't been able to pin down the
cause yet (may be my software and not GCP itself).

~~~
antpls
Tried on us-east, us-north, europe-west, also tried asia, with different
instance sizes and with both UI and CLI. None worked for me.

~~~
pfd1986
Same here.

------
hacknat
Question to Google employees:

Why do you guys suffer global outages? This is your 2nd major global outage in
less than 5 years. I’m sorry to say this, but it is the equivalent of going
bankrupt from a trust perspective. I need to see some blog posts about how you
guys are rethinking whatever design can lead to this - twice - or you are
never getting a cent of money under my control. You have the most feature rich
cloud (particularly your networking products), but down time like this is
unacceptable.

~~~
toomuchtodo
> I’m sorry to say this, but it is the equivalent of going bankrupt from a
> trust perspective.

It's the opposite really: the expectation that service providers have no
unexpected downtime is unrealistic, and it's strange this idea persists.

~~~
Twirrim
(disclaimer: I work for another cloud provider)

I agree, in general, outages are almost inevitable, but global outages
shouldn't occur. It suggests at least a couple of things:

1) Bad software deployments, without proper validation. A message elsewhere in
this post on HN suggest that problems have been occurring for at least 5 days,
which makes me think this is the most likely situation. If this is the case,
presumably given this is multiple days in to the issue, rolling back isn't an
option. That doesn't say good things about their testing or deployment
stories, and possibly their monitoring of the product? Even if the deployment
validation processes failed to catch it, you'd really hope alarming would have
caught it.

or:

2) Regions aren't isolated from each other. Cross-region dependencies are bad,
for all sorts of obvious reasons.

~~~
toomuchtodo
That shouldn't, but they do. S3 goes down [1]. The AWS global console goes
down, right after Prime Day outages [2]. Lots of Google Cloud services go down
[3, current thread]. Tens of Azure services go down hard [4].

Are software development and release processes improving to mitigate these
outages? We don't know. You have to trust the marketing. Will regions ever be
fully isolated? We don't know. Will AWS IAM and console ever not be global
services? We don't know.

Blah blah blah "We'll do better in the future". Right. Sure. Some service
credits will get handed out and everyone will forget until the next outage.

Disclaimer: Not a software engineer, but have worked in ops most of my career.
You will have downtime, I assure you. It is unavoidable, even at global scale.
You will never abstract and silo everything per region.

[1]
[https://www.theregister.co.uk/2017/03/01/aws_s3_outage/](https://www.theregister.co.uk/2017/03/01/aws_s3_outage/)

[2] [https://www.cnbc.com/2018/07/16/aws-hits-snag-after-
amazon-p...](https://www.cnbc.com/2018/07/16/aws-hits-snag-after-amazon-prime-
day-downtime.html)

[3] [https://www.cnet.com/news/google-cloud-issues-causes-
outages...](https://www.cnet.com/news/google-cloud-issues-causes-outages-for-
snapchat-spotify-and-others/)

[4] [https://www.datacenterknowledge.com/uptime/microsoft-
blames-...](https://www.datacenterknowledge.com/uptime/microsoft-blames-
severe-weather-azure-cloud-outage)

~~~
ignoramous
Can't speak for Google, but Facebook and Salesforce chose Cells for HA.

[http://highscalability.com/blog/2012/5/9/cell-
architectures....](http://highscalability.com/blog/2012/5/9/cell-
architectures.html)

~~~
toomuchtodo
Doesn't look like it was all that helpful to Facebook (as of 1542038976).
Facebook.com errors out currently.

> Facebook Platform Appears to be down

> A check of
> [https://developers.facebook.com/status/dashboard/](https://developers.facebook.com/status/dashboard/)
> returns an error and I'm unable to login with facebook to some of my mobile
> apps.

[https://news.ycombinator.com/item?id=18434262](https://news.ycombinator.com/item?id=18434262)

------
scarface74
Say I were a CTO (I’m nowhere near it), why would I choose GCP over AWS or
Azure? Even if after doing a technical assessment and I thought that GCP was
technically slightly better, if something happened, the first question I would
be asked is “why did you choose GCP over AWS?”

No one would ever ask why you chose AWS. The old “no one ever got fired for
buying IBM”.

Even if you chose Azure because you’re a Microsoft shop, no one would question
your choice of MS. Besides, MS is known for thier enterprise support.

From a developer/architect standpoint, I’ve been focused the last year on
learning everything I could about AWS and chose a company that fully embraced
it. AWS experience is much more marketable than GCP. It’s more popular than
Azure too, but there are plenty of MS shops around that are using Azure.

~~~
013a
\- Native integration with G-Suite as an identity provider. Unified
permissions modeling from the IDP, to work apps like email/Drive, to cloud
resources, all the way into Kubernetes IAM.

\- Security posture. Project Zero is class leading, and there's absolutely a
"fear-based" component there, with the open question of when Project Zero
discovers a new exploit, who will they share it with before going public? The
upcoming Security Command Center product looks miles ahead of the disparate
and poorly integrated solutions AWS or Azure offers.

\- Cost. Apples to apples, GCP is cheaper than any other cloud platform.
Combine that with easy-to-use models like preemptible instances which can
reduce costs further; deploying a similar strategy to AWS takes substantially
more engineering effort.

\- Class leading software talent. Google is proven to be on the forefront of
new CS research, then pivoting that into products that software companies
depend on; you can look all the way back to BigQuery, their AI work, or more
recently in Spanner or Kubernetes.

\- GKE. Its miles ahead of the competition. If you're on Kubernetes and its
not on GKE, then you've got legacy reasons for being where you're at.

Plenty of great reasons. Reliability is just one factor in the equation, and
GCP definitely isn't that far behind AWS. We have really short memories as
humans, but too soon we seem to forget Azure's global outage just a couple
months ago due to a weather issue at one datacenter, or AWS's massive us-
east-1 S3 outage caused by a human incorrectly entering a command. Shit
happens, and it's alright. As humans, we're all learning, and as long as we
learn from this and we get better then that's what matters.

~~~
scarface74
Your response is from a geek’s viewpoint. No insult, intended, I’m first and
foremost a 30 year computer geek myself - started programming in 65C02
assembly in 6th grade and still mostly hands on.

But, whether it is right or not, as an architect/manager, etc, you have to
think about what’s not just best technically. You also have to manage your
reputational risks if things go south and less selfishly, how quickly can you
find someone with the relevant experience.

From a reputation standpoint, even if AWS and GCP have the same reliability,
no one will blame you if AWS goes down if you followed best practices. If a
global outage of an AWS resource went down, you’re in the same boat as a ton
of other people. If everyone else was up and running fine but you weren’t
because you were on the distant third cloud provider, you don’t have as much
coverage.

I went out on a limb and chose Hashicorp’s Nomad as the basis of a make or
break my job project I was the Dev lead/architect for hoping like hell things
didn’t go south and the first thing people were going to ask me is why I chose
it. No one had heard of Nomad but I needed a “distributed cron” type system
that could run anything and it was on prem. It was the right decision but I
took a chance.

From a staffing standpoint, you can throw a brick and hit someone who at least
thinks they know something about AWS or Azure GCP, not so much.

It’s not about which company is technically better, but I didn’t want to
ignore your technical arguments...

 _Native integration with G-Suite as an identity provider. Unified permissions
modeling from the IDP, to work apps like email /Drive, to cloud resources, all
the way into Kubernetes IAM._

You can also do this with AWS - use a third party identity provider and map
them to native IAM user and roles.

[https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_cr...](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-
idp.html)

 _Cost. Apples to apples, GCP is cheaper than any other cloud platform.
Combine that with easy-to-use models like preemptible instances which can
reduce costs further; deploying a similar strategy to AWS takes substantially
more engineering effort._

The equivalent would be spot instances on AWS.

From what (little) I know about preemptible instances, it seems kind of random
when they get reassigned but Google tries to be fair about it. The analagous
thing on AWS would be spot instances where you set the amount you want to pay.

 _Class leading software talent. Google is proven to be on the forefront of
new CS research, then pivoting that into products that software companies
depend on; you can look all the way back to BigQuery, their AI work, or more
recently in Spanner or Kubernetes._

All of the cloud providers have managed Kubernetes.

As far as BigQuery. The equivalent would be Redshift.

[https://blog.panoply.io/a-full-comparison-of-redshift-and-
bi...](https://blog.panoply.io/a-full-comparison-of-redshift-and-bigquery)

 _Reliability is just one factor in the equation, and GCP definitely isn 't
that far behind AWS_

Things happen. I never made an argument about reliability.

~~~
maktouch
> The equivalent would be spot instances on AWS.

They're equivalent in the sense that you have nodes that can die anytime, but
it's much more complicated. You could technically have a much lower cost on
AWS by aggressively bidding low but we've had a few instances where the node
only lived a few minutes.

Preemptibles nodes are max 24h, and from our stats, they really live around
that amount of time. I think the lowest we've had was a node dying after 22h.

You also save out of the box because they apply discount when your instance is
running for a certain number of hours.

You can even have more discount by agreeing to a committed use which you pay
per month instead of one-shot unlike AWS.

I'm going to add a few more reasons to the above reply:

\- UI and CLI is so much better in GCP

I don't have to switch between 20 regions to see my instances/resources. From
one screen, I can see them all and filter however I like.

\- GCP encourage creating different projects and apply same billing.

It's doable in AWS too, of course, but coupled with the fact that you have
different projects and regions, and you can't see all instances of a project
at once, this makes a super bad experience

\- Networks are so much better in GCP

Out of the box, your regions are connected and have their own CIDR. Doing that
in AWS is complicated.

\- BigQuery integration is really good

A lot of logs and analytics can be exported to BigQuery, such as billings, or
storage access. Coupled with Data Studio and you have non technical people
doing dashboards.

\- Kubernetes inside GCP is a lot better than AWS'

[https://blog.hasura.io/gke-vs-aks-vs-
eks-411f080640dc](https://blog.hasura.io/gke-vs-aks-vs-eks-411f080640dc)

\- Firewall rules > EC2 Security Group

\- A lot of small quality of life that makes the experience a lot better
overall

... like automatically managing SSH keys for instances, instead of having a
master ssh key and sharing that.

Here's the thing though, a lot of GCP can be replicated, just like what you
linked for the identity provider. With GCP, there's a lot of stuff out of the
box -- so dev and ops can focus on the important stuff.

Overall, AWS is just a confusing mess and offers a very bad UX. Moving to GCP
was the best move we've made.

~~~
halbritt
Not much to add here other than this reflects my experience as well.

Moved for bizdev reasons, and really appreciated the improved quality of life.

------
AlexB138
This has been going on longer than three days. We have been dealing with this
exact issue since at least Monday (11/5) morning in us-central1.

~~~
splap
same here. using gcloud, not web console

------
marcinzm
>Nov 09, 2018 05:59

>We will provide more information by Monday, 2018-11-12 11:00 US/Pacific.

Wait, did the people tasked with fixing this just take the weekend off?

~~~
jasonlotito
The people tasked with fixing this aren't the ones providing the updates.

~~~
marcinzm
Fair point but still seems odd that the people providing updates took the
weekend off during a large scale customer impacting issue. I'm sure all the
people spending the weekend trying to mitigate the impact of this on their
infrastructure would love to have timely updates.

~~~
izacus
It's weekend, why wouldn't you take it off? It's just silly software.

~~~
jschwartzi
More to the point, why would you depend on Google for any critical
infrastructure after this?

~~~
user5994461
Because you tried to run your own infrastructure and it was so much worse.

------
rlancer
Status page is inaccurate as issues doesn't only affect the web UI, the same
operations are not functioning via the CLI.

~~~
pm90
Its kinda strange that HN seems to be the most effective way to give feedback
to Google Cloud :/

~~~
Draiken
I also find it weird that on HN where normally people are very skeptical of
any argument without data backing it, when it comes to this outage, people are
assuming everything written here affects everyone.

Perhaps some of the issues are localized? Perhaps it's even user error (it
happens, you know?). But because a small amount of HN users say "it's
everywhere!" then suddenly people reach for their pitchforks.

Sometimes we just don't have all the information.

------
scarface74
A generic question: Our company is completely dependent on AWS. Sure we have
taken all of the standard precautions for redundancy, but what happened here
could just as easily happen with AWS - a needed resource is down globally.

What would a small business do as a contingency plan?

~~~
geggam
Infrastructure as code.

Terraform using AMIs plus chef recipes that work in the cloud and bare metal.
Dont use AWS specific services.

This would allow you to spin over to another cloud provider , vsphere or bare
metal with minimal work

~~~
xchaotic
I think you are downplaying minimal

~~~
Twirrim
I don't think OP was intending "minimal" to mean it would be easy to get to
the stage where it's possible, just that once you've got all your
infrastructure-as-code stuff set up correctly, you ought to be able to just be
pressing buttons / running scripts and have your infrastructure up and running
in another cloud provider.

Even when working in small companies with small infrastructure, I've kept
recreation of infrastructure as one of my high priorities (one reason it
really bugged me in one job to have to depend on Oracle Databases that I
couldn't automate to the same degree.)

In my mind, it's not different from the importance of having, and testing
restoration of, backups. If your infrastructure gets compromised somehow, or
you find yourself up the creek with your provider, you've _got_ to be able to
rebuild everything from scratch.

~~~
user5994461
The infrastructure has always been the easy part, as long as the company is
willing to pay for multiple datacenters.

Then you realize a lot of software and databases can only run from a single
instance, zero support for multi regions, and you're not gonna to rewrite
everything and resiliency just can't happen.

------
rlancer
UPDATE: Got some clarity, these issues are caused by "resource exhaustion"
meaning there are no resources left to be allocated.

~~~
halbritt
I'm curious to see if this is true.

I faced some pretty serious resource allocation issues earlier in the year.
The us-west1-a region was oversubscribed. I was unable to get any real
information from support with regard to capacity. Eventually my rep gave me
some qualitative information that I was able to act on.

------
7ewis
I honestly don't mind if providers have outages - we can't expect 100.00%
accuracy, I know the systems I manage certainly don't achieve that.

One thing I _do_ care about though, is root cause analysis. I love reading a
good RCA, it restores my faith in the company and makes me trust them more.

(I'm not affect by the GKE outage so opinions may differ right now!)

------
locusm
Do not use GCP without paying for support. We have had resource allocation
errors for weeks, as have a lot of other people. Check out the posts in their
forum where folk on basic support get zero love.
[https://groups.google.com/forum/?utm_medium=email&utm_source...](https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!forum/gce-
discussion)

------
thwy12321
Been trying to spin up vm instances all day, had to try every single zone just
to get one up. Not only is this incredibly harmful to a technology business
dependent on this infra, it wasnt obvious to me what the issue was until I
tried creating instances. Nothing says, hey resources are constrained here,
try this one. Just about ready to bite the bullet and move to aws.

~~~
pfd1986
Same here. We have spent 2 days trying to create instances and migrate images
just to figure out later they can't start.

Right when I convinced our project to get migrated from AWS...

~~~
Masiosare
Same question... why would you do that? AWS is super stable most of the time.
I have been running k8s over EC2 (not eks) for a year and works like a charm.
I've even run experiments using spot instances and it's pretty good (no
guarantee there of course).

~~~
pfd1986
We did. Our customer had a bucket shared to a role they created, we spent a
week back and forth trying to mount such bucket in our instance using fuse.
Mounting with gcsfuse took 5 minutes (although no role used, so perhaps an
unfair comparison). In general, I found Gcloud a lot easier to work with.

------
sladey
Seems to be some weird underlying issue going on at GCP at the moment. Had
cloud build webhooks returning a 500 error. Noticed we were at 255 images and
deleting some fixed the issue. Created a P2 ticket about the issue before we
managed to solve it and haven't had a response in 40+ hours.

The timeline of this disruption matches when we started experiencing cloud
build errors.

~~~
lstamour
Outsider here, but I believe Cloud Build runs on GKE Jobs, so if they’re
having trouble, it does indeed sound related.

------
ernsheong
"third consecutive day of service disruption" is not an accurate statement?
Latest update was Nov 11 saying things resolved on Nov 9.

[https://status.cloud.google.com/incident/container-
engine/18...](https://status.cloud.google.com/incident/container-engine/18005)

~~~
ernsheong
If all nodes in GKE clusters were down for 3 days, I would consider this
newsworthy and shocking. This... is not. Come on, people.

------
013a
Cloud providers have all of the potential in the world to make each region
truly isolated. I shouldn't have to architect my application to be multi-
cloud, at least for stability reasons.

Yet, somehow every major cloud provider experiences global outages.

That old AWS S3 outage in us-east-1 was an interesting one; when it went down,
many services which rely on S3 also went down, in other regions beside us-
east-1 because they were using us-east-1 buckets. I have a feeling this is
more common than you'd think; globally-redundant services which rely on some
single point of geographical failure for some small part.

~~~
threeseed
AWS regions are very much isolated from each other.

We know because we are still waiting here in ap-southeast-2 for services such
as EKS to be made available. Pretty sure that any reliance within their
backend services on us-east-1 was just a temporary bug and nothing systemic.

------
spiderPig
Our company is dependent on this as well and the way customer service has been
handling this has been abysmal thus far.

------
qaq
There is no magic public clouds have incredibly complex control planes and
marketing fluff aside you would very likely experience much better uptime at
singe top tier DC than @ a cloud provider.

------
arunoda
The is not only GKE. But for GCE as well. I cannot create instance is almost
all zones. I tried both preemptible and normal as well.

Always saying resource not available. My account is a pretty new account.

In contrast, one of my friend is having a pretty old account which is very
active. He has no such issue.

So I think due to this issue, Google has enabled some resource limitation for
new accounts.

But they should properly communicate this issue.

------
gigatexal
Oh man must be a tough time to be an SRE at google cloud. But... they’re
Google. They have been doing internal cloud for years and years. Borg — which
K8s is a reimplementation if — has been the heart of Google for so long now
you’d think they’d be able to architect their systems to have no outages
whatsoever. I mean nobody is perfect but this looks bad.

~~~
Jedi72
Goes to show outsourcing infrastructure is more about blame shifting so that
when things go wrong its "not our fault" than reducing actual downtime.

------
closeparen
Doesn’t GKE “just” run an independent Kubernetes cluster on customer VMs? How
is a widespread outage like this possible?

~~~
regnerba
GKE does the creation of the VMs and setup of them, joining them to the
cluster and applying labels for example.

The specific issue appears to be about creating new "node pools". Creating
standard VMs in GCP works fine however, so this is specific to GKE and their
internal tooling that integrates with the rest of GCP.

GKE doesn't (at least to my knowledge) allow you to create VMs separately and
join them to the cluster in any kind of easy fashion.

~~~
kenan_warren
It's actually not just GKE, there have been issues creating normal VMs since
late Friday night. It seems anything that required creating VMs gave back
resource exhaustion errors. I finally got a cluster for us-east1 setup last
night so it looks like the resource issues are clearing up though.

------
fizzledbits
As of this morning, I am _still_ unable to reliably start my docker+machine
autoscaling instances. In all cases the error is "Error: The zone <my project>
does not have enough resources available to fulfill the request"

An instance in us-central1-a has refused to start since last Thursday or
Friday.

I created a new instance in us-west2-c, which worked briefly but began to fail
midday Friday, and kept failing through the weekend.

On Saturday I created yet another clone in northamerica-northeast1-b. That
worked Saturday and Sunday, but this morning, it is failing to start.
Fortunately my us-west2-c instance has begun to work again, but I'm having
doubts about continuing to use GCE as we scale up.

And yet, the status page says all services are available.

Is the typical of others' experiences?

------
wijowa
Right now we're experiencing an issue where a small percentage of end users on
our GKE site are getting super slow speeds. The issue is ISP related as they
can switch to a 4G hot spot in the same location and get normal speeds... and
inside our system the timing looks normal. So there's a slowdown either TO the
load balancer or FROM the load balancer. Took a week to convince Google's
support contractor to even believe it wasn't an issue with our site and their
advice is generally along the lines of Turn it off and Turn it back on again
(which might actually fix the problem) though that's easier said than done in
GCP.

------
nielsole
I use preemptible machines in autodialing and for first time did not have any
machines available for multiple hours yesterday. I am wondering whether this
falls under the normal preemptible behaviour or this service degradation.

------
wb3tech
If anyone is interested, here is my documented experience with this issue. I
freaking love GCP and GKE, although I have not production environment as it
was a HA cluster in us-central1. Working federation now.

[https://stackoverflow.com/questions/53244471/gke-cluster-
won...](https://stackoverflow.com/questions/53244471/gke-cluster-wont-
provision-in-any-region)

------
regnerba
Is this just about creating new pools? I haven't noticed an issue with our
existing pools scaling.

~~~
rlancer
You were able to add more Nodes to you're pool? Are you using any auto
scaling?

------
_wmd
When guerilla marketing backfires

------
bdibs
As someone currently trying to decide between GCP and AWS for a project, is
this a regular occurrence?

And for those who have used both, which would you go with today?

------
franky_g
Had it affected all regions or just some?

Is there another status page Google? Coz the last update I'm looking at...is
dated on the 9th..

~~~
justinsb
The general page is at
[https://status.cloud.google.com/;](https://status.cloud.google.com/;) you can
scroll down to see GKE, and my (unofficial) belief is that
[https://status.cloud.google.com/incident/container-
engine/18...](https://status.cloud.google.com/incident/container-engine/18006)
should have closed out [https://status.cloud.google.com/incident/container-
engine/18...](https://status.cloud.google.com/incident/container-engine/18005)

_If_ that's the case, something else is causing the error messages other
people are seeing

------
fulafel
Offtopic but are there some documented exceptions to the "keep the original
title" rule?

------
whatshisface
Why do cloud providers have more global outages than major flagship websites
like google.com?

~~~
betaby
Whey don't run on the same infra. Amazon.com doesn't run on AWS.

~~~
talonx
On the contrary, it does. They made the transition gradually.

------
fergie
Things break after everybody has gone home on a Friday? 3 day disruption.

------
thomasfl
I'd like to upvote, but 666 points seemed relevant.

------
haosdent
Time to use Mesos.

------
shiftnight
I have a question. At what point does k8s make sense?

I have a feeling that a microservice architecture is overkill for 99% of
businesses. You can serve a lot of customers on a single node with the
hardware available today. Often times, sharding on customers is rather trivial
as well.

Monolith for the win! Opinions?

~~~
nstart
As someone whose daily work happens on k8s, I'd say you better be paining a
lot before you move to k8s. I take great care to avoid this, but if you aren't
careful, you can end up "feeling" productive on k8s without actually being
productive. K8s gives a lot of room for one to tweak workflows, discuss
deployment strategies, security, "best practices", etc. And you can get things
done reasonably fast. But that's like a developer working all day on fine
tuning their editor and comparing and writing plugins and claiming that they
are getting productive.

The key issue here is that k8s was written with very large goals in mind. That
a small business can easily spin it up quickly and run a few microservices or
even a monolith + some workers is just coincidental. It is NOT the design
goal. And the result of that is that a lot of the tooling and writing around
k8s reflects that. A lot of the advice around practices like observability and
service meshes comes from people who've worked in the top 1% (or less) of
companies in terms of computing complexity. What I'm personally seeing is that
this advice is starting to trickle down into the mainstream as gospel. Which
strangely makes sense. No one else has the ability to preach with such
assurance because not many people in small companies have actually been in the
scenarios of the big guns. The only problem is that it's gospel without
considering context.

So at what point does k8s make sense? Only when you have answers to the
following:

* Getting started is easy, maintaining and keeping up with the going ons is a full time job - Do you have at least 1 engineer at least that you can spare to work on maintaining k8s as their primary job? It doesn't mean full time. But if they have to drop everything else to go work on k8s and investigate strange I/O performance issues, are you ready to allow that?

* The k8s eco system is like the JS framework ecosystem right now - There are no set ways of doing anything. You want to do CICD? Should you use helm charts? Helm charts inherited from a chart folder? Or are you fine using the PATCH API/kubectl patch commands to upgrade deployments. Who's going to maintain the pipeline? Who's going to write the custom code for your github deployments or your brigade scripts or your custom in house tool? Who's going to think about securing this stuff and the UX around it. That's just CICD mind you. We aren't anywhere close to the weeds about deciding if you want to use ingresses vs Load balancers and how you are going to run into service provider limits on certain resources. Are you ready to have at minimum 1 developer working on this stuff and taking time to talk to the team about it?

* Speaking about the team, k8s and Docker in general is a shift in thinking - This might sound surprising but the fact that Jessie Frazelle (y'all should all follow her btw) is occasionally seen reiterating the point that containers are NOT VM's is a decent indicator that people don't understand k8s or Docker at a conceptual level. When you adopt k8s, you are going to pass that complexity to your developers at some point. Either that or your dev ops team takes on that full complexity and that's a fair amount to abstract away from the developers which will likely increase the work load of devops and/or their team size. Are you prepared for either path?

* Oh also, what do your development environments start to look like? This is partly related to microservices but are you dockerizing your applications to work on the local dev environment? Who's responsible for that transition? As much as one tries to resist it, once you are on k8s you'll want to take advantage of it. Someone will build a small thing as a microservice or a worker that the monolith or other services depend on. How are you going to set that up locally? And again, who's going to help the devs accumulate that knowledge while they are busy trying to build the product. (Please don't put your hopes on devs wanting to learn that after hours. That's just cruel).

I can't write everything else I have in mind on this topic. It'd go on for a
long long time. But the common theme here is that the choice around adopting
k8s is generally put on a table of technical pros and cons. I'd argue that
there's a significant hidden cost of human impact as well. Not all these
decisions are upfront but it is the pain that you will adopt and have to
decide on at some point.

Again, at what point does k8s make sense? Like I said, you ideally should be
paining before you start to consider k8s because for nearly every feature of
k8s, there is a well documented, well established, well secured parallel that
already exists in the myriad of service providers. It's a matter of taking
careful stock of how much upfront pain you are trading away for pain that you
WILL accumulate later.

PS - If anyone claims that adopting a newer technology is going to make things
outright less painful , that's a good sign of immaturity. I've been there and
I picture myself smashing my head into a table every now and then when I think
of how immature I used to be. Apologies to people I've worked with at past
jobs.

PPS - From the k8s site, "Designed on the same principles that allows Google
to run billions of containers a week, Kubernetes can scale without increasing
your ops team." <\-- is the kind of claim that we need to take flamethrowers
to. On paper, 1 dev with the kubectl+kops CLI can scale services to run with
1000's of nodes and millions of containers. But realistically, you don't get
there without having incurred significantly more complex use cases. So no,
nothing scales independently.

~~~
raarts
I fully agree with you, and personally have taken the path of using Docker
Swarm as a step-up to k8s, as it was so much easier to get along with. I would
certainly recommend this to smaller businesses.

------
aaaaaaaaaab
Daily reminder that there's no "cloud", just other people's computers. ( ͡° ͜ʖ
͡°)

------
spullara
If a hosting service is down and nobody uses it, is there really any
disruption?

