Global Google Kubernetes Engine Outage

bittermandel · on Sept 15, 2023

This doesn't look like an outage at all. > Diagnosis: GKE customers using Kubernetes version 1.25 may experience issues with Persistent Disk creation or deletion failures.

eVeechu7 · on Sept 15, 2023

Agree. The word outage isn't used in the notice. They say 'incident' which is more accurate. The title should be changed.

voytec · on Sept 15, 2023

Don't read into announcements like this too much. Status pages and outage notices are often political.

Status pages are rarely dynamic and updates require blessing from upstairs. And more often than not complete outages are referred to as "degraded performance affecting some users".

oooyay · on Sept 15, 2023

I don't know how status pages work at Google, but I do work in reliability engineering and I sometimes make recommendations to update the status pages.

Some context before I go on is that reliability is often measured by mapping critical features to services and degradation. This gets more challenging as a feature starts to map to more than a couple services and those services begin to have dependencies. When your reliability on average can be measured in its number of nines opposed to its significant preceding digits your signal interpretation game has to step up significantly. These two situations make it infinitely more complex to state whether a given service degradation in a chain of services is truly having external customer impact at a given time. That's why a human needs to make the call to update the status page and why status page availability numbers are different from internal numbers.

I spend a good portion of nearly every sprint hunting down systemic issues that'll pop up across the ecosystem of services from a birds eye view. Often, knowing whether external customer impact will be felt for this series of errors relies heavily on knowing the current configuration of services in a chain, their graceful failure mechanisms, what failure manifests as client side, and whether that failure is critical to an SLA.

I have not, in my history of reliability engineering, seen anyone object to updating the status page for political reasons.

re-thc · on Sept 15, 2023

> I have not, in my history of reliability engineering, seen anyone object to updating the status page for political reasons.

The status page is tied to public SLAs = impact on $$$. Internally you can track anything. What's public is the problem.

oooyay · on Sept 15, 2023

No, not really. SLAs are calculated on a per customer basis and generally have a legal definition in contracts if they're actual, functioning SLAs.

The status pages purpose is generally to head off a flood of customer reported issues. This is why you'll usually see issues that affect a broader subset of users on that page.

re-thc · on Sept 15, 2023

> No, not really. SLAs are calculated on a per customer basis and generally have a legal definition in contracts if they're actual, functioning SLAs.

And how can I as a customer calculate this? We're not going to sue each time there's a breach of SLA to get the real data. Whatever the status page says will trigger customers to decide if they should claim SLA credits. A lower number (delayed update of the status page) will skip payouts or reduce it.

> The status pages purpose is generally to head off a flood of customer reported issues. This is why you'll usually see issues that affect a broader subset of users on that page.

That's what you assume and that's what it's supposed to be. It's long been abused otherwise. Amazon for example will require explicit approval to update the page. They and others have famously delayed updating the status page as late as they can get away with often attempting to not even call an outage. It will say something like "increased error rates".

oooyay · on Sept 15, 2023

Five nines of availability calculates to 5 minutes, you can calculate up and down from there. If you don't want to do the conversion from percentage to minutes there's lots of calculators like this one: https://uptime.is/five-nines

I wasn't assuming what status pages are used for, I was speaking to my experience working in reliability engineering. I can't speak to Amazons practices as I've not worked there, but when I've seen this happen it's because we struggled to identify customer impact. The systems you're talking about are vaste and a single or even subset of applications reporting errors doesn't mean there's going to be customer impact. That's why I mentioned it usually takes a human that knows that system and it's upstreams to know if there'll be customer impact from a particular error.

I'd encourage you to read the wording of an SLA in a contract. They're often very specific in terms of time and the features they cover. Increased error rates tells me you'll probably run into retry scenarios, which depending on your contract may not actually affect an SLA. Error rates are generally an SLO or an SLI, which are not contractually actionable.

mdekkers · on Sept 15, 2023

> And how can I as a customer calculate this?

Either your shit works, or it doesn’t. You do monitor, don’t you?

re-thc · on Sept 15, 2023

> Either your shit works, or it doesn’t. You do monitor, don’t you?

That then becomes a he said she said problem with the vendor you're claiming against. Does everyone have time for it? You will submit the SLA credit claim and chances are unless it's WAY off you'll accept the vendor's nerfed version and move on. Something is better than nothing.

zug_zug · on Sept 15, 2023

I'm an SRE and I've seen it firsthand at multiple companies.

eddythompson80 · on Sept 15, 2023

Not sure why you’re being down voted. Status pages for big companies are never hooked up to automation. It’s just bad PR to show red across the bar.

If there is a networking outage, everything on a status page should be red but then that looks bad for PR. So you just set “networking outage” but everything else is green even though everything is realistically down.

aldarisbm · on Sept 15, 2023

it's also not only bad PR, but CSPs are subject to SLAs.

chucky · on Sept 15, 2023

Cloud providers (and everyone else) are unfortunately always downplaying their incidents though, so I don't trust that information. I have no idea about this particular case though since I'm not a GKE user.

Would be interesting to hear from actual users how serious this is.

dustedcodes · on Sept 15, 2023

I run GKE clusters in europe-west2. Completely unaffected. I run on the latest k8s version. I have an uptime monitoring service which runs in AWS and it reports zero downtime over the last 24 hours.

voytec · on Sept 15, 2023

> Completely unaffected. I run on the latest k8s version.

The announcement mentions specific version: "issue with Google Kubernetes Engine impacting customers using Kubernetes version 1.25"

vel0city · on Sept 15, 2023

Well yeah that's kind of their point. GKE makes it pretty easy to stay on a recent version.

mwarkentin · on Sept 15, 2023

We're using their "stable" release channel which means most of our clusters are currently on a version of 1.25. We only have a few deployments using PVCs so impact is pretty minimal anyways for us as far as I can tell.

vel0city · on Sept 15, 2023

Even ignoring the fact this only affects one old version and Google makes it very easy to upgrade (and know when you're behind), I can't imagine this is affecting many workloads. The vast majority of the workloads I run (and I think get run, generally) don't rely on PV's. Then from that, most of the time you're not making wholly new volumes all the time you're generally just consuming the already existing volumes.

cdogl · on Sept 15, 2023

A big part of the Kubernetes value proposition is “autoscaling”, used in the loosest sense. Pods will come and go over time, in response to events, etc. as part of normal operations for many systems.

If I still had to deploy to Kubernetes, I’d consider this an outage.

cmckn · on Sept 15, 2023

I would expect the Venn diagram of:

1) actually using autoscaling in prod

2) using it for a stateful workload (with PV’s)

3) needing to scale that workload during this time window

4) using this specific k8s version

To be razor thin.

I’m honestly surprised they’ve reported the impact on this dashboard; but good to see they did.

smif · on Sept 15, 2023

It seems like 1.25 is a month and a half out from retirement, maybe it's related to that.

Kubernetes 1.25 (in Danny Glover voice): I'm getting too old for this shit!

waych · on Sept 15, 2023

autoscaling isn't necessary to elicit downtime.

Need to deploy a dev copy of your cluster? How about the CI/CD pipelines that are blocked? Go sit on your hands and wait it out. Almost 2 days now.

cmckn · on Sept 15, 2023

Woah, I completely missed that this has been going on since the 13th?! That’s outrageous, surely something can just be rolled back.

vel0city · on Sept 15, 2023

This isn't preventing pods from moving from node to node, and for a lot of workloads it wouldn't even prevent making new pods. This is if you make a new storage claim (a PVC) the underlying storage filling that claim isn't being created at the moment, for one particular version of Kubernetes. Existing volumes are not impacted. Every other version is not impacted.

Loads of pods don't even use PVCs. In the several hundred deployments I routinely manage, there's only a handful of PVCs, and they aren't exactly dynamically created and destroyed. I've got many GKE clusters and no workloads I'm running are affected, and I imagine most existing workloads in GKE clusters aren't affected. I'm then also doubly not affected, because I'm not running 1.25.

ithkuil · on Sept 15, 2023

yeah, as a user of that service you may or may not be affected by this particular problem.

mrweasel · on Sept 15, 2023

That has to be annoying to fix. It's one thing when it's "just" your own cluster and something explodes in your face, it's a quite another when it's other people persistent volumes and you can't just assume that they'll be able to redeploy, because you have no idea what is in those storage claims.

Glad it's not me having to deal with this, Kubernetes is still a black box of magic in many respects and there's probably more than a few abstraction down to the actual disks.

BobbyJo · on Sept 15, 2023

Most of the abstraction between disk and container is external to K8s. You can use a single box CPU+disk with K8s if you want, that's just not a configurable product offering, so that's not how any of the cloud providers expose disk. K8s itself is entirely unconcerned with most of the implementation details, it's more a framework for those kinds of things

dustedcodes · on Sept 15, 2023

All my services are up and running and I had zero downtime over the last 24 hours, so can't be that "global".

glitchcrab · on Sept 15, 2023

Did you actually read the summary? It affects volumes for customers running 1.25 only - it is global but only for a subset of customers.

tacticus · on Sept 15, 2023

it affects the creation or deletion of volumes. anything that already exists will be fine and continue operating. On the second oldest version of kube currently supported that is also older than the version in every release channel atm. So it's a subset of a subset of a subset of an old version.

pyrale · on Sept 15, 2023

You seem to miss the point that "global" is a geographic indication. It does not mean "all features" or "all users".

> anything that already exists will be fine and continue operating.

Any pod that already exists will be fine, but that doesn't extend to services, as they may autoscale volumes along with pods. I'm not sure many users on such an antiquated kube version would have done that, but if no one did, there would probably not be a notice.

dangoodmanUT · on Sept 15, 2023

Clickbait title -_-

mrjin · on Sept 15, 2023

I found one assertion so true: Cloud is just someone-else's computer.

keyle · on Sept 15, 2023

I might blow your mind if I said that serverless runs on servers?

fragmede · on Sept 15, 2023

server-fewer just doesn't roll off the tongue as nicely

re-thc · on Sept 15, 2023

It's read server less i.e on less servers not 0.

dpkirchner · on Sept 15, 2023

Software is not actually soft.

dijit · on Sept 15, 2023

Eh, this is a colloqualism I can get behind, though "firm"-ware is actually just a different type of software.

"Hardware" meaning that it's very inflexible, "software" meaning that it is comparatively flexible, "wetware" meaning that it inherently lacks order. ;)

handsclean · on Sept 15, 2023

And grocery stores are just selling somebody else’s corn, but it’s that or become a farmer.

madmulita · on Sept 15, 2023

You mean that they can make the corn unavailable in the middle of my dinner?

robertlagrant · on Sept 15, 2023

Serverless and cloud servers can both do that.

intelVISA · on Sept 15, 2023

The cloud is just digital feudalism, except we even tricked people into selling their own hardware willingly thinking it was 'smart'.

dijit · on Sept 15, 2023

> digital feudalism

Can I steal this?

I think it's really apt.

intelVISA · on Sept 15, 2023

of course; doubtless many others have observed the same reality of cloud compute

gchamonlive · on Sept 15, 2023

Which is exactly what you want. You don't want user to access you laptop.

mrjin · on Sept 15, 2023

Not exactly. I don't want to get into the situation that there is nothing I can do when it's down neither.

verdverm · on Sept 15, 2023

You still have outages when you run your own hardware and systems.

What would that look like for your company compared to the cloud?

kynetic · on Sept 15, 2023

Depends really. I am the entire ops team for a non-profit volunteer org, I host their services on the cloud specifically so when something fails I don't have to do anything. Would much rather have a cloud computing company's ops team working to fix the issue instead and we'll just pay our monthly dues.

The same can be said of many smaller companies where uptime can be maintained cheaper and more consistently by offloading ops work to a cloud computing company.

gchamonlive · on Sept 15, 2023

You don't have to if you plan your disaster recovery scenarios correctly.

donor20 · on Sept 15, 2023

Doesn’t seem like a global outage - where is this headline coming from?

eddythompson80 · on Sept 15, 2023

The term global outage in public clouds is used whenever there is an outage that has the same root cause that impacts more than 2 independent regions

barkingcat · on Sept 15, 2023

global is about geography.

glintik · on Sept 15, 2023

Users of dedicated servers: LOL.