Hacker News new | past | comments | ask | show | jobs | submit login
[flagged] Global Google Kubernetes Engine Outage (cloud.google.com)
98 points by talonx on Sept 15, 2023 | hide | past | favorite | 54 comments



This doesn't look like an outage at all. > Diagnosis: GKE customers using Kubernetes version 1.25 may experience issues with Persistent Disk creation or deletion failures.


Agree. The word outage isn't used in the notice. They say 'incident' which is more accurate. The title should be changed.


Don't read into announcements like this too much. Status pages and outage notices are often political.

Status pages are rarely dynamic and updates require blessing from upstairs. And more often than not complete outages are referred to as "degraded performance affecting some users".


I don't know how status pages work at Google, but I do work in reliability engineering and I sometimes make recommendations to update the status pages.

Some context before I go on is that reliability is often measured by mapping critical features to services and degradation. This gets more challenging as a feature starts to map to more than a couple services and those services begin to have dependencies. When your reliability on average can be measured in its number of nines opposed to its significant preceding digits your signal interpretation game has to step up significantly. These two situations make it infinitely more complex to state whether a given service degradation in a chain of services is truly having external customer impact at a given time. That's why a human needs to make the call to update the status page and why status page availability numbers are different from internal numbers.

I spend a good portion of nearly every sprint hunting down systemic issues that'll pop up across the ecosystem of services from a birds eye view. Often, knowing whether external customer impact will be felt for this series of errors relies heavily on knowing the current configuration of services in a chain, their graceful failure mechanisms, what failure manifests as client side, and whether that failure is critical to an SLA.

I have not, in my history of reliability engineering, seen anyone object to updating the status page for political reasons.


> I have not, in my history of reliability engineering, seen anyone object to updating the status page for political reasons.

The status page is tied to public SLAs = impact on $$$. Internally you can track anything. What's public is the problem.


No, not really. SLAs are calculated on a per customer basis and generally have a legal definition in contracts if they're actual, functioning SLAs.

The status pages purpose is generally to head off a flood of customer reported issues. This is why you'll usually see issues that affect a broader subset of users on that page.


> No, not really. SLAs are calculated on a per customer basis and generally have a legal definition in contracts if they're actual, functioning SLAs.

And how can I as a customer calculate this? We're not going to sue each time there's a breach of SLA to get the real data. Whatever the status page says will trigger customers to decide if they should claim SLA credits. A lower number (delayed update of the status page) will skip payouts or reduce it.

> The status pages purpose is generally to head off a flood of customer reported issues. This is why you'll usually see issues that affect a broader subset of users on that page.

That's what you assume and that's what it's supposed to be. It's long been abused otherwise. Amazon for example will require explicit approval to update the page. They and others have famously delayed updating the status page as late as they can get away with often attempting to not even call an outage. It will say something like "increased error rates".


Five nines of availability calculates to 5 minutes, you can calculate up and down from there. If you don't want to do the conversion from percentage to minutes there's lots of calculators like this one: https://uptime.is/five-nines

I wasn't assuming what status pages are used for, I was speaking to my experience working in reliability engineering. I can't speak to Amazons practices as I've not worked there, but when I've seen this happen it's because we struggled to identify customer impact. The systems you're talking about are vaste and a single or even subset of applications reporting errors doesn't mean there's going to be customer impact. That's why I mentioned it usually takes a human that knows that system and it's upstreams to know if there'll be customer impact from a particular error.

I'd encourage you to read the wording of an SLA in a contract. They're often very specific in terms of time and the features they cover. Increased error rates tells me you'll probably run into retry scenarios, which depending on your contract may not actually affect an SLA. Error rates are generally an SLO or an SLI, which are not contractually actionable.


> And how can I as a customer calculate this?

Either your shit works, or it doesn’t. You do monitor, don’t you?


> Either your shit works, or it doesn’t. You do monitor, don’t you?

That then becomes a he said she said problem with the vendor you're claiming against. Does everyone have time for it? You will submit the SLA credit claim and chances are unless it's WAY off you'll accept the vendor's nerfed version and move on. Something is better than nothing.


I'm an SRE and I've seen it firsthand at multiple companies.


Not sure why you’re being down voted. Status pages for big companies are never hooked up to automation. It’s just bad PR to show red across the bar.

If there is a networking outage, everything on a status page should be red but then that looks bad for PR. So you just set “networking outage” but everything else is green even though everything is realistically down.


it's also not only bad PR, but CSPs are subject to SLAs.


Cloud providers (and everyone else) are unfortunately always downplaying their incidents though, so I don't trust that information. I have no idea about this particular case though since I'm not a GKE user.

Would be interesting to hear from actual users how serious this is.


I run GKE clusters in europe-west2. Completely unaffected. I run on the latest k8s version. I have an uptime monitoring service which runs in AWS and it reports zero downtime over the last 24 hours.


> Completely unaffected. I run on the latest k8s version.

The announcement mentions specific version: "issue with Google Kubernetes Engine impacting customers using Kubernetes version 1.25"


Well yeah that's kind of their point. GKE makes it pretty easy to stay on a recent version.


We're using their "stable" release channel which means most of our clusters are currently on a version of 1.25. We only have a few deployments using PVCs so impact is pretty minimal anyways for us as far as I can tell.


Even ignoring the fact this only affects one old version and Google makes it very easy to upgrade (and know when you're behind), I can't imagine this is affecting many workloads. The vast majority of the workloads I run (and I think get run, generally) don't rely on PV's. Then from that, most of the time you're not making wholly new volumes all the time you're generally just consuming the already existing volumes.


A big part of the Kubernetes value proposition is “autoscaling”, used in the loosest sense. Pods will come and go over time, in response to events, etc. as part of normal operations for many systems.

If I still had to deploy to Kubernetes, I’d consider this an outage.


I would expect the Venn diagram of:

1) actually using autoscaling in prod

2) using it for a stateful workload (with PV’s)

3) needing to scale that workload during this time window

4) using this specific k8s version

To be razor thin.

I’m honestly surprised they’ve reported the impact on this dashboard; but good to see they did.


It seems like 1.25 is a month and a half out from retirement, maybe it's related to that.

Kubernetes 1.25 (in Danny Glover voice): I'm getting too old for this shit!


autoscaling isn't necessary to elicit downtime.

Need to deploy a dev copy of your cluster? How about the CI/CD pipelines that are blocked? Go sit on your hands and wait it out. Almost 2 days now.


Woah, I completely missed that this has been going on since the 13th?! That’s outrageous, surely something can just be rolled back.


This isn't preventing pods from moving from node to node, and for a lot of workloads it wouldn't even prevent making new pods. This is if you make a new storage claim (a PVC) the underlying storage filling that claim isn't being created at the moment, for one particular version of Kubernetes. Existing volumes are not impacted. Every other version is not impacted.

Loads of pods don't even use PVCs. In the several hundred deployments I routinely manage, there's only a handful of PVCs, and they aren't exactly dynamically created and destroyed. I've got many GKE clusters and no workloads I'm running are affected, and I imagine most existing workloads in GKE clusters aren't affected. I'm then also doubly not affected, because I'm not running 1.25.


yeah, as a user of that service you may or may not be affected by this particular problem.


That has to be annoying to fix. It's one thing when it's "just" your own cluster and something explodes in your face, it's a quite another when it's other people persistent volumes and you can't just assume that they'll be able to redeploy, because you have no idea what is in those storage claims.

Glad it's not me having to deal with this, Kubernetes is still a black box of magic in many respects and there's probably more than a few abstraction down to the actual disks.


Most of the abstraction between disk and container is external to K8s. You can use a single box CPU+disk with K8s if you want, that's just not a configurable product offering, so that's not how any of the cloud providers expose disk. K8s itself is entirely unconcerned with most of the implementation details, it's more a framework for those kinds of things


All my services are up and running and I had zero downtime over the last 24 hours, so can't be that "global".


Did you actually read the summary? It affects volumes for customers running 1.25 only - it is global but only for a subset of customers.


it affects the creation or deletion of volumes. anything that already exists will be fine and continue operating. On the second oldest version of kube currently supported that is also older than the version in every release channel atm. So it's a subset of a subset of a subset of an old version.


You seem to miss the point that "global" is a geographic indication. It does not mean "all features" or "all users".

> anything that already exists will be fine and continue operating.

Any pod that already exists will be fine, but that doesn't extend to services, as they may autoscale volumes along with pods. I'm not sure many users on such an antiquated kube version would have done that, but if no one did, there would probably not be a notice.


Clickbait title -_-


I found one assertion so true: Cloud is just someone-else's computer.


I might blow your mind if I said that serverless runs on servers?


server-fewer just doesn't roll off the tongue as nicely


It's read server less i.e on less servers not 0.


Software is not actually soft.


Eh, this is a colloqualism I can get behind, though "firm"-ware is actually just a different type of software.

"Hardware" meaning that it's very inflexible, "software" meaning that it is comparatively flexible, "wetware" meaning that it inherently lacks order. ;)


And grocery stores are just selling somebody else’s corn, but it’s that or become a farmer.


You mean that they can make the corn unavailable in the middle of my dinner?


Serverless and cloud servers can both do that.


The cloud is just digital feudalism, except we even tricked people into selling their own hardware willingly thinking it was 'smart'.


> digital feudalism

Can I steal this?

I think it's really apt.


of course; doubtless many others have observed the same reality of cloud compute


Which is exactly what you want. You don't want user to access you laptop.


Not exactly. I don't want to get into the situation that there is nothing I can do when it's down neither.


You still have outages when you run your own hardware and systems.

What would that look like for your company compared to the cloud?


Depends really. I am the entire ops team for a non-profit volunteer org, I host their services on the cloud specifically so when something fails I don't have to do anything. Would much rather have a cloud computing company's ops team working to fix the issue instead and we'll just pay our monthly dues.

The same can be said of many smaller companies where uptime can be maintained cheaper and more consistently by offloading ops work to a cloud computing company.


You don't have to if you plan your disaster recovery scenarios correctly.


Doesn’t seem like a global outage - where is this headline coming from?


The term global outage in public clouds is used whenever there is an outage that has the same root cause that impacts more than 2 independent regions


global is about geography.


Users of dedicated servers: LOL.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: