This doesn't look like an outage at all.
> Diagnosis: GKE customers using Kubernetes version 1.25 may experience issues with Persistent Disk creation or deletion failures.
Don't read into announcements like this too much. Status pages and outage notices are often political.
Status pages are rarely dynamic and updates require blessing from upstairs. And more often than not complete outages are referred to as "degraded performance affecting some users".
I don't know how status pages work at Google, but I do work in reliability engineering and I sometimes make recommendations to update the status pages.
Some context before I go on is that reliability is often measured by mapping critical features to services and degradation. This gets more challenging as a feature starts to map to more than a couple services and those services begin to have dependencies. When your reliability on average can be measured in its number of nines opposed to its significant preceding digits your signal interpretation game has to step up significantly. These two situations make it infinitely more complex to state whether a given service degradation in a chain of services is truly having external customer impact at a given time. That's why a human needs to make the call to update the status page and why status page availability numbers are different from internal numbers.
I spend a good portion of nearly every sprint hunting down systemic issues that'll pop up across the ecosystem of services from a birds eye view. Often, knowing whether external customer impact will be felt for this series of errors relies heavily on knowing the current configuration of services in a chain, their graceful failure mechanisms, what failure manifests as client side, and whether that failure is critical to an SLA.
I have not, in my history of reliability engineering, seen anyone object to updating the status page for political reasons.
No, not really. SLAs are calculated on a per customer basis and generally have a legal definition in contracts if they're actual, functioning SLAs.
The status pages purpose is generally to head off a flood of customer reported issues. This is why you'll usually see issues that affect a broader subset of users on that page.
> No, not really. SLAs are calculated on a per customer basis and generally have a legal definition in contracts if they're actual, functioning SLAs.
And how can I as a customer calculate this? We're not going to sue each time there's a breach of SLA to get the real data. Whatever the status page says will trigger customers to decide if they should claim SLA credits. A lower number (delayed update of the status page) will skip payouts or reduce it.
> The status pages purpose is generally to head off a flood of customer reported issues. This is why you'll usually see issues that affect a broader subset of users on that page.
That's what you assume and that's what it's supposed to be. It's long been abused otherwise. Amazon for example will require explicit approval to update the page. They and others have famously delayed updating the status page as late as they can get away with often attempting to not even call an outage. It will say something like "increased error rates".
Five nines of availability calculates to 5 minutes, you can calculate up and down from there. If you don't want to do the conversion from percentage to minutes there's lots of calculators like this one: https://uptime.is/five-nines
I wasn't assuming what status pages are used for, I was speaking to my experience working in reliability engineering. I can't speak to Amazons practices as I've not worked there, but when I've seen this happen it's because we struggled to identify customer impact. The systems you're talking about are vaste and a single or even subset of applications reporting errors doesn't mean there's going to be customer impact. That's why I mentioned it usually takes a human that knows that system and it's upstreams to know if there'll be customer impact from a particular error.
I'd encourage you to read the wording of an SLA in a contract. They're often very specific in terms of time and the features they cover. Increased error rates tells me you'll probably run into retry scenarios, which depending on your contract may not actually affect an SLA. Error rates are generally an SLO or an SLI, which are not contractually actionable.
> Either your shit works, or it doesn’t. You do monitor, don’t you?
That then becomes a he said she said problem with the vendor you're claiming against. Does everyone have time for it? You will submit the SLA credit claim and chances are unless it's WAY off you'll accept the vendor's nerfed version and move on. Something is better than nothing.
Not sure why you’re being down voted. Status pages for big companies are never hooked up to automation. It’s just bad PR to show red across the bar.
If there is a networking outage, everything on a status page should be red but then that looks bad for PR. So you just set “networking outage” but everything else is green even though everything is realistically down.
Cloud providers (and everyone else) are unfortunately always downplaying their incidents though, so I don't trust that information. I have no idea about this particular case though since I'm not a GKE user.
Would be interesting to hear from actual users how serious this is.
I run GKE clusters in europe-west2. Completely unaffected. I run on the latest k8s version. I have an uptime monitoring service which runs in AWS and it reports zero downtime over the last 24 hours.
We're using their "stable" release channel which means most of our clusters are currently on a version of 1.25. We only have a few deployments using PVCs so impact is pretty minimal anyways for us as far as I can tell.
Even ignoring the fact this only affects one old version and Google makes it very easy to upgrade (and know when you're behind), I can't imagine this is affecting many workloads. The vast majority of the workloads I run (and I think get run, generally) don't rely on PV's. Then from that, most of the time you're not making wholly new volumes all the time you're generally just consuming the already existing volumes.
A big part of the Kubernetes value proposition is “autoscaling”, used in the loosest sense. Pods will come and go over time, in response to events, etc. as part of normal operations for many systems.
If I still had to deploy to Kubernetes, I’d consider this an outage.
This isn't preventing pods from moving from node to node, and for a lot of workloads it wouldn't even prevent making new pods. This is if you make a new storage claim (a PVC) the underlying storage filling that claim isn't being created at the moment, for one particular version of Kubernetes. Existing volumes are not impacted. Every other version is not impacted.
Loads of pods don't even use PVCs. In the several hundred deployments I routinely manage, there's only a handful of PVCs, and they aren't exactly dynamically created and destroyed. I've got many GKE clusters and no workloads I'm running are affected, and I imagine most existing workloads in GKE clusters aren't affected. I'm then also doubly not affected, because I'm not running 1.25.
That has to be annoying to fix. It's one thing when it's "just" your own cluster and something explodes in your face, it's a quite another when it's other people persistent volumes and you can't just assume that they'll be able to redeploy, because you have no idea what is in those storage claims.
Glad it's not me having to deal with this, Kubernetes is still a black box of magic in many respects and there's probably more than a few abstraction down to the actual disks.
Most of the abstraction between disk and container is external to K8s. You can use a single box CPU+disk with K8s if you want, that's just not a configurable product offering, so that's not how any of the cloud providers expose disk. K8s itself is entirely unconcerned with most of the implementation details, it's more a framework for those kinds of things
it affects the creation or deletion of volumes. anything that already exists will be fine and continue operating. On the second oldest version of kube currently supported that is also older than the version in every release channel atm. So it's a subset of a subset of a subset of an old version.
You seem to miss the point that "global" is a geographic indication. It does not mean "all features" or "all users".
> anything that already exists will be fine and continue operating.
Any pod that already exists will be fine, but that doesn't extend to services, as they may autoscale volumes along with pods. I'm not sure many users on such an antiquated kube version would have done that, but if no one did, there would probably not be a notice.
Eh, this is a colloqualism I can get behind, though "firm"-ware is actually just a different type of software.
"Hardware" meaning that it's very inflexible, "software" meaning that it is comparatively flexible, "wetware" meaning that it inherently lacks order. ;)
Depends really. I am the entire ops team for a non-profit volunteer org, I host their services on the cloud specifically so when something fails I don't have to do anything. Would much rather have a cloud computing company's ops team working to fix the issue instead and we'll just pay our monthly dues.
The same can be said of many smaller companies where uptime can be maintained cheaper and more consistently by offloading ops work to a cloud computing company.