(Eng lead for AKS here)
While lots of people have had great success with AKS, we're always concerned when someone has a bad time. In this particular case the AKS engineering team spent over a day helping identify that the user had over scheduled their nodes, by running applications without memory limit, resulting in the kernel oom (out of memory) killer terminating the Docker daemon and kubelet. As part of this investigation we increased the system reservation for both Docker and kubelet to ensure that in the future if a user over schedules their nodes the kernel will only terminate their applications and not the critical system daemons.
Does it seem weird to anybody else that a vendor would semi-blame the customer in public like this? I can't imagine seeing a statement like this from a Google or Amazon engineer.
It also doesn't seems to ignore a number of the points, especially how support was handled. I think it's bad form to only respond to the one thing that can be rebutted, ignoring the rest. And personally, I would have apologized for the bad experience here.
While it might be phrased in a way that implies the customer is partly to blame, the actual details would indicate the main problem was with Azure Kubernetes Service. Critical system daemons going down because the application uses too much memory is not a reasonable failure mode (and the AKS team rightfully fixed it).
Exactly. The whole point of offering a service to the public is that you know more than other people. So of course customers will do wrong things, be confused, etc.
In Microsoft's shoes, I would have strongly avoided anything that sounded like customer blame. E.g.: "We really regret the bad experience they had here. They were using the platform in a way we didn't expect, which led to an obviously unacceptable failure mode. We appreciate their bringing it to our attention; we've made sure it won't happen going forward. We also agree that some of the responses from support weren't what they should have been and will be looking how to improve that for all Azure users."
The goal with a public statement like this isn't to be "right". It isn't even to convince the customer. It's to convince everybody else that their experience will be much better than what is hopefully a bad outlier. The impression I'm left with is that a) Azure isn't really owning their failures, and b) if I use their service in a way that seems "wrong" to them, I shouldn't expect much in the way of support.
My understanding, which may be incorrect, is also that they consider all SPLA revenue as cloud revenue as well.
(SPLA is the licensing paid by service providers to lease their customers infrastructure running Microsoft products. So if you pay some VPS or server provider $30/mo or whatever they charge for Server 2012, and they turn around and send $28 of it to MS or whatever, MS reports that $28 as cloud revenue)
Well, this is on the front page, the top comment is misinformation, the posters left out details that made them look bad, and they seem to be going on a smear campaign out of spite on every platform they have. at what point is any of this in good faith?
What makes you think it's not in good faith? As far as I can tell, Prashant Deva had a series of bad experiences on Azure, including significant downtime. He's mad, and he's saying so.
From his perspective he was using it right; from Azure's apparently he was using it wrong. A difference in perspective isn't bad faith.
Probably the part where he doesn't actually ever say it's a difference in perspective- that's your take. He says AKS is terrible, etc, etc. You're giving him a benefit of the doubt, which I appreciate, but he's gone too far in his bias. Maybe underlying it is a real issue, that clearly hackernews wants to indulge, but the threshold has been crossed.
He doesn't have to say it's a difference in perspective. He's giving his perspective. That's what blog posts generally are.
I note that you don't say your comments here are just your perspective as you trash-talk him. Does that mean you're pursuing a smear campaign and not acting in good faith? Why should he be held to a standard you yourself aren't willing to follow?
I would prefer a vendor responds publicly rather than request a private message. It's possible that one side was angry, and writing a blog post that makes it on HN will surely get a ton of negative attention. If that's the case, they should have the right to clear anything they'd like. I didn't read it as blame, but explanation.
I think it's good to listen to both sides. But the response from Azure eng can be more professional. Customers have the right to do anything, maybe technically right or wrong. But the original post's attitude is more like blaming and throwing out random tech details, not explanation.
Well, we really don't know what was said as the blog didn't actually provide any of the original communications. It's a he said she said thing at this point. Frankly the author comes across as having a huge axe to grind. That may be with good reason but it's hard for me to judge the quality of the Azure support when we never see any of their communications, just paraphrases.
AKS engineering team spent over a day helping identify that the user had over scheduled their nodes, by running applications without memory limit, resulting in the kernel oom (out of memory) killer terminating the Docker daemon and kubelet.
I'm a bit confused why the cluster nodes don't come configured like this out of the box... kubernetes users aren't supposed to have to worry about OOM of the underlying system killing ops-side processes are they?
In this case, the cluster admin would be whoever is provisioning the cluster nodes. In Google Kubernetes Engine, the "Capacity" and "Allocatable" info shown on the nodes are different (I see some mem/cpu reserved for probably system stuff). This makes me think GKE probably subtracts node capacity from what's allocated for the system automatically.
note, it also needs to match node configuration (how cgroups are setup specifically) so I doubt this works well on EKS which is BYO node. Maybe it's the issue with AKS too, I don't know enough about how it works...
AKS now reserves 20% of memory from each agent node and a very small amount of CPU to protect docker daemon and kubelet to function with misbehaving customer pods. However, that just means customer's pods will be evicted or no place to schedule when all resource is used up. This is something we see now in customer support cases.
'my stuff didnt work on AKS' is one thing; 'my stuff brought AKS and the dashboard down' is an fundamental failure that is in no way mitigated by this comment, and it feels very dishonest to try to redirect the blame for it.
My experience with azure has been reasonably positive, but even I've seen some weird stuff where things randomly don't work (AAD) or the dashboard just refuses to show anything for a while.
That this is a widespread endemic problem in Azure seems entirely plausible...
it is unclear what this response hopes to achieve. it is mentioned in the post that our containers do crash. that should under no condition cause the underlying node to go down. this has even been pointed out by others responding to this thread.
it is interesting though that none of the other issues in the blog post are bought up.
Setting aside the workarounds and safety margins discussed in other comments, I would expect a reasonable operating system to allow explicitly prioritizing processes so that the important ones can only run out of memory after all user processes have been preemptively terminated to reclaim their memory. I would also expect a good container platform to restart system processes reliably, even if they crash.
Scheduling is only really going to work well if you set limits, requests and quotas for containers. Please do this if you're running containers in production. I know it's a pain, as it's non-trivial to figure out how much resource your containers need, but the payoff is you avoid the issues described in the article.
My guess is that the system reservation change was very welcome for me as well.
Note that a service as AKS also draws in new customers that may not yet have years of Kubernetes experience. I'm one of those for example, and I created an AKS cluster so we could deploy short-lived environments for branches of our product. We're using GitLab and the 'Review Apps' integration with Kubernetes.
The instability experienced by the author of this article is something I experienced as well, and I have spent a lot of time draining, rebooting, and scaling nodes to try and find out what is happening. I would not have been able to guess the absence of resource limits could possibly kill a node.
Fortunately these instabilities disappeared a couple of weeks ago after a redeployment of the AKS instance, and it has been stable ever since. I guess the system reservation change was included there? From my perspective that was also the moment AKS truly started feeling like a GA product.
Ah, and Hyper-V supports dynamic memory, so the system reservation backing can effectively be thin provisioned. That's nice. (Hm, dynamic memory probably got switched on from the start.)
Thanks for posting this here. It would be cool for there to be a way to hold application users to account without needing to chase viral Internet posts and do your best to pin some accurate reporting on slightly after the fact. A tricky general problem.
If there's one thing I miss with Azure (and AWS), it's the perpetually-free 600MB RAM KVM VM GCloud gives everyone to play with. It only has 1GB outbound, but inbound bandwidth is free, and I can do pretty much whatever I want with it. But anyways...
I don't think Azure ever uses dynamic memory for VMs - if I SSH into a VM I see the full allocation of whatever size it was supposed to be out of the bat.
I think this has to do with cgroups and ensuring the OOM killer doesn't target what is essentially the `init` process of a Kubernetes cluster - the docker daemon or kubelet.
This is a pretty bad mistake from the customer if this is true. If not done already it would probably be good to expose Prometheus metrics on CPU/Memory usage per node.
Yes, this is true on it's face: it's bad to deploy containers to k8s without appropriate resource limits. However, this should in no way affect the operation of the node, so the implied transfer of responsibility for this incident from AKS to the customer is invalid imo.
lol so aks forgot to provision enough resources and possibly setup enforcement and you are blaming the user? the user should be able to run as close to edge of "allocatable" as possible or even go over it and be oom kill'ed without bringing down the entire node. this functionality is even built into kubelet already. there's no way you can twist this to make it into user error.
More generally I should be able to choose to run an interruptible workload that I know to leak memory. I should expect that if I don’t, one of my coworkers will, and the node will stay up. Not leaving enough RAM for the node’s core resources is a mistake, but far from the worst thing in the world.
>the AKS engineering team spent over a day helping identify that the user had over scheduled their nodes, by running applications without memory limit, resulting in the kernel oom (out of memory) killer terminating the Docker daemon and kubelet.
sounds like a bunch of people have just learned for the first time about OOM killer. I mean the production systems with overcommits and the running loose OOM killer and I bet without swap ... And they blame the customer. Sounds like a PaaS MVP quickly slapped together by an alpha state startup. You may want to look into man pages, in particular oom scoring and the code -17.
Actually Kubelet should already be adjusting OOM scores to make sure that user pods (containers) get killed over Kubelet or the Docker daemon. Why didn't that work here?
adjusting scores for other processes skews the chances yet doesn't guarantee. The way to guarantee it for a given process is to disable the killer for that particular process.
Interesting. The kubelet seems to use varying negative OOMAdjusts to prioritise killing[0] but if I'm reading the kernel code right anything at -999/-998 would return 1 from the badness function and essentially be equally valid to kill unless it was using over 99.9% of available memory.[1]
I see OOMScoreAdjust=-999 for kubelet being used but why not -1000. -999 seems like it would be equally likely to be evicted as -998 unless the for_each_process(p) macro always goes first to last processes?
seems that way to me too - everybody like kubelet and "guaranteed" containers gets 1.
>unless the for_each_process(p) macro always goes first to last processes?
It seems it would usually go first to the first processes - the macro below - i.e. it would get to the "top" processes like kubelet, docker, etc before the containers.
Given that "chosen" is updated only "if (points > chosen_points)" it seems that the first listed process with score 1 will stay the "chosen" in that situation, ie. it will be one of the top processes like the [-999] kubelet, not a [-998] container.
From a provider of an Azure class i'd have expected that they wouldn't rely on that machinery and would instead go the way of disabling the killer for the top processes outright.
At Google you can’t even run anything on Borg until you specify how much memory it will use. You also have to specify how many cores you need and how much local (ephemeral) disk. And memory limit is hard: your task is killed without any warning if it attempts to exceed the limit. I was actually puzzled to discover that these limits are not required on k8s. Not only this leads to screwups like this one, it also makes it impossible to optimally schedule workloads, because you simply don’t know how much of each resource each job is going to use.
that's not actually how this works on Borg these days (and by "these days" i mean past 5+ years) and there's nothing about k8s not requiring limits by default that lead to this.
I'll let current googlers comment on that. That's how it worked 3 years ago when I was there. You could also let Borg learn how much a job is going to use, but no serious service that I'm aware of used this for anything in Prod.
The slide merely says "most Borg users use Autopilot", which could easily be true. Heck, I used it myself for non-production batch jobs. Those jobs were run as me. Any engineer at Google can spin up a job, and I'd venture to guess that most of them run at least something there every now and then. That's ~40k logical "users" as of 2018. The interesting question (which I admit I don't know the answer to as of today) is whether users that run search, ads, spanner, bigtable, and other shared service behemoths use Autopilot. FWIW my team did not use it at all.
When I deploy to Amazon ECS the upper limit of the resource geometry of my service is checked and if it exceeds that upper limit available of the underlying cluster, it refuses to deploy. I understand k8s has similar features. It reads like Azure doesn't have their k8s configured correctly.
If the containers in a pod request more ram than is available on any node in the cluster then the pod will fail to schedule and will remain in pending state, which can be seen in the events for the controller (replicaset, daemonset, etc) using, for example `kubectl describe replicaset myreplicaset.` We've gotten ourselves into this situation a few times on GKE. It's easily resolvable by tuning the resource requests or scaling the nodepool and has no adverse effect on the operation of the cluster.