(Eng lead for AKS here) While lots of people have had great success with AKS, we...

wpietri · on Aug 7, 2018

Does it seem weird to anybody else that a vendor would semi-blame the customer in public like this? I can't imagine seeing a statement like this from a Google or Amazon engineer.

It also doesn't seems to ignore a number of the points, especially how support was handled. I think it's bad form to only respond to the one thing that can be rebutted, ignoring the rest. And personally, I would have apologized for the bad experience here.

ummonk · on Aug 7, 2018

While it might be phrased in a way that implies the customer is partly to blame, the actual details would indicate the main problem was with Azure Kubernetes Service. Critical system daemons going down because the application uses too much memory is not a reasonable failure mode (and the AKS team rightfully fixed it).

wpietri · on Aug 7, 2018

Exactly. The whole point of offering a service to the public is that you know more than other people. So of course customers will do wrong things, be confused, etc.

In Microsoft's shoes, I would have strongly avoided anything that sounded like customer blame. E.g.: "We really regret the bad experience they had here. They were using the platform in a way we didn't expect, which led to an obviously unacceptable failure mode. We appreciate their bringing it to our attention; we've made sure it won't happen going forward. We also agree that some of the responses from support weren't what they should have been and will be looking how to improve that for all Azure users."

The goal with a public statement like this isn't to be "right". It isn't even to convince the customer. It's to convince everybody else that their experience will be much better than what is hopefully a bad outlier. The impression I'm left with is that a) Azure isn't really owning their failures, and b) if I use their service in a way that seems "wrong" to them, I shouldn't expect much in the way of support.

Drdrdrq · on Aug 7, 2018

...and apparently forgot to notify customer of it, and in general communicate with customer better.

I think this is the main reason for AWS lead. They simply treat customers right (well, better than G and MS anyway).

rblatz · on Aug 7, 2018

Yet Azure is the top cloud provider, and AWS is #2.

nemothekid · on Aug 7, 2018

They are number one because Microsoft doesn’t break out Azure and Office 365 revenue.

cthalupa · on Aug 7, 2018

My understanding, which may be incorrect, is also that they consider all SPLA revenue as cloud revenue as well.

(SPLA is the licensing paid by service providers to lease their customers infrastructure running Microsoft products. So if you pay some VPS or server provider $30/mo or whatever they charge for Server 2012, and they turn around and send $28 of it to MS or whatever, MS reports that $28 as cloud revenue)

ickler9 · on Aug 7, 2018

Well, this is on the front page, the top comment is misinformation, the posters left out details that made them look bad, and they seem to be going on a smear campaign out of spite on every platform they have. at what point is any of this in good faith?

wpietri · on Aug 7, 2018

What makes you think it's not in good faith? As far as I can tell, Prashant Deva had a series of bad experiences on Azure, including significant downtime. He's mad, and he's saying so.

From his perspective he was using it right; from Azure's apparently he was using it wrong. A difference in perspective isn't bad faith.

ickler9 · on Aug 8, 2018

Probably the part where he doesn't actually ever say it's a difference in perspective- that's your take. He says AKS is terrible, etc, etc. You're giving him a benefit of the doubt, which I appreciate, but he's gone too far in his bias. Maybe underlying it is a real issue, that clearly hackernews wants to indulge, but the threshold has been crossed.

wpietri · on Aug 8, 2018

He doesn't have to say it's a difference in perspective. He's giving his perspective. That's what blog posts generally are.

I note that you don't say your comments here are just your perspective as you trash-talk him. Does that mean you're pursuing a smear campaign and not acting in good faith? Why should he be held to a standard you yourself aren't willing to follow?

shaklee3 · on Aug 7, 2018

I would prefer a vendor responds publicly rather than request a private message. It's possible that one side was angry, and writing a blog post that makes it on HN will surely get a ton of negative attention. If that's the case, they should have the right to clear anything they'd like. I didn't read it as blame, but explanation.

azurezyq · on Aug 7, 2018

I think it's good to listen to both sides. But the response from Azure eng can be more professional. Customers have the right to do anything, maybe technically right or wrong. But the original post's attitude is more like blaming and throwing out random tech details, not explanation.

brudgers · on Aug 7, 2018

Steve Jobs told customers they were holding their phones wrong, so, to me, not really.

Cthulhu_ · on Aug 7, 2018

Jobs was a dick; Microsoft has PR and developer relations teams that are trained in how to provide feedback to their community.

wizardofmysore · on Aug 7, 2018

No. If the customer is at fault there is no problem in blaming them especially if they run a smear campaign.

jschwartzi · on Aug 7, 2018

Well, we really don't know what was said as the blog didn't actually provide any of the original communications. It's a he said she said thing at this point. Frankly the author comes across as having a huge axe to grind. That may be with good reason but it's hard for me to judge the quality of the Azure support when we never see any of their communications, just paraphrases.

anaisbetts · on Aug 7, 2018

[flagged]

jazoom · on Aug 7, 2018

I like hearing the other side of the story. I wish I got the other side more often.

po · on Aug 7, 2018

AKS engineering team spent over a day helping identify that the user had over scheduled their nodes, by running applications without memory limit, resulting in the kernel oom (out of memory) killer terminating the Docker daemon and kubelet.

I'm a bit confused why the cluster nodes don't come configured like this out of the box... kubernetes users aren't supposed to have to worry about OOM of the underlying system killing ops-side processes are they?

dilyevsky · on Aug 7, 2018

they do if cluster admin didn't setup proper system-reserved and kube-reserved (both are kubelet flags) and configured enforcement.

alpb · on Aug 7, 2018

In this case, the cluster admin would be whoever is provisioning the cluster nodes. In Google Kubernetes Engine, the "Capacity" and "Allocatable" info shown on the nodes are different (I see some mem/cpu reserved for probably system stuff). This makes me think GKE probably subtracts node capacity from what's allocated for the system automatically.

P.S. I work at Google.

dilyevsky · on Aug 7, 2018

correct, it should be provisioned by k8s provider (AKS in this case) and is what GKE is doing https://cloud.google.com/kubernetes-engine/docs/concepts/clu...

note, it also needs to match node configuration (how cgroups are setup specifically) so I doubt this works well on EKS which is BYO node. Maybe it's the issue with AKS too, I don't know enough about how it works...

QiKe · on Aug 7, 2018

AKS now reserves 20% of memory from each agent node and a very small amount of CPU to protect docker daemon and kubelet to function with misbehaving customer pods. However, that just means customer's pods will be evicted or no place to schedule when all resource is used up. This is something we see now in customer support cases.

dilyevsky · on Aug 7, 2018

That seems crazy high. If I have a node with 512G of RAM the kubelet/sys will take 100G? why would kubelet ever need this much?

QiKe · on Aug 7, 2018

AKS caps at 4G.

crb · on Aug 8, 2018

4GB of RAM per machine, or 4GB reservation of a 20GB machine?

wokwokwok · on Aug 7, 2018

'my stuff didnt work on AKS' is one thing; 'my stuff brought AKS and the dashboard down' is an fundamental failure that is in no way mitigated by this comment, and it feels very dishonest to try to redirect the blame for it.

My experience with azure has been reasonably positive, but even I've seen some weird stuff where things randomly don't work (AAD) or the dashboard just refuses to show anything for a while.

That this is a widespread endemic problem in Azure seems entirely plausible...

pdeva1 · on Aug 7, 2018

it is unclear what this response hopes to achieve. it is mentioned in the post that our containers do crash. that should under no condition cause the underlying node to go down. this has even been pointed out by others responding to this thread. it is interesting though that none of the other issues in the blog post are bought up.

HelloNurse · on Aug 7, 2018

Setting aside the workarounds and safety margins discussed in other comments, I would expect a reasonable operating system to allow explicitly prioritizing processes so that the important ones can only run out of memory after all user processes have been preemptively terminated to reclaim their memory. I would also expect a good container platform to restart system processes reliably, even if they crash.

amouat · on Aug 7, 2018

Yeah, it should do that. You can read up on how the kubelet and Linux OOM work in k8s here https://kubernetes.io/docs/tasks/administer-cluster/out-of-r.... Once the OOM kicks in though, I think you're in a pretty bad place.

Scheduling is only really going to work well if you set limits, requests and quotas for containers. Please do this if you're running containers in production. I know it's a pain, as it's non-trivial to figure out how much resource your containers need, but the payoff is you avoid the issues described in the article.

baaym · on Aug 7, 2018

My guess is that the system reservation change was very welcome for me as well.

Note that a service as AKS also draws in new customers that may not yet have years of Kubernetes experience. I'm one of those for example, and I created an AKS cluster so we could deploy short-lived environments for branches of our product. We're using GitLab and the 'Review Apps' integration with Kubernetes.

The instability experienced by the author of this article is something I experienced as well, and I have spent a lot of time draining, rebooting, and scaling nodes to try and find out what is happening. I would not have been able to guess the absence of resource limits could possibly kill a node.

Fortunately these instabilities disappeared a couple of weeks ago after a redeployment of the AKS instance, and it has been stable ever since. I guess the system reservation change was included there? From my perspective that was also the moment AKS truly started feeling like a GA product.

crunchlibrarian · on Aug 7, 2018

Sounds like you're still beta testing

exikyut · on Aug 7, 2018

Ah, and Hyper-V supports dynamic memory, so the system reservation backing can effectively be thin provisioned. That's nice. (Hm, dynamic memory probably got switched on from the start.)

Thanks for posting this here. It would be cool for there to be a way to hold application users to account without needing to chase viral Internet posts and do your best to pin some accurate reporting on slightly after the fact. A tricky general problem.

If there's one thing I miss with Azure (and AWS), it's the perpetually-free 600MB RAM KVM VM GCloud gives everyone to play with. It only has 1GB outbound, but inbound bandwidth is free, and I can do pretty much whatever I want with it. But anyways...

AaronFriel · on Aug 7, 2018

I don't think Azure ever uses dynamic memory for VMs - if I SSH into a VM I see the full allocation of whatever size it was supposed to be out of the bat.

I think this has to do with cgroups and ensuring the OOM killer doesn't target what is essentially the `init` process of a Kubernetes cluster - the docker daemon or kubelet.

specialp · on Aug 7, 2018

This is a pretty bad mistake from the customer if this is true. If not done already it would probably be good to expose Prometheus metrics on CPU/Memory usage per node.

markbnj · on Aug 7, 2018

Yes, this is true on it's face: it's bad to deploy containers to k8s without appropriate resource limits. However, this should in no way affect the operation of the node, so the implied transfer of responsibility for this incident from AKS to the customer is invalid imo.

sleepybrett · on Aug 7, 2018

It does if you don't have enough system reserved. It causes the kubelet to not function well and eventually the node goes sideways due to oom killing.

markbnj · on Aug 7, 2018

Right, I think we're saying the same thing. If the node is properly configured an end-user pod should not be able to take down the kubelet.

dilyevsky · on Aug 7, 2018

lol so aks forgot to provision enough resources and possibly setup enforcement and you are blaming the user? the user should be able to run as close to edge of "allocatable" as possible or even go over it and be oom kill'ed without bringing down the entire node. this functionality is even built into kubelet already. there's no way you can twist this to make it into user error.

btown · on Aug 7, 2018

More generally I should be able to choose to run an interruptible workload that I know to leak memory. I should expect that if I don’t, one of my coworkers will, and the node will stay up. Not leaving enough RAM for the node’s core resources is a mistake, but far from the worst thing in the world.

QiKe · on Aug 7, 2018

We are indeed working on more convenient container monitoring and logging on Azure portal.

bengale · on Aug 7, 2018

Last time I tried to use AKS I just got cryptic errors about the size of VMs available in Europe so I gave up and used GCP.

trhway · on Aug 7, 2018

>the AKS engineering team spent over a day helping identify that the user had over scheduled their nodes, by running applications without memory limit, resulting in the kernel oom (out of memory) killer terminating the Docker daemon and kubelet.

sounds like a bunch of people have just learned for the first time about OOM killer. I mean the production systems with overcommits and the running loose OOM killer and I bet without swap ... And they blame the customer. Sounds like a PaaS MVP quickly slapped together by an alpha state startup. You may want to look into man pages, in particular oom scoring and the code -17.

praseodym · on Aug 7, 2018

Actually Kubelet should already be adjusting OOM scores to make sure that user pods (containers) get killed over Kubelet or the Docker daemon. Why didn't that work here?

trhway · on Aug 7, 2018

adjusting scores for other processes skews the chances yet doesn't guarantee. The way to guarantee it for a given process is to disable the killer for that particular process.

nimos · on Aug 7, 2018

Interesting. The kubelet seems to use varying negative OOMAdjusts to prioritise killing[0] but if I'm reading the kernel code right anything at -999/-998 would return 1 from the badness function and essentially be equally valid to kill unless it was using over 99.9% of available memory.[1]

I see OOMScoreAdjust=-999 for kubelet being used but why not -1000. -999 seems like it would be equally likely to be evicted as -998 unless the for_each_process(p) macro always goes first to last processes?

[0]https://github.com/kubernetes/kubernetes/blob/master/pkg/kub...

[1]https://elixir.bootlin.com/linux/latest/source/mm/oom_kill.c...

trhway · on Aug 7, 2018

seems that way to me too - everybody like kubelet and "guaranteed" containers gets 1.

>unless the for_each_process(p) macro always goes first to last processes?

It seems it would usually go first to the first processes - the macro below - i.e. it would get to the "top" processes like kubelet, docker, etc before the containers.

#define next_task(p) \ list_entry_rcu((p)->tasks.next, struct task_struct, tasks)

#define for_each_process(p) \ for (p = &init_task ; (p = next_task(p)) != &init_task ; )

Given that "chosen" is updated only "if (points > chosen_points)" it seems that the first listed process with score 1 will stay the "chosen" in that situation, ie. it will be one of the top processes like the [-999] kubelet, not a [-998] container.

From a provider of an Azure class i'd have expected that they wouldn't rely on that machinery and would instead go the way of disabling the killer for the top processes outright.

enitihas · on Aug 7, 2018

Interesting, How does one disable the killer for a process? I thought on linux it was only possible to adjust oom scores.

technofiend · on Aug 7, 2018

Mental note - create a new cgroup for docker and kubelet.

h4b4n3r0 · on Aug 7, 2018

At Google you can’t even run anything on Borg until you specify how much memory it will use. You also have to specify how many cores you need and how much local (ephemeral) disk. And memory limit is hard: your task is killed without any warning if it attempts to exceed the limit. I was actually puzzled to discover that these limits are not required on k8s. Not only this leads to screwups like this one, it also makes it impossible to optimally schedule workloads, because you simply don’t know how much of each resource each job is going to use.

dilyevsky · on Aug 7, 2018

that's not actually how this works on Borg these days (and by "these days" i mean past 5+ years) and there's nothing about k8s not requiring limits by default that lead to this.

h4b4n3r0 · on Aug 7, 2018

I'll let current googlers comment on that. That's how it worked 3 years ago when I was there. You could also let Borg learn how much a job is going to use, but no serious service that I'm aware of used this for anything in Prod.

puzzle · on Aug 7, 2018

I left years ago and there were serious services using Autopilot. The name is no longer secret, see https://github.com/kubernetes/kubernetes/issues/44095 or Tim Hockin's slides from two years ago, where he revealed that 2/3 of Borg users rely on it: https://speakerdeck.com/thockin/everything-you-ever-wanted-t...

h4b4n3r0 · on Aug 7, 2018

The slide merely says "most Borg users use Autopilot", which could easily be true. Heck, I used it myself for non-production batch jobs. Those jobs were run as me. Any engineer at Google can spin up a job, and I'd venture to guess that most of them run at least something there every now and then. That's ~40k logical "users" as of 2018. The interesting question (which I admit I don't know the answer to as of today) is whether users that run search, ads, spanner, bigtable, and other shared service behemoths use Autopilot. FWIW my team did not use it at all.

JediPig · on Aug 7, 2018

you just killed any hope for azure running k8s. seriously. killed it with that statement.

empath75 · on Aug 7, 2018

[flagged]

nickbauman · on Aug 7, 2018

When I deploy to Amazon ECS the upper limit of the resource geometry of my service is checked and if it exceeds that upper limit available of the underlying cluster, it refuses to deploy. I understand k8s has similar features. It reads like Azure doesn't have their k8s configured correctly.

markbnj · on Aug 7, 2018

If the containers in a pod request more ram than is available on any node in the cluster then the pod will fail to schedule and will remain in pending state, which can be seen in the events for the controller (replicaset, daemonset, etc) using, for example `kubectl describe replicaset myreplicaset.` We've gotten ourselves into this situation a few times on GKE. It's easily resolvable by tuning the resource requests or scaling the nodepool and has no adverse effect on the operation of the cluster.