DISCLAIMER: I work for Red Hat Consulting as an OpenShift/k8s consultant.
This is such a bad idea. And I get that they're point is to reduce latency. But the point of k8s is describe your workload accurately and allow it to make decisions on your behalf. The no-brainer way fix this is to set the CPU Requests and Limits to the same value and add an HPA. Setting CPU Requests and Limits to the same value usually gives people the behavior they're expecting. Having more pods in can also reduce latency. But, taking away the Limits hides information about the workload while working around the issue at low to medium workloads. If they were ever to get Black Friday or other 2.5x workload peaks, I'd worry that the Limits removal would cause k8s not to be able to schedule the workload appropriately even if they had enough resources on paper. Remember, the idea of k8s is to scale atomically and horizontally while ensuring availability. If you're making something vertically scale, you'd likely want to re-evaluate that workload.
This doesn't seem as dangerous as is being suggested -- and in a world with the kernel bug and some separation of workloads it seems very viable.
Obviously, in a world without the kernel bug it makes much less sense to not set limits, but as far as scheduling goes, well-set requests (+/- a VPA[0]), with a HPA[1] should be enough to handle sudden increases in scale, and for truly large increases that are a complete surprise (otherwise you could have planned for it) elastic infrastructure via a cluster autoscaler[2].
How are the limits incorporated into scheduling? I assumed that was based on requests.
What does "scale atomically" even mean? How does removing limits relate to horizontal vs vertical? HPA is based on request utilization, not limits, afaik.
>How does removing limits relate to horizontal vs vertical?
Vertical -> give more resources to the program
Horizontal -> run more instances of the program
Removing limits gives your pods more resources (scaling them vertically) whereas creating more pods creates more copies (scaling horizontally).
Assuming parent meant scaling by whole units with "scale atomically", that is you have one or two running programs, not "1.5" if you just give it 50% more resources.
tpxl gets me. :D Even the "scale atomically" part.
People seem to have inferred that I believe that Limits are used by the Scheduler. I don't. But if we set "Request = Limits", we're guaranteeing to the Scheduler that our pod workload will never need more than what is Requested, or, we scale up to a new pod.
It seems to me latency is a symptom of the actual issue, not the actual problem.
If a workload idles at 25% of Request, 12.5% of Limits (as in TFA), and peaks at 50% of Request, 25% of Limits that seems hugely wasteful. What's more, the workload has several "opportunities" to optimize latency. And uncapping the CPU Limit reduces the latency. If it were me, I'd be asking, "Why does my workload potentially need access to (but not utilization?) 4, 6, 8, 16, 32 cores to reduce its latency?"
More often than not, I've been able to help customers reduce their latency by DECREASING the Pod's Requests and Limits, but also INCREASE the replica count (via HPA or manually). It's not a silver bullet, and whether a workload is node.js, JBoss EAP, Spring Boot, or Quarkus does matter to some extent. The first thing I reach for in my k8s toolbox is to scale out. "Many hands make light work" is an old adage. N+1 workloads can usually respond to more traffic than N workloads in a shorter amount of time. k8s' strength is that it is networked and clustered. Forcing one node or a set of nodes to work harder (TFA mentions "isolating" the workload) or vertically scaling is anti-pattern in my book. Especially when you understand the workload pattern well. What is being done here is that nodes (which are likely VMs) are being over-committed [0]. Now, those VMs live on physical hypervisors which are likely -guess what- over-committed. Turtles of (S)POFs all the way down I say.
Also, TFA mentions
In the past we’ve seen some nodes going to a "notReady" state, mainly because some services were using too much resources in a node.
and
The downsides are that we lose in “container density”, the number of containers that can run in a single node. We could also end up with a lot of “slack” during a low traffic time. You could also hit some high CPU usage, but nodes autoscaling should help you with it.
So they acknowledge the risk is real and they've encountered it. For most of my customers, failing nodes, reduced "container density", and "slack" are unacceptable. That translates into increased engineer troubleshooting time, higher cloud provider bills. What's worse is that the suggestion of the Cluster Autoscaler will protect you also comes with increased costs (licenses, VM, storage, etc.). Not the solution I want. Seems like a blank check to your cloud provider.
But I get it. I've fought with customers that tell me, "By removing the Limit, my container starts up in half the time." Great. Then they get to Perf Testing and they get wildly inconsistent speed up when scaling out (or way sublinear), or they're limited by resource in their ability to scale up especially when metrics tells them they have resources available, or there is unchecked backpressure, or downstream bottlenecks, or this one workload ends up consuming an entire worker node, or ...
Isn't the solution painfully obvious. Remove the limits around the time you expect extreme loads. Like you said, it works most of the time. Take a hit for unexpected workload spikes. It's a design decision.
Since this started by citing me, I feel somewhat obligated to defend my guidance.
I stand by it.
In an ideal world where apps are totally regular and load is equally balanced and every request is equally expensive and libraries don't spawn threads, sure. Maybe it's fine to use limits. My experience, on the other hand, says that most apps are NOT regular, load-balancers sometimes don't, and the real costs of queries are often unpredictable.
This is not to say that everyone should set their limits to `1m` and cross their fingers.
If you want to do it scientifically:
Benchmark your app under a load that represents the high end of reality. If you are preparing for BFCM, triple that.
For these benchmarks, set CPU request = limit.
Measure the critical indicators. Vary the CPU request (and limit) up or down until the indicators are where you want them (e.g. p95 latency < 100ms).
If you provision too much CPU you will waste it. Maybe nobody cares about p95 @50ms vs @100ms. If you provision too little CPU, you won't meet your SLO under load.
Now you can ask: How much do I trust that benchmark? The truth is that accurate benchmarking is DAMN hard. However hard you think it is, it's way harder than that. Even within Google we only have a few apps that we REALLY trust the benchmarks on.
This is where I say to remove (or boost) the CPU limit. It's not going to change the scheduling or feasibility. If you don't use it, it doesn't cost you anything. If you DO you use it it was either idle or you stole it from someone else who was borrowing it anyway.
When you take that unexpected spike - some query-of-doom or handling more load than expected or ... whatever - one of two things happens. Either you have extra CPU you can use, or you don't. When you set CPU limits you remove one of those options.
As for HPA and VPA - sure, great use them. We use that a LOT inside Google. But those don't act instantly - certainly not on the timescale of seconds. Why do you want a "brick-wall" at the end of your runway?
What's the flip-side of this? Well, if you are wildly off in your request, or if you don't re-run your benchmarks periodically, you can come to depend on the "extra". One day that extra won't be there, and your SLOs will be demolished.
Lastly, if you are REALLY sophisticated, you can collect stats and build a model of how much CPU is "idle" at any given time, on average. That's paid-for and not-used. You can statistically over-commit your machines by lowering requests, packing a bit more work onto the node, and relying on your stats to maintain your SLO. This works best when your various workloads are very un-correlated :)
TL;DR burstable CPU is a safety net. It has risks and requires some discipline to use properly, but for most users (even at Google) it is better than the alternative. But don't take it for granted!
Removing CPU limits seems like a bad idea now that there's a kernel fix. But putting that aside...
I don't understand why pods without CPU limits would cause unresponsive kubelets. For a long time now Kubernetes has allocated a slice for system services. While pods without CPU limits are allowed to burst, they are still limited to the amount of CPU allocated to kubernetes pods.
Run "systemd-cgls" on a node and you'll see two toplevel slices: kubepods and system. The kubelet process lives within the system slice.
If you run "kubectl describe <node>" you can see the resources set aside for system processes on the node. Processes in the system slice should always have (cpu_capacity - cpu_allocatable) available to share, no matter what happens in the kubepods slice.
It might depend alot on the distribution and how kubernetes is started. How much CPU time to reserve for system services from the scheduler (Allocatable as you pointed out) needs to be passed to kubelet, and I think only really applies to guarenteed pods.
What I did on the distribution I work on, is tune the cgroup shares so control plane services are allocated CPU time ahead of pods (whether guarenteed, burstable, or best effort). We don't run anything as static containers, so this covers all the kube services, etcd, system services, etc.
Before this change in our distribution, IIRC, pods and control plane had equal waiting, which allowed the possibility for kubelet or other control plane services to be starved if the system was very busy.
There are also lots of other problems that can lead to kubelet bouncing between ready/not ready that we've observed which wouldn't be triggered by the limits.
Even without the bug it will have negative effect on latency and generally is not really needed for un-metered workloads (there’re posts by thockin on reddit and github that describe this in detail)
To answer your other question - I believe kops ships without system reserved by default
It’s mentioned elsewhere on this thread but essentially with cfs quota period default which is 100ms it’s really easy for multithreaded process to exhaust the quota and just sit there idle until next period. Another thing is if you have spare cycles that you presumably already paid for why not just use them?
> Removing CPU limits seems like a bad idea now that there's a kernel fix.
Actually, why? Sure those guys may starve the ones without limits but they won't starve each other just because Linux will simply time-share the processes. And for services on the critical path (what they turned it off for) that seems like correct behaviour.
I should have said that it seems like the wrong fix to the problem. But I have since learned that limits can cause excessive throttling. And of course you may want your pods to be burstable, but that would just be a question of setting appropriate limits.
A container with a request but without a limit should be scheduled as Burstable, and it should only receive allocations in excess of its request when all other containers have had their demand <= request satisfied.
A container without either request or limit is twice-damned, and will be scheduled as BestEffort. The entire cgroup slice for all BestEffort pods is given a cpu.shares of 2 milliCPUs, and if the kernel scheduler is functioning well, no pod in there is going to disrupt the anything but other BestEffort pods with any amount of processor demand. Throw in a 64 thread busyloop and no Burstable or Guaranteed pods should notice much.
Of course that's the ideal. There is an observable difference between a process that relinquishes its scheduler slice and one that must be pre-empted. But I wouldn't call that a major disruption. Each pod will still be given its full requested share of CPU.
I wrote a little fork+spinloop program w/100 subprocesses and deployed it with a low (100m) CPU request and no limit. It's certainly driving CPU usage to nearly all 8 of the 8 cores on the machine, but the other processes sharing the node are doing fine.
Prometheus scrapes of the kubelet have slowed down a bit, but are still under 400ms.
Prometheus scrape latency for the node kubelet has increased, but not it's still sub-500ms.
Note that this cluster (which is on EKS) does have system reserved resources.
I don't really understand why Buffer (or anyone else, for that matter) would choose to remove CPU limits from services where they are extremely important rather than upgrading to a kernel version that doesn't have this bug.
Is upgrading a kernel of a docker host that straight forward? I would worry to keep everything compatible, with a lot of testing before any upgrade of this kind. It looks like they're running k8s with kops, and the fix was merged just a few weeks ago.
I'm not in a good position to say for sure, as I've only used managed Kubernetes distributions, but I think it probably works out to less work than removing CPU limits. Kernel upgrades are at least semi-routine, so most shops that run Kubernetes themselves are going to have a process for them. Conversely, removing CPU limits and migrating the critical path to a different set of tainted nodes is a substantial one-off change with a long tail of failure scenarios that need to be tested. Thus, I would expect that a kernel upgrade would be easier than doing what Buffer did.
By and large, you can just upgrade the kernel and nothing will break. It's one of the biggest things kernel developers worry over, in fact.
The reason, I think, K8s people might do this is that they often do not have the access, experience, or skills to do platform upgrades. It's easier to just futz around on the layer they feel they understand well; the world of containers.
It should be. Kernel updates are a critical path for bug fixes and security patches. If you can’t upgrade your kernel using a clear, premeditated plan, you are not doing your job right.
You can test out your shit by creating a new node with the recent kernel version installed. Then schedule some apps for testing on that node and see if everything works. Swap it out when you're done testing.
Yes, you just install the kernel image, kernel headers and kernel modules and reboot. You are not upgrading the distro. With Kubernetes it would be quite easy to test in isolation by simply tainting the worker node(s) that have the upgraded kernel.
Can’t remember the last time I had a centos kernel update (kernel-ml) break anything over the past 7 odd years I’ve been running it across hundreds of servers.
The core principle most readers miss is that CPU limits are tied to CPU throttling, which is markedly different than CPU time sharing. I would argue that in 99% of cases, you truly do not need or want limits.
limits cause CPU throttling, which is like running your process in a strobe light. If your quota period is 100ms, you might only be able to make progress for 10ms out of every 100ms period, regardless of whether or not there is CPU contention, just because you've exceeded your limit.
requests -> CFS time sharing. This ensures that out of a given period of time, CPU time is scheduled fairly and according to the request as a proportion of total request (it just so happens that the Kube scheduler won't schedule such that sum[requests] > capacity, but theoretically it could because requests are truly relative when it comes to how they are represented in cgroups)
Here is the fundamental assertion: requests ensure fair CPU scheduling in the event of CPU contention (more processes want CPU than can be scheduled). Given that you are using requests, why would you want limits? You might think "limits prevent a process from taking too much CPU" but that's just not true. If that processes DID try to use up too much CPU, CFS would ensure it does not via fair time sharing. If no other running processes needed the CPU, why enforce CPU throttling which has very bad effects on tail latency?
+1 The only good reason to use cpu limits I can think of is if you sell metered compute and run it on k8s. I’d be curious to know if anyone actually does this though
the genesis of cfs_quota and cpu throttling in general has to do with modulating power consumption of a chip, iirc. It's truly a fallacy that limits are needed to prevent noisy neighbor type stuff.
Huh didn’t know about reason behind cfs quota, thanks. Yeah it always seemed of dubious usefulness to me. Considering i can probably trash cpu caches without using much cycles and do other things with disk and network io I’m a bit surprised people worry about cfs quota so much
This seems like a bad trade-off, at least for 99% of us who haven’t been using Kubernetes in production for the last 5 years and manage it ourselves.
Putting all the “user facing” services in a state where one of them consuming all the CPU could affect all the others feels like a disaster waiting to happen.
If there are several services all with the same share of CPU resources and with no configured limits, and they are all runnable, then none of them will be able to starve the others. The kernel will schedule them each in turn.
If you configure a static limit what you get is services that don't run even when there is CPU time available, which is bad.
The number of times I’ve seen CPU limits kill off pods during even mild spikes and causing pretty much downtime and “disaster” is just as surprising. Work on autoscaling nodes instead, don’t use cpu limits.
This advice is confusing. CPU is a "compressible" resource -- pods don't get killed for (trying to) exceed it. Pods don't get evicted from nodes based on CPU starvation, so autoscaling your node count won't help if you end up with a set of pods on a node that need more CPU than the node can provide. They'll just starve each other.
If your service allows horizontal scalability, you can use autoscaling of pods with Horizontal Pod Autoscaler (ideally also with a cluster autoscaler) to increase pod count for a given service when some percentage of the requested CPU is exceeded, whether or not you set a CPU limit. Setting the cpu_request appropriately for your pods is critical to ensure that node CPU is not oversubscribed by the Kubernetes pod scheduler.
Pods where mem & CPU requests = limits are given the highest class of service ("guaranteed"). For your most critical and latency sensitive services, this is the best approach when also coupled with HPA. Assuming a 4.19 kernel or later, I suppose.
Maybe I'm misunderstanding you, but I'm pretty sure CPU limits will only limit the amount of CPU used even if there is more available. It will not kill off the pod.
Memory limits however will kill the pod if the pod uses more than the limit.
It is possible to get into this state. CPU starvation can be so severe that containers start failing their liveness probes and are killed. This is obviously very different than things like memory limits where the kernel OOMKills you, but will look similar to the untrained observer. Their app is serving 503s and the containers are in a restart loop -- looks like a Kubernetes problem.
In general, the problem is that people don't understand how these complex systems interact -- what do limits do, what are the consequences of limits, how do you decide on correct limits, what do liveness and readiness probes do, what is the kubelet's role in the probes, wait what's a kubelet, etc.
That may be more likely with limits, but it doesn’t require a limit. I’ve had lots of fun with that in Elasticsearch pods with no limit. And then you get to enjoy a nice cascading failure.
This isn't exclusive to containers, you had this problem with VMs and even hardware. Back in the physical server days, when you didn't have enough CPU to service the current rate of inbound requests, eventually your healthchecks start to fail and load balancers would evict you from the pool. Now instead of a physical CPU, it sounds like you're running into a CPU limit implemented in the kernel but the same basic principles apply.
Kubernetes changes how you create the infrastructure but you still have the same problems you had in a distributed system before.
> kops: Since June 2020, kops 1.18+ will start using Ubuntu 20.04 as the default host image. If you’re using a lower version of kops, you’ll have to probably to wait the fix. We are currently in this situation.
We're running Ubuntu 20.04 on Kops 1.17 in production just fine, thank you very much. It wasn't a happy path since it wasn't officially supported then - stuff about forcing iptables-legacy instead of nftables - but with a couple hacks we got it to work just fine (Kops was in a bad situation where CoreOS was hitting EOL and there were no officially supported distributions running updated kernels that patched the CPU throttling issues, so we worked with the maintainers to figure out what we needed to do, as the maintainers were also running Ubuntu 20.04 on versions of Kops which didn't formally support it).
This whole blog post is dangerous. CPU limits are really important for cluster stability, as I'm sure the author will find out soon enough. Why bother with dangerous workarounds for problems that have actual solutions? This makes no sense to me.
I encountered that issue on my company Mesos cluster. Here are some details.
We ran our largest application from bare-metal to Mesos (https://medium.com/criteo-labs/migrating-arbitrage-to-apache...) and observed performance was not as good as expected (especially on 99pctl latency).
Other application were showing similar behavior.
We ended up finding the issue with cfs bandwidth cgroup, considered several alternatives and eventually moved to cpusets instead.
cpusets allow to get:
- better mental model (it's far easier to reason on "dedicated cpus")
- net performance gain (from -5% to -10% cpu consumption)
- more consistent latency (if nothing run on the same cpu than your app, you benefit from good scheduling and possibly avoid cpu cache issues)
When the fixed kernel was released, we decided to upgrade to it and keep our new model of cpu isolation.
At a previous job we made an argument for moving away from cfs and to look at only full core allocation, often pinning with NUMA. The speed up was noticeable, since it removed the cfs overhead and memory access was now local.
We then got stuck in discussions around partial core allocation. We didn’t have that many jobs configured to use less than a full core, but it did impact our container packing.
That's why we put the CPUThrottlingHigh alert into the kubernetes-mixin project. It a least let folks know. The Node Exporter for example is always throttled and I don't mind. For the user facing parts I'd rather not be in the same situation. Ultimately latency should tell me though.
I've seen CPU throttling occur when limits aren't exhausted even on 5.4 kernels, so I don't believe the underlying kernel bug is fixed.
One option not mentioned in the post is to enable k8s' static CPU scheduler policy. With this option in place workloads in the "guaranteed" quality of service class that are allocated an integer CPU limit will be given exclusive use of their CPUs. I've found this also avoids the CFS bugs and eliminates CPU throttling, without removing CPU limits.
One thing to keep in mind is that this bug mostly impacts workloads that spin up more threads then they have allocated CPUs. For golang workloads you can set GOMAXPROCS to be equal to your CPU allocation and eliminate most throttling that way too, without messing with limits or the static scheduler policy
Enabling the static CPU scheduler policy currently requires setting a Kubelet flag, and that puts it out of reach of most people running managed Kubernetes distributions.
Because EKS supports custom launch templates? Good luck trying to finagle that into supporting the exact Kubelet flags that you want to enable, while staying abreast of upstream updates so that your cluster doesn't break when AWS tries to keep it up-to-date. Not anywhere close to a simple "extra_kubelet_flags: array[text]" kind of field.
In the low latency trading world, these concerns are addressed by partitioning resources (for CPU, with affinities). This seems like a simpler mechanism that doesn’t require the kernel/daemon to track resource usage and to impose limits.
I see only upsides to performance (bandwidth and latency) and availability by partitioning resources — so what are the benefits of the alternative, using limits, beyond being able to stuff more apps onto a machine? That’s not to trivialize that benefit.
Video Games industry is the same, in fact it was one of the reasons we went with google cloud over alternatives, at the time Amazon was not using KVM (or, HVM as they seem to call it)- and GCP was at least attempting CPU affinity on the VMs, this caused quite a variance in latency when using amazon which did not exist on GCP.
To answer your question: I believe there is 'pinning' in Kubernetes which can solve it, but kubernetes has other overheads in terms of latency (iptables pod routing with contrack enabled for instance) so I personally would avoid using it for low latency applications.
For videogames, you should not be subject to the iptables bits - Agones encourages use of the `hostPort` networking mode, which doesn't create or require special iptables routing.
If you’re talking about core pinning (cpuset.cpu_exclusive) Google had famously used this for some workloads and when they turned it off by accident performance got better
I have experience with affinities (I don’t know the k8s name for them) where it is crucial to achieving low latency (way sub microsecond RTT). Depending on your app’s architecture, it can be a major boon to bandwidth as well.
EDIT: Sorry, I wrote that thinking you were referring to k8s. Just sched_setaffinity and isolcpus are sufficient. YMMV
Yeah Kubernetes only uses cpusets to pin workload to cores and only when you have this enabled explicitly. Sched_setaffinity is separate mechanism not used by k8s
99% of typical kubernetes workloads dont need those kind of latency requirements and it maybe be detrimental for their throughput to only use subset of cores (classic throughput vs latency tradeoff).
Here's a story that might make you not want to do that.
We ran Kubernetes with the standard scheduler and node autoscaling for a long time, and used to allow developers in our (simplified) manifests define resource requests and limits. We saw that with our current config, we always had some unused capacity (that we wanted) since the scheduler spread out workloads while the autoscaler was only throwing nodes away with less than 70% load. So we started ignoring the limits provided by developers. This was initially a great success, our response times in the 99th went down drastically, even during sudden traffic spikes.
2 years later, and nobody cares about resource allocation for new services anymore. We can essentially never disable bursting again, because too many services (100+) use the extra capacity constantly, and due to our organizational structure we can't really _make_ these teams fix their allocations.
As an unsolicited recommendation, if you _wanted_ to attempt to chase this down, you should look at using the Vertical Pod Autoscaler (https://github.com/kubernetes/autoscaler/tree/master/vertica...) in recommendation only mode in order to get a good idea of what services are actually using, and set new limits based on that.
It definitely won't be optimal, but should let you get to a place where you at least have some limits set.
We at Mux have removed nearly all limits but setup alerts that trigger when container consistently bursts above the request so we can chase those down (temporary bursts are ignored). Never had any issues
We enforce cpu request (and memory request and limit) via process and plan on adding automation to enforce that so shouldn’t be a problem even with scaled up team since you can only hurt yourself by setting request too low. Not sure how the number 30 was chosen...
Would work better? If the problem is that you are getting throttled at a lower rate than your specified limit, maybe bumping that would help. But you still get to use your target “request” for bin packing / node resource tracking.
This would depend on the throttle level being proportional to the specified limit and not something orthogonal like number of processes - but if you don’t want to turn off limits entirely it might at least help.
"The danger of not setting a CPU limit is that containers running in the node could exhaust all CPU available."
My assumptions have been:
1. cpu request tells you how much cpu a pod gets MINIMUM always, independently of how much other pods use it or not
2. on GKE you can't request 100% cpu due to google reserving cpu for the node
3. if you have hard limits, your cluster utilisation will be bad -> we do remove cpu limits due to this.
The reason a container with no limit can exhaust CPU is that kubernetes CPU requests map to the cpushares accounting system, and CPU limits map to the Completely Fair Scheduler's cpuquota system. The cpushares system divides a core into 1024 shares, and guarantees a process gets the number of shares it reserves, but it does not limit the process from taking more shares if other processes aren't consuming them. The cpuquota system divides CPU time into periods of... I think... 100k microseconds by default, and hard limits a process at the number of microsecs per periods it requests. So if you don't set limits you're only using the cpushares system, and are free to take up as much idle CPU as you can grab.
1 is correct, 2 is partially correct, and 3 is not correct.
It is absolutely true Kubernetes will reserve the amount of CPU you request, although it will also allow you to exceed that request if you attempt to and there is free CPU time to service you. 2 is correct in so far as Google run daemonsets on GKE which themselves have CPU requests and limits, and thus there will never be a node which as 100% cpu free for you to request. 3 is simply incorrect - it may be true that for some combinations of nodes and workloads it is not possible for the Kubernetes scheduler to bin-pack efficiently, but for large clusters with diverse workloads this should not be a problem.
Excluding kernel bugs, CPU limits just provide an upper bound on burst capacity. That controls oversubscription of CPU on a node. As with any other kind of oversubscription of a resource based on variable demand, there is a tradeoff. Allowing one pod to burst over its request is both unreliable and potentially impacting other neighboring pods. Whether that improves your cluster efficiency or introduces intolerably high variability in service latency and throughput depends on your mix of workloads and how the scheduler distributes your various pods.
Buffer's solution of having different flavors of node, onto which mutually compatible workloads are scheduled in isolation from incompatible ones, is a very reasonable thing to do, even if this particular case is a bit of a head-scratcher.
From a traditional cluster perspective, we've been doing this for years.
depending on the goal of your service and cluster, it might be preferable to over subscribe your CPU.
Compared to Memory oversubscription, CPU over sub isn't anywhere near as much of a show stopper, so long as your service degrades well when it can't get the CPU it needs.
Where cost is an issue its very much worth oversubscribing your CPU by 20% to ensure you are rinsing the CPU.
As a mainframer once told me, "There's nothing wrong with having 100% CPU usage. We paid for it, we'd better use it".
On an interesting note, in mainframes it's normal to pay for a machine with n CPUs and get a n+m CPU machine delivered and installed. The extra CPUs are inactive until you pay for the upgrade and receive an activation code. In order to reduce downtime, during startup it's possible to have more than your licensed CPUs active to speed up the boot process and to catch up with any missed jobs.
The problem is the lack of controls in the timescales that the CPU scheduler is using, that do not necessarily match the timescales of applications. This is classic statistical multiplexing and burstiness problem often encountered in the network queueing world. I wrote a blog and a couple of synthetic benchmarks that highlight the issues a couple of months ago that you might find interesting: https://medium.com/engineering-at-palo-alto-networks/kuberne...
The Go runtime also locks in some unwarranted assumptions at process start time, and never changes its parameters if the number of available CPUs changes.
Explicitly setting GOMAXPROCS is probably the cleanest way to limit CPU among the runtimes that are out there, however. For example, if you set requests = 1, limits = 1, GOMAXPROCS=1, then you will never run into the latency-increasing cfs cpu throttling; you would be throttled if you used more than 1 CPU, but since you can't (modulo forks, of course), it won't happen. There is https://github.com/uber-go/automaxprocs to set this automatically, if you care.
You are right that by default, the logic that sets GOMAXPROCS is unaware of the limits you've set. That means GOMAXPROCS will be something much higher than your cpu limit, and an application that uses all available CPUs will use all of its quota early on in the cfs_period_us interval, and then sleep for the rest of it. This is bad for latency.
Setting GOMAXPROCS explicitly is the best practice in my experience. The runtime latches in a value for runtime.NumCPU() based on the population count of the cpumask at startup. The cpumask can change if kubernetes schedules or de-schedules a "guaranteed" pod on your node and the kubelet is using the static CPU management policy, and it will vary from node to node if you have various types of machines. You don't want to have 100 replicas of your microservice all using different, randomly-chose values of GOMAXPROCS.
Last time I was knee deep in managing prod Kube infrastructure limits were also there for scheduling purposes. It's hard to properly allocate services across nodes when there is no concept of requirements they have. I guess you can get around that with setting a request versus a limit?
Pod scheduling is based on Requests so you can definitely go without setting a limit. I don't think the limit itself plays a role in Kubernetes scheduling (unless you don't specify a request, then limit equals request).
And you are definitely right: scheduling a pod without request/limit is like giving a blank check.
This is good advice if used carefully. OpenVZ still did this the best way. It allowed to set guaranteed minimums and no max so you could guarantee CPU time to the host node. Scaled containers and efficiently used resources wonderfully
Would the latency of the system be improved by reserving some amount of CPU for the container? For example if a container always got a few milliseconds per period, or if you even reserved a part or all of a CPU for the container.
I work on a team that operates multitenant GKE clusters for other engineers at our company.
Earlier this year I read this blog post [1] about a bug in the Linux kernel that unnecessarily throttles
workloads due to a CFS bug. Kernel versions 4.19 and higher have been patched.
I asked GCP support which GKE versions included this patch. They told me 1.15.9-gke.9.
But my team at work is still getting reports of CPU throttling causing increased latencies on GKE workloads in these clusters.
This means
1. we're using a kernel that doesn't contain the patch.
2. the patch wasn't sufficient to prevent unnecessary CPU throttling
3. latency is caused by something other than CPU throttling
To rule out 1, I again checked that our GKE clusters (which are using nodes with Container Optimized OS [COS] VM images) are on a version that contains the CFS patch.
```
dxia@one-of-our-gke-nodes ~ $ uname -a
Linux one-of-our-gke-nodes 4.19.112+ #1 SMP Sat Apr 4 06:26:23 PDT 2020 x86_64 Intel(R) Xeon(R) CPU @ 2.30GHz GenuineIntel GNU/Linux
```
Kernel version is 4.19.112+ which is a good sign. I also checked the COS VM image version.
gke-11512-gke3-cos-77-12371-227-0-v200605-pre
The cumulative diff for [COS release notes][2] for cos-stable-77-12371-227-0 show this lineage (see "Changelog (vs ..." in each entry).
cos-stable-77-12371-227-0
77-12371-208-0
77-12371-183-0
77-12371-175-0
77-12371-141-0 <- This one's notes say "Fixed CFS quota throttling issue."
Now looking into 2:
This dashboard [5]. Top graph shows an example Container's CPU limit, request, and usage. The bottom graph shows the number of seconds the Container was CPU throttled as measured by sampling the local kubelet's Prometheus metric for `container_cpu_cfs_throttled_seconds_total` over time. CPU usage data is collected from resource usage metrics for Containers from the [Kubernetes Metrics API][6] which is returns metrics from the [metrics-server][7].
The first graph shows usage is not close to the limit. So there shouldn't be any CPU throttling happening.
The first drop in the top graph was decreasing the CPU limit from 24 to match the CPU requests of 16. The decrease of CPU limit from 24 to 16 actually caused CPU throttling to increase. We removed CPU limits from the Container on 8/31 12:00 which decreased number of seconds of CPU throttling to zero. This makes me think the kernel patch wasn't sufficient to prevent unnecessary CPU throttling.
This K8s Github issue ["CFS quotas can lead to unnecessary throttling #67577"][8] is still open. The linked [kernel bug][9] has a comment saying it should be marked fixed. I'm not sure if there are still CPU throttling issues with CFS not tracked in issue #67577 though.
Because of the strong correlation in the graphs between removing CPU limits and CPU throttling, I'm assuming the kernel patch named "Fixed CFS quota throttling issue." in COS 77-12371-141-0 wasn't enough.
Questions
1. Anyone else using GKE run into this issue?
2. Does anyone have a link to the exact kernel patch that the COS entry "Fixed CFS quota throttling issue." contains? A Linux mailing list ticket or patch would be great so I can see if it's the same patch that various blog posts reference.
3. Anyone aware of any CPU throttling issues in the current COS version and kernel we're using? 77-12371-227-0 and 4.19.112+, respectively.
Hey David, we talked on a podcast once :) Please raise a support case and send me the ticket number; I'll see if we can get to the bottom of this for you.
Thy should have just upgraded the kernel to a fixed one, which definitely does not require to upgrade the whole distribution.
Also if they are using Kubernetes normally there is no reason to not upgrade the whole distribution as well, since only Kubernetes will be running on it, and of course that's widely tested (the containers each choose their own distribution, only the kernel is shared).
You should not run more than one application/service in a VM if you are worried about the performance. Then, you don't need to worry about configuration CPU limits. Kubernetes doesn't only slow down your application performance, it also increases your operating cost and team by several magnitudes.
No, you still have a CPU request, and the HPA is based on utilization of that.
Also, even without limits I believe CPU is prioritized based on the request. So if 1 pod requests 100 millicpu and another pod requests 200 millicpu, if they both try to use all the CPU on a node the one that requested 200 millicpu will use 2/3 of the CPU and the other will use 1/3.
Fixed in the following COS stable images back in January:
cos-stable-79-12607-80-0
cos-stable-77-12371-141-0
cos-stable-73-11647-415-0
cos-stable-78-12499-89-0
Hey, didn't notice your comment when I posted mine. But I'm seeing similar behavior on GKE 1.15.12-gke.3: CPU throttling even when CPU usage < CPU limits. https://news.ycombinator.com/item?id=24351566
This is standard practice in some domains. Reserve 1 or 2 cores for admin and possibly for the interrupt daemon, and isolate/affinitize app processes to the remaining cores -- ideally giving more resources to your "hot" threads, and better resources (the core "closest" to your NIC) to your network thread, etc.
This is such a bad idea. And I get that they're point is to reduce latency. But the point of k8s is describe your workload accurately and allow it to make decisions on your behalf. The no-brainer way fix this is to set the CPU Requests and Limits to the same value and add an HPA. Setting CPU Requests and Limits to the same value usually gives people the behavior they're expecting. Having more pods in can also reduce latency. But, taking away the Limits hides information about the workload while working around the issue at low to medium workloads. If they were ever to get Black Friday or other 2.5x workload peaks, I'd worry that the Limits removal would cause k8s not to be able to schedule the workload appropriately even if they had enough resources on paper. Remember, the idea of k8s is to scale atomically and horizontally while ensuring availability. If you're making something vertically scale, you'd likely want to re-evaluate that workload.