This is such a bad idea. And I get that they're point is to reduce latency. But the point of k8s is describe your workload accurately and allow it to make decisions on your behalf. The no-brainer way fix this is to set the CPU Requests and Limits to the same value and add an HPA. Setting CPU Requests and Limits to the same value usually gives people the behavior they're expecting. Having more pods in can also reduce latency. But, taking away the Limits hides information about the workload while working around the issue at low to medium workloads. If they were ever to get Black Friday or other 2.5x workload peaks, I'd worry that the Limits removal would cause k8s not to be able to schedule the workload appropriately even if they had enough resources on paper. Remember, the idea of k8s is to scale atomically and horizontally while ensuring availability. If you're making something vertically scale, you'd likely want to re-evaluate that workload.
This doesn't seem as dangerous as is being suggested -- and in a world with the kernel bug and some separation of workloads it seems very viable.
Obviously, in a world without the kernel bug it makes much less sense to not set limits, but as far as scheduling goes, well-set requests (+/- a VPA), with a HPA should be enough to handle sudden increases in scale, and for truly large increases that are a complete surprise (otherwise you could have planned for it) elastic infrastructure via a cluster autoscaler.
What does "scale atomically" even mean? How does removing limits relate to horizontal vs vertical? HPA is based on request utilization, not limits, afaik.
What's your take on the arguments against limits in the comment at https://news.ycombinator.com/item?id=24356073 ?
Vertical -> give more resources to the program
Horizontal -> run more instances of the program
Removing limits gives your pods more resources (scaling them vertically) whereas creating more pods creates more copies (scaling horizontally).
Assuming parent meant scaling by whole units with "scale atomically", that is you have one or two running programs, not "1.5" if you just give it 50% more resources.
People seem to have inferred that I believe that Limits are used by the Scheduler. I don't. But if we set "Request = Limits", we're guaranteeing to the Scheduler that our pod workload will never need more than what is Requested, or, we scale up to a new pod.
It seems to me latency is a symptom of the actual issue, not the actual problem.
If a workload idles at 25% of Request, 12.5% of Limits (as in TFA), and peaks at 50% of Request, 25% of Limits that seems hugely wasteful. What's more, the workload has several "opportunities" to optimize latency. And uncapping the CPU Limit reduces the latency. If it were me, I'd be asking, "Why does my workload potentially need access to (but not utilization?) 4, 6, 8, 16, 32 cores to reduce its latency?"
More often than not, I've been able to help customers reduce their latency by DECREASING the Pod's Requests and Limits, but also INCREASE the replica count (via HPA or manually). It's not a silver bullet, and whether a workload is node.js, JBoss EAP, Spring Boot, or Quarkus does matter to some extent. The first thing I reach for in my k8s toolbox is to scale out. "Many hands make light work" is an old adage. N+1 workloads can usually respond to more traffic than N workloads in a shorter amount of time. k8s' strength is that it is networked and clustered. Forcing one node or a set of nodes to work harder (TFA mentions "isolating" the workload) or vertically scaling is anti-pattern in my book. Especially when you understand the workload pattern well. What is being done here is that nodes (which are likely VMs) are being over-committed . Now, those VMs live on physical hypervisors which are likely -guess what- over-committed. Turtles of (S)POFs all the way down I say.
Also, TFA mentions
In the past we’ve seen some nodes going to a "notReady" state, mainly because some services were using too much resources in a node.
The downsides are that we lose in “container density”, the number of containers that can run in a single node. We could also end up with a lot of “slack” during a low traffic time. You could also hit some high CPU usage, but nodes autoscaling should help you with it.
But I get it. I've fought with customers that tell me, "By removing the Limit, my container starts up in half the time." Great. Then they get to Perf Testing and they get wildly inconsistent speed up when scaling out (or way sublinear), or they're limited by resource in their ability to scale up especially when metrics tells them they have resources available, or there is unchecked backpressure, or downstream bottlenecks, or this one workload ends up consuming an entire worker node, or ...
I stand by it.
In an ideal world where apps are totally regular and load is equally balanced and every request is equally expensive and libraries don't spawn threads, sure. Maybe it's fine to use limits. My experience, on the other hand, says that most apps are NOT regular, load-balancers sometimes don't, and the real costs of queries are often unpredictable.
This is not to say that everyone should set their limits to `1m` and cross their fingers.
If you want to do it scientifically:
Benchmark your app under a load that represents the high end of reality. If you are preparing for BFCM, triple that.
For these benchmarks, set CPU request = limit.
Measure the critical indicators. Vary the CPU request (and limit) up or down until the indicators are where you want them (e.g. p95 latency < 100ms).
If you provision too much CPU you will waste it. Maybe nobody cares about p95 @50ms vs @100ms. If you provision too little CPU, you won't meet your SLO under load.
Now you can ask: How much do I trust that benchmark? The truth is that accurate benchmarking is DAMN hard. However hard you think it is, it's way harder than that. Even within Google we only have a few apps that we REALLY trust the benchmarks on.
This is where I say to remove (or boost) the CPU limit. It's not going to change the scheduling or feasibility. If you don't use it, it doesn't cost you anything. If you DO you use it it was either idle or you stole it from someone else who was borrowing it anyway.
When you take that unexpected spike - some query-of-doom or handling more load than expected or ... whatever - one of two things happens. Either you have extra CPU you can use, or you don't. When you set CPU limits you remove one of those options.
As for HPA and VPA - sure, great use them. We use that a LOT inside Google. But those don't act instantly - certainly not on the timescale of seconds. Why do you want a "brick-wall" at the end of your runway?
What's the flip-side of this? Well, if you are wildly off in your request, or if you don't re-run your benchmarks periodically, you can come to depend on the "extra". One day that extra won't be there, and your SLOs will be demolished.
Lastly, if you are REALLY sophisticated, you can collect stats and build a model of how much CPU is "idle" at any given time, on average. That's paid-for and not-used. You can statistically over-commit your machines by lowering requests, packing a bit more work onto the node, and relying on your stats to maintain your SLO. This works best when your various workloads are very un-correlated :)
TL;DR burstable CPU is a safety net. It has risks and requires some discipline to use properly, but for most users (even at Google) it is better than the alternative. But don't take it for granted!
I don't understand why pods without CPU limits would cause unresponsive kubelets. For a long time now Kubernetes has allocated a slice for system services. While pods without CPU limits are allowed to burst, they are still limited to the amount of CPU allocated to kubernetes pods.
Run "systemd-cgls" on a node and you'll see two toplevel slices: kubepods and system. The kubelet process lives within the system slice.
If you run "kubectl describe <node>" you can see the resources set aside for system processes on the node. Processes in the system slice should always have (cpu_capacity - cpu_allocatable) available to share, no matter what happens in the kubepods slice.
What I did on the distribution I work on, is tune the cgroup shares so control plane services are allocated CPU time ahead of pods (whether guarenteed, burstable, or best effort). We don't run anything as static containers, so this covers all the kube services, etcd, system services, etc.
Before this change in our distribution, IIRC, pods and control plane had equal waiting, which allowed the possibility for kubelet or other control plane services to be starved if the system was very busy.
There are also lots of other problems that can lead to kubelet bouncing between ready/not ready that we've observed which wouldn't be triggered by the limits.
To answer your other question - I believe kops ships without system reserved by default
The arguments for allowing containers to burst makes plenty of sense to me. I do it on most of my services!
thockin's reddit post for reference: https://www.reddit.com/r/kubernetes/comments/all1vg/on_kuber...
Another interesting bit of context describing some of the non-intuitive impacts of CPU limits: https://github.com/kubernetes/kubernetes/issues/51135
Edit: added links
Actually, why? Sure those guys may starve the ones without limits but they won't starve each other just because Linux will simply time-share the processes. And for services on the critical path (what they turned it off for) that seems like correct behaviour.
Live and learn!
A container without either request or limit is twice-damned, and will be scheduled as BestEffort. The entire cgroup slice for all BestEffort pods is given a cpu.shares of 2 milliCPUs, and if the kernel scheduler is functioning well, no pod in there is going to disrupt the anything but other BestEffort pods with any amount of processor demand. Throw in a 64 thread busyloop and no Burstable or Guaranteed pods should notice much.
Of course that's the ideal. There is an observable difference between a process that relinquishes its scheduler slice and one that must be pre-empted. But I wouldn't call that a major disruption. Each pod will still be given its full requested share of CPU.
If that's not the case, I'd love to know!
Prometheus scrapes of the kubelet have slowed down a bit, but are still under 400ms.
Prometheus scrape latency for the node kubelet has increased, but not it's still sub-500ms.
Note that this cluster (which is on EKS) does have system reserved resources.
[root@ip-10-1-100-143 /]# cat /sys/fs/cgroup/cpu/system.slice/cpu.shares
[root@ip-10-1-100-143 /]# cat /sys/fs/cgroup/cpu/kubepods/cpu.shares
[root@ip-10-1-100-143 /]# cat /sys/fs/cgroup/cpu/user.slice/cpu.shares
This is a more detailed post on the same thing - part two indicates changes have been back-ported to a number of kernel versions:
Linux-stable: 4.14.154+, 4.19.84+, 5.3.9+
Ubuntu: 4.15.0-67+, 5.3.0-24+
Redhat Enterprise Linux:
RHEL 7: 3.10.0-1062.8.1.el7+
RHEL 8: 4.18.0-147.2.1.el8_1+
The reason, I think, K8s people might do this is that they often do not have the access, experience, or skills to do platform upgrades. It's easier to just futz around on the layer they feel they understand well; the world of containers.
limits cause CPU throttling, which is like running your process in a strobe light. If your quota period is 100ms, you might only be able to make progress for 10ms out of every 100ms period, regardless of whether or not there is CPU contention, just because you've exceeded your limit.
requests -> CFS time sharing. This ensures that out of a given period of time, CPU time is scheduled fairly and according to the request as a proportion of total request (it just so happens that the Kube scheduler won't schedule such that sum[requests] > capacity, but theoretically it could because requests are truly relative when it comes to how they are represented in cgroups)
Here is the fundamental assertion: requests ensure fair CPU scheduling in the event of CPU contention (more processes want CPU than can be scheduled). Given that you are using requests, why would you want limits? You might think "limits prevent a process from taking too much CPU" but that's just not true. If that processes DID try to use up too much CPU, CFS would ensure it does not via fair time sharing. If no other running processes needed the CPU, why enforce CPU throttling which has very bad effects on tail latency?
Putting all the “user facing” services in a state where one of them consuming all the CPU could affect all the others feels like a disaster waiting to happen.
If you configure a static limit what you get is services that don't run even when there is CPU time available, which is bad.
If your service allows horizontal scalability, you can use autoscaling of pods with Horizontal Pod Autoscaler (ideally also with a cluster autoscaler) to increase pod count for a given service when some percentage of the requested CPU is exceeded, whether or not you set a CPU limit. Setting the cpu_request appropriately for your pods is critical to ensure that node CPU is not oversubscribed by the Kubernetes pod scheduler.
Pods where mem & CPU requests = limits are given the highest class of service ("guaranteed"). For your most critical and latency sensitive services, this is the best approach when also coupled with HPA. Assuming a 4.19 kernel or later, I suppose.
Memory limits however will kill the pod if the pod uses more than the limit.
In general, the problem is that people don't understand how these complex systems interact -- what do limits do, what are the consequences of limits, how do you decide on correct limits, what do liveness and readiness probes do, what is the kubelet's role in the probes, wait what's a kubelet, etc.
Kubernetes changes how you create the infrastructure but you still have the same problems you had in a distributed system before.
We're running Ubuntu 20.04 on Kops 1.17 in production just fine, thank you very much. It wasn't a happy path since it wasn't officially supported then - stuff about forcing iptables-legacy instead of nftables - but with a couple hacks we got it to work just fine (Kops was in a bad situation where CoreOS was hitting EOL and there were no officially supported distributions running updated kernels that patched the CPU throttling issues, so we worked with the maintainers to figure out what we needed to do, as the maintainers were also running Ubuntu 20.04 on versions of Kops which didn't formally support it).
This whole blog post is dangerous. CPU limits are really important for cluster stability, as I'm sure the author will find out soon enough. Why bother with dangerous workarounds for problems that have actual solutions? This makes no sense to me.
We ran our largest application from bare-metal to Mesos (https://medium.com/criteo-labs/migrating-arbitrage-to-apache...) and observed performance was not as good as expected (especially on 99pctl latency).
Other application were showing similar behavior.
We ended up finding the issue with cfs bandwidth cgroup, considered several alternatives and eventually moved to cpusets instead.
cpusets allow to get:
- better mental model (it's far easier to reason on "dedicated cpus")
- net performance gain (from -5% to -10% cpu consumption)
- more consistent latency (if nothing run on the same cpu than your app, you benefit from good scheduling and possibly avoid cpu cache issues)
When the fixed kernel was released, we decided to upgrade to it and keep our new model of cpu isolation.
We then got stuck in discussions around partial core allocation. We didn’t have that many jobs configured to use less than a full core, but it did impact our container packing.
One option not mentioned in the post is to enable k8s' static CPU scheduler policy. With this option in place workloads in the "guaranteed" quality of service class that are allocated an integer CPU limit will be given exclusive use of their CPUs. I've found this also avoids the CFS bugs and eliminates CPU throttling, without removing CPU limits.
One thing to keep in mind is that this bug mostly impacts workloads that spin up more threads then they have allocated CPUs. For golang workloads you can set GOMAXPROCS to be equal to your CPU allocation and eliminate most throttling that way too, without messing with limits or the static scheduler policy
I see only upsides to performance (bandwidth and latency) and availability by partitioning resources — so what are the benefits of the alternative, using limits, beyond being able to stuff more apps onto a machine? That’s not to trivialize that benefit.
Does kubernetes even allow for “affinitizing”?
To answer your question: I believe there is 'pinning' in Kubernetes which can solve it, but kubernetes has other overheads in terms of latency (iptables pod routing with contrack enabled for instance) so I personally would avoid using it for low latency applications.
EDIT: Sorry, I wrote that thinking you were referring to k8s. Just sched_setaffinity and isolcpus are sufficient. YMMV
99% of typical kubernetes workloads dont need those kind of latency requirements and it maybe be detrimental for their throughput to only use subset of cores (classic throughput vs latency tradeoff).
This would depend on the throttle level being proportional to the specified limit and not something orthogonal like number of processes - but if you don’t want to turn off limits entirely it might at least help.
"The danger of not setting a CPU limit is that containers running in the node could exhaust all CPU available."
My assumptions have been:
1. cpu request tells you how much cpu a pod gets MINIMUM always, independently of how much other pods use it or not
2. on GKE you can't request 100% cpu due to google reserving cpu for the node
3. if you have hard limits, your cluster utilisation will be bad -> we do remove cpu limits due to this.
It is absolutely true Kubernetes will reserve the amount of CPU you request, although it will also allow you to exceed that request if you attempt to and there is free CPU time to service you. 2 is correct in so far as Google run daemonsets on GKE which themselves have CPU requests and limits, and thus there will never be a node which as 100% cpu free for you to request. 3 is simply incorrect - it may be true that for some combinations of nodes and workloads it is not possible for the Kubernetes scheduler to bin-pack efficiently, but for large clusters with diverse workloads this should not be a problem.
Buffer's solution of having different flavors of node, onto which mutually compatible workloads are scheduled in isolation from incompatible ones, is a very reasonable thing to do, even if this particular case is a bit of a head-scratcher.
depending on the goal of your service and cluster, it might be preferable to over subscribe your CPU.
Compared to Memory oversubscription, CPU over sub isn't anywhere near as much of a show stopper, so long as your service degrades well when it can't get the CPU it needs.
Where cost is an issue its very much worth oversubscribing your CPU by 20% to ensure you are rinsing the CPU.
On an interesting note, in mainframes it's normal to pay for a machine with n CPUs and get a n+m CPU machine delivered and installed. The extra CPUs are inactive until you pay for the upgrade and receive an activation code. In order to reduce downtime, during startup it's possible to have more than your licensed CPUs active to speed up the boot process and to catch up with any missed jobs.
We ran Kubernetes with the standard scheduler and node autoscaling for a long time, and used to allow developers in our (simplified) manifests define resource requests and limits. We saw that with our current config, we always had some unused capacity (that we wanted) since the scheduler spread out workloads while the autoscaler was only throwing nodes away with less than 70% load. So we started ignoring the limits provided by developers. This was initially a great success, our response times in the 99th went down drastically, even during sudden traffic spikes.
2 years later, and nobody cares about resource allocation for new services anymore. We can essentially never disable bursting again, because too many services (100+) use the extra capacity constantly, and due to our organizational structure we can't really _make_ these teams fix their allocations.
It definitely won't be optimal, but should let you get to a place where you at least have some limits set.
Point being, if you don't have the capability to somehow keep teams in check through process and not pure capability, reconsider.
Java, for example, makes some tuning decisions based on this that you're not gonna like.
You are right that by default, the logic that sets GOMAXPROCS is unaware of the limits you've set. That means GOMAXPROCS will be something much higher than your cpu limit, and an application that uses all available CPUs will use all of its quota early on in the cfs_period_us interval, and then sleep for the rest of it. This is bad for latency.
And you are definitely right: scheduling a pod without request/limit is like giving a blank check.
I was expecting a discussion about CPU limits and all that is here is a workaround for a bug.
1. we're using a kernel that doesn't contain the patch.
2. the patch wasn't sufficient to prevent unnecessary CPU throttling
3. latency is caused by something other than CPU throttling
To rule out 1, I again checked that our GKE clusters (which are using nodes with Container Optimized OS [COS] VM images) are on a version that contains the CFS patch.
dxia@one-of-our-gke-nodes ~ $ uname -a
Linux one-of-our-gke-nodes 4.19.112+ #1 SMP Sat Apr 4 06:26:23 PDT 2020 x86_64 Intel(R) Xeon(R) CPU @ 2.30GHz GenuineIntel GNU/Linux
Kernel version is 4.19.112+ which is a good sign. I also checked the COS VM image version.
The cumulative diff for [COS release notes] for cos-stable-77-12371-227-0 show this lineage (see "Changelog (vs ..." in each entry).
77-12371-141-0 <- This one's notes say "Fixed CFS quota throttling issue."
Now looking into 2:
This dashboard . Top graph shows an example Container's CPU limit, request, and usage. The bottom graph shows the number of seconds the Container was CPU throttled as measured by sampling the local kubelet's Prometheus metric for `container_cpu_cfs_throttled_seconds_total` over time. CPU usage data is collected from resource usage metrics for Containers from the [Kubernetes Metrics API] which is returns metrics from the [metrics-server].
The first graph shows usage is not close to the limit. So there shouldn't be any CPU throttling happening.
The first drop in the top graph was decreasing the CPU limit from 24 to match the CPU requests of 16. The decrease of CPU limit from 24 to 16 actually caused CPU throttling to increase. We removed CPU limits from the Container on 8/31 12:00 which decreased number of seconds of CPU throttling to zero. This makes me think the kernel patch wasn't sufficient to prevent unnecessary CPU throttling.
This K8s Github issue ["CFS quotas can lead to unnecessary throttling #67577"] is still open. The linked [kernel bug] has a comment saying it should be marked fixed. I'm not sure if there are still CPU throttling issues with CFS not tracked in issue #67577 though.
Because of the strong correlation in the graphs between removing CPU limits and CPU throttling, I'm assuming the kernel patch named "Fixed CFS quota throttling issue." in COS 77-12371-141-0 wasn't enough.
1. Anyone else using GKE run into this issue?
2. Does anyone have a link to the exact kernel patch that the COS entry "Fixed CFS quota throttling issue." contains? A Linux mailing list ticket or patch would be great so I can see if it's the same patch that various blog posts reference.
3. Anyone aware of any CPU throttling issues in the current COS version and kernel we're using? 77-12371-227-0 and 4.19.112+, respectively.
Also if they are using Kubernetes normally there is no reason to not upgrade the whole distribution as well, since only Kubernetes will be running on it, and of course that's widely tested (the containers each choose their own distribution, only the kernel is shared).
Also, even without limits I believe CPU is prioritized based on the request. So if 1 pod requests 100 millicpu and another pod requests 200 millicpu, if they both try to use all the CPU on a node the one that requested 200 millicpu will use 2/3 of the CPU and the other will use 1/3.
According to https://cloud.google.com/container-optimized-os/docs/release...
If they saw the issue, then either they have not configured their nodes right, missconfigured them or perhaps run something very old?
I'm quite curious to see a proper test bench