CPU Throttling for containerized Go applications explained

jeffbee · 2024-09-29T19:45:49.000000Z

I would say that this has relatively little to do with Kubernetes in the end. The Kubelet just turns the knobs and pulls the levers that Linux offers. If you understand how Linux runs your program, then what K8s does will seem obvious.

A detail I would like to quibble about: GOMAXPROCS is not by default the number of CPUs "on the node" as the article states. It is the number of set bits in the task's CPU mask at startup. This will not generally be the number of CPUs on the node, since that mask is determined by the number of other tenants and their resource configurations. "Other tenants" includes the kubelet and whatever other system containers are present.

The problem of using this default scheme arises because GOMAXPROCS is latched in once at startup, but the actual CPU mask may change while the task is running, and if you start 100 replicas of something on 100 different nodes they may all end up with various GOMAXPROCS, which will affect the capacity of each replica. So it is better to explicitly set GOMAXPROCS to something reasonable.

pritambaral · 2024-09-30T05:05:56.000000Z

> I would say that this has relatively little to do with Kubernetes in the end.

It does. E.g., this issue does not exist with LXD. LXD mounts a custom procfs inside the container that exposes the correct values of system resources allotted to the container. K8s doesn't, probably because k8s started out as a way to run docker containers, and docker couldn't care less about doing things the right way.

See for yourself by running htop in an LXD container and dynamically changing the CPU and Memory limits of the container. Unlike k8s, there's no need to restart the container for the new limits to apply; they update live.

Vegemeister · 2024-09-30T00:45:46.000000Z

I think it kind of has to do with kubernetes, in that kubernetes embeds assumptions in its design and UI about the existence of a kernel capability which is almost, but not quite, entirely unlike the cpu.max cgroup knob, and then tries to use cpu.max anyway. Leaving CPUs idle when threads are runnable is not normally a desirable thing for a scheduler to do, CPU usage is not measured in "number of cores", and a concurrency limit is about the least-energy-efficient way to pretend you have a slower chip than you really do.

There is a reason these particular users keep stepping on the same rake.

cpu.uclamp.max is a little closer to the mental model k8s is teaching people, but it violates the usage=n_cores model too, and most servers are using the performance governor anyway.

Groxx · 2024-09-29T21:15:43.000000Z

Or just update it at runtime every minute or something.

jeffbee · 2024-09-29T23:16:08.000000Z

The go runtime isn't really dynamic in that regard.

Groxx · 2024-09-30T00:15:39.000000Z

It has been from the first version: https://pkg.go.dev/runtime#GOMAXPROCS

hinkley · 2024-09-30T00:06:23.000000Z

You can tail some devices can’t you?

ec109685 · 2024-09-30T05:12:53.000000Z

In an ideal world, it’s far better to not use Limits but instead have applications set their CPU requests. That way, if the system has CPU available, applications can use more than their requested CPU (and won’t get throttled), but if CPU becomes saturated, the Kernel will ensure no process gets more than their fair share.

Unfortunately in practice, without Limits, noisy neighbors can interfere with well behaving apps. For example, if you are on a 64 core machine, if you have a process that requests 2 CPU’s and another process using all the rest of the cores, the 2 CPU process’s CPU share will not be perfectly consistent and for latency sensitive apps (like redis), you’ll see response time fluctuates.

It’s probably better to use newer Kubernetes features for extremely latency sensitive application to pin them to particular CPU’s. That way, their latency shouldn’t be affected by noisy neighbors, and those apps can fight for the rest of the host’s CPU’s.

With Limits, unless you can guarantee your app will never use more than its assigned max cpu, any temporary burst of cpu utilization will hit throttling (your app will sleep until the next scheduling period), which can destroy p95 response times. Having an app essentially melt down when the box has gobs of CPU available is never fun.

ekimekim · 2024-09-30T07:59:59.000000Z

The other problem with not setting limits is that it's very easy to use more than your requests routinely, and you won't know that you're misconfigured until the one day you have a noisy neighbor and you only get what you asked for.

Monitoring helps, but requires some nuance. For example, your average CPU might look fine at 50%, but in truth you're using 200% for 500ms followed by 0% for 500ms, and when CPU is scarce your latency unexpectedly doubles.

While it doesn't eliminate it entirely (as you rightly point out), enforcing limits even when there's excess CPU available will mostly ensure that your performance doesn't suddenly change due to outside factors, which IMO is more valuable than having higher performance most-but-not-all of the time.

Vegemeister · 2024-10-01T19:24:55.000000Z

>For example, your average CPU might look fine at 50%, but in truth you're using 200% for 500ms followed by 0% for 500ms, and when CPU is scarce your latency unexpectedly doubles.

That is exactly the behavior that cgroups' cpu.max has, except it'd have to be 50 ms instead of 500 with the default period.

The problem with cpu.max is that people want a "50%" CPU limit to make the kernel force-idle your threads in the same timeslice size you'd get with something else competing for the other 50% of the CPU, but that is not actually what cpu.max does. Perhaps that is what it should do, but unfortunately, the `echo $maxruntime_ns $period_ns >cpu.max` thing is UAPI. Although, I don't know if anyone would complain if one day the kernel started interpreting that as a rational fraction and ignoring the absolute values of the numbers.

This makes me really want to write a program that RDTSCs in a loop into an array, and then autocorr(diff()) the result. That'd probably expose all kinds of interesting things about scheduler timeslices, frequency scaling, and TSC granularity.

ec109685 · 2024-10-02T06:33:44.000000Z

Yes, in that scenario of 500ms of 200% CPU for a request / response type workload, 50% of responses will have an extra 25ms response time tacked on as the system is sleeping during the remaining portion of each scheduling period.

This goes into detail: https://docs.kernel.org/scheduler/sched-bwc.html

ec109685 · 2024-09-30T18:42:02.000000Z

If you don’t let people burst, you lose a benefit of multi-tenancy. Each workload stays conservative ensuring they never throttle, and your nodes end up very underutilized since you can’t share that buffer amongst workloads.

With auto scaling, if a workload is using more their allocated CPU, more containers will be brought online to bring down cpu utilization, which will get the system back into balance.

CSDude · 2024-09-29T19:37:35.000000Z

I feel like there is a great potential to be explored here by playing with cgroups dynamically, not in a machine learning way but allowing bursts, finding good ratios request/limit to pick up (1s/10s or 0.1s/1s ?) and voluntarily kicking out (eviction) stateless workloads.

I even pursued my PhD on it until I quit (unrelated reasons). There was a startup doing this with ML but forgot their name.

jfkfif · 2024-10-01T19:02:16.000000Z

I am working on something super similar. If you remember the name of the startup I would appreciate it deeply

CSDude · 2024-10-02T19:10:24.000000Z

looked at my phd files

https://stormforge.io/

enjoy please reach out to me if you want

JacobHenner · 2024-09-29T23:10:22.000000Z

tl;dr: don't set CPU limits in Kubernetes - especially for multi-threaded applications - unless you strictly require CPU bandwidth control [1].

[1]: https://docs.kernel.org/scheduler/sched-bwc.html

Concept5116 · 2024-09-30T05:32:36.000000Z

You had me until “what the frick”