
Kubernetes: Make your services faster by removing CPU limits - iansinnott
https://erickhun.com/posts/kubernetes-faster-services-no-cpu-limits/
======
oso2k
DISCLAIMER: I work for Red Hat Consulting as an OpenShift/k8s consultant.

This is such a bad idea. And I get that they're point is to reduce latency.
But the point of k8s is describe your workload accurately and allow it to make
decisions on your behalf. The no-brainer way fix this is to set the CPU
Requests and Limits to the same value and add an HPA. Setting CPU Requests and
Limits to the same value usually gives people the behavior they're expecting.
Having more pods in can also reduce latency. But, taking away the Limits hides
information about the workload while working around the issue at low to medium
workloads. If they were ever to get Black Friday or other 2.5x workload peaks,
I'd worry that the Limits removal would cause k8s not to be able to schedule
the workload appropriately even if they had enough resources on paper.
Remember, the idea of k8s is to scale atomically and horizontally while
ensuring availability. If you're making something vertically scale, you'd
likely want to re-evaluate that workload.

~~~
sciurus
How are the limits incorporated into scheduling? I assumed that was based on
requests.

What does "scale atomically" even mean? How does removing limits relate to
horizontal vs vertical? HPA is based on request utilization, not limits,
afaik.

What's your take on the arguments against limits in the comment at
[https://news.ycombinator.com/item?id=24356073](https://news.ycombinator.com/item?id=24356073)
?

~~~
tpxl
>How does removing limits relate to horizontal vs vertical?

Vertical -> give more resources to the program

Horizontal -> run more instances of the program

Removing limits gives your pods more resources (scaling them vertically)
whereas creating more pods creates more copies (scaling horizontally).

Assuming parent meant scaling by whole units with "scale atomically", that is
you have one or two running programs, not "1.5" if you just give it 50% more
resources.

~~~
oso2k
tpxl gets me. :D Even the "scale atomically" part.

People seem to have inferred that I believe that Limits are used by the
Scheduler. I don't. But if we set "Request = Limits", we're guaranteeing to
the Scheduler that our pod workload will never need more than what is
Requested, or, we scale up to a new pod.

It seems to me latency is a symptom of the actual issue, not the actual
problem.

If a workload idles at 25% of Request, 12.5% of Limits (as in TFA), and peaks
at 50% of Request, 25% of Limits that seems hugely wasteful. What's more, the
workload has several "opportunities" to optimize latency. And uncapping the
CPU Limit reduces the latency. If it were me, I'd be asking, "Why does my
workload potentially need access to (but not utilization?) 4, 6, 8, 16, 32
cores to reduce its latency?"

More often than not, I've been able to help customers reduce their latency by
DECREASING the Pod's Requests and Limits, but also INCREASE the replica count
(via HPA or manually). It's not a silver bullet, and whether a workload is
node.js, JBoss EAP, Spring Boot, or Quarkus does matter to some extent. The
first thing I reach for in my k8s toolbox is to scale out. "Many hands make
light work" is an old adage. N+1 workloads can usually respond to more traffic
than N workloads in a shorter amount of time. k8s' strength is that it is
networked and clustered. Forcing one node or a set of nodes to work harder
(TFA mentions "isolating" the workload) or vertically scaling is anti-pattern
in my book. Especially when you understand the workload pattern well. What is
being done here is that nodes (which are likely VMs) are being over-committed
[0]. Now, those VMs live on physical hypervisors which are likely -guess what-
over-committed. Turtles of (S)POFs all the way down I say.

Also, TFA mentions

    
    
         In the past we’ve seen some nodes going to a "notReady" state, mainly because some services were using too much resources in a node.
    

and

    
    
         The downsides are that we lose in “container density”, the number of containers that can run in a single node. We could also end up with a lot of “slack” during a low traffic time. You could also hit some high CPU usage, but nodes autoscaling should help you with it.
    

So they acknowledge the risk is real and they've encountered it. For most of
my customers, failing nodes, reduced "container density", and "slack" are
unacceptable. That translates into increased engineer troubleshooting time,
higher cloud provider bills. What's worse is that the suggestion of the
Cluster Autoscaler will protect you also comes with increased costs (licenses,
VM, storage, etc.). Not the solution I want. Seems like a blank check to your
cloud provider.

But I get it. I've fought with customers that tell me, "By removing the Limit,
my container starts up in half the time." Great. Then they get to Perf Testing
and they get wildly inconsistent speed up when scaling out (or way sublinear),
or they're limited by resource in their ability to scale up especially when
metrics tells them they have resources available, or there is unchecked
backpressure, or downstream bottlenecks, or this one workload ends up
consuming an entire worker node, or ...

[0] [https://www.openshift.com/blog/full-cluster-
part-2-protectin...](https://www.openshift.com/blog/full-cluster-
part-2-protecting-nodes)

------
rsanders
Removing CPU limits seems like a bad idea now that there's a kernel fix. But
putting that aside...

I don't understand why pods without CPU limits would cause unresponsive
kubelets. For a long time now Kubernetes has allocated a slice for system
services. While pods without CPU limits are allowed to burst, they are still
limited to the amount of CPU allocated to kubernetes pods.

Run "systemd-cgls" on a node and you'll see two toplevel slices: kubepods and
system. The kubelet process lives within the system slice.

If you run "kubectl describe <node>" you can see the resources set aside for
system processes on the node. Processes in the system slice should always have
(cpu_capacity - cpu_allocatable) available to share, no matter what happens in
the kubepods slice.

    
    
        Capacity:
            cpu:                         8
            ephemeral-storage:           83873772Ki
            memory:                      62907108Ki
        Allocatable:
            cpu:                         7910m
            ephemeral-storage:           76224326324
            memory:                      61890276Ki
            pods:                        58
    

Granted, it's not a large proportion of CPU.

~~~
Thaxll
It's pretty simple, limits work only when everyone are using them. If you have
one pod that does not enforce limits it can disrupt the entire node.

~~~
rsanders
A container with a request but without a limit should be scheduled as
Burstable, and it should only receive allocations in excess of its request
when all other containers have had their demand <= request satisfied.

A container without either request or limit is twice-damned, and will be
scheduled as BestEffort. The entire cgroup slice for all BestEffort pods is
given a cpu.shares of 2 milliCPUs, and if the kernel scheduler is functioning
well, no pod in there is going to disrupt the anything but other BestEffort
pods with any amount of processor demand. Throw in a 64 thread busyloop and no
Burstable or Guaranteed pods should notice much.

Of course that's the ideal. There is an observable difference between a
process that relinquishes its scheduler slice and one that must be pre-empted.
But I wouldn't call that a major disruption. Each pod will still be given its
full requested share of CPU.

If that's not the case, I'd love to know!

~~~
Thaxll
Are you sure that BestEffort QOS do not disrupt the entire node? I remember in
the past a single pod would freeze the entire VM.

~~~
rsanders
I wrote a little fork+spinloop program w/100 subprocesses and deployed it with
a low (100m) CPU request and no limit. It's certainly driving CPU usage to
nearly all 8 of the 8 cores on the machine, but the other processes sharing
the node are doing fine.

Prometheus scrapes of the kubelet have slowed down a bit, but are still under
400ms.

Prometheus scrape latency for the node kubelet has increased, but not it's
still sub-500ms.

Note that this cluster (which is on EKS) _does_ have system reserved
resources.

    
    
        [root@ip-10-1-100-143 /]# cat /sys/fs/cgroup/cpu/system.slice/cpu.shares
        1024
        [root@ip-10-1-100-143 /]# cat /sys/fs/cgroup/cpu/kubepods/cpu.shares
        8099
        [root@ip-10-1-100-143 /]# cat /sys/fs/cgroup/cpu/user.slice/cpu.shares
        1024

------
ledneb
>
> [https://engineering.indeedblog.com/blog/2019/12/unthrottled-...](https://engineering.indeedblog.com/blog/2019/12/unthrottled-
> fixing-cpu-limits-in-the-cloud/)

This is a more detailed post on the same thing - part two indicates changes
have been back-ported to a number of kernel versions:

    
    
        Linux-stable: 4.14.154+, 4.19.84+, 5.3.9+
        Ubuntu: 4.15.0-67+, 5.3.0-24+
        Redhat Enterprise Linux:
            RHEL 7: 3.10.0-1062.8.1.el7+
            RHEL 8: 4.18.0-147.2.1.el8_1+
        CoreOS: v4.19.84+

~~~
ravedave5
Know which version of alpine linux would have gotten this fix? I'm having a
hard time walking it back from the commit.

~~~
Thaxll
There is a low probability that your host is Alpine, more like your pods.

------
YawningAngel
I don't really understand why Buffer (or anyone else, for that matter) would
choose to remove CPU limits from services where they are extremely important
rather than upgrading to a kernel version that doesn't have this bug.

~~~
burgerquizz
Is upgrading a kernel of a docker host that straight forward? I would worry to
keep everything compatible, with a lot of testing before any upgrade of this
kind. It looks like they're running k8s with kops, and the fix was merged just
a few weeks ago.

~~~
YawningAngel
I'm not in a good position to say for sure, as I've only used managed
Kubernetes distributions, but I think it probably works out to less work than
removing CPU limits. Kernel upgrades are at least semi-routine, so most shops
that run Kubernetes themselves are going to have a process for them.
Conversely, removing CPU limits and migrating the critical path to a different
set of tainted nodes is a substantial one-off change with a long tail of
failure scenarios that need to be tested. Thus, I would expect that a kernel
upgrade would be easier than doing what Buffer did.

------
fierro
The core principle most readers miss is that CPU limits are tied to CPU
_throttling_ , which is markedly different than CPU _time sharing_. I would
argue that in 99% of cases, you truly do not need or want limits.

limits cause CPU throttling, which is like running your process in a strobe
light. If your quota period is 100ms, you might only be able to make progress
for 10ms out of every 100ms period, regardless of whether or not there is CPU
contention, just because you've exceeded your limit.

requests -> CFS time sharing. This ensures that out of a given period of time,
CPU time is scheduled fairly and according to the request as a proportion of
total request (it just so happens that the Kube scheduler won't schedule such
that sum[requests] > capacity, but theoretically it could because requests are
truly relative when it comes to how they are represented in cgroups)

Here is the fundamental assertion: requests ensure fair CPU scheduling in the
event of CPU contention (more processes want CPU than can be scheduled).
_Given that_ you are using requests, why would you want limits? You might
think "limits prevent a process from taking too much CPU" but that's just not
true. If that processes DID try to use up too much CPU, CFS would ensure it
does not via fair time sharing. If _no other_ running processes needed the
CPU, why enforce CPU throttling which has very bad effects on tail latency?

~~~
dilyevsky
+1 The only good reason to use cpu limits I can think of is if you sell
metered compute and run it on k8s. I’d be curious to know if anyone actually
does this though

~~~
fierro
the genesis of cfs_quota and cpu throttling in general has to do with
modulating power consumption of a chip, iirc. It's truly a fallacy that limits
are needed to prevent noisy neighbor type stuff.

~~~
dilyevsky
Huh didn’t know about reason behind cfs quota, thanks. Yeah it always seemed
of dubious usefulness to me. Considering i can probably trash cpu caches
without using much cycles and do other things with disk and network io I’m a
bit surprised people worry about cfs quota so much

------
davewritescode
This seems like a bad trade-off, at least for 99% of us who haven’t been using
Kubernetes in production for the last 5 years and manage it ourselves.

Putting all the “user facing” services in a state where one of them consuming
all the CPU could affect all the others feels like a disaster waiting to
happen.

~~~
sheeshkebab
The number of times I’ve seen CPU limits kill off pods during even mild spikes
and causing pretty much downtime and “disaster” is just as surprising. Work on
autoscaling nodes instead, don’t use cpu limits.

~~~
nebster
Maybe I'm misunderstanding you, but I'm pretty sure CPU limits will only limit
the amount of CPU used even if there is more available. It will not kill off
the pod.

Memory limits however will kill the pod if the pod uses more than the limit.

~~~
jrockway
It is possible to get into this state. CPU starvation can be so severe that
containers start failing their liveness probes and are killed. This is
obviously very different than things like memory limits where the kernel
OOMKills you, but will look similar to the untrained observer. Their app is
serving 503s and the containers are in a restart loop -- looks like a
Kubernetes problem.

In general, the problem is that people don't understand how these complex
systems interact -- what do limits do, what are the consequences of limits,
how do you decide on correct limits, what do liveness and readiness probes do,
what is the kubelet's role in the probes, wait what's a kubelet, etc.

~~~
rsanders
That may be more likely with limits, but it doesn’t require a limit. I’ve had
lots of fun with that in Elasticsearch pods with no limit. And then you get to
enjoy a nice cascading failure.

------
solatic
> kops: Since June 2020, kops 1.18+ will start using Ubuntu 20.04 as the
> default host image. If you’re using a lower version of kops, you’ll have to
> probably to wait the fix. We are currently in this situation.

We're running Ubuntu 20.04 on Kops 1.17 in production just fine, thank you
very much. It wasn't a happy path since it wasn't officially supported then -
stuff about forcing iptables-legacy instead of nftables - but with a couple
hacks we got it to work just fine (Kops was in a bad situation where CoreOS
was hitting EOL and there were no officially supported distributions running
updated kernels that patched the CPU throttling issues, so we worked with the
maintainers to figure out what we needed to do, as the maintainers were also
running Ubuntu 20.04 on versions of Kops which didn't formally support it).

This whole blog post is dangerous. CPU limits are really important for cluster
stability, as I'm sure the author will find out soon enough. Why bother with
dangerous workarounds for problems that have actual solutions? This makes no
sense to me.

------
kamaradclimber
I encountered that issue on my company Mesos cluster. Here are some details.

We ran our largest application from bare-metal to Mesos
([https://medium.com/criteo-labs/migrating-arbitrage-to-
apache...](https://medium.com/criteo-labs/migrating-arbitrage-to-apache-
mesos-3f474179ec0b)) and observed performance was not as good as expected
(especially on 99pctl latency). Other application were showing similar
behavior.

We ended up finding the issue with cfs bandwidth cgroup, considered several
alternatives and eventually moved to cpusets instead.

cpusets allow to get: \- better mental model (it's far easier to reason on
"dedicated cpus") \- net performance gain (from -5% to -10% cpu consumption)
\- more consistent latency (if nothing run on the same cpu than your app, you
benefit from good scheduling and possibly avoid cpu cache issues)

When the fixed kernel was released, we decided to upgrade to it and keep our
new model of cpu isolation.

------
gpapilion
At a previous job we made an argument for moving away from cfs and to look at
only full core allocation, often pinning with NUMA. The speed up was
noticeable, since it removed the cfs overhead and memory access was now local.

We then got stuck in discussions around partial core allocation. We didn’t
have that many jobs configured to use less than a full core, but it did impact
our container packing.

------
toomanymike
I've seen CPU throttling occur when limits aren't exhausted even on 5.4
kernels, so I don't believe the underlying kernel bug is fixed.

One option not mentioned in the post is to enable k8s' static CPU scheduler
policy. With this option in place workloads in the "guaranteed" quality of
service class that are allocated an integer CPU limit will be given exclusive
use of their CPUs. I've found this also avoids the CFS bugs and eliminates CPU
throttling, without removing CPU limits.

One thing to keep in mind is that this bug mostly impacts workloads that spin
up more threads then they have allocated CPUs. For golang workloads you can
set GOMAXPROCS to be equal to your CPU allocation and eliminate most
throttling that way too, without messing with limits or the static scheduler
policy

~~~
solatic
Enabling the static CPU scheduler policy currently requires setting a Kubelet
flag, and that puts it out of reach of most people running managed Kubernetes
distributions.

~~~
rsanders
It looks possible on EKS now. [https://aws.amazon.com/about-aws/whats-
new/2020/08/amazon-ek...](https://aws.amazon.com/about-aws/whats-
new/2020/08/amazon-eks-managed-node-groups-now-support-ec2-launch-templates-
custom-amis/)

~~~
solatic
Because EKS supports custom launch templates? Good luck trying to finagle that
into supporting the exact Kubelet flags that you want to enable, while staying
abreast of upstream updates so that your cluster doesn't break when AWS tries
to keep it up-to-date. Not anywhere close to a simple "extra_kubelet_flags:
array[text]" kind of field.

------
MetalMatze
That's why we put the CPUThrottlingHigh alert into the kubernetes-mixin
project. It a least let folks know. The Node Exporter for example is always
throttled and I don't mind. For the user facing parts I'd rather not be in the
same situation. Ultimately latency should tell me though.

------
dundarious
In the low latency trading world, these concerns are addressed by partitioning
resources (for CPU, with affinities). This seems like a simpler mechanism that
doesn’t require the kernel/daemon to track resource usage and to impose
limits.

I see only upsides to performance (bandwidth and latency) and availability by
partitioning resources — so what are the benefits of the alternative, using
limits, beyond being able to stuff more apps onto a machine? That’s not to
trivialize that benefit.

Does kubernetes even allow for “affinitizing”?

~~~
dijit
Video Games industry is the same, in fact it was one of the reasons we went
with google cloud over alternatives, at the time Amazon was not using KVM (or,
HVM as they seem to call it)- and GCP was at least attempting CPU affinity on
the VMs, this caused quite a variance in latency when using amazon which did
not exist on GCP.

To answer your question: I believe there is 'pinning' in Kubernetes which can
solve it, but kubernetes has other overheads in terms of latency (iptables pod
routing with contrack enabled for instance) so I personally would avoid using
it for low latency applications.

[https://builders.intel.com/docs/networkbuilders/cpu-pin-
and-...](https://builders.intel.com/docs/networkbuilders/cpu-pin-and-
isolation-in-kubernetes-app-note.pdf)

~~~
GauntletWizard
For videogames, you should not be subject to the iptables bits - Agones
encourages use of the `hostPort` networking mode, which doesn't create or
require special iptables routing.

------
theptip
I’m wondering if using a bursty limit like

    
    
        requests:
          cpu: 100m
        limits:
          cpu: 200m
    

Would work better? If the problem is that you are getting throttled at a lower
rate than your specified limit, maybe bumping that would help. But you still
get to use your target “request” for bin packing / node resource tracking.

This would depend on the throttle level being proportional to the specified
limit and not something orthogonal like number of processes - but if you don’t
want to turn off limits entirely it might at least help.

------
Schwan
Is this really right?

"The danger of not setting a CPU limit is that containers running in the node
could exhaust all CPU available."

My assumptions have been: 1\. cpu request tells you how much cpu a pod gets
MINIMUM always, independently of how much other pods use it or not 2\. on GKE
you can't request 100% cpu due to google reserving cpu for the node 3\. if you
have hard limits, your cluster utilisation will be bad -> we do remove cpu
limits due to this.

~~~
markbnj
The reason a container with no limit can exhaust CPU is that kubernetes CPU
requests map to the cpushares accounting system, and CPU limits map to the
Completely Fair Scheduler's cpuquota system. The cpushares system divides a
core into 1024 shares, and guarantees a process gets the number of shares it
reserves, but it does not limit the process from taking more shares if other
processes aren't consuming them. The cpuquota system divides CPU time into
periods of... I think... 100k microseconds by default, and hard limits a
process at the number of microsecs per periods it requests. So if you don't
set limits you're only using the cpushares system, and are free to take up as
much idle CPU as you can grab.

------
KaiserPro
From a traditional cluster perspective, we've been doing this for years.

depending on the goal of your service and cluster, it might be preferable to
over subscribe your CPU.

Compared to Memory oversubscription, CPU over sub isn't anywhere near as much
of a show stopper, so long as your service degrades well when it can't get the
CPU it needs.

Where cost is an issue its very much worth oversubscribing your CPU by 20% to
ensure you are rinsing the CPU.

~~~
rbanffy
As a mainframer once told me, "There's nothing wrong with having 100% CPU
usage. We paid for it, we'd better use it".

On an interesting note, in mainframes it's normal to pay for a machine with n
CPUs and get a n+m CPU machine delivered and installed. The extra CPUs are
inactive until you pay for the upgrade and receive an activation code. In
order to reduce downtime, during startup it's possible to have more than your
licensed CPUs active to speed up the boot process and to catch up with any
missed jobs.

------
sascha_sl
Here's a story that might make you not want to do that.

We ran Kubernetes with the standard scheduler and node autoscaling for a long
time, and used to allow developers in our (simplified) manifests define
resource requests and limits. We saw that with our current config, we always
had some unused capacity (that we wanted) since the scheduler spread out
workloads while the autoscaler was only throwing nodes away with less than 70%
load. So we started ignoring the limits provided by developers. This was
initially a great success, our response times in the 99th went down
drastically, even during sudden traffic spikes.

2 years later, and nobody cares about resource allocation for new services
anymore. We can essentially never disable bursting again, because too many
services (100+) use the extra capacity constantly, and due to our
organizational structure we can't really _make_ these teams fix their
allocations.

~~~
dilyevsky
We at Mux have removed nearly all limits but setup alerts that trigger when
container consistently bursts above the request so we can chase those down
(temporary bursts are ignored). Never had any issues

~~~
sascha_sl
This might work if your developer (who do not have to care about operational
things at all) to SRE (and not even dedicated SRE) ratio is not 30...

Point being, if you don't have the capability to somehow keep teams in check
through process and not pure capability, reconsider.

~~~
dilyevsky
We enforce cpu request (and memory request _and_ limit) via process and plan
on adding automation to enforce that so shouldn’t be a problem even with
scaled up team since you can only hurt yourself by setting request too low.
Not sure how the number 30 was chosen...

------
dstiliadis
The problem is the lack of controls in the timescales that the CPU scheduler
is using, that do not necessarily match the timescales of applications. This
is classic statistical multiplexing and burstiness problem often encountered
in the network queueing world. I wrote a blog and a couple of synthetic
benchmarks that highlight the issues a couple of months ago that you might
find interesting: [https://medium.com/engineering-at-palo-alto-
networks/kuberne...](https://medium.com/engineering-at-palo-alto-
networks/kubernetes-scheduling-and-timescales-e98d8e31d304)

------
uberduper
Bit of a warning. If you do not set cpu requests, your pods may end up with
cpu.shares=2.

Java, for example, makes some tuning decisions based on this that you're not
gonna like.

~~~
jeffbee
The Go runtime also locks in some unwarranted assumptions at process start
time, and never changes its parameters if the number of available CPUs
changes.

~~~
jrockway
Explicitly setting GOMAXPROCS is probably the cleanest way to limit CPU among
the runtimes that are out there, however. For example, if you set requests =
1, limits = 1, GOMAXPROCS=1, then you will never run into the latency-
increasing cfs cpu throttling; you would be throttled if you used more than 1
CPU, but since you can't (modulo forks, of course), it won't happen. There is
[https://github.com/uber-go/automaxprocs](https://github.com/uber-
go/automaxprocs) to set this automatically, if you care.

You are right that by default, the logic that sets GOMAXPROCS is unaware of
the limits you've set. That means GOMAXPROCS will be something much higher
than your cpu limit, and an application that uses all available CPUs will use
all of its quota early on in the cfs_period_us interval, and then sleep for
the rest of it. This is bad for latency.

~~~
jeffbee
Setting GOMAXPROCS explicitly is the best practice in my experience. The
runtime latches in a value for runtime.NumCPU() based on the population count
of the cpumask at startup. The cpumask can change if kubernetes schedules or
de-schedules a "guaranteed" pod on your node and the kubelet is using the
static CPU management policy, and it will vary from node to node if you have
various types of machines. You don't want to have 100 replicas of your
microservice all using different, randomly-chose values of GOMAXPROCS.

------
eightnoteight-1
I think cpu pinning also would have reduced the impact by a significant factor
if all workloads have full vcpu cores

------
whalesalad
Last time I was knee deep in managing prod Kube infrastructure limits were
also there for scheduling purposes. It's hard to properly allocate services
across nodes when there is no concept of requirements they have. I guess you
can get around that with setting a request versus a limit?

~~~
Sayrus
Pod scheduling is based on Requests so you can definitely go without setting a
limit. I don't think the limit itself plays a role in Kubernetes scheduling
(unless you don't specify a request, then limit equals request).

And you are definitely right: scheduling a pod without request/limit is like
giving a blank check.

------
klohto
TL:DR; Remove limits if you’re running below 4.19 due to a bug present. Update
AMI if running on EKS.

I was expecting a discussion about CPU limits and all that is here is a
workaround for a bug.

~~~
proptecher
Where can I find an EKS worker node AMI with 4.19+? The EKS documentation
shows latest is 4.14.x in the current 1.17 k8s.

~~~
klohto
You don't need to, it was backported
[https://news.ycombinator.com/item?id=24353080](https://news.ycombinator.com/item?id=24353080)

------
marsdepinski
This is good advice if used carefully. OpenVZ still did this the best way. It
allowed to set guaranteed minimums and no max so you could guarantee CPU time
to the host node. Scaled containers and efficiently used resources wonderfully

------
NortySpock
Would the latency of the system be improved by reserving some amount of CPU
for the container? For example if a container always got a few milliseconds
per period, or if you even reserved a part or all of a CPU for the container.

~~~
jeffbee
Using cpu masks to exclude other processes from your CPUs will certainly
improve latency. It just costs more.

------
david_xia
I work on a team that operates multitenant GKE clusters for other engineers at
our company. Earlier this year I read this blog post [1] about a bug in the
Linux kernel that unnecessarily throttles workloads due to a CFS bug. Kernel
versions 4.19 and higher have been patched. I asked GCP support which GKE
versions included this patch. They told me 1.15.9-gke.9. But my team at work
is still getting reports of CPU throttling causing increased latencies on GKE
workloads in these clusters.

This means

1\. we're using a kernel that doesn't contain the patch. 2\. the patch wasn't
sufficient to prevent unnecessary CPU throttling 3\. latency is caused by
something other than CPU throttling

To rule out 1, I again checked that our GKE clusters (which are using nodes
with Container Optimized OS [COS] VM images) are on a version that contains
the CFS patch.

```

dxia@one-of-our-gke-nodes ~ $ uname -a Linux one-of-our-gke-nodes 4.19.112+ #1
SMP Sat Apr 4 06:26:23 PDT 2020 x86_64 Intel(R) Xeon(R) CPU @ 2.30GHz
GenuineIntel GNU/Linux

```

Kernel version is 4.19.112+ which is a good sign. I also checked the COS VM
image version.

gke-11512-gke3-cos-77-12371-227-0-v200605-pre

The cumulative diff for [COS release notes][2] for cos-stable-77-12371-227-0
show this lineage (see "Changelog (vs ..." in each entry).

cos-stable-77-12371-227-0 77-12371-208-0 77-12371-183-0 77-12371-175-0
77-12371-141-0 <\- This one's notes say "Fixed CFS quota throttling issue."

Now looking into 2:

This dashboard [5]. Top graph shows an example Container's CPU limit, request,
and usage. The bottom graph shows the number of seconds the Container was CPU
throttled as measured by sampling the local kubelet's Prometheus metric for
`container_cpu_cfs_throttled_seconds_total` over time. CPU usage data is
collected from resource usage metrics for Containers from the [Kubernetes
Metrics API][6] which is returns metrics from the [metrics-server][7].

The first graph shows usage is not close to the limit. So there shouldn't be
any CPU throttling happening.

The first drop in the top graph was decreasing the CPU limit from 24 to match
the CPU requests of 16. The decrease of CPU limit from 24 to 16 actually
caused CPU throttling to increase. We removed CPU limits from the Container on
8/31 12:00 which decreased number of seconds of CPU throttling to zero. This
makes me think the kernel patch wasn't sufficient to prevent unnecessary CPU
throttling.

This K8s Github issue ["CFS quotas can lead to unnecessary throttling
#67577"][8] is still open. The linked [kernel bug][9] has a comment saying it
should be marked fixed. I'm not sure if there are still CPU throttling issues
with CFS not tracked in issue #67577 though.

Because of the strong correlation in the graphs between removing CPU limits
and CPU throttling, I'm assuming the kernel patch named "Fixed CFS quota
throttling issue." in COS 77-12371-141-0 wasn't enough.

Questions

1\. Anyone else using GKE run into this issue?

2\. Does anyone have a link to the exact kernel patch that the COS entry
"Fixed CFS quota throttling issue." contains? A Linux mailing list ticket or
patch would be great so I can see if it's the same patch that various blog
posts reference.

3\. Anyone aware of any CPU throttling issues in the current COS version and
kernel we're using? 77-12371-227-0 and 4.19.112+, respectively.

[1]: [https://medium.com/omio-engineering/cpu-limits-and-
aggressiv...](https://medium.com/omio-engineering/cpu-limits-and-aggressive-
throttling-in-kubernetes-c5b20bd8a718)

[2]: [https://cloud.google.com/container-optimized-
os/docs/release...](https://cloud.google.com/container-optimized-
os/docs/release-notes#cos-stable-77-12371-227-0)

[5]:
[https://share.getcloudapp.com/o0u8KoEn](https://share.getcloudapp.com/o0u8KoEn)

[6]: [https://kubernetes.io/docs/tasks/debug-application-
cluster/r...](https://kubernetes.io/docs/tasks/debug-application-
cluster/resource-metrics-pipeline/)

[7]:
[https://github.com/kubernetes/kubernetes/tree/master/cluster...](https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/metrics-
server)

[8]:
[https://github.com/kubernetes/kubernetes/issues/67577](https://github.com/kubernetes/kubernetes/issues/67577)

[9]:
[https://bugzilla.kernel.org/show_bug.cgi?id=198197](https://bugzilla.kernel.org/show_bug.cgi?id=198197)

[COS]: [https://cloud.google.com/container-optimized-
os/docs](https://cloud.google.com/container-optimized-os/docs)

~~~
crb
Hey David, we talked on a podcast once :) Please raise a support case and send
me the ticket number; I'll see if we can get to the bottom of this for you.

------
devit
Thy should have just upgraded the kernel to a fixed one, which definitely does
not require to upgrade the whole distribution.

Also if they are using Kubernetes normally there is no reason to not upgrade
the whole distribution as well, since only Kubernetes will be running on it,
and of course that's widely tested (the containers each choose their own
distribution, only the kernel is shared).

------
poisonta
You should not run more than one application/service in a VM if you are
worried about the performance. Then, you don't need to worry about
configuration CPU limits. Kubernetes doesn't only slow down your application
performance, it also increases your operating cost and team by several
magnitudes.

------
alex88
If you remove the cpu limit you won't be able to use HPA though, right?

~~~
sciurus
No, you still have a CPU request, and the HPA is based on utilization of that.

Also, even without limits I believe CPU is prioritized based on the request.
So if 1 pod requests 100 millicpu and another pod requests 200 millicpu, if
they both try to use all the CPU on a node the one that requested 200 millicpu
will use 2/3 of the CPU and the other will use 1/3.

------
javiercr
Does anyone know if this bug is present on Google Kubernetes Engine (GKE)?

~~~
cptomlly
Fixed in the following COS stable images back in January: cos-
stable-79-12607-80-0 cos-stable-77-12371-141-0 cos-stable-73-11647-415-0 cos-
stable-78-12499-89-0

According to [https://cloud.google.com/container-optimized-
os/docs/release...](https://cloud.google.com/container-optimized-
os/docs/release-notes#cos-stable-77-12371-141-0)

------
miiiiiike
This is terrible advice.

------
secondcoming
Couldn't the unresponsive kubectl issue be resolved by isolating cpus and
controlling yourself where processes go?

~~~
Schwan
I'm not sure if this is a real issue; Normally (on gke for example) you can't
use 100% cpu because of this.

If they saw the issue, then either they have not configured their nodes right,
missconfigured them or perhaps run something very old?

I'm quite curious to see a proper test bench

~~~
dilyevsky
Kops (at least a few years ago) did not set any reservations for system
components by default

