1. we're using a kernel that doesn't contain the patch.
2. the patch wasn't sufficient to prevent unnecessary CPU throttling
3. latency is caused by something other than CPU throttling
To rule out 1, I again checked that our GKE clusters (which are using nodes with Container Optimized OS [COS] VM images) are on a version that contains the CFS patch.
dxia@one-of-our-gke-nodes ~ $ uname -a
Linux one-of-our-gke-nodes 4.19.112+ #1 SMP Sat Apr 4 06:26:23 PDT 2020 x86_64 Intel(R) Xeon(R) CPU @ 2.30GHz GenuineIntel GNU/Linux
Kernel version is 4.19.112+ which is a good sign. I also checked the COS VM image version.
The cumulative diff for [COS release notes] for cos-stable-77-12371-227-0 show this lineage (see "Changelog (vs ..." in each entry).
77-12371-141-0 <- This one's notes say "Fixed CFS quota throttling issue."
Now looking into 2:
This dashboard . Top graph shows an example Container's CPU limit, request, and usage. The bottom graph shows the number of seconds the Container was CPU throttled as measured by sampling the local kubelet's Prometheus metric for `container_cpu_cfs_throttled_seconds_total` over time. CPU usage data is collected from resource usage metrics for Containers from the [Kubernetes Metrics API] which is returns metrics from the [metrics-server].
The first graph shows usage is not close to the limit. So there shouldn't be any CPU throttling happening.
The first drop in the top graph was decreasing the CPU limit from 24 to match the CPU requests of 16. The decrease of CPU limit from 24 to 16 actually caused CPU throttling to increase. We removed CPU limits from the Container on 8/31 12:00 which decreased number of seconds of CPU throttling to zero. This makes me think the kernel patch wasn't sufficient to prevent unnecessary CPU throttling.
This K8s Github issue ["CFS quotas can lead to unnecessary throttling #67577"] is still open. The linked [kernel bug] has a comment saying it should be marked fixed. I'm not sure if there are still CPU throttling issues with CFS not tracked in issue #67577 though.
Because of the strong correlation in the graphs between removing CPU limits and CPU throttling, I'm assuming the kernel patch named "Fixed CFS quota throttling issue." in COS 77-12371-141-0 wasn't enough.
1. Anyone else using GKE run into this issue?
2. Does anyone have a link to the exact kernel patch that the COS entry "Fixed CFS quota throttling issue." contains? A Linux mailing list ticket or patch would be great so I can see if it's the same patch that various blog posts reference.
3. Anyone aware of any CPU throttling issues in the current COS version and kernel we're using? 77-12371-227-0 and 4.19.112+, respectively.