Hacker News new | past | comments | ask | show | jobs | submit login

I work on a team that operates multitenant GKE clusters for other engineers at our company. Earlier this year I read this blog post [1] about a bug in the Linux kernel that unnecessarily throttles workloads due to a CFS bug. Kernel versions 4.19 and higher have been patched. I asked GCP support which GKE versions included this patch. They told me 1.15.9-gke.9. But my team at work is still getting reports of CPU throttling causing increased latencies on GKE workloads in these clusters.

This means

1. we're using a kernel that doesn't contain the patch. 2. the patch wasn't sufficient to prevent unnecessary CPU throttling 3. latency is caused by something other than CPU throttling

To rule out 1, I again checked that our GKE clusters (which are using nodes with Container Optimized OS [COS] VM images) are on a version that contains the CFS patch.

```

dxia@one-of-our-gke-nodes ~ $ uname -a Linux one-of-our-gke-nodes 4.19.112+ #1 SMP Sat Apr 4 06:26:23 PDT 2020 x86_64 Intel(R) Xeon(R) CPU @ 2.30GHz GenuineIntel GNU/Linux

```

Kernel version is 4.19.112+ which is a good sign. I also checked the COS VM image version.

gke-11512-gke3-cos-77-12371-227-0-v200605-pre

The cumulative diff for [COS release notes][2] for cos-stable-77-12371-227-0 show this lineage (see "Changelog (vs ..." in each entry).

cos-stable-77-12371-227-0 77-12371-208-0 77-12371-183-0 77-12371-175-0 77-12371-141-0 <- This one's notes say "Fixed CFS quota throttling issue."

Now looking into 2:

This dashboard [5]. Top graph shows an example Container's CPU limit, request, and usage. The bottom graph shows the number of seconds the Container was CPU throttled as measured by sampling the local kubelet's Prometheus metric for `container_cpu_cfs_throttled_seconds_total` over time. CPU usage data is collected from resource usage metrics for Containers from the [Kubernetes Metrics API][6] which is returns metrics from the [metrics-server][7].

The first graph shows usage is not close to the limit. So there shouldn't be any CPU throttling happening.

The first drop in the top graph was decreasing the CPU limit from 24 to match the CPU requests of 16. The decrease of CPU limit from 24 to 16 actually caused CPU throttling to increase. We removed CPU limits from the Container on 8/31 12:00 which decreased number of seconds of CPU throttling to zero. This makes me think the kernel patch wasn't sufficient to prevent unnecessary CPU throttling.

This K8s Github issue ["CFS quotas can lead to unnecessary throttling #67577"][8] is still open. The linked [kernel bug][9] has a comment saying it should be marked fixed. I'm not sure if there are still CPU throttling issues with CFS not tracked in issue #67577 though.

Because of the strong correlation in the graphs between removing CPU limits and CPU throttling, I'm assuming the kernel patch named "Fixed CFS quota throttling issue." in COS 77-12371-141-0 wasn't enough.

Questions

1. Anyone else using GKE run into this issue?

2. Does anyone have a link to the exact kernel patch that the COS entry "Fixed CFS quota throttling issue." contains? A Linux mailing list ticket or patch would be great so I can see if it's the same patch that various blog posts reference.

3. Anyone aware of any CPU throttling issues in the current COS version and kernel we're using? 77-12371-227-0 and 4.19.112+, respectively.

[1]: https://medium.com/omio-engineering/cpu-limits-and-aggressiv...

[2]: https://cloud.google.com/container-optimized-os/docs/release...

[5]: https://share.getcloudapp.com/o0u8KoEn

[6]: https://kubernetes.io/docs/tasks/debug-application-cluster/r...

[7]: https://github.com/kubernetes/kubernetes/tree/master/cluster...

[8]: https://github.com/kubernetes/kubernetes/issues/67577

[9]: https://bugzilla.kernel.org/show_bug.cgi?id=198197

[COS]: https://cloud.google.com/container-optimized-os/docs




Hey David, we talked on a podcast once :) Please raise a support case and send me the ticket number; I'll see if we can get to the bottom of this for you.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: