Hacker News new | past | comments | ask | show | jobs | submit | more rsanders's comments login

Excluding kernel bugs, CPU limits just provide an upper bound on burst capacity. That controls oversubscription of CPU on a node. As with any other kind of oversubscription of a resource based on variable demand, there is a tradeoff. Allowing one pod to burst over its request is both unreliable and potentially impacting other neighboring pods. Whether that improves your cluster efficiency or introduces intolerably high variability in service latency and throughput depends on your mix of workloads and how the scheduler distributes your various pods.

Buffer's solution of having different flavors of node, onto which mutually compatible workloads are scheduled in isolation from incompatible ones, is a very reasonable thing to do, even if this particular case is a bit of a head-scratcher.


Removing CPU limits seems like a bad idea now that there's a kernel fix. But putting that aside...

I don't understand why pods without CPU limits would cause unresponsive kubelets. For a long time now Kubernetes has allocated a slice for system services. While pods without CPU limits are allowed to burst, they are still limited to the amount of CPU allocated to kubernetes pods.

Run "systemd-cgls" on a node and you'll see two toplevel slices: kubepods and system. The kubelet process lives within the system slice.

If you run "kubectl describe <node>" you can see the resources set aside for system processes on the node. Processes in the system slice should always have (cpu_capacity - cpu_allocatable) available to share, no matter what happens in the kubepods slice.

    Capacity:
        cpu:                         8
        ephemeral-storage:           83873772Ki
        memory:                      62907108Ki
    Allocatable:
        cpu:                         7910m
        ephemeral-storage:           76224326324
        memory:                      61890276Ki
        pods:                        58
Granted, it's not a large proportion of CPU.


It might depend alot on the distribution and how kubernetes is started. How much CPU time to reserve for system services from the scheduler (Allocatable as you pointed out) needs to be passed to kubelet, and I think only really applies to guarenteed pods.

What I did on the distribution I work on, is tune the cgroup shares so control plane services are allocated CPU time ahead of pods (whether guarenteed, burstable, or best effort). We don't run anything as static containers, so this covers all the kube services, etcd, system services, etc.

Before this change in our distribution, IIRC, pods and control plane had equal waiting, which allowed the possibility for kubelet or other control plane services to be starved if the system was very busy.

There are also lots of other problems that can lead to kubelet bouncing between ready/not ready that we've observed which wouldn't be triggered by the limits.


Even without the bug it will have negative effect on latency and generally is not really needed for un-metered workloads (there’re posts by thockin on reddit and github that describe this in detail)

To answer your other question - I believe kops ships without system reserved by default


Can you explain how having a CPU limit set (at any level) has a negative effect on latency? That's an important factor to understand.

The arguments for allowing containers to burst makes plenty of sense to me. I do it on most of my services!

thockin's reddit post for reference: https://www.reddit.com/r/kubernetes/comments/all1vg/on_kuber...

Another interesting bit of context describing some of the non-intuitive impacts of CPU limits: https://github.com/kubernetes/kubernetes/issues/51135

Edit: added links


It’s mentioned elsewhere on this thread but essentially with cfs quota period default which is 100ms it’s really easy for multithreaded process to exhaust the quota and just sit there idle until next period. Another thing is if you have spare cycles that you presumably already paid for why not just use them?


> Removing CPU limits seems like a bad idea now that there's a kernel fix.

Actually, why? Sure those guys may starve the ones without limits but they won't starve each other just because Linux will simply time-share the processes. And for services on the critical path (what they turned it off for) that seems like correct behaviour.


I should have said that it seems like the wrong fix to the problem. But I have since learned that limits can cause excessive throttling. And of course you may want your pods to be burstable, but that would just be a question of setting appropriate limits.

Live and learn!


It's pretty simple, limits work only when everyone are using them. If you have one pod that does not enforce limits it can disrupt the entire node.


A container with a request but without a limit should be scheduled as Burstable, and it should only receive allocations in excess of its request when all other containers have had their demand <= request satisfied.

A container without either request or limit is twice-damned, and will be scheduled as BestEffort. The entire cgroup slice for all BestEffort pods is given a cpu.shares of 2 milliCPUs, and if the kernel scheduler is functioning well, no pod in there is going to disrupt the anything but other BestEffort pods with any amount of processor demand. Throw in a 64 thread busyloop and no Burstable or Guaranteed pods should notice much.

Of course that's the ideal. There is an observable difference between a process that relinquishes its scheduler slice and one that must be pre-empted. But I wouldn't call that a major disruption. Each pod will still be given its full requested share of CPU.

If that's not the case, I'd love to know!


Are you sure that BestEffort QOS do not disrupt the entire node? I remember in the past a single pod would freeze the entire VM.


I wrote a little fork+spinloop program w/100 subprocesses and deployed it with a low (100m) CPU request and no limit. It's certainly driving CPU usage to nearly all 8 of the 8 cores on the machine, but the other processes sharing the node are doing fine.

Prometheus scrapes of the kubelet have slowed down a bit, but are still under 400ms.

Prometheus scrape latency for the node kubelet has increased, but not it's still sub-500ms.

Note that this cluster (which is on EKS) does have system reserved resources.

    [root@ip-10-1-100-143 /]# cat /sys/fs/cgroup/cpu/system.slice/cpu.shares
    1024
    [root@ip-10-1-100-143 /]# cat /sys/fs/cgroup/cpu/kubepods/cpu.shares
    8099
    [root@ip-10-1-100-143 /]# cat /sys/fs/cgroup/cpu/user.slice/cpu.shares
    1024


Could you please elaborate on why's that so?


This advice is confusing. CPU is a "compressible" resource -- pods don't get killed for (trying to) exceed it. Pods don't get evicted from nodes based on CPU starvation, so autoscaling your node count won't help if you end up with a set of pods on a node that need more CPU than the node can provide. They'll just starve each other.

If your service allows horizontal scalability, you can use autoscaling of pods with Horizontal Pod Autoscaler (ideally also with a cluster autoscaler) to increase pod count for a given service when some percentage of the requested CPU is exceeded, whether or not you set a CPU limit. Setting the cpu_request appropriately for your pods is critical to ensure that node CPU is not oversubscribed by the Kubernetes pod scheduler.

Pods where mem & CPU requests = limits are given the highest class of service ("guaranteed"). For your most critical and latency sensitive services, this is the best approach when also coupled with HPA. Assuming a 4.19 kernel or later, I suppose.

https://medium.com/better-programming/the-kubernetes-quality...


I thought Red Hat announced last year they were abandoning Btrfs and enhancing XFS with similar features.


> Red Hat supports Fedora well, in many ways. But Fedora already works closely with, and depends on, upstreams. And this will be one of them. That's an important consideration for this proposal. The community has a stake in ensuring it is supported. Red Hat will never support Btrfs if Fedora rejects it. Fedora necessarily needs to be first, and make the persuasive case that it solves more problems than alternatives. Feature owners believe it does, hands down.

I guess the dynamic here is that Fedora wants to convince Red Hat into supporting Btrfs by adopting it themselves.


Thanks! I didn't realize they were such distinct personalities.


The support for multi-arch docker images is getting there, but there's not much of an ecosystem. Between Raspberry Pi, ARM instances on AWS, and ARM-based Macs, it ought to get to critical mass before too long.


Retrying transactions is something every app ought to handle, but it's rare enough that most codebases I've seen just punt on it.


The thing is before CRDB the application was operating just fine in MySQL's default isolation level.


My last company used Terraform to manage Kubernetes. The main issue is that the TF Kubernetes provider supports a limited subset of K8S object types, and of fields within those K8S objects. For example: TF didn't even support Deployment objects until sometime in mid/late 2019 (I may be wrong on timing, but it was long after they were the primary method for general scheduling of long-running containers).

We ended up using TF's Helm provider, sometimes with hacks like a helm chart which deploys an arbitrary YAML file (the so-called "raw" chart). At that point, Terraform is blind to what's actually happening inside K8S. You can still benefit from the ability of TF to pass data from your other infra automation into the Helm charts, of course, but it's really Helm actually managing the configuration of your K8S cluster. And that's the app we all love to hate.

The situation may have been improved, but my conclusion was that it would always be a somewhat incomplete interface.


The terraform provider has caught up a bit in the last 6 months. It is still missing things like CRD support.

For those things we use a direct kubectl yaml provider.

I wish there was an istio provider!


I believe that if you have a Parquet file meeting certain criteria, it's directly parallelizable as multiple Spark partitions without any shuffling. The splits would occur at Parquet row group boundaries, I believe.

See https://stackoverflow.com/questions/27194333/how-to-split-pa..., https://parquet.apache.org/documentation/latest/, etc.

Whether it's better to have multiple Parquet files or a single parallelizable Parquet file is dependent on your environment and application. At my company, we've tended to have a single row group per file (and one HDFS block per file), in part due to historical reasons.


The situation isn't terribly different for private health insurance. You're still spending out of a shared pool of dollars, and you don't have complete freedom in how you choose to do so.


And I would never choose a health care system mostly funded out of private insurance either.


My company has about half of our services inside Kubernetes, and multiple K8s clusters, so this is a dream come true. We'd already been eyeing Connect as a much simpler service mesh we could use both inside and outside K8S.

It does seem that Hashicorp has been slow to embrace K8S, perhaps in part due to pushing their Nomad scheduler. I'm glad that is changing. Let each product succeed on its own merits and serve the market best without trying to advantage the others.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: