I'm always a bit confused about the CPU limit (for the pod), some guides (and tools) advice to always set one, but this one [0] doesn't.
Ops people I worked with almost always want to lower that limit and I have to insist for raising it (no way they disable it).
Is there an ultimate best practice for that?
CPU limits are harmful if they strand resources that could have been applied. I usually skip them for batch deployments, use them for latency-sensitive services. Doesn’t seem like a security topic though.
They are actually even worse for latency sensitive workload because cfs with 100ms default period will cause crap tail latency (especially for multithreaded processes such as most go programs)
Interesting. It's my impression too. I understand that CPU limit will artificially throttle CPU, when not necessarily needed, wasting CPU cycles I could use.
(Java programs in my case but I imagine it's comparable to Go ones)
Do you recommend to disable CPU limit? In the general case.
We don’t set them anywhere in prod and generally didn’t have any issues. We always set cpu requests and alert if those are exceeded for prolonged periods and always set memory req=limit
I think this is backwards. How are you planning on “sticking to it” when you’re serving unpredictable user traffic? If requests are set appropriately everywhere then it won’t really starve batch as kernel would just scale everything to their respective cpu.shares when cpu is fully saturated. This would allow you to weather spiky load with minimum latency impact and minimize spend
It's weird that apparently you are a borg user from google, according to other discussions we have exchanged, but you question the value of hard-capping for latency-sensitive processes.
Borg sre even ;) (former) and yes i do question them. For one borg aint using 100ms cfs period and it wasn’t even standard cfs if i recall so yes i do question that outside of limited borg usecase
[0] https://learnk8s.io/production-best-practices