Hacker News new | past | comments | ask | show | jobs | submit login
Linux kernel cgroups writeback high CPU troubleshooting (dasl.cc)
178 points by mesto1 72 days ago | hide | past | favorite | 27 comments



The skill required to cut through so many layers, going from php application down to linux kernel. Impressed and jealous!


Think about the dopamine release and persistence of it too!


The author has identified the issue to be memcg reparenting causing a spike in CPU usage. Reparenting mostly solves a problem with zombie memcg, where the memcg lingers because some resource is still charged to it. In the extreme case you can end up with tens if thousands of zombies. The zombie memcg problem is not unique to cgroup v2, but reparenting is fairly recent.

The article solves the cpu spike by disabling the io or memory controller, but if one would like to use those controllers, a better way to charge memory would be nice.

It is unfortunate that it's clear where the memory should be charged, but the kernel does not provide reliable way to deterministically charge that memory. If anyone has any design ideas, please feel free to chime in!


looking at the linked launchpad bug it seems the issue is lock contention.

simply pacing the reparenting would solve the problem, and reworking the locking (to allow the reparenting process to work in batch maybe?) would make sure it finishes relatively quickly.


This is an interesting problem. The OP should have a look into how the vm.dirty_ratio, vm.dirty_background_ratio, vm.dirty_bytes, and vm.dirty_background_bytes (and other similarly prefixed) sysctl parameters control when the kernel starts flushing dirty pages to disk. Last time I checked, different distros defaulted things like dirty_ratio to somewhere between 10 and 50, mostly for legacy reasons.

This is really not great in situations where you're bootstrapping a fresh server. Here's what happens:

- you boot up a server with, say, 1tb RAM

- your default dirty ratio is 10 (best case)

- you quickly write 90gb of files to your server (images, whatever)

- you get mad unblocked throughput as the page cache fills up & the kernel hasn't even tried flushing anything to disk yet

- your application starts, takes up 9gb memory

- starts to serve requests, writes another 1gb of mem mapped cache

- the kernel starts to flush, realises disk is slower than it thought, starts to trip over itself and aggressively throttle IO until it can catch up

- your app is now IO bound while the kernel thrashes around for a bit

This can be tuned by adjusting the vm.dirty_* defaults, and is well worth doing IMO. The defaults that kernels still ship with are from a long time ago when we didn't have this much memory available.

My memory of this next bit is flaky at best, so happy to be corrected here, but I remember this also being a big problem with k8s. With cgroups v1, a node would get added to your cluster and a pod would get scheduled there. The pod would be limited to, say, 4gb memory - way more memory than it actually uses - but it would have a lot of IO operations. Because the node still had a ton of free memory, way below its default dirty writeback ratio/bytes, none of the IO operations would get flushed to disk for ages, but the dirty pages in the page cache would still be counted towards that pod's memory usage even though they weren't 'real' memory, but something completely out of the control of the pod (or kubernetes, really). Before you knew it, bOOM. Pod oomkilled for seemingly no reason, and no way to do anything about it. I remember some issues where people skirted around it by looking off into middle distance and saying the usual things about k8s not being for stateful workloads, but it was really lame and really not talked about enough.

This might seem unrelated, but you guessed it, it was fixed in cgroups v2, and I imagine that the fix for that problem either directly or indirectly explains why OP saw a difference in behaviour between cgroups v1 and v2.

Also, slightly related, I remember discovering a while back that for workloads like this where you've got a high turnover of files & processes, having the `discard` (trim) flag set on your ssd mount could really mess you up (definitely in ext4, not sure about xfs). It would prevent the page cache from evicting pages of deleted files without forcing writeback first, which is obviously the opposite of what it was designed to do (protect/trim the ssds). Not to mention cause all sorts of zombifications when terminated processes still had memmapped files that hadn't been flushed to disk, etc.

AFAIK it's still a problem, though it's been years since I profiled this stuff. At peak load with io-intensive workloads, you could end up with SSDs making your app run slower. Try remounting without the `discard` flag (and periodically fstrim manually), or use `discard=async`, and see what difference it makes.


Hi, I'm the author of the article! Thank you for the awesome description of the various vm.dirty_* sysctls.

The problem described in my post was not _directly_ related to the kernel flushing dirty pages to disk. As such, I'm not sure that tweaking these sysctls would have made any difference.

Instead, we were seeing the kernel using too much CPU when it moved inodes from one cgroup to another. This is part of the kernel's writeback cgroup accounting logic. I believe this is a related but slightly different form of writeback problems :)


Hey, I agree that tweaking these probably wouldn't have made much difference, but tuning/reducing the dirty_bytes could calm the writeback stampede and smooth that bump, potentially getting rid of whatever race might have been happening. Regardless, disabling the cgroup accounting there is the right thing to do, especially as you don't need it. Tbh, the main reason I wrote most of that was as background to explain the cgv1 vs v2 differences and why they're there (and because I was stuck in traffic for like 45 mins :/)

If you're ever in the mood to revisit that problem you should try disabling that discard flag and see if it makes a difference. Also, if it was me, I'd have tried setting LimitNOFILE to whatever it is in my shell and seeing if the rsync still behaved differently.

Anyway - thoroughly enjoyed your article. You should write more :)


I found that info about the `discard` behavior quite interesting. And thank you for the kind, inspiring words -- cheers!


> moved inodes from one cgroup to another

`cgroup.memory=nokmem` avoids this.


TIL, thanks for sharing. We ended up solving our problem another way by adding this `DisableControllers` stanza to the service's systemd configuration: https://gist.github.com/dasl-/87b849625846aed17f1e4841b04ecc...

I believe the kernel's cgroup writeback accounting features are enabled / disabled based on this code: https://github.com/torvalds/linux/blob/c291c9cfd76a8fb92ef3d...


> it was fixed in cgroups v2

I would say it was changed in cgroups v2.

Cgroups v1 was written by a company where only one process on a machine is allowed to do block I/O, and that program is carefully written to not use kernel caches.

Cgroups v2 was written by a company that uses lots of off-the-shelf Linux applications that do ordinary block I/O in the usual naïve way. That's why v2 focuses so much on "pressure".


BTW company-1 == Google and company-2 == FB/Meta.

In addition, Google has completely removed local storage from their servers, so there is no disk I/O at all.


> In addition, Google has completely removed local storage from their servers, so there is no disk I/O at all.

What does that mean? There should be disk somewhere anyway to store gmail messages.


Usually Google applications uses high level network storage services like collosus, BigTable or spanner and these high level services are backed by dedicated storage appliances where they bypass the kernel for SSDs and use direct block IO for slow disks. For network, they are moving towards userspace network [1].

[1] https://research.google/pubs/snap-a-microkernel-approach-to-...


https://static.googleusercontent.com/media/sre.google/en//st...

It sounds a bit vanilla on paper, since things like NFS and iSCSI have existed forever.


Funny enough, Google is still on cgroup v1. Writeback is also very aggressive such that most of pagecache is clean.


I’m looking at why my kvm vms are getting oom killed right now and dear sir this is gold. Sounds like exactly what’s happening because they get killed during a nightly db maintenance job.


Another way to fix this is for your one off write of code, assets, etc at boot time should use O_DIRECT and side step the page and/or buffer cache altogether. The performance will be slower, but you won't have a massive page cache overhang for a one time operation.


Or run `sync` after the copy "finishes" -- less focused, but very easy to do.


sync is not guaranteed to do anything. And sometimes it does way more than it should. Direct I/O semantics are the correct thing here because it bypasses the cache entirely.

I've had a lot of issues writing partition and disk images using DD on modern Linux systems because of caching. And these all kept happening even though I would use `sync` like you describe. But setting oflag=direct resolved all of the issues I was having.


The article states this doesn't change anything.


Yeah, even on consumer hardware, dirty ratio of 10% is waaay too much. These settings can also be tuned in bytes (vm.dirty_bytes and vm.dirty_background_bytes), and I tune these to 128-256MB on my desktop.


It's not the infamous 2.6.32, but … kernel 3.10 up until very recently? Oof.

(3.10 released 2013-06-30)


It's a Red Hat kernel, so it's not... I mean, it is 3.10, but it's not the 3.10 that was released in 2013; it's frankensteined with backports from newer kernels.


i am working on some related bpf tooling. check out https://yeet.cx


"This investigation was a collaboration between myself and my colleagues."


I think this is a good thing to be writing on a blog site which is named after the author himself. The author doesn't want to give the impression that he's taking all the credit, but he also doesn't want to mix up his own blog with a corporate voice. The team he's talking about is easily identified elsewhere, at the bug tracker comment he links to: https://bugs.launchpad.net/ubuntu/+source/linux-oem-6.5/+bug...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: