Hacker News new | past | comments | ask | show | jobs | submit login
Using cgroups to limit I/O (andrestc.com)
179 points by ggarnier on Oct 19, 2017 | hide | past | favorite | 11 comments



It's important to note that if you want to use the blkio scheduler, if you're using the CFQ io scheduler[1] (which is the defualt on quite a lot of distributions) then using the blkio cgroup (even if you just join a non-root cgroup and don't have any limits set) can have massive performance penalties for O_DSYNC (and even some "regular") operations. It's caused by an implementation detail of blkio weighting within CFQ. We have a bug open in runc that details it[2].

Also, while this article is quite rosy-eyed about cgroupv2, there are many many issues that mean that cgroupv2 is not really usable by container runtimes today -- despite what systemd might be trying to do with their "hybrid mode". A friend of mine from the LXC community gave a talk about the wide variety of issues that occur due to cgroupv2's weirdly restrictive API[3]. The blkio writeback cache limits are very useful, but they're really not worth the pain of breaking how container runtimes have to work.

Also this comment is not _entirely_ accurate:

> Since in cgroups v1, different resources/controllers (memory, blkio) live in different hierarchies on the filesystem, even when those cgroups have the same name, they are completely independent. So, when the memory page is finally being flushed to disk, there is no way that the memory controller can know what blkio cgroup wrote that page.

You can mount a single cgroupv1 hierarchy with multiple controllers attached to it (that's what 'cpu,cpuset' are), which then could (in principle) know about it each other. But they don't, since a lot of the cgroupv2 code is locked behind cgroup_on_dfl(...) checks.

[1]: https://www.kernel.org/doc/Documentation/block/cfq-iosched.t... [2]: https://github.com/opencontainers/runc/issues/861 [3]: https://www.youtube.com/watch?v=P6Xnm0IhiSo


Author here.

>Also, while this article is quite rosy-eyed about cgroupv2, there are many many issues that mean that cgroupv2 is not really usable by container runtimes today -- despite what systemd might be trying to do with their "hybrid mode". A friend of mine from the LXC community gave a talk about the wide variety of issues that occur due to cgroupv2's weirdly restrictive API[3]. The blkio writeback cache limits are very useful, but they're really not worth the pain of breaking how container runtimes have to work.

I knew there are some problems with cgroup v2 but haven't quite looked into it. I'm going to watch the talk and probably add a note about that. Thanks!

>You can mount a single cgroupv1 hierarchy with multiple controllers attached to it (that's what 'cpu,cpuset' are), which then could (in principle) know about it each other. But they don't, since a lot of the cgroupv2 code is locked behind cgroup_on_dfl(...) checks. That's interesting, I didn't know about that! I'll add a note to the post.


Great article. Thanks


There has been a lot of work looking at this issue inside of Google. One of the natural outcomes of trying to schedule a bunch of different 'types' of services on CPU and maximize utilization, is to look at each subsystem (memory, network, disk i/o) and work on ways of partitioning them. Of course if they can block everything (like disk i/o can) then you need to take extra precautions.

co-processes using the same spindle (physical disk drive, even if they aren't on the same partition) can easily become priority inverted if the low priority process is doing a lot of disk i/o.


Yeah, but Google doesn't (or didn't until a few years ago) rely on cgroups to do a lot of the finer-grained I/O scheduling.


Only allow scatter gather (vectored io) in the guest.


A plug: I wrote a process launcher that uses cgroups to set a per process limit for memory and swap use (that was before Docker).

It was for an online game service; the challenge was to run multiple instances of the game app on the cluster with minimal latency for the users, but also to make sure that no user could hog server resources by spawning multiple instances of the game.

https://github.com/geokat/cgfy


Thanks! I use cgroups (manually) to memory constrain applications that I know leak memory (e.g., rtorrent, chrome at one time). This looks like a nice way of making that a bit less manual.


No problem. I've just checked and it looks like libcgroup is not maintained any more and fallen behind the cgroups development. This may work instead:

systemd-run --scope -p MemoryMax=4M rtorrent

You can also set other quotas - `man systemd.resource-control` for more info.


When they hit a threshold, send them a check point event and kill and restart. Too bad chrome didn't thundering heard so hard when it comes back to life.


sorry, i thought the op was the author




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: