It's important to note that if you want to use the blkio scheduler, if you're using the CFQ io scheduler[1] (which is the defualt on quite a lot of distributions) then using the blkio cgroup (even if you just join a non-root cgroup and don't have any limits set) can have massive performance penalties for O_DSYNC (and even some "regular") operations. It's caused by an implementation detail of blkio weighting within CFQ. We have a bug open in runc that details it[2].
Also, while this article is quite rosy-eyed about cgroupv2, there are many many issues that mean that cgroupv2 is not really usable by container runtimes today -- despite what systemd might be trying to do with their "hybrid mode". A friend of mine from the LXC community gave a talk about the wide variety of issues that occur due to cgroupv2's weirdly restrictive API[3]. The blkio writeback cache limits are very useful, but they're really not worth the pain of breaking how container runtimes have to work.
Also this comment is not _entirely_ accurate:
> Since in cgroups v1, different resources/controllers (memory, blkio) live in different hierarchies on the filesystem, even when those cgroups have the same name, they are completely independent. So, when the memory page is finally being flushed to disk, there is no way that the memory controller can know what blkio cgroup wrote that page.
You can mount a single cgroupv1 hierarchy with multiple controllers attached to it (that's what 'cpu,cpuset' are), which then could (in principle) know about it each other. But they don't, since a lot of the cgroupv2 code is locked behind cgroup_on_dfl(...) checks.
>Also, while this article is quite rosy-eyed about cgroupv2, there are many many issues that mean that cgroupv2 is not really usable by container runtimes today -- despite what systemd might be trying to do with their "hybrid mode". A friend of mine from the LXC community gave a talk about the wide variety of issues that occur due to cgroupv2's weirdly restrictive API[3]. The blkio writeback cache limits are very useful, but they're really not worth the pain of breaking how container runtimes have to work.
I knew there are some problems with cgroup v2 but haven't quite looked into it. I'm going to watch the talk and probably add a note about that. Thanks!
>You can mount a single cgroupv1 hierarchy with multiple controllers attached to it (that's what 'cpu,cpuset' are), which then could (in principle) know about it each other. But they don't, since a lot of the cgroupv2 code is locked behind cgroup_on_dfl(...) checks.
That's interesting, I didn't know about that! I'll add a note to the post.
There has been a lot of work looking at this issue inside of Google. One of the natural outcomes of trying to schedule a bunch of different 'types' of services on CPU and maximize utilization, is to look at each subsystem (memory, network, disk i/o) and work on ways of partitioning them. Of course if they can block everything (like disk i/o can) then you need to take extra precautions.
co-processes using the same spindle (physical disk drive, even if they aren't on the same partition) can easily become priority inverted if the low priority process is doing a lot of disk i/o.
A plug: I wrote a process launcher that uses cgroups to set a per process limit for memory and swap use (that was before Docker).
It was for an online game service; the challenge was to run multiple instances of the game app on the cluster with minimal latency for the users, but also to make sure that no user could hog server resources by spawning multiple instances of the game.
Thanks! I use cgroups (manually) to memory constrain applications that I know leak memory (e.g., rtorrent, chrome at one time). This looks like a nice way of making that a bit less manual.
When they hit a threshold, send them a check point event and kill and restart. Too bad chrome didn't thundering heard so hard when it comes back to life.
Also, while this article is quite rosy-eyed about cgroupv2, there are many many issues that mean that cgroupv2 is not really usable by container runtimes today -- despite what systemd might be trying to do with their "hybrid mode". A friend of mine from the LXC community gave a talk about the wide variety of issues that occur due to cgroupv2's weirdly restrictive API[3]. The blkio writeback cache limits are very useful, but they're really not worth the pain of breaking how container runtimes have to work.
Also this comment is not _entirely_ accurate:
> Since in cgroups v1, different resources/controllers (memory, blkio) live in different hierarchies on the filesystem, even when those cgroups have the same name, they are completely independent. So, when the memory page is finally being flushed to disk, there is no way that the memory controller can know what blkio cgroup wrote that page.
You can mount a single cgroupv1 hierarchy with multiple controllers attached to it (that's what 'cpu,cpuset' are), which then could (in principle) know about it each other. But they don't, since a lot of the cgroupv2 code is locked behind cgroup_on_dfl(...) checks.
[1]: https://www.kernel.org/doc/Documentation/block/cfq-iosched.t... [2]: https://github.com/opencontainers/runc/issues/861 [3]: https://www.youtube.com/watch?v=P6Xnm0IhiSo