
Using cgroups to limit I/O - ggarnier
https://andrestc.com/post/cgroups-io/
======
cyphar
It's important to note that if you want to use the blkio scheduler, if you're
using the CFQ io scheduler[1] (which is the defualt on quite a lot of
distributions) then using the blkio cgroup (even if you just join a non-root
cgroup and don't have any limits set) can have _massive_ performance penalties
for O_DSYNC (and even some "regular") operations. It's caused by an
implementation detail of blkio weighting within CFQ. We have a bug open in
runc that details it[2].

Also, while this article is quite rosy-eyed about cgroupv2, there are many
_many_ issues that mean that cgroupv2 is not really usable by container
runtimes today -- despite what systemd might be trying to do with their
"hybrid mode". A friend of mine from the LXC community gave a talk about the
wide variety of issues that occur due to cgroupv2's weirdly restrictive
API[3]. The blkio writeback cache limits are very useful, but they're really
not worth the pain of breaking how container runtimes have to work.

Also this comment is not _entirely_ accurate:

> Since in cgroups v1, different resources/controllers (memory, blkio) live in
> different hierarchies on the filesystem, even when those cgroups have the
> same name, they are completely independent. So, when the memory page is
> finally being flushed to disk, there is no way that the memory controller
> can know what blkio cgroup wrote that page.

You _can_ mount a single cgroupv1 hierarchy with multiple controllers attached
to it (that's what 'cpu,cpuset' are), which then could (in principle) know
about it each other. But they don't, since a lot of the cgroupv2 code is
locked behind cgroup_on_dfl(...) checks.

[1]: [https://www.kernel.org/doc/Documentation/block/cfq-
iosched.t...](https://www.kernel.org/doc/Documentation/block/cfq-iosched.txt)
[2]:
[https://github.com/opencontainers/runc/issues/861](https://github.com/opencontainers/runc/issues/861)
[3]:
[https://www.youtube.com/watch?v=P6Xnm0IhiSo](https://www.youtube.com/watch?v=P6Xnm0IhiSo)

~~~
andrestc
Author here.

>Also, while this article is quite rosy-eyed about cgroupv2, there are many
many issues that mean that cgroupv2 is not really usable by container runtimes
today -- despite what systemd might be trying to do with their "hybrid mode".
A friend of mine from the LXC community gave a talk about the wide variety of
issues that occur due to cgroupv2's weirdly restrictive API[3]. The blkio
writeback cache limits are very useful, but they're really not worth the pain
of breaking how container runtimes have to work.

I knew there are some problems with cgroup v2 but haven't quite looked into
it. I'm going to watch the talk and probably add a note about that. Thanks!

>You can mount a single cgroupv1 hierarchy with multiple controllers attached
to it (that's what 'cpu,cpuset' are), which then could (in principle) know
about it each other. But they don't, since a lot of the cgroupv2 code is
locked behind cgroup_on_dfl(...) checks. That's interesting, I didn't know
about that! I'll add a note to the post.

~~~
shaklee3
Great article. Thanks

------
ChuckMcM
There has been a lot of work looking at this issue inside of Google. One of
the natural outcomes of trying to schedule a bunch of different 'types' of
services on CPU and maximize utilization, is to look at each subsystem
(memory, network, disk i/o) and work on ways of partitioning them. Of course
if they can block everything (like disk i/o can) then you need to take extra
precautions.

co-processes using the same spindle (physical disk drive, even if they aren't
on the same partition) can easily become priority inverted if the low priority
process is doing a lot of disk i/o.

~~~
puzzle
Yeah, but Google doesn't (or didn't until a few years ago) rely on cgroups to
do a lot of the finer-grained I/O scheduling.

------
geoka9
A plug: I wrote a process launcher that uses cgroups to set a per process
limit for memory and swap use (that was before Docker).

It was for an online game service; the challenge was to run multiple instances
of the game app on the cluster with minimal latency for the users, but also to
make sure that no user could hog server resources by spawning multiple
instances of the game.

[https://github.com/geokat/cgfy](https://github.com/geokat/cgfy)

~~~
loeg
Thanks! I use cgroups (manually) to memory constrain applications that I know
leak memory (e.g., rtorrent, chrome at one time). This looks like a nice way
of making that a bit less manual.

~~~
geoka9
No problem. I've just checked and it looks like libcgroup is not maintained
any more and fallen behind the cgroups development. This may work instead:

systemd-run --scope -p MemoryMax=4M rtorrent

You can also set other quotas - `man systemd.resource-control` for more info.

------
thallesr
sorry, i thought the op was the author

