
Open sourcing oomd, a new approach to handling OOMs - kiyanwang
https://code.fb.com/production-engineering/open-sourcing-oomd-a-new-approach-to-handling-ooms/
======
scottlamb
The "pressure stall information" looks interesting. It's exponentially
weighted moving averages of the fraction of time "some" or "full" (all non-
idle) tasks are stalled on each of CPU, memory, and IO. Also a counter of
absolute time. [1]

Printing this seems like a good addition to "Linux Performance Analysis in
60,000 Milliseconds". [2] More useful that the load average printed by uptime,
which hasn't aged well. Linux mixes cpu-bound and io-bound stuff into one
bucket [3] which is a terrible idea, and the number of tasks is confusing at
best when you have a mix of thread-per-CPU and thread-per-connection stuff
running.

Looks like it's also exported on a per-cgroups basis, which is great.

[1] [http://git.cmpxchg.org/cgit.cgi/linux-
psi.git/tree/Documenta...](http://git.cmpxchg.org/cgit.cgi/linux-
psi.git/tree/Documentation/accounting/psi.txt)

[2] [https://medium.com/netflix-techblog/linux-performance-
analys...](https://medium.com/netflix-techblog/linux-performance-analysis-
in-60-000-milliseconds-accc10403c55)

[3] [http://www.brendangregg.com/blog/2017-08-08/linux-load-
avera...](http://www.brendangregg.com/blog/2017-08-08/linux-load-
averages.html)

~~~
peterwwillis
I was excited at first, and then realized this is a kernel patch and not a
userland tool. Boooo. FB says "PSI tracks three major system resources — CPU,
memory, and I/O — and provides a canonical view into how the usage of these
resources changes over time." It should have been possible to do this in
userland and make this (somewhat) kernel version-independent.

I'm sure I'll use this once there's OSes running a stable kernel with these
patches, but I don't run non-vendor kernels in production [unless I'm forced
to], and depending on kernel features makes it hard to support legacy systems.

~~~
Hello71
how would that work

edit: you seem to be implying that FB is stupid and likes writing kernel code
for no good reason.

------
LinuxBender
That looks like an interesting way to intercept vmscan's oom killer. I prefer
to avoid it all together.

I prefer to set vm.overcommit_ratio=0 and then increase vm.vfs_cache_pressure
to somewhere between 400 and 10000 depending on the server role, then set
vm.min_free_kbytes, vm.admin_reserve_kbytes and vm.user_reserve_kbytes higher
based on the amount of ram in the system using a simple formula.

For systems that are ephemeral, I also set vm.panic_on_oom to 2 so that they
self heal. Avoid doing this on databases.

Of course, THP should also be set to madvise and memory defrag disabled unless
you really need it. That will also free up some leaky memory and artificial
pausing.

Beyond these things, I agree that CGroups is another way to set memory
constraints around applications to further protect the server. There are many
that won't have CGroups for a while however, as many systems are still on
older non systemd distributions.

~~~
JdeBP
One does not need systemd in order to have control groups. Control groups are
a kernel mechanism. However note that the headlined software needs version 2
of Linux control groups.

~~~
LinuxBender
Agreed. I tried to use it on centos 6 but it was very incomplete. The kernel
is just too old. I started using kernels from elrepo to work around this, but
not many will do that in a production environment.

~~~
mdaniel
> _The kernel is just too old_

For my curiosity, is there a technical reason for the distro to use an old
kernel? I would think one would wish to have all the bug fixes available, and
I don't think the kernel has a bad backward compatibility story, so from the
outside it seems like a sure thing -- leading me to believe there must be more
to the story.

~~~
LinuxBender
Enterprise distros such as Redhat fork the kernel, then patch for bugs and
vulnerabilities, to avoid adding too many new features to a major version
release of the OS. This actually makes a lot of sense from a stability and
predictability perspective. It would be significantly harder to support the
upstream kernel releases, as new features and behavioral changes occur rather
often.

------
stuaxo
There needs to be a signal apps can be sent when resources are low... there
are quite a few apps that could do things like empty various caches under ram
pressure for instance.

Similarly for disk space being low, or under IO pressure, some apps could hold
off checking certain files / throttle things for a while.

~~~
voidlogic
This is a good point, in Go all instances of sync.Pool (Go's generic object
pool with thread local storage) are registered with the allocator/GC so that
under heap pressure they can be drained, it would be great if the OS could
request this draining too.

~~~
skyde
On android this Is done using LRU cache for background app.

If your app has a cached process and it retains memory that it currently does
not need, then your app—even while the user is not using it— affects the
system's overall performance. As the system runs low on memory, it kills
processes in the LRU cache beginning with the process least recently used. The
system also accounts for processes that hold onto the most memory and can
terminate them to free up RAM.

------
AndyKelley
I would be interested to see this compared to what I consider to be the null
hypothesis: disabling overcommit altogether and handling out of memory at the
application level.

~~~
Hello71
poor, because 1. fork exists, and 2. the application incurring OOM is
statistically unlikely to be the one that should be killed on multi-
application servers, and 3. 99.9% of applications do not correctly handle
malloc returning NULL in 100% of cases. some make an effort, but except for
stuff like SQLite, almost nobody tests it, so it's usually broken at least 1%
of the time.

~~~
toast0
> 3\. 99.9% of applications do not correctly handle malloc returning NULL in
> 100% of cases.

0% of applications correctly handle being unable to write to a page they were
given as writable. It's certainly true that few applications correctly handle
a failed malloc, and in many cases there's not really a correct way to handle
it, but at least it's available to handle.

------
1996
Not convinced that has any advantage compared to a naive approach of:

\- introspection at the application level: exit if memory usage goes way above
usual (say by 2x)

\- management by systemd to gracefully restart the application when it exits
(dependancies etc), and also monitor memory use in case the application
doesn't monitor it well (say exit when 2.5x the usual)

I could be made better, but at the cost of more complexity.

~~~
ebikelaw
I guess this would be a good way to operate dunning-kruger.com, but if you
want your site to stay up this seems problematic. If you introspective
application forms some belief about "usual" memory usage when it's idle an
unloaded, and you suddenly shift all of your traffic from another datacenter
to this one, and the heap usage on that application increases by 10x, then all
your replicas will suicide.

~~~
1996
If the memory usage can increase 10x while this is not reflected in the
assumptions, both the measurements and the assumptions are flawed

To give more context: I run many copies of a daemon with known memory leaks,
on machines each with a load close to 0.7 and about 200M of free ram left. I
consider that an heavy load.

This approach restarts one of the copies of the daemon when its own leaks
become dangerous to the rest of the system.

It is not perfect, but good enough when the numbers are measured at peak. The
200Mb free RAM are all the extra leeway I care to leave.

------
kevinoid
Interesting approach. I'm curious to try it out.

After playing around with vm.overcommit_ratio, different swap sizes,
earlyoom[1], and a few other variables, I still haven't found a happy medium
between high memory utilization and low risk of swapping to death.
vm.overcommit_ratio=0 is safe, but on systems where occasional swapping is
tolerable and memory is limited (e.g. my laptop), I'd rather allow some
overcommit.

The risk is that if many cold or unallocated pages get touched while the
system is under high memory pressure, the system can become totally
unresponsive. At the moment I use "Magic SysRq"+f to manually start the
oom_killer, when possible. Obviously it's not a great solution. Is there some
kernel tunable to keep the system responsive that I'm unaware of? What do you
guys do for desktop/laptop systems?

1\. [https://github.com/rfjakob/earlyoom](https://github.com/rfjakob/earlyoom)

------
__flo
Android also has its own low memory killer:
[https://android.googlesource.com/platform/system/core/+/mast...](https://android.googlesource.com/platform/system/core/+/master/lmkd/)

------
voidlogic
Does doing this in user-space have any advantage beyond they could dev a PoC
faster? It seems like building a better OOMK and having it live in user-space
are orthogonal.

~~~
testvox
It's because this supports plugins which are intended to interact with other
user-space programs in order to allow for cooperative resource management.
These plugins would be much harder to write if they had to be implemented in
kernal-space (you couldn't use standard user space libs for making RPC calls,
would be easy to introduce kernal corruption if you don't know kernal
programming). So yes it's for development reasons but not to allow for a
faster POC. It's to allow non kernal devs to easily write plugins.

------
wheaties
That's neat but I question if many of us are at the level of Facebook w.r.t.
the need for this. To wit, who here runs their own data center?

~~~
dagenix
What does running their own data center have to do with it? As I understand
it, this daemon could be deployed to a VM instance.

