
Non-uniform memory access meets the OOM killer - r4um
http://rachelbythebay.com/w/2018/03/30/oom/
======
adrianmonk
The OOM uses a heuristic to figure out what to kill. If the primary purpose of
your system is to run some process that hangs on to a lot of RAM, that
heuristic is exactly the opposite of what you need, so it would a good idea to
disable it or exempt that process.

Also, while I'm talking prophylactics, if you have (which you should)
monitoring and alerting in your production environment, it seems like there
should be an alert for whenever the OOM killer activates. Assuming you are
allocating resources carefully enough that you expect everything to fit, if it
fires, it's almost always a sign that things are not going according to plan
and need to be investigated sooner rather than later.

~~~
greenleafjacob
Yes, the OOM killer activation is on its face evidence of an incident;
comments on other threads saying “If your process ran out of RAM, you get to
quit. Why offload it on some other random process? This is how your database
process runs out of memory, and your web workers get killed (or vice versa).”
The scenario depicts a capacity shortage regardless of what the system decides
to do in response.

------
saagarjha
A lot of time seems to go into tricking the watchdogs on single purpose
machines. I heard a story once of a guy who wanted to get some computation
done, but the process was being deproritized by the scheduler because it
seemed like it was a hung process that kept asking for CPU time. The solution
he came up with was voluntarily relinquishing compute accesd right before
anyone would would check up on it, making it appear as if the process was
great at sharing time with others. By doing this, he could get that one
process’s instructions running something like 99% of the time.

~~~
the8472
That's a pretty odd workaround considering that you can reserve cores to the
point where not even kernel tasks run on them and then pin a single userspace
thread to that core so it can run without ever being preemptively descheduled

~~~
dward
This wouldn't even work with the completely fair scheduler, which is the Linux
default scheduler.

[https://en.m.wikipedia.org/wiki/Completely_Fair_Scheduler](https://en.m.wikipedia.org/wiki/Completely_Fair_Scheduler)

~~~
the8472
CFS obeys core isolation and task sets, so it would work

~~~
dward
I was speaking of the "odd workaround", not about using cpu isolation.

------
cthalupa
>This new version also had this wacky little "feature" where it tried to bind
itself to a single NUMA node.

This is 100% a feature. If you care at all about memory access latency, you
want to remain local to the NUMA node. Foreign memory access is significantly
slower. If you have NUMA enabled and your applications are not NUMA aware, and
there are shared pages being access by applications running on both nodes, the
NUMA rebalancing can actually cause even worse performance as it constantly
moves the pages from one node to the other.

Any application that cares about memory access latency should 100% be written
to be NUMA aware, and if it is not, you should be using numactl to bind the
application to the proper node.

This also goes for PCI-E devices (including nvme drives!) as they are going to
be bound to a NUMA node as well. If you have an application that is accessing
an nvme volume, or using a GPU, you should 100% make sure that it is running
on the same node as the pci-e bus for that device.

------
smarks
Time to re-up this classic from Andries Brouwer:

[https://lwn.net/Articles/104185/](https://lwn.net/Articles/104185/)

------
speedplane
It's not commonplace for even medium size companies to run dozens of servers.
Memory resources (as well as disk and CPU) are always being stretched. The OOM
may have been sufficient for single server environments, where you could
always provision an extra 40%, but it's far too blunt of a tool.

Most environments I've worked with have to define an instance size (in memory
and CPU), and determine how many parallel threads/processes will run on it.
Plus you need to determine when and how to scale up to more instances. To
reduce costs, the goal is to 100% utilization, but also with the capability
deal with spikes in traffic an workload, and all with an acceptable error
rate.

Unfortunately, doing this type sizing/scaling analysis is incredibly
difficult. The opaque effects of the OOM make it even more difficult. I'm sure
the OOM uses a deterministic algorithm, but it's complex enough that most
don't know it, or handle for it. In a server environment, if the OOM kills a
service, your app and all other services are likely hosed. It would be far
more preferable if the OOM had a straightforward, consistent, and
deterministic method to dealing with low memory. This way programmers would
know to look out for it, and could handle it more consistently.

------
ParrotyError
The OOM killer was a misfeature when it was designed. Why is it still in the
kernel? Solaris solved this problem 20 years ago.

~~~
aristidb
Pardon my ignorance: How did Solaris solve this?

~~~
ParrotyError
I can't remember but I did sit in on a presentation about 15 years ago where
they explained it. I lent the notes to a senior developer and never got them
back.

~~~
RantyDave
It doesn't have an OOM killer. Even more remarkably a call to allocate memory
can't fail, but it may not return either. When Solaris (well, SmartOS in my
case) runs completely out of memory, all hell breaks loose.

------
n_t
That's why one needs to be aware of memory and other load characteristics of
system, particularly if it is an enterprise system. Various process should be
put in different cgroups with defined resources. cgroups also provides memory
pressure notification and other goodies too. If it is an embedded system,
probably it is best to turn off overcommit. Finally, for critical processes,
use oom.priority so that process can be excluded from being killed.

------
StreamBright
This is the reason i am big fan of running any software with separate users
and setting ulimit to a low value so that something stupid like this cannot
impact the production service. I would be super keen to try to replicate this
scenario on my test cluster and see if my settings catching it. Does anybody
know if the software in question is an opensource tool?

~~~
ams6110
This is the approach I take also. I'm also looking at totally disabling the
OOM killer because it seems to be pretty useless. Anytime I see stuff killed
by OOM the culprit is usually and obviously some runaway Java process, but OOM
inevitably picks the SSH daemon to kill, which doesn't help anything, and the
box continues to swap so badly that it just seems unrecoverable. I'd rather
just have the box panic and reboot if it's truly out of memory.

~~~
yjftsjthsd-h
I have not looked into it at all, but can you not exempt sshd from the OOM
killer?

~~~
ams6110
I looked into it a little bit. There are ways to tune it but I didn't see a
way to exempt processes by name. It may be possible.

The scenario I described above is HPC clusters in a university environment.
The problem is students running programs that are poorly written. I'd rather
reboot the node and tell them to fix their code than deal with trying to
accommodate their careless / naive programming.

------
jschwartzi
At my last job I wrote a build system that build maybe 30 or 40 executables
from several hundred source files. Sometimes when I'd run make -j with no
constraint my desktop environment would crash.

It turned out that the OOM killer was triggering because I was filling up
memory with compiler invocations.

I was really proud of that bug.

~~~
bmurphy1976
I seem to end up building a lot of stuff on memory constrained devices for
some reason. The OOM killer is always a problem, but it's easily avoided by
provisioning an excessive amount of swap. It's slow, but slow is faster than
never. Did you have any swap at the time?

~~~
jschwartzi
Yeah, probably. It was just a desktop Ubuntu system.

------
amelius
Reminds me of:

[https://www.joelonsoftware.com/2002/11/11/the-law-of-
leaky-a...](https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-
abstractions/)

------
ben_bai
What happened to good old returning NULL when no memory is available?

No let's do overcommit (malloc always works) and OOM-kill some random process
when under memory pressure!

~~~
JeremyBanks
We collectively decided that this is less annoying that introducing thousands
of difficult error cases to handle in every application.

~~~
concrete-faucet
Well, what about adding a new signal (SIGXMEM) with a default action of
ignore? If the system is running low on memory it can send this to some or all
processes and wait for a little bit to see if things get better.

This is how iOS has handled things since version 2.0:
[https://developer.apple.com/documentation/uikit/uiapplicatio...](https://developer.apple.com/documentation/uikit/uiapplicationdelegate/1623063-applicationdidreceivememorywarni)
> It is strongly recommended that you implement this method. If your app does
not release enough memory during low-memory conditions, the system may
terminate it outright.

~~~
geofft
See section 11 "Memory Pressure" of
[https://www.kernel.org/doc/Documentation/cgroup-v1/memory.tx...](https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt)
\- there's a way to get notified via eventfd() if your current cgroup's memory
gets low. I believe you can just do this on the root memory cgroup
(/sys/fs/cgroup/memory/memory.pressure_level) if you're not setting up actual
cgroups for your application.

(Signals for asynchronous conditions are an awkward interface because they can
interrupt you between any two assembly instructions. You're not able to
release memory in the handler itself; you have to set a flag that gets handled
by the main loop. So eventfd makes sense here. I'm assuming iOS is doing
something similar by queueing an Objective-C method call. Signals make a lot
more sense for segfaults and the like, where you're being interrupted at the
exact instruction that isn't working and you need to handle it before
executing any more instructions.)

------
dis-sys
Being able to write NUMA aware applications like the one described in the
article is a luxury for ALL Go users. The current Go runtime doesn't have any
NUMA awareness.

As of today, you can get a two NUMA nodes processor (AMD threadripper 1900X)
for as little as $449.

------
BrainInAJar
Memory overcommit is the most hostile, idiotic misfeature to ever ship in any
mainstream operating system. It's such a great example of why one should pay
absolutely no concern to Linus

