As far as I know it doesn't. The Linux OOM killer came about based on the assump...

IgorPartola · on Jan 23, 2015

The OOM actually has three modes:

1. Rank processes by "badness" and kill the baddest one. This is not just based on memory usage, but is a fairly complicated and expensive algorithm.

2. Kill the process that caused the request that failed.

3. Do a kernel panic. This way if your server runs only a single process you care about, it'll just get rebooted, rather than killing the only process you actually want to keep running.

Also, note that the article is talking about memory allocation inside the kernel, and not malloc() which is used in userland. It's one thing when your random Postgres process is out of RAM, it's quite another when the kernel is.

rcxdude · on Jan 23, 2015

>1. Rank processes by "badness" and kill the baddest one. This is not just based on memory usage, but is a fairly complicated and expensive algorithm.

AFAIK the algorithm has been simplified greatly: it's now just amount of memory weighted by the oom_score which can be set from userspace.

nothrabannosir · on Jan 23, 2015

But what we need is a 4th option: a new SIGNOMEM that gives the process some time to release memory, or shut down cleanly if it can't.

dezgeg · on Jan 23, 2015

There is actually a possibility for Linux processes to do exactly that if they use the vmpressure_fd() (http://lwn.net/Articles/524742/ ).

EDIT: Now that I look closer, it's possible that vmpressure_fd() was never merged into mainline and therefore is not really available.

joosters · on Jan 23, 2015

Good luck getting programs to process SIGNOMEM without allocating any more memory!

Besides, if programs are using memory that could be discarded, it would be better if the kernel knew this in the first place and could therefore do the MM on their behalf.

derefr · on Jan 23, 2015

You could have the system reserve a little tiny bit of extra memory (somewhat similar to the "disk space reserved for root" in ext2) that it would only ever hand out in response to malloc() attempts inside a SIGNOMEM handler.

You're right, though; metadata on the allocations themselves marking them as release-on-memory-pressure would be much better.

IgorPartola · on Jan 23, 2015

No it wouldn't. If a program has memory that it doesn't need and can live without it ought to just release it. Otherwise that is called a memory leak. I believe this was tried in early Android releases and was disastrous. Basically during an out of memory condition the kernel would ask every process "hey, can you release some memory?" And each process would reply with "nope, I need it all!"

I don't see how the metadata is any different. What happens when the kernel yanks a page of memory from über a running process?

gizmo686 · on Jan 23, 2015

Programs often make time vs memory tradeoffs. In some cases, it is even possible for them to adjust these tradeoffs during runtime.

The most common example is is file-system caches. Pretending (for the sake of argument) that the kernel did not automatically cache the file-system, a program may reasonably make this optimization [1]. In this case, the program can easily release some memory by clearing its cache.

You could also be running a program with garbage collection that normally wouldn't bother doing a full sweep until it hit some memory usage threshold; again, it can do this on request.

I'm sure that people can come up with other examples.

[1] In fact, a program can request that the kernel not cache its file-system requests, in which case it would have actual reason to cache what it thinks it might need again.

IgorPartola · on Jan 23, 2015

Of course. But in practice here's what's going to happen. The good apps, say Postgres, will implement the ability to do this. They will create caches, and release them upon request from the OS. This will significantly hinder the performance of Postgres because it will keep losing its caches. Note that this is worse than just emptying the cache, you actually lose the allocation and have to start over. In some cases you'll effectively disable the cache, which is there for a reason!

Now, here comes, say, MongoDB, which says "Yeah, I have caches, and I need them all! I won't release any allocations because I have to have them." Let's say, for whatever reason you run both Postgres and MongoDB on the same box and it's running out of RAM. Now you are punishing Postgres, the good citizen, and rewarding MongoDB, the bad citizen, and only because Postgres bothered to implement the ability to give up its caches.

I cannot find the reference to this, but I believe this was tried in early Android, and the consequences were that it became unusable as soon as you installed at least one memory hog app that never gave up any memory.

jbert · on Jan 23, 2015

> No it wouldn't. If a program has memory that it doesn't need and can live without it ought to just release it. Otherwise that is called a memory leak.

I disagree. A process could usefully cache the results of computation in RAM (e.g. a large lookup table). This RAM is useful to the process (will increase performance) but if it is discarded due to a spike in memory pressure it could be rebuilt.

> What happens when the kernel yanks a page of memory from über a running process?

There is a madvise() flag for this, MADV_DONTNEED - http://man7.org/linux/man-pages/man2/madvise.2.html

You get zero pages. Which is fine - you can detect that with a canary value and then you know you need to rebuild your cache.

dezgeg · on Jan 23, 2015

> [...] that it would only ever hand out in response to malloc() attempts inside a SIGNOMEM handler.

POSIX doesn't allow malloc() to be called inside a signal handler at all. Also, when free():ing memory that is allocated with malloc(), there is absolutely no guarantee that the memory is ever returned back to the OS (i.e. via munmap() or sbrk()), due to possible memory fragmentation.

In other words, it's just wouldn't be practically possible to create an application that uses standard memory allocation functions and reliably can free some memory back to the kernel.

jbert · on Jan 23, 2015

madvise() MADV_DONTNEED allows this - http://man7.org/linux/man-pages/man2/madvise.2.html

It's also a flag in posix_madvise(), so the concept is reasonably portable.

dezgeg · on Jan 23, 2015

malloc() and free() is what I meant by "standard memory allocation functions". I'd say program that is in the position to use madvise() effectively has implemented it's own heap allocator.

Also, MADV_DONTNEED is only usable in some specific situations, like caches. I don't see how it could be used to implement things like "on low memory, trigger garbage collection and trim the heap to the smallest possible with munmap()".

jbert · on Jan 25, 2015

You said: "In other words, it's just wouldn't be practically possible to create an application that uses standard memory allocation functions and reliably can free some memory back to the kernel."

So I thought you'd be interested to know that you can do just this with the standard functions mmap() and madvise().

No, it's not a replacement for malloc/free, but it does have value to some applications for some use cases.

jbert · on Jan 23, 2015

> You're right, though; metadata on the allocations themselves marking them as release-on-memory-pressure would be much better.

http://man7.org/linux/man-pages/man2/madvise.2.html - MADV_DONTNEED allows you to provide such meta data now.

derefr · on Jan 24, 2015

Right, this is what I was thinking of when I said it; I was just being vague because I didn't remember the name of it :)

jzwinck · on Jan 23, 2015

1. This is the one where if you "less" a giant log file, your web server will get killed. Awesome.

2. This could be handled by returning NULL.

3. If your server only needs to run a single process, making that one process terminate would be much faster than rebooting the entire machine (commercial-grade servers can take ten minutes to POST).

ogoffart · on Jan 23, 2015

2. No. Because most allocations by the kernel are not a result of the program calling malloc. (They can be the result of a pagefault when the program tries to write on some memory)

quotemstr · on Jan 23, 2015

> They can be the result of a pagefault when the program tries to write on some memory

How do you think malloc works? There's no malloc system call.

IgorPartola · on Jan 23, 2015

There are a couple of different things going on here:

malloc() generally uses mmap() which is a syscall. It also can use brk()/sbrk(). Some implementations use both.

There is also kmalloc() which is what's broken based on TFA. This is inside the kernel only and is not a syscall.

So you are right, malloc() is not a syscall, but it may or may not use a syscall when you call it. You can, of course, write better code yourself: statically allocate the memory you need, then use malloc() on that big memory chunk. This way you are guaranteed not to cause the system any grief. Either your process starts, or it does not. After it starts, it is in complete control of its own memory.

quotemstr · on Jan 23, 2015

First of all, malloc implementations that rely on brk are simply broken. The whole idea of a process-wide data segment is broken and leads not only to awkwardness around thread safety safety, also the inability to decommit memory when it's no longer needed. Decent malloc implementations use mmap.

Second, the whole idea that a page isn't really "allocated" until you touch it is just stupid, lazy, and terrible; it doesn't exist on sane operating systems like NT. On decent systems, mmap (or its equivalent) sets aside committed memory and malloc carves out pieces from that block.

It's frustrating to see people who grew up with Linux not only understand how terrible and broken Linux's memory management is, but not realizing that it could be any other way.

JoshTriplett · on Jan 23, 2015

NT doesn't have fork. Supporting that requires copy-on-write semantics. And you don't want to stop fork just because you don't have enough memory for a second complete copy of the current processes; you might be about to throw all that away with an exec, and even if not you're likely to keep most of the shared memory shared.

IgorPartola · on Jan 23, 2015

Well, first of NT is... different. Not sure if it's sane or superior or whatever.

Second, you may be right that this behavior causes problems, but it clearly has advantages as well. You advocate copy-on-write semantics elsewhere. Well this behavior is basically that.

Third, dlmalloc() which is pretty widely used, uses sbrk().

Forth, I sense a whole lot of frustration targeted at "these damn Linux kids". Sorry you feel that way. On the plus side, it's Friday, and this particular Linux kid would gladly buy you a beer to help cheer you up.

(Technically, I grew up with NetBSD, not Linux. Wrote a toy allocator for it at one point, and am trying to work on one that works for Linux as well.)

voidlogic · on Jan 23, 2015

>There's no malloc system call.

Thats not really true. Not all dynamic allocation schemes rely on mmap, some use the brk and sbrk system calls.

"In the original Unix system, brk and sbrk were the only ways in which applications could acquire additional data space; later versions allowed this to also be done using the mmap call." -https://en.wikipedia.org/wiki/Sbrk

Also mmap itself is also a system call, it usually just was called prior to the failing page allocation, not at the time of allocation failure.