The Linux OOM killer came about based on the assumption that most processes, when faced with a NULL return from malloc, will unceremoniously exit. Thus, running out of memory would effectively randomly kill whatever process next attempted to make an allocation, and suddenly the system would have a bit more memory available. So rather than kill a random process, the OOM killer tries to kill a process actually responsible for using a huge amount of memory.
Not always a sensible plan, and possibly not even something that should be on by default, but understandable.
1. Rank processes by "badness" and kill the baddest one. This is not just based on memory usage, but is a fairly complicated and expensive algorithm.
2. Kill the process that caused the request that failed.
3. Do a kernel panic. This way if your server runs only a single process you care about, it'll just get rebooted, rather than killing the only process you actually want to keep running.
Also, note that the article is talking about memory allocation inside the kernel, and not malloc() which is used in userland. It's one thing when your random Postgres process is out of RAM, it's quite another when the kernel is.
Good luck getting programs to process SIGNOMEM without allocating any more memory!
Besides, if programs are using memory that could be discarded, it would be better if the kernel knew this in the first place and could therefore do the MM on their behalf.
You could have the system reserve a little tiny bit of extra memory (somewhat similar to the "disk space reserved for root" in ext2) that it would only ever hand out in response to malloc() attempts inside a SIGNOMEM handler.
You're right, though; metadata on the allocations themselves marking them as release-on-memory-pressure would be much better.
No it wouldn't. If a program has memory that it doesn't need and can live without it ought to just release it. Otherwise that is called a memory leak. I believe this was tried in early Android releases and was disastrous. Basically during an out of memory condition the kernel would ask every process "hey, can you release some memory?" And each process would reply with "nope, I need it all!"
I don't see how the metadata is any different. What happens when the kernel yanks a page of memory from über a running process?
Programs often make time vs memory tradeoffs. In some cases, it is even possible for them to adjust these tradeoffs during runtime.
The most common example is is file-system caches. Pretending (for the sake of argument) that the kernel did not automatically cache the file-system, a program may reasonably make this optimization [1]. In this case, the program can easily release some memory by clearing its cache.
You could also be running a program with garbage collection that normally wouldn't bother doing a full sweep until it hit some memory usage threshold; again, it can do this on request.
I'm sure that people can come up with other examples.
[1] In fact, a program can request that the kernel not cache its file-system requests, in which case it would have actual reason to cache what it thinks it might need again.
Of course. But in practice here's what's going to happen. The good apps, say Postgres, will implement the ability to do this. They will create caches, and release them upon request from the OS. This will significantly hinder the performance of Postgres because it will keep losing its caches. Note that this is worse than just emptying the cache, you actually lose the allocation and have to start over. In some cases you'll effectively disable the cache, which is there for a reason!
Now, here comes, say, MongoDB, which says "Yeah, I have caches, and I need them all! I won't release any allocations because I have to have them." Let's say, for whatever reason you run both Postgres and MongoDB on the same box and it's running out of RAM. Now you are punishing Postgres, the good citizen, and rewarding MongoDB, the bad citizen, and only because Postgres bothered to implement the ability to give up its caches.
I cannot find the reference to this, but I believe this was tried in early Android, and the consequences were that it became unusable as soon as you installed at least one memory hog app that never gave up any memory.
> No it wouldn't. If a program has memory that it doesn't need and can live without it ought to just release it. Otherwise that is called a memory leak.
I disagree. A process could usefully cache the results of computation in RAM (e.g. a large lookup table). This RAM is useful to the process (will increase performance) but if it is discarded due to a spike in memory pressure it could be rebuilt.
> What happens when the kernel yanks a page of memory from über a running process?
> [...] that it would only ever hand out in response to malloc() attempts inside a SIGNOMEM handler.
POSIX doesn't allow malloc() to be called inside a signal handler at all. Also, when free():ing memory that is allocated with malloc(), there is absolutely no guarantee that the memory is ever returned back to the OS (i.e. via munmap() or sbrk()), due to possible memory fragmentation.
In other words, it's just wouldn't be practically possible to create an application that uses standard memory allocation functions and reliably can free some memory back to the kernel.
malloc() and free() is what I meant by "standard memory allocation functions". I'd say program that is in the position to use madvise() effectively has implemented it's own heap allocator.
Also, MADV_DONTNEED is only usable in some specific situations, like caches. I don't see how it could be used to implement things like "on low memory, trigger garbage collection and trim the heap to the smallest possible with munmap()".
You said: "In other words, it's just wouldn't be practically possible to create an application that uses standard memory allocation functions and reliably can free some memory back to the kernel."
So I thought you'd be interested to know that you can do just this with the standard functions mmap() and madvise().
No, it's not a replacement for malloc/free, but it does have value to some applications for some use cases.
1. This is the one where if you "less" a giant log file, your web server will get killed. Awesome.
2. This could be handled by returning NULL.
3. If your server only needs to run a single process, making that one process terminate would be much faster than rebooting the entire machine (commercial-grade servers can take ten minutes to POST).
2. No. Because most allocations by the kernel are not a result of the program calling malloc. (They can be the result of a pagefault when the program tries to write on some memory)
There are a couple of different things going on here:
malloc() generally uses mmap() which is a syscall. It also can use brk()/sbrk(). Some implementations use both.
There is also kmalloc() which is what's broken based on TFA. This is inside the kernel only and is not a syscall.
So you are right, malloc() is not a syscall, but it may or may not use a syscall when you call it. You can, of course, write better code yourself: statically allocate the memory you need, then use malloc() on that big memory chunk. This way you are guaranteed not to cause the system any grief. Either your process starts, or it does not. After it starts, it is in complete control of its own memory.
First of all, malloc implementations that rely on brk are simply broken. The whole idea of a process-wide data segment is broken and leads not only to awkwardness around thread safety safety, also the inability to decommit memory when it's no longer needed. Decent malloc implementations use mmap.
Second, the whole idea that a page isn't really "allocated" until you touch it is just stupid, lazy, and terrible; it doesn't exist on sane operating systems like NT. On decent systems, mmap (or its equivalent) sets aside committed memory and malloc carves out pieces from that block.
It's frustrating to see people who grew up with Linux not only understand how terrible and broken Linux's memory management is, but not realizing that it could be any other way.
NT doesn't have fork. Supporting that requires copy-on-write semantics. And you don't want to stop fork just because you don't have enough memory for a second complete copy of the current processes; you might be about to throw all that away with an exec, and even if not you're likely to keep most of the shared memory shared.
Well, first of NT is... different. Not sure if it's sane or superior or whatever.
Second, you may be right that this behavior causes problems, but it clearly has advantages as well. You advocate copy-on-write semantics elsewhere. Well this behavior is basically that.
Third, dlmalloc() which is pretty widely used, uses sbrk().
Forth, I sense a whole lot of frustration targeted at "these damn Linux kids". Sorry you feel that way. On the plus side, it's Friday, and this particular Linux kid would gladly buy you a beer to help cheer you up.
(Technically, I grew up with NetBSD, not Linux. Wrote a toy allocator for it at one point, and am trying to work on one that works for Linux as well.)
Thats not really true. Not all dynamic allocation schemes rely on mmap, some use the brk and sbrk system calls.
"In the original Unix system, brk and sbrk were the only ways in which applications could acquire additional data space; later versions allowed this to also be done using the mmap call." -https://en.wikipedia.org/wiki/Sbrk
Also mmap itself is also a system call, it usually just was called prior to the failing page allocation, not at the time of allocation failure.
The Linux OOM killer came about based on the assumption that most processes, when faced with a NULL return from malloc, will unceremoniously exit. Thus, running out of memory would effectively randomly kill whatever process next attempted to make an allocation, and suddenly the system would have a bit more memory available. So rather than kill a random process, the OOM killer tries to kill a process actually responsible for using a huge amount of memory.
Not always a sensible plan, and possibly not even something that should be on by default, but understandable.