The “too small to fail” memory-allocation rule

willvarfar · on Jan 23, 2015

Somewhat related, the classic "Respite from the OOM killer" by Andries Brouwer:

An aircraft company discovered that it was cheaper to fly its planes with less fuel on board. The planes would be lighter and use less fuel and money was saved. On rare occasions however the amount of fuel was insufficient, and the plane would crash. This problem was solved by the engineers of the company by the development of a special OOF (out-of-fuel) mechanism. In emergency cases a passenger was selected and thrown out of the plane. (When necessary, the procedure was repeated.) A large body of theory was developed and many publications were devoted to the problem of properly selecting the victim to be ejected. Should the victim be chosen at random? Or should one choose the heaviest person? Or the oldest? Should passengers pay in order not to be ejected, so that the victim would be the poorest on board? And if for example the heaviest person was chosen, should there be a special exception in case that was the pilot? Should first class passengers be exempted? Now that the OOF mechanism existed, it would be activated every now and then, and eject passengers even when there was no fuel shortage. The engineers are still studying precisely how this malfunction is caused.

http://lwn.net/Articles/104179/

goodwilly · on Jan 23, 2015

Got a chuckle out of me, but if you equate processes to human lives you get an infinite amount of absurdity anyway, making this a rather meaningless analogy.

willvarfar · on Jan 23, 2015

Do you think that as Linux gets used in medical, military and home contexts perhaps lives may really be at risk?

stingraycharles · on Jan 23, 2015

Then you enter the realm about hardened / realtime systems, where guarantees about eg. execution time are required. Those usually require a different aproach to kernel development anyway -- I do know there is something like RT Linux, but I have no idea how big it is.

justincormack · on Jan 23, 2015

Real time Linux is lacking funding, and is not being developed much any more [1]

[1] https://lwn.net/Articles/617140/

TickleSteve · on Jan 23, 2015

Although there are versions of Linux that are hard-realtime, the approach they take is to have a proper RTOS "underneath" Linux and effectively run Linux as a non realtime process.

Linux as a whole is far too large and dynamic for deterministic response.

Linux does get used a lot for soft realtime applications tho, especially in the military world.

xorcist · on Jan 23, 2015

Not necessarily. Most systems within the medical context does not have any realtime requirements, after all. A watchdog restart may be only a nuisance.

It would be nice of embedded systems engineers thought about things such as memory overcommit, but in real life they are as most of us: Stuck with old code bases that someone built right before leaving.

drzaiusapelord · on Jan 23, 2015

You probably wouldn't, and shouldn't, use Linux for RTOS duties. This is why we see Linux overwhelmingly on mobile phones, servers, etc and not powering your pacemaker, spacecraft, or nuclear plant. NASA uses VxWorks for a reason.

http://en.wikipedia.org/wiki/Comparison_of_embedded_computer...

sp332 · on Jan 23, 2015

It's called hyperbole. The point isn't that processes are like human lives, it's to point out that killing processes is the wrong strategy.

erlkonig · on Jan 23, 2015

Enabling overcommit machine-wide is a puerile, broken approach that not only converts your server to an unreliable toy, but encourages other idiots to rely on the same broken behavior in their libraries, language implementations, and so forth, basically leading the current plethora of collection libraries that don't even bother to monitor their own memory use or check malloc's return. It is software engineering plague, a rot on the underbelly of allegedly-solid code. oomkiller's unpredictability causes any number of problems in actual production environments, usually by killing the wrong process, and secretly ripping the stability out of programs whose code does check malloc's return. The answer is:

{ echo 'vm.overcommit_memory = 2' ; echo 'vm.overcommit_ratio = 100' ; } >/etc/sysctl.d/10-no-overcommit.conf

Which restores classical semantics and allow processes to identify memory allocation failures and respond to them responsibly in a number of ways (garbage collect being an obvious one, clean, safe exits after logging being another).

Now, if we could say that a specific process was allowed to overcommit because we could guarantee it would use the bogus memory allocation, then we'd have something vaguely useful.

joosters · on Jan 23, 2015

And after following this advice, you end up with a system that can fail to fork() even when half of the computer's memory is free. This can also turn your once-working server into an unreliable toy.

(Also, see the comments in the original article that talk about vm.overcommit_memory=2 not actually doing what it claims to do...)

angersock · on Jan 23, 2015

I was under the impression that fork() basically set up copy-on-write pages for the child process, and thus if it exec()'ed it would lose all those pages anyways and start fresh. Are those VM settings changing that behavior?

dezgeg · on Jan 23, 2015

It wouldn't affect the copy-on-write optimization or exec(), it just causes fork() to return an error earlier. 'vm.overcommit_memory = 2' attempts to guarantee that after a fork, all of the copy-on-write-pages can still be dirtied (i.e. causing a copy) without the system running out of RAM.

quotemstr · on Jan 23, 2015

Yes, but they _can_ vfork, and you should be using vfork anyway when all you're going to do after fork is exec.

joosters · on Jan 23, 2015

Good luck making sure that all your running programs use vfork(). And why would they? If you read the man page, it is hardly encouraging:

It is rather unfortunate that Linux revived this specter from the past. The BSD man page states: "This system call will be eliminated when proper system sharing mechanisms are implemented. Users should not depend on the memory sharing semantics of vfork() as it will, in that case, be made synonymous to fork(2)."

So no-one has been encouraged to use vfork()

benmmurphy · on Jan 23, 2015

some apps also use fork and the copy on write behaviour deliberately to implement features. redis uses it to create its backup file (http://redis.io/topics/faq)

>> Redis background saving schema relies on the copy-on-write semantic of fork in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits.

IgorPartola · on Jan 23, 2015

Isn't vfork() 11 years deprecated?

quotemstr · on Jan 23, 2015

Deprecated my ass. It's stable and supported ABI.

vezzy-fnord · on Jan 23, 2015

It may not be deprecated, but it's a rather unsafe interface: http://ewontfix.com/7/

quotemstr · on Jan 23, 2015

It's possible to use safely, and the benefits are worth it --- no commit charge problems and no time spent copying page tables. (Even in the memory itself is copy-on-write, you still have to set up the child's address space.)

I'm sick of people cargo-culting ideas like "vfork is bad" without really understanding the issues.

rwmj · on Jan 23, 2015

If you run a mission critical server on Linux, then you need to engage your brain and understand the requirements of your workload. That's what system administrators are paid to do.

The question here is what are suitable defaults for non-critical desktop uses of Linux, given that there will always be limits on the amount of RAM we can put into a machine and/or badly behaving processes written by CADTs.

chongli · on Jan 23, 2015

Enabling overcommit machine-wide is a puerile, broken approach that not only converts your server to an unreliable toy

Why should the reliability of your server matter (beyond a certain point)? For years Erlang developers have been following the "let it crash and a supervisor will restart it" model. They seem to have the uptime numbers to back them up.

jimktrains2 · on Jan 23, 2015

Because erlang is also built around that concept. Often when a server fails it needs to be brought up by hand and then wait until it can be brought back into rotation (depending on how it's used &c).

Additionally, some applications, like a database often only run on a single host (sure you have hot spares, but fail over is often manual and recovering is defiantly manual).

So while I get your point, we're not going to throw out everything we have simply because it wasn't built around "let it fail".

fit2rule · on Jan 23, 2015

Its not exactly true that these error-recovery paths are untested - in the context of the broader collective it can be said that there is no certainty.

But the Linux kernel has been used in countless industries requiring precisely that level of testing. I myself have been involved in SIL-4 certification of embedded Linux kernels for the transportation industry, and we ran into this memory-alloc issue years ago; its been quite widely understood already, and accommodated by the extremely rigorous testing thats required to get the Linux kernel in use in places where human lives are on the line.

So what I would suggest anyone working on this issue do, is contact the folks who are using the Linux kernel in the SIL-4 context, and try to get support on releasing the tests that have been developed to exercise exactly this issue. Its not a new issue - all safety kernels have to be tested and certified (and have 100% code coverage completion) on the subject of out-of-memory conditions, and if this is not done there is no way that Linux can be used. Fact is, in 38+ countries around the world, the Linux kernel is keeping the trains on the rails already - the work has been done. Its maybe just not open/obvious to the LWN collective, as is often the case.

kryptiskt · on Jan 23, 2015

The path can still be lightly tested because every such system may have configured away the OOM killer with vm.overcommit_memory=2

derefr · on Jan 23, 2015

You mean, the subset of the linux kernel module set that was used in these projects has been tested. Presumably they didn't, say, test every hardware driver; that would require a lot of hardware :)

fit2rule · on Jan 23, 2015

I mean that the Linux kernel memory allocation behaviour was tested. Yes, drivers and modules - and userspace apps - all undergo their own testing, but to be clear I was referring to the memory allocation and management subsystem.

Of course there are other rules that factor in here too - in safety critical, you don't use malloc() much.

jakub_g · on Jan 23, 2015

> But it is worse than that: since small allocations do not fail, almost none of the thousands of error-recovery paths in the kernel now are ever exercised.

I've started noticing the similar thing with Firefox a year or two ago. Probably no one is heavily testing browser's behavior in low mem situation.

Basically in low memory conditions, things are going crazy. Apart from low responsiveness, there is stuff happening like very strange rendering artifacts and occasional browser cache corruption.

The manifestation of the last one was pretty funny once for me, I started a chess-like game (figures were rendered as PNG images) and the computer had multiple kings and rooks ;) Took me a while to figure out the issue was on browser's my side.

quotemstr · on Jan 23, 2015

> in low memory conditions

That's why Firefox is switching to "infallible memory allocation" [1]. It's okay for Firefox to do that because Firefox is a top-level application, not a damn OS kernel.

[1] https://developer.mozilla.org/en-US/docs/Infallible_memory_a...

mccr8 · on Jan 23, 2015

"new" is already infallible by default in Firefox. Most places outside of the JS engine are using infallible allocation, and have been for a number of years.

The basic problem is that error handling for tiny allocations is unlikely to be ever tested, and thus have potentially critical security bugs. This is not merely a theoretical concern: one of the Pwn2Own exploits for Firefox last year relied on an error in some OOM handling code, resulting in remote code execution.

Instead, Firefox just crashes if an allocation fails (in most places), and Mozilla gets a crash report. If a particular allocation fails enough, usually due to being a large allocation, then it will show up in our crash statistics, and that particular location can be made fallible. The error handling code is thus being run at least sometimes, making it a little safer.

IgorPartola · on Jan 23, 2015

I know! In this case the OOM killer should kill the process that requested the XFS operation in the first place! To avoid deadlocks it should just KILL it not TERM it. I don't see any problems with that solution :).

In all seriousness, wow. This is the type of thing really must hurt. It'll be interesting to see which path they choose.

wereHamster · on Jan 23, 2015

KILL doens't necessarily end the process. It may be stuck in uninterruptible state (TASK_UNINTERRUPTIBLE).

IgorPartola · on Jan 23, 2015

Yup, when it is waiting on the kernel, but in this case we know the entire context so the kernel can actually detect this and clean up properly.

jkot · on Jan 23, 2015

Interesting. I had similar problem with recursive memory allocation while working on database engine. Solution was relatively simple, reorder method calls inside alocator, so that memory is allocated BEFORE cleanup progresses.

I think Linux memory allocator devs could keep small preallocated buffer, return allocated space, and schedule independent maintenance after buffer gets low.

dezgeg · on Jan 23, 2015

> I think Linux memory allocator devs could keep small preallocated buffer, return allocated space, and schedule independent maintenance after buffer gets low.

It's already a possibilty. The gfp flags argument to kmalloc() may have the __GFP_HIGH flag set: "This allocation has high priority and may use emergency pools". However, it's use is generally discouraged and can really be only used in extremely specific situations.

angersock · on Jan 23, 2015

What is the BSD answer to the OOM killer? Doesn't have one, right?

JoshTriplett · on Jan 23, 2015

As far as I know it doesn't.

The Linux OOM killer came about based on the assumption that most processes, when faced with a NULL return from malloc, will unceremoniously exit. Thus, running out of memory would effectively randomly kill whatever process next attempted to make an allocation, and suddenly the system would have a bit more memory available. So rather than kill a random process, the OOM killer tries to kill a process actually responsible for using a huge amount of memory.

Not always a sensible plan, and possibly not even something that should be on by default, but understandable.

IgorPartola · on Jan 23, 2015

The OOM actually has three modes:

1. Rank processes by "badness" and kill the baddest one. This is not just based on memory usage, but is a fairly complicated and expensive algorithm.

2. Kill the process that caused the request that failed.

3. Do a kernel panic. This way if your server runs only a single process you care about, it'll just get rebooted, rather than killing the only process you actually want to keep running.

Also, note that the article is talking about memory allocation inside the kernel, and not malloc() which is used in userland. It's one thing when your random Postgres process is out of RAM, it's quite another when the kernel is.

rcxdude · on Jan 23, 2015

>1. Rank processes by "badness" and kill the baddest one. This is not just based on memory usage, but is a fairly complicated and expensive algorithm.

AFAIK the algorithm has been simplified greatly: it's now just amount of memory weighted by the oom_score which can be set from userspace.

nothrabannosir · on Jan 23, 2015

But what we need is a 4th option: a new SIGNOMEM that gives the process some time to release memory, or shut down cleanly if it can't.

dezgeg · on Jan 23, 2015

There is actually a possibility for Linux processes to do exactly that if they use the vmpressure_fd() (http://lwn.net/Articles/524742/ ).

EDIT: Now that I look closer, it's possible that vmpressure_fd() was never merged into mainline and therefore is not really available.

joosters · on Jan 23, 2015

Good luck getting programs to process SIGNOMEM without allocating any more memory!

Besides, if programs are using memory that could be discarded, it would be better if the kernel knew this in the first place and could therefore do the MM on their behalf.

derefr · on Jan 23, 2015

You could have the system reserve a little tiny bit of extra memory (somewhat similar to the "disk space reserved for root" in ext2) that it would only ever hand out in response to malloc() attempts inside a SIGNOMEM handler.

You're right, though; metadata on the allocations themselves marking them as release-on-memory-pressure would be much better.

IgorPartola · on Jan 23, 2015

No it wouldn't. If a program has memory that it doesn't need and can live without it ought to just release it. Otherwise that is called a memory leak. I believe this was tried in early Android releases and was disastrous. Basically during an out of memory condition the kernel would ask every process "hey, can you release some memory?" And each process would reply with "nope, I need it all!"

I don't see how the metadata is any different. What happens when the kernel yanks a page of memory from über a running process?

gizmo686 · on Jan 23, 2015

Programs often make time vs memory tradeoffs. In some cases, it is even possible for them to adjust these tradeoffs during runtime.

The most common example is is file-system caches. Pretending (for the sake of argument) that the kernel did not automatically cache the file-system, a program may reasonably make this optimization [1]. In this case, the program can easily release some memory by clearing its cache.

You could also be running a program with garbage collection that normally wouldn't bother doing a full sweep until it hit some memory usage threshold; again, it can do this on request.

I'm sure that people can come up with other examples.

[1] In fact, a program can request that the kernel not cache its file-system requests, in which case it would have actual reason to cache what it thinks it might need again.

IgorPartola · on Jan 23, 2015

Of course. But in practice here's what's going to happen. The good apps, say Postgres, will implement the ability to do this. They will create caches, and release them upon request from the OS. This will significantly hinder the performance of Postgres because it will keep losing its caches. Note that this is worse than just emptying the cache, you actually lose the allocation and have to start over. In some cases you'll effectively disable the cache, which is there for a reason!

Now, here comes, say, MongoDB, which says "Yeah, I have caches, and I need them all! I won't release any allocations because I have to have them." Let's say, for whatever reason you run both Postgres and MongoDB on the same box and it's running out of RAM. Now you are punishing Postgres, the good citizen, and rewarding MongoDB, the bad citizen, and only because Postgres bothered to implement the ability to give up its caches.

I cannot find the reference to this, but I believe this was tried in early Android, and the consequences were that it became unusable as soon as you installed at least one memory hog app that never gave up any memory.

jbert · on Jan 23, 2015

> No it wouldn't. If a program has memory that it doesn't need and can live without it ought to just release it. Otherwise that is called a memory leak.

I disagree. A process could usefully cache the results of computation in RAM (e.g. a large lookup table). This RAM is useful to the process (will increase performance) but if it is discarded due to a spike in memory pressure it could be rebuilt.

> What happens when the kernel yanks a page of memory from über a running process?

There is a madvise() flag for this, MADV_DONTNEED - http://man7.org/linux/man-pages/man2/madvise.2.html

You get zero pages. Which is fine - you can detect that with a canary value and then you know you need to rebuild your cache.

dezgeg · on Jan 23, 2015

> [...] that it would only ever hand out in response to malloc() attempts inside a SIGNOMEM handler.

POSIX doesn't allow malloc() to be called inside a signal handler at all. Also, when free():ing memory that is allocated with malloc(), there is absolutely no guarantee that the memory is ever returned back to the OS (i.e. via munmap() or sbrk()), due to possible memory fragmentation.

In other words, it's just wouldn't be practically possible to create an application that uses standard memory allocation functions and reliably can free some memory back to the kernel.

jbert · on Jan 23, 2015

madvise() MADV_DONTNEED allows this - http://man7.org/linux/man-pages/man2/madvise.2.html

It's also a flag in posix_madvise(), so the concept is reasonably portable.

dezgeg · on Jan 23, 2015

malloc() and free() is what I meant by "standard memory allocation functions". I'd say program that is in the position to use madvise() effectively has implemented it's own heap allocator.

Also, MADV_DONTNEED is only usable in some specific situations, like caches. I don't see how it could be used to implement things like "on low memory, trigger garbage collection and trim the heap to the smallest possible with munmap()".

jbert · on Jan 25, 2015

You said: "In other words, it's just wouldn't be practically possible to create an application that uses standard memory allocation functions and reliably can free some memory back to the kernel."

So I thought you'd be interested to know that you can do just this with the standard functions mmap() and madvise().

No, it's not a replacement for malloc/free, but it does have value to some applications for some use cases.

jbert · on Jan 23, 2015

> You're right, though; metadata on the allocations themselves marking them as release-on-memory-pressure would be much better.

http://man7.org/linux/man-pages/man2/madvise.2.html - MADV_DONTNEED allows you to provide such meta data now.

derefr · on Jan 24, 2015

Right, this is what I was thinking of when I said it; I was just being vague because I didn't remember the name of it :)

jzwinck · on Jan 23, 2015

1. This is the one where if you "less" a giant log file, your web server will get killed. Awesome.

2. This could be handled by returning NULL.

3. If your server only needs to run a single process, making that one process terminate would be much faster than rebooting the entire machine (commercial-grade servers can take ten minutes to POST).

ogoffart · on Jan 23, 2015

2. No. Because most allocations by the kernel are not a result of the program calling malloc. (They can be the result of a pagefault when the program tries to write on some memory)

quotemstr · on Jan 23, 2015

> They can be the result of a pagefault when the program tries to write on some memory

How do you think malloc works? There's no malloc system call.

IgorPartola · on Jan 23, 2015

There are a couple of different things going on here:

malloc() generally uses mmap() which is a syscall. It also can use brk()/sbrk(). Some implementations use both.

There is also kmalloc() which is what's broken based on TFA. This is inside the kernel only and is not a syscall.

So you are right, malloc() is not a syscall, but it may or may not use a syscall when you call it. You can, of course, write better code yourself: statically allocate the memory you need, then use malloc() on that big memory chunk. This way you are guaranteed not to cause the system any grief. Either your process starts, or it does not. After it starts, it is in complete control of its own memory.

quotemstr · on Jan 23, 2015

First of all, malloc implementations that rely on brk are simply broken. The whole idea of a process-wide data segment is broken and leads not only to awkwardness around thread safety safety, also the inability to decommit memory when it's no longer needed. Decent malloc implementations use mmap.

Second, the whole idea that a page isn't really "allocated" until you touch it is just stupid, lazy, and terrible; it doesn't exist on sane operating systems like NT. On decent systems, mmap (or its equivalent) sets aside committed memory and malloc carves out pieces from that block.

It's frustrating to see people who grew up with Linux not only understand how terrible and broken Linux's memory management is, but not realizing that it could be any other way.

JoshTriplett · on Jan 23, 2015

NT doesn't have fork. Supporting that requires copy-on-write semantics. And you don't want to stop fork just because you don't have enough memory for a second complete copy of the current processes; you might be about to throw all that away with an exec, and even if not you're likely to keep most of the shared memory shared.

IgorPartola · on Jan 23, 2015

Well, first of NT is... different. Not sure if it's sane or superior or whatever.

Second, you may be right that this behavior causes problems, but it clearly has advantages as well. You advocate copy-on-write semantics elsewhere. Well this behavior is basically that.

Third, dlmalloc() which is pretty widely used, uses sbrk().

Forth, I sense a whole lot of frustration targeted at "these damn Linux kids". Sorry you feel that way. On the plus side, it's Friday, and this particular Linux kid would gladly buy you a beer to help cheer you up.

(Technically, I grew up with NetBSD, not Linux. Wrote a toy allocator for it at one point, and am trying to work on one that works for Linux as well.)

voidlogic · on Jan 23, 2015

>There's no malloc system call.

Thats not really true. Not all dynamic allocation schemes rely on mmap, some use the brk and sbrk system calls.

"In the original Unix system, brk and sbrk were the only ways in which applications could acquire additional data space; later versions allowed this to also be done using the mmap call." -https://en.wikipedia.org/wiki/Sbrk

Also mmap itself is also a system call, it usually just was called prior to the failing page allocation, not at the time of allocation failure.

toast0 · on Jan 23, 2015

The FreeBSD OOM killer finds the biggest process (swap + maybe resident), and kills it. But in my experience, you often get a good amount of allocation failures before the OOM killer gets triggered.

angersock · on Jan 23, 2015

Cool. What does OpenBSD do?

feld · on Jan 23, 2015

I'm not certain that all the BSDs share a similar OOM killer approach. It would be interesting to see if they have diverged.

edit: after some quick googling the OOM killer code on FreeBSD is certainly different than other BSDs because of changes needed to better cooperate with ZFS. Also, EMC has been working on code in this area too:

https://wiki.freebsd.org/AvgPageoutAlgorithm

Memory classes here are well described: https://wiki.freebsd.org/Memory

Virtual memory system: https://www.freebsd.org/doc/en/books/arch-handbook/vm.html

Code: https://svnweb.freebsd.org/base/head/sys/vm/vm_pageout.c?vie...

Lots of good comments in there.

        /*
         * If we are critically low on one of RAM or swap and low on
         * the other, kill the largest process.  However, we avoid
         * doing this on the first pass in order to give ourselves a
         * chance to flush out dirty vnode-backed pages and to allow
         * active pages to be moved to the inactive queue and reclaimed.
         */

zqfm · on Jan 23, 2015

My first thought is the kernel should pre-allocate some space for running a recovery/cleanup/analysis process when malloc fails. Is anything like this done already? Can it defer to the user to decide what to do when that happens?

SolarNet · on Jan 23, 2015

Well on the first case there is, the OOM killer is preallocated, the memory deallocation algorithms are preallocated, etc.

On the second... not so much. It comes back to the same problem. How do you notify the user without using the memory subsystem again (allocating text, or graphics buffers). How do you differentiate a memory call used to notify the user from a normal one? It's the same problem the OOM killer was having.

iopq · on Jan 24, 2015

This thread is hilarious. That Ido guy keeps posting his do /once/ while (false); loops and ignoring everyone who tells him that's a horrible replacement for the goto.

raldi · on Jan 23, 2015

Couldn't the filesystem code release its locks before calling the OOM killer, then reacquire them?