
The “too small to fail” memory-allocation rule - kakakiki
http://lwn.net/Articles/627419/
======
willvarfar
Somewhat related, the classic "Respite from the OOM killer" by Andries
Brouwer:

 _An aircraft company discovered that it was cheaper to fly its planes with
less fuel on board. The planes would be lighter and use less fuel and money
was saved. On rare occasions however the amount of fuel was insufficient, and
the plane would crash. This problem was solved by the engineers of the company
by the development of a special OOF (out-of-fuel) mechanism. In emergency
cases a passenger was selected and thrown out of the plane. (When necessary,
the procedure was repeated.) A large body of theory was developed and many
publications were devoted to the problem of properly selecting the victim to
be ejected. Should the victim be chosen at random? Or should one choose the
heaviest person? Or the oldest? Should passengers pay in order not to be
ejected, so that the victim would be the poorest on board? And if for example
the heaviest person was chosen, should there be a special exception in case
that was the pilot? Should first class passengers be exempted? Now that the
OOF mechanism existed, it would be activated every now and then, and eject
passengers even when there was no fuel shortage. The engineers are still
studying precisely how this malfunction is caused._

[http://lwn.net/Articles/104179/](http://lwn.net/Articles/104179/)

~~~
goodwilly
Got a chuckle out of me, but if you equate processes to human lives you get an
infinite amount of absurdity anyway, making this a rather meaningless analogy.

~~~
willvarfar
Do you think that as Linux gets used in medical, military and home contexts
perhaps lives may really be at risk?

~~~
stingraycharles
Then you enter the realm about hardened / realtime systems, where guarantees
about eg. execution time are required. Those usually require a different
aproach to kernel development anyway -- I do know there is something like RT
Linux, but I have no idea how big it is.

~~~
justincormack
Real time Linux is lacking funding, and is not being developed much any more
[1]

[1] [https://lwn.net/Articles/617140/](https://lwn.net/Articles/617140/)

------
erlkonig
Enabling overcommit machine-wide is a puerile, broken approach that not only
converts your server to an unreliable toy, but encourages other idiots to rely
on the same broken behavior in their libraries, language implementations, and
so forth, basically leading the current plethora of collection libraries that
don't even bother to monitor their own memory use or check malloc's return. It
is software engineering plague, a rot on the underbelly of allegedly-solid
code. oomkiller's unpredictability causes any number of problems in actual
production environments, usually by killing the wrong process, and secretly
ripping the stability out of programs whose code _does_ check malloc's return.
The answer is:

{ echo 'vm.overcommit_memory = 2' ; echo 'vm.overcommit_ratio = 100' ; }
>/etc/sysctl.d/10-no-overcommit.conf

Which restores classical semantics and allow processes to identify memory
allocation failures and respond to them responsibly in a number of ways
(garbage collect being an obvious one, clean, safe exits after logging being
another).

Now, if we could say that a _specific_ process was allowed to overcommit
because we could guarantee it would use the bogus memory allocation, then we'd
have something vaguely useful.

~~~
joosters
And after following this advice, you end up with a system that can fail to
fork() even when half of the computer's memory is free. This can also turn
your once-working server into an unreliable toy.

(Also, see the comments in the original article that talk about
vm.overcommit_memory=2 not actually doing what it claims to do...)

~~~
quotemstr
Yes, but they _can_ vfork, and you should be using vfork anyway when all
you're going to do after fork is exec.

~~~
IgorPartola
Isn't vfork() 11 years deprecated?

~~~
quotemstr
Deprecated my ass. It's stable and supported ABI.

~~~
vezzy-fnord
It may not be deprecated, but it's a rather unsafe interface:
[http://ewontfix.com/7/](http://ewontfix.com/7/)

~~~
quotemstr
It's possible to use safely, and the benefits are worth it --- no commit
charge problems and no time spent copying page tables. (Even in the memory
itself is copy-on-write, you still have to set up the child's address space.)

I'm sick of people cargo-culting ideas like "vfork is bad" without really
understanding the issues.

------
fit2rule
Its not exactly true that these error-recovery paths are untested - in the
context of the broader collective it can be said that there is no certainty.

But the Linux kernel has been used in countless industries requiring
_precisely_ that level of testing. I myself have been involved in SIL-4
certification of embedded Linux kernels for the transportation industry, and
we ran into this memory-alloc issue years ago; its been quite widely
understood already, and accommodated by the extremely rigorous testing thats
required to get the Linux kernel in use in places where human lives are on the
line.

So what I would suggest anyone working on this issue do, is contact the folks
who are using the Linux kernel in the SIL-4 context, and try to get support on
releasing the tests that have been developed to exercise exactly this issue.
Its not a new issue - all safety kernels have to be tested and certified (and
have 100% code coverage completion) on the subject of out-of-memory
conditions, and if this is not done there is no way that Linux can be used.
Fact is, in 38+ countries around the world, the Linux kernel is keeping the
trains on the rails _already_ \- the work has been done. Its maybe just not
open/obvious to the LWN collective, as is often the case.

~~~
derefr
You mean, the subset of the linux kernel module set that was used in these
projects has been tested. Presumably they didn't, say, test every hardware
driver; that would require a lot of hardware :)

~~~
fit2rule
I mean that the Linux kernel memory allocation behaviour was tested. Yes,
drivers and modules - and userspace apps - all undergo their own testing, but
to be clear I was referring to the memory allocation and management subsystem.

Of course there are other rules that factor in here too - in safety critical,
you don't use malloc() much.

------
jakub_g
> But it is worse than that: since small allocations do not fail, almost none
> of the thousands of error-recovery paths in the kernel now are ever
> exercised.

I've started noticing the similar thing with Firefox a year or two ago.
Probably no one is heavily testing browser's behavior in low mem situation.

Basically in low memory conditions, things are going crazy. Apart from low
responsiveness, there is stuff happening like very strange rendering artifacts
and occasional browser cache corruption.

The manifestation of the last one was pretty funny once for me, I started a
chess-like game (figures were rendered as PNG images) and the computer had
multiple kings and rooks ;) Took me a while to figure out the issue was on
browser's my side.

~~~
quotemstr
> in low memory conditions

That's why Firefox is switching to "infallible memory allocation" [1]. It's
okay for Firefox to do that because Firefox is a top-level application, not a
damn OS kernel.

[1] [https://developer.mozilla.org/en-
US/docs/Infallible_memory_a...](https://developer.mozilla.org/en-
US/docs/Infallible_memory_allocation)

~~~
mccr8
"new" is already infallible by default in Firefox. Most places outside of the
JS engine are using infallible allocation, and have been for a number of
years.

The basic problem is that error handling for tiny allocations is unlikely to
be ever tested, and thus have potentially critical security bugs. This is not
merely a theoretical concern: one of the Pwn2Own exploits for Firefox last
year relied on an error in some OOM handling code, resulting in remote code
execution.

Instead, Firefox just crashes if an allocation fails (in most places), and
Mozilla gets a crash report. If a particular allocation fails enough, usually
due to being a large allocation, then it will show up in our crash statistics,
and that particular location can be made fallible. The error handling code is
thus being run at least sometimes, making it a little safer.

------
IgorPartola
I know! In this case the OOM killer should kill the process that requested the
XFS operation in the first place! To avoid deadlocks it should just KILL it
not TERM it. I don't see any problems with that solution :).

In all seriousness, wow. This is the type of thing really must hurt. It'll be
interesting to see which path they choose.

~~~
wereHamster
KILL doens't necessarily end the process. It may be stuck in uninterruptible
state (TASK_UNINTERRUPTIBLE).

~~~
IgorPartola
Yup, when it is waiting on the kernel, but in this case we know the entire
context so the kernel can actually detect this and clean up properly.

------
jkot
Interesting. I had similar problem with recursive memory allocation while
working on database engine. Solution was relatively simple, reorder method
calls inside alocator, so that memory is allocated BEFORE cleanup progresses.

I think Linux memory allocator devs could keep small preallocated buffer,
return allocated space, and schedule independent maintenance after buffer gets
low.

~~~
dezgeg
> I think Linux memory allocator devs could keep small preallocated buffer,
> return allocated space, and schedule independent maintenance after buffer
> gets low.

It's already a possibilty. The gfp flags argument to kmalloc() may have the
__GFP_HIGH flag set: "This allocation has high priority and may use emergency
pools". However, it's use is generally discouraged and can really be only used
in extremely specific situations.

------
angersock
What is the BSD answer to the OOM killer? Doesn't have one, right?

~~~
JoshTriplett
As far as I know it doesn't.

The Linux OOM killer came about based on the assumption that most processes,
when faced with a NULL return from malloc, will unceremoniously exit. Thus,
running out of memory would effectively randomly kill whatever process next
attempted to make an allocation, and suddenly the system would have a bit more
memory available. So rather than kill a _random_ process, the OOM killer tries
to kill a process actually responsible for using a huge amount of memory.

Not always a sensible plan, and possibly not even something that should be on
by default, but understandable.

~~~
IgorPartola
The OOM actually has three modes:

1\. Rank processes by "badness" and kill the baddest one. This is not just
based on memory usage, but is a fairly complicated and expensive algorithm.

2\. Kill the process that caused the request that failed.

3\. Do a kernel panic. This way if your server runs only a single process you
care about, it'll just get rebooted, rather than killing the only process you
actually want to keep running.

Also, note that the article is talking about memory allocation inside the
kernel, and not malloc() which is used in userland. It's one thing when your
random Postgres process is out of RAM, it's quite another when the kernel is.

~~~
nothrabannosir
But what we need is a 4th option: a new SIGNOMEM that gives the process some
time to release memory, or shut down cleanly if it can't.

~~~
joosters
Good luck getting programs to process SIGNOMEM without allocating any more
memory!

Besides, if programs are using memory that could be discarded, it would be
better if the kernel knew this in the first place and could therefore do the
MM on their behalf.

~~~
derefr
You could have the system reserve a little tiny bit of extra memory (somewhat
similar to the "disk space reserved for root" in ext2) that it would only ever
hand out in response to malloc() attempts inside a SIGNOMEM handler.

You're right, though; metadata on the allocations themselves marking them as
release-on-memory-pressure would be much better.

~~~
IgorPartola
No it wouldn't. If a program has memory that it doesn't need and can live
without it ought to just release it. Otherwise that is called a memory leak. I
believe this was tried in early Android releases and was disastrous. Basically
during an out of memory condition the kernel would ask every process "hey, can
you release some memory?" And each process would reply with "nope, I need it
all!"

I don't see how the metadata is any different. What happens when the kernel
yanks a page of memory from über a running process?

~~~
gizmo686
Programs often make time vs memory tradeoffs. In some cases, it is even
possible for them to adjust these tradeoffs during runtime.

The most common example is is file-system caches. Pretending (for the sake of
argument) that the kernel did not automatically cache the file-system, a
program may reasonably make this optimization [1]. In this case, the program
can easily release some memory by clearing its cache.

You could also be running a program with garbage collection that normally
wouldn't bother doing a full sweep until it hit some memory usage threshold;
again, it can do this on request.

I'm sure that people can come up with other examples.

[1] In fact, a program can request that the kernel not cache its file-system
requests, in which case it would have actual reason to cache what it thinks it
might need again.

~~~
IgorPartola
Of course. But in practice here's what's going to happen. The good apps, say
Postgres, will implement the ability to do this. They will create caches, and
release them upon request from the OS. This will significantly hinder the
performance of Postgres because it will keep losing its caches. Note that this
is worse than just emptying the cache, you actually lose the allocation and
have to start over. In some cases you'll effectively disable the cache, which
is there for a reason!

Now, here comes, say, MongoDB, which says "Yeah, I have caches, and I need
them all! I won't release any allocations because I have to have them." Let's
say, for whatever reason you run both Postgres and MongoDB on the same box and
it's running out of RAM. Now you are punishing Postgres, the good citizen, and
rewarding MongoDB, the bad citizen, and only because Postgres bothered to
implement the ability to give up its caches.

I cannot find the reference to this, but I believe this was tried in early
Android, and the consequences were that it became unusable as soon as you
installed at least one memory hog app that never gave up any memory.

------
zqfm
My first thought is the kernel should pre-allocate some space for running a
recovery/cleanup/analysis process when malloc fails. Is anything like this
done already? Can it defer to the user to decide what to do when that happens?

~~~
SolarNet
Well on the first case there is, the OOM killer is preallocated, the memory
deallocation algorithms are preallocated, etc.

On the second... not so much. It comes back to the same problem. How do you
notify the user without using the memory subsystem again (allocating text, or
graphics buffers). How do you differentiate a memory call used to notify the
user from a normal one? It's the same problem the OOM killer was having.

------
iopq
This thread is hilarious. That Ido guy keeps posting his do / _once_ / while
(false); loops and ignoring everyone who tells him that's a horrible
replacement for the goto.

------
raldi
Couldn't the filesystem code release its locks before calling the OOM killer,
then reacquire them?

