
Linux memory overcommit (compared to airlines selling tickets) - acqq
http://opsmonkey.blogspot.com/2007/01/linux-memory-overcommit.html
======
sigil
Along these lines, I had a fun encounter with the Linux OOM killer the other
day.

The scenario was roughly this. Process A consumes 55% of available memory but
has stable memory usage and has been running for days. Process B has a memory
allocation bug and rapidly goes from low memory usage to consuming the
remaining 45%. Which process gets killed first on my system? Process A.

I'm not up on the latest OOM scoring heuristics (they keep changing), but if
it's only looking at instantaneous memory usage when things go critical, and
not the trend, that seems dumb. I'd be willing to bet memory leaks are an
especially common cause of OOM.

...then again, putting all these heuristics in the kernel with no tunables
seems dumber. Does anyone know if delegating to a userspace OOM killer is
possible or being contemplated?

~~~
rcxdude
it seems the latest kernel doesn't use the more complex heuristics that it
used to (preserving longer running tasks, for example) and just goes off the
proportion of memory used by the process, with a 3% bonus for the superuser.
However it does also adjust the score by the value in /proc/pid/oom_score_adj
(which can completly disable or completely favour the process) and so you
could probably write a userspace daemon which updated this value based on
whatever heuristics you want (although you'd have to do it before the OOM
condition).

------
amalcon
I liked the article. It was a simplified explanation, which is a good thing in
its own right: not everyone reading it cares about the gory details. I'll
provide some such details for the curious, though.

Technically, malloc() is not a system call; it is a library function that
calls either brk()/sbrk() or mmap() (or just uses memory it already "has").
The choice depends on the situation and just what it's doing.

mmap() engages the VM subsystem, backing a chunk of RAM with part of a given
file on disk (in this case with a special flag to not actually do that, but
just give the RAM). Of course this will never require actual RAM until you go
to use it: that's the whole point of the call. Normally mmap() will not cause
OOM issues, because it creates exactly as much virtual memory out of thin air
as it requests physical memory. That's not true in this usage, though. I
suppose in this usage, it could block off a chunk of virtual memory.

brk() and sbrk() are the more "normal" ways to get extra memory, by expanding
the heap. The new memory must be initialized manually (or not, as the runtime
prefers). The main reason to not do this all the time is that the heap itself
must be contiguous. Using mmap() lets malloc() return pages to the system
under more varied conditions and work around parts of the address space that
contain other things[1].

Calls to mmap()/brk()/sbrk() are rather expensive compared to malloc() calls.
They are system calls, after all. That's why programs will normally ask for
more memory than they actually need at the moment: it reduces the amortized
runtime cost of malloc(). Linux tries to take advantage of this.

So, the strange Linux behavior is that if a process calls mmap()/brk()/sbrk()
(and doesn't hit the addressable limit) Linux will simply record that the
process has requested mapping of a particular region of memory. Then, later,
the VM subsystem hands out the memory when the process tries to do something
to it. The question is, what to do if there isn't any memory to hand out?

There's no way to cause an arbitrary memory access to fail with ENOMEM, so
Linux has this hacky workaround where it semi-arbitrarily selects and kills
something.

The historical way to avoid this was to have a huge amount of virtual memory,
hence the suggestions like "twice the amount of RAM in the system". The system
would become unusably slow before the OOM killer would engage, and the user
would manually kill some relatively unimportant processes.

[1]- In older versions of the Linux kernel, the program code would be stored
in memory about 1G away from the heap. So, as soon as you used about 1G, you'd
run into your code and malloc() would no longer be able to brk() for more
memory.

~~~
acqq
The post linked from the article is really nice:

<http://lwn.net/Articles/104185/>

"in rare occasions however the amount of fuel was insufficient, and the plane
would crash. This problem was solved by the engineers of the company by the
development of a special OOF (out-of-fuel) mechanism. In emergency cases a
passenger was selected and thrown out of the plane. (When necessary, the
procedure was repeated.)"

I always thought that mapping of virtual memory should succeed only if there
is enough virtual memory available (i.e. when not in RAM than in page file).
Why is the overcommit concept better? Because malloc takes way too much VM
unnecessary? Then why not just correcting malloc?

~~~
amalcon
Well, malloc() is very much doing the right thing in a world where (virtual)
memory is plentiful. The performance of the program improves substantially,
without impacting memory usage that much.

I never understood why the system doesn't do the accounting strictly, even if
it waits to actually hand out individual pages. Then, it could give an error
on brk() or mmap() when the memory would be "overbooked". Memory "saved" this
way would be unusable to other processes, but it would be usable as buffer
cache, and fork() would still be fast.

edit: Another possible solution would be a special "out-of-memory" signal. The
program could trap that signal and do the right thing with it. The problem
with that is that it's difficult to program for out-of-memory conditions.
Something as simple as printf() won't work at all, and could cause a signal
loop depending on implementation. Given that, almost no programs would do it,
so there would be little benefit.

~~~
gsg
Why would you deny allocation requests larger than immediately available
physical memory when they could be serviced perfectly well by paging?

On modern systems malloc and friends manage address space, not physical
memory. Failure should only be reported if _that_ resource is exhausted.

------
billnapier
Overcommit was a design decision that seems wrong at first, but when taken
with a wide viewpoint makes some sense. The motivating reason for overccomit
is to make fork cheap. (the article even briefly touched on this). It's a
trade off between the speed of spawning new processes (and threads) and ease
of understanding how your system will behave in low memory situations. I think
they made the right choice.

Another thing to note. The systems in the article were in trouble regardless
of the overcommit, it's just that overcommit hid the problem and made it occur
in a non-obvious place.

------
pc1234
So, if I'm reading this correctly, my usual null-checking of malloc() returns
is insufficient?

What the hell? How did this ever get put in?

As an application developer trying to write code that can run on machines I
have no control over, how is this reasonable? This seems like it pretty much
neuters my ability to accurately detect potential out-of-memory conditions
with a hackish workaround.

~~~
ajross
Your null-checking has always been insufficient. Nothing you can do in the
context of a single app can fix things anyway. Actually try it some time: hack
up a malloc() that "fails" beyond a certain heap size and start testing all
your "handling" code. I guarantee you it won't do what you think it does.

And this isn't about "malloc" anyway. It's about mmap. Mappings are
overcommited because mappings are routinely undercommitted. How much stack do
you use for each of your threads? How much stack is mapped? If you don't know
the answers to these questions (or worse, don't know how to find out), then
you need to do a little more research before pontificating on kernel memory
allocation strategies.

Sorry, pet peeve. This is one of those areas where too many people think they
understand something that they simply don't.

~~~
pc1234
The thing that bugs me about this is that I'm pretty sure there is some way to
make a damned good effort at at least saving results out to disk or dying
gracefully even when malloc starts asploding.

Also, this means that I can screw up other processes for no good reason, even
though only my proc should get killed for going over quota.

"Mappings are overcommited because mappings are routinely undercommitted"

Could you please explain this again? I'm not quite sure I parsed that out
correctly. Thanks!

~~~
ajross
Saving results to disk requires disk buffers, which can require the kernel to
allocate, which won't work on an OOM system. This is how you get deadlocks:
process trying to "handle" OOM ends up trying to allocate, and the system
freezes. Which is why the OOM killer exists.

I should have said "underutilized" above. But the point is very real:
applications map stuff all over the place. It's quite rare that they actually
fault in those pages.

The stack thing is the obvious example: measure the stack size of your
application (easy trick is to subtract the address of a local variable "deep"
in the tree from the address of one of the arguments to main, or
pthread_create, or wherever the top is). Now check the size of the mapping
that contains the stack. And tell me if you want to actually reserve all that
memory.

------
z2amiller
What a nice surprise to wake up and check HN to find an article I'd written
four years ago trending on the front page. :-)

~~~
acqq
Inspiration came reading the comments here:

[http://blogs.msdn.com/b/oldnewthing/archive/2011/05/12/10163...](http://blogs.msdn.com/b/oldnewthing/archive/2011/05/12/10163578.aspx)

------
willvarfar
I've always felt this should be settable per heap.

I would happily set it in my programs explicitly.

~~~
zorked
You can disable overcommit globally (/proc/sys/vm/overcommit_memory).

At the application level, you can allocate as much memory as you want and then
write to it all. You are not guaranteed not to die of OOM while doing this,
but at least you will know right away that you will have enough memory in the
long run.

The point of overcomitting is that it reduces the possibility of OOM errors. I
think people are just offended that the kernel is killing their process. With
overcommits turned off, the OOM would have happened sooner, and I doubt most
applications would have handled it any better than the kernel did. If you
think the kernel is wrong, or if you don't like this behavior, you can both
customize the OOM killer and turn off overcommitting.

Also, overcommits are by no means specific to Linux. FreeBSD, OS X, HP/UX and
so on all have it. I _think_ Solaris is the exception here. But if you want
the Solaris behavior, you can have it under Linux too.

