The FAQ is hosted my NOMMULinux project (nomen omen a Linux distribution that is not using virtual memory, it was discussed on HN earlier today).
I found it to be a very nice summary of how OS manages memory, that is quite interesting even without context. That's why I posted it. It is a little bit light on the details, but in my opinion it is able to show memory management as a complete process, which many other materials fail to do (at least the ones that I've seen).
Some of us disagree violently with that FAQ's discussion of the OOM killer and strict overcommit. The arguments aren't even coherent.
For example, I disable overcommit and run my machines without any swap whatsoever. (Though this doesn't resolve the problem entirely in Linux.) The way to deal with the reality of finite memory isn't to pretend like it's infinite and then randomly kill processes when you hit a wall; or to use magic constants in user processes that try to guess the maximum number of logical tasks that can be handled for some amount of system RAM (which you could never possibly do correctly or consistently as a general matter for non-trivial purposes).
Rather, I would argue that overcommit encourages people to allocate huge amounts of swap in an attempt to avoid the OOM killer. And I would argue that overcommit and swap make it much easier to DoS a site, all things being equal. But certainly overcommit makes it more difficult to implement resilient software.
The solution is to fail allocation requests as soon as possible, and to fail them inline (not as signals). That way user processes can unwind and isolate the logical task attempting the allocation. Or if that's too difficult and inconvenient, a process can simply choose to panic. Without overcommit a process can choose the policy most suitable for the manner in which it was implemented. Among other things, that creates actionable back-pressure bound to the logic request.
The point about ELF shared libraries is just plain wrong. A shared library is a file, which already has a backing store in the filesystem. You need neither overcommit nor swap space to be able to map a shared library file into virtual address space and share it across multiple processes. Lots of operating system do this without overcommitting physical RAM.
And because all modern systems have a unified buffer cache, you actually get better performance by keeping those pages backed by the regular file system. If you treat them like anonymous pages, you have to copy them back out to swap space under memory pressure, whereas if they remained file-backed you don't have to write them out at all.
Likewise, if a process wants to speculatively allocate a huge buffer that might possibly go unused, rather than using malloc or anonymous mmap, create and map a temporary file. For a dozen different reasons this is preferable. The only possible negative is that it doesn't kow-tow to broken software that didn't think about memory issues. But I don't want to install that software, anyhow, because if the author couldn't put in the minimal amount of effort to be able to work on systems without overcommit, they're unlikely to have put much thought into buffer overflows, integer overflows, and many other correctness and security issues.
Finally, because Linux has to obey traditional Unix resources limit, as well as cgroup limits, it has to maintain an accounting of allocated pages whether or not overcommit is enabled. That fact should also be a strong hint that proper memory accounting is also a security concern, not simply an annoyance that we can wave away.
I dislike overcommit as well (although I have made use of it by relying on lazy allocation), but I think it is very hard to implement unix fork semantics without it: on a fork the kernel would have to reserve memory for a copy of the whole process, even though the next thing the child is most likely to do is throw it away by calling exec.
Unless all software were to be changed to use vfork or posix_spawn, I think we are stuck with it.
I would love to be able to disable overcommit on a per-process basis though.
It's not "hard" to implement fork without overcommit. tl;dr: have a big swap file. Don't worry, that swap file won't actually be used.
Really, fork works fine in a strict accounting world. You can turn off overcommit now on any Linux system, and in practice, nothing will go wrong. While it's true that fork requires that the system prepare for the possibility that the child might dirty all its parent's memory, it's just a matter of book-keeping to do and involves no actual inefficiency. If you have enough swap, the kernel can use that swap as memory "reserves" and allow even huge processes to fork.
Big programs (of which there are few) really should all be modified to use vfork, though, which doesn't make COW mappings and so doesn't incur the commit charge hit.
You implement fork semantics identically. You just fail the fork if you can't commit the memory. You don't have to literally copy the pages, you just have to do the accounting.
Solaris does this strictly and will never overcommit, period. Linux does the accounting on most distributions in order to enforce resource limits. Until the most recent kernels the relevant code path was
Although, because the memory manager was written to assume overcommit, I believe the accounting checks are racy.
The most commonly cited reason for overcommit is to allow a process with a large amount of anonymous pages to fork+exec. But the solution to that has always been vfork+exec (which is how posix_spawn and system is implemented on most unices, including Linux). That pattern never went away, and it's still slightly faster than a regular fork even on Linux.
And that reason has always seemed specious to me. Large processes like PostgreSQL and MySQL don't generally fork+exec subprocesses _after_ they've entered their steady state.
I think the only real-world benefit of overcommit was that, at least in the 1990s, it allowed some software written for beefy Unix machines to work on desktop Linux machines. Back then, it was de rigeur to speculatively allocate buffer space, and often (but not always) to allow a sysadmin to control through a configuration file how much buffer and cache space to preallocate.
But this is 2016--there is no such thing as a sysadmin in the traditional sense; that is, someone attending to a specific machine, manually reallocating resources to various processes in response to load and functional requirements. In 2016 (and for at least the past 15 years or so) we expect software to react dynamically to load. And in order to do that reliably and consistently you _need_ malloc, fork, et al to fail properly.
Also, ironically, Linux has such a blazing fast virtual memory manager that it doesn't make sense anymore to preallocate buffer and cache space upfront. Combined with modern, enhanced virtual memory APIs, it's become an unnecessary contrivance.
`posix_spawn` looks interesting. I'm kind of scared about fork in a multi-threaded context -- I see very little discussion about what happens (especially if you have many threads executing I/O).
Is there anywhere to go to read about the interactions of threads with syscalls like `fork`?
It is specified by POSIX and SUSv4 [1], but it still hard to use safely. Only the forking thread will exist in the child process; assume that mutexes are in an unspecified state, including those you do not know about (for example those inside libc malloc).
Friends do not let friends fork in a multithreaded process. Prefork before spawning any threads if required.
The POSIX specification is one of the best resources, especially
* http://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html
* http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html and
* http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09
You can download the HTML copy for free. It's useful to have the HTML frames version locally as you can grep it, and it's easier to jump around from the navigation frame.
The Linux manual pages are generally atrocious. The more updated pages literally copy+paste the POSIX standard. Better to just go straight to it. And because the POSIX standard is so well written, I think it's better to refer to it first and only look to other resources when you have more particular questions unanswered or raised by the standard. There's too much cargo-cult advice out there, often written by people who have at best only a passing familiarity with the standard or actual implementations.
As always, when in doubt refer to the actual source code. That should always be possible with Linux.
Forking from a threaded process is doable and the standard and implementations are carefully written to make it possible and practical, though generally the only sensible thing to do is to exec another process. The posix_spawn implementation should do the right thing. But the posix_spawn API can be painful to use for non-trivial purposes, especially compared to the simplicity of fork+dup+close+exec. A good implementation to study is the implementation in musl libc. (glibc code is not for the faint of heart.)
> Some of us disagree violently with that FAQ's discussion of the OOM killer and strict overcommit.
Agreed. Android for example has made major architectural changes due to not having swap. I wasn't around during the early days, and by no means an expert, but it looks to me like the following decisions were made:
1. No swapping, flash at the time was very slow, limited in capacity, and still has limited lifetime. Introduce OOM killer. [0]
2. SysV semaphores and shared memory can leak when killing a process. [1][2] It looks like these have since been added to bionic (Android's libc). [3]
3. Introduce The Binder kernel driver for IPC instead. The Binder handles reference counting, and can do neat things like death notifications. [4]
I found it to be a very nice summary of how OS manages memory, that is quite interesting even without context. That's why I posted it. It is a little bit light on the details, but in my opinion it is able to show memory management as a complete process, which many other materials fail to do (at least the ones that I've seen).