+1 on the date, but generally relevant to most modern x86/x86-64 hardware IMO.
(The rest of this comment is largely my own inflated opinion.)
From a developer's perspective, whether it's _useful_ when operating at a certain level of abstraction is another question. Really do think it's good to know this stuff, but if you're writing, say, Ruby or Python web apps you probably don't _need_ to know it.
Low latency/high throughput/systems-level stuff it starts to become more relevant. That said, there's certainly been operational scenarios where knowing some of this stuff can help narrow down misconfigured systems and/or squeeze more life out of existing hardware (e.g. configuring a JVM to use huge pages).
Another benefit of huge pages: Some database systems use a model where client processes operate on a massive pool of shared memory (typically a buffer cache). This allows the system to leverage the OS process model for isolation, privileges, etc. Huge pages greatly reduce the overhead of new client processes lazily mapping that pool into their address space.
The FAQ's tone of dismissive superiority with respect to overcommit is very offputting. The author suggests that only people who don't know how virtual memory works can oppose overcommit; it's curious then how the author reveals some virtual memory knowledge gaps. Specifically, the difference between address space reservation (which does not require backing storage) and commit charge (which does) seems lost on him.
Yes, you need backing storage to handle the possibility that a process might actually make copies of all COW pages (that's what we mean by "commit charge" --- the number of pages the kernel has voluntarily promised to deliver at some point. The right way to reduce the pressure on commit that these mappings cause is to not have so many of them in the first place. There's no reason PIC shared libraries need to be PROT_WRITE; there's no reason big programs like web browsers need to use fork(2) instead of the VM-efficient vfork(2). (Small programs can use fork and its COW commit charge without causing real problems, even on non-sloppy (err, "strict overcommit") systems.
Linux's default overcommit policy is terrible in part because it discourages people from learning about the difference between reserving address space and actually reserving memory, and the "FAQ" just confuses the issue more.
vfork is specified by POSIX that you cannot even do so much as close an fd or call setenv between fork and exec, both of which are very common things to want to do to set up an environment before something is run.
I disagree with the author too about overcommit, but vfork isn't a realistic solution.
In your vfork, you can exec another process that does all these things in a fresh environment. Realistically, you'll want a posix_spawn interface that uses vfork internally.
Some paragraphs are weirdly duplicated, such as "The OOM killer's process-killing capability is a reasonable way to deal with runaway processes [...]" and "People who don't understand how virtual memory works often insist on tracking the relationship between virtual and physical memory [...]"
I always wondered if forking a giant processes (i.e., large memory footprint) to create a small processes caused a lot of needless memory to be allocated, but now I understand why its no big deal:
For example, the fork/exec combo creates transient virtual memory usage
spikes, which go away again almost immediately without ever breaking the
copy on write status of most of the pages in the forked page tables. Thus
if a large process forks off a smaller process, enormous physical memory
demands threaten to happen (as far as overcommit is concerned), but never
materialize.
Note that this also occasionally results in "failure to fork" with the default linux overcommit setting. At fork-time if there isn't enough memory to duplicate the entire process, the fork call fails, even though you never intended to use that memory. You can use vfork to get around this.
Under your TLB section, you mention:
"The TLB is a cache for the MMU. All memory in the CPU's L1 cache must have an associated TLB entry, and invalidating a TLB entry flushes the associated cache line(s)."
Many CPUs such as x86 ones don't actually flush the L1 on a TLB eviction, there would be no reason to. Eventually the line will be evicted through a snoop, capacity eviction, C6 transition etc. Of course, if a later memory access comes to re-read the line, the TLB will have to be repopulated in order to have a mapping.
Also, on the "appeal of huge pages", I would put what are the negatives. Huge pages have appeal of reducing TLB pressure for stuff like big data/HPC workloads with large contiguous data sets, but it can also increase page fault latency and memory usage for more fragmented workloads that touch many pages. For example, on persist, Redis forks a process that may touch many pages. Each COW would have to find room for and allocate 2MB rather than just 4K
The FAQ is hosted my NOMMULinux project (nomen omen a Linux distribution that is not using virtual memory, it was discussed on HN earlier today).
I found it to be a very nice summary of how OS manages memory, that is quite interesting even without context. That's why I posted it. It is a little bit light on the details, but in my opinion it is able to show memory management as a complete process, which many other materials fail to do (at least the ones that I've seen).
Some of us disagree violently with that FAQ's discussion of the OOM killer and strict overcommit. The arguments aren't even coherent.
For example, I disable overcommit and run my machines without any swap whatsoever. (Though this doesn't resolve the problem entirely in Linux.) The way to deal with the reality of finite memory isn't to pretend like it's infinite and then randomly kill processes when you hit a wall; or to use magic constants in user processes that try to guess the maximum number of logical tasks that can be handled for some amount of system RAM (which you could never possibly do correctly or consistently as a general matter for non-trivial purposes).
Rather, I would argue that overcommit encourages people to allocate huge amounts of swap in an attempt to avoid the OOM killer. And I would argue that overcommit and swap make it much easier to DoS a site, all things being equal. But certainly overcommit makes it more difficult to implement resilient software.
The solution is to fail allocation requests as soon as possible, and to fail them inline (not as signals). That way user processes can unwind and isolate the logical task attempting the allocation. Or if that's too difficult and inconvenient, a process can simply choose to panic. Without overcommit a process can choose the policy most suitable for the manner in which it was implemented. Among other things, that creates actionable back-pressure bound to the logic request.
The point about ELF shared libraries is just plain wrong. A shared library is a file, which already has a backing store in the filesystem. You need neither overcommit nor swap space to be able to map a shared library file into virtual address space and share it across multiple processes. Lots of operating system do this without overcommitting physical RAM.
And because all modern systems have a unified buffer cache, you actually get better performance by keeping those pages backed by the regular file system. If you treat them like anonymous pages, you have to copy them back out to swap space under memory pressure, whereas if they remained file-backed you don't have to write them out at all.
Likewise, if a process wants to speculatively allocate a huge buffer that might possibly go unused, rather than using malloc or anonymous mmap, create and map a temporary file. For a dozen different reasons this is preferable. The only possible negative is that it doesn't kow-tow to broken software that didn't think about memory issues. But I don't want to install that software, anyhow, because if the author couldn't put in the minimal amount of effort to be able to work on systems without overcommit, they're unlikely to have put much thought into buffer overflows, integer overflows, and many other correctness and security issues.
Finally, because Linux has to obey traditional Unix resources limit, as well as cgroup limits, it has to maintain an accounting of allocated pages whether or not overcommit is enabled. That fact should also be a strong hint that proper memory accounting is also a security concern, not simply an annoyance that we can wave away.
I dislike overcommit as well (although I have made use of it by relying on lazy allocation), but I think it is very hard to implement unix fork semantics without it: on a fork the kernel would have to reserve memory for a copy of the whole process, even though the next thing the child is most likely to do is throw it away by calling exec.
Unless all software were to be changed to use vfork or posix_spawn, I think we are stuck with it.
I would love to be able to disable overcommit on a per-process basis though.
It's not "hard" to implement fork without overcommit. tl;dr: have a big swap file. Don't worry, that swap file won't actually be used.
Really, fork works fine in a strict accounting world. You can turn off overcommit now on any Linux system, and in practice, nothing will go wrong. While it's true that fork requires that the system prepare for the possibility that the child might dirty all its parent's memory, it's just a matter of book-keeping to do and involves no actual inefficiency. If you have enough swap, the kernel can use that swap as memory "reserves" and allow even huge processes to fork.
Big programs (of which there are few) really should all be modified to use vfork, though, which doesn't make COW mappings and so doesn't incur the commit charge hit.
You implement fork semantics identically. You just fail the fork if you can't commit the memory. You don't have to literally copy the pages, you just have to do the accounting.
Solaris does this strictly and will never overcommit, period. Linux does the accounting on most distributions in order to enforce resource limits. Until the most recent kernels the relevant code path was
Although, because the memory manager was written to assume overcommit, I believe the accounting checks are racy.
The most commonly cited reason for overcommit is to allow a process with a large amount of anonymous pages to fork+exec. But the solution to that has always been vfork+exec (which is how posix_spawn and system is implemented on most unices, including Linux). That pattern never went away, and it's still slightly faster than a regular fork even on Linux.
And that reason has always seemed specious to me. Large processes like PostgreSQL and MySQL don't generally fork+exec subprocesses _after_ they've entered their steady state.
I think the only real-world benefit of overcommit was that, at least in the 1990s, it allowed some software written for beefy Unix machines to work on desktop Linux machines. Back then, it was de rigeur to speculatively allocate buffer space, and often (but not always) to allow a sysadmin to control through a configuration file how much buffer and cache space to preallocate.
But this is 2016--there is no such thing as a sysadmin in the traditional sense; that is, someone attending to a specific machine, manually reallocating resources to various processes in response to load and functional requirements. In 2016 (and for at least the past 15 years or so) we expect software to react dynamically to load. And in order to do that reliably and consistently you _need_ malloc, fork, et al to fail properly.
Also, ironically, Linux has such a blazing fast virtual memory manager that it doesn't make sense anymore to preallocate buffer and cache space upfront. Combined with modern, enhanced virtual memory APIs, it's become an unnecessary contrivance.
`posix_spawn` looks interesting. I'm kind of scared about fork in a multi-threaded context -- I see very little discussion about what happens (especially if you have many threads executing I/O).
Is there anywhere to go to read about the interactions of threads with syscalls like `fork`?
It is specified by POSIX and SUSv4 [1], but it still hard to use safely. Only the forking thread will exist in the child process; assume that mutexes are in an unspecified state, including those you do not know about (for example those inside libc malloc).
Friends do not let friends fork in a multithreaded process. Prefork before spawning any threads if required.
The POSIX specification is one of the best resources, especially
* http://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html
* http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html and
* http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09
You can download the HTML copy for free. It's useful to have the HTML frames version locally as you can grep it, and it's easier to jump around from the navigation frame.
The Linux manual pages are generally atrocious. The more updated pages literally copy+paste the POSIX standard. Better to just go straight to it. And because the POSIX standard is so well written, I think it's better to refer to it first and only look to other resources when you have more particular questions unanswered or raised by the standard. There's too much cargo-cult advice out there, often written by people who have at best only a passing familiarity with the standard or actual implementations.
As always, when in doubt refer to the actual source code. That should always be possible with Linux.
Forking from a threaded process is doable and the standard and implementations are carefully written to make it possible and practical, though generally the only sensible thing to do is to exec another process. The posix_spawn implementation should do the right thing. But the posix_spawn API can be painful to use for non-trivial purposes, especially compared to the simplicity of fork+dup+close+exec. A good implementation to study is the implementation in musl libc. (glibc code is not for the faint of heart.)
> Some of us disagree violently with that FAQ's discussion of the OOM killer and strict overcommit.
Agreed. Android for example has made major architectural changes due to not having swap. I wasn't around during the early days, and by no means an expert, but it looks to me like the following decisions were made:
1. No swapping, flash at the time was very slow, limited in capacity, and still has limited lifetime. Introduce OOM killer. [0]
2. SysV semaphores and shared memory can leak when killing a process. [1][2] It looks like these have since been added to bionic (Android's libc). [3]
3. Introduce The Binder kernel driver for IPC instead. The Binder handles reference counting, and can do neat things like death notifications. [4]