~$ curl -v http://nommu.org/memory-faq.txt > /dev/null
> GET /memory-faq.txt HTTP/1.1
> Host: nommu.org
< HTTP/1.1 200 OK
< Server: GitHub.com
< Last-Modified: Thu, 05 May 2016 01:37:27 GMT
Jun 8, 2015 is the last commit to this particular file.
(The rest of this comment is largely my own inflated opinion.)
From a developer's perspective, whether it's _useful_ when operating at a certain level of abstraction is another question. Really do think it's good to know this stuff, but if you're writing, say, Ruby or Python web apps you probably don't _need_ to know it.
Low latency/high throughput/systems-level stuff it starts to become more relevant. That said, there's certainly been operational scenarios where knowing some of this stuff can help narrow down misconfigured systems and/or squeeze more life out of existing hardware (e.g. configuring a JVM to use huge pages).
Yes, you need backing storage to handle the possibility that a process might actually make copies of all COW pages (that's what we mean by "commit charge" --- the number of pages the kernel has voluntarily promised to deliver at some point. The right way to reduce the pressure on commit that these mappings cause is to not have so many of them in the first place. There's no reason PIC shared libraries need to be PROT_WRITE; there's no reason big programs like web browsers need to use fork(2) instead of the VM-efficient vfork(2). (Small programs can use fork and its COW commit charge without causing real problems, even on non-sloppy (err, "strict overcommit") systems.
Linux's default overcommit policy is terrible in part because it discourages people from learning about the difference between reserving address space and actually reserving memory, and the "FAQ" just confuses the issue more.
I disagree with the author too about overcommit, but vfork isn't a realistic solution.
Probably supposed to read "for which there is no current memory mapping" ?
Another posted pointed out that it's on github, so you could always submit a pull request:
For example, the fork/exec combo creates transient virtual memory usage
spikes, which go away again almost immediately without ever breaking the
copy on write status of most of the pages in the forked page tables. Thus
if a large process forks off a smaller process, enormous physical memory
demands threaten to happen (as far as overcommit is concerned), but never
Many CPUs such as x86 ones don't actually flush the L1 on a TLB eviction, there would be no reason to. Eventually the line will be evicted through a snoop, capacity eviction, C6 transition etc. Of course, if a later memory access comes to re-read the line, the TLB will have to be repopulated in order to have a mapping.
Also, on the "appeal of huge pages", I would put what are the negatives. Huge pages have appeal of reducing TLB pressure for stuff like big data/HPC workloads with large contiguous data sets, but it can also increase page fault latency and memory usage for more fragmented workloads that touch many pages. For example, on persist, Redis forks a process that may touch many pages. Each COW would have to find room for and allocate 2MB rather than just 4K
I found it to be a very nice summary of how OS manages memory, that is quite interesting even without context. That's why I posted it. It is a little bit light on the details, but in my opinion it is able to show memory management as a complete process, which many other materials fail to do (at least the ones that I've seen).
For example, I disable overcommit and run my machines without any swap whatsoever. (Though this doesn't resolve the problem entirely in Linux.) The way to deal with the reality of finite memory isn't to pretend like it's infinite and then randomly kill processes when you hit a wall; or to use magic constants in user processes that try to guess the maximum number of logical tasks that can be handled for some amount of system RAM (which you could never possibly do correctly or consistently as a general matter for non-trivial purposes).
Rather, I would argue that overcommit encourages people to allocate huge amounts of swap in an attempt to avoid the OOM killer. And I would argue that overcommit and swap make it much easier to DoS a site, all things being equal. But certainly overcommit makes it more difficult to implement resilient software.
The solution is to fail allocation requests as soon as possible, and to fail them inline (not as signals). That way user processes can unwind and isolate the logical task attempting the allocation. Or if that's too difficult and inconvenient, a process can simply choose to panic. Without overcommit a process can choose the policy most suitable for the manner in which it was implemented. Among other things, that creates actionable back-pressure bound to the logic request.
The point about ELF shared libraries is just plain wrong. A shared library is a file, which already has a backing store in the filesystem. You need neither overcommit nor swap space to be able to map a shared library file into virtual address space and share it across multiple processes. Lots of operating system do this without overcommitting physical RAM.
And because all modern systems have a unified buffer cache, you actually get better performance by keeping those pages backed by the regular file system. If you treat them like anonymous pages, you have to copy them back out to swap space under memory pressure, whereas if they remained file-backed you don't have to write them out at all.
Likewise, if a process wants to speculatively allocate a huge buffer that might possibly go unused, rather than using malloc or anonymous mmap, create and map a temporary file. For a dozen different reasons this is preferable. The only possible negative is that it doesn't kow-tow to broken software that didn't think about memory issues. But I don't want to install that software, anyhow, because if the author couldn't put in the minimal amount of effort to be able to work on systems without overcommit, they're unlikely to have put much thought into buffer overflows, integer overflows, and many other correctness and security issues.
Finally, because Linux has to obey traditional Unix resources limit, as well as cgroup limits, it has to maintain an accounting of allocated pages whether or not overcommit is enabled. That fact should also be a strong hint that proper memory accounting is also a security concern, not simply an annoyance that we can wave away.
Unless all software were to be changed to use vfork or posix_spawn, I think we are stuck with it.
I would love to be able to disable overcommit on a per-process basis though.
Really, fork works fine in a strict accounting world. You can turn off overcommit now on any Linux system, and in practice, nothing will go wrong. While it's true that fork requires that the system prepare for the possibility that the child might dirty all its parent's memory, it's just a matter of book-keeping to do and involves no actual inefficiency. If you have enough swap, the kernel can use that swap as memory "reserves" and allow even huge processes to fork.
Big programs (of which there are few) really should all be modified to use vfork, though, which doesn't make COW mappings and so doesn't incur the commit charge hit.
Solaris does this strictly and will never overcommit, period. Linux does the accounting on most distributions in order to enforce resource limits. Until the most recent kernels the relevant code path was
fork -> security_vm_enough_memory_mm -> __vm_enough_memory
The most commonly cited reason for overcommit is to allow a process with a large amount of anonymous pages to fork+exec. But the solution to that has always been vfork+exec (which is how posix_spawn and system is implemented on most unices, including Linux). That pattern never went away, and it's still slightly faster than a regular fork even on Linux.
And that reason has always seemed specious to me. Large processes like PostgreSQL and MySQL don't generally fork+exec subprocesses _after_ they've entered their steady state.
I think the only real-world benefit of overcommit was that, at least in the 1990s, it allowed some software written for beefy Unix machines to work on desktop Linux machines. Back then, it was de rigeur to speculatively allocate buffer space, and often (but not always) to allow a sysadmin to control through a configuration file how much buffer and cache space to preallocate.
But this is 2016--there is no such thing as a sysadmin in the traditional sense; that is, someone attending to a specific machine, manually reallocating resources to various processes in response to load and functional requirements. In 2016 (and for at least the past 15 years or so) we expect software to react dynamically to load. And in order to do that reliably and consistently you _need_ malloc, fork, et al to fail properly.
Also, ironically, Linux has such a blazing fast virtual memory manager that it doesn't make sense anymore to preallocate buffer and cache space upfront. Combined with modern, enhanced virtual memory APIs, it's become an unnecessary contrivance.
Is there anywhere to go to read about the interactions of threads with syscalls like `fork`?
Friends do not let friends fork in a multithreaded process. Prefork before spawning any threads if required.
* http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html and
The Linux manual pages are generally atrocious. The more updated pages literally copy+paste the POSIX standard. Better to just go straight to it. And because the POSIX standard is so well written, I think it's better to refer to it first and only look to other resources when you have more particular questions unanswered or raised by the standard. There's too much cargo-cult advice out there, often written by people who have at best only a passing familiarity with the standard or actual implementations.
As always, when in doubt refer to the actual source code. That should always be possible with Linux.
Forking from a threaded process is doable and the standard and implementations are carefully written to make it possible and practical, though generally the only sensible thing to do is to exec another process. The posix_spawn implementation should do the right thing. But the posix_spawn API can be painful to use for non-trivial purposes, especially compared to the simplicity of fork+dup+close+exec. A good implementation to study is the implementation in musl libc. (glibc code is not for the faint of heart.)
Agreed. Android for example has made major architectural changes due to not having swap. I wasn't around during the early days, and by no means an expert, but it looks to me like the following decisions were made:
1. No swapping, flash at the time was very slow, limited in capacity, and still has limited lifetime. Introduce OOM killer. 
2. SysV semaphores and shared memory can leak when killing a process.  It looks like these have since been added to bionic (Android's libc). 
3. Introduce The Binder kernel driver for IPC instead. The Binder handles reference counting, and can do neat things like death notifications. 
4. Introduce ashmem for shared memory. 
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-st... (extreamly readable at 212 LOC)
 See `man 7 sem_overview`'s and `man 7 shm_overview`'s Persistence sections.
 https://android.googlesource.com/platform/bionic/+/master/li..., https://android.googlesource.com/platform/bionic/+/master/li...