
Memory FAQ - mbel
http://nommu.org/memory-faq.txt
======
jdi
[https://github.com/0xAX/linux-insides](https://github.com/0xAX/linux-insides)
A different source for documentation on the Linux insides, for those who don't
already know about it.

~~~
bogomipz
Yeah this is a really great resource, I'm looking forward to seeing the final
book.

------
kensai
Interesting information. But this FAQ collection should have a date. Is it
always relevant?

~~~
schmichael

      ~$ curl -v http://nommu.org/memory-faq.txt > /dev/null
      > GET /memory-faq.txt HTTP/1.1
      > Host: nommu.org
      >
      < HTTP/1.1 200 OK
      < Server: GitHub.com
      < Last-Modified: Thu, 05 May 2016 01:37:27 GMT
    

Hm, github you say?

[https://github.com/nommu/nommu.org](https://github.com/nommu/nommu.org)

Jun 8, 2015 is the last commit to this particular file.

------
twoodfin
Another benefit of huge pages: Some database systems use a model where client
processes operate on a massive pool of shared memory (typically a buffer
cache). This allows the system to leverage the OS process model for isolation,
privileges, etc. Huge pages greatly reduce the overhead of new client
processes lazily mapping that pool into their address space.

~~~
eloff
This helps e.g. Postgres which has a process per connection model. MySQL uses
threads, which is lighter weight and doesn't have this problem.

------
quotemstr
The FAQ's tone of dismissive superiority with respect to overcommit is very
offputting. The author suggests that only people who don't know how virtual
memory works can oppose overcommit; it's curious then how the author reveals
some virtual memory knowledge gaps. Specifically, the difference between
address space reservation (which does not require backing storage) and commit
charge (which does) seems lost on him.

Yes, you need backing storage to handle the possibility that a process might
actually make copies of all COW pages (that's what we mean by "commit charge"
\--- the number of pages the kernel has voluntarily promised to deliver at
some point. The right way to reduce the pressure on commit that these mappings
cause is to not have so many of them in the first place. There's no reason PIC
shared libraries need to be PROT_WRITE; there's no reason big programs like
web browsers need to use fork(2) instead of the VM-efficient vfork(2). (Small
programs can use fork and its COW commit charge without causing real problems,
even on non-sloppy (err, "strict overcommit") systems.

Linux's default overcommit policy is terrible in part because it discourages
people from learning about the difference between reserving address space and
actually reserving memory, and the "FAQ" just confuses the issue more.

~~~
Jasper_
vfork is specified by POSIX that you cannot even do so much as close an fd or
call setenv between fork and exec, both of which are very common things to
want to do to set up an environment before something is run.

I disagree with the author too about overcommit, but vfork isn't a realistic
solution.

~~~
quotemstr
In your vfork, you can exec another process that does all these things in a
fresh environment. Realistically, you'll want a posix_spawn interface that
uses vfork internally.

------
AceJohnny2
> _Virtual address ranges for which there is current memory mapping are said
> to be "unmapped"_

Probably supposed to read "for which there is _no_ current memory mapping" ?

~~~
davidsong
Yeah I spotted that too.

Another posted pointed out that it's on github, so you could always submit a
pull request:

[https://github.com/nommu/nommu.org](https://github.com/nommu/nommu.org)

------
zokier
Some paragraphs are weirdly duplicated, such as "The OOM killer's process-
killing capability is a reasonable way to deal with runaway processes [...]"
and "People who don't understand how virtual memory works often insist on
tracking the relationship between virtual and physical memory [...]"

------
waynecochran
I always wondered if forking a giant processes (i.e., large memory footprint)
to create a small processes caused a lot of needless memory to be allocated,
but now I understand why its no big deal:

    
    
        For example, the fork/exec combo creates transient virtual memory usage
        spikes, which go away again almost immediately without ever breaking the
        copy on write status of most of the pages in the forked page tables. Thus
        if a large process forks off a smaller process, enormous physical memory
        demands threaten to happen (as far as overcommit is concerned), but never
        materialize.

~~~
paulfurtado
Note that this also occasionally results in "failure to fork" with the default
linux overcommit setting. At fork-time if there isn't enough memory to
duplicate the entire process, the fork call fails, even though you never
intended to use that memory. You can use vfork to get around this.

------
derricgilling
Under your TLB section, you mention: "The TLB is a cache for the MMU. All
memory in the CPU's L1 cache must have an associated TLB entry, and
invalidating a TLB entry flushes the associated cache line(s)."

Many CPUs such as x86 ones don't actually flush the L1 on a TLB eviction,
there would be no reason to. Eventually the line will be evicted through a
snoop, capacity eviction, C6 transition etc. Of course, if a later memory
access comes to re-read the line, the TLB will have to be repopulated in order
to have a mapping.

Also, on the "appeal of huge pages", I would put what are the negatives. Huge
pages have appeal of reducing TLB pressure for stuff like big data/HPC
workloads with large contiguous data sets, but it can also increase page fault
latency and memory usage for more fragmented workloads that touch many pages.
For example, on persist, Redis forks a process that may touch many pages. Each
COW would have to find room for and allocate 2MB rather than just 4K

------
zymhan
That was interesting to read, but I don't understand what the context is at
all.

~~~
mbel
The FAQ is hosted my NOMMULinux project (nomen omen a Linux distribution that
is not using virtual memory, it was discussed on HN earlier today).

I found it to be a very nice summary of how OS manages memory, that is quite
interesting even without context. That's why I posted it. It is a little bit
light on the details, but in my opinion it is able to show memory management
as a complete process, which many other materials fail to do (at least the
ones that I've seen).

~~~
wahern
Some of us disagree violently with that FAQ's discussion of the OOM killer and
strict overcommit. The arguments aren't even coherent.

For example, I disable overcommit and run my machines without any swap
whatsoever. (Though this doesn't resolve the problem entirely in Linux.) The
way to deal with the reality of finite memory isn't to pretend like it's
infinite and then randomly kill processes when you hit a wall; or to use magic
constants in user processes that try to guess the maximum number of logical
tasks that can be handled for some amount of system RAM (which you could never
possibly do correctly or consistently as a general matter for non-trivial
purposes).

Rather, I would argue that overcommit encourages people to allocate huge
amounts of swap in an attempt to avoid the OOM killer. And I would argue that
overcommit and swap make it much easier to DoS a site, all things being equal.
But certainly overcommit makes it more difficult to implement resilient
software.

The solution is to fail allocation requests as soon as possible, and to fail
them inline (not as signals). That way user processes can unwind and isolate
the logical task attempting the allocation. Or if that's too difficult and
inconvenient, a process can simply choose to panic. Without overcommit a
process can choose the policy most suitable for the manner in which it was
implemented. Among other things, that creates actionable back-pressure bound
to the logic request.

The point about ELF shared libraries is just plain wrong. A shared library is
a file, which already has a backing store in the filesystem. You need neither
overcommit nor swap space to be able to map a shared library file into virtual
address space and share it across multiple processes. Lots of operating system
do this without overcommitting physical RAM.

And because all modern systems have a unified buffer cache, you actually get
better performance by keeping those pages backed by the regular file system.
If you treat them like anonymous pages, you have to copy them back out to swap
space under memory pressure, whereas if they remained file-backed you don't
have to write them out at all.

Likewise, if a process wants to speculatively allocate a huge buffer that
might possibly go unused, rather than using malloc or anonymous mmap, create
and map a temporary file. For a dozen different reasons this is preferable.
The only possible negative is that it doesn't kow-tow to broken software that
didn't think about memory issues. But I don't want to install that software,
anyhow, because if the author couldn't put in the minimal amount of effort to
be able to work on systems without overcommit, they're unlikely to have put
much thought into buffer overflows, integer overflows, and many other
correctness and security issues.

Finally, because Linux has to obey traditional Unix resources limit, as well
as cgroup limits, it has to maintain an accounting of allocated pages whether
or not overcommit is enabled. That fact should also be a strong hint that
proper memory accounting is also a security concern, not simply an annoyance
that we can wave away.

~~~
gpderetta
I dislike overcommit as well (although I have made use of it by relying on
lazy allocation), but I think it is very hard to implement unix fork semantics
without it: on a fork the kernel would have to reserve memory for a copy of
the whole process, even though the next thing the child is most likely to do
is throw it away by calling exec.

Unless all software were to be changed to use vfork or posix_spawn, I think we
are stuck with it.

I would love to be able to disable overcommit on a per-process basis though.

~~~
KayEss
`posix_spawn` looks interesting. I'm kind of scared about fork in a multi-
threaded context -- I see very little discussion about what happens
(especially if you have many threads executing I/O).

Is there anywhere to go to read about the interactions of threads with
syscalls like `fork`?

~~~
gpderetta
It is specified by POSIX and SUSv4 [1], but it still hard to use safely. Only
the forking thread will exist in the child process; assume that mutexes are in
an unspecified state, including those you do not know about (for example those
inside libc malloc).

Friends do not let friends fork in a multithreaded process. Prefork before
spawning any threads if required.

[1]
[http://pubs.opengroup.org/onlinepubs/9699919799/](http://pubs.opengroup.org/onlinepubs/9699919799/)

------
valarauca1
Really good read. Saved in my book mark. Solid information dense document.

