Hacker News new | past | comments | ask | show | jobs | submit login
What they don’t tell you about demand paging in school (offlinemark.com)
71 points by luu on Oct 17, 2020 | hide | past | favorite | 24 comments



"In Operating Systems 101 we learn that operating systems are “lazy” when they allocate memory to processes. When you mmap() an anonymous page, the kernel slyly returns a pointer immediately. It then waits until you trigger a page fault by “touching” that memory before doing the real memory allocation work”.

Some operating systems do that, some don't. IBM's AIX variant of UNIX allocated space when you asked for it, not when an instruction touched the page. The advantage is that you find out if you don't have enough memory when you make the system call to request it, not by having the program killed.

(It may be time for paging out to disk to go away. It's a huge performance hit. Mobile doesn't do it. And RAM is so cheap.)


Did AIX actually _allocate_ and map the pages you requested, or merely _reserve_ them by bumping a counter? Solaris at least historically did the latter. The heavy lifting of determining which physical page you would get, and actually mapping it, was generally still deferred until a fault required it.

While paging out to rotational media is a tremendous latency issue, paging out to an NVMe SSD or something like Optane doesn't seem like it is nearly as bad. One could imagine a tiered system of paging also: first by compressing into another RAM location, but then evicting the tail of that compressed page set to SSD as required to make more room in RAM.


MacOS does the compression thing. It can be done on Linux too.


You're right, I probably should have said "some operating systems". I was generalizing based on my own education, which used Linux. I believe Windows does not do this either, but I'm not sure how many OS 101 classes teach Windows kernel :)

> The advantage is that you find out if you don't have enough memory when you make the system call to request it, not by having the program killed.

I wonder how much of an advantage this is. Memory availability is constantly in flux, and Linux at least, goes to great lengths to reclaim memory before invoking the OOM killer, which is a last resort. What if memory frees up immediately after the application gets the error return, but before the app touches it?

I'd trust the kernel more to manage this than applications themselves, because the kernel can make much better decisions on how to juggle things (evicting caches and whatnot). Plus, we all know how well we all test our error handling paths :)

> (It may be time for paging out to disk to go away. It's a huge performance hit. Mobile doesn't do it. And RAM is so cheap.)

In case anyone is interested in learning more, this is an excellent article discussing the nuances of paging to disk: https://chrisdown.name/2018/01/02/in-defence-of-swap.html

Re: mobile, does Android disable paging file mappings to disk, which happens even if swap is disabled? (Swap only affects anonymous mappings).


> I wonder how much of an advantage this is. Memory availability is constantly in flux, and Linux at least, goes to great lengths to reclaim memory before invoking the OOM killer, which is a last resort.

Whether or not it goes to "great lengths" before invoking the OOM killer is irrelevant. 1) On heavily loaded systems the OOM killer can trigger regularly, 'causing all manner of havoc because it can non-deterministically kill unrelated processes[1], and 2) a good kernel should go to great lengths to service a memory request, but that merely begs the question if great lengths should involve shooting down unrelated processes.

> What if memory frees up immediately after the application gets the error return, but before the app touches it?

What if what? What if I try to create a TCP connection to a host which rejected the connection, or the request timed out, but if the request had been delayed just a little longer it would have succeeded? What does that have to do with whether the kernel should start killing unrelated TCP streams?

> I'd trust the kernel more to manage this than applications themselves, because the kernel can make much better decisions on how to juggle things (evicting caches and whatnot). Plus, we all know how well we all test our error handling paths :)

The kernel has very little information with which to determine which process to kill as an appropriate response to resource exhaustion, presuming killing any process is even the correct response. It's a difficult problem, but OOM only provides one solution that is acceptable to those who wouldn't otherwise even care at the expense of incapacitating the ability of smart software which cares strongly to implement correct and deterministic solutions. Moreover, we're talking about the kernel here. Generally speaking, kernels should provide mechanism, not policy. By heavily and deeply conflating mechanism and policy the OOM killer is a fundamentally and fatally flawed approach for a general purpose kernel. There is no way to "fix" an OOM killer approach without effectively erasing the line between application software and the kernel. That might be fine for embedded systems, but for a general purpose, multi-user system (which in the age of k8s is making a comeback), it's just plain wrong.

[1] For most of 2019 a QoI (not correctness) regression in Linux' memory reclamation and page buffer code resulted in JVM processes on our k8s clusters doing heavy disk I/O and generating alot of buffer cache churn indirectly triggering OOM on a daily basis, causing unrelated, twice-removed processes to be terminated as various allocation requests (often internal to the kernel) timed out. Worse, processes could hang indefinitely on various mutexes in the kernel that are taken as part of the "great lengths" Linux goes through. Sometimes very important services would get killed, like systemd or docker. Facebook and Google understand the potential for this problem well, which is why on all their clusters they run their own userspace daemons which try to predict imminent invocation of the OOM killer and throttle or shoot down processes according to their much more sophisticated heuristics and policies. The whole charade of infinite regress is plainly ridiculous, IMO. Almost everybody would be better off with a simple rule that the process requesting an allocation was killed. That's not as good as strict accounting giving developers and processes the option to continue or die (much software does in fact exit on malloc failure, while a not insignificant amount of software could indeed successfully and correctly continue) at the point of commitment, but still preferable. Then a ton of overwrought and buggy code in the kernel could disappear overnight, and smart applications would have a better path forward for achieving more correct, more deterministic behavior. But what we have instead is a horrible hack deeply embedded in the kernel so that mostly mythical programs (i.e. software that preallocated most of system memory but then forked) could run (non-deterministically, of course) on early Linux systems.


Thanks for sharing in this detail. I don't have production SRE experience, so this is fascinating to read. I understand that the OOM killer has caused you quite a lot of pain!

I shouldn't have even mentioned the OOM killer. My point was merely: Due to those great lengths, memory availability is highly dynamic. Committing memory up front prevents optimizations where the kernel can reclaim memory "just in time" to allow a mapping to succeed and not require killing anything.


> I believe Windows does not do this

Windows can do both: https://docs.microsoft.com/en-us/windows/win32/memory/reserv...


Paging out to disk is not going to go away until memory gets as cheap as disk or until we get much better at eliminating loading to memory stuff that is not useful.

When you have a server with virtual machines, you have a bunch of stuff in memory that just isn't all that useful. This might be libraries and functionalities nobody uses, parts of memory of processes that is used only at startup, etc.

It is always better to page that memory to the disk so that the memory can be actively used to serve demand.

While it is nice to have application fail in controlled manner when it allocates memory, the reality datacenter life is that once a process hits memory limit it can no longer be trusted to behave correctly and should be restarted, ideally along with entire instance.

It is far better to have something kill the process than have the process try to continue after failed allocation request as most processes are not very well tested with regards to failing at random allocation and continue with no possibility to allocate more memory.


> When you have a server with virtual machines, you have a bunch of stuff in memory that just isn't all that useful. This might be libraries and functionalities nobody uses, parts of memory of processes that is used only at startup, etc.

> It is always better to page that memory to the disk so that the memory can be actively used to serve demand.

That's a good point. But interestingly, not a strong enough one to make eg Google turn on paging out to disk nor ssd. (At least not a few years ago. No clue if they do that these days.)

Perhaps what you would want is a hybrid: let the OS page stuff out by default so you can free up the libraries and startup code and data structures etc. But if your disk hits some thrashing threshold because the working sets get larger than memory, let some variant of OOM killer loose?


> Perhaps what you would want is a hybrid: let the OS page stuff out by default so you can free up the libraries and startup code and data structures etc. But if your disk hits some thrashing threshold because the working sets get larger than memory, let some variant of OOM killer loose?

Linux does a form of this. File mappings are paged out to disk (reclaimed) by default, and the OOM killer is only triggered if this is not enough. This gets you pretty far, as it includes all binaries and libraries.


Oh, true. My suggestion would mostly only add paging out of seldom used data structures.


The only reason mobile devices don't swap to disk is that they invariably use bottom-of-the-barrel eMMC for storage, which can't handle the wear and tear of swapping. That has nothing to do with the performance implications of swapping "cold" memory pages to disk.


Disk paging should go away but only I think if the system also supports memory compression. iOS has had page compression for a long time. The time taken to compress a page with a reasonable algorithm is orders of magnitude faster than the disk write, even for SSDs.

I've got it enabled on my Pis and little Intel machines[0] with soldered on RAM. I wish I didn't need to do as a "hack" but it works and I'm not doing anything crazy where the performance hit is meaningful.

[0] https://hackaday.com/2020/05/20/zram-boosts-raspberry-pi-per...


> (It may be time for paging out to disk to go away. It's a huge performance hit. Mobile doesn't do it. And RAM is so cheap.)

A small swap area (like half a gig) can give a useful indicator of too much memory use that can be hard to get otherwise. If your disk paging hits a certain rate, or you hit 50% used, or rapid growth, you have a problem you need to solve right away. Sometimes that gives you enough time to move load and avoid a messy OOM death.


> (It may be time for paging out to disk to go away. It's a huge performance hit. Mobile doesn't do it. And RAM is so cheap.)

Note that the kind of paging described in this article, as well as decisions regarding when to allocate space, do not use/require disk at all. This all about allocated but unused memory.


> (It may be time for paging out to disk to go away. It's a huge performance hit. Mobile doesn't do it. And RAM is so cheap.)

The “RAM is cheap” argument is one of the reasons why we have lazy developers creating apps like MS Teams (on Windows 10) that takes a huge amount of memory (in my case, it’s about 1GB on startup without checking any chats, and I don’t even have many chats and teams in it). It somehow eventually runs out of memory, crashes and restarts. I have 8GB of RAM and allocated swap of the same amount. I’m not running many other RAM intensive applications (except for Brave browser with less than a handful of tabs).


Teams is an Electron app.


I know it is. My point on developer laziness still stands.


The author asks for someone to contact them with the answer to this question -- and I'd be interested as well:

A small mystery remains: if the statistics

  from /proc/*/statm which are used by htop
sync every 64 page faults, how can 1.5 MB of error accumulate? Syncing every 64 faults suggests that the maximum error would be 4 KB times 63 faults = 252 KB of error.

(sorry for weird formatting, wanted to include my comment around what statistics he's referring to, but asterisks can't be used unless preformatted AFAICT.)



Thanks for surfacing the question agf. Hopefully someone here can help :)


The Linux overcommit behavior is configurable via sysctl (and maybe by cgroup?), this is describing what happens with the defaults.


It would be interested to see how behaviour changes with different overcommit settings.


I read this as "demand padding in school" and thought - wow, now there is a term for that too. Demand padding - the act of adding more and more subjects and homework onto children until they get to the breaking point.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: