Some operating systems do that, some don't. IBM's AIX variant of UNIX allocated space when you asked for it, not when an instruction touched the page. The advantage is that you find out if you don't have enough memory when you make the system call to request it, not by having the program killed.
(It may be time for paging out to disk to go away. It's a huge performance hit. Mobile doesn't do it. And RAM is so cheap.)
While paging out to rotational media is a tremendous latency issue, paging out to an NVMe SSD or something like Optane doesn't seem like it is nearly as bad. One could imagine a tiered system of paging also: first by compressing into another RAM location, but then evicting the tail of that compressed page set to SSD as required to make more room in RAM.
> The advantage is that you find out if you don't have enough memory when you make the system call to request it, not by having the program killed.
I wonder how much of an advantage this is. Memory availability is constantly in flux, and Linux at least, goes to great lengths to reclaim memory before invoking the OOM killer, which is a last resort. What if memory frees up immediately after the application gets the error return, but before the app touches it?
I'd trust the kernel more to manage this than applications themselves, because the kernel can make much better decisions on how to juggle things (evicting caches and whatnot). Plus, we all know how well we all test our error handling paths :)
> (It may be time for paging out to disk to go away. It's a huge performance hit. Mobile doesn't do it. And RAM is so cheap.)
In case anyone is interested in learning more, this is an excellent article discussing the nuances of paging to disk: https://chrisdown.name/2018/01/02/in-defence-of-swap.html
Re: mobile, does Android disable paging file mappings to disk, which happens even if swap is disabled? (Swap only affects anonymous mappings).
Whether or not it goes to "great lengths" before invoking the OOM killer is irrelevant. 1) On heavily loaded systems the OOM killer can trigger regularly, 'causing all manner of havoc because it can non-deterministically kill unrelated processes, and 2) a good kernel should go to great lengths to service a memory request, but that merely begs the question if great lengths should involve shooting down unrelated processes.
> What if memory frees up immediately after the application gets the error return, but before the app touches it?
What if what? What if I try to create a TCP connection to a host which rejected the connection, or the request timed out, but if the request had been delayed just a little longer it would have succeeded? What does that have to do with whether the kernel should start killing unrelated TCP streams?
> I'd trust the kernel more to manage this than applications themselves, because the kernel can make much better decisions on how to juggle things (evicting caches and whatnot). Plus, we all know how well we all test our error handling paths :)
The kernel has very little information with which to determine which process to kill as an appropriate response to resource exhaustion, presuming killing any process is even the correct response. It's a difficult problem, but OOM only provides one solution that is acceptable to those who wouldn't otherwise even care at the expense of incapacitating the ability of smart software which cares strongly to implement correct and deterministic solutions. Moreover, we're talking about the kernel here. Generally speaking, kernels should provide mechanism, not policy. By heavily and deeply conflating mechanism and policy the OOM killer is a fundamentally and fatally flawed approach for a general purpose kernel. There is no way to "fix" an OOM killer approach without effectively erasing the line between application software and the kernel. That might be fine for embedded systems, but for a general purpose, multi-user system (which in the age of k8s is making a comeback), it's just plain wrong.
 For most of 2019 a QoI (not correctness) regression in Linux' memory reclamation and page buffer code resulted in JVM processes on our k8s clusters doing heavy disk I/O and generating alot of buffer cache churn indirectly triggering OOM on a daily basis, causing unrelated, twice-removed processes to be terminated as various allocation requests (often internal to the kernel) timed out. Worse, processes could hang indefinitely on various mutexes in the kernel that are taken as part of the "great lengths" Linux goes through. Sometimes very important services would get killed, like systemd or docker. Facebook and Google understand the potential for this problem well, which is why on all their clusters they run their own userspace daemons which try to predict imminent invocation of the OOM killer and throttle or shoot down processes according to their much more sophisticated heuristics and policies. The whole charade of infinite regress is plainly ridiculous, IMO. Almost everybody would be better off with a simple rule that the process requesting an allocation was killed. That's not as good as strict accounting giving developers and processes the option to continue or die (much software does in fact exit on malloc failure, while a not insignificant amount of software could indeed successfully and correctly continue) at the point of commitment, but still preferable. Then a ton of overwrought and buggy code in the kernel could disappear overnight, and smart applications would have a better path forward for achieving more correct, more deterministic behavior. But what we have instead is a horrible hack deeply embedded in the kernel so that mostly mythical programs (i.e. software that preallocated most of system memory but then forked) could run (non-deterministically, of course) on early Linux systems.
I shouldn't have even mentioned the OOM killer. My point was merely: Due to those great lengths, memory availability is highly dynamic. Committing memory up front prevents optimizations where the kernel can reclaim memory "just in time" to allow a mapping to succeed and not require killing anything.
Windows can do both: https://docs.microsoft.com/en-us/windows/win32/memory/reserv...
When you have a server with virtual machines, you have a bunch of stuff in memory that just isn't all that useful. This might be libraries and functionalities nobody uses, parts of memory of processes that is used only at startup, etc.
It is always better to page that memory to the disk so that the memory can be actively used to serve demand.
While it is nice to have application fail in controlled manner when it allocates memory, the reality datacenter life is that once a process hits memory limit it can no longer be trusted to behave correctly and should be restarted, ideally along with entire instance.
It is far better to have something kill the process than have the process try to continue after failed allocation request as most processes are not very well tested with regards to failing at random allocation and continue with no possibility to allocate more memory.
> It is always better to page that memory to the disk so that the memory can be actively used to serve demand.
That's a good point. But interestingly, not a strong enough one to make eg Google turn on paging out to disk nor ssd. (At least not a few years ago. No clue if they do that these days.)
Perhaps what you would want is a hybrid: let the OS page stuff out by default so you can free up the libraries and startup code and data structures etc. But if your disk hits some thrashing threshold because the working sets get larger than memory, let some variant of OOM killer loose?
Linux does a form of this. File mappings are paged out to disk (reclaimed) by default, and the OOM killer is only triggered if this is not enough. This gets you pretty far, as it includes all binaries and libraries.
I've got it enabled on my Pis and little Intel machines with soldered on RAM. I wish I didn't need to do as a "hack" but it works and I'm not doing anything crazy where the performance hit is meaningful.
A small swap area (like half a gig) can give a useful indicator of too much memory use that can be hard to get otherwise. If your disk paging hits a certain rate, or you hit 50% used, or rapid growth, you have a problem you need to solve right away. Sometimes that gives you enough time to move load and avoid a messy OOM death.
Note that the kind of paging described in this article, as well as decisions regarding when to allocate space, do not use/require disk at all. This all about allocated but unused memory.
The “RAM is cheap” argument is one of the reasons why we have lazy developers creating apps like MS Teams (on Windows 10) that takes a huge amount of memory (in my case, it’s about 1GB on startup without checking any chats, and I don’t even have many chats and teams in it). It somehow eventually runs out of memory, crashes and restarts. I have 8GB of RAM and allocated swap of the same amount. I’m not running many other RAM intensive applications (except for Brave browser with less than a handful of tabs).
A small mystery remains: if the statistics
from /proc/*/statm which are used by htop
(sorry for weird formatting, wanted to include my comment around what statistics he's referring to, but asterisks can't be used unless preformatted AFAICT.)