It's important to remember the biggest reason why overcommit exists on Linux and macOS: fork(). When a process forks, the vast majority of the child process's memory is safely shared with the parent process, due to copy-on-write. But a strict accounting would say that the total memory usage of the system has doubled, which is too conservative in most cases. Since forking is so common on Unix, overcommit ends up being necessary pretty quickly. Otherwise, fork/exec would stop working in any process using more than half of the system memory.
wouldn't you just account the COW pages against the parent until they are copied?
kicking the can down the road means there isn't any longer a reasonable correction (failing the allocation), but instead we get to drive around randomly trying to find something to kill.
this is particularly annoying if you are running a service. there is no hope for it to recover - for example by flushing a cache. instead the OS looks around - sees this fat process just sitting there, and .. good news, we have plenty of memory now.
> wouldn't you just account the COW pages against the parent until they are copied?
That's what Linux does. But how do you return ENOMEM when the copy does happen and now the system is out of memory? Memory writes don't return error codes. The best you could do is send a signal, which is exactly what the OOM killer does.
That's why you ensure you have enough swap to cover the worst case scenario.
You as swap not because you need to actually use it, but on order to be able to guarantee there is enough memory available if the worst case scenario happens. In normal circumstances, swap should never really be utilised.
> You as swap not because you need to actually use it […] In normal circumstances, swap should never really be utilised.
Swap is actually there so anonymous pages can be evicted, and is often used long before there is memory contention, during normal operation.
Not having swap means that only file-backed pages can be evicted. During memory contention, this can cause thrashing. During "normal" operation, it degrades performance.
Swap only being used when memory runs out, as kind of "emergency RAM", is a very widespread misunderstanding.
But what happens when the kernel needs to copy one and has run out of memory? You still get a random process killed.
(I note that Windows has a different approach, with "reserve" vs "commit", but nobody regards that as a preferential reason for using Windows as a server OS)
> But what happens when the kernel needs to copy one and has run out of memory?
Don't allow it. Fork the process with read only pages except for the ranges passed to fork(). Count read-write pages as used memory by the child process.
If the forked process wants to write to a page that's read-only it'll have to do a system call to turn it read-write. That call can then fail if there's not enough free memory to copy the pages.
> Don't allow it. Fork the process with read only pages except for the ranges passed to fork().
This doesn't work because of the horrible fork/exec design. If I am a huge process and I want to run `ls`, I will first have to clone myself using fork(), which may trigger an OOM, even if the first action of my clone would have been exec(), ignoring all of that memory.
I am always surprised that no one has added a sane 'spawn process' primitive to replace fork/exec. Especially since fork() without exec() only really works in single-threaded processes.
Interesting, I'd never learned about that. If I understand correctly, this actually used to be implemented as a fork()+exec() on many systems, though. It's actually vfork() that addressed the problem and allowed an actually useful implementation of posix_spawn in Linux at least.
You'd presumably need the parent to also do a system call to mark pages read-write after forking, and that would need to include the stack pages, and that sounds like lots of fun; especially since sometimes library functions fork (although most of that is to fork/exec and more specific apis for that exist now). Not that it isn't loads of fun anyway, but woe betide thee who forks and threads and makes the other threads have a read-only stack.
Probably you'd get people just marking as read-write and returning in sigsegv handlers, which I'm sure has great security properties... OTOH, at least there's an opportunity to deny the remap in the handler and get a decent crashdump from the program or a sliver of hope for managing the situation.
That also means that if you used seccomp to restrict mprotect syscall, the child process may never ever write again, which is an unfortunate design choice.
>there is no hope for it to recover - for example by flushing a cache. instead the OS looks around - sees this fat process just sitting there, and .. good news, we have plenty of memory now.
Overcommit was godsent in the times of expensive memory and when people used virtual memory on disk (so it will spill low use memory pages there instead of the kill). Of course these days with abundance of cheap memory and people not configuring virtual memory any more we get the situation you describe.
Virtual memory is always on... If you actually hit swap the system effectively deadlocks anyway because the swap daemon is too dumb by being too fair. Persistent memory might change this but it needs so much work.
it was a throwback down the memory lane and the years before my first Linux in 1995 i had been dealing with Windows where configuration of swap was "configuring virtual memory", so i automatically used that term.
> If you actually hit swap the system effectively deadlocks anyway
that depends. In many cases in the past the options would be either with swap and thus slow or pretty much not at all. And again these days there is so much memory that there is always a way to avoid the swapping. Though i've met funny situations in recent years like when a several terabyte sized database process would get killed by the OOMKiller on a machine with overcommit left on and no swap configured.
OOM kill exists because swap is not being useful. A process failure is something you can deal with, livelock isn't. If you have an algorithm that works out of core then you'll never hit swap because you maintain the working set below memory limits.
Indeed. It's totally dumb that the OS is allowed lie to you when you've asked for some resources. And because of this behaviour people no longer check for success from malloc(), etc, because of laziness. It's a bad situation.
Please don’t blindly declare it ‘totally dumb’. If you disallow overcommit, you can end up with a system that can’t fork and run /bin/true, even if there are gigabytes of memory left.
Both styles of memory allocation have their uses, and their drawbacks, but please understand them before declaring many OS designers as stupid and dumb.
I agree with you on all counts, but it's worth highlighting that the parent did not call the designers dumb, but rather the situation and implementation.
I know it can seem like a distinction without a difference, but I think it's fair to critique work. I think they could have been more articulate and considerate towards the designers, and have some empathy that many people have worked very hard on it, and did their best in the problem/solution space they were working with.
I just think it's important that people stay objective on if the quality of the people themselves is in question. It's not ok if it is, but I don't think it was here.
now people that don't check how malloc() returns being blanketly labled as lazy is an example of it being about people. people don't ignore the possibility of a nullptr return from malloc because they're lazy. They ignore it because it's hard -- even if you do catch it, there's very little that you can actually do. You can't dynamically allocate...so I hope you've got enough stack space to do what you've gotta do. And even then, you have to be able to propagate that there's no memory all the way up the stack as it unwinds...and check every single allocation.
The cpp world is mildly better in that it can throw std::bad_alloc...but if you have something that winds up doing an allocation in a destructor, I imagine that's not a fun time.
Most of the time, there's not really a better thing to do than crash -- and there's not a lot of incentive to put any work into it. It's not something that should be happening on any sort of regular basis.
But don't you need to check the return value of malloc to then reliable crash/terminate the program? It is like if your car runs out of gas you need to stop, but you don't abandon the steering wheel at high speed and hope for the best.
I suppose vfork might help with that (though I don't understand why they don't directly add fork_and_execve, it would seem much easier).
Also, the problem with Linux is not having overcommit, but notv being able to choose when to overcommit and when not. Windows makes that easier, AFAIU.
They do. Man posix_spawn. It comes with its own dsl to be able to support a small subset of all operations that are often performed between fork and execv.
Vfork is usually a better
solution.
Edit: and yes, I would love to be able to disable overcommit per process.
I think the biggest downside to vfork is (from the Linux manpage) "the behavior is undefined if the process created by vfork() [...] calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions."
So strictly speaking I think doing any of those operations documented by posix_spawn yourself between vfork and execv is undefined behavior. In practice, I believe it's fine, and on glibc posix_spawn is apparently written in terms of vfork, but libc is allowed to make assumptions that portable programs shouldn't.
Story time: I once worked on a Unix-based system with no MMU. The fork implementation did something insane: it looked for things that were possibly pointers (four-byte aligned memory locations which can be interpreted as valid physical memory locations on this system allocated to that program) and adjusted them. It was kind of like a conservative GC in that it treated anything that looked like a pointer as if it were a pointer. But no reasonable person would call modifying memory that may or may not be pointers to be "conservative". Amazingly, it worked most of the time, but sometimes strings got corrupted, and as a workaround there were a bunch of places where string buffers were fully zeroed where just a NUL byte would otherwise do.
This was early in my career, and I didn't design the system anyway. If I were working on it today I'd remove the fork implementation and make everything use vfork and/or posix_spawn instead. Apparently that's what posix_spawn was made for.
Another weird problem with the lack of MMU: the compiler also didn't use register-based addressing (what's the term? like position-independent code but for the data segment?), so global variables were truly global, not just global to the process. A bunch of code needed to be "deglobalized" to deal with this. I wanted to improve the compiler, but long story short I got offered another job first.
No-MMU Linux usually just disables fork and uses vfork instead. The really annoying part is that you can't use shared libraries without doubling the size of function pointers (FD-PIC ABI).
> The really annoying part is that you can't use shared libraries without doubling the size of function pointers (FD-PIC ABI).
Interesting, thanks. That didn't come up on our system—not only did we not use shared libraries but we also linked the whole system into one binary, kernel and all. Link times were atrocious in combination with identical code folding and big VLIW sentences, but it worked.
This paper [1] argues that fork needs to be deprecated in favor of posix_spawn() or other more modern solutions. It claims that, among many other reasons, that fork encourages overcommit and that programs such as Redis are extraordinarily constrained if overcommit is disabled - because of fork.
> They do. Man posix_spawn. It comes with its own dsl to be able to support a small subset of all operations that are often performed between fork and execv.
In other words, you write your C code inside a C string instead of in C.
It’s a weakness of the fork()+exec() model, for sure. However, creating a fork_and_execve() API is extremely tricky. Just think of all the innumerable setup options you would need to give it, e.g. what file handles should be closed or left open? What directory should it start in? What environment variables should be set - or cleared? And on and on and on…
the flexibility of a separate fork() then exec() means you can set up the initial state of a new process exactly as you want, by doing whatever work is needed between the two calls. If you merge them into one, then you will never be able to encapsulate all of that.
I mean, posix_spawn exists. It's a messy function, but its job is messy for exactly the reasons you describe. (FWIW, there are very few things you can legally perform between fork and exec.)
Are there any things that are illegal between fork and exec? It is perfectly legal to exec whenever you want, and it it perfectly legal to fork whenever you want. I'm not aware of any requirements around the fork/exec sequence.
For a single-threaded process, sure. But in a multi threaded context, almost any operation done in the child after fork() can royally mess up the system. There are many versions of libcs where malloc() after fork() is likely to deadlock between the parent and child processes, as they share the internal malloc() locks.
“After a fork() in a multithreaded program, the child can safely call only async-signal-safe functions (see signal-safety(7)) until such time as it calls execve(2)“
For example, you can't do any dynamic memory allocation or call a function that may allocate.
There's no restrictions in that sense. However the computer might not do what you intended if you were to do something like fputc('x',stdout);fork();fflush(0);
I think you mean between vfork and exec. The Linux manpage says:
Standard description
(From POSIX.1) The vfork() function has the same effect as fork(2), except that the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions.
Note that modern POSIX recommends just calling fork() because "[vfork] was previously under-specified". The glibc man pages shouldn't say undefined since that's too strong a word. There is so much you can actually do in practice during vfork(). Pretty much all system calls are just as safe. It's mostly a question of whether or not your userspace libraries might clobber some global, due to how vfork puts the process in a state where implementation details that are normally private become part of the public surface area of these apis, and not many library authors are disposed to offer those kinds of assurances.
> The proper semantics for starting a new process is..
:) Ah.. I guess when "I was younger" (TM) I was also so dogmatic on most things computer-related.
Anyways, when we need to do something a bit more complex than equivalent of system(), then it quickly becomes evident, that in many cases we need to prepare ourselves for the future execve().
Here's the list of syscalls/c-funcs, which are called in two random projects I maintain, after fork() and before execve() (or execveat() or fexecve()).
alarm(0); /* disable alarms */
setenv(); /* A couple of required envs, like MALLOC_PERTURB_ or MALLOC_PERTURB_ */
prctl(PR_SET_DUMPABLE, 1) /* regarding ptrace()-attach */
syscall(__NR_personality, ADDR_NO_RANDOMIZE); /* disable ASLR for debugging, if needed */
socketpair() /* reliable execve success detection, some form of witchcraft */
setpriority()
prctl(PR_SET_PDEATHSIG, SIGKILL); /* die upon parents death */
setrlimit(); /* set of reset rlimits */
lseek(fd, 0, SEEK_SET); /* rewind input file for this specific subprocess */
/* prepare arguments (argv) for execve dynamically */
sysconf(_SC_NPROCESSORS_ONLN); pthread_setaffinity_np(); /* pin subprocess to a list of CPUs */
/*
LOTS of functions here
if we wanted to use net/process/mount namespacing
e.g:
assigning IP adddresses to interfaces
creating custom views of the filesystem tree
modifying capability sets
*/
open("/proc/self/oom_score_adj"), write(), close(); /* adjustment of oom score */
open("/proc/self/fd", O_DIRECTORY); getdents(); fcntl(F_GETFD); fcntl(F_SETFD, FD_CLOEXEC); close() /* closing fds upon exec */
setsid(); /* new session */
sigprocmask(empty_set); /* reset signal mask */
open("/dev/null"); dup2(null, 0..1); /* close fd 0,1,2 */
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER); /* application of sandboxing */
and finally execv() or execveat()
Granted, the projects are maintain are probably more on the heavy side of things, when it comes to process manipulation, before execve, but putting all of that in some control structure, would be down to impossible for me. Such structure would have to be so extensible, that it'd have to be some form of VM I guess effectively. So.. having ability to simply call a couple of syscalls from the context of a regular new process, and before execve() is quite good here.
Sure.. maybe we should have some simple form of fork/execv, for those who want to call system() or popen() and not hit the memory overcommit related crashes.
But not as a replacement, rather a new syscall. Even so, debugging while a process creation/execution failed would be madness, given that you simply would get EINVAL, and the failure could be related to any of dozen parameters in a process creation control structure.
int process_handle = spawn("/bin/true", SP_PAUSED); // create process but don't execute, and return a handle
// most API would take a process handle, thus you could do stuff (such as prctl) on the new process
unpause(process_handle);
There's a bit of a move in this direction in the Linux API with the PIDFD stuff.
Sounds reasonable. I guess most process controlling syscalls (ptrace, prctl, seccomp, bpf, remotely operating on pid's FDs, setsid, signal masks) will have to be replaced to be able to operate on remote process or on a fd, but yeah, this can work, potentially.
If it's all syscalls (i.e. no libraries messing with hidden state) and you don't mess with parent's memory, you might be able to do this all with clone(CLONE_VM|CLONE_VFORK) instead of fork(). This is what musl uses to implement posix_spawn().
I think setenv() will use malloc() (it might use mutexes), same for error printing if it uses printf() or perror().
Also, this is all mess if fork() is called from a multi-threaded context. Esp. via clone(), which up to certain glibc() version cached getpid() result, and returned parent'd PID in the child (for performance reasons of course ;).
Allocated-but-unavailable is a totally reasonable part of the memory hierarchy.
Main Memory
=> zswap (compressed memory)
=> swap
In this case, the pages may be logically allocated or not -- the assurance is that the data will be the value you expect it to be when it becomes resident.
Should those pages be uninitialized, the "Swapped" state is really just "Remember that this thing was all zeros."
We could do computing your way, but it'd be phenomenally more expensive. I know this because every thing we introduce to the hierarchy in practice makes computing phenomenally less expensive.
"We could do computing your way, but it'd be phenomenally more expensive"
It must be viable - Windows prevents overcommit. But it has slow child-process-creation (edit: previously said "forking"), and this steers development towards native threads which is its own set of problems.
I had never previously joined the dots on the point pcwalton makes at the top of this thread. It is a dramatic trade-off.
I'm not sure there's any causality between Windows preventing overcommit and Windows having slow process creation- there's nothing inherently slower about CreateProcess()/posix_spawn() than fork() + exec() as an API.
It seems more that overcommit is a workaround for the way fork() can potentially lead to copying the entire address space into new pages, but usually doesn't. Because CreateProcess() knows more precisely how much to allocate before it returns to the child process, it can just reserve that amount and signal an error immediately if there's not enough memory+swap to back it.
(And on the other hand, Windows has a lot of legacy and backwards compatibility behavior around processes that could easily explain the slower process creation independent of the API.)
Let's say you malloc some memory and the computer actually has everything available.
Everything is great, up until some other process on your system does a fork bomb of an infinitely recursive program that allocates nothing on the heap. You've just got a whole lot of quickly growing stacks hoovering up your physical memory pages.
Stack vs Heap is a userspace level concept. As far as the kernel memory manager knows, it's all just allocated memory - some is allocated by the process- or thread-spawning routines, some is allocated by malloc(), but they're using the exact same pool.
If overcommit is disabled and someone has allocated most system memory, fork() and exec() and pthread_create() etc. will theoretically fail with ENOMEM.
A bigger problem on Linux at least is that the kernel will swap out all possible memory before returning an allocation error. Even if you have not allocated any swap space, it will swap out any memory mapped files.
And even if none of your programs have explicitly mmap()ed anything, the actual application code is mapped, so it will start swapping out all code pages to disk before it refuses an allocation. And now this means that your system has become entirely unusable and will have to be hard rebooted, because it is now swapping data to and from disk on every instruction execution after every context switch. At least your CPU will get to stay nice and cool for a while.
The biggest reason overcommit exists is because it allows the system to operate more efficiently. The reality is most applications touch only some of the pages they allocate, and it's silly for the system to fail a malloc. Often times other expensive cleanup activities can be deferred (you don't really want to drop a handy directory entry cache just so an app can be sure it got physically backed memory for its request, 99.99% of the time).
IIUC Linux was really the first OS to make overcommit so prominent. Most systems were a lot more conservative.
Replying up here instead of way down in the branches.
I don't think disabling of overcommit implies that physical pages are mapped immediately. If caches are instantly droppable, you can use a page that's allocated but unused for cache, and drop the cache page (and zero it) when the allocated page is written to.
You'd still have all of your caches until you have memory pressure with actual data written (but of course, with overcommit, you'd drop caches then too), but if you attempt to allocate more than you have (including through fork attempts as discussed elsewhere), you get a system call failure rather than an OOM kill.
In theory, a system without overcommit can run just as efficiently, IF you have reserved huge amounts of swap space. As long as swap is available, the OS can do all the same COW and efficiency tricks. It’s not the physically backed memory is the limiting factor, it’s RAM+swap
I've found the linux memory system to be far too complicated to understand for quite some time, compared to what's documented in, for example, The Design and Implementation of The FreeBSD Operating System, for a more comprehensible system.
dentry caches exist with and without overcommit, but you get higher cache hit rates with overcommit, because you flush them less recently. Depending on workload, this can matter a lot. It mattered more in the time of hard drives.
if overcommit is disabled, then the system drops its internal caches (dentry cache, pagecache) to satisfy a memory allocation. Since most applications don't touch pages they allocate, that means the OS could have avoided dropping the caches. Since it did drop the caches, other parts of the system will then have to do more work to reconstruct the caches (looking up an dentry explicitly, or loading a page from disk instead of RAM).
Everything I'm describing is about a busy server with heterogenous workloads of specific types.
I think what joosters is trying to say that instead of overcommitting a system could just allocate memory directly in swap. Both creating the page in swap and swapping it in on first write can be mostly optimized away apart from the bookkeeping, but you guarantee that you always have enough RAM+Swap to provide what you have committed to.
> if overcommit is disabled, then the system drops its internal caches (dentry cache, pagecache) to satisfy a memory allocation.
Couldn't the system just reserve the pages for future use by the application but still use them for caching until the application actually tries to use them?
It doesn't need to reserve a specific page during allocation, just ensure that the number of reserved pages is smaller than the total amount of pages. I would expect at least read only caches to mostly behave like they currently do on a system with over commit enabled.
> Otherwise, fork/exec would stop working in any process using more than half of the system memory.
Somehow Solaris manages just fine.
And don't forget that swap memory exists. Ironically, using overcommit without swap is asking for trouble on Linux. Overcommit or no overcommit, the Linux VM and page buffer systems are designed with the expectation of swap.
Solaris does have this problem! If you have a huge program running, fork()img it can fail on Solaris, even if you only want to just exec a tiny program. The key way of avoiding this is to ensure you have lots and lots of swap space.
Even without overcommit swap makes the system work better as unused pages can be written to disk and that memory used for disk cache and I think can also help with defragmentation of memory not sure how that process actually works.
Solaris has strict memory accounting; fork will fail if the system can't guarantee space for all non-shared anonymous memory. macOS has overcommit (all BSDs do to some extent, at least for fork), but it also automatically creates swap space so you rarely encounter issues one way or another in practice.
fork and malloc can also fail in Linux even with overcommit enabled (rlimits, but also OOM killer racing with I/O page dirtying triggering best-effort timeout), so Linux buys you a little convenience at the cost of making it impossibly difficult to actually guarantee behavior when it matters most.
I don't use Solaris, but I see they have vfork [1] which should be unaffected. Likewise posix_spawn{,p}. (I think those are often implemented in userspace via vfork; unsure if there's a separate system call on Solaris.)
Most processes that fork know that they are going to fork. Therefore they can pre-fork very early, so as to commit the memory when they only have a bare minimum allocated. Your typical daemon does this anyway.
Some other type of process like an interpreter that can subshell out doesn't know how big the allocation is going to get, would have to pre-fork early on.
In this way, you wouldn't "need" overcommit and the Linux horror of OOM. Well, perhaps you don't need it so badly. Programs that use sparse arrays without mmap() probably need overcommit or lots of swap.
I first learned about this while trying to understand processes being killed by OOM in production. We had python 2.x batch jobs being executed by long-running python worker processes -- some of the arbitrary application code in some of the arbitrary batch jobs would occasionally want to execute some command line tool in a new process, and to create the new process under the hood python's subprocess library would fork/exec, and if the parent worker process had already accumulated a large virtual memory footprint, linux's approximate memory accounting heuristics would kick in during the attempted "fork", decide that we were obviously going to run out of physical memory, and kill the process.
We didn't actually want to fork anything and share gigabytes of virtual memory with the child process, we wanted to spawn an almost entirely independent process to do something and report results, but that got implemented under the hood by fork.
Spawning processes is one area where Windows is more elegant than linux: windows offers spawn. Apparently macos and solaris implement a posix_spawn that avoid the complications of fork/exec.
linux offers posix_spawn, apparently which may may or may not call fork under the hood depending on which libc you're using. If libc implements posix_spawn by calling fork then you're back in the same mess with linux heuristic memory accounting and overcommit. E.g. old versions of glibc will fork when you posix_spawn, newer versions of glibc may vfork . musl apparently will always vfork.
It looks like cpython's subprocess.Popen was patched in python 3.8 to detect some cases where posix_spawn can be used -- it reads as if it will only kick in on linux if it detects a sufficiently new version of glibc: https://github.com/python/cpython/blob/main/Lib/subprocess.p...
Sounds to me like you (or Python devs?) should be looking at clone(), which is basically a lower level version of fork() that provides precise control over every detail with which the new process is created, including what (if any) parent memory you want to share.
This. When my team hit the issue we just made a subprocess replacement which relied on clone instead. Before that, we would trigger OOM when trying to run external commands while still having 30% free memory.
it looks like cpython is improving and cpython 3.10 now uses vfork when launching subprocesses in linux
years ago when i first hit this in production, we ended up working around it by rewriting our application code to use a pure-python library that did the equivalent thing as the separate command line tool we were trying to launch.
I once complained about malloc happily allocating memory that the physical memory system couldn't satisfy (never let your hand write a check your ass can't cash?) but the more experienced programmer asked me if I'd heard of fractional reserve banking, and if not, whether it bothered me too.
>
I once complained about malloc happily allocating memory that the physical memory system couldn't satisfy (never let your hand write a check your ass can't cash?) but the more experienced programmer asked me if I'd heard of fractional reserve banking, and if not, whether it bothered me too.
Oh. but is there a scenario where it might be useful to check if a certain amount of memory can be available? Like say you know a certain process uses 6 Gigs for memory but will take a while to get to that point.. and then fail, is it not safe to just error out earlier?
So, uh, how do you feel about fractional reserve banking? Nearly all banks worldwide practice it. Statistically, it's not impossible that the entire world financial system could collapse due to uncorrelated bank runs.
It is impossible. Only a moron of magnificent magnitude would fail to print additional cash to cover the run.
The problems caused by the feds failure to lend to First Bank of America during the Great Depression are well understood by the central banks.
What would likely happen is the overnight rate would go up to 12%, and additional money would be printed to cover withdrawals for the month or two most people would be willing to forgo 12% interest in a potentially inflationary economy.
i mean, the world financial system did almost just collapse, not that long ago, https://en.wikipedia.org/wiki/Financial_crisis_of_2007%E2%80... more or less due to the confluence of several adverse events, each of which probably could have been buffered on its own.
When you say additional money would be printed, I assume you mean the M0 money supply would be increased?
It's not impossible, no, but in the case of my computer a single user - me - controls memory usage.
I guess it's beneficial in >99% of use-cases, and the <1% of other cases can turn it off. Still I guess I'm naive enough to hope a correct program would not crash.
If you're smart, the answer is yes, over reliance on statistical multiplexing scares the shit out of you, because it's all fun and games till the check is due.
Is your issue that the system call doesn't return enough diagnostic information? If so, how would you have done it differently? I'm asking out of curiosity, not out of a reflexive instinct to defend C (which has many problems).
The difficulty is that the lack of memory might be discovered days after it was allocated (an over-committing allocator simply doesn’t know at the time whether there will be memory or not) - how do you asynchronously tell the program that this allocation has now failed in a high-level way?
Generally, UNIX will let you know about the missing memory by sending the process a signal. But by that point, there’s not much that can be done to fix things up - and remember, all the fix up code would have to run without allocating any more memory itself. That’s extremely tricky in C, and nigh-on impossible in other languages.
The first thing that comes to mind is a "reliable" (that won't overcommit) memory allocation syscall, but one might fear that it will abused (everyone wants the best stuff even if they might not use it after all). The second thing that comes to mind is to able to probe a (virtual) address (address range) within your addressing space, to check if it's actually there without doing weird stuff, and eventually just wait until it becomes available. But at this point maybe you could just use file-backed mmap for big data (effectively managing your own swap space)?
My first issue is with the article, while it seems to say it is about C, it is about a different problem which is illustrated with a C program. So what I would do differently (now that memory is cheap) : do not do overcommiting, so the OOM is out of the window. Do a different approach to fork, so you can exec commands and don't need to allocate the world.
Actually just using the ms windows memory model would be fine.
That’s one of the many reasons I think Windows NT kernel is generally better than Linux.
Windows doesn’t do that. When you don’t have enough memory and not enough page file space either, these allocation functions usually do fail returning nullptr.
But this feature is also why malloc on Windows can take a long time to return - seconds or more if you malloc too much. I recently bumped into this problem on the GPU. Since GPU memory in the Windows Display Driver Model requires it to be backed by the host machine’s virtual memory, a malloc on the GPU when running low on space can end up searching & even defragging virtual memory to satisfy the request. Crazy! Just asking for RAM can cause you to land in heavy disk swap. I’ve seen cases of cudaMalloc taking minutes because of how Windows handles allocation, and the same is true of CPU mallocs.
I have a faint memory of writing custom memory manager and allocating memory in bulk upon program start. But it was still a mess. Eventually had to periodically pause and defragment the memory allocations, rewriting pointers and so on. Well, at least it was fast.
> Since GPU memory in the Windows Display Driver Model requires it to be backed by the host machine’s virtual memory
Wow, this seems really unintuitive to me, especially on a OS that doesn't overcommit. Is it as unintuitive as it sounds, or is there a good reason for this?
There are good reasons, not all of which I know or understand, but my basic mental model is that the backing is required for various scenarios where virtual memory might need to get relocated, either in response to a malloc or to something else. I’m told it’s complicated by having the GPU drive the display, and by scenarios involving multiple GPUs in a system.
User can easily alt+tab into another program which also uses GPU. When many processes are using GPU concurrently, it's possible their combined VRAM use exceeds the amount of physical memory available on the GPU.
Assuming that the work needs to be done? You can either do that on allocation (which you can predict, batch, etc) or you can do that when you need the next page to be used. Which is far harder to predict and work with.
I've found overcommit useful for some scientific computing applications, but you're right it's not the most intuitive default. You can disable memory overcommit on Linux for the same behavior as Windows, if you want.
Disabling overcommit on Linux doesn't tend to enable library and utility developers to have used memory responsibly. So it can be hard to run a system with common software.
I keep it disabled on one of my systems for exactly that reason: I want to make sure my own code works with overcommit disabled, as it'll always be on Windows or OpenBSD.
If you really want that you can just tell Linux not to overcommit. I can foresee your next question though, "Wait, why did I run out of memory? Stupid Linux, I have plenty of RAM".
Well, it would be fair to call overcommit-less Linux stupid. There are a number of aspects of Linux that are only reasonable if you assume that memory overcommit is a thing, and would be radically stupid to include in an OS that does not overcommit by default. For example, fork(), which only makes sense with memory overcommit and shared copy-on-write pages. Process is using 50.1% of the physical memory? Guess it can’t spawn processes, since it has to fork() before it can exec() and there’s not enough memory to copy the process!
I’m sure there are other horrendously degraded modes that Linux can theoretically operate in (read-only file system?, unreliable system clock?), but disabling overcommit, while possible, turns Linux into an exceedingly crummy OS.
I'm a little disappointed that the article didn't answer the question, or at least try to. A discussion of using read/write vs mincore vs trying to catch a SIGSEGV would've been a nice addition.
The answer is that you kind of can't. You're at the mercy of the OS to give you accurate information, and malloc as an interface isn't set up to distinguish between virtual memory and "actual" memory. We could imagine a separate interface that would allow the OS to communicate this distinction (or hacks like you allude to), but I don't know of any standard approach.
It has always lead to all kinds of problems, like computers becoming inaccessible, and corrupted data because of segfaults between disk accesses.
And also nobody ever checked the malloc result anyway. Competent programmers just ensured that the segfault wasn't a huge problem. So at the best of the days, all it did was reducing the capacity of the computer.
This difference between "malloc() succeeded and physical/swap memory is actually available" also has somewhat corresponding impact on how you measure memory usage.
One approach is RSS, the memory in physical RAM... but what if you're swapping? Then again, maybe you swapped out memory you don't actually need and ignoring swap is fine.
The other approach is "how much memory you allocated", and then you hit fun issues mentioned in this article, like "the OS doesn't actually _really_ allocate until you touch the page".
Facebook put out something about measuring memory use on Linux after I left, and I haven't looked at it... But the best way I've seen is to have swap of size min(512M, 2x RAM) and measure the usage of that. There's some cases where sometimes something bigish gets swapped and you're actually fine, but often you really want to address that anyway.
IMO strict memory accounting misses the point. It can ensure that all allocated pages fits to VM (or return error during allocation), but more pragmatic memory constraint is whether working set of pages (including code/mmap-based pages) fits in physical RAM. If that is not satisfied, system/application crawls to a halt due to page thrashing and some kind of OOM killer is needed. And that may happen even if strict memory accounting is satisfied.
Similarly on macOS. There is a limit of 64 gigs in the VM compressor (in-core and on-disk compressed "segments" combined); when this is reached, a process that owns more than 50% of the compressed memory can be killed.
You can tell the FreeBSD / xnu devs take their job more seriously. A failure in the VM compressor sounds so much more professional than being OOM killed.
It will absolutely fail (first because size_t is only 32b).
What's behind malloc() is what matters.
That being said, I never would have expected that code to ever succeed! Shows how much I take memory allocation for granted on more sophisticated systems. I can't remember the last time I malloc'd more than a few megabytes.
> If you use a system-level tool that reports the memory usage of your processes, you should look at the real memory usage.
Things are a bit more complicated than that. Because RSS will contain memory mapped to your process that could also be mapped by other processes. That is the sum of RSS on your machine is also higher than your physical memory.
That includes libraries dynamically linked to your executable, but more importantly shared memory mmapped to your process.
A more "fair" estimate exists in the form of PSS (or USS), that will list all mapped regions from all process, and account each process a proportional share of the region.
e.g. If 2 processes mmap `/dev/shm/foo` of 1GB, both will inherit 500GB by PSS computation.
if you want to allocate physical memory, allocate address space with mmap() (malloc with options) and use mlock() to wire the physical pages to the addresses in your process. mlock() will fail if there's not enough physical ram to satisfy your request.
Kubernetes made writing poor code a breeze. At work we have microservices crashing 20 times a week but SLOs are not affected since traffic is routed to surviving pods. So we can concentrate on churning features fast instead of writing good code.
K8s might save you from crashes affecting availability but what about data corruption, logicals errors sound APIs etc.? I'm guessing if code quality is neglected all of these suffer.
Code quality is not neglected to the point we have logical errors. Architecture is reasonable and we have an extensive suite of end to end tests and integration tests.
No founding needed, we are corporate owned. The business guys demand features like there's no tomorrow so we have to do some trade offs. If the dev team owned the thing we would have taken other decisions.
the basic pattern is having a load balancer sitting in front of N services, and then having a service manager keep an eye on each service and restart them if they crash. so kubernetes can get the job done, but you can do essentially the same thing with a load balancer & some VMs & using the service manager that comes with your operating system (even windows has one)
> At work we have microservices crashing 20 times a week but SLOs are not affected since traffic is routed to surviving pods.
This probably makes a lot of people have strong negative emotions, but at the same time i feel that you're not wrong and it's the only way to deal with the modern web dev, where clients/business push for features instead of quality, versus something like kernel/system software development, where there is more pushback against this for historial and cultural reasons.
At work, we have this one monolith system that's in the center of everything else within a particular project - it's not really scalable and it has multiple scheduled processes within it, as well as serves a lot of external API requests, oh and also has an administrative UI. So far, my attempts to warn people against having a single point of failure like this have fallen on deaf ears and we still have outages where the JVM misbehaves or scheduled processes gobble up all of the server's memory and GC slows everything down on a regular basis.
Contrast this to me finally getting to implement something more like microservices in another project - the services are containerized and run on servers that have been configured with Ansible, are horizontally scalable and have proper load balancing. Furthermore, the scheduled process functionality and others can sit behind feature flags and be enabled within a particular instance, all while not having multiple separate projects and keeping things simple with a single, modular codebase. I actually dubbed this approach "moduliths", horizontally scalable and modular monoliths, since there is no way that this org can handle "proper" microservices, about which i wrote more here: https://blog.kronis.dev/articles/modulith-because-we-need-to...
That said, even the older monolith projects can benefit from modern approaches like Ansible for configuration management (which also prevents situations where environment configuration diverges over time and no one has any idea why) as well as being put into containers - the horrible monolith application now also lives within a container (not yet in prod, sadly) and has built in health checks. Were it ever to break and fail to recover in a set time, then it will automatically restart, making an outage that lasts an hour and possibly makes someone get paged in the middle of the night instead be a minute long interruption before everything restarts.
Personally, i think that with the direction that the industry is headed all services and even servers should be restarted every now and then anyways, since with JVM/CLR you sometimes get weird things happening after a service being up for months or years. Knowing why that happens would be nice, of course, but no one actually has the time to address those.
> Knowing why that happens would be nice, of course, but no one actually has the time to address those.
More than likely bad user code. Perhaps even race conditions. But in case of the JVM, with flight recorder and other forms of logging you could find out the problem with quite a good chance.
> ...you could find out the problem with quite a good chance.
It's not that it's impossible to do so due to technical limitations. Even without JFR, there's still VisualVM and any number of APM solutions, like JavaMelody, Apache Skywalking, Stagemonitor etc.
It's rather a problem of telling the clients/business:
Hey, look, for the next X days/weeks i won't be developing any new features or tending to your user stories, but instead will attempt to track down this persistent, yet somewhat hard to reproduce problem.
And because of limitations in place that pertain to accessing production environments, this process will likely take much longer than it otherwise should, especially in case of blocking synchronous communications when asking for production logs or heap dumps, which are sometimes wrongly exported after the server restart, which makes them meaningless.
Alternatively, i will spend a similar amount of time attempting to first get the application instrumented and then we'll run into similar challenges regarding the access permissions for those, before returning to the aforementioned attempts to debug and solve the application issues, because adding instrumentation doesn't magically solve those.
Depending on the environment that you work in, this proposition might either be accepted, you might also find yourself fighting an uphill battle, or people might just look at you like you have two heads and without having proper backing support of the other engineers you'll find yourself for critiqued both for wasting the time on debugging with no guarantees of actual payoff in the end, as well as the application quite possibly still not working.
I'm actually in the middle of implementing an APM solution to hopefully give better insights into how the application works, but in many of the environments out there this will be a Catch 22: https://www.merriam-webster.com/dictionary/catch-22
So, if you have control over the application from day one instead of being onboarded into a maintenance project with SRE not having been a concern throughout its development, consider building for failure - treat it as a "when?" question instead of "whether?" and do what you can to mitigate the actual user impact even when components may fail.
Horizontal scaling is one way to achieve that, and a pretty decent one, as long as you don't attempt to scale your single source of truth.
This made be curious to check the real and virtual memory of some processes on my laptop (MacBook Air M1).
The real memory size of Safari is ~160MB but virtual memory size is 392GB which doesn't look right. I checked other processes and all the processes have similar virtual memory size which is around ~390GB.
I wonder if this is a bug in Activity Monitor or the virtual memory allocations really are this big for each process.
That sounds about right. The runtime of a certain language actually allocates 1TB of virtual memory at startup, and then hands out memory from that pool. It's just reserving 1TB of virtual address space.
Overcommit should not be necessary in order to just reserve a range of VM without committing it. Windows gets this right (requiring explicit commit to use memory if you explicitly reserved it earlier). Demand paging on Linux is super-convenient (and my designs exploit it heavily), but you can’t really write robust programs that way.
One of my interview questions starts with "Can a program allocate more memory than is physically available on the server?"
Everybody gets this wrong (which is funny for a binary question) but it starts an interesting discussion through which I hope to learn how much they know about OS and virtual memory.
Should the question be like this?
"Can a program ~~allocate~~ use more memory than is physically available on the server (assuming no swap)?"
Because, malloc allocates from virtual memory, not from the physical memory. Only when we access the the memory explicitly by read/write for the first time, then the page fault happens and page allocation begins. If page can't be allocated, then program gets terminated with a SIGNAL.
Here, malloc was successful since allocation from VM was successful. But that is no guarantee that we have all memory we got. We can still run out of memory even though we allocated everything at start of program... It is so un predictable ...
So, I assume, your question is asked with above intention.
Please correct me if I am wrong. Your question and whole thread made me to think!
-p
No, it's worse than that - the answer is "yes", because virtual memory + overcommit means that most of the time the OS will happily allow you to allocate more memory than physical+swap, and essentially gamble that you won't actually need all of it (and this is implemented because apparently that's almost always a winning bet).
The OS doesn't have to gamble that you won't actually need all the memory you allocate, it could just be a gamble that another memory hogging process exits before you need to use all of your memory, or that you don't need to use all of your memory at the same time.
Yeah. And the issue is that the actual problem happens sometime later when the application actually tries to use that memory. So you replaced an error that is relatively simple to handle with something that is impossible to handle reliably.
So the operating system very much doesn't like to admit it doesn't have physical memory to back the area you are trying to use. Now it does not have a simple way to signal this to the application (there is no longer an option to return an error code) and so either everything slows down (as OS hopes that another process will return a little bit of memory to get things going for a little while) or one of the processes gets killed.
Interestingly, Linux's OOMKiller actually gives you (the sysadmin or system designer) more control over what happens when the system is low on memory than disabling overcommit.
In a system without overcommit, every process is taking memory out of the shared pool, until some random process is the unlucky one that can't allocate more. In a happy case, that unlucky process also has some data that it can let go of. But this is entirely random - you could have a bunch of gigabyte-sized application caches in half of your processes, but the NTP daemon might be the one who cends up failing because it can't allocate a few more bytes. Even worse, it could be the SSH server or bash failing to spawn a new shell, preventing any kind of intervention on the system.
With OOMKiller, you can at least define some priorities, and ensure some critical processes are never going to be killed or stalled.
With overcommit disabled, the program will fail in a more predictable way. Moreover, if it allocates the memory successfully nothing bad can happen to it later. So you have the option to allocate the memory right at the startup of your program and be sure it is not going to fail later.
With overcommit enabled you technically have more memory to work with. You could say that if the program has to fail anyway, then it might be better if it fails later at a higher memory usage.
There is userspace integration available as well for OOM killers, afaik Facebook using/having developed them. They can take into account much more fine-grained selection criteria before killing a process.
And we additionally end up in a feedback loop where coders don't check the return value from malloc(!) Since why bother when it never errors. Then we don't have enough resilience capital there either.
I thought this was going to be about the more fun point that malloc is allowed to return NULL even on successful allocation (because malloc(0) is allowed to return NULL). This is a quick way to check if the authors of malloc wrappers know what they are doing.
If allocation didn't succeed, you have bigger problems. I personally dislike articles like this one. It is apparent they have never run into the problem in a production scenario.
I used to disable overcommit, but I ran into a bunch of compilers that would crash because of it despite having 30GiB of free (as reported by free(1) ) memory available.
Embedded systems (the ones where you would disallow malloc) don't generally have virtual memory by virtue of having no MMU, so on those you can't do overcommitment since there is no page fault mechanism.
No, the reason is simply that by statically allocating all memory you can avoid entire classes of program faults and bugs. There are no memory leaks and you don't need to solve the NP-complete problem of "is there an execution path where a dynamic memory allocation will fail". Keep in mind that it is not just about the total amount of dynamically allocated memory, but also the order of allocations (and frees).
That's right - some "safe" coding standards like MISRA C go as far as forbidding use of malloc() without explicit justification.
If you still need dynamic allocation, you might choose to have a custom allocator working from a fixed-size pool or arena created in code (possibly itself carved out of RAM with malloc(), but importantly only once at first setup, where you have a better guarantee that the allocation will succeed).
(All that said, embedded systems sometimes don't have virtual memory, so the original problem stated in the link is just not a thing...)
> working from a fixed-size pool or arena created in code (possibly itself carved out of RAM with malloc(), but importantly only once at first setup, where you have a better guarantee that the allocation will succeed)
And I should add to this that you probably want to access all of the pool/arena to do setup, or just ensure it's physically allocated if you are running in a virtual memory space. This is something that is reasonable at setup time, though.
I’m working on a project now and trying my best to make it MISRA compatible because FreeRTOS did the same. You don’t use malloc() but rather pvPortMalloc() which pulls from an already allocated pool. Their HEAP4 system tries to keep the chunks organized. So far I’ve been very happy with it.
I wouldn't say this is the reason. Embedded systems typically don't have virtual memory to start.
I would expect (but verify) a malloc implementation on embedded to return null if it can't satisfy the allocation.
But even with that assumption malloc in embedded is often a bad idea. You need to plan for worst case anyway and you can not afford memory fragmentation.
TLDR: "You don't." (Because malloc hands you virtual memory and actually trying to use it might reveal that the system doesn't have the real memory to handle your request.
I kept reading hoping that there was going to be a solution, but not really; there are comments discussing disabling overcommit, but even that's a tradeoff (it does fix this failure mode, but you might not want to actually run a system like that).
Technically the allocation succeeded. That's why you have a pointer and not NULL. So the answer is the classic one: "you check if the pointer is NULL or not".
This entire thing is a wrong question to be asking, anyway.
Even if you are sure right after malloc that "you have the memory 100% available to you", who's not to say that, immediately one nanosecond afterwards, some other process comes in and, well, asks for more memory?
The memory you were so sure you had for yourself right after malloc may no longer be there (either swapped out or even discarded if possible), and you will be killed for trying to access it again. Oops.
At no point you can say "I have this memory and it's mine and just mine!" unless you are root and mlock (or equivalents). And I hope you don't do that, anyway. Just let the OS do what it is designed to do.
There's gotta be some way to programmatically determine it, right? It may not be portable. It may require some system calls or something, but there's gotta be a way, right?
Maybe a system call that checks whether writing to a page would result in a fault? What if we had memory file descriptors for pages? We'd be able to use them with epoll and io_uring to asynchronously check whether it's safe to write to the pages they represent.
I'm not an expert enough to tell you; I just read the article and decided that it was too long so summarized for others. There's some discussion upthread about catching SIGSEGV and other methods, FWIW.