What use is "full use" if your system live locks!? This is similar logic to that of overcommit--more "efficient" use of RAM in most cases, with the tiny, insignificant cost of your processes randomly being killed?
What happens in practice is that people resort to over provisioning RAM anyhow. But even then that's no guarantee that your processes won't be OOM-killed. We're seeing precisely that in serious production environment--OOM killer shooting down processes even though there's plenty (e.g. >10%, tens of gigabytes) of unallocated memory because it can't evict the buffer cache quickly enough--where quickly enough is defined by some magic heuristics deep in the eviction code.
 And before you say that it's not random but can be controlled... you really have no idea of the depth of the problem. Non-strict, overcommit-dependent logic is baked deep into the Linux kernel. There are plenty of ways to wedge the kernel where it will end up shooting down the processes that are supposed to be the most protected, sometimes shooting down much of the system. In the many cases people simply reboot the server entirely rather than wait around to see what the wreckage looks like. This is 1990s Windows revisited--reboot and move along, unreliability is simply the nature of computers....
Ugh, so much this. Where I work (until the 16th) we've moved from an environment of stability, where problems are investigated and fixed and stay fixed, to our acquirer's environment where things break randomly at all times, and it's not worth investigating system problems because nothing will stay fixed anyway. #totallynotbitter
Although XPoint has been a disappointment so far.
Arguably we currently have three options to prevent livelocking (short of fixing the kernel). All three have significant cons.
1. Have a big swap partition. Historically people have recommended not to have swap at all because (arguably) the kernel was too swappy and would swap out stuff that was needed. And also because some people prefer misbehaving processes to get OOM-killed instead of slowing the system to a crawl on old HDDs. I was in the latter camp, but I'm now experiencing this bug too (nothing gets OOM-killed, the system locks up instead) and considering reenabling swap.
2. Use the new memory pressure information the kernel provides with a userspace tool like earlyoom to kill misbehaving processes before the pressure is significant enough to slow the system. I tried this one out, but earlyoom repeatedly killed the X server under low memory conditions. On a desktop system, X is likely the parent process of everything you want to run, but this still might be a bug?
3. Disabling overcommit entirely. This supposedly breaks a bunch of userspace programs, and even assumptions made in the kernel itself. Worse still, it still results in processes probably killing themselves anyway when memory requests fail, but it doesn't have the advantage (on a correctly working Linux system) that the worst behaving process on the system gets killed instead of whoever happened to request too much memory.
Are you arguing that killing 100% of the running programs and losing all the data is superior to kill <100% of the programs?
> Because what are you going to OOM-kill, some non-important process? Why would you run non-important processes in the first place.
I often use Alt + Sysrq to manually activate the OOM killer, in case the kernel doesn't automatically detect memory exhaustion fast enough. And it works pretty nice for me. I found that, at the worst case scenario, nearly the entire desktop will be killed (rarely occurs, often it simply kills the offending process), but I can still restart my desktop in a minute, instead of spending five minutes to reboot the hardware.
We (application developers) will follow and adjust our programs to correctly handle malloc() failure -- after all it's quite easy to fix that even in existing applications.
One thing that's needed is efficient ways to ask for, and release, memory from the OS. I feel like Linux isn't doing so well on that front.
For example, Haskell (GHC) switched from calling brk() to allocating 1 TB virtual memory upfront using mmap(), so that allocations can be done without system calls, and deallocations can be done cheaply (using madvise(MADV_FREE)). Of course this approach makes it impossible to observe memory pressure.
Many GNOME applications do similar large memory mappings, apparently for security reasons (Gigacage).
It seems to me that these programs have to be given some good, fast APIs so that we can go into a world without overcommit.
Can you elaborate on why this is easy? It seems really difficult to me. Wouldn't you need to add several checks to even trivial code like `string greeting = "Hello "; greeting += name;`, because you need to allocate space for a string, allocate stack space to call a constructor, allocate stack space to call an append function, allocate space for the new string?
Even Erlang with its memory safety and its crash&restart philosophy kills the entire VM when running out of memory.
>Haskell (GHC) switched from calling brk() to allocating 1 TB virtual memory upfront
The choice of 1TB was a clever one. Noobs frequently confuse VM for RAM, so this improbably large value has probably prevented a lot of outraged posts about Haskell's memory usage.
Or in the worst case, terminating the program (as opposed to letting Linux thrashing/freezing the computer for 20 minutes); but most programs have some "unit of work" that can be aborted instead of shutting down completely.
Adding those checks is some effort of plumbing, sure, but not terribly difficult work.
> Wouldn't you need to add several checks
In the case of C++, I'd say it's even easier, because malloc failure throws std::bad_alloc, and you can handle it "further up" conveniently without having to manually propagate malloc failure up like in C.
> Even Erlang with its memory safety
Memory safety is quite different from malloc-safety (out-of-memory-safety) though. In fact I'd claim that the more memory-safe a language is (Haskell, Java, or Erlang as you say), the higher the chance that it doesn't offer methods to recover from allocation failure.
That's the correct thing to do for almost all failed allocations. Crash, with a nice crashdump if possible/configured.
Wouldn't it be nicer if I can easily make it fail e.g. the current request, instead of crashing the process completely?
The problem is, I know for sure my supervison trees aren't proper; and i have doubts about the innermost workings of OTP --- did the things they reasonably expected to never fail get supervised properly? Will my code that expects things like pg2 to just be there work properly if it's restarting? How sad is mnesia going to be if it loses a table?
I'm much happier with too much memory used, shut it all down and start with a clean slate.
> OOM killer shooting down processes even though there's plenty (e.g. >10%, tens of gigabytes) of unallocated memory because it can't evict the buffer cache quickly enough
Is that due to dirty pages? Have you tried tweaking the dirty ratio to get dirty files flushed sooner? I believe there were also some improvements in the default behavior towards the end of the 4.x series that are supposed to result in more steady flushing.
You can't completely disable the OOM killer. Some in-kernel allocations will trigger an OOM reap and kill regardless. Too many parts of Linux were written with the assumption of overcommit.
Heck, as this issue shows even when there's technically free memory the OOM killer can still kick in.
I had heard about these issues before but never saw that in practice--at least not often enough that I looked closely enough. I knew Google and Facebook and others have been relying on userspace OOM killers for years to improve reliability and consistency. But only after having recently joined a DevOps team at a large enterprise with large enterprise, big data customers did I begin to see the depth and breadth of the problem.
Before getting into the hell-hole of helpless despair that is DevOps I primarily did systems programming. And, for the record, I always made sure to handle allocation failure in those languages where I could (C, Lua, etc) because as a software engineer it was always clear to me that, much as with security, an over reliance on policy-based mechanisms invariably resulted in the worst possible QoS. (Of course, in Linux even disabling overcommit is no guarantee that page fault won't trigger an OOM kill, but there are other techniques to improve reliability under load.)
> Is that due to dirty pages?
Yes, people have tweaked and turned those knobs and reduced incidence rate by maybe 20-30%. Yet nobody is sleeping better at night.
Computing shouldn't be a black art. The irony is that overcommit and loose memory accounting more generally was originally intended to remove the necessity for professional system administrators to babysit a system to keep it humming along. But now we've come completely full circle. And it was entirely predictable.
And before anyone says that cloud computing means you increase reliability by scaling horizontally, well "cloud scale" systems is exactly what are being run. But unless you massively over provision, and especially when certain tasks, even when split up across multiple nodes, can take minutes or even hours, an OOM kill incident can reverberate and cascade. Again, ironic because the whole idea of loose memory accounting is supposedly to obviate the need to over provision.
By "full use" it doesn't mean "get to the point of live-locking". It means if the heuristic is sufficiently conservative so as to prevent live-locking, it will also necessarily prevent usage patterns that wouldn't live-lock.
Safety and reliability improvements are very rarely free. It's my opinion, and I assume wahern's, that Linux should be reliable by default, even if this reduces efficiency in the non-failure case.
A better comparison is a car that will only go to 100 km/h on a 120 km/h road. It's safer, less accident prone and more energy efficient and you will never need to brake because of hitting the speed limit, but you're not making full use of the car.
Watch some documentaries about the time wearing seatbelts became mandatory and listen to people bitch how inconvenient, uncomfortable and dangerous (what if the car catches fire and the seatbelt gets stuck???) they are...
Their friends will still blame anything and anyone other than them for the circumstances leading up to their death, but in unguarded moments, they’ll say “I wish they’d worn a seatbelt”. They blame the person that died for failing to take an obvious step of self-preservation, even if they don’t realize it.
Swap paging files have not reached this level of awareness, but we’re certainly a lot closer than we used to be to understanding as a community why modern system design requires paging files to operate in a dynamic memory allocation, generic work task, variable usage and specification environment. If you don’t, you’ll end up crashing into the overcommit windshield someday.
Is overcommit good? No more or less good than cars than can go 150mph (as their speedometers universally claim). They’re both design choices that no one has seriously objected to, and that means that seatbelts and swap files are necessary.
People do not learn. No safety margin means things will fail so you need to expect failure.
Even if you do expect failure, there is a chance of unexpected kind of failure. Failures tend to be much more costly than the extra capacity.
And that's in predictable systems, not something as unpredictable as say an ssh server.
To create reliable applications you need to design them to work within limits of memory allocated to them. Unattended applications (like RDBMS or application container) should not allocate memory dynamically based on user input.
By definition, reliable application will not fail because of external input. Allocating memory from OS cannot be done reliably unless that memory was set aside somehow. If the memory was not set aside then we say we are in overprovisioning situation and this means we accept that it might happen that a process wanting to allocate memory will fail to get it.
So the solution (the simplest but not the only one) is to allocate the memory ahead of time for the maximum load the app can experience and to make a limit on the load (number of concurrent connections, etc.) so that this limit is never exceeded.
Some systems do it halfway decently. For example Oracle will explicitly allocate all its memory spaces and then work within those.
The OS really is a big heap of compromises. It does not know anything about the processes it runs but it is somehow expected to do the right thing. We see it is not behaving gracefully when memory runs out but the truth is, OS is built for the situation where it is being used mostly correctly. If memory is running out it means the user already made a mistake in desiging their use of resources (or didn't think about it at all) and there is not really much the OS can do to help it.
This is incredibly easy to say, and incredibly hard to do for any application of even moderate complexity. Getting people to accurately tune thread counts or concurrency limits is painful when working with diverse workloads, and getting a guaranteed upper bound on memory usage would be a lot harder. It's much more common to apply higher level techniques such as concurrency limits and load shedding to avoid OOMs and other resource starvation issues.
The other thing we can do is reduce the impact of OOMs, and accept that they exist as a tail event. There is software running on your system that doesn't have hard memory limits, and you will always have to consider OOMs and other similar failure cases when ensuring your system is resilient. As long as we can prevent those tail events becoming a correlated failure across our fleet, we can get a pretty reliable system.
I agree. But this is what it takes to build really reliable applications.
For example, MISRA rules (C and C++ standards for automotive applications) forbid dynamic memory allocation completely after the application started. It also, if I remember well, forbids anything that could make it impossible to statically calculate stack requirements like recurrency.
This is to make sure it is possible to calculate memory requirements statically.
It makes little sense for Excel to pre-allocate 4GB of memory to account for the user wanting to create a gigantic analysis in their spreadsheet over the next weeks, if all that the user wanted was to make a small sum of 100 narrow rows.
That's how we used to write software until Linux ended up running everywhere, and that was the only way to write software for platforms without a MMU. As recently as a decade ago mobile phones, before Android, often had strict memory management. Each allocation could fail, and these failure paths had to be tested.
Explicitly committing after a fork() would just make fork() cost a bit more but it would work. The fork() call would fail if there's no memory to duplicate all anonymous pages for the child process. But once it would succeed the child process would know it won't cause an OOM merely by writing into a memory location.
The other common case is to actually fork() a child process. Even then it would suffice to just reserve enough physical pages to make sure the child won't OOM-by-write, but only copy pages when they're actually being written to so you wouldn't be copying a gigabyte of memory even if you never used most of it.
The kernel would allow applications to allocate at most physical mem size + swap size - any memory reserved for kernel. Thus you could still use MMU to push the least used pages into swap, and you could run more or larger programs that would fit in your RAM, but when memory is low these programs would always fail at the point of allocation, not memory access.
(The user is at fault but they do not know why.)
vm.overcommit_memory should have never been a default... It's hard to change it back.
On top of this is the potential for so-called swap death by thrashing.
Computer resources were managed similarly - the most important and most CPU-heavy resources would be tied to a core and given exclusive use of that core.
The only caveat here is that doing this all takes engineering effort and fairly deep knowledge of your OS. If you don't have mission-critical applications like this to write it's probably overkill.
(One kinda-cool side effect of the memory all being pre-allocated was that you could just browse the memory space of the process, everything was at a predictable address even without debug symbols.)
My understanding this requires "arcane knowledge" is because most frameworks and developers simply don't care. All new languages and frameworks are built to supposedly make things easy and fun and counting your objects or calculating your buffers isn't exactly fun.
Making things dynamic and hiding underlying memory layout is seen as a tool to write software faster but it disconnects developers from understanding what actually happens and makes it even more difficult to write reliable software.
If writing software the correct way was prevalent this would not be arcane knowledge but common sense.
If a human can do it, then an advance tool chain could do it.
Probably too expensive for most use cases.
There are many reasons why this is not a good trade-off for the vast majority of use cases. Unless you are running something like a traditional Oracle deployment where you're just giving over the entire host to the database anyway and the database is enough of a cash cow that they can afford to re-implement much of the functionality of an operating system.
It generally makes a lot more sense to let the OS manage the buffer pool using the available free memory on the system (this is what Postgres does, for example).
It also makes more sense for the vast majority of deployments to allow process memory consumption to be dynamic and deploy the appropriate limits and monitoring to keep things healthy - e.g. by terminating malfunctioning processes and reconfiguring and migrating applications as needed.
This doesn't meet some traditional ideals about how programs should be written, but actually seems to result in better systems in practice.
The main reason is that the error handling code has a high probability of being buggy because allocating memory is pervasive in most code and it's unlikely that you're going to have tests that provide adequate coverage of all the interesting code paths. The consequences of incorrect behaviour are often worse than the consequences of a process crash.
1. When you allocate memory, but do not write into it yet, those pages not mapped in RAM and thus don't occupy actual space.
2. When you're done with a specific page of memory, you can madvise(MADV_FREE) it, which means that the kernel can discard these pages from RAM and use it for caches and buffers. But you still hold on to the virtual memory allocation, so you can just start writing into the page again when you need more memory and the kernel will map it again.
If I understand all that correctly, you can have your allocator work in such a way that it keeps a large reserve of preallocated memory pages, but the corresponding amount of RAM can be used for caches and buffers when it doesn't need everything. An interesting question would be how that scenario appears in ps(1) and top(1), i.e. whether those MADV_FREE'd pages would count towards RSS.
Currently, OOM killing.
Then RHEL6 came along, and it swapped all the time. Gone was our warning. The stats shows tens of gigabytes of cache engaged. WUT? How do we have tens of gigabytes of memory doing nothing? Before you could finish that thought, OOM killer was killing off programs due to memory pressure. WTF? The system is swallowing RAM to cache I/O, but couldn't spare a drop for programs? ...I could go on, but simply put, RHEL6 was garbage. And really I mean the RHEL6 kernels and the way that Red Hat tweaked whatever they did for it.
RHEL7 was a little better, but still seeing echoes of the ugliness of RHEL6. RHEL5 was just a faded pleasant dream.
The last 3 Fedoras on the other hand, the memory management seems like we're finally digging ourselves out of the nightmare. That nightmare lasted almost a full decade.... sheesh
What are you going to anyway, just start deallocating the memory your application needs to actually work, so that you can display an "I'm sorry, we're out of memory" message? Might as well crash then.
The amount of memory available also varies wildly from user to user and you can't put a memory requirement on an application.
So what do people do? Disregard the issue. Even the best games on the App Store have tons of one star reviews by people with older devices.
The flip side is that application that are prone to crash are pressured to optimize for relatively quick and painless restart.
On a server the process using the most memory might be the mail server so killing it is a bad idea... unless it normally never uses more than 2GB and for some reason it’s eating 14GB at the time.
You have to define quality of service, set relative priorities within those categories, be able to correlate memory usage with load to detect abnormal spikes vs normal spikes, etc. For example a daemon’s memory usage is probably correlated with its open socket count.
That in turn relies on a system much smarter than init scripts that understands the system is under memory load so re-starting a daemon that was just killed might not be appropriate (which gets into policy questions too).
For interactive use things are even more complicated; you need a mechanism for the active application being used by the user to donate its ultra-high priority to any daemons or child processes it is talking to. If image_convert is running because I initiated it as the user then it should get more memory (and CPU/IO). If it is running because my desktop window manager is refreshing it’s icon cache then it should get its priority smashed hard - if the system is under pressure or burning battery it should even be killed and prevented from restarting temporarily. Who is going to do all the work in the kernel, userspace libraries, then get all the apps to properly tag their threads and requests with the correct priority?
tl;dr: setting policy is hard and anyway it becomes the herding cats problem.
The option to kill processes is more interesting on personal machines, where sometimes adding RAM is not even an option. My anecdotal experience with a laptop without swap since 2014 (16 GB, then 32) is that I got near the limit a couple of times. I wouldn't mind if Linux killed any of the browsers, emacs (which is tiny nowadays), Thunderbird. They can recover. Maybe kill even databases. Leave me XOrg, the desktop environment and a terminal to investigate the problem.
If you know you want to keep bare Xorg alive with no window manager, taskbar or desktop, there's oom_adj magic knob, accessible by systemd.
And memory cgroups.
But on a shell server, you actually want to kill that Xorg but preserve background tasks.
On a shell server there is no Xorg running, not even on machines with a partial of complete X11 installation because (example) they need fonts to generate PDFs. I just checked a few of them.
On a desktop/laptop Xorg is the last thing I want to hang or be killed because it makes it harder to check what went wrong. I'll look into oom_adj, thanks.
I run the script to display the oom_score_adj of the processes running on my laptop. It's 0 for all of them except the blink based browers (including Slack). For those processes it's either 200 or 300. It means that they'll be the first ones to go. Apparently their developers played nice with the system.
The oom killer only kicks in sometimes, e.g. when programs make truly egregious allocation requests.
The benefit of having swap is that it turns things into a soft degradation since it's much easier for the system to start with swapping out rarely used pages. The gradual loss of performance makes it easier for the human to intervene compared to the cliff you encounter when it starts dropping shared code pages.
This is a myth. Allocation failures happen at least as much on small allocations as on big ones. In fact, I see OOMs every day and the vast majority of the time the trigger was a small allocation. For example, the kernel trying and failing to allocate a socket buffer.
And that's really the root of the issue. You have a giant application with a 200GB committed working memory set doing important, critical work; and it gets shot down because some other process just tried to initiate an HTTP request. It's a ludicrous situation. And people defending Linux here by saying the same problem exists everywhere else are wishful apologists--the situation is absolutely not the same everywhere else.
Even setting aside the issue of strict memory accounting--which, BTW, both Windows and Solaris are perfectly capable of doing, and do by default--Linux could still do dramatically better. Clearly there's some level of unreliability people are willing to put up with for the benefits of efficiency, but Linux blew past that equilibrium long ago.
No, you are truly out of RAM at that point - the amount of RAM all processes need exceeds the amount of memory the system has, and the user has indicated that no swapping should be done.
Now, if the kernel truly wanted to handle this case gracefully by swapping disk-backed files, I think it should also tell the process scheduler about this, and enter a special mode where only processes whose code is currently resident would be allowed to run, until they hit a hard wait. This might prevent the thrashing behavior in many cases (assuming processes don't interact too much with memory-mapped files). Otherwise, every context switch initiated by the scheduler is likely to cause another page fault event.
The word need does the heavy lifting here. Strictly speaking it does not. It CPU only needs a few data and instruction pages at a time to execute code. The former may, the latter frequently will be backed by memory-mapped files.
If your program contains huge unicode lookup tables, megabytes of startup code and so on then it is perfectly reasonable to discard those pages. It would be wasteful to keep those resident, especially on memory-constrained devices. Not having swap is not the same as not wanting paging to happen.
What the human wants is totally different from what the kernel needs to keep chugging along (at glacial pace). Bridging the gap is what is discussed further downstream in the mailing list linked by the article, and it's only possible based on previous work (PSI) that was added fairly recently.
We're in this crazy situation because somebody decided to sacrifice reliability to increase efficiency. Efficiency is easier to measure so they got away with it. Fixing the live-lock problem makes things slightly better, but it's not solving the underlying problem. Andries Brouwer's classic comment remains as relevant as ever:
Of course a browser with leaks could not be expected to handle oom gracefully.
At the point where the kernel knows that it should be returning a null pointer, you're probably already operating in a severely degraded state. You will be scanning the page table constantly, looking for pages to reclaim, wasting tons of CPU. You will have reclaimed all of the reclaimable pages you can reclaim, so io utilisation will be through the roof - calling a random function will cause a disk read to page the executable into memory. Your system will not be functional. And that's ignoring the reality that modern MMUs don't actually let the kernel know when all the memory is in use.
If you want to handle memory pressure better for a given application, shove the application in a cgroup and use a user space oom killer. It's not possible for a program to react gracefully to a system OOM.
That's only currently the case. It doesn't need to be. As per your example, you could just as easily provide an upper bound to the page table scan, and return NULL when enough pages can't be found in 100us.
It's not possible for a program to react gracefully to a system OOM.
A program should never receive a system OOM, the fact that Linux' memory accounting is so bad the system itself can run OOM is the problem. It is perfectly possible for a system kernel to execute in bounded memory, and never let programs claim parts of that. Linux just isn't designed that way.
> It is perfectly possible for a system kernel to execute in bounded memory
Is it possible while maintaining anything like the feature set that linux provides currently?
Also comes to mind - while not generic swap - kind of edge case, a modern version of swap, ie. extending virtual memory space onto flash storage - Facebook replacement of some RAM with NVM https://research.fb.com/wp-content/uploads/2018/03/reducing-...
Maybe I can invoke Cunningham's law? I use 16GB on an NVMe SSD, which is equal to the size of my RAM.
tl;dr - 2x RAM was true in the early 90s
At some amount of active memory use Linux will grind to a halt. This happens with or without swap. It's a problem of how it handles low memory situations.
But if you have swap, this point comes much later. Often many gigabytes of unimportant data can be moved out of memory. If you can swap out 6 gigabytes without causing problems, and you only needed an extra 4 gigabytes of ram, then swap saves your day with almost no downside.
Obligatory link: https://chrisdown.name/2018/01/02/in-defence-of-swap.html
Now, note that (a) the root set are the hottest objects in memory, and (b) they are the ones that gc algorithms touch first. As a result, when you are doing a gc, the LRU (least-recently-used) heuristic for managing swap ends up putting all the hottest data in the program onto disk. This is pretty much the pessimal thing for the kernel to do, and can tank performance by triple-digit factors.
However, this is entirely fixable!
The gc and the kernel need to cooperate to figure out what is okay to push to disk or not. Matthew Hertz, Yi Feng and Emery Berger have a really nice PLDI 2005 paper, Garbage Collection without Paging, which shows how you can make gc performance hundreds of times better with just a little kernel assistance.
However, because kernel developers do not program in garbage collected languages (they program in C, because that's what the kernel is written in), they don't understand garbage collection and are hostile to kernel changes to support it. Emery Berger told me they tried and failed to get the Linux kernel to support their changes, and people who worked on .NET told me they had similar experiences with the Windows kernel team.
Maybe one should solve the problem: "At what eviction-induced latency should the OOM killer be invoked." Thanks to the infrastructure around latencytop, the latency might be available already.
Of course, the never-ending dilemma of what process to kill is still there.
POSIX has no such API. It was designed in a simple time.
You cannot get away with that and POSIX applications. They have zero or bad session support. There's no global always available database to help you (Android has 4), nor event scheduling bus. (Dbus is a joke, it does not allow sending event to non-existing endpoint to be delivered later.)
Every application using the standard mechanisms will get relaunched and activated as needed if it got shut down. It will receive a Bundle with state it managed to save and with original launch Intent. It can access an sqlite database to be made available via a restartable ContentProvider. SharedPreferences are also stored. Etc.
In Linux world, there are no such de facto standards and what is most common is utterly broken.
Same in Windows (only registry is persistent, it's not meant for data storage) and in OS X.