Hacker News new | past | comments | ask | show | jobs | submit login
The Linux kernel's inability to gracefully handle low memory pressure (lkml.org)
647 points by emkemp on Aug 5, 2019 | hide | past | favorite | 455 comments

Similarly, there are many annoying Linux bugs:

`pthread_create` can sometimes return back a garbage thread value or crash your program entirely without any way to catch it or detect it [1]. High speed threadding is hard enough as it is, without the kernel acting non-deterministically.

Un-killable processes after copy failure (D or S state) [2]. If the kernel is completely unable to recover from this failure, is it really best to make the process hang forever, where your only available option is to restart the machine? I ran into this with a copy onto a network drive with a spotty connection, that actual file itself really didn't matter - but there was no way to tell the kernel this.

Out Of Memory (OOM) "randomly" kills off processes without warning [3]. There doesn't appear to be a way to mark something as low-priority or high-priority and if you have a few things running, it's just "random" what you end up losing. From a software writing stand-point this is frustrating to say the least and makes recovery very difficult - who restarts who and how do you tell why the other process is down?

[1] https://linux.die.net/man/3/pthread_create

[2] https://superuser.com/questions/539920/cant-kill-a-sleeping-...

[3] https://serverfault.com/questions/84766/how-to-know-the-caus...

Point 3 is wrong. OOM killing is not random. Each process is given a score according to its memory usage, and the highest score is chosen by the kernel. The way to mark priority in killing is to adjust this score through /proc. All of this is documented in `man 5 proc` from `/proc/[pid]/oom_adj` to `/proc/[pid]/oom_score_adj`.


I haven't toyed with that in a long time (probably about a decade really) but back when I did it was still very difficult to get the OOM to behave the way you wanted. IIRC the scoring is fairly complex and take child processes into account so while it's technically completely deterministic it can still be fairly tricky to anticipate how it's going to work out in practice. And I did know about `oom_adjust`. Often it worked. Sometimes it didn't. Sometimes it would work too well and not kill a process that was clearly using an abnormal amount of memory. Finding the right `oom_adjust` is an art more than a science.

Overall I ended up in the camp "you shouldn't throw passengers out of the plane"[1]. The best way to have the OOM killer behave well is not to have it run at all. If I don't have enough RAM just panic and let me figure out what needs to be done.

[1] https://lwn.net/Articles/104185/

And the OOM Killer has been rewritten about twice entirely in the past decade. Christoph Lameter (one of my coworkers, who also wrote the SLUB memory allocator) wrote the very first one. Very little if any of his original code is in the current linux OOM Killer.

The current approach does indeed work much better. You can entirely disable the OOM Killer for a given workload with those procfs handles.

When I have multi-tenant shared servers that are regularly experiencing low memory conditions, that's now one of the first things that I turn off. I set sysctl vm.oom_dump_tasks=0 and vm.oom_kill_allocating_task=1, because without those and a comfortable amount of swap, I've found the likelihood that a rogue process uses up all the memory and the system becomes completely unrecoverable without power cycling, seemingly goes way up.

The way that the score is calculated, as I understand it, the processes with large memory footprints that are not particularly long-running, and especially not owned by root, are the most likely to be killed. My shared servers are running unicorn for Rails apps, so they almost all look the same under that lens.

I want the process with the runaway memory usage to be killed, so that person becomes aware of the problem when they find their app is down. (There are many better ways to solve this, but in our current system design, there's nothing to tell us that a service is down, so killing some other disused service means it's down until some user comes along and is impacted.) It seems like killing the process that made the last allocation is the most likely way to get this behavior I'm looking for, where the failures are noticed right away.

But I'm not convinced I've fixed anything, even if the behavior characteristics seem better, and I haven't had any servers going rogue and killing themselves lately, I am not the sysadmin, I'm pretty sure the sysadmin just comes and makes the machine over-provisioned so this doesn't happen again, which seems to be the best advice out there. And as I've learned from the discussion around this article, the fact of having a fast SSD on which paging stuff out and expiring it from the swap, all happens fast enough to look almost as memory, almost completely neuters the OOM anyway.

Our whole situation is wrong, don't even get me started; I'd like to solve this with containers, where we can set a policy that says "all containers must have memory request and limit" and thanks to cgroups, this problem never comes back.

> thanks to cgroups, this problem never comes back.

cgroups doesn't save you from the OOM killer. In fact, we're seeing persistent OOM issues in production even when there's more than enough physical memory to satisfy all allocation requests.

Similar to the issue discussed in the leading LKML thread, I/O contention and stalls when evicting pageable memory (e.g. buffer cache) can result in the OOM killer shooting down processes when an allocation request (particularly an in-kernel request, such as for socket buffers) can't be satisfied quickly enough--quickly according to some complex and opaque heuristics.

The fundamental problem is that overcommit has infected everything in the kernel. The assumption of overcommit is baked into Linux in the form of various heuristics and hacks. You can't really get away from it :(

> OOM killing is not random.

Yeah I know, hence the use of "random".

> The way to mark priority in killing is to adjust this score through /proc.

Haven't heard about this, thanks for the heads up!

> > The way to mark priority in killing is to adjust this score through /proc. > > Haven't heard about this, thanks for the heads up!

While going down that road is technically correct, it is a road full of pain.

A slightly less painful strategy is to disable overcommit. That way, if memory pressure is high, and a process calls `malloc`, that call will fail if there is not enough memory, and that process will fail. If you only have a couple of processes in your system that are using most of the memory and you can control them, it is simpler to just making them resilient to these kind of errors, than to try to mess with the process score to control the OOM killer.

Unfortunately, the fork/exec method of spawning processes doesn't work for memory heavy processes (such as say a java server) without overcommit and copy-on-write memory. Not that I think fork/exec is the best method to spawn processes, but it is the standard way in unix-like systems.

The rationale for disabling overcommit only really makes sense if physical and virtual memory consumption are in the same ballpark. That's true for some workloads but not generally true.

There are totally legitimate use cases for processes that use significantly more virtual memory than physical memory (since virtual memory is relatively cheap, but physical memory isn't). A lot of programs are going to touch and not release all virtual memory they allocate from the kernel, but there are plenty of important counterexamples.

fork()/exec() is one example (which I've been burned by personally), but there are plenty of others. Any program that uses TCMalloc and has fluctuating memory consumption will have a lot of virtual address space allocated but not backed by physical pages. Sophisticated programs like in-memory caches or databases can also safely exploit a larger virtual memory space while keeping the amount of physical memory bounded.

Virtual memory and overcommit isn't the same thing. Virtual memory means using disk space as memory. Overcommit means that the OS allows more memory to be allocated than it can guarantee is available. It is exactly like an airline booking 1 000 passengers for a flight with 500 seats, hoping that only half of them will actually show up. I don't think overcommit is ever needed in a modern system or is even a useful feature. For example, Windows doesn't support it at all.

> I don't think overcommit is ever needed in a modern system or is even a useful feature.

If you want to spawn child processes from a process that uses half the system's memory (not uncommon in server environments) using fork/exec it is useful. In case you aren't familiar with how that works, the parent process makes copy of itself, including a virtual copy of all the memory assigned to that process. That memory isn't actually allocated or copied until the child process child tries to write to it (and then only the specific pages that are written to). Typically, the child process then calls `exec` to replace itself with a new program and replaces the process memory. Without overcommit or swap if the parent process is large enough, then the fork syscall fails due to insufficient memory.

In a desktop environment using swap/virtual memory is fine. But in a server environment, where the disk may be network-attached (higher latency) and just big enough for the OS and applications, needing significant swap space is often undesirable.

Windows supports overcommit, it's just not the default, and it's not how typical Windows runtimes allocate memory. And the nice thing about making it opt-in is that processes which didn't ask for overcommit won't get shot down when overcommitted memory can't be faulted in.

No, it doesn't. See quotemstr's explanation here https://lwn.net/Articles/627632/

Thank you for the clarification.

What's left out, though (and perhaps the source of my confusion), is that you typically commit a reserved page from an exception handler, and if the commit fails then presumably in the vast majority of situations the process will simply exit. See https://docs.microsoft.com/en-us/windows/win32/memory/reserv...

If the code dereferencing the reserved-but-uncommitted memory was prepared to handle commit failure beforehand it would normally have done so explicitly inline. I can't imagine very many situations where I would pass a pointer to some library expecting the library to handle an exception when dereferencing it. There are some situations--and they're one reason why SEH is better than Unix-style signals--but extremely niche.[1]

Either way, my point was that Windows does strict accounting, and while you can accomplish something like overcommit explicitly, nobody else has to pay the price for such a memory management strategy. Only the processes gambling on lazy commit end up playing Russian Roulette.

[1] In a POSIX-based library I once used guard pages, SIGSEGV handler, per-thread signal stacks, and longjmp to implement an efficient contiguous stack structure. This was in an extremely performance critical loop (an NFA generated by Ragel) where constantly checking memory bounds on each push operation had substantial costs (as in multiples slower). AFAICT, it was all C and POSIX compliant. (Perhaps with the exception of whether SIGBUS or SIGSEGV was delivered.) Though, because neither POSIX nor Linux support per-thread signal handlers you could effectively only use this trick in one component of a process--you had to hog SIGSEGV handling--without coordination of signal handlers. SEH would have resolved this dilemma. This being such a niche use case, that wasn't much of a problem, though.

That is solved by adding a reasonable amount of swap space. In practice the swap will never be used, but it's there as a buffer to guarantee that things will work if the forked process doesn't immediately exec.

But any process can be trying to allocate at the time your system runs out of memory, and most applications are not authored to handle malloc failing. Process failure seems easier to work around, from what I’ve seen. Would love to hear more from someone who has contrary experience.

At least you can then properly blame the software for doing the wrong thing and potentially patch it. The kernel should not implement global behavior that encourages improper memory allocation failure handling.

Yes so much. The overcommit approach seems to be to say, “Most people are lazy so we shouldn’t allow anyone to brush their teeth.” Handling malloc failures isn’t rocket science.

Do you have an example of a nontrivial user-space project that has malloc failure handling that actually works, with tests and all? The one I'm aware of is SQLite. Then there is DBus, whose author disagrees with your assessment.


Lua, various OS kernels

Like any aspect of writing software, how you approach the problem effects the complexity of the final product. If someone doesn't make a habit of handling OOM, then of course their solutions are going to be messy and complex; they're going to "solve" the problem in the most direct and naive way, which is rarely the best way.

For example, unless I have reason to use a specialized data structure, I use intrusive list and RB-tree implementations, which cuts down on the number of individual allocations many fold. Once I allocate something like a request context, I know that I can add it to various lists and lookup trees without having to worry about allocation failure. Most of my projects have more points where they do I/O operations than memory allocations. Should people just ignore whether their reads or writes failed, too?

The JVM, and any other runtime which allocates its heap up front. You could argue that's cheating, but it does give you predictable behaviour.

Incorrectly handling malloc failures has been responsible for a lot of security problems. We are all better off acknowledging that C programmers are broadly incapable of writing correct OOM handling at scale and figuring out what to do with this fact, instead of wishing that it weren't this way.

There is now a way to mark processes as OOM-killer exempt


Part of the issue with processes stuck in D state (waiting for the kernel to do something) is that it is deeply tied into kernel assumptions about things like NFS, NFS is stateless, and theoretically severs can appear and disappear at will, and operations will keep working when it comes back. You can make NFS a hell of a lot less annoying in this regard by mounting it with soft or intr flags, however if the network disappears or hiccups, you WILL lose data (the network is NEVER reliable, in fact the entire model of NFS is arguably wrong to begin with)

That's not a problem with NFS, that's a fundamental issue with computers. Things fail. Nothing protects you from losing data, your local log-structured filing systems won't save you either. They'll help you protect an existing state from corruption, but they don't protect you from loss.

That new request you just received when a hardware failure occurred? Say goodbye to it, you've no way of ensuring it will make it to storage when the disks have caught fire. Later on, when you've put out the blaze, all that algorithms can do is tell you when things started to get lost.

What you are talking about is unrelated to NFSs broken model. Although still important, because these are the reasons why we don't need fsync() to actually flush data to disk for example, it can be asynchronous and only needs to add a checkpoint to your log structured filesystem that will be written to disk later, once enough data is buffered or long enough time has passed.

Often ordering would be enough, but sometimes you do want to wail until the changes are on disk. It's a mess.

This is a bad excuse for the really flakey nature of nfs, other network filesystems do not have its issues.

Like which ones?

AFS is a good example of a network filesystem that takes properties of the network into account, however its a little weird and not fully compliant with POSIX filesystem semantics.

However, it scales like crazy and is very powerful and reliable.

I ran AFS for very long, it does have issues which perhaps other network fs'es don't even notice. It has issues with clients behind NAT for instance, the callbacks won't work. The limit of files in a dir is less than for local filesystems (or were) so Maildirs with long mail filenames could get into something that feels like "out of inodes but not space" situations. All in all, it was very nice though.

> the entire model of NFS is arguably wrong to begin with

On local networks (everything attached to a single switch) with good hardware, it is reliable, and soft/intr is the worse choice among others.

To wit NFS is one of the commonly supported VM storage options (libvirt, VMware, etc).

Best of the bad options doesn't make it a good option. It's far from reliable and one of the most common reasons I encounter for deadlocked Linux systems. Well, to be fair, it also hangs a fair share of BSDs and sometimes even a Solaris. If at all possible, I'd suggest avoiding it.

What do you propose for a shared file system?

There is no use in specifying (no)intr as of kernel 2.6.25, it can always be interrupted by sigkill. See e.g. https://access.redhat.com/solutions/157873

There are many things "known" about NFS that are legacy.

What are you referring to regarding pthread_create()? Last time I checked I thought that it would return an undefined thread* only when giving a nonzero return code, which while certainly isn't handled in a lot of multi-threaded applications could be checked before anything else is done with the newly created thread.

There appears to be cases under high create/destroy scenarios where it returns zero, but failed to allocate memory for a thread during create. This is with tonnes of available memory (hence not a mapping error) and no exceptions thrown.

That said, it's still entirely possible I've made a mistake. Please see here: https://github.com/electric-sheep-uc/black-sheep/blob/master...

The idea is that you do preliminary processing of a camera frame before sending a neural network over the top.

> This is with tonnes of available memory (hence not a mapping error) and no exceptions thrown.

Could be a memory fragmentation issue

EDIT: Also since you're using C++ threads: You should really use move semantics, because right now you have two points of failure acting on the same thing: `new` operator may fail on creating the threads instance, and the underlying pthread_create may fail as well.

Comment on my mentioning on move semantics, because I feel that this really ought to be pointed out:

In all the C++ standard library implementations of std::thread the only member variable of that class is the native handle to the system thread itself; there are no additional member variables! This means that the size of a std::thread object is equal to the size of a native handle, usually the size of a pointer, but sometimes smaller.

If you create std::thread by `new` you're essentially creating a pointer to a "possibly a pointer", which comes with all the inefficiencies associated with it: Double indirection, small size allocations tend to fragment memory. And at the end of the day to actually use it, you have to at least lob around that outside pointer around on the stack anyway.

So there is zero benefit at all of using dynamic allocation for std::thread. Don't do it! Just create the std::thread instances on the stack, they're just handled/pointers with "smarts" around them, and you can copy them around just efficiently as you can copy a pointer or an integer. Better yet, if you're not trying to "outsmart" the compiler you'll often get copy elision where applicable.

sts::thread has its copy constructor deleted. This means that creating it on the heap is often your only option if you have to mix it with other types that don't handle noving properly because move semantics are strictly opt in.

> std::thread has its copy constructor deleted

and for good reason

> your only option if you have to mix it with other types that don't handle moving properly

It's not the only option. The other option is to implement move semantics on the containing type. Properly implemented move semantics gives you assurance about ownership and that you don't use the tread interface inappropriately.

Deleting the copy constructor is a very arbitrary design devision without any good reason.

Move semantics are a can of worms in themselves. You assume in your comment that you can modify the other types that interact with the types that are only movable. This is only possible if you own all relevant types, which is actually the exception rather than the norm. And even if you own the relevant related types, move semantics transitively enforce themselves onto containing types which turns their introduction into sprawling mess of cascading changes with hidden surprises.

Here's something to ponder on: What are the proper semantics for copying a thread? What is it you want to express by doing that?

You'll find that usually the copy constructor has been deleted only for those classes where the semantics of a copy are not well defined.

So let's assume you work around that by encapsulating that thread in a std::shared_ptr or a std::weak_ptr. What are the constraints you must work within when using that thread reference?

Usually when you run into "problems" caused by an object not being "move aware" triggered by encapsulating a non-copyable type, this is a red flag that something in your codes architecture is off. Think of it as a weaker variant of the strong typing of functional languages. You probably don't want to have a shared_ptr on a thread inside your object (and the object being copyable), but wrap that object in a shared_ptr (or weak_ptr) and pass those around.

Yourncery first question is already leading you down the wrong path: std::thread is a thread handle, not the thread itself. Equating the handle with the thread (a complex construct of a separate stack, seperate processor state, separate scheduling state etc.) is folly.

There are more software architectures between heaven and earth than exist in your philosophy. C++ especially is an old language and most code was written before C++11 started to be adopted. So a pure C++11 style codebase that follows the associated design best practices may be able to deal with std::thread and similarly restricted classes with little friction. But this just isn't the norm. Most big important codebases are too far down different roads to adjust them to play nice with move semantics.

> std::thread is a thread handle, not the thread itself

While technically true, semantically there's not much of a difference. Yes, you can copy around a handle, but then you have the burden of tracking all of these copies, so that you don't end up with a dangling handle to a thread long dead… or even worse, a reused handle for an entirely new thread that happens to have gotten the same handle value.

This is why you should not think of std::thread being a handle, but the actual thread. Yes, from a system level point of view there's the handle, and all the associated jazz that comes with threads (their dedicated stacks, maybe TLS, affinity masks, etc.), all of which are non-portable and hence not exposed by std::thread, because essentially you're not supposed to see all of that stuff.

> C++ especially is an old language and most code was written before C++11 started to be adopted.

That is true. Heck, I've still got some C++ code around which I wrote almost 25 years ago. But if you use a feature that was introduced only later, then you should use it within the constraints supported by the language version that introduced it and not shoehorn workarounds "just to make it work".

std::thread is just fundamentally flawed. The way it encapsulates the thread itself is just one of the things. Thread CPU affinity cannot be managed. Code cannot query which thread it is running on. There are no thread ids (the handle would be a workable substitute if it were copyable). Threads cannot be killed. In other words, if you take threading seriously std::thread is useless.

I need all of these things except for killing threads. So this is not just an academic list for me.

> Thread CPU affinity cannot be managed

That's because the capability of doing so depends on the target environment. C++ is all-target. If you need that you can use `std::thread::native_handle` + the OS's API for that.

> Code cannot query which thread it is running on

You're wrong assuming that. std::this_thread::get_id() exists: https://en.cppreference.com/w/cpp/thread/get_id

> There are no thread ids

Yes, there are. std::thread::get_id() exists (also see above): https://en.cppreference.com/w/cpp/thread/thread/get_id

> Threads cannot be killed.

Not all runtime environments actually support doing this kind of thing. Also within the semantics of C++ the ability to kill threads opens an gargantuan can of worms. For example how would you implement RAII style deallocation and deinitialization of objects created within the scope of a thread?

Or even one lower level: How do you deal with locks held within such a thread? Not all OS's define semantics on what to do with synchronization objects that a held in a thread that's been killed. Window implicitly releases them. Pthreads defined mutex consistency, but after killing a thread holding a mutex, the state of the affected mutex is indeterminate until a locking attempt on the mutex is done.

Killing threads really is something that should be avoided if possible. Not since C++11 but since ever, because it causes a lot more problems than it solves. If you need something that can be killed without going through too much trouble, spawn a processes and use shared memory.

std::thread is very limited because C++ is an all-purpose, all-operating-system, all-environment language and it must limit itself to greatest common denominator of threading support you can expect. And realistically this boils down to: 1. there are threads. And 2. threads are created, run for some time, may terminate and you can wait for termination.

That's it. Anything beyond that is utterly dependent on the runtime environment. And because of that std::thread does give you std::thread::native_handle, to be able to interface with that.

Many features of the STL are optional. So the existence of operating systems that are incapable of providing certain features is not a valid argument for leaving out features that are essential to using threads in any meaningful way on others.

Killable threads are not rocket science, either. You're just limited in the kinds of things these threads can do. But there's no need to get all hungup on that particular feature.

> Please see here: https://github.com/electric-sheep-uc/black-sheep/blob/0735de...

Creating and destroying threads in a tight loop (and blocking on their destruction, thereby reducing the point of having many) seems like a bad idea. Conceptually you have only maximum of two threads in any given part of this snippet, the one running the loop and the current instance of visThread. My guess is also that the loop thread spends most of its time waiting for the recently created threads to die. Why not have visThread only get created once and process a queue of events?

Anyway without additional evidence you potentially flagged a bug in the c++ standard library rather than pthread_create.

Firstly, thanks for taking the time to look.

> Creating and destroying threads in a tight loop (and blocking on their destruction, thereby reducing the point of having many) seems like a bad idea.

The purpose is that it's pretty much 100% always running a visThread (it's a neural network that takes about 100ms per image). The pre-process on the other hand runs in about 10ms, but there's no reason why it can't be run in advance (+). The neural network can't really be run on multiple cores safely, but it does have OpenMP (parallel loops).

(+) It does create some latency in the output compared to the real world, but it's not a massive deal when it comes down to it.

> Why not have visThread only get created once and process a queue of events?

This is probably the best way to do this, but this was the lazy way with not too much overhead (I think) :)

> Anyway without additional evidence you potentially flagged a bug in the c++ standard library rather than pthread_create.

Possibly, I did run it through GDB and Valgrind, it reliably seemed to die in pthread_create, but that of course could have only been the trigger. It could also be the aggressive optimization [1].

[1] https://github.com/electric-sheep-uc/black-sheep/blob/0735de...

What do you recommend doing when Linux Freezes? It doesn't come up a lot, but when it does it can be kind of unnerving since the three-finger-salute doesn't work.

I would also love to know if anybody has a solution for getting video to play properly in Firefox. I know it's not a bug per se, but it would be nice to not have to switch between browsers all the time.

I've been using Ubuntu for about a year now and otherwise its been a very positive experience.

> What do you recommend doing when Linux Freezes? It doesn't come up a lot, but when it does it can be kind of unnerving since the three-finger-salute doesn't work.

Unfortunately, I don't really have a solution for this. I have an SSD as the main disk and _even now_, when I hit this too hard Linux grinds to a halt. No mouse, no keyboard, just heat, fans and disk light.

One thing that sometimes works for me is the old CTRL+ALT+FX mashing, but not always. Once you can get a shell you can type into you're okay, but of course this doesn't always work.

> I would also love to know if anybody has a solution for getting video to play properly in Firefox. I know it's not a bug per se, but it would be nice to not have to switch between browsers all the time.

What do you mean? It's been reliable for quite a long while? There were two issues I used to have, one was not having my graphics card setup (it was running from the CPU) and the other was not having flash (when that was something).

> I've been using Ubuntu for about a year now and otherwise its been a very positive experience.

Yeah, I think it makes one of the better daily drivers.

One of those magic sysrq combination used to work for me. I used to use it a lot before I upgraded to 12 GB RAM.

Thank you for your answer. It's good to know that dealing with freezes is not a problem only faced by newbs like myself.

Re: playing video, for some reason I can't play South Park on either Firefox or Chrome. Videos on twitter also won't load with Firefox, though they work fine with Chrome.

> Thank you for your answer. It's good to know that dealing with freezes is not a problem only faced by newbs like myself.

Yeah, it's another bug with Desktop based Linux. The problem is that Linux "basically" treats the GUI like any other program, when things get heavy everything gets roughly evenly screwed. OSes designed to be centralized around a GUI on the other hand usually guarantee that GUI related processes get a minimum amount of time on the CPU to ensure they don't freeze.

Linux should 100% be doing this. Doing lots of hardware I/O shouldn't mean you lose the mouse or keyboard. When you lose control of input, you think the machine isn't doing something, when in actual fact it's doing tonnes, it's just not showing you. Even something like Android suffers from this under heavy load, it's really crap.

The real joke is, it's probably a difficult kernel fix. You would need some kind of watch dog timer for the kernel to make sure it's not getting too bogged down with any one particular task and then interrupt ones that are (some tasks don't like to be interrupted) [1]. You then need make sure that all of the heavy kernel calls don't make guarantees about the call being completed (i.e. blocking), which to be completely honest should be the default position to take anyway.

I'm not completely up-to-date on this, but my bet is that the issues come from everywhere. Programs can read/write arbitrarily large amounts of data (RAM, disk, network, bus, etc), when I believe you should be able to ask the kernel what size it would like you to read/write based on I/O activity and the capabilities of the device. If your program is bogging down the kernel, it should lover your recommended block read/write size. Better yet, this would be compatible with existing software, as they could choose to ignore this, possibly with it "punishing" programs that eat up lots of kernel time by making them wait longer for their next opportunity. There's a bunch of algorithms for time splicing tasks, but the most optimal appears to be max-min with very little organizing overhead [2].

> Re: playing video, for some reason I can't play South Park on either Firefox or Chrome. Videos on twitter also won't load with Firefox, though they work fine with Chrome.

Hmm, that shouldn't happen, sounds like you potentially have some system-wide badness. A few things to try:

* Disable any customization you made (extensions, add-ons, etc) - see if it is one of these interfering

* Make sure you have all updates and you're running an up-to-date version of Ubuntu (this issues have possibly been patched already)

* Make sure you have the correct drivers installed for your GPU

But... One thing I did note was that JavaScript coin miners have gotten so bad that I can't run certain sites without ad-blockers anymore (uBlock Origin is generally recommended). I remember my CPU sitting at one core maxed out just because of the JS engine. I generally run uBlock Origin + NoScript on every page and manually enable temporary scripts on pages I trust. One of the biggest offenders for crazily heavy JS was actually Facebook.

[1] https://en.wikipedia.org/wiki/Watchdog_timer

[2] https://uhra.herts.ac.uk/bitstream/handle/2299/19523/Accepte...

I thought about your comment re: video and I did some more poking around online since I was hoping to avoid having to do a fresh install. I tried updating my video codecs and now Netflix/Hulu works! Hooray!

>> JavaScript coin miners have gotten so bad that I can't run certain sites without ad-blockers anymore

Whoa, is that the reason why some websites are eating a ton of memory?? Is this common?

>> I generally run uBlock Origin + NoScript on every page and manually enable temporary scripts on pages I trust.

Thank you for the recommendations, I'll check those out.

> I thought about your comment re: video and I did some more poking around online since I was hoping to avoid having to do a fresh install. I tried updating my video codecs and now Netflix/Hulu works! Hooray!

Yeah, it's important to pull updates regularly :) In general many of the bugs you'll come across will end up being fixed sooner or later, so it's worth checking regularly.

> Whoa, is that the reason why some websites are eating a ton of memory?? Is this common?

That and just the poor use of JS. Most websites really don't need JS. I use mbasic.facebook.com instead of facebook.com as it'll run without JS, plus you have to manually refresh to get any kind of notification.

I have to say, reading all these replies about the OOM killer makes Linux look quite bad. These proc scores are not an elegant solution. I far prefer Darwin’s launchd which lets you set actual memory limits (soft and hard) that gives you warnings before you cross a threshold. Now this is more consumer OS oriented, but something equivalent for servers that let you express preferences in a more natural way seems desirable.

systemd lets you configure soft and hard limits on a per-service level, almost identically to launchd. See MemoryHigh= and MemoryMax= [1].

This does depend on cgroupsv2, but it works on most modern distros.

[1]: https://www.freedesktop.org/software/systemd/man/systemd.res...

Oh nice! Thanks for the pointer!

You can use ulimit to set soft and hard limits for all sorts of system resources (including memory) on Linux.

..but if the problem is programs not checking malloc() return codes since it will not return failures, then ulimits will in themselves not help the program. It will help the OS to stay alive which is good, but we still need to deal with the programs who expect to run all the way into a swamp and sink without malloc giving them "bad news".

True — I forgot about this. But can you do that on a process via some config _before_ the process is created?

Yeah, usually you call ulimit before calling the process. The new process inherits the limits. If you want to modify the limits of an _existing_ process, you can use prlimit.

memory limits are also available in linux via cgroups or ulimit.

There is nothing more frustrating than unkillable processes stuck io iowait(D) state. There's no reason for this behavior to exist. And its so easy to hang forever - network blink, your NFS client gets stuck and your programs too.

Ah, yes, that bug.

Few programs can handle a fail return from "malloc", and Linux perhaps tries too hard to avoid forcing one. Most programs just aren't very good at getting a "no" to "give me more memory" Browsers should be better at this, since they started using vast amounts of memory for each tab.

I used to hit a worse bug on servers. If you did lots of MySQL activity, so that many blocks of open files were in memory, and then started creating processes, you'd often hit a situation where the Linux kernel needed a page of memory but couldn't evict a file block due to some lock being set. Crash. That was years ago; I hope it's been fixed by now.

> Browsers should be better at this,

Browsers are quite good at this actually. Major web browsers run on Windows (and even 32-bit windows!), where there is no overcommit, so malloc can return "no" any time, which happens quite often when you are limited to 4Gb of memory per process.

The only apps that suck at this are Linux-only apps that are never used anywhere else and just assume that all Linux systems have overcommit enabled.

>Most programs just aren't very good at getting a "no" to "give me more memory"

I suspect that overcommiting is one of the reasons for this though. Many programmers in the Linux world have integrated that "malloc can't fail" and the only error handling they bother doing is calling abort() if malloc fails.

Of course the fact that C doesn't provide any sane way to implement error handling probably doesn't help.

Handling malloc() failure is almost never done for short lived programs. For instance git used to fail as soon as an error popped of (whether be it malloc() or open(), etc...). It just is much simpler and convenient to do so.

While C has no special error handling mechanism in place, error handling can still be done reasonably. IMO, the big reason for why malloc() errors are rarely handled is because it is quite hard to come with a viable fallback strategy.

>Handling malloc() failure is almost never done for short lived programs.

True, and that makes sense for something like git. But in my experience many long-lived programs don't bother to handle ENOMEM gracefully either.

But I guess I'm veering off-topic here, I'm mostly fine with applications crashing of their own volition when they don't have enough memory. I agree with you that in many cases there's no clear recovery path for an application that's out of RAM. It's the OOM-killer I have a problem with.

>While C has no special error handling mechanism in place, error handling can still be done reasonably.

I very much disagree with that. There are a few factors that make error handling in C a pain:

- No RAII, so you have to explicitly handle cleanup at every point you may have to early-return an error (goto fail etc...).

- No convenient way to return multiple values from a function. That means that in general functions signal errors returning some special value like 0 or -1 (even that is very much nonstandard, often even within the same library).

Oh you want to be able to signal several error conditions? Uh, maybe use several negative codes then? Oh you need those to return actual results? Well maybe set errno then? Don't forget to read `man errno` though, because it's easy to get it wrong. Oh you had a printf in DEBUG builds in there that overwrote errno before you could test it? Oops. Don't do that!

What's that, your function returns a pointer and not an integer? Ah, mmh, well maybe return NULL in case of error? You want to return several error codes? Well maybe you can just cast the integer code into a pointer and return that, then use macros do figure out which is which. It's terribly ugly? Well the kernel does it so... It can't be that bad right? Oh and what about errno? Remember that?

What's that, NULL is a valid return value for your function? Uh, that's annoying. Maybe use an output parameter then? Oh, or maybe some token value like 0xffffffff, that probably won't ever happen in practice right? After all that's what mmap does.

So no I wouldn't consider C error handling reasonable in any way shape or form. "Non-existent" is more accurate. You can always work around it but it always gets in the way.

I try to always implement comprehensive error checking in my programs. I do a significant amount of kernel/bare metal work, so it's really important. It's not rare that I end up with functions that contain more error-handling-related code than actual functional code.

You are making things sound way more complicated than they need to be, the situation is actually very simple: if you need to return multiple error codes, use a return value for the error code and give back things via an output parameter, otherwise just use a sentinel value for error (0, -1 or NULL depending on context, they aren't totally random you know, 0 and nonzero are used for false/true, -1 is used when you expect some index and NULL when you expect some object). When in doubt just use an error return code everywhere (e.g. what many Microsoft APIs - even some C++ ones - do with HRESULT).

If it's not that complicated please explain why OpenSSL, the linux kernel, Curl a multitude of very popular C libraries don't do what you describe. Clearly it's complicated enough that even talented C coders try to cut some corners when given the chance.

C error handling ergonomics are non-existent which means that everybody bakes ad-hoc library-specific conventions that are extremely error-prone.

You could argue that they're doing it wrong and you might have a point but if almost everybody gets it wrong maybe it's fair to blame the language itself a little bit.

I already gave an example of APIs that do this - pretty much all COM APIs use HRESULT. I do not know why not everyone does this as i'm not everyone and as such i cannot tell what sort of considerations (if any) were going on. At best i can make some guesses.

BTW curl does seem to do what i wrote above, for example `curl_easy_init` returns a `CURL` object on success or NULL if there was an error [1] and `curl_easy_perform` returns a `CURLcode` value [2] that looks like it is used across the API to indicate errors.

[1] https://curl.haxx.se/libcurl/c/curl_easy_init.html

[2] https://curl.haxx.se/libcurl/c/curl_easy_perform.html

The kernel very much returns sentinel values, if something more complicated has to be transmitted error codes are commonly used. I see nothing wrong with it.

I'm not arguing that the kernel devs are doing it wrong. I'm only pointing out that, in my opinion, the way C deals with error handling (that is, by not doing anything at all) is far from reasonable and the cause of many bugs. It's terrible ergonomics.

If you have a kernel function returning a pointer and you think that you're supposed to check for NULL when it actually returns a ERR_PTR in case of errors you will not only fail to do the check but on top of that end up with a garbage pointer somewhere in your program. If you have a MMU and you try to de-reference the pointer you'll have a violent crash, which at least shouldn't be too hard to debug. If you feed the pointer to some hardware module or if you're working on an MMU-less system then Good Luck; Have Fun.

C doesn't have your back here. It doesn't let you signal how a function reports errors, it doesn't even let you tag nullable pointers.

Often you need to return error objects. Consider a function for parsing something. You want to return not only the error code, but also the line and column number of the parse error, and a description of it. So you need two output parameters; one for the result and one for the error. Your declaration becomes something like this:

    bool parse(inp_type *a, out_type **b, out_error **c);
where the return value false indicates an error. In C++, you'd just have written something like:

    out_type parse(const inp_type& a);
and thrown an exception on error.

In C you can return a struct, however a better approach is to use a context object which also contains error information, like:

    ctx_t* ctx = ctx_new();
    if (!ctx) ... fail ...
    if (!ctx_parse(ctx, code)) {
        show_error_message(ctx_erline(ctx), ctx_ercol(ctx));
        ... more fail ...
        ctx_free(ctx); /* often done in a goto'd section to avoid missing frees*/
This also allows you to extend the APIs functionality, error information, etc in the future while remaining backwards compatible.

Which is great, except that ctx_new() requires a malloc, which then can fail, and now you can't even explain why the thing failed, as you have no context info.

You also have to worry about all of the ctx objects you've created along the way, to free them up as you recover from the low memory error.

Yep, you're absolutely right. But don't tell me that is simple! :)

That is very similar to the way I handled errors back in my C days.

> No RAII, so you have to explicitly handle cleanup at every point you may have to early-return an error (goto fail etc...).

I think RAII can be useful, but I've never found any use for it in systems level code that I write. Most of the time I'm dealing with resources that were allocated inside a systems library or an external component which just gives me a handle to the resource. I think this is a common enough scenario in systems code that I don't think its just me.


    1. X = CreateResource()
    2. Y = TransformResource(X)
    3. ProcessNewResource(Y)
    4. Z = TransformResource(Y)
    5. etc. etc.
And so as you transform that resource, you will have multiple ways to unwind the resource depending on where the failure occurs. Even if you wrap X in some RAII container, you don't know what your destructor is going to look like.

Another con to RAII, especially when paired with shared-ownership smart pointers, is you lose predictability over your resource deallocs. You never know when the last pointer is going go out of scope, and if its a 'heavy' resource with a complicated unwind, you're going to get a CPU spike at an indeterminate time. I deal primarily with industrial automation code and I much prefer to have a smooth/even CPU graph. I think this issue is more relevant to systems code which is the context of this thread.

I've just checked and it appears that c++'s std::vector::resize may throw std::bad_alloc when malloc fails, but rust's Vec::resize's interface doesn't leave any room to report any errors, so I guess that it will panic...? That's sad.

The standard library assumes infallible allocation, yeah. We have plans to add fallible stuff eventually, but we’re still working on our allocator APIs.

Rust in general will panic on memory allocation failure. There was some discussion about oom handling a while back but I don't know the current state.

Now you regret not putting in exceptions.

> vast amounts of memory for each tab

What underlies this? I am astounded to see 1GB of memory returned when I close a couple of tabs.

Chrome and Firefox both seem like this.

It's spread across all parts of the browser, but speaking as a Firefox graphics engineer, we use quite a lot of memory. Painting web pages can be slow, so we try to cache as much as possible. When elements scroll separately, or can be animated, we need to cache them in separate buffers. If we get the heuristics wrong (and it's hard to get it right for every web page out there) this can be explosive. It's not helped by the fact that graphics drivers can frequently bring down the whole process when they run out of memory. It's a hard problem, but webrender will help as it needs to cache less.

Maybe the browser should try to discard some cached data when the system is out of memory. Then some things in the browser would be slower, but the operating system wouldn't hang.

The browser _does_ do that. The hard part is detecting "the system is out of memory". Some OSes notify you when that happens, and Firefox listens to those notifications and will flush caches. Some OSes will at least fail malloc and let you detect out-of-memory that way. Linux does neither, last I checked.

Disclaimer: I work on Firefox, but not the details of the OS "listen for memory pressure" integration.

Any userspace process can see how low the memory is though, firefox could do it itself. Still, if a notification system is already used in other OSes, a very easy solution would be to add such notification channel in userspace so that any process could ask firefox to free memory. Right now I am using earlyoom to save my system from freezing. It sometimes kills firefox, sometimes dbeaver, sometims VMs. But if it could tell firefox to chill for a bit and free memory, then I could avoid the massacre (at least sometimes).

> Any userspace process can see how low the memory is though

How, exactly?

Or put another way, how do you reliably tell apart "we are seriously thrashing" and "resident memory is getting close to the physical memory limits but there is plenty of totally cold stuff to swap out and it won't be a problem" from userspace? The kernel is the only thing that can make that determination somewhat reliably.

Maybe there's a better way than this, but the same way earlyoom decides it's time to kill processes (% RAM usage) firefox would decide it's time to free cache. While using the 100% of RAM might be the optimal state if you aren't going to use more than that, it's not safe.

So let's say I'd like to write a memory-efficient web page, what should I avoid then?

Based on my experience as a Firefox developer investigating memory usage reports, the worst-performing "normal" web pages in terms of memory have:

1) Lots of script (megabytes). 2) Possibly loaded multiple times (e.g. in multiple iframes). 3) Possibly sticking objects into global arrays and never taking them out (aka a memory leak for the lifetime of the page). 4) Loading hundreds of very large images all at the same time. 5) Loading hundreds or thousands of iframes that all have nontrivial, if not huge, amounts of script. Social media "like" buttons often fall in this bucket.

There are obviously also pathological cases: if your HTML has 1 million elements in it (not a hypothetical; I've seen this happen), memory usage is going to be high, obviously. And arguably having a page with thousands or hundreds of thousands of JS functions is "pathological" too, but it's pretty normal nowadays...

For video memory/tile memory usage, avoid anything fancy that's hard to rasterize, for starters: Think things like rounded borders, drop shadows, opacity, transparent background images, etc. The more complex it is to draw your page the more likely that it will end up being cached into temporary surfaces and composited and stuff. For a while for some absurd reason Twitter's main layout had a bunch of rectangles with opacities of like 0.95 or 0.99, so all the layers had to get cached into separate surfaces even though you could barely tell it was happening. Getting them to fix that made the site faster for basically every Firefox and Chrome user. They hadn't noticed.

For JS and DOM memory usage you can use the browser's built in profiler to get a pretty good estimate of where things are going and what you've done wrong.

Rather than guess at what to avoid, you should make use of the memory profilers which Firefox and Chromium developer tools provide. Apparently Firefox's memory profiler is an add-on: https://developer.mozilla.org/en-US/docs/Mozilla/Performance...

Yeah, but this is a pigeonholing principle.

I don’t want to spend my time developing something only to discover it doesn’t perform well due to some reason.

I would prefer to not use any of the performance killers in the first place.

Firstly, avoid leaking memory (including objects like images and DOM nodes) in JavaScript. Leaking memory here means retaining a reference beyond the end of the object's use. The garbage collector only collects memory which is no longer referenced; it does not attempt to analyse when a reference is no longer used.

Secondly, avoid including unnecessary resources. Many web pages include many libraries which are then mostly unused. Some packaging tools can help eliminate such unused code.

A memory profiler helps in both cases: it detects leaks, and it measures the cost of resources, allowing you to make educated decisions about their inclusion.



Is that a joke, or are you essentially saying that if I used WebAssembly, then most of the memory usage would go down?

He's not wrong. If you disable JS by default tabs will take much, much less RAM. Sites that require JS aren't worth the time anyway. Unless you're literally writing an application there's no reason to require executing code to render in text and images. And there's absolutely no reason to not have a no-JS fallback. In fact, there should be a real HTML skeleton first upon which you write JS enhancements.

But these are all things you do if you want to make a webpage for people. If your main concern is corporate profit or saving institutional funds then SPAs and requiring JS for obfuscation makes sense in the anti-user way that corps trend to. It's just faster. Who cares if it's anti-user when profits are on the line?

Is there a setting to disable all of that caching? In other words, prioritize memory usage over performance?

To be honest: I do not know. But given how fat most websites are today, I am not that surprised that so much memory is needed. Yes, there is still a major leap from a couple of megabytes to a full gigabyte, but with so many DOM nodes and JS objects I can imagine that even a resource-conservative browser will have trouble keeping memory usage low.

Or are there any browsers where you observe significantly less memory usage on the same websites? (Ignoring limited browsers like Lynx of course)

Most modern browsers cache a lot of data in memory as well to make things like navigating back snappy and avoiding a full page re-load. It's also necessary to cache most of the state of the previous page to allow retaining form fields if you accidentally navigated away (as some forms are only in the DOM, created by JS, and not part of the original HTML).

I believe to have read somewhere that at least Chrome listens to low-memory situations broadcast by the OS and will evict such caches. So while it uses a lot of memory as long as memory is available, it will also release much of it if necessary.

This makes sense but is somewhat annoying since a web browser is not the only program running on my computer (but apparently wants to be) and eats up RAM that could be used for the OS file cache.

I wish there was some sort of allocation API specifically for allocating caches so that recently accessed files could kick out a web browser's cache of a not-so-recently accessed tab or vice versa.

POSIX does have something sort of like this in madvise(), but I couldn't find a specific option for the semantics you described.

On a low memory computer I had configured FF to not show images unless I alt-clicked them. Together with not using JavaScript this meant using significantly less memory. I suspect browsers still have to have raw bitmaps in memory, which for the 2mb jpegs you are fed everywhere quickly adds up...

They have to be in memory somewhere. They technically don't have to be kept in RAM if they are uploaded to VRAM though.

Assumining RGBA it's just over 8MB for a 1920x1080 image.

Images could be kept compressed until they are painted. Most GPUs support texture compression, so don't need to keep a bitmap for compositing.

Texture compression is inherently lossy, so it isn't an option.

It's also really dependend on your textures and what you want to do with them, you don't want browser to just go give it their best shot at compressing your company logo.

You want your browser to be fast. Browsers are often the only thing running. You have a bunch of ram. Unused ram is pointless. Disks are large and fast and are used for swap. It's not clear why you're suprised, what the problem is, or why anyone would put any effort into optimising a browser for memory usage. The number of people who have tens or hundreds of tabs open in the real world isn't as large as it is on HN and other tech sites, and for many people who do just keeping the url around so the site can be reloaded is probably good enough.

> Browsers are often the only thing running.

If I had to choose one program that is proportionally used the least alone, I would have voted browsers.

Javascript? I use noscript and it's less than 200MB/tab, even with major news sites and what not open.

200MB/tab just to display some text with markdown and maybe some small images? Isn't that insane?

Answer is too simple. Exactly what is taking up what memory? I'd love to see the annotated memory map of these processes. M Kb of javascript text. N MB of image data, O MB of this, P MB of that.

Unpopular opinion time:

My guess is most developers don't care and are not even looking at this anymore, either during development or after release. Nobody seems to even know how much memory their program allocates and how quickly it allocates that memory under various running conditions. I used to challenge my fellow developers: Stop in the debugger right now. About how much memory should the process be using? Nobody even seems to have an order-of-magnitude guess anymore. It's your program, dude! Shouldn't you know this?

You can ask any embedded software engineer exactly how much memory his/her program uses, what's the stack size, what's the heap size, what's statically and dynamically allocated. Sadly, this discipline is pretty much gone outside of that specialized area.

For browsers, this problem is "once removed". Firefox and Chrome both go to heroic efforts to reduce memory usage. However, given the html and javascript which fill most websites, and users expecting responsiveness, it turns out to be very hard to use less memory.

Browsers are little OSes with encapsulated virtual machines. As every website tends to be a mix of web trackers, ads, server dependent functionality. The whole thing can be a big mess.

If it helps any I have ~2500 tabs open in palemoon (well overdue for an afternoon of tab culling but anyway) and it's ~1.5GB. I never allow JS. So my guess is it's either JS directly, or perhaps JS pulling in extra resources when allowed to run.

I think a personal wiki with clickable links to 2500 sites would be more useful than a browser with 2500 tabs open. It could even be easily curated by more than one person.

That said, Firefox some strides starting in 55 insofar as handling very large number of tabs starting in version 55.


Unfortunately Palemoon is more or less firefox 38 still isn't it?

Just tidying up after myself would fix most of that. I don't know what palemoon~firefox relationship is versionwise.

But my point was more that sans JS it really seems to use up far less memory. Honestly, try nuking js for 1/2 hour and see how it feels.

I get annoyed at all the sites that don't work without js.

My most common scenario with js is 10 tabs using 400Mb of ram to 30 tabs using less than 1gb with the former scenario being more common.

1 GiB of virtual memory, I assume. IOW, not the same thing as 1/32 of your 32 GiB of RAM, for example.

I think that's not that bug at all. When memory runs out, the entire system stalls, including the UI, but nothing crashes. If these issues are frequent, the system is basically frozen.

I have this in Matlab on Linux. Matlab can actually deal with worker processes being killed, but my machine just locks up. Therefore, we have to run these specific simulations under Windows, where this doesn't occur.

I witnessed MySQL bringing linux servers down two.

In my case it happens like this:

I have a long running PHP process that constantly fires away mostly SELECT but also a bunch of INSERT and UPDATE statements and also some DELETEs.

Since the DB and the key files do not fit into memory, its all disk bound work.

All tables are MyISAM.

Like clockwork, this stalls the virtual machine once per day.

All I can do is to hard power down the VM and restart it. Afterwards the table data is corrupted beyond repair.

Not sure it is related to memory though. Because the memory usage of PHP and MySQL seem to be constant. Most RAM seems to be used by Linux for caches.

The most common cause here is something causing a situation where some queries hang or takes a long time to complete, while also locking access to something, while new queries keep coming in. This builds up quickly.

A good way to catch this would be to have something log the list of running queries every couple of seconds. Look at this log after the crash and you'll hopefully be able to identify which are the long running processes, and which are the regular queries that builds up.

To fix it would be a combination of making the queries that cause the locking to be less like that, also perhaps putting in a limit on how many queries can build up and also implement a way for the regular queries that build up to time out or fail quicker or more gracefully.

I fire the queries sequentially. So there is never more then one query running.

The "once per day" part is suspicious. Maybe a scheduled backup is what's running the other query ;)

I hade the same suspicion. Especially since the provider indeed runs a nightly backup of the VMs. But even after turning that off, the VM stalled the next night again.

Try doing some rate limiting in order to not cause the dead lock. Should probably also disable write cache. And if it still doesn't work switch to a bare metal machine. And give it a lot of swap and up the swappiness. Swapping is a much better alternative then crashing. VPS providers doesn't like swap because it will tear their SSD disks, so the swap and swappiness is probably preset too low.

I tried sleeping 0.1s every 5s or so. It did not help. Still crashed.

I don't think swapping would even occur since neither mysql nor php grow their memory usage over time.

It's not an SSD. It's good old rusty HD.

>Afterwards the table data is corrupted beyond repair.

That should not happen with a DB even if you turn off the power. Are you sure the hardware is good?

GP ist using the MyISAM storage engine. It's not crash safe. This is sad but expected behavior.

Don't use MyISAM!

Is there any reason to ever use it (or: Why does it still exist)? In-memory databases for caches or other things that are not critical? I have to admit, I was astounded when I first got the error message from MySQL that a table was corrupted and that I should run REPAIR TABLE. That sounded like very weird behavior for a database.

Any remaining reasonable use cases would be sufficiently corner-casey that that the first order approximation is "if you want it to behave like a database, no, you do not want MyISAM".

This being said, at least some years ago, a use case I saw that held SOME water then was generating MyISAM tables offline, importing them as-is into a running MySQL (or taking an instance offline and bringing it back up) and then serving from it read-only. At least at the time, this provided better RO performance than InnoDB. I wouldn't be surprised if that was still true. Please don't do that at home!

Also, I think until the previous-to-most-recent release, some internal tables were still MyISAM, causing MySQL overall to have some very rare cases of not being crash safe. Again, I think that's since been resolved in 5.8(?).

    Is there any reason to ever use it
It is faster, uses less disk space and has a more logical filesystem layout.

What do you mean by more logical? innodb_file_per_table has existed for a while now.

A flag with that name exists, yes. But it does not seperate table data into one file per table. It will still put stuff related to the tables into the central ibdata1 file.

Google "ibdata1 one file per table" to see all the pain it causes.


> But it does not seperate table data into one file per table

That's because if you didn't have it set when creating the database, it won't move data to the new fs layout when you set the setting on, without an OPTIMIZE. If you had it on from the beginning, table data is per file. I literally just did an ls on my /va/lib/mysql and there's a folder per database, in which there are 2 files per table (.frm and .ibd).

When innodb_file_per_table is on, and the database has been OPTIMIZEd, only the following is stored in ibdata1 [0]:

- data dictionary aka metadata of InnoDB tables

- change buffer

- doublewrite buffer

- undo logs

[0] https://www.percona.com/blog/2013/08/20/why-is-the-ibdata1-f...

   only the following is stored in ibdata1
You say "only", I say "clusterfuck".

Just look at the very page you linked to. It's a totally confusing concept that befuddles users and causes questions "we often receive", starts "panic", can "unfortunately" not easily be analyzed and you might need to "kill threads" and initiate "rollbacks" to fix the problems it brings.

MyISAM got that right. One dir per database.

Ok but I don't get why you're so obsessed with the fs layout anyway. You should mostly treat it as a black box. And the point of ibdata1 is safety, which as you stated higher up is a serious problem with MyISAM. Even if it's not oom situations, you'll end up stuck sooner or later. You have been warned.

    You should mostly treat it as a black box
Again, check the very link you posted. People do that. Until shit hits the fan. And then they have to take that black box apart. Which is not easy.

welcome to MySQL

In general, the out of memory condition doesn't always come from the Linux kernel however but from the underlying memory allocator which typically is the memory allocator in the CRT in libc. Just because some process's memory allocator returned NULL or threw bad_alloc doesn't mean the system as whole is running out of memory.

When the kernel is running out of memory it will just start the OOM killer which will kill a process with low "nice" value.

Actually, I'd say that if malloc (or equivalent) returns NULL then the system really is out of memory. Every general-purpose memory allocator is going to contact the OS to ask for more memory if it doesn't have anything free in its own buffers.

But... it's still no good saying 'make your program behave nicely when malloc fails' - even if your own code is perfect, what are the chances that every library you use does the same thing? And even then, Linux by default will optimistically over-allocate memory (and rightfully so!) - with the result that you'll never catch every out of memory condition.

IMO, 'out of memory' is not a property that each single process should try to manage, rather it should be the OS or some other process with a global oversight that monitors memory usage and takes measures when memory gets tight.

You're right the memory allocator ultimately gets the memory it manages from the OS but as a programmer you're looking at it from the abstraction that its API provides and assuming any particular condition that would cause a NULL to be returned or bad_alloc to be thrown may or may not be correct.

The other point is that there's a distinction between kernele's view of OOM condition and some memory managers's OOM condition. Consider you run two processes, both allocate X gigs of memory and both succeed. However once you start running and committing the memory you'll get a kernel OOM condition and one process is killed. This is the overcomitting you mentioned.

Personally I don't see why people make such a big fuzz about dealing with memory allocation failures. Memory is just a resource same as any other OS resource, socket, mutex, pipe whatever. Normally in a well designed application you throw on these conditions and unwind your stack and ultimately report the error to the user or print it to the log and perhaps try again later. Just because it's "memory" should not make it special IMHO.

It's the transparency of memory allocation that makes it so difficult to deal with failures. Even 'trivial' library functions could allocate memory, hell even calling a tiny dumb function might cause the stack to require a new page of memory, leading to failure. Just checking that all malloc calls check for NULL isn't even half of it.

Exception handlers won't save you either. Unless you consciously consider every memory allocation failure, your exception handlers will be too high level and result in your program either aborting by itself or becoming unusable. Did you pre-allocate enough resources to pop up an 'out of memory' error window? Good luck failing gracefully.

Memory allocation is special.

Yes, it's an imperfect world and you can't control what happens in a library but the attitude "it doesn't matter if my code is messed up or not, some library will still do the wrong thing" doesn't help. All you can do is make your code work properly and that's what you (and everyone else) should do.

Again it's imperfect world saying that "it won't work because x, y, z will happen" is not the right attitude and is bad attitude. Most of your code should treat it as just a resource allocation failure and in a sane program that is indicated by propagating an exception up the stack. Now you might be right that the program might fail when it'd be the time to display a message box to the user or whatever. But somewhere in the middle layers of the code you don't have that context, you don't know that it will fail. Therefore that part of your code really should be (exception) neutral just like in any resource allocation case.

> Memory allocation is special.

I think the Zig lang treats it as special, therefore making you write code that handles the case that a malloc fails explicitly.

> I'd say that if malloc (or equivalent) returns NULL then the system really is out of memory.

That's very much not true when 32-bit processes are involved. You can easily be out of (non-fragmented) address space in a 32-bit process (whether it's all resident or not) while the overall system is nowhere close to being out of memory.

Even in a 64-bit process you can exhaust the address space without being out of memory if you try hard enough; you just have to try much harder.

That said, even on Linux allocators will return NULL when they're just out of address space; there's no overcommit going on there.

> That said, even on Linux allocators will return NULL when they're just out of address space; there's no overcommit going on there.

Try calling fork() in that process then. By rights, the new process should inherit its own copy of all the address space of the old process, and is free to overwrite it with whatever it wants. Linux (by default) won't stop fork() from failing on a process with N GB of RAM and total memory(+swap) < 2N GB, yet there simply isn't the memory around for both processes. There's your overcommit.

Would it be possible for the kernel to suspend the process in scenarios where malloc would fail instead of returning a failure? Either until enough becomes available for it to succeed, or until something tells the kernel to renew/revive/resume the process and try the malloc again?

It could but if that process is using most of a systems memory that will lock up forever because while the process is frozen it won't release any of it's memory.

You provide limited information but it is not clear the scenario you explain is a bug. If too much memory is locked into resident memory with mlock then this sounds like the expected and correct behavior.

Then I prefer the unexpected and incorrect behaviour of Windows, which freezes the offending application and continues to be responsive, allowing me to kill it if I wish to do so...

> Few programs can handle a fail return from "malloc"

Fewer than should, that's for sure, but hardly a trivial number. A lot of old-school C programs are very careful about this, and would handle such a failure passably well. Unfortunately, just about every other language tends to achieve greater "expressiveness" by making it harder to check for allocation failure. How many constructors were invoked by this line of code? By this simple manipulation of a list, map, or other collection type? How many hidden memory allocations did those involve? I'm not saying such expressiveness is a bad thing, but it does make memory-correctness more difficult and so most programmers won't even try.

As the world moves more and more toward "higher level" languages, returning an error from malloc becomes a less and less viable strategy. Might as well just terminate immediately, since "most frequent requester is most likely to die" is better than 99% of the OOM-killer configurations I've ever seen.

Glad to see this issue raised! My system hangs for minutes sometimes and is very frustrating compared to Windows and OSX which seem to handle out of memory in a much more user-friendly way. Which seems to be: suspending the offending program and letting the user decide what to do from there. I'm sure there's a reason the Linux kernel doesn't do something similar, but can anyone enlighten me? :)

Probably lack of integration; if NT hits a memory issue, it can just pass notice to the tightly-coupled userland and GUI. If Linux runs out of memory, even if it internally knows what process to blame... What would it do that makes sense for a headless server, TiVo, and Android phone? Keeping in mind that the kernel folks don't even work that closely with many userspace vendors.

OSX handles this with a kqueue event that can notify userland when the system moves between various memory pressure states; this is hooked into by libdispatch and other userland libraries which will discard caches and so on.

I don't see why Linux couldn't do the same; open /sys/kernel/something and epoll on it.

This already exists: applications can receive memory pressure events (such as the system reaching "medium" level, where you may want to start freeing some caches) via /sys/fs/cgroup/memory/.../memory.pressure_level. See https://www.kernel.org/doc/Documentation/cgroup-v1/memory.tx....

The first two nota benes explicitly describe this document being outdated and not what most people expect when it comes to “memory controller”. I am not certain that citing this is a great example.

What about this? Seems to do what they want.


Windows (server and desktop versions) will throw up a message dialog on the screen. It will also start to kill off processes just enough to resolve the low memory situation.

During this - unlike Linux - you can actually use the mouse, CLI and close programs yourself.

On top of that server applications like IIS has built-in watchdogs. If an IIS process grows to use too much memory (60% IIRC) or excessive CPU, the watchdog will recycle the process.

I think Windows kernels do not use overcommit, so memory allocation will fail if you run out of memory.

You could use the message bus to post a message to the service that handles out of memory decisions, which in turn could either

1. Show a GUI with a choice

2. Show a message on the current terminal and ask what to do

3. Just return "kill it now" if there is no interactive session

And if there is no such service, just default to 3. The problem really is that the state cannot be captured and communicated to the user. I doubt the NT kernel itself shows a GUI window, it's probably a service that gets woken up by a kernel exception, which in turn shows the window. Basically, the Linux kernel needs more pluggable functionality for user interactions. It's absolutely fine and even recommended to not have an entire GUI in the kernel, it needs to just provide a mechanism for userspace to capture the event and decide what to do with it.

Throw a signal like it would do if the process were out of memory completely and about to be killed? (for clarification, no snark intended, actual question)

What signal is sent when a process is out of memory? I thought malloc would either start returning NULL or you’d fault when trying to access overcommitted memory.

Yes, but ideally you want to be throwing some ‘memory pressure’ signals before absolutely running out of memory, so that programs can take simple actions like emptying caches, etc.

Catching an otherwise-fatal out-of-memory fault and recovering would be too complicated / bug-prone.

Android sends low memory events and kills processes based on heuristics.

Then again, Android has a customized Linux kernel.

This describes my general Linux experience well: A very stable kernel, with which I never had serious issues on a headless server. But applications in the userspace (apart from the standard GNU packages) are usually a tossup anywhere between system-crashing garbage and perfectly working pieces of software.

I used to run into this problem all the time in grad school. Once a month or so I'd load a data set, do some dumb Python operation on it that took significantly more memory than I predicted, and BAM! I'd have to restart my laptop.

I just kinda assumed that's how computers worked until I got a Mac a couple of months ago...

The link suggests that there might be some default parameters you could change to protect against this behavior. Does anyone have any suggestions on what settings to change?

A Mac is certainly better at handling these kinds of issues but it's by no means totally safe. It tries to compress memory and dynamically allocate more swap, but there's still a limit and you can see that if you accidentally run programs with way higher RAM requirement than you have. I've had multiple occasions where my program used so much RAM that even moving the cursor is an exercise in patience, never mind switching to a terminal window and typing commands to kill the process.

A Mac will keep creating virtual memory swap up to some limit (some multiple of the amount of physical RAM — can't quite remember, possibly 5x) and then it will produce a kind of vague dialog box saying "You've run out of application memory" with a list of applications to force quit.

But at least you can recover rather cleanly from the issue.

If you've foolishly decided to run without swap (like the original post), then suspending the offending program does nothing.

This is because the offending program has allocated a lot of private dirty pages, which can't be dropped from memory because without swap space, there is nowhere for it to go.

Linux use cases tend to be servers where user interaction is unexpected at 3AM? No one around to make a choice, so automate a choice.

IMO despite the standout behavior, I prefer my software to deal with itself.

Systems designed to wait for user input end up having design choices intent on keeping a user using them.

Software is just a tool. Not a lifestyle. Set and forget this shit as much as possible

If the alternative is simply killing a process or crashing the kernel, then surely a better approach would be to suspend the offending process and call a handle that does something. If you want that something to restart the machine, fine. You want it to notify the administrator, fine.

There is the use case of android phones. One of the answers to the OP is about that case. It sends that Google developed a user space process to monitor those events https://lkml.org/lkml/2019/8/5/1121

From that reply it seems that Facebook implemented something similar, I guess for their servers.

> Linux use cases tend to be servers where user interaction is unexpected at 3AM? No one around to make a choice, so automate a choice.

Even if it's a small percentage of the overall "computing" population, there are still millions of people running Linux on the desktop (roughly 2% out of 3.2 billion people using internet makes for 64 million - a large european country). It's 64 million of people for which this behaviour is a pain in the arse.

Does FreeBSD handle this issue better?

And you've turned swap off on Windows?

Windows will not BSOD due to memory pressure.

I think it happens only if you have a broken device driver, applications cannot cause BSOD with memory allocation.

Well I beg to differ, using a unlimited swap file can quickly reach hard issues after 64gb of swap use. At that point mallocs in the windows ui fail (timouts or something?), that apparently are not meant to, eg fonts from shutdown menu missing, the system being unable to shutdown ect.

Does it BSOD if it runs out of swap space though?

I'm not sure but the assumption might be that there's generally no user to ask as the computer might be a server.

Right but if there is a user to ask then it should ask!!

Ehm, and how does the OS know which is the “offending” process?

I think you are confusing the issue raised here with your desktop experience.

Currently the Linux kernel computes a score for each process based on some heuristics. There's a good introductory article on LWN:


Yep and it's about as good as just picking a random process and killing it.

It's awesome when you run out of memory and you try to log in only to have it kill sshd.

A classic from [1]:

> An aircraft company discovered that it was cheaper to fly its planes with less fuel on board. The planes would be lighter and use less fuel and money was saved. On rare occasions however the amount of fuel was insufficient, and the plane would crash. This problem was solved by the engineers of the company by the development of a special OOF (out-of-fuel) mechanism. In emergency cases a passenger was selected and thrown out of the plane. (When necessary, the procedure was repeated.) A large body of theory was developed and many publications were devoted to the problem of properly selecting the victim to be ejected. Should the victim be chosen at random? Or should one choose the heaviest person? Or the oldest? Should passengers pay in order not to be ejected, so that the victim would be the poorest on board? And if for example the heaviest person was chosen, should there be a special exception in case that was the pilot? Should first class passengers be exempted? Now that the OOF mechanism existed, it would be activated every now and then, and eject passengers even when there was no fuel shortage. The engineers are still studying precisely how this malfunction is caused.

[1] https://lwn.net/Articles/104185/

Fortunately, engineers have invented a way to attach a Strolling Wheelbarrow After Plane, where you can stash the sleeping passengers without ejecting them out of the plane entirely. This has the unpleasant side effect of slowing down the journey for everyone when passengers wake up inorderly (and God forbid everyone wake up at the same time), though.

What I still do not understand is why people continue to turn a blind eye to this instead of switching to SmartOS. I just don't get it.

How does Solaris/SmartOS handles that situation?

It doesn't get in that situation, because malloc() can return null on Solaris (i.e. it never¹ overcommits).

While in general I think this is vastly better than the somewhat insane Linux OOM killer, you can get in awkward situations where you can't start any processes (including a root shell) because you're out of memory.

I rather like the FreeBSD solution to this, which is to not overcommit, but after a certain number of allocation failures it kills the process using the most memory. This prevents situations where you can't start any processes.

There's no one-size-fits-all solution to handling low memory conditions, but the Linux solution manages to almost never do what you want which is kind of impressive in a way.

¹ I seem to recall hearing somewhere that you can allow allocations to overcommit on a per-application basis on later versions of Solaris, but don't quote me on this.

> FreeBSD solution to this, which is to not overcommit

Where did this myth come from? Did y'all just assume that the vm.overcommit sysctl actually makes sense and zero means "no overcommit"? :)


But indeed, OOM killer kills the largest process, which makes more sense in most scenarios than Linux's "badness" scoring.

Huh, I had no idea it worked like that. That's bizarre.

Running sshd as an on-demand (Type=socket) service would probably work better, since then the sshd process would be new and thus have a better heuristic score - also not be tying up memory sitting unused in the meantime.

systemd still seems to run it (Type=notify) with the -D option all the time though, at least on the systems I can check.

Dropbear is configured by default as a Type=socket service though.

This is sort of just kicking the problem down the road. Your idea actually might work for (presumably) low-volume use ssh, but what about the next important service? When does the work-around to a papered-over work-around to a virtual problem that is supposed to just be RAM-backed or handled at

  ptr = malloc(42);
  if(!ptr) exit_error();

Well, there probably needs to be a way to override the heuristic at least, sort of a 'this process is important, don't auto-kill it if trying to find memory'.

As for ssh specifically, I rarely ssh into my desktop machine, but I keep sshd running for just this kind of situation where I might need to try and rescue a swamped machine. So in most cases low-volume sshd use is exactly what is called for.

If you're running into the memory purge of doom on a server that's probably a whole different nightmare scenario.

malloc returning NULL has been a broken assumption for a long time though, and that isn't going to change afaik.

I blame the specific algorithm for that, not the basic concept. Nothing with less than 10MB of memory use should ever get killed unless you're in some kind of fork bomb.

Probably it would be good to include the amount of overcommit in the heuristic. Processes that overcommit should be killed first. What is a good way to measure/inspect overcommit of the process?

Further to the comments about the pager hammering the disk to read clean pages (mainly but not exclusively binaries) even if swapping is disabled: In many cases adding swap space will reduce the amount of paging which occurs.

Many long-lived processes are completely idle (when was the last time that `getty ttyv6` woke up?) or at a minimum have pages of memory which are never used (e.g. the bottom page of main's stack). Evicting these "theoretically accessible but in practice never accessed" pages for memory frees up more memory for the things which matter.

Unfortunately enabling swap in linux has a very annoying side effect, linux will preferentially push out pages of running programs that have been untouched for X time for more disk cache, pretty much no matter how much ram you have.

This comes into play when you copy or access huge files that are going to be read exactly once, they will start pushing out untouched program pages to disk, in exchange for disk cache that is completely 100% useless, even to the tune of hundreds of gigabytes of it.

Programs can reduce the problem with madvise(MADV_DONT_NEED), but that only applies to files you are mmap()ing, and every single program under the sun needs to be patched to issue these calls.

You can adjust vm.swapiness systctl to make X larger, but no matter what, programs will start to get pushed out to disk eventually, and cause unresponsiveness when activated. You can reduce vm.swapiness to 1, but if you do, the system only starts swapping in an absolute critical low ram situation and you encounter anywhere from 5 minutes, to 1+ hour of total, complete unresponsiveness in a low ram situation.

There _NEEDS_ to be a setting where program pages don't get pushed out for disk cache, peroid, unless approaching a low ram situation, but BEFORE it causes long periods of total crushing unresponsiveness.

There _NEEDS_ to be a setting where program pages don't get pushed out for disk cache, peroid, unless approaching a low ram situation, but BEFORE it causes long periods of total crushing unresponsiveness.

Here's the thing: a mapped program page is just another page in the page cache. Now, you could maybe say that "any page cache page that is mapped into at least one process will be pinned", but the problem there is that means that any unprivileged process can then pin an unlimited amount of memory, which is an obvious non-starter.

A workable alternative might be to add an extended file attribute like 'privileged.pinned_mapping', which if set indicates that any pages of the file that have active shared mappings are pinned. That means the superuser can go along and mark all the normal executables in this way, and the worst-case memory consumption a user can cause is limited by the total size of all the executables marked in this way that the user has access to.

SuSE solves this in their SuSE Linux Enterprise Server (SLES) with a new sysctl tunable, which soft-limits the size of the page cache.


It is quite effective, although historically there have been issues with bugs causing server lockups in the kernel code around this tunable. It seems to be quite stable in SLES 15, however.

While the tunable is available in their regular SLES product, it is only supported in the "SLES for SAP". The two share the same kernel, that is probably why.

Theres no reason extra data cannot be added to entries in the page cache to make smarter decisions. That’s how Windows and OS X do it in their equivalent subsystems.

Nobody is suggesting these pages be pinned which is an extreme measure.

The problem I'm trying to point out here is that if the extra metadata in the page cache is entirely under user control (like for example "is mapped shared" and/or "is mapped executable") then it amounts to a user-specified QOS flag.

That might be OK on a single-user system but it doesn't fly on a multi-user one. That's why I suggested you could gate that kind of thing behind some kind of superuser control.

Why can’t a user make QoS decisions for their own pages? Root controlled pages should obviously have higher priority.

The kernel could still “fairly” evict pages across users - just letting them choose which N pages they prefer to go first.

Why can’t a user make QoS decisions for their own pages?

Because then you just get everyone asking for maximum QOS / don't-page-me-out on everything they can.

The pages in the page cache are not owned by a particular user, they're shared. If there's three users running /usr/bin/firefox, they'll all have have shared read-only executable mappings of the same page cache pages. If you do a buffered read of a file immediately after I do the same, we both get our data copied from the same page cache page. So it's not at all clear how you'd do the accounting on this to implement that user-based fairness criterion.

> but that only applies to files you are mmap()ing

fadvise provides the same for file descriptors. some tools such as rsync make use of it to prevent clobbering the page cache when streaming files.

nice! was not aware of that syscall, however, patching the entire world remains...

it might be possible to create an LD_PRELOAD'd library that wraps open type syscalls (i.e. those that return an fd, might just be open, haven't kept up with all of linux syscalls) and that based on a config file calls fadvise on those fd's that correspond to specific files/paths on disk). Won't help for statically linked binaries or those that call syscalls directly without glibc's shims, but that should be a small number of programs.

heck, if I were still a phd student, I'd want to run performance numbers on this in many different scenarios and see how performance behaves. feel like there could be a paper here.

@the8472, hah, so someone thought of the same thing. I'd try to leverage /etc/ld.so.preload with a config file as a more transparent solution, but your link proves the point that its possible.

You probably don't want it in ld preload globally because it would also clobber the page cache in processes that do benefit from it.

And if you only do it in a container you can also limit the page cache size of the container to avoid impacting the other workloads.

hence why I said a config file based (i.e. include the path that stores one's media, won't matter what program you use to play it), but yes, page cache does play a role (but hence why I also said it be interesting to explore how different applications behave with and without it and how that impacts other system performance)

i.e. I really wonder for desktop workloads if one only caches "executable" data, how would that negatively impact perceived performance. I'd imagine it have some impact, but I'd be interested in seeing it quantified.

Can't the file cache detect streaming loads and skip caching it? ZFS does this for its L2ARC[1].

[1]: https://wiki.freebsd.org/ZFSTuningGuide#L2ARC_discussion

Is there a way to enable such an option for an entire process, in the same way as e.g. ionice(1)?

Whenever I take a backup of my computer it winds up swapping everything else out to disk. Normally I'm perfectly happy letting unused pages get evicted in favor of more cache, but for this specific program this behavior is very much not ideal. I'm asking here since I've done some searching in the past and not found anything, but I'm not sure if I was using the right keywords.

Follow-up, for people who encounter this thread in the future: I did some more hunting and found `nocache` (https://github.com/Feh/nocache , though I installed it via the Ubuntu repositories) which does this by intercepting the open() system call and calling fadvise() immediately afterwards.

> Unfortunately enabling swap in linux has a very annoying side effect, linux will preferentially push out pages of running programs that have been untouched for X time for more disk cache, pretty much no matter how much ram you have.

THIS. I ended up disabling swap because my kernel insisted on essentially reserving 50% of RAM for disk buffers; meaning even with 16GiB of RAM, I'd have processes getting swapped out and taking forever to run, because everything was stuck in 8GiB of RAM, and Firefox was taking 6GiB of that. I couldn't for the life of me figure out a way to get Linux to make that something more reasonable, like 20%. (And yes, I tried playing with `vm.swapiness`.)

Programs should really use unbuffered i/o for large files read only once (yes i know Linus doesn’t like unbuffered i/o but he’s wrong)

> This comes into play when you copy or access huge files that are going to be read exactly once

Readahead is still useful for large files read sequentially once, and that needs to be buffered. Such programs should use posix_fadvise().

you can readahead as far as you like with unbuffered io

If you are reading unbuffered (ie. O_DIRECT) then you are reading directly into the memory block the user supplied, so you cannot read ahead - there's nowhere to put the extra data.

Of course you can read ahead using multiple buffers, you can issue as many reads as you want concurrently

I think it is pretty clear I was referring to kernel-mediated readahead. Sure, you can achieve the same thing in userspace using async IO or threads.

I think it’s pretty clear that I was referring to reading large files using direct io, and your comment was totally irrelevant

Reading large files using direct IO defeats kernel read-ahead, which means you have to take on the complexity of reimplementing it in userspace, or the performance hit of not having it.

This is a good reason for programs to use buffered IO even when they are reading a large file once, so yes my comment was entirely on point.

Your comments were misleading, even if you want to argue they were on point. I know that you're knowledgable about the kernel, I ready your other comments. The following comments are just not true, except in bizarro world where all code must be in the kernel:

- "Readahead... needs to be buffered"

- "reading unbuffered (ie. O_DIRECT)... you cannot read ahead"

Good application code needs to work around kernel limitations all the time. Your suggestion of fadvise is reasonable, except in practice fadvise rarely makes a measurable difference. The only way to get what you really want is to code it the way you want it, and fortunately, there are multiple ways to do that.

> There _NEEDS_ to be a setting where program pages don't get pushed out for disk cache, peroid, unless approaching a low ram situation, but BEFORE it causes long periods of total crushing unresponsiveness.

Did you try different vm.vfs_cache_pressure values?

Probably the best solution to this is something like memlockd where you explicitly tell the kernel what memory must always be in resident set.

Linux resource scheduling and prioritization and is pretty awful compared to its popularity.

TBH, there are very few OSes that get high-pressure resource scheduling and prioritization right under nearly all normal circumstances.

The hackaround for decades on Linux is always adding a tiny swap device, say 64-256 MiB on a fast device in order to 0) detect average high memory pressure with monitoring tools 1) prevent oddities under load without swap (as in OP example).

Sgi IRIX nailed this, FWIW.

I would have thought some of the IRIX scheduler made it into Linux by now.

No way, any sufficiently advanced technology is indistinguishable from magic and IRIX was so very, very advanced. IRIX hasn't been in development for almost two decades now and it's still more advanced in aspects like guaranteed I/O and software management (inst(1M))... What does that say about it and what does it say about the engineers who worked on it?

And you are right on all counts. Inst was magic. The things I did, often on live systems...

XFS is still better than any version of that ext dreck!

That's debatable. Better at losing file contents after hard reset? Maybe.

Makes me want to gather them in a room and continue the process. Irix was so very good.

Irix even had "virtual swap" which had no (or very insufficient sized) backing store for it, just to handle all the superlarge allocations from which it only uses a tiny amount.

As crazy as it sounds, can that fast device be a ramdisk?

Not crazy at all, very useful.




Or Armbian, a Debian derivative for many RaspberryPi-like ARM boards, where it helps avoiding trashing the usual sd-cards, and holds the logs, with different compression algorithms for /var/log, /tmp and swap.

Their script is here:


I really like how this enables a tiny NanoPi Neo2 with one GB Ram booting from a 64GB SD-Card in an aluminum case with SATA-adapter and a 2,5" HDD mostly idling to draw only about 1W from the wall socket, while clocking up from about 400Mhz to 1,3Ghz if it needs to. It's not the fastest, but sufficient.

Exactly what I thought. Losing <5% of your RAM as swap space probably won't do much difference.

But if it solves problems, why doesn't the kernel automatically assign part of RAM as a virtual swap device, if the user allows it. It can help monitoring tools, and because the kernel knows it is not a real swap device, it can optimize its use.

Swap has a side effect that's not very nice: It makes memory non-deterministic, as disk is non-deterministic.

Linux does unfortunately have serious issues to do with peaks of latency which make it behave horribly with realtime tasks such as pro audio. It's so bad that it's often perceiveable in desktop usage.

linux-rt does mitigate this considerably, but it's still not very good.

I'm hopeful for genode (with seL4 specifically), haiku, helenos and fuchsia.

Yes, this is exactly the cause. The pager hammers the disc to read clean pages, because they don't count as swap.

And I agree that a small amount of swap can actually reduce the paging that occurs, if you assume that the amount of RAM required is independent of the amount of RAM available. However, as we all know, stuff grows to fill the available space, and if you do configure swap you just delay the inevitable, not prevent it.

Having said that, having swap available means that when memory pressure occurs, you have a more graceful degradation in service, because the first you know about it is when the kernel starts swapping out an idle process to free memory for the new memory hog that you are interacting with. This slows down your interactive session, but not as much as if you have no swap available - in that case, the system suddenly and drastically reduces in performance because it is trying to swap in your interactive process. The more graceful degradation of having some swap available gives you a chance to realise that you are doing more than your computer can cope with, and stop.

As far as I see it, there are three solutions:

1. Disable overcommit. This tends to not play very nicely with things that allocate loads of virtual memory and not use it, like Java. And if you do have a load of processes that actually use all the memory they allocated, then you can still have the same problem occur. The solution to that one is to get the kernel to say no early, before the system actually runs out of RAM.

2. Get the OOM killer to kill things much earlier, before the system starts throwing away clean pages that are actually being used. On my system with 384GB RAM, I have installed earlyoom, and instructed it to trigger the OOM killer if the free RAM falls below 10% (and remember that stuff you are actually using, but happens to be clean, counts as free RAM). This is the easiest and quickest solution right now. If your main objection to this is that you are inviting the system to kill things that might be important, remember that the kernel already does this, and if you don't like it you should use option 1 above (and really hope that all your software handles malloc failure correctly).

3. Introduce a new system in the kernel to mark pages that are actually being regularly used but are clean as "important", and no longer count as free RAM for the purposes of calculating memory pressure. This could either be as a new madvise (but it would be impractical to get all software developers to start using this), or by marking all binary text by default (which would neglect the large read-only databases that some programs hammer), or by some heuristics. This would then trigger the OOM killer (or the allocator to say no, depending on overcommit) when actual free RAM is low.

This exact bug has been a huge issue for me when I am developing with Matlab. Those are large simulations.

Things get swapped around and memory is often close to the limit. Linux then becomes unresponsive, and basically stalls. Theoretically it recovers, but that process is so slow that the next stall is already happening.

It is therefore impossible to run large scale Matlab simulations on my Linux machine, while it is no issue in Windows.

As far as I can see, Linux is only usable with enough RAM so that it is guaranteed you never run out. I don't know why this has never been an issue, I guess because it is a Server OS and RAM is planable, or very infrequent?

Try to use zram memory compression with zstd compression algorithm if your kernel supports it or at least with lz4hc. I use this setup to compile Chromium and run few memory-hungry processes and the system is responsive during compilation. Here is free -h output:

              total        used        free      shared  buff/cache   available
Mem: 15Gi 6.6Gi 4.0Gi 295Mi 4.8Gi 8.3Gi Swap: 27Gi 10Gi 17Gi

Note that Swap is just a piece of compressed memory. I have no real swap and sappiness is set to 100%.

Thank you, I will check if we can enable zram somehow.

Add swap (Windows does that, too) and never use more memory than you have RAM (edit: ..in one process). Storing swap on a quick storage adds to the fun and the price.

The one trick, SSDs don't like.

Isn't swap the standard configuration? I did not set up this server, but I don't see why it would not have swap. Nevertheless, this problem occurs. I'll check.

I know the solution is to never use more memory than I have RAM, but that's just what happens - and I know there is no way to "solve" the issue and make it magically run. The issue is HOW it is handled... It's weird that other OSs can deal with this, while Linux needs to be restarted.

I think the issue is that only a handful of (Matlab) processes eat up all the RAM, so this "OOM" can not really do anything - there's no use killing off other processes. What should probably happen is to kill Matlab or one of its processes, even though it is in use. I'd be fine with that. Give some out of memory error and kill or suspend the process. At least then we know.

Instead, the system just locks up completely (because the other threads keep trying to push stuff into memory), but is not actually dead, so we don't even catch the issue. Also, because EVERY process is essentially stalled, you can't even kill Matlab yourself. Or suspend it and dump the data, which would be useful. No, you have to hard reset the machine.

Swap doesn't resolve this on a HDD though. The UI/terminal still locks up, and you still can't recover once you hit the point of thrashing.

What really confuses me is that this kernel was developed when SSDs didn't exist, so how on earth did "The system becomes irrecoverably unresponsive if a single application uses too much RAM" get missed?

> so how on earth did "The system becomes irrecoverably unresponsive if a single application uses too much RAM" get missed?

I don't know, there are/were several similar issues (very basic situations, frequently encountered by everyone or at least many people) which are/were not fixed for years (we might say decade(s)): that one dealing with memory exhaustion; then right after that, the problem which follows when memory is freed but the system is still unresponsive for several minutes(!); freezing when writing to a USB disk; freezing when something goes wrong on a NFS mount...

I never understood why those common and really important issues were not tackled (or not tackled before many, many years). IMO they were such basic functionalities, which a proper OS is expected to perform reliably as a basis for and before all the rest, it should have been dealt with and granted highest priority.

It didn't get missed its just that nobody cared enough.

Servers can be provisioned with way more than enough memory for its use case and can have spares configured to take up the load if it has to be killed.

On the desktop side the issue happens far less if you have more than enough memory. A developer running vim and firefox/chrome on his 32GB of ram machine is vastly less likely to experience this issue than a cheap laptop with 4GB of ram.

It didn't get missed. We kill -9d and rebooted a lot.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact