Hacker News new | past | comments | ask | show | jobs | submit login
The Linux kernel's inability to gracefully handle low memory pressure (lkml.org)
647 points by emkemp on Aug 5, 2019 | hide | past | favorite | 455 comments



Similarly, there are many annoying Linux bugs:

`pthread_create` can sometimes return back a garbage thread value or crash your program entirely without any way to catch it or detect it [1]. High speed threadding is hard enough as it is, without the kernel acting non-deterministically.

Un-killable processes after copy failure (D or S state) [2]. If the kernel is completely unable to recover from this failure, is it really best to make the process hang forever, where your only available option is to restart the machine? I ran into this with a copy onto a network drive with a spotty connection, that actual file itself really didn't matter - but there was no way to tell the kernel this.

Out Of Memory (OOM) "randomly" kills off processes without warning [3]. There doesn't appear to be a way to mark something as low-priority or high-priority and if you have a few things running, it's just "random" what you end up losing. From a software writing stand-point this is frustrating to say the least and makes recovery very difficult - who restarts who and how do you tell why the other process is down?

[1] https://linux.die.net/man/3/pthread_create

[2] https://superuser.com/questions/539920/cant-kill-a-sleeping-...

[3] https://serverfault.com/questions/84766/how-to-know-the-caus...


Point 3 is wrong. OOM killing is not random. Each process is given a score according to its memory usage, and the highest score is chosen by the kernel. The way to mark priority in killing is to adjust this score through /proc. All of this is documented in `man 5 proc` from `/proc/[pid]/oom_adj` to `/proc/[pid]/oom_score_adj`.

http://man7.org/linux/man-pages/man5/proc.5.html


I haven't toyed with that in a long time (probably about a decade really) but back when I did it was still very difficult to get the OOM to behave the way you wanted. IIRC the scoring is fairly complex and take child processes into account so while it's technically completely deterministic it can still be fairly tricky to anticipate how it's going to work out in practice. And I did know about `oom_adjust`. Often it worked. Sometimes it didn't. Sometimes it would work too well and not kill a process that was clearly using an abnormal amount of memory. Finding the right `oom_adjust` is an art more than a science.

Overall I ended up in the camp "you shouldn't throw passengers out of the plane"[1]. The best way to have the OOM killer behave well is not to have it run at all. If I don't have enough RAM just panic and let me figure out what needs to be done.

[1] https://lwn.net/Articles/104185/


And the OOM Killer has been rewritten about twice entirely in the past decade. Christoph Lameter (one of my coworkers, who also wrote the SLUB memory allocator) wrote the very first one. Very little if any of his original code is in the current linux OOM Killer.

The current approach does indeed work much better. You can entirely disable the OOM Killer for a given workload with those procfs handles.


When I have multi-tenant shared servers that are regularly experiencing low memory conditions, that's now one of the first things that I turn off. I set sysctl vm.oom_dump_tasks=0 and vm.oom_kill_allocating_task=1, because without those and a comfortable amount of swap, I've found the likelihood that a rogue process uses up all the memory and the system becomes completely unrecoverable without power cycling, seemingly goes way up.

The way that the score is calculated, as I understand it, the processes with large memory footprints that are not particularly long-running, and especially not owned by root, are the most likely to be killed. My shared servers are running unicorn for Rails apps, so they almost all look the same under that lens.

I want the process with the runaway memory usage to be killed, so that person becomes aware of the problem when they find their app is down. (There are many better ways to solve this, but in our current system design, there's nothing to tell us that a service is down, so killing some other disused service means it's down until some user comes along and is impacted.) It seems like killing the process that made the last allocation is the most likely way to get this behavior I'm looking for, where the failures are noticed right away.

But I'm not convinced I've fixed anything, even if the behavior characteristics seem better, and I haven't had any servers going rogue and killing themselves lately, I am not the sysadmin, I'm pretty sure the sysadmin just comes and makes the machine over-provisioned so this doesn't happen again, which seems to be the best advice out there. And as I've learned from the discussion around this article, the fact of having a fast SSD on which paging stuff out and expiring it from the swap, all happens fast enough to look almost as memory, almost completely neuters the OOM anyway.

Our whole situation is wrong, don't even get me started; I'd like to solve this with containers, where we can set a policy that says "all containers must have memory request and limit" and thanks to cgroups, this problem never comes back.


> thanks to cgroups, this problem never comes back.

cgroups doesn't save you from the OOM killer. In fact, we're seeing persistent OOM issues in production even when there's more than enough physical memory to satisfy all allocation requests.

Similar to the issue discussed in the leading LKML thread, I/O contention and stalls when evicting pageable memory (e.g. buffer cache) can result in the OOM killer shooting down processes when an allocation request (particularly an in-kernel request, such as for socket buffers) can't be satisfied quickly enough--quickly according to some complex and opaque heuristics.

The fundamental problem is that overcommit has infected everything in the kernel. The assumption of overcommit is baked into Linux in the form of various heuristics and hacks. You can't really get away from it :(


> OOM killing is not random.

Yeah I know, hence the use of "random".

> The way to mark priority in killing is to adjust this score through /proc.

Haven't heard about this, thanks for the heads up!


> > The way to mark priority in killing is to adjust this score through /proc. > > Haven't heard about this, thanks for the heads up!

While going down that road is technically correct, it is a road full of pain.

A slightly less painful strategy is to disable overcommit. That way, if memory pressure is high, and a process calls `malloc`, that call will fail if there is not enough memory, and that process will fail. If you only have a couple of processes in your system that are using most of the memory and you can control them, it is simpler to just making them resilient to these kind of errors, than to try to mess with the process score to control the OOM killer.


Unfortunately, the fork/exec method of spawning processes doesn't work for memory heavy processes (such as say a java server) without overcommit and copy-on-write memory. Not that I think fork/exec is the best method to spawn processes, but it is the standard way in unix-like systems.


The rationale for disabling overcommit only really makes sense if physical and virtual memory consumption are in the same ballpark. That's true for some workloads but not generally true.

There are totally legitimate use cases for processes that use significantly more virtual memory than physical memory (since virtual memory is relatively cheap, but physical memory isn't). A lot of programs are going to touch and not release all virtual memory they allocate from the kernel, but there are plenty of important counterexamples.

fork()/exec() is one example (which I've been burned by personally), but there are plenty of others. Any program that uses TCMalloc and has fluctuating memory consumption will have a lot of virtual address space allocated but not backed by physical pages. Sophisticated programs like in-memory caches or databases can also safely exploit a larger virtual memory space while keeping the amount of physical memory bounded.


Virtual memory and overcommit isn't the same thing. Virtual memory means using disk space as memory. Overcommit means that the OS allows more memory to be allocated than it can guarantee is available. It is exactly like an airline booking 1 000 passengers for a flight with 500 seats, hoping that only half of them will actually show up. I don't think overcommit is ever needed in a modern system or is even a useful feature. For example, Windows doesn't support it at all.


> I don't think overcommit is ever needed in a modern system or is even a useful feature.

If you want to spawn child processes from a process that uses half the system's memory (not uncommon in server environments) using fork/exec it is useful. In case you aren't familiar with how that works, the parent process makes copy of itself, including a virtual copy of all the memory assigned to that process. That memory isn't actually allocated or copied until the child process child tries to write to it (and then only the specific pages that are written to). Typically, the child process then calls `exec` to replace itself with a new program and replaces the process memory. Without overcommit or swap if the parent process is large enough, then the fork syscall fails due to insufficient memory.

In a desktop environment using swap/virtual memory is fine. But in a server environment, where the disk may be network-attached (higher latency) and just big enough for the OS and applications, needing significant swap space is often undesirable.


Windows supports overcommit, it's just not the default, and it's not how typical Windows runtimes allocate memory. And the nice thing about making it opt-in is that processes which didn't ask for overcommit won't get shot down when overcommitted memory can't be faulted in.


No, it doesn't. See quotemstr's explanation here https://lwn.net/Articles/627632/


Thank you for the clarification.

What's left out, though (and perhaps the source of my confusion), is that you typically commit a reserved page from an exception handler, and if the commit fails then presumably in the vast majority of situations the process will simply exit. See https://docs.microsoft.com/en-us/windows/win32/memory/reserv...

If the code dereferencing the reserved-but-uncommitted memory was prepared to handle commit failure beforehand it would normally have done so explicitly inline. I can't imagine very many situations where I would pass a pointer to some library expecting the library to handle an exception when dereferencing it. There are some situations--and they're one reason why SEH is better than Unix-style signals--but extremely niche.[1]

Either way, my point was that Windows does strict accounting, and while you can accomplish something like overcommit explicitly, nobody else has to pay the price for such a memory management strategy. Only the processes gambling on lazy commit end up playing Russian Roulette.

[1] In a POSIX-based library I once used guard pages, SIGSEGV handler, per-thread signal stacks, and longjmp to implement an efficient contiguous stack structure. This was in an extremely performance critical loop (an NFA generated by Ragel) where constantly checking memory bounds on each push operation had substantial costs (as in multiples slower). AFAICT, it was all C and POSIX compliant. (Perhaps with the exception of whether SIGBUS or SIGSEGV was delivered.) Though, because neither POSIX nor Linux support per-thread signal handlers you could effectively only use this trick in one component of a process--you had to hog SIGSEGV handling--without coordination of signal handlers. SEH would have resolved this dilemma. This being such a niche use case, that wasn't much of a problem, though.


That is solved by adding a reasonable amount of swap space. In practice the swap will never be used, but it's there as a buffer to guarantee that things will work if the forked process doesn't immediately exec.


But any process can be trying to allocate at the time your system runs out of memory, and most applications are not authored to handle malloc failing. Process failure seems easier to work around, from what I’ve seen. Would love to hear more from someone who has contrary experience.


At least you can then properly blame the software for doing the wrong thing and potentially patch it. The kernel should not implement global behavior that encourages improper memory allocation failure handling.


Yes so much. The overcommit approach seems to be to say, “Most people are lazy so we shouldn’t allow anyone to brush their teeth.” Handling malloc failures isn’t rocket science.


Do you have an example of a nontrivial user-space project that has malloc failure handling that actually works, with tests and all? The one I'm aware of is SQLite. Then there is DBus, whose author disagrees with your assessment.

https://blog.ometer.com/2008/02/04/out-of-memory-handling-d-...


Lua, various OS kernels

Like any aspect of writing software, how you approach the problem effects the complexity of the final product. If someone doesn't make a habit of handling OOM, then of course their solutions are going to be messy and complex; they're going to "solve" the problem in the most direct and naive way, which is rarely the best way.

For example, unless I have reason to use a specialized data structure, I use intrusive list and RB-tree implementations, which cuts down on the number of individual allocations many fold. Once I allocate something like a request context, I know that I can add it to various lists and lookup trees without having to worry about allocation failure. Most of my projects have more points where they do I/O operations than memory allocations. Should people just ignore whether their reads or writes failed, too?


The JVM, and any other runtime which allocates its heap up front. You could argue that's cheating, but it does give you predictable behaviour.


Incorrectly handling malloc failures has been responsible for a lot of security problems. We are all better off acknowledging that C programmers are broadly incapable of writing correct OOM handling at scale and figuring out what to do with this fact, instead of wishing that it weren't this way.


There is now a way to mark processes as OOM-killer exempt

https://backdrift.org/oom-killer-how-to-create-oom-exclusion...

Part of the issue with processes stuck in D state (waiting for the kernel to do something) is that it is deeply tied into kernel assumptions about things like NFS, NFS is stateless, and theoretically severs can appear and disappear at will, and operations will keep working when it comes back. You can make NFS a hell of a lot less annoying in this regard by mounting it with soft or intr flags, however if the network disappears or hiccups, you WILL lose data (the network is NEVER reliable, in fact the entire model of NFS is arguably wrong to begin with)


That's not a problem with NFS, that's a fundamental issue with computers. Things fail. Nothing protects you from losing data, your local log-structured filing systems won't save you either. They'll help you protect an existing state from corruption, but they don't protect you from loss.

That new request you just received when a hardware failure occurred? Say goodbye to it, you've no way of ensuring it will make it to storage when the disks have caught fire. Later on, when you've put out the blaze, all that algorithms can do is tell you when things started to get lost.


What you are talking about is unrelated to NFSs broken model. Although still important, because these are the reasons why we don't need fsync() to actually flush data to disk for example, it can be asynchronous and only needs to add a checkpoint to your log structured filesystem that will be written to disk later, once enough data is buffered or long enough time has passed.


Often ordering would be enough, but sometimes you do want to wail until the changes are on disk. It's a mess.


This is a bad excuse for the really flakey nature of nfs, other network filesystems do not have its issues.


Like which ones?


AFS is a good example of a network filesystem that takes properties of the network into account, however its a little weird and not fully compliant with POSIX filesystem semantics.

However, it scales like crazy and is very powerful and reliable.


I ran AFS for very long, it does have issues which perhaps other network fs'es don't even notice. It has issues with clients behind NAT for instance, the callbacks won't work. The limit of files in a dir is less than for local filesystems (or were) so Maildirs with long mail filenames could get into something that feels like "out of inodes but not space" situations. All in all, it was very nice though.


> the entire model of NFS is arguably wrong to begin with

On local networks (everything attached to a single switch) with good hardware, it is reliable, and soft/intr is the worse choice among others.

To wit NFS is one of the commonly supported VM storage options (libvirt, VMware, etc).


Best of the bad options doesn't make it a good option. It's far from reliable and one of the most common reasons I encounter for deadlocked Linux systems. Well, to be fair, it also hangs a fair share of BSDs and sometimes even a Solaris. If at all possible, I'd suggest avoiding it.


What do you propose for a shared file system?


There is no use in specifying (no)intr as of kernel 2.6.25, it can always be interrupted by sigkill. See e.g. https://access.redhat.com/solutions/157873

There are many things "known" about NFS that are legacy.


What are you referring to regarding pthread_create()? Last time I checked I thought that it would return an undefined thread* only when giving a nonzero return code, which while certainly isn't handled in a lot of multi-threaded applications could be checked before anything else is done with the newly created thread.


There appears to be cases under high create/destroy scenarios where it returns zero, but failed to allocate memory for a thread during create. This is with tonnes of available memory (hence not a mapping error) and no exceptions thrown.

That said, it's still entirely possible I've made a mistake. Please see here: https://github.com/electric-sheep-uc/black-sheep/blob/master...

The idea is that you do preliminary processing of a camera frame before sending a neural network over the top.


> This is with tonnes of available memory (hence not a mapping error) and no exceptions thrown.

Could be a memory fragmentation issue

EDIT: Also since you're using C++ threads: You should really use move semantics, because right now you have two points of failure acting on the same thing: `new` operator may fail on creating the threads instance, and the underlying pthread_create may fail as well.


Comment on my mentioning on move semantics, because I feel that this really ought to be pointed out:

In all the C++ standard library implementations of std::thread the only member variable of that class is the native handle to the system thread itself; there are no additional member variables! This means that the size of a std::thread object is equal to the size of a native handle, usually the size of a pointer, but sometimes smaller.

If you create std::thread by `new` you're essentially creating a pointer to a "possibly a pointer", which comes with all the inefficiencies associated with it: Double indirection, small size allocations tend to fragment memory. And at the end of the day to actually use it, you have to at least lob around that outside pointer around on the stack anyway.

So there is zero benefit at all of using dynamic allocation for std::thread. Don't do it! Just create the std::thread instances on the stack, they're just handled/pointers with "smarts" around them, and you can copy them around just efficiently as you can copy a pointer or an integer. Better yet, if you're not trying to "outsmart" the compiler you'll often get copy elision where applicable.


sts::thread has its copy constructor deleted. This means that creating it on the heap is often your only option if you have to mix it with other types that don't handle noving properly because move semantics are strictly opt in.


> std::thread has its copy constructor deleted

and for good reason

> your only option if you have to mix it with other types that don't handle moving properly

It's not the only option. The other option is to implement move semantics on the containing type. Properly implemented move semantics gives you assurance about ownership and that you don't use the tread interface inappropriately.


Deleting the copy constructor is a very arbitrary design devision without any good reason.

Move semantics are a can of worms in themselves. You assume in your comment that you can modify the other types that interact with the types that are only movable. This is only possible if you own all relevant types, which is actually the exception rather than the norm. And even if you own the relevant related types, move semantics transitively enforce themselves onto containing types which turns their introduction into sprawling mess of cascading changes with hidden surprises.


Here's something to ponder on: What are the proper semantics for copying a thread? What is it you want to express by doing that?

You'll find that usually the copy constructor has been deleted only for those classes where the semantics of a copy are not well defined.

So let's assume you work around that by encapsulating that thread in a std::shared_ptr or a std::weak_ptr. What are the constraints you must work within when using that thread reference?

Usually when you run into "problems" caused by an object not being "move aware" triggered by encapsulating a non-copyable type, this is a red flag that something in your codes architecture is off. Think of it as a weaker variant of the strong typing of functional languages. You probably don't want to have a shared_ptr on a thread inside your object (and the object being copyable), but wrap that object in a shared_ptr (or weak_ptr) and pass those around.


Yourncery first question is already leading you down the wrong path: std::thread is a thread handle, not the thread itself. Equating the handle with the thread (a complex construct of a separate stack, seperate processor state, separate scheduling state etc.) is folly.

There are more software architectures between heaven and earth than exist in your philosophy. C++ especially is an old language and most code was written before C++11 started to be adopted. So a pure C++11 style codebase that follows the associated design best practices may be able to deal with std::thread and similarly restricted classes with little friction. But this just isn't the norm. Most big important codebases are too far down different roads to adjust them to play nice with move semantics.


> std::thread is a thread handle, not the thread itself

While technically true, semantically there's not much of a difference. Yes, you can copy around a handle, but then you have the burden of tracking all of these copies, so that you don't end up with a dangling handle to a thread long dead… or even worse, a reused handle for an entirely new thread that happens to have gotten the same handle value.

This is why you should not think of std::thread being a handle, but the actual thread. Yes, from a system level point of view there's the handle, and all the associated jazz that comes with threads (their dedicated stacks, maybe TLS, affinity masks, etc.), all of which are non-portable and hence not exposed by std::thread, because essentially you're not supposed to see all of that stuff.

> C++ especially is an old language and most code was written before C++11 started to be adopted.

That is true. Heck, I've still got some C++ code around which I wrote almost 25 years ago. But if you use a feature that was introduced only later, then you should use it within the constraints supported by the language version that introduced it and not shoehorn workarounds "just to make it work".


std::thread is just fundamentally flawed. The way it encapsulates the thread itself is just one of the things. Thread CPU affinity cannot be managed. Code cannot query which thread it is running on. There are no thread ids (the handle would be a workable substitute if it were copyable). Threads cannot be killed. In other words, if you take threading seriously std::thread is useless.

I need all of these things except for killing threads. So this is not just an academic list for me.


> Thread CPU affinity cannot be managed

That's because the capability of doing so depends on the target environment. C++ is all-target. If you need that you can use `std::thread::native_handle` + the OS's API for that.

> Code cannot query which thread it is running on

You're wrong assuming that. std::this_thread::get_id() exists: https://en.cppreference.com/w/cpp/thread/get_id

> There are no thread ids

Yes, there are. std::thread::get_id() exists (also see above): https://en.cppreference.com/w/cpp/thread/thread/get_id

> Threads cannot be killed.

Not all runtime environments actually support doing this kind of thing. Also within the semantics of C++ the ability to kill threads opens an gargantuan can of worms. For example how would you implement RAII style deallocation and deinitialization of objects created within the scope of a thread?

Or even one lower level: How do you deal with locks held within such a thread? Not all OS's define semantics on what to do with synchronization objects that a held in a thread that's been killed. Window implicitly releases them. Pthreads defined mutex consistency, but after killing a thread holding a mutex, the state of the affected mutex is indeterminate until a locking attempt on the mutex is done.

Killing threads really is something that should be avoided if possible. Not since C++11 but since ever, because it causes a lot more problems than it solves. If you need something that can be killed without going through too much trouble, spawn a processes and use shared memory.

std::thread is very limited because C++ is an all-purpose, all-operating-system, all-environment language and it must limit itself to greatest common denominator of threading support you can expect. And realistically this boils down to: 1. there are threads. And 2. threads are created, run for some time, may terminate and you can wait for termination.

That's it. Anything beyond that is utterly dependent on the runtime environment. And because of that std::thread does give you std::thread::native_handle, to be able to interface with that.


Many features of the STL are optional. So the existence of operating systems that are incapable of providing certain features is not a valid argument for leaving out features that are essential to using threads in any meaningful way on others.

Killable threads are not rocket science, either. You're just limited in the kinds of things these threads can do. But there's no need to get all hungup on that particular feature.


> Please see here: https://github.com/electric-sheep-uc/black-sheep/blob/0735de...

Creating and destroying threads in a tight loop (and blocking on their destruction, thereby reducing the point of having many) seems like a bad idea. Conceptually you have only maximum of two threads in any given part of this snippet, the one running the loop and the current instance of visThread. My guess is also that the loop thread spends most of its time waiting for the recently created threads to die. Why not have visThread only get created once and process a queue of events?

Anyway without additional evidence you potentially flagged a bug in the c++ standard library rather than pthread_create.


Firstly, thanks for taking the time to look.

> Creating and destroying threads in a tight loop (and blocking on their destruction, thereby reducing the point of having many) seems like a bad idea.

The purpose is that it's pretty much 100% always running a visThread (it's a neural network that takes about 100ms per image). The pre-process on the other hand runs in about 10ms, but there's no reason why it can't be run in advance (+). The neural network can't really be run on multiple cores safely, but it does have OpenMP (parallel loops).

(+) It does create some latency in the output compared to the real world, but it's not a massive deal when it comes down to it.

> Why not have visThread only get created once and process a queue of events?

This is probably the best way to do this, but this was the lazy way with not too much overhead (I think) :)

> Anyway without additional evidence you potentially flagged a bug in the c++ standard library rather than pthread_create.

Possibly, I did run it through GDB and Valgrind, it reliably seemed to die in pthread_create, but that of course could have only been the trigger. It could also be the aggressive optimization [1].

[1] https://github.com/electric-sheep-uc/black-sheep/blob/0735de...


What do you recommend doing when Linux Freezes? It doesn't come up a lot, but when it does it can be kind of unnerving since the three-finger-salute doesn't work.

I would also love to know if anybody has a solution for getting video to play properly in Firefox. I know it's not a bug per se, but it would be nice to not have to switch between browsers all the time.

I've been using Ubuntu for about a year now and otherwise its been a very positive experience.


> What do you recommend doing when Linux Freezes? It doesn't come up a lot, but when it does it can be kind of unnerving since the three-finger-salute doesn't work.

Unfortunately, I don't really have a solution for this. I have an SSD as the main disk and _even now_, when I hit this too hard Linux grinds to a halt. No mouse, no keyboard, just heat, fans and disk light.

One thing that sometimes works for me is the old CTRL+ALT+FX mashing, but not always. Once you can get a shell you can type into you're okay, but of course this doesn't always work.

> I would also love to know if anybody has a solution for getting video to play properly in Firefox. I know it's not a bug per se, but it would be nice to not have to switch between browsers all the time.

What do you mean? It's been reliable for quite a long while? There were two issues I used to have, one was not having my graphics card setup (it was running from the CPU) and the other was not having flash (when that was something).

> I've been using Ubuntu for about a year now and otherwise its been a very positive experience.

Yeah, I think it makes one of the better daily drivers.


One of those magic sysrq combination used to work for me. I used to use it a lot before I upgraded to 12 GB RAM.


Thank you for your answer. It's good to know that dealing with freezes is not a problem only faced by newbs like myself.

Re: playing video, for some reason I can't play South Park on either Firefox or Chrome. Videos on twitter also won't load with Firefox, though they work fine with Chrome.


> Thank you for your answer. It's good to know that dealing with freezes is not a problem only faced by newbs like myself.

Yeah, it's another bug with Desktop based Linux. The problem is that Linux "basically" treats the GUI like any other program, when things get heavy everything gets roughly evenly screwed. OSes designed to be centralized around a GUI on the other hand usually guarantee that GUI related processes get a minimum amount of time on the CPU to ensure they don't freeze.

Linux should 100% be doing this. Doing lots of hardware I/O shouldn't mean you lose the mouse or keyboard. When you lose control of input, you think the machine isn't doing something, when in actual fact it's doing tonnes, it's just not showing you. Even something like Android suffers from this under heavy load, it's really crap.

The real joke is, it's probably a difficult kernel fix. You would need some kind of watch dog timer for the kernel to make sure it's not getting too bogged down with any one particular task and then interrupt ones that are (some tasks don't like to be interrupted) [1]. You then need make sure that all of the heavy kernel calls don't make guarantees about the call being completed (i.e. blocking), which to be completely honest should be the default position to take anyway.

I'm not completely up-to-date on this, but my bet is that the issues come from everywhere. Programs can read/write arbitrarily large amounts of data (RAM, disk, network, bus, etc), when I believe you should be able to ask the kernel what size it would like you to read/write based on I/O activity and the capabilities of the device. If your program is bogging down the kernel, it should lover your recommended block read/write size. Better yet, this would be compatible with existing software, as they could choose to ignore this, possibly with it "punishing" programs that eat up lots of kernel time by making them wait longer for their next opportunity. There's a bunch of algorithms for time splicing tasks, but the most optimal appears to be max-min with very little organizing overhead [2].

> Re: playing video, for some reason I can't play South Park on either Firefox or Chrome. Videos on twitter also won't load with Firefox, though they work fine with Chrome.

Hmm, that shouldn't happen, sounds like you potentially have some system-wide badness. A few things to try:

* Disable any customization you made (extensions, add-ons, etc) - see if it is one of these interfering

* Make sure you have all updates and you're running an up-to-date version of Ubuntu (this issues have possibly been patched already)

* Make sure you have the correct drivers installed for your GPU

But... One thing I did note was that JavaScript coin miners have gotten so bad that I can't run certain sites without ad-blockers anymore (uBlock Origin is generally recommended). I remember my CPU sitting at one core maxed out just because of the JS engine. I generally run uBlock Origin + NoScript on every page and manually enable temporary scripts on pages I trust. One of the biggest offenders for crazily heavy JS was actually Facebook.

[1] https://en.wikipedia.org/wiki/Watchdog_timer

[2] https://uhra.herts.ac.uk/bitstream/handle/2299/19523/Accepte...


I thought about your comment re: video and I did some more poking around online since I was hoping to avoid having to do a fresh install. I tried updating my video codecs and now Netflix/Hulu works! Hooray!

>> JavaScript coin miners have gotten so bad that I can't run certain sites without ad-blockers anymore

Whoa, is that the reason why some websites are eating a ton of memory?? Is this common?

>> I generally run uBlock Origin + NoScript on every page and manually enable temporary scripts on pages I trust.

Thank you for the recommendations, I'll check those out.


> I thought about your comment re: video and I did some more poking around online since I was hoping to avoid having to do a fresh install. I tried updating my video codecs and now Netflix/Hulu works! Hooray!

Yeah, it's important to pull updates regularly :) In general many of the bugs you'll come across will end up being fixed sooner or later, so it's worth checking regularly.

> Whoa, is that the reason why some websites are eating a ton of memory?? Is this common?

That and just the poor use of JS. Most websites really don't need JS. I use mbasic.facebook.com instead of facebook.com as it'll run without JS, plus you have to manually refresh to get any kind of notification.


I have to say, reading all these replies about the OOM killer makes Linux look quite bad. These proc scores are not an elegant solution. I far prefer Darwin’s launchd which lets you set actual memory limits (soft and hard) that gives you warnings before you cross a threshold. Now this is more consumer OS oriented, but something equivalent for servers that let you express preferences in a more natural way seems desirable.


systemd lets you configure soft and hard limits on a per-service level, almost identically to launchd. See MemoryHigh= and MemoryMax= [1].

This does depend on cgroupsv2, but it works on most modern distros.

[1]: https://www.freedesktop.org/software/systemd/man/systemd.res...


Oh nice! Thanks for the pointer!


You can use ulimit to set soft and hard limits for all sorts of system resources (including memory) on Linux.


..but if the problem is programs not checking malloc() return codes since it will not return failures, then ulimits will in themselves not help the program. It will help the OS to stay alive which is good, but we still need to deal with the programs who expect to run all the way into a swamp and sink without malloc giving them "bad news".


True — I forgot about this. But can you do that on a process via some config _before_ the process is created?


Yeah, usually you call ulimit before calling the process. The new process inherits the limits. If you want to modify the limits of an _existing_ process, you can use prlimit.


memory limits are also available in linux via cgroups or ulimit.


There is nothing more frustrating than unkillable processes stuck io iowait(D) state. There's no reason for this behavior to exist. And its so easy to hang forever - network blink, your NFS client gets stuck and your programs too.


Ah, yes, that bug.

Few programs can handle a fail return from "malloc", and Linux perhaps tries too hard to avoid forcing one. Most programs just aren't very good at getting a "no" to "give me more memory" Browsers should be better at this, since they started using vast amounts of memory for each tab.

I used to hit a worse bug on servers. If you did lots of MySQL activity, so that many blocks of open files were in memory, and then started creating processes, you'd often hit a situation where the Linux kernel needed a page of memory but couldn't evict a file block due to some lock being set. Crash. That was years ago; I hope it's been fixed by now.


> Browsers should be better at this,

Browsers are quite good at this actually. Major web browsers run on Windows (and even 32-bit windows!), where there is no overcommit, so malloc can return "no" any time, which happens quite often when you are limited to 4Gb of memory per process.

The only apps that suck at this are Linux-only apps that are never used anywhere else and just assume that all Linux systems have overcommit enabled.


>Most programs just aren't very good at getting a "no" to "give me more memory"

I suspect that overcommiting is one of the reasons for this though. Many programmers in the Linux world have integrated that "malloc can't fail" and the only error handling they bother doing is calling abort() if malloc fails.

Of course the fact that C doesn't provide any sane way to implement error handling probably doesn't help.


Handling malloc() failure is almost never done for short lived programs. For instance git used to fail as soon as an error popped of (whether be it malloc() or open(), etc...). It just is much simpler and convenient to do so.

While C has no special error handling mechanism in place, error handling can still be done reasonably. IMO, the big reason for why malloc() errors are rarely handled is because it is quite hard to come with a viable fallback strategy.


>Handling malloc() failure is almost never done for short lived programs.

True, and that makes sense for something like git. But in my experience many long-lived programs don't bother to handle ENOMEM gracefully either.

But I guess I'm veering off-topic here, I'm mostly fine with applications crashing of their own volition when they don't have enough memory. I agree with you that in many cases there's no clear recovery path for an application that's out of RAM. It's the OOM-killer I have a problem with.

>While C has no special error handling mechanism in place, error handling can still be done reasonably.

I very much disagree with that. There are a few factors that make error handling in C a pain:

- No RAII, so you have to explicitly handle cleanup at every point you may have to early-return an error (goto fail etc...).

- No convenient way to return multiple values from a function. That means that in general functions signal errors returning some special value like 0 or -1 (even that is very much nonstandard, often even within the same library).

Oh you want to be able to signal several error conditions? Uh, maybe use several negative codes then? Oh you need those to return actual results? Well maybe set errno then? Don't forget to read `man errno` though, because it's easy to get it wrong. Oh you had a printf in DEBUG builds in there that overwrote errno before you could test it? Oops. Don't do that!

What's that, your function returns a pointer and not an integer? Ah, mmh, well maybe return NULL in case of error? You want to return several error codes? Well maybe you can just cast the integer code into a pointer and return that, then use macros do figure out which is which. It's terribly ugly? Well the kernel does it so... It can't be that bad right? Oh and what about errno? Remember that?

What's that, NULL is a valid return value for your function? Uh, that's annoying. Maybe use an output parameter then? Oh, or maybe some token value like 0xffffffff, that probably won't ever happen in practice right? After all that's what mmap does.

So no I wouldn't consider C error handling reasonable in any way shape or form. "Non-existent" is more accurate. You can always work around it but it always gets in the way.

I try to always implement comprehensive error checking in my programs. I do a significant amount of kernel/bare metal work, so it's really important. It's not rare that I end up with functions that contain more error-handling-related code than actual functional code.


You are making things sound way more complicated than they need to be, the situation is actually very simple: if you need to return multiple error codes, use a return value for the error code and give back things via an output parameter, otherwise just use a sentinel value for error (0, -1 or NULL depending on context, they aren't totally random you know, 0 and nonzero are used for false/true, -1 is used when you expect some index and NULL when you expect some object). When in doubt just use an error return code everywhere (e.g. what many Microsoft APIs - even some C++ ones - do with HRESULT).


If it's not that complicated please explain why OpenSSL, the linux kernel, Curl a multitude of very popular C libraries don't do what you describe. Clearly it's complicated enough that even talented C coders try to cut some corners when given the chance.

C error handling ergonomics are non-existent which means that everybody bakes ad-hoc library-specific conventions that are extremely error-prone.

You could argue that they're doing it wrong and you might have a point but if almost everybody gets it wrong maybe it's fair to blame the language itself a little bit.


I already gave an example of APIs that do this - pretty much all COM APIs use HRESULT. I do not know why not everyone does this as i'm not everyone and as such i cannot tell what sort of considerations (if any) were going on. At best i can make some guesses.

BTW curl does seem to do what i wrote above, for example `curl_easy_init` returns a `CURL` object on success or NULL if there was an error [1] and `curl_easy_perform` returns a `CURLcode` value [2] that looks like it is used across the API to indicate errors.

[1] https://curl.haxx.se/libcurl/c/curl_easy_init.html

[2] https://curl.haxx.se/libcurl/c/curl_easy_perform.html


The kernel very much returns sentinel values, if something more complicated has to be transmitted error codes are commonly used. I see nothing wrong with it.


I'm not arguing that the kernel devs are doing it wrong. I'm only pointing out that, in my opinion, the way C deals with error handling (that is, by not doing anything at all) is far from reasonable and the cause of many bugs. It's terrible ergonomics.

If you have a kernel function returning a pointer and you think that you're supposed to check for NULL when it actually returns a ERR_PTR in case of errors you will not only fail to do the check but on top of that end up with a garbage pointer somewhere in your program. If you have a MMU and you try to de-reference the pointer you'll have a violent crash, which at least shouldn't be too hard to debug. If you feed the pointer to some hardware module or if you're working on an MMU-less system then Good Luck; Have Fun.

C doesn't have your back here. It doesn't let you signal how a function reports errors, it doesn't even let you tag nullable pointers.


Often you need to return error objects. Consider a function for parsing something. You want to return not only the error code, but also the line and column number of the parse error, and a description of it. So you need two output parameters; one for the result and one for the error. Your declaration becomes something like this:

    bool parse(inp_type *a, out_type **b, out_error **c);
where the return value false indicates an error. In C++, you'd just have written something like:

    out_type parse(const inp_type& a);
and thrown an exception on error.


In C you can return a struct, however a better approach is to use a context object which also contains error information, like:

    ctx_t* ctx = ctx_new();
    if (!ctx) ... fail ...
    if (!ctx_parse(ctx, code)) {
        show_error_message(ctx_erline(ctx), ctx_ercol(ctx));
        ... more fail ...
        ctx_free(ctx); /* often done in a goto'd section to avoid missing frees*/
    }
This also allows you to extend the APIs functionality, error information, etc in the future while remaining backwards compatible.


Which is great, except that ctx_new() requires a malloc, which then can fail, and now you can't even explain why the thing failed, as you have no context info.

You also have to worry about all of the ctx objects you've created along the way, to free them up as you recover from the low memory error.


That is very similar to the way I handled errors back in my C days.


Yep, you're absolutely right. But don't tell me that is simple! :)


> No RAII, so you have to explicitly handle cleanup at every point you may have to early-return an error (goto fail etc...).

I think RAII can be useful, but I've never found any use for it in systems level code that I write. Most of the time I'm dealing with resources that were allocated inside a systems library or an external component which just gives me a handle to the resource. I think this is a common enough scenario in systems code that I don't think its just me.

e.g.

    1. X = CreateResource()
    2. Y = TransformResource(X)
    3. ProcessNewResource(Y)
    4. Z = TransformResource(Y)
    5. etc. etc.
And so as you transform that resource, you will have multiple ways to unwind the resource depending on where the failure occurs. Even if you wrap X in some RAII container, you don't know what your destructor is going to look like.

Another con to RAII, especially when paired with shared-ownership smart pointers, is you lose predictability over your resource deallocs. You never know when the last pointer is going go out of scope, and if its a 'heavy' resource with a complicated unwind, you're going to get a CPU spike at an indeterminate time. I deal primarily with industrial automation code and I much prefer to have a smooth/even CPU graph. I think this issue is more relevant to systems code which is the context of this thread.


I've just checked and it appears that c++'s std::vector::resize may throw std::bad_alloc when malloc fails, but rust's Vec::resize's interface doesn't leave any room to report any errors, so I guess that it will panic...? That's sad.


The standard library assumes infallible allocation, yeah. We have plans to add fallible stuff eventually, but we’re still working on our allocator APIs.


Rust in general will panic on memory allocation failure. There was some discussion about oom handling a while back but I don't know the current state.


Now you regret not putting in exceptions.


> vast amounts of memory for each tab

What underlies this? I am astounded to see 1GB of memory returned when I close a couple of tabs.

Chrome and Firefox both seem like this.


It's spread across all parts of the browser, but speaking as a Firefox graphics engineer, we use quite a lot of memory. Painting web pages can be slow, so we try to cache as much as possible. When elements scroll separately, or can be animated, we need to cache them in separate buffers. If we get the heuristics wrong (and it's hard to get it right for every web page out there) this can be explosive. It's not helped by the fact that graphics drivers can frequently bring down the whole process when they run out of memory. It's a hard problem, but webrender will help as it needs to cache less.


Maybe the browser should try to discard some cached data when the system is out of memory. Then some things in the browser would be slower, but the operating system wouldn't hang.


The browser _does_ do that. The hard part is detecting "the system is out of memory". Some OSes notify you when that happens, and Firefox listens to those notifications and will flush caches. Some OSes will at least fail malloc and let you detect out-of-memory that way. Linux does neither, last I checked.

Disclaimer: I work on Firefox, but not the details of the OS "listen for memory pressure" integration.


Any userspace process can see how low the memory is though, firefox could do it itself. Still, if a notification system is already used in other OSes, a very easy solution would be to add such notification channel in userspace so that any process could ask firefox to free memory. Right now I am using earlyoom to save my system from freezing. It sometimes kills firefox, sometimes dbeaver, sometims VMs. But if it could tell firefox to chill for a bit and free memory, then I could avoid the massacre (at least sometimes).


> Any userspace process can see how low the memory is though

How, exactly?

Or put another way, how do you reliably tell apart "we are seriously thrashing" and "resident memory is getting close to the physical memory limits but there is plenty of totally cold stuff to swap out and it won't be a problem" from userspace? The kernel is the only thing that can make that determination somewhat reliably.


Maybe there's a better way than this, but the same way earlyoom decides it's time to kill processes (% RAM usage) firefox would decide it's time to free cache. While using the 100% of RAM might be the optimal state if you aren't going to use more than that, it's not safe.


So let's say I'd like to write a memory-efficient web page, what should I avoid then?


Based on my experience as a Firefox developer investigating memory usage reports, the worst-performing "normal" web pages in terms of memory have:

1) Lots of script (megabytes). 2) Possibly loaded multiple times (e.g. in multiple iframes). 3) Possibly sticking objects into global arrays and never taking them out (aka a memory leak for the lifetime of the page). 4) Loading hundreds of very large images all at the same time. 5) Loading hundreds or thousands of iframes that all have nontrivial, if not huge, amounts of script. Social media "like" buttons often fall in this bucket.

There are obviously also pathological cases: if your HTML has 1 million elements in it (not a hypothetical; I've seen this happen), memory usage is going to be high, obviously. And arguably having a page with thousands or hundreds of thousands of JS functions is "pathological" too, but it's pretty normal nowadays...


For video memory/tile memory usage, avoid anything fancy that's hard to rasterize, for starters: Think things like rounded borders, drop shadows, opacity, transparent background images, etc. The more complex it is to draw your page the more likely that it will end up being cached into temporary surfaces and composited and stuff. For a while for some absurd reason Twitter's main layout had a bunch of rectangles with opacities of like 0.95 or 0.99, so all the layers had to get cached into separate surfaces even though you could barely tell it was happening. Getting them to fix that made the site faster for basically every Firefox and Chrome user. They hadn't noticed.

For JS and DOM memory usage you can use the browser's built in profiler to get a pretty good estimate of where things are going and what you've done wrong.


Rather than guess at what to avoid, you should make use of the memory profilers which Firefox and Chromium developer tools provide. Apparently Firefox's memory profiler is an add-on: https://developer.mozilla.org/en-US/docs/Mozilla/Performance...


Yeah, but this is a pigeonholing principle.

I don’t want to spend my time developing something only to discover it doesn’t perform well due to some reason.

I would prefer to not use any of the performance killers in the first place.


Firstly, avoid leaking memory (including objects like images and DOM nodes) in JavaScript. Leaking memory here means retaining a reference beyond the end of the object's use. The garbage collector only collects memory which is no longer referenced; it does not attempt to analyse when a reference is no longer used.

Secondly, avoid including unnecessary resources. Many web pages include many libraries which are then mostly unused. Some packaging tools can help eliminate such unused code.

A memory profiler helps in both cases: it detects leaks, and it measures the cost of resources, allowing you to make educated decisions about their inclusion.


HTML3+


JavaScript


Is that a joke, or are you essentially saying that if I used WebAssembly, then most of the memory usage would go down?


He's not wrong. If you disable JS by default tabs will take much, much less RAM. Sites that require JS aren't worth the time anyway. Unless you're literally writing an application there's no reason to require executing code to render in text and images. And there's absolutely no reason to not have a no-JS fallback. In fact, there should be a real HTML skeleton first upon which you write JS enhancements.

But these are all things you do if you want to make a webpage for people. If your main concern is corporate profit or saving institutional funds then SPAs and requiring JS for obfuscation makes sense in the anti-user way that corps trend to. It's just faster. Who cares if it's anti-user when profits are on the line?


Is there a setting to disable all of that caching? In other words, prioritize memory usage over performance?


To be honest: I do not know. But given how fat most websites are today, I am not that surprised that so much memory is needed. Yes, there is still a major leap from a couple of megabytes to a full gigabyte, but with so many DOM nodes and JS objects I can imagine that even a resource-conservative browser will have trouble keeping memory usage low.

Or are there any browsers where you observe significantly less memory usage on the same websites? (Ignoring limited browsers like Lynx of course)


Most modern browsers cache a lot of data in memory as well to make things like navigating back snappy and avoiding a full page re-load. It's also necessary to cache most of the state of the previous page to allow retaining form fields if you accidentally navigated away (as some forms are only in the DOM, created by JS, and not part of the original HTML).

I believe to have read somewhere that at least Chrome listens to low-memory situations broadcast by the OS and will evict such caches. So while it uses a lot of memory as long as memory is available, it will also release much of it if necessary.


This makes sense but is somewhat annoying since a web browser is not the only program running on my computer (but apparently wants to be) and eats up RAM that could be used for the OS file cache.

I wish there was some sort of allocation API specifically for allocating caches so that recently accessed files could kick out a web browser's cache of a not-so-recently accessed tab or vice versa.


POSIX does have something sort of like this in madvise(), but I couldn't find a specific option for the semantics you described.


On a low memory computer I had configured FF to not show images unless I alt-clicked them. Together with not using JavaScript this meant using significantly less memory. I suspect browsers still have to have raw bitmaps in memory, which for the 2mb jpegs you are fed everywhere quickly adds up...


They have to be in memory somewhere. They technically don't have to be kept in RAM if they are uploaded to VRAM though.

Assumining RGBA it's just over 8MB for a 1920x1080 image.


Images could be kept compressed until they are painted. Most GPUs support texture compression, so don't need to keep a bitmap for compositing.


Texture compression is inherently lossy, so it isn't an option.

It's also really dependend on your textures and what you want to do with them, you don't want browser to just go give it their best shot at compressing your company logo.


You want your browser to be fast. Browsers are often the only thing running. You have a bunch of ram. Unused ram is pointless. Disks are large and fast and are used for swap. It's not clear why you're suprised, what the problem is, or why anyone would put any effort into optimising a browser for memory usage. The number of people who have tens or hundreds of tabs open in the real world isn't as large as it is on HN and other tech sites, and for many people who do just keeping the url around so the site can be reloaded is probably good enough.


> Browsers are often the only thing running.

If I had to choose one program that is proportionally used the least alone, I would have voted browsers.


Javascript? I use noscript and it's less than 200MB/tab, even with major news sites and what not open.


200MB/tab just to display some text with markdown and maybe some small images? Isn't that insane?


Answer is too simple. Exactly what is taking up what memory? I'd love to see the annotated memory map of these processes. M Kb of javascript text. N MB of image data, O MB of this, P MB of that.

Unpopular opinion time:

My guess is most developers don't care and are not even looking at this anymore, either during development or after release. Nobody seems to even know how much memory their program allocates and how quickly it allocates that memory under various running conditions. I used to challenge my fellow developers: Stop in the debugger right now. About how much memory should the process be using? Nobody even seems to have an order-of-magnitude guess anymore. It's your program, dude! Shouldn't you know this?

You can ask any embedded software engineer exactly how much memory his/her program uses, what's the stack size, what's the heap size, what's statically and dynamically allocated. Sadly, this discipline is pretty much gone outside of that specialized area.


For browsers, this problem is "once removed". Firefox and Chrome both go to heroic efforts to reduce memory usage. However, given the html and javascript which fill most websites, and users expecting responsiveness, it turns out to be very hard to use less memory.



Browsers are little OSes with encapsulated virtual machines. As every website tends to be a mix of web trackers, ads, server dependent functionality. The whole thing can be a big mess.


If it helps any I have ~2500 tabs open in palemoon (well overdue for an afternoon of tab culling but anyway) and it's ~1.5GB. I never allow JS. So my guess is it's either JS directly, or perhaps JS pulling in extra resources when allowed to run.


I think a personal wiki with clickable links to 2500 sites would be more useful than a browser with 2500 tabs open. It could even be easily curated by more than one person.

That said, Firefox some strides starting in 55 insofar as handling very large number of tabs starting in version 55.

https://www.techradar.com/news/firefoxs-blazing-speed-with-h...

Unfortunately Palemoon is more or less firefox 38 still isn't it?


Just tidying up after myself would fix most of that. I don't know what palemoon~firefox relationship is versionwise.

But my point was more that sans JS it really seems to use up far less memory. Honestly, try nuking js for 1/2 hour and see how it feels.


I get annoyed at all the sites that don't work without js.

My most common scenario with js is 10 tabs using 400Mb of ram to 30 tabs using less than 1gb with the former scenario being more common.


1 GiB of virtual memory, I assume. IOW, not the same thing as 1/32 of your 32 GiB of RAM, for example.


I think that's not that bug at all. When memory runs out, the entire system stalls, including the UI, but nothing crashes. If these issues are frequent, the system is basically frozen.

I have this in Matlab on Linux. Matlab can actually deal with worker processes being killed, but my machine just locks up. Therefore, we have to run these specific simulations under Windows, where this doesn't occur.


I witnessed MySQL bringing linux servers down two.

In my case it happens like this:

I have a long running PHP process that constantly fires away mostly SELECT but also a bunch of INSERT and UPDATE statements and also some DELETEs.

Since the DB and the key files do not fit into memory, its all disk bound work.

All tables are MyISAM.

Like clockwork, this stalls the virtual machine once per day.

All I can do is to hard power down the VM and restart it. Afterwards the table data is corrupted beyond repair.

Not sure it is related to memory though. Because the memory usage of PHP and MySQL seem to be constant. Most RAM seems to be used by Linux for caches.


The most common cause here is something causing a situation where some queries hang or takes a long time to complete, while also locking access to something, while new queries keep coming in. This builds up quickly.

A good way to catch this would be to have something log the list of running queries every couple of seconds. Look at this log after the crash and you'll hopefully be able to identify which are the long running processes, and which are the regular queries that builds up.

To fix it would be a combination of making the queries that cause the locking to be less like that, also perhaps putting in a limit on how many queries can build up and also implement a way for the regular queries that build up to time out or fail quicker or more gracefully.


I fire the queries sequentially. So there is never more then one query running.


The "once per day" part is suspicious. Maybe a scheduled backup is what's running the other query ;)


I hade the same suspicion. Especially since the provider indeed runs a nightly backup of the VMs. But even after turning that off, the VM stalled the next night again.


Try doing some rate limiting in order to not cause the dead lock. Should probably also disable write cache. And if it still doesn't work switch to a bare metal machine. And give it a lot of swap and up the swappiness. Swapping is a much better alternative then crashing. VPS providers doesn't like swap because it will tear their SSD disks, so the swap and swappiness is probably preset too low.


I tried sleeping 0.1s every 5s or so. It did not help. Still crashed.

I don't think swapping would even occur since neither mysql nor php grow their memory usage over time.

It's not an SSD. It's good old rusty HD.


>Afterwards the table data is corrupted beyond repair.

That should not happen with a DB even if you turn off the power. Are you sure the hardware is good?


GP ist using the MyISAM storage engine. It's not crash safe. This is sad but expected behavior.

Don't use MyISAM!


Is there any reason to ever use it (or: Why does it still exist)? In-memory databases for caches or other things that are not critical? I have to admit, I was astounded when I first got the error message from MySQL that a table was corrupted and that I should run REPAIR TABLE. That sounded like very weird behavior for a database.


Any remaining reasonable use cases would be sufficiently corner-casey that that the first order approximation is "if you want it to behave like a database, no, you do not want MyISAM".

This being said, at least some years ago, a use case I saw that held SOME water then was generating MyISAM tables offline, importing them as-is into a running MySQL (or taking an instance offline and bringing it back up) and then serving from it read-only. At least at the time, this provided better RO performance than InnoDB. I wouldn't be surprised if that was still true. Please don't do that at home!

Also, I think until the previous-to-most-recent release, some internal tables were still MyISAM, causing MySQL overall to have some very rare cases of not being crash safe. Again, I think that's since been resolved in 5.8(?).


    Is there any reason to ever use it
It is faster, uses less disk space and has a more logical filesystem layout.


What do you mean by more logical? innodb_file_per_table has existed for a while now.


A flag with that name exists, yes. But it does not seperate table data into one file per table. It will still put stuff related to the tables into the central ibdata1 file.

Google "ibdata1 one file per table" to see all the pain it causes.


False.

> But it does not seperate table data into one file per table

That's because if you didn't have it set when creating the database, it won't move data to the new fs layout when you set the setting on, without an OPTIMIZE. If you had it on from the beginning, table data is per file. I literally just did an ls on my /va/lib/mysql and there's a folder per database, in which there are 2 files per table (.frm and .ibd).

When innodb_file_per_table is on, and the database has been OPTIMIZEd, only the following is stored in ibdata1 [0]:

- data dictionary aka metadata of InnoDB tables

- change buffer

- doublewrite buffer

- undo logs

[0] https://www.percona.com/blog/2013/08/20/why-is-the-ibdata1-f...


   only the following is stored in ibdata1
You say "only", I say "clusterfuck".

Just look at the very page you linked to. It's a totally confusing concept that befuddles users and causes questions "we often receive", starts "panic", can "unfortunately" not easily be analyzed and you might need to "kill threads" and initiate "rollbacks" to fix the problems it brings.

MyISAM got that right. One dir per database.


Ok but I don't get why you're so obsessed with the fs layout anyway. You should mostly treat it as a black box. And the point of ibdata1 is safety, which as you stated higher up is a serious problem with MyISAM. Even if it's not oom situations, you'll end up stuck sooner or later. You have been warned.


    You should mostly treat it as a black box
Again, check the very link you posted. People do that. Until shit hits the fan. And then they have to take that black box apart. Which is not easy.


welcome to MySQL


In general, the out of memory condition doesn't always come from the Linux kernel however but from the underlying memory allocator which typically is the memory allocator in the CRT in libc. Just because some process's memory allocator returned NULL or threw bad_alloc doesn't mean the system as whole is running out of memory.

When the kernel is running out of memory it will just start the OOM killer which will kill a process with low "nice" value.


Actually, I'd say that if malloc (or equivalent) returns NULL then the system really is out of memory. Every general-purpose memory allocator is going to contact the OS to ask for more memory if it doesn't have anything free in its own buffers.

But... it's still no good saying 'make your program behave nicely when malloc fails' - even if your own code is perfect, what are the chances that every library you use does the same thing? And even then, Linux by default will optimistically over-allocate memory (and rightfully so!) - with the result that you'll never catch every out of memory condition.

IMO, 'out of memory' is not a property that each single process should try to manage, rather it should be the OS or some other process with a global oversight that monitors memory usage and takes measures when memory gets tight.


You're right the memory allocator ultimately gets the memory it manages from the OS but as a programmer you're looking at it from the abstraction that its API provides and assuming any particular condition that would cause a NULL to be returned or bad_alloc to be thrown may or may not be correct.

The other point is that there's a distinction between kernele's view of OOM condition and some memory managers's OOM condition. Consider you run two processes, both allocate X gigs of memory and both succeed. However once you start running and committing the memory you'll get a kernel OOM condition and one process is killed. This is the overcomitting you mentioned.

Personally I don't see why people make such a big fuzz about dealing with memory allocation failures. Memory is just a resource same as any other OS resource, socket, mutex, pipe whatever. Normally in a well designed application you throw on these conditions and unwind your stack and ultimately report the error to the user or print it to the log and perhaps try again later. Just because it's "memory" should not make it special IMHO.


It's the transparency of memory allocation that makes it so difficult to deal with failures. Even 'trivial' library functions could allocate memory, hell even calling a tiny dumb function might cause the stack to require a new page of memory, leading to failure. Just checking that all malloc calls check for NULL isn't even half of it.

Exception handlers won't save you either. Unless you consciously consider every memory allocation failure, your exception handlers will be too high level and result in your program either aborting by itself or becoming unusable. Did you pre-allocate enough resources to pop up an 'out of memory' error window? Good luck failing gracefully.

Memory allocation is special.


Yes, it's an imperfect world and you can't control what happens in a library but the attitude "it doesn't matter if my code is messed up or not, some library will still do the wrong thing" doesn't help. All you can do is make your code work properly and that's what you (and everyone else) should do.

Again it's imperfect world saying that "it won't work because x, y, z will happen" is not the right attitude and is bad attitude. Most of your code should treat it as just a resource allocation failure and in a sane program that is indicated by propagating an exception up the stack. Now you might be right that the program might fail when it'd be the time to display a message box to the user or whatever. But somewhere in the middle layers of the code you don't have that context, you don't know that it will fail. Therefore that part of your code really should be (exception) neutral just like in any resource allocation case.


> Memory allocation is special.

I think the Zig lang treats it as special, therefore making you write code that handles the case that a malloc fails explicitly.


> I'd say that if malloc (or equivalent) returns NULL then the system really is out of memory.

That's very much not true when 32-bit processes are involved. You can easily be out of (non-fragmented) address space in a 32-bit process (whether it's all resident or not) while the overall system is nowhere close to being out of memory.

Even in a 64-bit process you can exhaust the address space without being out of memory if you try hard enough; you just have to try much harder.

That said, even on Linux allocators will return NULL when they're just out of address space; there's no overcommit going on there.


> That said, even on Linux allocators will return NULL when they're just out of address space; there's no overcommit going on there.

Try calling fork() in that process then. By rights, the new process should inherit its own copy of all the address space of the old process, and is free to overwrite it with whatever it wants. Linux (by default) won't stop fork() from failing on a process with N GB of RAM and total memory(+swap) < 2N GB, yet there simply isn't the memory around for both processes. There's your overcommit.


Would it be possible for the kernel to suspend the process in scenarios where malloc would fail instead of returning a failure? Either until enough becomes available for it to succeed, or until something tells the kernel to renew/revive/resume the process and try the malloc again?


It could but if that process is using most of a systems memory that will lock up forever because while the process is frozen it won't release any of it's memory.


You provide limited information but it is not clear the scenario you explain is a bug. If too much memory is locked into resident memory with mlock then this sounds like the expected and correct behavior.


Then I prefer the unexpected and incorrect behaviour of Windows, which freezes the offending application and continues to be responsive, allowing me to kill it if I wish to do so...


> Few programs can handle a fail return from "malloc"

Fewer than should, that's for sure, but hardly a trivial number. A lot of old-school C programs are very careful about this, and would handle such a failure passably well. Unfortunately, just about every other language tends to achieve greater "expressiveness" by making it harder to check for allocation failure. How many constructors were invoked by this line of code? By this simple manipulation of a list, map, or other collection type? How many hidden memory allocations did those involve? I'm not saying such expressiveness is a bad thing, but it does make memory-correctness more difficult and so most programmers won't even try.

As the world moves more and more toward "higher level" languages, returning an error from malloc becomes a less and less viable strategy. Might as well just terminate immediately, since "most frequent requester is most likely to die" is better than 99% of the OOM-killer configurations I've ever seen.


Glad to see this issue raised! My system hangs for minutes sometimes and is very frustrating compared to Windows and OSX which seem to handle out of memory in a much more user-friendly way. Which seems to be: suspending the offending program and letting the user decide what to do from there. I'm sure there's a reason the Linux kernel doesn't do something similar, but can anyone enlighten me? :)


Probably lack of integration; if NT hits a memory issue, it can just pass notice to the tightly-coupled userland and GUI. If Linux runs out of memory, even if it internally knows what process to blame... What would it do that makes sense for a headless server, TiVo, and Android phone? Keeping in mind that the kernel folks don't even work that closely with many userspace vendors.


OSX handles this with a kqueue event that can notify userland when the system moves between various memory pressure states; this is hooked into by libdispatch and other userland libraries which will discard caches and so on.

I don't see why Linux couldn't do the same; open /sys/kernel/something and epoll on it.


This already exists: applications can receive memory pressure events (such as the system reaching "medium" level, where you may want to start freeing some caches) via /sys/fs/cgroup/memory/.../memory.pressure_level. See https://www.kernel.org/doc/Documentation/cgroup-v1/memory.tx....


The first two nota benes explicitly describe this document being outdated and not what most people expect when it comes to “memory controller”. I am not certain that citing this is a great example.


What about this? Seems to do what they want.

https://serverfault.com/a/949045


Windows (server and desktop versions) will throw up a message dialog on the screen. It will also start to kill off processes just enough to resolve the low memory situation.

During this - unlike Linux - you can actually use the mouse, CLI and close programs yourself.

On top of that server applications like IIS has built-in watchdogs. If an IIS process grows to use too much memory (60% IIRC) or excessive CPU, the watchdog will recycle the process.


I think Windows kernels do not use overcommit, so memory allocation will fail if you run out of memory.


You could use the message bus to post a message to the service that handles out of memory decisions, which in turn could either

1. Show a GUI with a choice

2. Show a message on the current terminal and ask what to do

3. Just return "kill it now" if there is no interactive session

And if there is no such service, just default to 3. The problem really is that the state cannot be captured and communicated to the user. I doubt the NT kernel itself shows a GUI window, it's probably a service that gets woken up by a kernel exception, which in turn shows the window. Basically, the Linux kernel needs more pluggable functionality for user interactions. It's absolutely fine and even recommended to not have an entire GUI in the kernel, it needs to just provide a mechanism for userspace to capture the event and decide what to do with it.


Throw a signal like it would do if the process were out of memory completely and about to be killed? (for clarification, no snark intended, actual question)


What signal is sent when a process is out of memory? I thought malloc would either start returning NULL or you’d fault when trying to access overcommitted memory.


Yes, but ideally you want to be throwing some ‘memory pressure’ signals before absolutely running out of memory, so that programs can take simple actions like emptying caches, etc.

Catching an otherwise-fatal out-of-memory fault and recovering would be too complicated / bug-prone.


Android sends low memory events and kills processes based on heuristics.

Then again, Android has a customized Linux kernel.


This describes my general Linux experience well: A very stable kernel, with which I never had serious issues on a headless server. But applications in the userspace (apart from the standard GNU packages) are usually a tossup anywhere between system-crashing garbage and perfectly working pieces of software.


I used to run into this problem all the time in grad school. Once a month or so I'd load a data set, do some dumb Python operation on it that took significantly more memory than I predicted, and BAM! I'd have to restart my laptop.

I just kinda assumed that's how computers worked until I got a Mac a couple of months ago...

The link suggests that there might be some default parameters you could change to protect against this behavior. Does anyone have any suggestions on what settings to change?


A Mac is certainly better at handling these kinds of issues but it's by no means totally safe. It tries to compress memory and dynamically allocate more swap, but there's still a limit and you can see that if you accidentally run programs with way higher RAM requirement than you have. I've had multiple occasions where my program used so much RAM that even moving the cursor is an exercise in patience, never mind switching to a terminal window and typing commands to kill the process.


A Mac will keep creating virtual memory swap up to some limit (some multiple of the amount of physical RAM — can't quite remember, possibly 5x) and then it will produce a kind of vague dialog box saying "You've run out of application memory" with a list of applications to force quit.


But at least you can recover rather cleanly from the issue.


If you've foolishly decided to run without swap (like the original post), then suspending the offending program does nothing.

This is because the offending program has allocated a lot of private dirty pages, which can't be dropped from memory because without swap space, there is nowhere for it to go.


Linux use cases tend to be servers where user interaction is unexpected at 3AM? No one around to make a choice, so automate a choice.

IMO despite the standout behavior, I prefer my software to deal with itself.

Systems designed to wait for user input end up having design choices intent on keeping a user using them.

Software is just a tool. Not a lifestyle. Set and forget this shit as much as possible


If the alternative is simply killing a process or crashing the kernel, then surely a better approach would be to suspend the offending process and call a handle that does something. If you want that something to restart the machine, fine. You want it to notify the administrator, fine.


There is the use case of android phones. One of the answers to the OP is about that case. It sends that Google developed a user space process to monitor those events https://lkml.org/lkml/2019/8/5/1121

From that reply it seems that Facebook implemented something similar, I guess for their servers.


> Linux use cases tend to be servers where user interaction is unexpected at 3AM? No one around to make a choice, so automate a choice.

Even if it's a small percentage of the overall "computing" population, there are still millions of people running Linux on the desktop (roughly 2% out of 3.2 billion people using internet makes for 64 million - a large european country). It's 64 million of people for which this behaviour is a pain in the arse.


Does FreeBSD handle this issue better?


And you've turned swap off on Windows?


Windows will not BSOD due to memory pressure.


I think it happens only if you have a broken device driver, applications cannot cause BSOD with memory allocation.


Well I beg to differ, using a unlimited swap file can quickly reach hard issues after 64gb of swap use. At that point mallocs in the windows ui fail (timouts or something?), that apparently are not meant to, eg fonts from shutdown menu missing, the system being unable to shutdown ect.


Does it BSOD if it runs out of swap space though?


I'm not sure but the assumption might be that there's generally no user to ask as the computer might be a server.


Right but if there is a user to ask then it should ask!!


Ehm, and how does the OS know which is the “offending” process?

I think you are confusing the issue raised here with your desktop experience.


Currently the Linux kernel computes a score for each process based on some heuristics. There's a good introductory article on LWN:

https://lwn.net/Articles/317814/


Yep and it's about as good as just picking a random process and killing it.

It's awesome when you run out of memory and you try to log in only to have it kill sshd.


A classic from [1]:

> An aircraft company discovered that it was cheaper to fly its planes with less fuel on board. The planes would be lighter and use less fuel and money was saved. On rare occasions however the amount of fuel was insufficient, and the plane would crash. This problem was solved by the engineers of the company by the development of a special OOF (out-of-fuel) mechanism. In emergency cases a passenger was selected and thrown out of the plane. (When necessary, the procedure was repeated.) A large body of theory was developed and many publications were devoted to the problem of properly selecting the victim to be ejected. Should the victim be chosen at random? Or should one choose the heaviest person? Or the oldest? Should passengers pay in order not to be ejected, so that the victim would be the poorest on board? And if for example the heaviest person was chosen, should there be a special exception in case that was the pilot? Should first class passengers be exempted? Now that the OOF mechanism existed, it would be activated every now and then, and eject passengers even when there was no fuel shortage. The engineers are still studying precisely how this malfunction is caused.

[1] https://lwn.net/Articles/104185/


Fortunately, engineers have invented a way to attach a Strolling Wheelbarrow After Plane, where you can stash the sleeping passengers without ejecting them out of the plane entirely. This has the unpleasant side effect of slowing down the journey for everyone when passengers wake up inorderly (and God forbid everyone wake up at the same time), though.


What I still do not understand is why people continue to turn a blind eye to this instead of switching to SmartOS. I just don't get it.


How does Solaris/SmartOS handles that situation?


It doesn't get in that situation, because malloc() can return null on Solaris (i.e. it never¹ overcommits).

While in general I think this is vastly better than the somewhat insane Linux OOM killer, you can get in awkward situations where you can't start any processes (including a root shell) because you're out of memory.

I rather like the FreeBSD solution to this, which is to not overcommit, but after a certain number of allocation failures it kills the process using the most memory. This prevents situations where you can't start any processes.

There's no one-size-fits-all solution to handling low memory conditions, but the Linux solution manages to almost never do what you want which is kind of impressive in a way.

¹ I seem to recall hearing somewhere that you can allow allocations to overcommit on a per-application basis on later versions of Solaris, but don't quote me on this.


> FreeBSD solution to this, which is to not overcommit

Where did this myth come from? Did y'all just assume that the vm.overcommit sysctl actually makes sense and zero means "no overcommit"? :)

https://news.ycombinator.com/item?id=20623919

But indeed, OOM killer kills the largest process, which makes more sense in most scenarios than Linux's "badness" scoring.


Huh, I had no idea it worked like that. That's bizarre.


Running sshd as an on-demand (Type=socket) service would probably work better, since then the sshd process would be new and thus have a better heuristic score - also not be tying up memory sitting unused in the meantime.

systemd still seems to run it (Type=notify) with the -D option all the time though, at least on the systems I can check.

Dropbear is configured by default as a Type=socket service though.


This is sort of just kicking the problem down the road. Your idea actually might work for (presumably) low-volume use ssh, but what about the next important service? When does the work-around to a papered-over work-around to a virtual problem that is supposed to just be RAM-backed or handled at

  ptr = malloc(42);
  if(!ptr) exit_error();
end?


Well, there probably needs to be a way to override the heuristic at least, sort of a 'this process is important, don't auto-kill it if trying to find memory'.

As for ssh specifically, I rarely ssh into my desktop machine, but I keep sshd running for just this kind of situation where I might need to try and rescue a swamped machine. So in most cases low-volume sshd use is exactly what is called for.

If you're running into the memory purge of doom on a server that's probably a whole different nightmare scenario.

malloc returning NULL has been a broken assumption for a long time though, and that isn't going to change afaik.


I blame the specific algorithm for that, not the basic concept. Nothing with less than 10MB of memory use should ever get killed unless you're in some kind of fork bomb.


Probably it would be good to include the amount of overcommit in the heuristic. Processes that overcommit should be killed first. What is a good way to measure/inspect overcommit of the process?


Further to the comments about the pager hammering the disk to read clean pages (mainly but not exclusively binaries) even if swapping is disabled: In many cases adding swap space will reduce the amount of paging which occurs.

Many long-lived processes are completely idle (when was the last time that `getty ttyv6` woke up?) or at a minimum have pages of memory which are never used (e.g. the bottom page of main's stack). Evicting these "theoretically accessible but in practice never accessed" pages for memory frees up more memory for the things which matter.


Unfortunately enabling swap in linux has a very annoying side effect, linux will preferentially push out pages of running programs that have been untouched for X time for more disk cache, pretty much no matter how much ram you have.

This comes into play when you copy or access huge files that are going to be read exactly once, they will start pushing out untouched program pages to disk, in exchange for disk cache that is completely 100% useless, even to the tune of hundreds of gigabytes of it.

Programs can reduce the problem with madvise(MADV_DONT_NEED), but that only applies to files you are mmap()ing, and every single program under the sun needs to be patched to issue these calls.

You can adjust vm.swapiness systctl to make X larger, but no matter what, programs will start to get pushed out to disk eventually, and cause unresponsiveness when activated. You can reduce vm.swapiness to 1, but if you do, the system only starts swapping in an absolute critical low ram situation and you encounter anywhere from 5 minutes, to 1+ hour of total, complete unresponsiveness in a low ram situation.

There _NEEDS_ to be a setting where program pages don't get pushed out for disk cache, peroid, unless approaching a low ram situation, but BEFORE it causes long periods of total crushing unresponsiveness.


There _NEEDS_ to be a setting where program pages don't get pushed out for disk cache, peroid, unless approaching a low ram situation, but BEFORE it causes long periods of total crushing unresponsiveness.

Here's the thing: a mapped program page is just another page in the page cache. Now, you could maybe say that "any page cache page that is mapped into at least one process will be pinned", but the problem there is that means that any unprivileged process can then pin an unlimited amount of memory, which is an obvious non-starter.

A workable alternative might be to add an extended file attribute like 'privileged.pinned_mapping', which if set indicates that any pages of the file that have active shared mappings are pinned. That means the superuser can go along and mark all the normal executables in this way, and the worst-case memory consumption a user can cause is limited by the total size of all the executables marked in this way that the user has access to.


SuSE solves this in their SuSE Linux Enterprise Server (SLES) with a new sysctl tunable, which soft-limits the size of the page cache.

https://www.suse.com/documentation/sles-for-sap-12/book_s4s/...

It is quite effective, although historically there have been issues with bugs causing server lockups in the kernel code around this tunable. It seems to be quite stable in SLES 15, however.

While the tunable is available in their regular SLES product, it is only supported in the "SLES for SAP". The two share the same kernel, that is probably why.


Theres no reason extra data cannot be added to entries in the page cache to make smarter decisions. That’s how Windows and OS X do it in their equivalent subsystems.

Nobody is suggesting these pages be pinned which is an extreme measure.


The problem I'm trying to point out here is that if the extra metadata in the page cache is entirely under user control (like for example "is mapped shared" and/or "is mapped executable") then it amounts to a user-specified QOS flag.

That might be OK on a single-user system but it doesn't fly on a multi-user one. That's why I suggested you could gate that kind of thing behind some kind of superuser control.


Why can’t a user make QoS decisions for their own pages? Root controlled pages should obviously have higher priority.

The kernel could still “fairly” evict pages across users - just letting them choose which N pages they prefer to go first.


Why can’t a user make QoS decisions for their own pages?

Because then you just get everyone asking for maximum QOS / don't-page-me-out on everything they can.

The pages in the page cache are not owned by a particular user, they're shared. If there's three users running /usr/bin/firefox, they'll all have have shared read-only executable mappings of the same page cache pages. If you do a buffered read of a file immediately after I do the same, we both get our data copied from the same page cache page. So it's not at all clear how you'd do the accounting on this to implement that user-based fairness criterion.


> but that only applies to files you are mmap()ing

fadvise provides the same for file descriptors. some tools such as rsync make use of it to prevent clobbering the page cache when streaming files.


nice! was not aware of that syscall, however, patching the entire world remains...


it might be possible to create an LD_PRELOAD'd library that wraps open type syscalls (i.e. those that return an fd, might just be open, haven't kept up with all of linux syscalls) and that based on a config file calls fadvise on those fd's that correspond to specific files/paths on disk). Won't help for statically linked binaries or those that call syscalls directly without glibc's shims, but that should be a small number of programs.

heck, if I were still a phd student, I'd want to run performance numbers on this in many different scenarios and see how performance behaves. feel like there could be a paper here.



@the8472, hah, so someone thought of the same thing. I'd try to leverage /etc/ld.so.preload with a config file as a more transparent solution, but your link proves the point that its possible.


You probably don't want it in ld preload globally because it would also clobber the page cache in processes that do benefit from it.

And if you only do it in a container you can also limit the page cache size of the container to avoid impacting the other workloads.


hence why I said a config file based (i.e. include the path that stores one's media, won't matter what program you use to play it), but yes, page cache does play a role (but hence why I also said it be interesting to explore how different applications behave with and without it and how that impacts other system performance)

i.e. I really wonder for desktop workloads if one only caches "executable" data, how would that negatively impact perceived performance. I'd imagine it have some impact, but I'd be interested in seeing it quantified.


Can't the file cache detect streaming loads and skip caching it? ZFS does this for its L2ARC[1].

[1]: https://wiki.freebsd.org/ZFSTuningGuide#L2ARC_discussion


Is there a way to enable such an option for an entire process, in the same way as e.g. ionice(1)?

Whenever I take a backup of my computer it winds up swapping everything else out to disk. Normally I'm perfectly happy letting unused pages get evicted in favor of more cache, but for this specific program this behavior is very much not ideal. I'm asking here since I've done some searching in the past and not found anything, but I'm not sure if I was using the right keywords.


Follow-up, for people who encounter this thread in the future: I did some more hunting and found `nocache` (https://github.com/Feh/nocache , though I installed it via the Ubuntu repositories) which does this by intercepting the open() system call and calling fadvise() immediately afterwards.


> Unfortunately enabling swap in linux has a very annoying side effect, linux will preferentially push out pages of running programs that have been untouched for X time for more disk cache, pretty much no matter how much ram you have.

THIS. I ended up disabling swap because my kernel insisted on essentially reserving 50% of RAM for disk buffers; meaning even with 16GiB of RAM, I'd have processes getting swapped out and taking forever to run, because everything was stuck in 8GiB of RAM, and Firefox was taking 6GiB of that. I couldn't for the life of me figure out a way to get Linux to make that something more reasonable, like 20%. (And yes, I tried playing with `vm.swapiness`.)


Programs should really use unbuffered i/o for large files read only once (yes i know Linus doesn’t like unbuffered i/o but he’s wrong)

> This comes into play when you copy or access huge files that are going to be read exactly once


Readahead is still useful for large files read sequentially once, and that needs to be buffered. Such programs should use posix_fadvise().


you can readahead as far as you like with unbuffered io


If you are reading unbuffered (ie. O_DIRECT) then you are reading directly into the memory block the user supplied, so you cannot read ahead - there's nowhere to put the extra data.


Of course you can read ahead using multiple buffers, you can issue as many reads as you want concurrently


I think it is pretty clear I was referring to kernel-mediated readahead. Sure, you can achieve the same thing in userspace using async IO or threads.


I think it’s pretty clear that I was referring to reading large files using direct io, and your comment was totally irrelevant


Reading large files using direct IO defeats kernel read-ahead, which means you have to take on the complexity of reimplementing it in userspace, or the performance hit of not having it.

This is a good reason for programs to use buffered IO even when they are reading a large file once, so yes my comment was entirely on point.


Your comments were misleading, even if you want to argue they were on point. I know that you're knowledgable about the kernel, I ready your other comments. The following comments are just not true, except in bizarro world where all code must be in the kernel:

- "Readahead... needs to be buffered"

- "reading unbuffered (ie. O_DIRECT)... you cannot read ahead"

Good application code needs to work around kernel limitations all the time. Your suggestion of fadvise is reasonable, except in practice fadvise rarely makes a measurable difference. The only way to get what you really want is to code it the way you want it, and fortunately, there are multiple ways to do that.


> There _NEEDS_ to be a setting where program pages don't get pushed out for disk cache, peroid, unless approaching a low ram situation, but BEFORE it causes long periods of total crushing unresponsiveness.

Did you try different vm.vfs_cache_pressure values?


Probably the best solution to this is something like memlockd where you explicitly tell the kernel what memory must always be in resident set.


Linux resource scheduling and prioritization and is pretty awful compared to its popularity.

TBH, there are very few OSes that get high-pressure resource scheduling and prioritization right under nearly all normal circumstances.

The hackaround for decades on Linux is always adding a tiny swap device, say 64-256 MiB on a fast device in order to 0) detect average high memory pressure with monitoring tools 1) prevent oddities under load without swap (as in OP example).


Sgi IRIX nailed this, FWIW.

I would have thought some of the IRIX scheduler made it into Linux by now.


No way, any sufficiently advanced technology is indistinguishable from magic and IRIX was so very, very advanced. IRIX hasn't been in development for almost two decades now and it's still more advanced in aspects like guaranteed I/O and software management (inst(1M))... What does that say about it and what does it say about the engineers who worked on it?


And you are right on all counts. Inst was magic. The things I did, often on live systems...


XFS is still better than any version of that ext dreck!


That's debatable. Better at losing file contents after hard reset? Maybe.


Makes me want to gather them in a room and continue the process. Irix was so very good.


Irix even had "virtual swap" which had no (or very insufficient sized) backing store for it, just to handle all the superlarge allocations from which it only uses a tiny amount.


As crazy as it sounds, can that fast device be a ramdisk?


Not crazy at all, very useful.

https://en.wikipedia.org/wiki/Zram

https://wiki.gentoo.org/wiki/Zram

https://wiki.debian.org/ZRam

Or Armbian, a Debian derivative for many RaspberryPi-like ARM boards, where it helps avoiding trashing the usual sd-cards, and holds the logs, with different compression algorithms for /var/log, /tmp and swap.

Their script is here:

https://github.com/armbian/build/blob/master/packages/bsp/co...

I really like how this enables a tiny NanoPi Neo2 with one GB Ram booting from a 64GB SD-Card in an aluminum case with SATA-adapter and a 2,5" HDD mostly idling to draw only about 1W from the wall socket, while clocking up from about 400Mhz to 1,3Ghz if it needs to. It's not the fastest, but sufficient.


Exactly what I thought. Losing <5% of your RAM as swap space probably won't do much difference.

But if it solves problems, why doesn't the kernel automatically assign part of RAM as a virtual swap device, if the user allows it. It can help monitoring tools, and because the kernel knows it is not a real swap device, it can optimize its use.



Swap has a side effect that's not very nice: It makes memory non-deterministic, as disk is non-deterministic.

Linux does unfortunately have serious issues to do with peaks of latency which make it behave horribly with realtime tasks such as pro audio. It's so bad that it's often perceiveable in desktop usage.

linux-rt does mitigate this considerably, but it's still not very good.

I'm hopeful for genode (with seL4 specifically), haiku, helenos and fuchsia.


Yes, this is exactly the cause. The pager hammers the disc to read clean pages, because they don't count as swap.

And I agree that a small amount of swap can actually reduce the paging that occurs, if you assume that the amount of RAM required is independent of the amount of RAM available. However, as we all know, stuff grows to fill the available space, and if you do configure swap you just delay the inevitable, not prevent it.

Having said that, having swap available means that when memory pressure occurs, you have a more graceful degradation in service, because the first you know about it is when the kernel starts swapping out an idle process to free memory for the new memory hog that you are interacting with. This slows down your interactive session, but not as much as if you have no swap available - in that case, the system suddenly and drastically reduces in performance because it is trying to swap in your interactive process. The more graceful degradation of having some swap available gives you a chance to realise that you are doing more than your computer can cope with, and stop.

As far as I see it, there are three solutions:

1. Disable overcommit. This tends to not play very nicely with things that allocate loads of virtual memory and not use it, like Java. And if you do have a load of processes that actually use all the memory they allocated, then you can still have the same problem occur. The solution to that one is to get the kernel to say no early, before the system actually runs out of RAM.

2. Get the OOM killer to kill things much earlier, before the system starts throwing away clean pages that are actually being used. On my system with 384GB RAM, I have installed earlyoom, and instructed it to trigger the OOM killer if the free RAM falls below 10% (and remember that stuff you are actually using, but happens to be clean, counts as free RAM). This is the easiest and quickest solution right now. If your main objection to this is that you are inviting the system to kill things that might be important, remember that the kernel already does this, and if you don't like it you should use option 1 above (and really hope that all your software handles malloc failure correctly).

3. Introduce a new system in the kernel to mark pages that are actually being regularly used but are clean as "important", and no longer count as free RAM for the purposes of calculating memory pressure. This could either be as a new madvise (but it would be impractical to get all software developers to start using this), or by marking all binary text by default (which would neglect the large read-only databases that some programs hammer), or by some heuristics. This would then trigger the OOM killer (or the allocator to say no, depending on overcommit) when actual free RAM is low.


This exact bug has been a huge issue for me when I am developing with Matlab. Those are large simulations.

Things get swapped around and memory is often close to the limit. Linux then becomes unresponsive, and basically stalls. Theoretically it recovers, but that process is so slow that the next stall is already happening.

It is therefore impossible to run large scale Matlab simulations on my Linux machine, while it is no issue in Windows.

As far as I can see, Linux is only usable with enough RAM so that it is guaranteed you never run out. I don't know why this has never been an issue, I guess because it is a Server OS and RAM is planable, or very infrequent?


Try to use zram memory compression with zstd compression algorithm if your kernel supports it or at least with lz4hc. I use this setup to compile Chromium and run few memory-hungry processes and the system is responsive during compilation. Here is free -h output:

              total        used        free      shared  buff/cache   available
Mem: 15Gi 6.6Gi 4.0Gi 295Mi 4.8Gi 8.3Gi Swap: 27Gi 10Gi 17Gi

Note that Swap is just a piece of compressed memory. I have no real swap and sappiness is set to 100%.


Thank you, I will check if we can enable zram somehow.


Add swap (Windows does that, too) and never use more memory than you have RAM (edit: ..in one process). Storing swap on a quick storage adds to the fun and the price.

The one trick, SSDs don't like.


Isn't swap the standard configuration? I did not set up this server, but I don't see why it would not have swap. Nevertheless, this problem occurs. I'll check.

I know the solution is to never use more memory than I have RAM, but that's just what happens - and I know there is no way to "solve" the issue and make it magically run. The issue is HOW it is handled... It's weird that other OSs can deal with this, while Linux needs to be restarted.

I think the issue is that only a handful of (Matlab) processes eat up all the RAM, so this "OOM" can not really do anything - there's no use killing off other processes. What should probably happen is to kill Matlab or one of its processes, even though it is in use. I'd be fine with that. Give some out of memory error and kill or suspend the process. At least then we know.

Instead, the system just locks up completely (because the other threads keep trying to push stuff into memory), but is not actually dead, so we don't even catch the issue. Also, because EVERY process is essentially stalled, you can't even kill Matlab yourself. Or suspend it and dump the data, which would be useful. No, you have to hard reset the machine.


Swap doesn't resolve this on a HDD though. The UI/terminal still locks up, and you still can't recover once you hit the point of thrashing.

What really confuses me is that this kernel was developed when SSDs didn't exist, so how on earth did "The system becomes irrecoverably unresponsive if a single application uses too much RAM" get missed?


> so how on earth did "The system becomes irrecoverably unresponsive if a single application uses too much RAM" get missed?

I don't know, there are/were several similar issues (very basic situations, frequently encountered by everyone or at least many people) which are/were not fixed for years (we might say decade(s)): that one dealing with memory exhaustion; then right after that, the problem which follows when memory is freed but the system is still unresponsive for several minutes(!); freezing when writing to a USB disk; freezing when something goes wrong on a NFS mount...

I never understood why those common and really important issues were not tackled (or not tackled before many, many years). IMO they were such basic functionalities, which a proper OS is expected to perform reliably as a basis for and before all the rest, it should have been dealt with and granted highest priority.


It didn't get missed its just that nobody cared enough.

Servers can be provisioned with way more than enough memory for its use case and can have spares configured to take up the load if it has to be killed.

On the desktop side the issue happens far less if you have more than enough memory. A developer running vim and firefox/chrome on his 32GB of ram machine is vastly less likely to experience this issue than a cheap laptop with 4GB of ram.


It didn't get missed. We kill -9d and rebooted a lot.


I'm so happy someone has made a clear bug report here. Because damn, this is a thing.


Yeah. I didn't really think of it as a bug at first, but I'm glad someone called it that. I wish the system would just kill the browser or low priority processes instead of freezing everything in an instant.


Why does it have to kill the browser? Why can't it tell it "nope, no more memory for you" before it's all gone?


You can configure Linux to return ENOMEM errors. The theory is that most applications will effectively treat this as fatal and die; so, would you rather kill the app that happened to make the most recent memory allocation or would you rather kill the app using all the memory (and have some control over keeping processes like sshd from dying)?

I've tried using a VM with overcommit turned off and a modest amount of memory. Among other things, my mail reader, mutt, used more than half the system's memory when looking at my mail archive, so it couldn't fork to exec an editor to write a new mail.


Why can't it be smart about it? Processes have CPU priorities, why can't they have memory allocation priorities?

It doesn't have to wait until it actually runs out before issuing ENOMEM. Most processes don't deserve to starve the system completely.


The fundamental problem here is that the kernel can't time-travel into the future far enough to make sure no higher-priority processes will try to allocate more memory, as the system has to make the decision to return ENOMEM at the time of the allocation. There is a reasonable(-ish?) default to deny allocations that request more memory than currently available (using the Linux definition of available, that is total memory without the sum of resident sets of all processes and a bit more). Of course, this limit only works properly if other processes do not add anything to their working sets, but again, the kernel's fortune-telling abilities are quite limited.


It can't look into the future, but it certainly could use the information from the past.

Has this application already allocated 90% of the memory? Has it been steadily growing the allocation without releasing much back? Well then, why let the system run out completely? Why not stop it at say 10% or 5% left?


That application might be the whole point of the system and should get to use every last drop of RAM.


For a desktop OS that would be very unusual in my experience.

For a server, perhaps, but I'd say even for a server it would be better that say the core application returns a 500 internal error or whatever than forcing the system to start killing random processes.


A lot of processes don't handle ENOMEM well. Your mail client asks for memory for a buffer to write an email and it gets ENOMEM, what's it going to do? Silently fail to do anything? Pop up an error message (which will probably take memory to display)? Exit?


Pop up an error message (which will probably take memory to display)?

I don't know how common this is on the Linux/POSIX side, but in DOS and Windows it is usual practice to allocate some amount of memory at startup and use that for error-handling code, so that things like showing error messages will not cause any more allocations.


You can configure your kernel for that, but you can't configure most software written for linux not to make absurd allocations.

It's also hard to get a good idea of how much memory a program is really using, which makes setting reasonable limits for things tricky.

The best you can do is to have a small swap space, and alert when it gets to 50% and fix whatever. But then you have the problem of filesystem pages evicting anonymous pages to swap which ruins the utility of the swap usage as a gauge and/or drives you to configure way more swap than is reasonable. Although, maybe a bigger swap space and alerts on swap i/o rate might be ok, other than it's awful painful to have swap space of even 0.5x ram if you have 1TB of ram.


IMO, that's exactly what it should do. The time when RAM was expensive is gone. Hard-drive swap was always more than a bit of a kludge. So the OS needs to list available options to the user (they -should- choose), and tell apps 'you're SOL, so do whatever you need to do."


It can. System-wide it's one extreme or another. With overcommit enabled, you'll pretty much never get refused. With overcommit disabled, you'll get refusals was soon as you reach the max memory, which means lots of mapped pages wasting space.

The middle ground is your own config unfortunately - cgroups can limit available memory, but you'll have to set it up by hand.


An operating system should never overallocate memory because one cannot build reliable applications and infrastructure on top of a kernel which is lying to the application.


Windows will never overcommit memory. As a result on my PC unless I dedicate a substantial (over 20%) fraction of my drive to swap (most of which will never, ever be touched) I will 'run out of RAM' far before even half of the physical RAM in my PC has been used. This seems extremely wasteful.


> one cannot build reliable applications and infrastructure on top of a kernel which is lying to the application.

You cannot make reliable systems thinking like that. The point of resilience is __not__ to rely on correctness of kernel's advertised behavior nor correctness of your assumptions about it.


Actually I can and have system engineered ultra-reliable production systems which ran over ten years without issues: I'd come back to work at a former employer and the system would still be there, serving. This was not an isolated scenario.

One cannot build highly available systems and networks on unreliable, lying software. Basic guarantees are required. If one has 10,000 systems and one loses power on even a portion of such lying systems without basic guarantees, no amount of distribution will guarantee data consistency. I'm no stranger to building very large distributed clusters, but those builds start with an operating system and software which do not overcommit and which are paranoid about data integrity and correctness of operation. In fact, I'm specialized in designing such networks and systems, from hardware to storage all the way up to application software.


Mapped memory is part of that lying. If you want to use efficient mmap of huge files without lots of manual bookkeeping, you don't want the system to actually allocate as much memory as you specify. With overcommit off you're setting aside memory you're almost never going to use.

Many databases rely on the overcommit being possible.


Isn't the memory for memory mapped files backed by the files themselves? Certainly they are on Windows.


Partially. Without overcommit the system needs to guarantee you have enough backing memory to allow you changing the whole file before the drive writes back anything.


Then you disable overcommit.

For the general case though, because of the way programs have been written, it is easier to have overcommit on.


There's the rub. Different people in different situations will want different outcomes from a low-memory situation. One of those different outcomes has to be the default.


Actually most Linux users are blissfully unaware that Linux overallocates memory and of those that know about it, very few understand the consequences to applications (such as data corruption or data loss). I doubt anyone who fully understood the consequences would want to run their infrastructure on something so unreliable and I doubt it because it's logical to me that they wouldn't.


You got it backwards. Those of us who care about reliability would never rely on reliability of a single system to safely store data. And overallocating memory is not even remotely that important compared to everything else that can go wrong on a system.


One cannot build a reliable system on unreliable software. If you believe you can, there is a lesson in store waiting for you.


IPv4 on its own is not reliable, but TCP adds reliability on top of it. Aren't distributed systems in general also designed to be reliable despite being composed of unreliable components?


Yes, fault tolerance is the reason distributed systems field even exists. Literally to make something reliable out of unreliable.


Really, I'd be fine with anything dying as long as it does not freeze. As it stands, when this happens, the only option left is to force a poweroff.


Yep, even with no swap whatsoever, performance is completely trashed (talking even the mouse lags for 30+ seconds at a time) for like a solid 5+ minutes before the oomkiller triggers, with swap, you might as well just reboot because the system will take perhaps an hour to start responding.

Linux is completely useless with ram that is almost full in a way that OSX and windows absolutely are not.


Linux absolutely, positively, requires a decent chunk of free memory because the kernel's algorithms simply do not work and in my 15+ years of Linux experience they NEVER worked. If Linux starts thrashing, and it does so easily, then the box takes what surely is an uncomputable amount of time to recover.

One way to make relatively sure that this is always the case is to use a userspace daemon like earlyoom.

Though to stay fair all desktop OS'es behave badly when put under memory pressure, it's just that Linux is an order of magnitude or so worse.


are any of the BSDs better in your experience?


Well my memory for BSD is not fully clear, but when I looked at this back in 2008 BSD handled it better then Linux. The king of sorting this was Solaris. It was rock solid. I would still argue that Solaris is better then Linux as a server, but it does not matter as Linux won. Also, I want my Amgia back :)


I only ditched the Solaris train for FreeBSD when all hope was gone after Schwartz’s reign came to an end. Solaris got so many things right architecturally that are damn hard to shoehorn in after the fact. Until today, I don’t know if any other *nix has sorted out ABI compatibility across architectures, which allowed running applications cross-compiled for another target to run against the memory-resident kernel without virtualization (only instruction emulation when needed). In practice, it meant that you could run “universal” x86 binaries against both the x86 and x86_64 kernels, even calling into drivers (!!), with guaranteed compatibility. The last I had checked, FreeBSD had made it a goal to unify all numeric values of IOCTLs across architectures, but I don’t believe they are there yet.


Well, a 32-bit Linux binary on 64-bit FreeBSD can call into the GPU driver and render 3D, at least :) But I don't think anything is guaranteed for sure. 32-bit crap is not exactly a priority, haha.

On the other hand, syscalls across architectures of the same bitness are exactly the same, minus the newer architectures just not having some historical abominations like sbrk. (IIRC, syscalls are different between even aarch64 and amd64 on Linux, which is just… why?!?)


Hey, pretty sure I recognize your name from other FreeBSD-related comments you've made on here in the past that I found worthy of saving!

We are stuck with needing to support x86 and prefer to do so as seamlessly as possible for the end user (call us old school like that); to that end we have a single bootable ISO image containing our appliance that needs to run on generic x86 hardware ranging from cutting edge to 15+ years old (I've posted about that with tons of accolades plus patches maximizing compatibility to the FreeBSD dev ML [some|a long] time ago but never heard back). We ported the software over from a very messy Linux-based implementation to a much cleaner architecture under FreeBSD 9 what I now realize can be considered a pretty long time ago.

With that background out of the way, we had written a custom scfb-powered "guaranteed compatible" X display driver (back before there was an upstream scfb driver, although I am sorry to see even with FreeBSD 12.0 that xf86-video-scfb does not work on quite a number of hardware profiles) that would work so long as the framebuffer did (and it usually does). I remember running into an issue (under 10.0) where even basic FBIO_* ioctl defines had different values when compiled for AMD64 vs i686, so although our x86 Xorg module would run under x64 (with the i386 compatibility module loaded, iirc?), it would fail to work as the wrong driver routine was being called into. I was honestly surprised it didn't work oob, as these were very basic ioctl calls.

Of course today with UEFI, signed drivers, and everything else, it's a miracle we haven't already been forced to distribute separate ISOs, but that's really only because we gave up on signed booting and simply require a legacy CSM for UEFI hardware, which is more steps but generally universally available; I don't think there's any real impetus behind getting UEFI support for x86, so we still boot the old-fashioned way.

Getting completely off topic, honestly both Linux and FreeBSD are quite literally decades behind when it comes to being able to guarantee unaccelerated video output at {,e,s,x}vga resolutions without requiring hardware-specific drivers; I don't think I have ever seen a MS-DOS/Windows PC that wouldn't at least boot in VGA mode going back at least all the way to early 90s. We certainly spent no time agonizing over getting video to work when Microsoft used to allow us to license WinPE to power our appliance images. I have an unfinished patch somewhere for the vt VGA driver that adds support for the various ioctls, but it seems the world has unfortunately decided that going with hardware-specific (k)drm drivers is the way forward. We decided to bite the bullet and now ship with those too, but it's about the farthest thing from bulletproof compatibility and neither the vendors nor the reverse-engineered community driver devs really bother formally testing these modules and it unfortunately very much shows. And now nVidia has discontinued providing proprietary x86 drivers for FreeBSD and Linux (although they notably still do for Solaris, almost certainly because of the aforementioned ABI compatibility guarantees there), so we're probably going to need to split our legacy/i686 and the uefi/x64 images soon enough.


Haha, that's like an epic combo of things I dislike — x86, legacy BIOS, 32-bit OS on 64-bit CPU, unaccelerated graphics, and X :)

> signed booting

It's not like FreeBSD was an early adopter of that. Some kind of secure boot support things have landed in the last couple months, and veriexec for signed stuff, but very few people have started using this.

> separate ISOs

Or a "fat" ISO that has both 32 and 64 bit versions?

> nVidia proprietary drivers

With these, there's already an issue with which versions of that support which cards..


> Until today, I don’t know if any other *nix has sorted out ABI compatibility across architectures, which allowed running applications cross-compiled for another target to run against the memory-resident kernel without virtualization (only instruction emulation when needed).

QEMU supports user-space emulation on Linux, for example you can run an ARM binary on top of a amd64 kernel. x86 on top of amd64 kernels works OOTB.


Interesting, thanks for sharing that info! I imagine that is restricted to running code that doesn't really make any calls against the hardware layer much?

Presumably it must also include some sort of compatibility layer rather than merely translating machine instructions, as the syscall payloads vary considerably. But qemu is certainly small and (when used correctly) fast enough that depending on that should not be a dealbreaker if you really have no better options.


I just ordered my capacitor kit for re-capping my Amigas.

Anyway Linux won for now, but Solaris lives on vicariously through illumos and SmartOS, which handle this the same way as Solaris does (it's the same code) and the codebase is very actively developed, so it's not over yet no matter how much Linux is winning now: I'm waiting for the folks to figure out just how shitty Linux is under the hood and the fact this is a topic on HN tells me it's starting to happen as people gain more and more experience. Linux won the battle but with such shitty code underneath winning the war as history will judge it is another matter entirely. Meanwhile SmartOS continues to be developed and works correctly in these situations.


FreeBSD is probably better than Linux at this, but that doesn't necessarily mean it's good.

Prior to FreeBSD 10, I'm paraphrasing but essentially it would pretty much not swap anonymous pages until memory pressure got really high -- it would evict clean disk pages and try to clean dirty disk pages first. But if the pressure continued, it would start marking anonymous pages and eventually swap those out too. A big problem was there was a massive slowdown when it hit the threshold; tons of pages would be marked as inactive, so using them in the process triggers a page fault (and they then get marked active), and all the page faults and page table activity slows things way down.

In FreeBSD 10, a patch from NetApp changed the page scanning behavior so it was always running. This is good, because you don't get that massive slowdown while the pages are determined to be active or inactive; instead the kernel always has a much better idea of which pages are used; but it's also not great because some of the inactive anonymous pages are going to get paged out, and it also makes memory accounting trickier. In FreeBSD 9, you could look at the active memory stat on a system that wasn't at the brink and figure that was pretty much the amount of ram you needed -- so if it was growing over time, you might have a memory leak, or real growth. In FreeBSD 10, it's harder, because your programs may need things loaded but not access them for some time, and the pages could get marked inactive, so you really have to guess; and swap usage isn't a great metric either because those idle pages might get swapped out. You can set the vm.defer_swapspace_pageouts sysctl to get something similar to old swap behavior, but you still lost the memory usage gauge.

OOM behavior seems better, but also not necessarily great. Most of my experience is from running Erlang servers with very little other stuff going on, so it might not apply to a more conventional server with lots of random things, and probably not to a desktop either. Overcommit is disabled by default (and i think it may be substantially different than Linux overcommit anyway), and mostly what happens in my experience is the big allocator (beam.smp for us) gets its allocations failed when you run out of swap and then it crashes and everything is in an ok state. Every once in a while, something else will get failed allocates and die before the big allocator, but then the big allocator still dies; this wasn't great because sometimes it would be an important but small daemon like ntpd, when this happened we'd often reboot to get back to a known good state. There's also an OOM killer that can be triggered depending on exactly how things go down -- it just kills the biggest process; for us, that's the right thing, if the VM is eating all the ram, killing off gettys isn't going to do any good.

However --- we saw several times that the kernel would just hang. Unfortunately, when we were seeing this often, I hadn't figured out the kernel debugger, and it was usually in the middle of an incident anyway, so we'd reboot and go on with life. A hung kernel is better than a thrashing kernel, but only a little bit.


> Overcommit is disabled by default

No way. Processes allocating terabytes (GHC Haskell compiled programs, AddressSanitizer, etc) work fine out of the box.

Reading tuning(7):

> Setting bit 0 of the vm.overcommit sysctl causes the virtual memory system to return failure to the process when allocation of memory causes vm.swap_reserved to exceed vm.swap_total. Bit 1 of the sysctl enforces RLIMIT_SWAP limit (see getrlimit(2)). Root is exempt from this limit. Bit 2 allows to count most of the physical memory as allocatable, except wired and free reserved pages (accounted by vm.stats.vm.v_free_target and vm.stats.vm.v_wire_count sysctls, respectively).

Soooo vm.overcommit=0 does not mean "no overcommit" (which would make sense), no, it seems to mean something like "no special flags like disabling overcommit" :D


You are totally right. I've failed at reading quite a bit on this one.


Yeah, any time I end up in that situation I just hold the power button and prey to the journaling gods. It's a serious issue, I'm extremely glad it looks like there's actually some progress on fixing it.


Enable SysRq and use SysRq+F. https://news.ycombinator.com/item?id=16561746


I don't think that's a fair comparison: Do you normally run OSX with the swap files completely disabled, or Windows with the pagefile completely disabled? That's what this bug is describing. I'd bet things get pretty nasty on OSX and Windows too, if you tried that.

Perhaps the real bug is that Linux distros make it easy to run swapless.


The behavior is even worse if you leave swap enabled, as I already detailed in my post...


Isn’t iOS basically a flavor of macOS that runs without swap ?


Yes, but it's also a heavily integrated environment that aggressively quits background programs on memory pressure.


But isn't that exactly what the linked article advocates Linux should also do ?

Before quitting background applications it first sends them a request to free memory, in a well-behaved iOS program you use this to clean up your caches and ensure your don't use more RAM than you absolutely need. You should also suspend your state to disk when your app is backgrounded so you can just continue where you left off if your app is killed.

Many macOS apps also do this, you can forcefully restart a Mac and after a reboot it'll restore your session to pretty much the exact state you left it in, including any open 'unsaved' files.

Linux could implement a similar mechanism to signal apps to clean themselves up and maybe a 'save your state, you're about to get killed' signal.


> Linux could implement a similar mechanism to signal apps to clean themselves up and maybe a 'save your state, you're about to get killed' signal.

Isn't that pretty much what "memory.pressure_level" [1] is?

[1]: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.tx...


That is the only working solution with HTML and JS based applications or apps using GC. Applications should save the state and quit (or be ready to quit) instead of using memory in backround mode.


Does the time until oomkiller change dramatically on spinning metal versus SATA SSD versus NVMe, in a swap-off scenario?


There are _numerous_ similar bugs reported in the last 10+ years, in redhat's bugzilla, ubuntu, suse and others.

Almost all of them get closed because they get old and no-one is willing to do the necessary work to refine the bug report to dependable reproducibility.

And that's not surprising really; it's hard, time-consuming work which may be obsoleted on the next kernel version.

That said, some have written memory stressors and have been able to crash or stall a machine but it's still a bit hit and miss.


Agree, but this is only a decent bug report.

A better one would just make a C program that mallocs more memory than is available. The "open enough tabs so that it crashes part" is like "banana for scale", it is incredible unspecific. You could probably open HackerNews 50x more than espn.com


Why? The situation in which it occurs is unambiguous. Allocate more memory than is available and watch it fail less than gracefully. What you're suggesting is merely technical gatekeeping; there is nothing to gain from writing a special purpose OOM-causer, no less specifying that it must be written in C.


> Your disk LED will be flashing incessantly (I'm not entirely sure why).

The VM is basically paging all clean pages in and out constantly as their tasks become runnable. A pretty standard case of thrashing.


There's several situations that need to be addressed in every VMM:

Cold pages - not being used

Warm/normal pages - average use pages

Hot pages - being used a lot

1. LRU/histogramming cold RO-code/RO-data pages to be paged out (dropped) or compressed when other pages are hotter and there's memory pressure.

2. Compressing or paging out to disk volatile RW-data when cold and other pages are hotter and there's memory pressure.

3. Pinning in memory (compressed and/or uncompressed) pages that are hot for performance reasons, unlocking them when they become less hot or others become hotter.

4. Having the ability (like on macOS) to signal applications that the system is under severe memory pressure (say "SIGMEM", "SIGLOWMEM" or "SIGOOM"), and to drop volatile data that can regenerated or is ephemeral.


In this case the RW data (private dirty) can't be paged out even if cold because the OP is inadvisedly running without swap. This leaves the VM very few targets to drop.


This is with swap disabled.


It should have been clarified, because it is not obvious:

The kernel can evict memory-mappings of executable files which are currently running. When they jump to a part of the executable that is no longer in memory, it can page that part back in from the executable file on disk.

This is pretty cool. But when memory is very low, the kernel will evict practically all user-space executable mappings from memory, and will be reading back in and evicting back out executable file contents on practically every single context switch. It's just trying so hard to squeeze out some space to make its tasks fit in memory and complete successfully.

I think this was the desired behavior of big-iron batch-processing in the 90s? Not sure why it has persisted so long. I'm a big fan of linux and this is my biggest pet-peeve.


What's the other option, though? It gets a fault for a page somewhere, it needs memory, and it has none. You have to evict something. What's your choice if not swap or filesystem-backed pages?

Systems have never operated well under true VM pressure. Not in the 90's, not now. When the working set goes beyond available memory performance falls off a cliff.

And the report in the article doesn't seem to have a good comparison anyway. I mean, do we seriously believe that Windows or OS X can handle this condition (4G, no swap) better? I mean, swap is default for a reason.


It just has to kill something. The choice is between killing some things, or freezing everything, for way too long (many minutes).

I'll agree that Windows and macOS aren't perfect here. But I expect better, that's why I prefer Linux. (Better networking, better filesystems, better configuration and debugging of core system services, easier installation of stable C libraries for development or scripting ...)


Can speak for OS X, but I dual booted Win 7 with swap disabled on my first machine with a SSD. Windows would just kill the Chrome tab that was eating all my Ram and I'd move right along. Linux locks up and is totally unresponsive even if you manually trigger the OOM killer. Windows behavior is much, much better for desktop users.


I'd much rather have OOMkiller trigger than have my system sit completely unresponsive for huge chunks of time, while I slowly struggle to kill something.


While this might add convenience when all you do is use a web browser, Linux is huge in the server space. Having the OS suddenly and somewhat unpredictability killing programs is absolutely not an option for most workloads.


The oom killer already exists on servers and already can kill programs.

If you want to turn off overcommit and have the system power off when it runs out of memory, the kernel allows that.

Whatever knob they add will certainly be configurable, and ubuntu desktop can configure it one way while ubuntu server configures it the other, if it turns out people would prefer that.

In practice, people running servers seem to want oom killers to kick in before the server barfs. One example of this is facebook's oomd [1]. I assure you, they're running that on their servers, not their web-browser-machines.

[1]: https://github.com/facebookincubator/oomd


If you work in the cloud for any length of time you learn this is not true. Every system must be able to handle processes being killed unpredictably otherwise you will have problems. Better to deal with it and plan for it than expect a system to have 100% uptime. Because you never have 100% uptime.


I've had that argument countless times.

The last thing you want during operations (whatever your actual job description) is a server that's essentially not serving, not remotely available for troubleshooting and not crashing.


I don't think having a system sit completely unresponsive to anything for a huge amount of time is exactly optimal either.


Connection refused is way better for the average server stack than timeouts or even worse accepted connections with transfer rates of 30 seconds per byte. "Connection refused" is a huge signal flare going off instantly, timeouts and extreme slowness will by definition take longer to notice while impacting systems, even if your monitoring were perfectly set up to look for it.


Sure, for modern systems "transfer rates of 30 seconds per byte" would also be "a huge signal flare going off instantly" where metrics track basic software-hardware interaction (latency and throughput, not just on/off error rates).

These metrics also need not necessarily by definition take longer to notice gray failures. You can detect a gray failure transfer rate without having to wait 30 seconds. Another example, but a second or two is enough to notice disk latency issues (and retry to a faster disk), which is probably in the same granularity as the window for detecting "Connection refused".


Absolutely, which is why configurable knobs are things that exist. I already have to mess with the OOMkiller on my servers; I'm fine with fiddling with knobs on my desktop box too. This is Linux, after all.

Wasn't one of the constant talking points in the war over the Init System That Shall Not Be Named that Linux needs to scale from embedded to big iron?


Having the system thrashing as described is worse than most alternatives, including those below for most workloads. There are probably a few workloads where thrashing is better, as maybe an operator can get in despite the thrashing and do the right thing.

a) killing the malfunctioning process (this is often hard)

b) killing the biggest process

c) kernel panic / locking up

d) killing more or less random processes until memory pressure is relieved

Thrashing in a distributed environment is probably going to end up with partially failed health checks and all the nastiness that comes with flapping. If you're lucky, failing health checks will reduce load and memory usage, and get you back to a happy place; if you're realistic, you probably got into the thrashing situation because of a burst of latency or traffic that resulted in slower processing for a bit and the resulting retries killed the system.


With no script it’s rare for Firefox to eat all my memory (with 4GB of ram) more often It’s GCC. I really just need to be involved when that happens, going and killing things isn’t the answer.

One thing I do want to try is sand boxing the browser using UML (which I’ve been using a lot of lately.) This way since it gets its own kernel I can totally limit everything.


Then make it very predictable. Set a minimum amount of disk cache. Now you know that the OOM killer will trigger exactly X hundred megabytes sooner, and also that your performance won't tank because running code is all paged out.


> * You have to evict something. What's your choice if not swap or filesystem-backed pages?*

How about fail to allocate more memory than the system actually has?


> ... on practically every single context switch ...

How frequently do context switches happen? Can this be reduced, so that there is less disk thrashing? I.e. whatever process needs lots of memory swapped in, just don't switch to it for up to maybe a few seconds, rather than milliseconds or whatever? This seems less bad than having the OOM killer just kill the thing.


Swap and paging are different concepts. The VM can page clean pages (mmap'd files, like most code your system is running) from disk in and out all day if it wants, in fact, when Linux is thrashing, that's exactly what it's doing.


You can think of all your executables and shared libraries on disk as a kind of read-only swap.

If the system is low on memory, a page of program code may be dropped from RAM and re-fetched from disk when it is needed again, i.e. when that section of the code is being executed again.

(this effect is not limited to program code, though)


Anything mmaped and read-only (SO included) could be and would be ditched


Also, when you run low on memory, iptables can crash on you: https://bugzilla.kernel.org/show_bug.cgi?id=200651

Poof, no more networking


No, just not 'that operation' afaics.


I committed the grave mistake of purchasing a laptop with only 8GB ram and I constantly run out of memory as a result. When it happens, I just repeatedly mash alt+sysrq+f until it kills off some chromium tabs and unfreezes my machine. It essentially behaves like one of those extensions that lets you unload tabs. If needed, you can get the tab back by just reloading the page. The machine slows down to a crawl at 96% usage, and freezes at 97% usage (according to my i3 bar).


A few months ago I upgraded my system to 8GB RAM and I don't see the behavior you describe. It does require a little more care when choosing which programs to use[1], but not significantly so. However, you do need to make sure you have swap enabled to let the kernel efficiently manage its resources. It's not unusual for me to have 1-2 GB of stuff in swap; this doesn't affect performance significantly since it's parts of the system that don't need to run, but if you insisted that they all stay resident then it would put a considerable strain on the system in low memory conditions.

[1] The big one for me is that I can't run the Atom text editor, Firefox, and a virtual machine all at the same time.


I run 8GiB without swap. It kills processes off quite nicely when hitting OOM as there is no time wasted with paging out to slow storage.


> slow storage

Linux recently gained the ability to swap huge pages to NVMe without having to split them. Combined with THP this can be quite fast.


How recent? Did this happen back in the 4.x series or is it new to the 5.x series?


4.14 contains some of the changes I read about. But the patch notes are different than what I originally read, I'm not sure whether this is the full feature or whether some of it stalled.


>[1] The big one for me is that I can't run the Atom text editor, Firefox, and a virtual machine all at the same time.

Is this on Linux? I have no problem with that workload on an 8GB Macbook Air (2019). I was apprehensive about going with the 8GB model, but I've not really had any issues.

I wonder if memory compression on OS X helps here.


Yes, this is Linux. I say "can't run" but what I really mean is that things start to get slower after a while until eventually it becomes annoying. This is probably in part due to my habit of accumulating open tabs and windows, and also because I run KDE which is pretty memory-hungry on its own. If you don't do this or you're the sort of person who shuts down your computer every night you might never see this problem so it's hard to compare just based on the list of programs.


Are you swapping to the “VM” section of your APFS container on an insanely fast PCIe (NVMe?) SSD? That’s my guess. Although macOS does a great job of handling poorly behaved programs that use tons of RAM just to begin with.


I'm sure the fast SSD helps to mitigate the relatively small amount of RAM, yes. I can't say to what extent.


And I cannot install more than 4 Gb (thanks to Intel for planned obsolence) and have to use several Electron-based programs (each takes not less than 500 Mb). When there is not enough memory, the system can freeze to the point that I cannot even move a mouse pointer.


> have to use several Electron-based programs

Have you thought about running browser-based versions? I tend to find that these behave a lot better.


Maybe I will try it, but in my case I run them isolated under another user account so it means that I will have to run several instances of a browser (but I'll try to compare the memory usage). Because I don't want them to install cookies and track me across the Internet.


Firefox Containers is a great way of doing this without having to run separate browsers

https://addons.mozilla.org/en-US/firefox/addon/multi-account...

Also incidentally electron apps afaik don't install cookies on your main browser, they should be separate anyway


How do you manage with only 512 MB RAM?


I committed the grave mistake of purchasing a laptop with only 8GB ram and I constantly run out of memory as a result.

I recommend that you try ZRAM. It has made a difference in a 8GB laptop that I use: https://www.cnx-software.com/2018/05/14/running-out-of-ram-i...

For 8GB I run a 2GB ZRAM swap.


How much “extra” RAM does that translate to for your workload?


https://github.com/rfjakob/earlyoom will help you. It will kill those processes for you.


Almost every other laptop comes with an empty RAM slot nowadays, I bet you can plug in an extra few gigs into it.


Not thinkpad x1 carbon


I've never liked the approach Linux kernel and userland take to memory exhaustion. Many people confidently asserts that it never happens. The somewhat better-informed suggest that it's unreasonable to write programs that recover from memory exhaustion because unwinding requires allocation --- a curious belief, because there are many existence proofs of the contrary. Then we get a feedback loop where everyone uses overcommit because everyone believes that programs can't recover from OOM, and people avoid writing OOM recovery code because they believe that everyone is using overcommit and allocation failure is unavoidable. And then they write kernel code and bring this attitude there.

Memory is just a resource. If you can recover from disk space exhaustion, you can recover from memory exhaustion. I think the current standard of memory discipline in the free software world is inadequate and disappointing.


It's not just memory discipline that is pretty bad (and not just in the OSS world). I've recently seen several newer languages refuse to deal elegantly with low-level errors.

For example, in a Java server application, if one request encounters some buggy code that tries to read past the end of an array, that request will fail, but all others will succeed - this will give a good chance for the system to be usable, and getting a good bug request, with system-generated diagnostics, for the buggy requests.

However, in Go or Rust, the same scenario panics and kills the entire process by default - turning a potentially minor bug in some obscure part of the system into a system-wide crash.

OOM is obviously harder to deal with (e.g. if one request is using too much memory, there's no guarantee that it won't be other requests actually seeing the OOM errors first), so if we don't even want to deal with the easy stuff, how can we hope to deal gracefully with the hard ones?


That's because you are supposed to do bounds-checking yourself in Rust, not because they don't want to handle it gracefully.

It's like happens when you read out of bounds in C, except it fails more reliably.


You're supposed to write correct code in Java as well. The reality is that it doesn't always happen. C doesn't claim to handle the issue at all, and doesn't verify it, which is at least a performance gain. Rust does verify it, but issues an error type that is not guaranteed to be recoverable at all.


Completely agree - languages like Rust prefer to fail fast and hard.

It's certainly an easy to understand solution and is getting the program into a well-known state, but it's also low effort and user-unfriendly. They could have done better, but they would need real exception support for that.


How are all the people talking about Windows here getting it to behave better? In my experience, when you run out of memory on Windows, the whole machine locks up hard for ten or fifteen minutes while it thrashes the disk before finally killing the offending process. (Admittedly that's on spinning metal; SSD would probably do better.)


On Windows I could consistently open a 32GB matrix in Matlab with 16GB of RAM on my laptop and perform operations on the matrix. The disk would spin, and it would take 20 minutes to do a simple operation because of the swapping, but I could open it, perform the operation, save, and exit successfully. I could easily background Matlab and do email or other common tasks such as browsing with very little impact to those applications. On Linux Mint that same task locks the mouse and brings the system to its knees, I can't even kill Matlab and would typically resort to a hard reboot. I learned quickly that I can't do the same things on Linux Mint that I used to do pretty easily on Windows.


For me, Windows (7, 64 bit) behaves exactly as you report for Linux. I would love to be able to get Windows to behave like it does for you. What version were you using? Did you tweak any settings?


I haven't been running windows for a year or two now, so this was a while back, but I think I was on Windows 10, possibly it was 8 back then, no tweaks. But I was quite successful at this in Matlab specifically. If you overload memory across many processes perhaps you can get into a bad place, but when it was just one process that was abusing swap, Windows is quite good about making sure other processes aren't dramatically affected in my experience (there was some lag, but it was usable).


Have you tried this with swap space disabled? In my experience it hits a wall very suddenly.


No, the above is with the default swap settings. I have tried Windows with swap disabled, but on an 8GB machine, it for whatever reason ran out of memory with only about 4GB of stuff loaded, so I decided it was better to put up with the default settings.


The most annoying thing about OOM is when a process goes crazy and starts using a lot of memory, the OOM killer looks at the system and sees that process is really active, so it kills mysql/ssh/apache/postgres to make room for the run away.

I've set up monitoring that pages me when "dmesg" includes "OOM".


You can actually adjust the oom_score on those long lived important processes to hinder OOM to kill them.


And I've always meant to go in and set SSH to be immune to OOM, and deprioritizing others like databases, but I've just never gotten around to it. Looks like oomd could be useful, once we get to kernels that support it in production (looks like 4.20, Ubuntu 18.04 has 4.15).


Another interesting aspect is to set SSH to realtime. When you log in you set it non-realtime if you work is not important to avoid sinking the server into oblivion by a simple command.

It is possible to do when you have shared servers as well, by letting root log in to another port instead and setting that process to realtime instead.

Thoughts in my head but never got around to do it.


Try setting vm.oom_kill_allocating_task. I find it more useful than the default behavior and it seems to be closer to what you want.


Apple have solved some of it by limiting the maximum number of processes before the user get a warning and have to close some of the old. Not sure if they handle memory any better, from my experience; no.


If there's one process going crazy and using tonnes of memory, what does a maximum number of processes help? We're not talking about a forkbomb here, just one rogue process.


I wonder how they did with Android? Especially in the early days, not with today's 8GB+ monstrosities.

My first Android device was a Nexus One. 512MB of RAM for what is essentially a full Linux system. Able to run a browser and multiple Java apps, all isolated and running their own VM. Task managers often reported near 100% RAM use and things still worked fine.

And my understanding is that optimized things further, but given how overpowered phones are today and how bloated apps are, it is hard to tell.


Android is described further in this thread https://lkml.org/lkml/2019/8/5/1121



FreeDesktop.org solution (being deployed on several GNOME distributions today): https://gitlab.freedesktop.org/hadess/low-memory-monitor

Used in combination with compressed swap (ZRAM), it greatly alleviates this problem on the (at least GNOME-based) Linux Desktop.

Still, browser really need to do something about the memory problems they're causing. They're Windows 95-level bad at managing their high memory/leak cases - just leave a browser with more than a few dozen tabs open over night. Especially with a tab that does background fetches (e.g. Facebook or Twitter or something with a lot of timer-driven Ajax queries).

I assert that if it weren't for browsers, there'd be no memory problems on modern desktops.


https://github.com/rfjakob/earlyoom has worked well for me in the past.


It's even in debian/ubuntu repos now. I hadn't realized that.


Yeah. OOM-kill handling wants to be a silver bullet, sort of. For instance, Linux kernel provides a number of I/O schedulers or net schedulers, etc. to pick from, but OOM kill is "one size fits all". And it doesn't really look like things are going to change [1][2][3].

[1] https://lore.kernel.org/lkml/alpine.DEB.2.21.1810221406400.1...

[2] https://lore.kernel.org/lkml/20181024155454.4e63191fbfaa0441...

[3] https://lore.kernel.org/lkml/20181023055655.GM18839@dhcp22.s...


android has something similar, the low memory killer daemon (lmkd).

both use the recently added pressure stall information (PSI)[0] infrastructure in the kernel to determine when the system is overloaded.

[0] https://lwn.net/Articles/775971/


A couple weeks ago, one of my physical stick of RAM completely stopped working after yet another Linux out-of-memory-force-poweroff situation. No idea if that could be the proper cause, but I do find it a little funny.

I just arrived at this thread after my entire system stalling completely at yet another low memory situation.

Let's just say I'm extrememly grateful to discover some of these userspace early OOM solutions in this thread.


I hit this bug yesterday on my laptop (16GB of RAM / 1GB of swap) with 2 instances of Firefox (about 60 tabs), Slack, Insomnia (Electron-based Postman clone) and a couple of `node` processes watching and transpiling.. stuff. `kswapd0` was running at 100% CPU, I guess trying to free up some RAM by moving things to swap (the swap partition was full by this point). Luckily I managed to recover the system by switching to another tty and killing kswapd0 and the node instances.

Sometimes instructing the kernel to clear its caches helps: `echo 1 | sudo tee /proc/sys/vm/drop_caches` [1]

[1]: https://serverfault.com/questions/696156/kswapd-often-uses-1...


> I guess trying to free up some RAM by moving things to swap (the swap partition was full by this point)

In that case, it cannot be moving things the swap since it was alreay full. What did probably happen is that since the swap was already full and not enough it started to trash the system by paging out the executable files in the memory. Ironically enough you'd probably have much smaller risk of trashing had the system have more swap space.


I wouldn't enable swap on a desktop Linux system. When you run out of memory if you have swap the system grinds to a halt and you pretty much can't do anything to save it, or at least it is a battle.

Without swap it just kills processed until there is enough memory, which is what you would have done anyway!

I think the main annoyance with Linux here is that in Windows you get to choose what to kill, whereas in Linux it can't really communicate with you (because the kernel doesn't know about such modern things as GUIs) so it had to pick more or less randomly.


It's not really that random. To see what the OOM killer heuristic currently considers its top 5 targets on your machine:

    for P in /proc/[0-9]*; do echo $(cat $P/oom_score) $(cat $P/comm); done | sort -n | tail -5
For me right now that shows 4 "Web Content" processes (firefox tabs) and a firefox-esr. That seems to check out.


Having some swap with low swappiness allows the system to page out pages that are unlikely to be used or haven't been used in a long time but aren't backed by a file.

In normal conditions this frees up memory for more useful data and helps you avoid getting to perverse conditions.


That's how it should work yes. Unfortunately it doesn't actually happen like that, hence this entire discussion.


It actually DOES happen like this. When the entire working set for actively used apps fits in memory swap lets the system page out things that are little used. This works perfectly fine.

This is to say that swapping out little used stuff delays the point where you are actually out of memory and performance goes straight to hell.

This means the optimal arrangement for desktop use is some swap and low swappiness.

One could imagine that perhaps something like

https://github.com/rfjakob/earlyoom

Might be an easier route to better behavior especially as you can more easily tell it what it ought to kill.

The behavior of the kernel could probably be improved but it is probably inherently lacking the data required to make a truly optimal choice along with a gui to communicate with users. Going forward possibly desktop oriented distros should probably come with some out of the box gui to handle this situation built into their graphical environments before it gets to the point of dysfunction.


I'm not sure how he's getting swap even with swap off, but this seems to be the big disadvantage to having overcommit --- the memory allocator won't ever say NO, so an application can keep allocating memory even if that memory becomes uselessly slow to actually access.

Then again, this "allocation will never fail" mentality has also lead to applications being written with such an assumption, and when allocations do fail, they crash. (Arguably, that's better than thrashing the rest of the system.) I don't know if the modern browsers will actually stop letting you open new tabs and just give an "out of memory" error instead of crashing, but that's how most Windows programs are usually written --- without the assumption that allocations can never fail, because on Windows, they can.


> I'm not sure how he's getting swap even with swap off

It's not data swap, it's executables. Linux knows it can reread the executable from disk if needed, so it uses those memory pages from other things, and reads them in when needed.


This is why one uses mlockall.


He’s not getting swap. Or in a sense he is... you can still thrash.. the read only pages of the executable and any memory mapped files are still eligible to be paged out. When you get into a memory pressure situation you end up with a handful of executable pages of all the active programs getting faulted-in on every context switch.


All memory is conceptually backed by some file. Under memory pressure, the kernel (and this mechanism is the same on any modern kernel) frees memory by writing pages to their backing storage, then discarding them from RAM.

There nothing really special about anonymous memory except that it's backed by swap instead of by a named file on some filesystem. On a system with swap disabled, the backing store is still conceptually swap, but since the swap doesn't exist, pages backed by that imaginary swap can't be evicted from RAM. Pages backed by other backing stores can certainly be evicted from RAM, however, and that's how you "swap" on a swapless system.

Note that executable code is almost all mapped so that it can "swap" in this way.


This is a very old problem, I used to see it decades ago when making tape backups. Tar would use move the entire disk through the buffer cache so that eventually everything in it was paged out. The classic solution was to use unbuffered versions of the disk device for backups.

What I've always thought is that there should be a working set size limit on a process which includes the buffer cache somehow. The idea is that the process may not use more RAM than this size- if it exceeds it, it must either fail or swap out its own pages, not those from any other process. This would fix the problem for tar- it only needs a tiny amount of memory.

I think the situation is very similar with the web-browser example. The browser should not be allowed to force all unrelated data to be paged out.


>working set size limit on a process which includes the buffer cache

You can control this with cgroups. Plug a process into a separate cgroup and set the `memory.limit_in_bytes` knob to whatever your heart desires.

I use it to limit the qBittorrent's memory usage on my machine. `firejail` is very convenient for doing this. If I don't set a limit (30% RAM in my case), it eats up all the memory with a uselessly large file cache, which does not improve upload speeds at all.

https://access.redhat.com/documentation/en-us/red_hat_enterp...


So... running Linux swapless is a thing? How popular is it?


I don't know what other people do, but I think the better option would be to set vm.swappiness to 0. Swap space is a good safety valve. You should never really have to use it, so a good way to detect that something is going really wrong and take action before it brings the system is down is by looking at when swap is busy filling up.

Also if someone opens up an application that grabs huge chunks of RAM but leave a lot of it idle, and turn swap off completely, they should not be surprised. I don't know why people see this as a bug, but perhaps I've just been spending time in the UNIX family tree for too many decades.


Kubernetes, for example, doesn't even support swap. Some bug reports say it won't even run with swap enabled, though I didn't test myself. ¯\_(ツ)_/¯


I don't know about that but it is very common to specify CPU and memory limits for docker containers. Exceeding those automatically leads to the process being killed. The reasoning is very simple: any form of swapping is completely unacceptable on a production server because it randomly and massively degrades server performance. If you have a cluster of stuff and one node is misbehaving like that, you kill it because that is completely unacceptable. If that is a regular thing, your servers are obviously mis-configured in some way and you fix it by provisioning more hardware or tweaking the limits.

12 years ago, before I got a mac, I had a windows XP laptop with enough memory (8GB) to disable the swap file (which world+dog will insist is a very stupid thing to do). This was great and vastly extended the useful life of my laptop. Alt+tabs were instant and I could run e.g. JVM applications with sane heap settings as well as a browser, office stuff and a few small things I needed with zero issues. On the rare occasion that something did run out of memory, it died or I killed it. Laptop disks were stupendously slow at the time; any form of swapping on a slow laptop disk is extremely disruptive. SSDs are much better but there too it tends to be mostly redundant.

IMHO most forms of swapping are highly undesirable on both servers and end user hardware. Swapping to free up memory for file caching is simply unacceptable when you can instead just evict cache pages. If you don't have enough memory left to cache effectively, that just means things like memory mapped files will get a lot slower. If something allocates more memory than you have just kill it.


> The reasoning is very simple: any form of swapping is completely unacceptable on a production server because it randomly and massively degrades server performance. If you have a cluster of stuff and one node is misbehaving like that, you kill it because that is completely unacceptable. If that is a regular thing, your servers are obviously mis-configured in some way and you fix it by provisioning more hardware or tweaking the limits.

A small swap space (~ 1G on a 64+G ram server) is a reasonable backstop against a slow memory leak. Assuming you don't have filesystem pages evicting anonymous pages, swap use is a clear indicator of too much memory use and points you in the direction of something to fix; and gives you a little bit of time to fix it on the running system. As long as swap is very small relative to ram, it's not going to enable thrashing -- a big leak or burst in use isn't going to fit in swap and you're going to be dead anyway.


> If you don't have enough memory left to cache effectively, that just means things like memory mapped files will get a lot slower.

Understatement of the year. If you can't cache your memory mapped executable pages, your system will be just as slow as in the nightmare swap scenario.


On old Win95 machines, you couldn't disable swap, because DirectX used it to buffer audio in several games (Jedi Knight comes to mind) and so your game wouldn't have any sound with virtual memory disabled.


Yes, here is the link to the issue: https://github.com/kubernetes/kubernetes/issues/53533

Still open.


No idea how popular it is, but that’s how I run my ThinkPad X220 with 4 GB RAM and a mechanical hard drive. It’s still incredibly snappy.


I used to disable swap because it supposedly reduced the life span of SSDs, but I don't care about that anymore — when it dies, it dies.


When working on Ubuntu 16.04 LTS, this is such a productivity killer. Quite annoyed at the time lost from this behavior, after coming from a Mac. In shells where I run a program that may load a larger data set (e.g. before ipython), I now regularly run `ulimit -v 50000000` to limit the shell's virtual memory to ~50 GB of the available 64 GB on this machine.

If the program tries to use more RAM it'll then just die, and not drag down the whole system with it. Works fine, but I really shouldn't have to do this.


It is interesting how Android that typically has less memory than desktop systems solves such problems. It kills inactive applications and background browser pages. The program that can save its state is more complicated, but it works better with limited amount of memory. Today there are many applications written using languages like HTML or JS, or garbage-collected languages and unless you can unload them from memory, there will never be enough of it.


Wish desktop browsers do this by default, except for pinned tabs or last 20 pages. But then frontend guys make heavy apps with with animated meme loading screens.


I've been living in denial about this being a linux specific issue until I saw this post. Eventhough I encounter this problem frequently on linux and it has almost never happened to me on OSX or Windows, I've just been telling myself that it was because of the hardware I was using in each case. If they found a way to fix this, where just the browser froze and not the entire OS, it would be a huge improvement.


Do you run Windows without Pagefile/Swapfile?


I don't know. But I've had the problem described above with a linux with swap enabled.


lkml.org has been unstable for the past few... umm years, so The Linux Foundation runs its own lkml archive - lore.kernel.org/lkml/

Alternative link, just in case: https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca48...


That html code in lore.kernel.org is weird, I wonder how it's generated.


It's a somewhat common trick I believe. The idea is this; you want newlines inbetween your tags, but if you have HTML code like `<div>foo</div>\n<div>bar</div>`, you end up with an unwanted text node with a space inbetween the divs which changes how the page looks. By putting the newline inside the tags instead of between them, you don't have any unwanted text nodes.


lkml.org is not official


Does the Raspberry Pi suffer from this? All but the latest models have less then 4GB of ram & their storage is often slow SD flash (technically it could be fast but most people have cheap SD cards) so it's fits this scenario perfectly. I guess most users aren't pushing a lot into memory like GUI browsers do.


It has nothing to do with the hardware. This will happen on any system with a default Linux install.


I didn't mean to imply it had anything to do with hardware rather I intended to point out that there should be a large base of users (owners of RPis) experiencing the issue since the hardware is exactly what you need to reproduce it (limited RAM & slowish disk speed).


In Windows on a 16GB RAM laptop, I've often fired up Matlab, opened a 32GB matrix, and performed a few simple operations on it. In Windows Matlab dutifully chugs away on the problem, the disk spins like mad, and I put Matlab in the background and do email for 20 minutes. This identical use case completely cripples my Linux Mint OS, the mouse hangs, nothing functions, and I've never gotten it to even complete the operation. I just can't operate on a 32GB matrix with 16GB of RAM in Linux, but I can in Windows with relative ease.

To me, this is the Linux kernel's biggest weakness against Windows. Most other gripes about Linux (poor power management, poor driver support, etc) belong outside the kernels domain, but this one is a glaring win for Windows over Linux.


I use a swap file these days because in the 4 years since I purchased my computers I went from never hitting 32gigs of memory used at the same time, to hitting it once a week. The worst offenders are browsers and the JVM. The swap file saves me from those 20 seconds of distraction when running a variable memory workload that suddenly jumps over the limit and hardlocks the computer for hours on end. If I was doing something important I will wait for OOM killer to maybe reap the evil children, but otherwise I just power cycle the system and add a note to put the swap file in fstab.


This is an obvious idea, so I presume there's a reason why it wouldn't work, but what would happen if you had different rules for uid=0 pids and for everything else? If processes running as root were never eligible for oom-killing, and could force mallocs by triggering an oom-kill of user processes as necessary, wouldn't you always be able to recover a thrashing system from a root console? Or is it too hard to isolate console IO from the rest of the system in that situation?


The issue isn't that the OOM killer is too aggressive and killing the consoles or shells you're trying to use to rescue your system. The issue is that when you're trashing, the system becomes unresponsive; it's hard to recover the system because it takes many minutes to even just switch to a TTY.


That's kind of where I'm trying to get to, I think: how can you segregate the system such that root processes get a priority such that they don't get affected by the thrashing?


Read the source code of the OOM killer, the way that it computes the badness score favours userland processes, but root processes must also be considered because it’s possible they misbehave too


Possible, yes, but the smaller number of root processes and their general higher-risk status implies that a different policy entirely might be better than the same policy with different parameters.

Besides which, I'm looking at https://github.com/torvalds/linux/blob/master/mm/oom_kill.c and I genuinely can't see where root processes get privileged. I can see a reference to a `root bonus` in a comment, but other than that... maybe my kernel source reading skills are just too rusty.


Many systems (embedded and phones) don't have a proper root console, and people still run daemons which can leak memory as root.


Yeah, not having a root console on those systems means they're in no better (or worse) a situation than today. I think misbehaving root processes are a different case, though: any misbehaviour as root can have dire consequences, which is why running as root is a Bad Plan in general. Would adding "...and oom behaviour can wedge the box in new and interesting ways" to the list make things worse than they are today?


I'm seriously impressed by the quality of the discussion generated from this on the mailing list, a great example of online collaboration imho !


I've noticed a similar problem with low free disk space with every OS I've tried it on. All kinds of erratic behavior, hangs, etc.


The program using all the memory here is Chrome (or Firefox). It has more information about what is going on than the kernel does. It should be smarter about memory use when it is trying to consume more memory than is available. Perhaps it could page out background tabs to disk or something similar if memory is low.


Those were just examples. This issue can be reproduced using any process.


Yes, it has been bugging me a lot. My SWAP space remains empty, and my RAM runs out of space. And that situation is frustrating because I can't close my applications or even restart display server to clean up memory.

Something needs to be done here for real, otherwise Linux is a nice software


What are these settings he's referring to? Genuinely asking - I used to run into this issue _all the time_ and even if it's not a default I'd love to toggle some flags and get a responsive system even under low-memory situations


After reading through the comments I would like to know the following: Why are there no priorities?

I can't figure it out from the answers. I think with root privileges it should just be possible to say "GUI has higher priority" etc... Then when there is a memory issue you kill some low-priority processes to get the memory back.

But whatever any sane person considers part of the operating system because it is the bare-minimum of what is required to do stuff (filesystem, gui, ...) needs to have priority and always be fast. This can be defined by the distribution using the startup privileges.

So, why is this so difficult?


Read the comment by idoubtit in the thread below to learn how to prioritize:

idoubtit 2 hours ago | unvote [-]

Point 3 is wrong. OOM killing is not random. Each process is given a score according to its memory usage, and the highest score is chosen by the kernel. The way to mark priority in killing is to adjust this score through /proc. All of this is documented in `man 5 proc` from `/proc/[pid]/oom_adj` to `/proc/[pid]/oom_score_adj`. http://man7.org/linux/man-pages/man5/proc.5.html


My biggest issue is that everything gets swapped eventual in favor of disk cache. I know there are setting, but..it is just wrong.


What is the actual setting for exactly this behavoir? At least I'd like to disable it.


vm.vfs_cache_pressure

I'm not sure if it'll exactly accomplish what you want though.


thanks, I'll take a look. I remember playing with it for 32mb flash devices.


I had exactly this issue 3 years ago, when I was still using 4G RAM and working with heavy frontend stacks (gulp, webpack1, ..)


What is the solution here?

Are you recommending a swap space be created automatically on behalf of the user?

One could also use compressed memory (`zram`).


swap on zram (RAMSIZE - 1G) and earlyoom


I used to have a cgroup just for chrome to limit how much total ram it could use because of this exact thing


The elephants in the room are Chromium and Firefox. They turned the job of displaying an HTML page into a cpu and memory inefficient nightmare. The display of 4KB of information, that is the informational weight of a typical webpage, must not take up 400.000MB of memory and millions of cpu-instructions. Look out for simpler HTML plumbers e.g. Dillo and w3m and help refine their table rendering.


At one point we used to brag about using less memory. Common discussions were - "see, my memory usage — while hitting free -m on cli", "see, my kernel size is just 500kb, I chose just the right module" and so on…


I fixed this bug myself... I bought extra RAM. :)


I increased swap to avoid that on my laptop.


I read somewhere Linus Torvalds recommending https://en.wikipedia.org/wiki/Paging#Swappiness to 90 and let Kernel decide what/when to swap;


I've tried OOM killers and thrash-protect. I've tried numerous tweaks to the vm and swap setting. Nothing works. Memory use gets into the 90%s and the system freezes, hard.

Nonetheless, I'm surprised someone is calling this a bug. Let's face it, Linux is just not a desktop operating system. It's a server operating system, and it expects that it will be professionally administered and tightly controlled to prevent OOM situations. That OOM situations occur on servers too is beside the point. There are reasons for the linux memory system to work as it does, reasons Linus will yell at you about if you complain.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: