Do we really need swap on modern systems? (redhat.com)
46 points by omnibrain 1 hour ago | 58 comments





I hate swap. My experience with it is that once a disk-backed machine (as opposed to SSD) has started swapping, it's essentially unusable until you manually force all anonymous pages to be paged in by turning off swap ("sudo swapoff -a" on Linux) or reboot.

My hunch is that the OS is swapping stuff back in stupidly. Once memory is available, I'd like it to page everything back proactively, preferring stuff from swap and then from file-backed mmaps. But instead it seems to be purely reactive, each major page fault requiring a disk seek to page in what's needed with little if any readahead. Basically the whole VM space remains a minefield until you stumble over and detonate each mine in your normal operation. Much better to reboot and have a usable system again.

On my Linux systems, I've turned off swap.

On OS X...last I checked, I wasn't able to find a way to do this. I'd like to turn off swap entirely, or failing that, have some equivalent way to force all of swap to be paged in now so I don't have to reboot when I hit swap. Anyone know of a way?

Something seems to be seriously wrong with the swap implementation on modern systems.

20 years ago on Windows 98 it just started swapping, but it was no big deal. If something became too slow to be usable, you could just press ctrl+alt+del and kill that swapped program and everything worked fine afterwards.

On the other hand, my modern linux laptop, it starts swapping, and it swaps and swaps and you can do nothing, not even move the mouse, till 30 minutes later something crashes.

Isn't this related to this change on kernel 4.10? https://kernelnewbies.org/Linux_4.10#head-f6ecae920c0660b7f4...

Possibly, however since the writeback behavior is configurable I expect you could test that thesis by changing the aggressiveness of the writeback draining.

Could this be a reflection of the increasing gulf between RAM speed and HD speed? Even with NVMe drives, which one probably shouldn't be swapping to anyway, RAM is orders of magnitude faster.

I think, among other things, it has to do with the size of the swap space relative to the speed of the swap device. IME high disk i/o combined with large swap space means swap never fills up and the OOM killer doesn't kick in. On systems with less RAM and swap, OOM conditions were hit much sooner, even with slower disks.

Default settings for dirty ratio and dirty background ratio exacerbate the issue: more data is held onto before it is written, and once the background ratio is hit, any application writing to disk will block.

With SSD's disk is not that slow.

SSDs are only ~4x faster than magnetic last I checked. If RAM is 100ns per access, and hd access is down from say, 1ms to 0.25ms, that's still a huge huge gap. 4x isn't even an order of magnitude.

EDIT: see comment below for more accurate numbers.

From the article:

>A typical reference to RAM is in the area of 100ns, accessing data on a SSD 150μs (so 1500 times of the RAM) and accessing data on a rotating disk 10ms (so 100.000 times the RAM.

Thank you for the correction. I should have read more carefully. Still, we're talking 3 orders of magnitude for SSD vs RAM.

What I've always been specifically confused about, is if there's any point in giving a VM a swap partition inside its virtual disk, rather than just giving it a lot of regular virtual memory (even overcommitting compared to the host's amount of memory) and then letting the host swap out some of that RAM to its swap partition.

Personally, I've never given VMs swap. I'd rather have memory pressure trigger horizontal scaling (or perhaps vertical rescaling, for things like DBMS nodes) than let Individual VMs struggle along under overloaded+degraded conditions.

Ah, this is a great idea. It'd also be easier to understand and see service degradation (ie. physical memory being used on the host) directly from something like vCenter instead of relying upon Solarwinds to tell me the host is out of memory.

One usage of swap in modern systems: hibernation. If you need to use hibernation, that means a swap must exists, either as a swapfile (pre-allocated, as uswsusp require a fixed offset on the disk to resume) or as a partition.

I've been reading these stories for ten years. About 8 years ago I started taking them seriously and stopped using swap. Turns out not having swap works much better. I'm amazed how slowly the consensus seems to be moving though.

Systems are used for vastly different purposes. With different memory usages and expected operation.

There can be no consensus because there is no one answer.

Yeah. I've had issues with this on some systems.

On Windows without swap when you hit a remotely low on RAM point, things start going really poorly for some reason - random latency. So with 16 GB of RAM even I can't disable swap on Windows without some really strange performance characteristics, I run SSDs so I really wanted it off and I just stuffed more RAM in my box - with 32 GB it isn't a problem.

On Linux however, you can pretty much turn it off and everything will run smooth until you're actually out and then you lag badly briefly, Linux's oom-killer does its thing and all is good again within the span of a few seconds.

> I've been reading these stories for ten years. About 8 years ago I started taking them seriously and stopped using swap.

Not sure what you're referring to here. This story doesn't recommend eliminating swap...

"Systems without swap can make sense and are supported by Red Hat - just be sure the behaviour of such a system under memory pressure is what you want"

So, it doesn't exclusively recommend it, but it concedes that there are use cases where it makes sense.

Two examples of why I have swap:

* On a laptop to hibernate, which results in zero power consumption vs suspend which will drain the battery in a day or so

* I use tmpfs for /tmp and using swap as the backing is far more performant than regular filesystems

> * I use tmpfs for /tmp and using swap as the backing is far more performant than regular filesystems

This seems absurd. You're running an in-memory filesystem backed by memory-on-disk? You weren't comparing to a journalled filesystem or something like that?

Aren't there legacy applications which expect swap where otherwise with modern applications swap isn't necessary? Or, at least that is my current (mis)-understanding...

This is by far my biggest pet peeve in the space. The "rule of thumb" that you need 2x RAM as swap. Even 10 years ago this "rule" was ancient and useless but it was always a constant challenge educating customers as to why, and that yes - we really did know better than your uncle Rob.

Once a server hits swap, it's dead. There is no recovering it other than for exceptional cases. If you are swapping out, you've already lost the battle.

I tend to configure servers with 512MB to 1GB swap simply so the kernel can swap out a couple hundred MB of pages it never uses - but that's really more to make people feel better than it really being useful at all.

My desktop at work has 16G of RAM. I didn't bother setting up swap, and I find the old guidance (2x RAM) pretty absurd at this point. I've had the OOM-killer render the system unresponsive a couple of times, but only because I'd written a program that was leaking memory and I was pushing it to misbehave. If you really want virtual memory on purpose, you can still set up a memory-mapped file for your big data structure.

Putting spinning-rust-backed swap on a 16G system is absurd. By the time such a system is into swap, it probably isn't trying to swap three or four megabytes, it's probably trying to swap three or four gigabytes, and that can literally take hours. Simply writing that much data to a hard drive can take a non-trivial amount of time, and swap doesn't generally just cleanly run out to the hard drive with nothing else interfering, it's a lot messier. Given the speeds of everything else involved, a 16GB RAM system trying to swap to a hard drive, even a good one to say nothing of those slow-writing SMR hard drives [1], is basically a system that has completely failed and it might as well just start OOM-killing things.

A system backed by an SSD does degrade more nicely, though. The system visibly slows down but doesn't go to outright unresponsive like it does on a hard drive. You can make a case for letting that happen and having human intervention select the processes to kill, rather than letting the kernel do it. So, even though it still isn't really useful as an extension of RAM, it can still be useful in recovering from systems that you've run yourself out of memory on. Since putting an SSD in my systems I've actually gone back to running with some swap space. Though the fact I like hibernation sometimes is also a reason I run with swap in Linux on my laptop.

[1]: Swap will almost certainly completely blow out the buffers on those things and you'll be stuck with the raw hardware write speeds pretty quickly.

I don't have swap either. On 8GB it is pretty annoying, because a program I often use frequently overcommits and the system hangs.

Is there any way to tell the OOM killer which program to kill first?

reply


The fun OOM analogy [1] that comes up when people propose different OOM killer designs:

> An aircraft company discovered that it was cheaper to fly its planes with less fuel on board. The planes would be lighter and use less fuel and money was saved. On rare occasions however the amount of fuel was insufficient, and the plane would crash. This problem was solved by the engineers of the company by the development of a special OOF (out-of-fuel) mechanism. In emergency cases a passenger was selected and thrown out of the plane. (When necessary, the procedure was repeated.) A large body of theory was developed and many publications were devoted to the problem of properly selecting the victim to be ejected. Should the victim be chosen at random? Or should one choose the heaviest person? Or the oldest? Should passengers pay in order not to be ejected, so that the victim would be the poorest on board? And if for example the heaviest person was chosen, should there be a special exception in case that was the pilot? Should first class passengers be exempted? Now that the OOF mechanism existed, it would be activated every now and then, and eject passengers even when there was no fuel shortage. The engineers are still studying precisely how this malfunction is caused.

[1] https://lwn.net/Articles/104185/

reply


From the article: Without swap, the system will call the OOM when the memory is exhausted. You can prioritize which processes get killed first in configuring oom_adj_score.

>Is there any way to tell the OOM killer which program to kill first?

From TFA:

>Without swap, the system will call the OOM when the memory is exhausted. You can prioritize which processes get killed first in configuring oom_adj_score.

The linked solution document is only available to registered RH users, though, and the name is actually oom_score_adj and not oom_adj_score.

`man 5 proc` has details, but tl;dr is set /proc/<pid>/oom_score_adj to -1000 to make a process OOM-killer-invincible.

Use earlyoom: https://github.com/rfjakob/earlyoom

By default, it'll start killing processes when free memory drops below 10%, though you can configure the threshold. I had the same problem for years, and then I started using earlyoom and I don't have to deal with it anymore.

> I've had the OOM-killer render the system unresponsive a couple of times

Use earlyoom instead of relying on oom-killer.

https://github.com/rfjakob/earlyoom

To quote from the description:

> The oom-killer generally has a bad reputation among Linux users. This may be part of the reason Linux invokes it only when it has absolutely no other choice. It will swap out the desktop environment, drop the whole page cache and empty every buffer before it will ultimately kill a process. At least that's what I think what it will do. I have yet to be patient enough to wait for it.

[...]

> This made people wonder if the oom-killer could be configured to step in earlier: superuser.com , unix.stackexchange.com.

> As it turns out, no, it can't. At least using the in-kernel oom killer.

And earlyoom exists to provide a better alternative to oom-killer in userspace that's much more aggressive about maintaining responsivity.

I've found you definitely need swap if you don't have 8GB of memory. I personally have nowhere near the amount of patience required to wait for the OOM killer.

I wish we took the path of EROS [0] rather then "RAM and DISK are seperate". A lot of problems stem from that incompatable viewpoint of computing. Computer Science is about hiding complexity under lays of abstraction that continualy provide safer states and constraints on the things built on top of them. Our abstraction that RAM and DISK are seperate is not safer nor does it provide constraints that are simple to navigate. Thinking about this the other way, where DISK is all you need and memory is just a write-through cache, is much safer in my opinion and leads to some really cool application design.

If RAM and DISK are the same, then writing a file system is just writing an in-memory tree. No need to pull data from the disk, just navigate the tree in your program's memory and pull the blob data out. Want to persist acorss reboots, protect against power outages, or save user settings? Just set a variable and it'll be there.

The benifits are much better then the costs.

[0] - https://web.archive.org/web/20031029002231/http://www.eros-o...

The AS/400 (or whatever they call it now) had an approach like that. Everything was on disk and RAM was just a cache of disk. That also meant every "object" had an address and could be accessed by any process with suitable permissions. There are lots of other things they do, with a very different approach than Unix, Windows etc.

Frank Soltis' book is recommended reading: https://www.amazon.com/dp/1882419669/

The challenge with this is that abstracting away disk in a way that isn't horribly leaky is incredibly hard as long as one lets us manipulate individual bits and the other requires us to write whole sectors.

Note that EROS is not providing a write-through cache. It's providing a write-back cache using checkpointing coupled with a journalling capability and ability to explicitly sync data.

So it's leaky: Your application needs to know that it needs to structure it's writes to memory so that they will make sense if the system comes back up with some of the data missing, and needs to know how to use the journalling functionality.

It can't just act as if it's running in RAM forever.

You might want to investigate Mumps:

https://en.wikipedia.org/wiki/MUMPS

Setting data in memory is the same as setting data on disk, the only difference is the name of the variable:

s X=1 ; store 1 in variable named X, in memory.

s ^X=X ; store 1 in variable named X, on disk.

s X=^X ; load disk to memory

I never understood the rule of thumb where swap space was proportional to the amount of physical RAM. It seem to me it should be the size of your largest expected allocation (system wide) minus the amount of physical RAM or something like that. If you had a nicely configured system and took out half the RAM it doesn't make sense that you'd want less swap space.

reply


Now, desktops can have 32 GB of RAM, but everyone just uses it to run Chrome.

reply


A system that has way more swap than RAM will run out of 'performance is acceptable' way before it runs out of memory.

That was different in the early days, but that was because people accepted worse performance (GC that stops the world for seconds can be better than no GC, even when running a GUI).

Certainly nowadays, if you take out half the RAM, you will want to take out half the processes, too.

But you choose amount of RAM dependant on maximum memory usage. Therefore swap space (being proportional to RAM) becomes dependant on largest expected allocation also. It wouldn't be wise to build system with 2Gb of RAM and 4Gb of swap space, when you need 6Gb of memory at peaks: such a system would be slooow. It may be not wise to buy 8Gb of RAM when 5Gb is the maximum that might be needed.

reply


reply


The real issue is not an amount of swap but thrashing.

E.g., several large processes sleeping in memory on desktop would be fine if only one or two used at the same time. OTOH, clustered nodes well tuned for a single task may not need a swap.

In any case, it is a metric for thrashing that should be used to initiate culling.

Swap seems like a nice safety valve. Preferable, I think, to suddenly shutting down an important program in use because it's OOM.

Must depend on use case, but I prefer program that is planning to use swap (usually one where I accidentally allocate a way too big buffer) to fail automatically, rather than having to try to use the now unresponsive system UI to kill it

You are right, it depends. While building firefox from sources system needs several gigs of RAM. At the same time normal functioning of my system does not need more that 4Gb. And a couple of years I used swap just for such big /usr/bin/ld processes. Now I have 8Gb of RAM and linking FF or LibreOffice is not an issue anymore.

it never works for me on windows

it just slows my system down to a crawl, requiring me to force a reboot

it probably depends on your hardware

and if i disable the pagefile, windows update stops working and at 75% memory usage it starts panicing and closing programs

All the laptops at my workplace have the minimum storage (Apple...), it becomes frustrating when I open photoshop and almost my entire free space suddenly vanishes while my 16gb of ram isn't even 20% utilised

reply


reply


How do you hibernate with no swap? Do you need a special hibernation partition to write to?

The way I've done it is create a swap file and set it's swappiness to 0 so nothing actually gets paged into it. Hibernation forces the writes so it will get used on hibernate.

That article takes a system with 2GB ram as example. For a modern system that is pretty unrealistic, even Laptops have more. My system has 12.

I missed the mention of zram. Zram can create ramdisks, and compress them. It can create a compressed swapdisk in ram, basically making your ram last longer in case you really run out of memory. In my experience that is a good alternative to having a bit of swapspace as reserve, as the article recommends.

So, the argument the article makes is:

1. Swap is slow

2. If using swap, your system starts to thrash

3. If thrashing, you can't close programs to free memory

4. If you can't close programs, you have to wait until the task is killed by the OS

5. If you have no swap (or very little), you don't have to wait.

Except with an SSD, swap isn't slow enough to cause that issue. So really this article only seems to apply to servers, not desktops.

Not entirely true, I've had swap thrashing with ssd's on desktops too.

Though it tends to mean you're boned, or going to be waiting a while while all i/o is dedicated to swapping for minutes at a time.

reply


reply


I have 32G in one machine, and 16G in another. I recently moved over to the 16G machine to do my dev work in, and I run a few VM's in it.

I've found myself wanting to upgrade it to 32G ram, but honestly that's about the only use case (besides production servers) where I would ever consider swap, and at that point I consider it a problem of not enough memory rather than swap being necessary.

What about to support hibernation? Is that possible via swap file now?

It depends upon what filesystem you're writing it to, but the answer is mostly yes.

Answer: yes

I haven't used swap in years, and more recently I've accompanied that by using earlyoom [0] to start killing processes when RAM usage rises above 90%.

Both changes have made my computers much more usable. Systems should designed to fail fast when memory is low instead of slowing down.

[0] https://github.com/rfjakob/earlyoom

