
Almost Always Add Swap Space - ashitlerferad
https://haydenjames.io/linux-performance-almost-always-add-swap-space/
======
scottlamb
This is bad advice when the swap is backed by spinning disk rather than SSD,
in large part because paging in once the problem is solved is not aggressive
enough.

Let me make it concrete: say 1 GiB of your application's RAM gets swapped out
then is later paged in only as needed (4 KiB pages) with no readahead. Now the
application's VM space is a minefield: there are up to 262,144 times it can
stall for a ~10 ms disk seek (for a total of ~40 minutes). Sequentially
reading in 1 GiB of RAM from disk would take only ~5 seconds.

Hopefully OSs use some readahead, but I'm sure Linux and macOS don't use
enough. I find a system that has ever swapped to be totally unusable until I
do "sudo swapoff -a" (on Linux) to force everything to be paged in, or just
reboot (on macOS, I haven't found any other way).

Some of this can happen even without swap: the OS will still drop clean file-
backed pages, so unless you mlock() your executables after startup (my
production binary at work does this), you can still have major page fault-
induced latency spikes.

Swap backed by SSD (or compressed RAM) should be more reasonable, but I've had
enough bad experiences with swap on spinning disk that authoritative-sounding
articles that encourage swap without mentioning this problem piss me off.

In contrast, if you don't have swap and run out of RAM, something will die and
get restarted. In many cases this is a much better failure mode than
continuing to run slowly.

~~~
raverbashing
Swap sucks, but OOM killer sucks more. It certainly isn't a great solution,
but it's good for temporary spikes.

You can adjust "swappiness" if it's moving stuff to disk too aggressively.

~~~
scottlamb
> Swap sucks, but OOM killer sucks more. It certainly isn't a great solution,
> but it's good for temporary spikes.

I disagree. The OOM killer usually kills the right thing, and if it doesn't a
quick restart is usually better than continuing to run slowly.

You can also just put things in containers and set per-container limits.

> You can adjust "swappiness" if it's moving stuff to disk too aggressively.

What I want is to adjust un-swappiness: aggressiveness of paging stuff back
in. Once the memory spike is over, it should page everything back in
automatically.

~~~
mchristen
The OOM killer is just fine for stateless server applications, but pretty much
anything else and it's a disaster waiting to happen.

Containers don't really solve your problem, it just adds a layer of
indirection between the processes hogging RAM and the OOM killer, which in my
experience adds even more uncertainty into what the OOM killer is going to do.

~~~
scottlamb
> The OOM killer is just fine for stateless server applications, but pretty
> much anything else and it's a disaster waiting to happen.

I won't say there's no circumstance in which the OOM killer is worse, but in
general I disagree with you.

A stateful server application: maybe a standard ACID DBMS like PostgreSQL. I'd
much rather it get killed (failing whatever transactions were in progress) and
restart in a healthy state than continue to run slowly. Here I'm relying on a
correct implementation of Durability, but if you don't have that, you have
other problems.

Even more so for a quorum-based database server in which the leader dying (and
another taking over) is much better than continuing to run slowly.

And yes, even more so in stateless servers, particularly when there are
several of them. Going down and restarting quickly (a brief loss of capacity,
a brief spike of errors if the client doesn't retry) is a lot better than
continuing to be slow (a sustained loss of capacity, possibly leading to
cascading overload; and user-visible latency problems if there's no hedging).

For something like a desktop application, it's debatable, but personally I'd
still rather endure the up-front nuisance of having to restart it than have
the OS try to paper over it until I finally realize why it's so painful and do
something about it. (Insert anecdote about frog boiling here; I hear frogs
actually do jump out before they boil, but yet somehow the story still rings
true, if you know what I mean.)

~~~
mchristen
Databases are just one kind of stateful server application though, and yes I
would agree with you for that usecase relying on it's underlying durability is
a good thing, because it's built with that in mind.

Instead I would like to point out the ubiquity of things like nodejs, which
make it all to easy for under-experienced developers to shoot themselves in
the foot w/ a chaingun with respect to data design and management. It's not
really the internal state of the nodejs process itself, it's the external
state that it is inevitably being manipulated in the node process.

~~~
scottlamb
I'd rather go full chaos monkey and shake out such bugs (/design flaws) than
try to avoid them.

~~~
mchristen
Completely agree there.

Some times that luxury doesn't exist though and you just have to get shit done
and hope playing fast and loose doesn't come back to bite you.

------
mason55
I was hoping this post was going to say something like turning on some small
amount of swap will allow the OS to cache some things that normally it
wouldn't cache at all, and so you get some improvement there. What's actually
in this post doesn't make any sense to me. If I have enough RAM why do I want
things to swap to begin with? The author never explained why putting rarely
used pages in swap will improve performance if I have adequate RAM.

And using swap as some kind of "RAM emergency cover" does not make any sense
to me. Personally, I have never had a case where I could gracefully recover a
host that had started swapping.

First, my production servers are not doing many other things besides being
production servers. It's not like they are running a bunch of unnecessary
services, and if I kill some then my application can recover. If a process is
out of control then it's almost assuredly something important that I'm going
to have to kill anyway.

Second, I find it's much harder to detect degraded performance than it is to
detect a dead process. It's very, very easy to have a health check that will
detect a host who has stopped listening and drop that host from the LB. And
alerting on that scenario is very easy as well. The alternative is a host
that's operating in a degraded state, which I need to detect with more
sophisticated health check + alerting, and in the end my resolution is just
going to be to kill everything anyway.

In a properly designed HA environment the loss of a host should be no big
deal. Architecture should be focused on making sure a host goes down ASAP if
it's having problems, not letting it survive in some kind of zombie state.

~~~
deathanatos
> _If I have enough RAM why do I want things to swap to begin with? The author
> never explained why putting rarely used pages in swap will improve
> performance if I have adequate RAM._

The memory freed from a page that got pushed to swap might be better used for
serving disk cache, for example.

If you _don 't_ have enough RAM (even for spikes), a preemptive push when
things aren't busy might mean that when they do become busy, you won't have to
wait.

At least, that's the theory as I understand it. Whether or not it works _in
practice_ , IDK.

> _And using swap as some kind of "RAM emergency cover" does not make any
> sense to me. Personally, I have never had a case where I could gracefully
> recover a host that had started swapping._

I have successfully recovered hosts that began swapping, and even hosts that
filled their swap. But it's beyond painful, and in most of the cases where I
have a host doing such swapping, an OOM kill would have been a welcome
reprieve.

> _Second, I find it 's much harder to detect degraded performance than it is
> to detect a dead process._

Agreed, and restarting something that's been OOM'd is simple enough too. Our
hosts tend to stop reporting metrics when they start swapping, simply b/c so
little is actually getting done, so they get labelled as "down". Linux also
exposes a metric called "major page faults" (IIRC, it's "pgmajfault" in
/proc/vmstat) that records the number of major (required service from disk)
page faults; if the rate of that value is too steep for too long, that ≅ swap
thrashing.

~~~
mnw21cam
The key point is that if you have no swap, and you run low on memory, then the
OS has no choice but to throw away stuff that is clean. This includes stuff
that is being actively used right now, like the code that is in your
processes. So your system will still thrash, even though it has no swap,
because it is trying to load pages of running code in on-demand, while
constantly throwing them away again because it doesn't have enough space. This
will incapacitate a server even more thoroughly than if there is swap, and the
working set is larger than physical memory. The more RAM a server has, the
more severe this incapacitation will be.

For this reason, you want an OOM killer to kill stuff way earlier than when
you actually run out of RAM, because if you ever reach that point it is too
late. I use a program called earlyoom, which has saved my server from having
to have the big red button pushed quite a few times, when fumble-fingered PhD
students "accidentally" consume all the RAM in their pet projects.

~~~
scottlamb
> I use a program called earlyoom, which has saved my server from having to
> have the big red button pushed quite a few times, when fumble-fingered PhD
> students "accidentally" consume all the RAM in their pet projects.

Along those lines, take a look at this: [1]

> Pressure stall information were added to Linux 4.20 as a way to quantify
> resource pressure in the system in a better way than the traditional load
> average. PSI aggregates and reports the overall wallclock time in which the
> tasks in a system (or cgroup) wait for cpu, io or memory.

> This release [Linux 5.2] lets users to configure sensitive thresholds and
> use poll() and friends to be notified when a certain pressure thresold is
> breached within a user-defined time window. With this mechanism, Android can
> monitor for, and ward off, mounting memory shortages before they cause
> problems for the user. For example, using memory stall, monitors in
> userspace like the low memory killer daemon (lmkd) can detect mounting
> pressure and kill less important processes before device becomes visibly
> sluggish. In memory stress testing psi memory monitors produce roughly 10x
> less false positives compared to vmpressure.

I haven't tried it yet, but it sounds promising.

[1]
[https://kernelnewbies.org/Linux_5.2#Improved_Presure_Stall_I...](https://kernelnewbies.org/Linux_5.2#Improved_Presure_Stall_Information_for_better_resource_monitoring)

------
tutfbhuf
> The Linux Kernel will move memory pages which are hardly ever used into swap
> space to ensure that even more cachable space is made available in-memory
> for more frequently used memory pages (a page is a piece of memory).

Why should I care if I always have enough RAM? Gedankenexperiment: Let's say
u'd have infinite RAM, would swap make any sense?

~~~
vectorEQ
what if an application contains a memory leak. it could hypothetically use up
your infinite ram, and linux will then start randomly killing stuff to try and
recover.... adding swap allows you to trigger on memory issues / overload
without random shit getting killed. you could then kill the offending process
yourself specifically being triggered by some memory usage monitor or w/e
instead of running the risk of other critical processes being killed.

~~~
saalweachter
Well to start, slap default resource limits on your processes so they can't
use all the RAM in the world. Then your leaky processes will crash when their
mallocs fail and everyone will be as happy as they deserve.

~~~
scottlamb
I agree with the idea, but a nit: on 64-bit Linux with the default
vm.overcommit_memory=0, their mallocs will likely continue to succeed. Those
just expand the address space, which is probably not what your resource limits
are controlling. [1] The actual limit [2] is hit when they try to use the
virtual address space, causing a minor page fault (when it's backed by a zero-
filled page of physical RAM).

[1] You can add in such a limit via RLIMIT_AS, but it'd have a lot of
collateral damage like preventing LMDB from mmap()ing a big database. I don't
recommend this.

[2] I'd recommend setting limits on a cgroup; with a systemd unit file, see
MemoryMax= and friends in the systemd.resource-control(5) manpage.

------
brylie
I have manually added swap on SSD VPS nodes that come without it by default. I
was under the impression that the hosting providers were trying to preserve
the hardware. Is swap space still bad for SSD drives?

------
parliament32
It's not quite as black and white as the author makes it out to be. I've
worked on (specialized) systems with 512gb of RAM with a 180gb SSD
installed... No, you're not going to see any performance improvement at all by
adding swap to the already insane amount of RAM, and if you run into an OOM
issue where swap would be your parachute, you're already fucked.

------
Waterluvian
I have 32GB Ram on Ubuntu 18.04. Can I just turn swap off? If I do, what's the
worst that will happen?

~~~
serf
in a sane OS? Nothing until you run out of ram.

In a Microsoft OS? Depends on the app -- some are dependent (many games) on
having swap available (and may err silently or under false pretenses without
one).

I've found that a _lot_ of MMO style games include cheat detection systems
(see : rootkit) that rely on swap, and fail without it. As to why they do
that, i'm unsure.

~~~
Waterluvian
Fascinating. Maybe just as a part of the fingerprinting mechanism?

------
LargoLasskhyfv
Swap is good. If it is in RAM. Just use zram. Like so:

[https://github.com/armbian/build/blob/master/packages/bsp/co...](https://github.com/armbian/build/blob/master/packages/bsp/common/usr/lib/armbian/armbian-
zram-config)

zramctl output from a tiny NanoPiNeo2 with 1GB Ram, running from a 64GB
Sandisk SDXC card:

NAME ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS MOUNTPOINT

/dev/zram1 lzo 496.7M 4K 76B 12K 4 [SWAP]

/dev/zram0 zstd 50M 2.5M 449K 832K 4 var/log

Edit: Formatting (useless)

------
LinuxBender
Almost never add swap. [1]

[1] -
[https://github.com/ohdns/sysctl_and_thp_test](https://github.com/ohdns/sysctl_and_thp_test)

------
yellowapple
One thing the article doesn't mention for those of us who run Linux on laptops
(There are dozens of us! Dozens!) is that having a swap partition at least as
big as one's physical RAM (if not bigger) is mandatory for hibernation. Linux
(like any OS kernel that supports hibernation) needs to save the current
contents of RAM somewhere, and (last I checked) it'll only work with swap
partitions (swap files don't work for some reason).

------
zzzcpan
What I agree is that these days you should still use swap, but swap should
always be either on zram or on zswap if you can't use zram for some reason,
have too little ram, can't load zram module or something. Raw swap on disk is
just not acceptable anymore for performance, security, wearout reasons.

~~~
gnode
> performance, security, wearout reasons

If memory pressure is low, and the swap is used to page out rarely / never
used pages in favour of more disk cache, then there is a performance gain to
be had, and wear should be minimal.

You can encrypt swap with a randomly generated key to aid security.

Ideally memory compression and disk swap would be used together; there's a
gain to be had by shrinking lesser used pages, but no sense in taking up any
space in RAM for pages showing no sign of life.

~~~
zzzcpan
More vfs cache for swap space is not a trade off here, but rather a mostly
incorrect oversimplification. The value of more cache is very non linear, it
depends on the workload and usually only matters when nearly all of the disk
I/O is already exhausted, the exact situation you want to prevent by using
more of the same disk I/O for swap. Essentially trading more cache for abrupt
performance degradation in terms of latency and responsiveness. I mean, sure,
if you can't use zram, say you are on a system that's a bit behind on
performance features, like freebsd, very light swappiness might be ok, might
free up some of very unused memory and just not be aggressive enough to swap
something that causes latency, responsiveness problems, although this is not
certain. With memory compression you can swap and compress much more of it
more aggressively and be certain that no abrupt performance degradation can
happen. It feels like having more ram that's gradually affecting tail latency
the more you use it.

~~~
Dylan16807
Zram is nice when you're barely using it, but every time I've seen it get low
on memory it degrades quite abruptly, much worse than ssd swap.

------
paulcarroty
Swap doesn't make difference, just a parachute for your apps: give some work
guaranties, but make things slower & eats resources of SSDs.

I'm prefer add RAM as much my tasks needs it.

~~~
mnw21cam
Without swap, your server could potentially become suddenly unresponsive, as
memory pressure forces all your running code out. The system will still be
thrashing, and even worse than it you did have swap. If you do have swap, then
you get a slightly more graceful degradation, and hopefully a chance to fix
the problem first.

What you really want is to never have a severe out of memory situation, and
you can do that either by getting enough RAM that you never use it all, or by
killing stuff nice and early when memory starts getting low.

~~~
paulcarroty
> Without swap, your server could potentially become suddenly unresponsive

The server never be unresponsive 'cause lack of swap. Don't like the logic "we
just add a swap and maybe it will be slower a bit, but still working".

------
IHLayman
For another perspective, the Kubernetes community has had a long discussion
about using swap space over the course of the last 2 1/2 years, as you can see
in this feature thread:
[https://github.com/kubernetes/kubernetes/issues/53533](https://github.com/kubernetes/kubernetes/issues/53533)
TL;dr: As of version 1.8 any system running k8s has to disable swap space or
formally mark in the kubelet that you are turning the check for it off. There
are good arguments on both sides of this issue, and the discussion is still
continuing.

------
everybodyknows
OP serves up animated ads directly from the origin URL -- unblockable by
umatrix. Very sophisticated indeed.

------
lone_haxx0r
Is it true that swap reduces the life expectancy of SSDs by writing too
frequently?

~~~
JustSomeNobody
I think this used to be the case with older SSDs, but newer ones are able to
survive much longer.

------
sgt
Learned about atop. Wish it existed on macOS too.

~~~
IHLayman
Also check out glances:
[https://nicolargo.github.io/glances/](https://nicolargo.github.io/glances/) A
good terminal display overview of processes, network, disk, and battery.

------
sys_64738
SSD V Spinner? Metal V VM?

