
Linux Performance: Why You Should Almost Always Add Swap Space - ashitlerferad
https://haydenjames.io/linux-performance-almost-always-add-swap-space/
======
mnw21cam
Swap engineering is definitely way more complicated than the article lets on.

What people often don't realise is that Linux distinguishes between dirty and
clean memory, with dirty memory not reflected in swap, and clean memory
already stored somewhere on disc, and _program code is counted as clean memory
that just happens to live somewhere other than swap_.

Therefore, under memory pressure (especially if you set swappiness to zero),
you will be preferentially swapping out your program code (because it is
always clean) in preference to your program data. If you have _no_ swap, then
this is what causes the system to grind to a halt when the RAM is full- all
your program code gets discarded from RAM, and nothing can run without reading
it from disc again.

The recommendation to have a little bit of swap is absolutely fine. However,
as the amount of RAM in your system increases, the penalty for running out of
RAM increases as well. On larger systems (for example 256GB RAM), I recommend
using something like EarlyOOM[1] to kill off tasks before the pathological
swapping case occurs. Otherwise, you could end up with an unresponsive system.
If you have lots of RAM, the kernel OOM killer waits far too late, and the
system is already unresponsive.

[1] [https://github.com/rfjakob/earlyoom](https://github.com/rfjakob/earlyoom)

~~~
borman
> Therefore, under memory pressure (especially if you set swappiness to zero),
> you will be preferentially swapping out your program code (because it is
> always clean) in preference to your program data.

Regarding systems with near 100% utilization, wouldn't it be a better advice
to pin all executable code to memory via tmpfs (at system level) or mlockall()
(application level)?

Encountered this case on some heavily-loaded batch data processing workers: at
some point, client programs would slow down and generate random disk I/O
(which in turn is detected by monitoring and throttled). Turned out, processes
would go through major page faults at random moments in time, when their
executable code pages are discarded from memory and subsequently re-read from
disk.

~~~
mnw21cam
I'd love to have a configuration option on Linux that elevates the penalty for
evicting executable pages, or just makes them non-discardable. I think it
would help a lot.

------
nisa
This is dangerous advice. At least on Ubuntu 16.04 with stock kernel running
due to memory pressure into swap kills the systems. Basically everything times
out and you can only hard-reboot. Without swap you have an OOM at least
something you can recover from. If you need swap just add a little like 1-4gb
- because if you have spinning disks 64gb swap won't help you at all because
every request to swap is magnitudes more slower. Might be different with NVMe
but it's still noticable.

~~~
tfha
Not just ubuntu. We tell our engineers to remove their swap partitions because
sometimes the test suite will consume all remaining memory and push into swap.
If that happens, pretty much the only thing you can do is a hard poweroff. If
you're fast, you can Ctrl+C and only lose a minute or two, but often you're
just completely stuck.

Much better to OOM and have the test suite killed early.

~~~
cosarara97
Can ulimit help here?

~~~
AstralStorm
Ulimit is per process, it won't do. vm.overcommit_memory=2 with some setting
of vm.overcommit_ratio or bytes will help though.

I'd say overcommit heuristics break applications and cause them to eat too
much memory since they don't know when to stop. The only trouble is KVM which
for some reason takes double of process space allocated to the VM for no good
reason and perhaps memory intensive Java.

You need to tune the latter anyway.

------
kevin_nisbet
In my experience, I think engineering swap usage is more complex than this
article leads on. While adding more swap may make sense for many systems, I've
personally worked on several systems where disabling swap made a huge
difference. It's just one of those things where there isn't a hard rule.

One of the issues I've encountered more than once with Java software, is what
appears to be longer than expected pauses in the GC, that I've linked to swap
usage. The theory is, if you have a reference that lives long enough to get
promoted to an old generation, but then is released, the memory in question
won't be used for quite some time. This can lead the kernel to detect the
memory pages not being used, swap them to disk, but then when it's time to do
GC, the process is just frozen while the GC is running and get's backed up.
Depending on the workload this might be fine, however, on both the platforms I
encountered this issue on, were real time applications which were engineered
to have available memory for the application and did (and also had no benefit
from using the free'd memory for IO cache). So in this case, disabling swap
had a real and measurable performance boost for us, getting rid of some of the
variability in our GC runtime and pauses.

Under low memory conditions I think is also subjective. So if I'm a
distributed / highly available system that's doing message routing in a telco
environment, I would rather have the OOM killer kill the process, and failover
to a backup, then for the system to be up but unresponsive because it's
spending a majority of it's time swapping too/from disk. Again, this totally
depends on the platform.

------
grahn
> As a last resort, the Kernel will deploy OOM killer to nuke high-memory
> process(es)

Yes! That is exactly what I _want_ to happen!

When the system runs out of RAM, things will generally stop functioning, swap
enabled or not. The only question is _how_ you want it to stop functioning
when that happens.

In almost every situation, I'll easily take the kernel killing whatever single
process it thinks is most appropriate to get rid of, and keep everything else
up and running smoothly, over grinding the _the entire system_ to a halt by
upping the effective memory access time by orders of magnitude.

Simply put: If everything doesn't fit in memory, then don't try to run
everything!

Properly designed software nowadays is designed to be able to crash without
corrupting data. As far as I'm concerned, it is almost always preferable to
kill and restart instead of giving CPR to processes that don't fit in the
working memory.

~~~
parenthephobia
I don't want this to happen.

On servers, in almost every situation, I'd rather have the server reboot than
have some random process crash and leave the system in a potentially broken
state.

On desktops, in every situation, I'd rather have X, my window manager and an
emergency terminal pinned to RAM so I can always decide what to kill for
myself.
([https://github.com/stiletto/angrymlocker](https://github.com/stiletto/angrymlocker)
helps with setting this up.)

~~~
Dylan16807
> I'd rather have the server reboot than have some random process crash and
> leave the system in a potentially broken state.

Set the OOM killer to trigger a reboot? Crawling to a swapping halt is the
worst of both worlds, it's like a full system crash but the server never comes
back.

~~~
AstralStorm
Set overcommit policy and malloc would just fail, which is handled in mysql
and would abort the query instead of invoking oom killer.

And a crashing service should get restarted, involving journal file recovery
which is consistent state.

~~~
Dylan16807
> Set overcommit policy and malloc would just fail, which is handled in mysql
> and would abort the query instead of invoking oom killer.

So many things depend on so many gigabytes of overcommit, that seems like a
pretty bad way to go about it in the general case.

> And a crashing service should get restarted, involving journal file recovery
> which is consistent state.

I think you're agreeing with me with this sentence?

------
bryanlarsen
Kubernetes forces you to start it with flag to tell it "I know what I'm doing,
I know this machine has swap, but start Kubernetes anyways even though it's
normally a bad idea to run Kubernetes on a machine with swap".

------
hannob
I wondered this for a while, unfortunately a major question I have remains
unanswered in the text: What about SSDs and SSD-only systems? (Like... any
average laptop.) Should I be worried that putting a swap on an SSD will cause
it to wear out fast due to many write cycles?

~~~
sevensor
I used to work in flash manufacturing. We would joke, "I hope nobody ever uses
this for swap!" The numbers I saw for endurance (basically write-erase cycles
before failure) were not encouraging at all. As far as I can tell the only
reason MLC flash works at all is that the controller does some magic to
present a whole bunch of pretty flaky cells as a single reliable unit.

~~~
warrenm
>I used to work in flash manufacturing

How long ago? r/w cycles haven't really been an issue on SSDs in a long time

~~~
sevensor
2010\. I've been following the industry since I left, and I promise you
neither the physics of FN flash nor the manufacturing process has
fundamentally changed since then. Nor have the true endurance numbers
substantially improved. If anything the situation is worse now with smaller
device dimensions. The only thing standing between your collection of cat gifs
and oblivion is the flash controller making intelligent guesses about what it
just read.

~~~
warrenm
Sorry, but you must not be "following the industry" very well.

SSDs have lifespans measured in _years_ of constant use.

If you think it isn't the case, you need to go back and learn more about those
devices you claim to know so much about.

~~~
sevensor
I'm afraid we're misunderstanding each other here. You're talking about the
fully-integrated SSD. Not my area of expertise. I'm talking about the
endurance of the flash cell transistors. Nobody's claiming significantly
higher endurance for the transistors themselves -- what's improved is the
stuff on top of them. SSD controllers have gotten better at not trashing any
given cell with repeated program / erase cycles. If you measure endurance of
modern SSDs versus seven years ago, sure, you'll see better endurance, but
that's becasue there's better logic on top, not because the cell transistors
are any better than they were.

~~~
detaro
But as a user putting the SSD into my machine I don't care about the low-level
cells, I care about what's in the datasheet of the device I bought?

~~~
sevensor
Fair point. As with eating sausages, so with using flash memory. Maybe easier
to make a rational decision about it if you haven't seen it made.

------
jacquesm
Cargo cult system administration.

No, you should not 'almost always add swap space'. What you should do instead
is tune your system for its intended use.

If you need swap as an 'early warning system' that you're about to run out of
memory you're already doing it wrong and the OOM killer is a piece of code
that has default settings that can be tuned, ditto for the virtual memory
manager in the kernel.

[http://www.oracle.com/technetwork/articles/servers-
storage-d...](http://www.oracle.com/technetwork/articles/servers-storage-
dev/oom-killer-1911807.html)

[https://access.redhat.com/documentation/en-
us/red_hat_enterp...](https://access.redhat.com/documentation/en-
us/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-memory-captun)

Select appropriate equivalent for your kernel where applicable, and don't
forget about the per-process limits in ulimit.

This sort of 'almost useless' advice is pretty annoying, it shows that the
writer has an interest in the matter but didn't bother to dig in deep enough
to make the advice truly useful.

Also, the comment on that article that implies that swap will allow you to
update a running process because it is backing the image is wrong, the backing
is simply done through the filesystem and swap space has nothing to do with
it. The practical upshot of which is that you're not going to be able to
reclaim the space held by the binary if you should remove it because the 'file
is busy' until the last instance of that process exits. In the meantime, if
you really want to run the newer version under the old name or path then
you're free to unlink it or 'mv' it to a different location and to start the
new binary from the default location. On some Unixes you can even overwrite
the binary using an mv command making it look like the old one has disappeared
but again, under the hood it will still be held open, you can check this using
lsof or whatever is your local equivalent to see that the file is indeed still
present. In a pinch (and you really really have messed up if you ever need
this trick) you can re-link the running image by hard linking to the inode.

~~~
warrenm
You should just about always add at least _some_ swap.

Anyone who thinks otherwise hasn't read much or seen very many systems.

start here:

\-
[https://www.kernel.org/doc/gorman/html/understand/understand...](https://www.kernel.org/doc/gorman/html/understand/understand013.html)

\-
[https://www.kernel.org/doc/gorman/html/understand/understand...](https://www.kernel.org/doc/gorman/html/understand/understand014.html)

\- [https://www.linuxquestions.org/questions/linux-
kernel-70/why...](https://www.linuxquestions.org/questions/linux-
kernel-70/why-is-my-linux-server-using-swap-when-inactive-memory-is-
available-4175580818)

\- [https://kerneltalks.com/disk-management/swap-addition-in-
lin...](https://kerneltalks.com/disk-management/swap-addition-in-linux)

\- [https://linuxaria.com/howto/linux-memory-
management](https://linuxaria.com/howto/linux-memory-management)

\-
[https://kernelnewbies.org/Linux_4.8#head-a0b674ff33af7e3e7b8...](https://kernelnewbies.org/Linux_4.8#head-a0b674ff33af7e3e7b80fcfa266a9d366f094e93)

\-
[https://kernelnewbies.org/Linux_4.11#head-680ed3eee137456d6a...](https://kernelnewbies.org/Linux_4.11#head-680ed3eee137456d6a0665b3191be34d9a17bab2)

\-
[https://kernelnewbies.org/Linux_4.14#head-5327850512ba28763d...](https://kernelnewbies.org/Linux_4.14#head-5327850512ba28763de4c8838246b4145df59033)

\-
[https://help.ubuntu.com/community/SwapFaq#How_much_swap_do_I...](https://help.ubuntu.com/community/SwapFaq#How_much_swap_do_I_need.3F)

\-
[https://access.redhat.com/solutions/15244](https://access.redhat.com/solutions/15244)

\-
[https://serverfault.com/a/684800/2321](https://serverfault.com/a/684800/2321)

\-
[https://serverfault.com/q/329928/2321](https://serverfault.com/q/329928/2321)

\-
[https://serverfault.com/q/25653/2321](https://serverfault.com/q/25653/2321)

\-
[https://serverfault.com/a/825915/2321](https://serverfault.com/a/825915/2321)

\- [https://askubuntu.com/q/184217/3544](https://askubuntu.com/q/184217/3544)

\- [https://antipaucity.com/2011/08/08/why-technical-
intricacies...](https://antipaucity.com/2011/08/08/why-technical-intricacies-
matter/#.WjfiqlQ-fgo)

I've been in environments ranging from a few dozen individual servers to 100s
of 1000s. In companies ranging from mom-and-pop shops to international
investment banks. Every last one is configured with swap. Because unless you
have done all the profiling to truly _know_ you don't need it, you should be
running it.

~~~
jacquesm
The majority of your links seem to be chosen simply because they have the word
'swap' on it somewhere, not because they make the case why having 'some swap'
is a must.

I wrote a kernel from scratch, besides that I did quite a bit of transaction
oriented stuff and built the server side of a very successful messaging
platform. That does not make me an expert either but I do have some minimal
understanding of what goes on under the hood. If you are simply configuring
swap space on a better-safe-than-sorry basis then I would rather not have you
near a system whose response has to be deterministic because it almost
certainly will fail in some unpredictable way sooner rather than later.

Yes it is difficult stuff, no you do not get a free pass just because you
messed with important stuff on a duct-tape-and-glue basis. See, swap is just a
stay of execution, once a system starts swapping it is as good as out of
control anyway so in the vast majority of cases that I am familiar with you
want that situation to announce itself loud and clear and in a way that gives
you back control. You then do postmortem and you fix it so it will not happen
again. That is far preferably over having systems that remain partially broken
but pretend to be fine.

I know plenty of people run after each other claiming great insight by quoting
the various help-yourself sites but it amounts to absolutely nothing when you
suddenly find yourself trying to figure out why your transaction queues are
overflowing because some system in the pipe suddenly decided 100ms is as good
as 10 as long as things keep flowing.

If on the other hand your systems are not important enough to properly set
them up with static resources committed to long running processes and with
strict limits on what logged in users can do from the command line then this
advise is going to fall on deaf ears because you did not need reliability in
the first place. Note that in many business context latency is _far_ more
important than throughput.

~~~
cat199
if swap is 'cargo cult administration', pretending it won't save you a few
times even with all the safeguards you claim that you perfectly do is
'narcissist administration'.

if your systems are so advanced that you've done all of this up front
profiling, you'd _know_ why the system in the pipe decided 100ms is as good as
10 because you'd be monitoring and alerting on memory/swap and would get the
message when it hits 80%-90%, and be logging in while leaky process is blowing
up into your swap space.

no need for a 'postmortem' if you can have a surgery..

~~~
jacquesm
If you're architecting your systems in such a way that there is a dynamic
component that can cause you to hit the swap then _your could also simply buy
that much more memory_ so it won't save you.

All you need to do is to monitor your memory usage, any growth that you do not
understand is a reason to stop what you are doing immediately and to figure
out what is going on.

A few GB extra swap space will not save your bacon, will make it _much_ harder
to get the system back under control because it is no longer responsive
compared to a process killed with a supervisory process restarting that
process immediately and logging a fault.

And if your systems can't handle that you have bigger problems, likely you
then also won't be able to deal with hardware faults, crashes, power failures
and other errors.

A postmortem is far preferable when it is about a process that is re-launched
in a small fraction of a second after which the system is back to normal if
the alternative is a system that causes a whole cascade of stuff to go out of
whack down the line.

Pretty much only Erlang/OTP get this sort of thing right to begin with.

Errors - including out of memory errors - should be expected and should be
dealt with in a deterministic manner.

~~~
mmjaa
There have been many times, in 30 years, where I've been glad I've got swap
enabled so I can recover a nearly-out-of-control production system in time,
with the right procedures, instead of everything just being killed.

Thats enough for me to turn it on. But, also, gigs of RAM is another must-do
cargo-cult thing...

~~~
jacquesm
Did you do root cause analysis and did you make sure that same condition could
never happen again or did you figure that since the swap file saved you that
no further action was needed?

The 'many times' has me worried.

~~~
mmjaa
Many times .. over 30 years. And yes, it was mostly due to bugs in the code,
which I wouldn't have been able to assess if my machine had just OOM'ed
everything. That's the point: swap gives you a little leeway for these
analyses.

------
voidmain0001
Interesting; Jeff Atwood of Coding Horror/Stackoverflow/Discourse forces swap
creation as part of the installation of Discourse as a Docker container. He
prefers slow performance to OOM. [https://meta.discourse.org/t/create-a-
swapfile-for-your-linu...](https://meta.discourse.org/t/create-a-swapfile-for-
your-linux-server/13880)

~~~
warrenm
Pretty much all smart sysadmins want swap.

The kernel _expects_ it, and performance is almost always worse without it.

[https://news.ycombinator.com/item?id=15952447](https://news.ycombinator.com/item?id=15952447)

~~~
AstralStorm
Only slightly worse. Technically, kernel will sometimes swap out live
executable code unless you set swappiness to 0. This makes performance
completely unpredictable under load.

Kernel does not "expect" things, the defaults are just heuristics not failing
to malloc past amount of ram you have. This causes all of the mentioned
behaviour. Apps allocating too big RSS for instance, deferring GC and heap
compaction too late causing kernel to unload executable code.

And once you really run out of RAM, even on fast SSD swap will kill any
performance. Unless you're using only hugepages.

------
troisdetroie
I've often disabled swap on systems running high-throughput databases like
ElasticSearch and Cassandra because paging to disk will cause one or several
nodes to slow down, which affects the performance of the whole cluster. The
better thing to do in those cases is let the node fail right away by disabling
swapping.

I fact, elasticsearch prefers to be run with `swapoff` :
[https://www.elastic.co/guide/en/elasticsearch/reference/curr...](https://www.elastic.co/guide/en/elasticsearch/reference/current/setup-
configuration-memory.html)

~~~
hyperpape
There's even a name for a slow node stalling an entire cluster:
[https://danluu.com/limplock](https://danluu.com/limplock).

------
mcculley
When running an instance where the biggest process is a garbage collected
runtime, one does not want any swap at all. If the garbage collector is forced
to walk through all of the pages in swap hunting for live references,
performance is terrible. It is better then to have the process see only as
much virtual memory as there is RAM in the system. It is a failure of
Unix/Linux that there is no sensible API for managing real RAM from the point
of view of the process.

------
djb_hackernews
What do the people that use cloud instances that don't have local disks do?
You definitely don't want swap to be on EBS...

~~~
LinuxBender
Allocate 20+% more memory that you could ever need. Use cgroups, lxc, systemd
to constrain applications and containers to specific amounts of memory, cpu,
etc. Properly engineered systems will have enough memory for everything else
under the hood beyond the application. VM and container abstraction does not
negate the requirement to calculate this.

------
zzzcpan
Nowadays linux has this thing called zram disk. It compresses data in memory
on the fly. I think general swapping advice would serve most people better if
they always add swap on zram first, before swap on ssd or hdd. The difference
is that getting data in and out of zram is so fast that you can actually use
it as a tier of a slower ram, transparently compressed.

~~~
therealmarv
is this activated per default today? Last information I've got from Ubuntu
16.04 times is that it is not activated by default. I know that macOS and
Windows do similar to this by default.

~~~
zzzcpan
Some systems have it activated by default, on ubuntu it was a matter of
installing a package zram-config with default configuration.

------
mrweasel
I'm not going to disagree with the article, because I haven't done any testing
myself that would suggest that the point made are wrong. I do however find it
amusing to login to a system with 32 or 64GB of memory with a 2GB swap. It
seems humorous that those 2GB should somehow do anything if you already used
64GB of RAM.

------
makz
System tuning is largely dependent on workload and use case. You can't just
give a generic advice like this.

------
bandrami
So, wouldn't the _best_ answer clearly be to buy even more RAM, and use a
RAMdisk for swap?

~~~
dingaling
That would be the ideal response, but unfortunately there are cases when it's
not possible.

1\. System is already maxed-out at its board limit

2\. Specific RAM simply isn't available any more from preferred vendors

------
LinuxBender
If you must use swap (you probably don't need to), then at least set
zswap.enabled=1 in the kernel boot options, typically in grub. This will
enable lzo compression of swap in memory so there is less writing to disk.
Some newer kernels use lz4.

Either way, what most folks are actually missing is the correct kernel
settings for the amount of memory they have. Sadly, the kernel does not
dynamically adjust these based on your total amount of memory. Below is based
loosely on Redhat suggestions. Please note, that if you fall below the min
free, the kernel will decide what to do next, based on your oom and panic
settings. The settings below will have the kernel free cache and other memory
earlier so that you do not hit those stalling conditions and can even prevent
some OOM race conditions. Some might suggest tuned, but use tuned with caution
or at least read up on everything it does.

    
    
        MEM=`grep ^MemTotal /proc/meminfo | awk {'print $2'}`
        if   [ ${MEM} -gt 1129241478 ] ; then
            sysctl -q -w vm.min_free_kbytes=16384000
        elif [ ${MEM} -gt 564620739 ] ; then
            sysctl -q -w vm.min_free_kbytes=8192000
        elif [ ${MEM} -gt 352887962 ] ; then
            sysctl -q -w vm.min_free_kbytes=4096000
        elif [ ${MEM} -gt 176443981 ] ; then
            sysctl -q -w vm.min_free_kbytes=1024000
        elif [ ${MEM} -gt 88221990 ] ; then
            sysctl -q -w vm.min_free_kbytes=524288
        else
            sysctl -q -w vm.min_free_kbytes=262144
        fi
    

If you have small VM's, then perhaps set the default above to something a
little smaller. You can of course free up about 128MB on default installations
by removing "crashkernel" from your grub config and rebooting.

Then do this regardless, because overcommit set to 0 does not mean off, so we
set the ration to 0 as well. Overcommit is good for developers testing code
and finding the correct ways to manage memory in their applications during
development.

    
    
        sysctl -q -w vm.overcommit_ratio=0
    

And of course, cache pressure plays into early evacuation of the right cache
based on your usage:

    
    
        ## default is 100 (optimal for file servers).  4000+ for in memory databases.
        ## 10000 means always prefer page cache.
            sysctl -q -w vm.vfs_cache_pressure=1000
    

And if you have people oversubscribing a lot: (adjust based on your memory
capacity)

    
    
        sysctl -q -w vm.admin_reserve_kbytes=131072
        sysctl -q -w vm.user_reserve_kbytes=262144
    

Please do read up on all of these before testing on your test machines. [1]

[1]
[https://www.kernel.org/doc/Documentation/sysctl/](https://www.kernel.org/doc/Documentation/sysctl/)

Then finally, make sure you have Transparent Huge Pages disabled unless you
know for sure you need it. THP can leak a lot of memory and it is nearly
impossible to see without extensive kernel debugging.

In grub, set this and reboot:

    
    
        transparent_hugepage=madvise
    

Or to manually disable THP during run-time,

    
    
        echo -n "madvise" > /sys/kernel/mm/transparent_hugepage/enabled
        echo -n "never" > /sys/kernel/mm/transparent_hugepage/defrag
    

Then restart your applications. THP defrag can also cause stalling and lag
spikes, especially in large memory java deployments (it will look like FGC's)
and in MongoDB, Cassandra, others.

If you did this manually, stop your apps, flush cache, compact memory, then
start your apps.

    
    
        sync;sync;sync
        echo 3 > /proc/sys/vm/drop_caches
        echo 1 > /proc/sys/vm/compact_memory
    

Some will say 3 sync's is not required. This is mostly true, but some old raid
controllers treat this differently.

~~~
warrenm
You probably _do_ need swap.

Start here to know why:
[https://news.ycombinator.com/item?id=15952447](https://news.ycombinator.com/item?id=15952447)

~~~
LinuxBender
The reason those articles are suggesting swap is due to the settings I linked
above that most people are missing. The kernel is not evacuating cache early
enough and it gets wedged. Kernel devs even argue among themselves about this.
A properly engineered system would never need swap, that much is for certain.

And if you must use it, then at least know when you need to encrypt your swap.
If you have customer data in memory that is encrypted at rest, then you must
encrypt your swap.

Some people use crypttab for this, but I think that is a mistake. Rather,
people should have a swap volume or partition that on each system boot, you
use cryptsetup with a long randomly generated password and mkswap -f, then
swapon each time.

Most companies have policies about encrypting customer data. If you have swap,
it plays into that policy.

------
dennisjac
I always see people caring way to much about the _amount_ of swap space used
and not enough about the swap activity going on. It isn't the amount of swap
space that slows a system down but actual swapping in/out of memory pages. In
vmstat you want to pay attention to the "si" and "so" columns and in your
alerting/graphing you want to keep track of the values "pswpin" and "pswpout"
in /proc/vmstat. If these are almost always exactly or near zero then that
means the fact that some memory pages are swapped out has virtually no impact
on the performance of your system.

There are two other important issues to take into account though. 1) Even if
the memory pages swapped out are not accessed generally they might be forces
back into memory because of some specific action. One example is a database
that has all the hot records in memory but other records that are accessed
only very rarely swapped out to disk. In general everything will perform fine
in this situation but the moment you do e.g. a table scan and a lot of these
records need to be moved back into memory you might see a disk I/O peak that
might be quite a kick on the neck for overall performance if the database is
really busy.

2) If you don't have any swap space configured you might still run into
problems with swap which seem to be caused by a bug in the memory handling in
the kernel. I've seen this on some KVM hypervisors which were CentOS 7
systems. These systems were equipped with 128G of RAM and had two virtual
machines running which both were configured with 32G of virtual RAM. They ran
fine until one day the kswapd kernel process ran with 100% cpu usage even
though no swap swap was configured whatsoever (to avoid the situation
mentioned above). The "fix" was to dump the systems caches with "echo 3 >
/proc/sys/vm/drop_caches" which seemed to calm down kswapd again. As best as I
can tell what happened is that the system used all the free ram for the page
cache and buffers and when the system needed some memory it apparently
prefered to swap pages out to disk (even though no swap was configured) rather
than reclaiming page cache of which there was plenty to reclaim. Unfortunately
that means there seems to be no bulletproof way to say "only use physical ram
and never try to swap anything out to disk". Even /proc/sys/vm/swappiness can
be dangerous as a value of "0" doesn't actually tell the system to only swap
if absolutely necessary but can lead to OOM situations even if swap space is
still available (see
[https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux...](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-
stable.git/commit/?id=fe35004fbf9eaf67482b074a2e032abb9c89b1dd) for details).

TL;DR: 1) Don't just pay attention to the amount of swap space but to actual
swap activity over time 2) Be aware of corner cases and bugs relating to swap

~~~
JdeBP
I've been making the same point about pagefile size versus paging I/O for a
couple of decades now, finally turning it into an FGA a decade ago. (-:

* [http://jdebp.eu./FGA/dont-throw-those-paging-files-away.html](http://jdebp.eu./FGA/dont-throw-those-paging-files-away.html)

------
Kenji
I'm sorry, but not a single reason was presented in favour of swap when you're
100% confident that you will not max out your RAM. Swap = completely useless
slowdown if you have total control over your system.

Better use ulimit to prevent processes with e.g. memory leaks from causing
havoc.

~~~
efaref
The most compelling reason to enable at least some swap space is that the
Linux kernel's memory handling algorithms can't cope with there being no swap.
If you have no swap, then you will encounter stalls and OOMs even if you don't
actually run out of memory. Adding even a tiny amount of swap prevents this
from happening.

~~~
albertzeyer
This sounds as if it is a serious issue and should be fixed. Is there a bug
report about it?

~~~
lolc
This is due to memory overcommit, it's a performance feature. You can turn it
off.

~~~
kevin_nisbet
Interesting, I sort of remember an interaction with NUMA as well. I don't
believe this is a bug, just a miss-understanding of the way the underlying
system works. From what I remember when swapiness is 0, there are lots of
reports of processes that use more than a single NUMA node worth of memory
getting killed by the OOM killer, even with plenty of free RAM available. I
unfortunately don't remember the details, but this prevented the memory from
being able to be allocated to the additional node.

I've heard of this mostly with mysql, where it's common to have a big server
with lots of RAM, but a single large process that uses most of the system RAM.
The way we got around this, was by setting the process to allocate interleaved
among the NUMA nodes.

I'll have to dig into the overcommit.

------
martin_andrino
Swap isn’t needed at all in this age. It’s just there for people who can’t
afford more RAM and therefore have to resort to hacky solutions like this.

~~~
bogomipz
What an absurd statement.

Do you use a laptop? I do and I'm using swap space right now. I would like to
buy a laptop with 32 or 64 Gigs of RAM but I can't as most(all?) laptop makers
don't ship with a memory controller that permits more than 16 Gigs of RAM. And
I can afford to buy more RAM.

~~~
cztomsik
What is the rule for setting swap size, or can I use just swap file just like
on win/mac? (I heard it will disable hibernation then)

~~~
warrenm
The basic rule is `RAM + 2GB`

I usually peak at 34GB, unless I have reason (like needing/wanting to enable
full hibernation) to use 66GB, 130GB, etc.

~~~
cztomsik
what if if you increase your RAM after installing the system?

~~~
LinuxBender
It's a circular argument that people get into because they have the incorrect
kernel settings. Even if you have 4TB of ram, someone will say to add more
memory on disk. It just means the system is not configured correctly. I have
30k servers and not a single one of them has swap.

~~~
warrenm
>I have 30k servers and not a single one of them has swap.

Physical? VM? Cloud?

I've never seen an environment with more than a couple carefully-tuned
machines that didn't run swap on every last one

~~~
LinuxBender
30k physical servers. 512GB to 1TB ram each.

~~~
warrenm
>30k physical servers. 512GB to 1TB ram each.

You running an AWS data center?

~~~
LinuxBender
Not a public cloud. It does involve some private cloud visualization and
containers.

