Hacker News new | past | comments | ask | show | jobs | submit login
Linux Performance: Why You Should Almost Always Add Swap Space (haydenjames.io)
46 points by ashitlerferad on Dec 18, 2017 | hide | past | favorite | 109 comments



Swap engineering is definitely way more complicated than the article lets on.

What people often don't realise is that Linux distinguishes between dirty and clean memory, with dirty memory not reflected in swap, and clean memory already stored somewhere on disc, and program code is counted as clean memory that just happens to live somewhere other than swap.

Therefore, under memory pressure (especially if you set swappiness to zero), you will be preferentially swapping out your program code (because it is always clean) in preference to your program data. If you have no swap, then this is what causes the system to grind to a halt when the RAM is full- all your program code gets discarded from RAM, and nothing can run without reading it from disc again.

The recommendation to have a little bit of swap is absolutely fine. However, as the amount of RAM in your system increases, the penalty for running out of RAM increases as well. On larger systems (for example 256GB RAM), I recommend using something like EarlyOOM[1] to kill off tasks before the pathological swapping case occurs. Otherwise, you could end up with an unresponsive system. If you have lots of RAM, the kernel OOM killer waits far too late, and the system is already unresponsive.

[1] https://github.com/rfjakob/earlyoom


> Therefore, under memory pressure (especially if you set swappiness to zero), you will be preferentially swapping out your program code (because it is always clean) in preference to your program data.

Regarding systems with near 100% utilization, wouldn't it be a better advice to pin all executable code to memory via tmpfs (at system level) or mlockall() (application level)?

Encountered this case on some heavily-loaded batch data processing workers: at some point, client programs would slow down and generate random disk I/O (which in turn is detected by monitoring and throttled). Turned out, processes would go through major page faults at random moments in time, when their executable code pages are discarded from memory and subsequently re-read from disk.


I'd love to have a configuration option on Linux that elevates the penalty for evicting executable pages, or just makes them non-discardable. I think it would help a lot.


This is dangerous advice. At least on Ubuntu 16.04 with stock kernel running due to memory pressure into swap kills the systems. Basically everything times out and you can only hard-reboot. Without swap you have an OOM at least something you can recover from. If you need swap just add a little like 1-4gb - because if you have spinning disks 64gb swap won't help you at all because every request to swap is magnitudes more slower. Might be different with NVMe but it's still noticable.


Not just ubuntu. We tell our engineers to remove their swap partitions because sometimes the test suite will consume all remaining memory and push into swap. If that happens, pretty much the only thing you can do is a hard poweroff. If you're fast, you can Ctrl+C and only lose a minute or two, but often you're just completely stuck.

Much better to OOM and have the test suite killed early.


Wouldn’t it make much more sense to limit the maximum amount of memory available to something like (system memory-500MB) using CGroups? I know configuring CGroups can be difficult because there is no command line script but it might be worth it if it’s such a common problem with your test suits. An easy way to configure CGroups would be to use docker but that adds a lot of overhead in management (not in performance - but you need to have every dev install docker and test the suit using docker).

I use swap quite a lot, though, I’m not sure what my system moves to swap though. A large portion seems to be inactive google chrome tabs (whenever I open an inactive tab, chrome freezes for a second and I suddenly use ~70MB less swap.


> If that happens, pretty much the only thing you can do is a hard poweroff.

Sounds like a non-swap system/kernel tuning issue - you should still be able to slowly switch to a virtual console or ssh in, etc, and kill the test suite..

Also, most linux distros run without any real usage of user/group resource limiting - setting up these appropriately is useful to limit the overall impact of runaway jobs


Can ulimit help here?


Ulimit is per process, it won't do. vm.overcommit_memory=2 with some setting of vm.overcommit_ratio or bytes will help though.

I'd say overcommit heuristics break applications and cause them to eat too much memory since they don't know when to stop. The only trouble is KVM which for some reason takes double of process space allocated to the VM for no good reason and perhaps memory intensive Java.

You need to tune the latter anyway.


indeed on ubuntu 16.04 64bit, when my chrome has 20+ tabs, some new tab could trigger a memory thrashing, that uses up all the memory and eats up a few GB disk, cpu load jumps up to 10x, the computer basically becomes unusable.

i would rather a OOM kill the chrome process in this case, it wastes less time on me, a re-open of chrome will keep all the histories anyways.

also on embedded system where the RAM is fairly limited, and no hard drive attached, and its storage is measured by MB instead of GB, you simply can not have the swap partition.

I want to run the system without swap actually.


>when my chrome has 20+ tabs

That's not a swap problem, that's a Chrome problem.


It is a swap problem. Swap is supposed to make things work and not stop.

I suppose this is actually a problem of Linux optimistic malloc policy.


[flagged]


Thanks for the list of resources--I've been struggling to find good information about whether to use swap.

A friendly suggestion: you obviously care a lot about this, and want to bring people around to your point of view. I think you'd be more effective at that, if you wrote this up as a blog post, or a github gist. Cite the relevant information from those links (or leave some of them as further reading). As it stands, there are 16 links, and I'm anticipating slogging through them, with no idea what I'm getting out of it.


> with no idea what I'm getting out of it.

Not a whole lot. I've looked at them and they are for the most part outdated, do not make the case for 'some swap' or are simply fluff to make the list look more impressive.


Half of them are older, half are new/current

In short, unless you really know you don't need it, you should be running it


In my experience, I think engineering swap usage is more complex than this article leads on. While adding more swap may make sense for many systems, I've personally worked on several systems where disabling swap made a huge difference. It's just one of those things where there isn't a hard rule.

One of the issues I've encountered more than once with Java software, is what appears to be longer than expected pauses in the GC, that I've linked to swap usage. The theory is, if you have a reference that lives long enough to get promoted to an old generation, but then is released, the memory in question won't be used for quite some time. This can lead the kernel to detect the memory pages not being used, swap them to disk, but then when it's time to do GC, the process is just frozen while the GC is running and get's backed up. Depending on the workload this might be fine, however, on both the platforms I encountered this issue on, were real time applications which were engineered to have available memory for the application and did (and also had no benefit from using the free'd memory for IO cache). So in this case, disabling swap had a real and measurable performance boost for us, getting rid of some of the variability in our GC runtime and pauses.

Under low memory conditions I think is also subjective. So if I'm a distributed / highly available system that's doing message routing in a telco environment, I would rather have the OOM killer kill the process, and failover to a backup, then for the system to be up but unresponsive because it's spending a majority of it's time swapping too/from disk. Again, this totally depends on the platform.


> As a last resort, the Kernel will deploy OOM killer to nuke high-memory process(es)

Yes! That is exactly what I want to happen!

When the system runs out of RAM, things will generally stop functioning, swap enabled or not. The only question is how you want it to stop functioning when that happens.

In almost every situation, I'll easily take the kernel killing whatever single process it thinks is most appropriate to get rid of, and keep everything else up and running smoothly, over grinding the the entire system to a halt by upping the effective memory access time by orders of magnitude.

Simply put: If everything doesn't fit in memory, then don't try to run everything!

Properly designed software nowadays is designed to be able to crash without corrupting data. As far as I'm concerned, it is almost always preferable to kill and restart instead of giving CPR to processes that don't fit in the working memory.


Funny story though, we had a database server get nuked because a client process used too much memory. We hadn't done the tuning correctly and someone decided to run their queries locally. So it was entirely our fault. But mysqld - 220GB, mysql (client) - 30GB, let's nuke the server process.

Lesson learned, make sure you adjust the OOM killer on things like services that use lots of RAM.


I don't want this to happen.

On servers, in almost every situation, I'd rather have the server reboot than have some random process crash and leave the system in a potentially broken state.

On desktops, in every situation, I'd rather have X, my window manager and an emergency terminal pinned to RAM so I can always decide what to kill for myself. (https://github.com/stiletto/angrymlocker helps with setting this up.)


> I'd rather have the server reboot than have some random process crash and leave the system in a potentially broken state.

Set the OOM killer to trigger a reboot? Crawling to a swapping halt is the worst of both worlds, it's like a full system crash but the server never comes back.


Set overcommit policy and malloc would just fail, which is handled in mysql and would abort the query instead of invoking oom killer.

And a crashing service should get restarted, involving journal file recovery which is consistent state.


> Set overcommit policy and malloc would just fail, which is handled in mysql and would abort the query instead of invoking oom killer.

So many things depend on so many gigabytes of overcommit, that seems like a pretty bad way to go about it in the general case.

> And a crashing service should get restarted, involving journal file recovery which is consistent state.

I think you're agreeing with me with this sentence?


[flagged]


That is not a very helpful comment.

Anyway, the quote from the article I was responding to above was talking about using swap to handle a situation where you have insufficient RAM, so you're off mark.


Nope, I'm not off the mark.

But nice try at deflecting your lack of understanding.


Please don't be uncivil on Hacker News, regardless of how much someone knows about swap.

https://news.ycombinator.com/newsguidelines.html


Kubernetes forces you to start it with flag to tell it "I know what I'm doing, I know this machine has swap, but start Kubernetes anyways even though it's normally a bad idea to run Kubernetes on a machine with swap".


I wondered this for a while, unfortunately a major question I have remains unanswered in the text: What about SSDs and SSD-only systems? (Like... any average laptop.) Should I be worried that putting a swap on an SSD will cause it to wear out fast due to many write cycles?


> Should I be worried that putting a swap on an SSD will cause it to wear out fast due to many write cycles?

This risk is hugely overstated for most use cases based on people having 'baggage' from 'dumb' flash memory cards.

eMMC is another story (because these are basically 'dumb' flash memory cards)

Recently read an article by samsung where they claim that 3d nand SSD drives can handle ~1x full drive write per day (e.g. 1TB drive -> 1TB/day) and still last the rated lifetime, and that v-nand can handle something like 5-10x (don't remember off hand precice figures, but it was easily googlable). Yes, this is a vendor figure - but should be at least ballpark accurate I'd think.

Also, many SMART-capable drives show their estimated write cycle lifetime in SMART status (again, vendor figure, but still) .. if you check this on your laptop I bet you will be surprised at how low the figure is.


No.

There is no worry putting swap on SSD.

Given the choice, you should prefer swap on SSD (since it's so much faster than spinny disks).

If swap on SSD was a bad idea, do you honestly think every laptop, desktop, and server manufacturer that uses exclusively SSD would still enable swap?

Every mainstream OS wants/needs swap - it helps all of them


I've swapped hundreds of terabytes to consumer SSDs. Haven't noticed any problems yet (visibly).


By the time you start to see r/w issues on anything that resembles a modern SSD, you're probably wanting to replace the overall device anyway


Same here. How does file swap (cf. https://wiki.archlinux.org/index.php/swap#Swap_file_creation) could help out to let an underlying file system distribute the usage of the ssd? Combined with recreating a swap file on every boot (say on a frequently booting desktop machine), this could make a difference.


I used to work in flash manufacturing. We would joke, "I hope nobody ever uses this for swap!" The numbers I saw for endurance (basically write-erase cycles before failure) were not encouraging at all. As far as I can tell the only reason MLC flash works at all is that the controller does some magic to present a whole bunch of pretty flaky cells as a single reliable unit.


> As far as I can tell the only reason MLC flash works at all is that the controller does some magic to present a whole bunch of pretty flaky cells as a single reliable unit.

I would refuse to even use SLC flash without that magic.

But once you have all that ECC and balancing, durability almost stops being a problem outside of heavy database loads.


>I used to work in flash manufacturing

How long ago? r/w cycles haven't really been an issue on SSDs in a long time


2010. I've been following the industry since I left, and I promise you neither the physics of FN flash nor the manufacturing process has fundamentally changed since then. Nor have the true endurance numbers substantially improved. If anything the situation is worse now with smaller device dimensions. The only thing standing between your collection of cat gifs and oblivion is the flash controller making intelligent guesses about what it just read.


Sorry, but you must not be "following the industry" very well.

SSDs have lifespans measured in years of constant use.

If you think it isn't the case, you need to go back and learn more about those devices you claim to know so much about.


I'm afraid we're misunderstanding each other here. You're talking about the fully-integrated SSD. Not my area of expertise. I'm talking about the endurance of the flash cell transistors. Nobody's claiming significantly higher endurance for the transistors themselves -- what's improved is the stuff on top of them. SSD controllers have gotten better at not trashing any given cell with repeated program / erase cycles. If you measure endurance of modern SSDs versus seven years ago, sure, you'll see better endurance, but that's becasue there's better logic on top, not because the cell transistors are any better than they were.


But as a user putting the SSD into my machine I don't care about the low-level cells, I care about what's in the datasheet of the device I bought?


Fair point. As with eating sausages, so with using flash memory. Maybe easier to make a rational decision about it if you haven't seen it made.


The magic of the SSD controller can introduce rather surprising failure modes that you should probably care about.


I set vm.swappiness to 0 on my Fedora laptop.

Meaning the kernel will only swap to save the system.

I've been doing this for years and had no issues.

Edit: Actually in recent kernels, which I am using, 0 means it's disabled. I was thinking of 1.


It does not help, just makes the problem appear later in a more dire way. Try overcommit settings instead.


I googled that and was pleasantly surprised to learn something new.[1]

But if I'm reading that right I would probably want to do both. Because overcommit settings seem to be a memory saving technique. To avoid needing swap by decreasing the allowed overcommit memory.

Anyways, in the many years that I've been changing vm.swappiness I've never had any issues and I've always had a minimum of 8G RAM so I don't think this will be a big issue.

1. http://engineering.pivotal.io/post/virtual_memory_settings_i...


Cargo cult system administration.

No, you should not 'almost always add swap space'. What you should do instead is tune your system for its intended use.

If you need swap as an 'early warning system' that you're about to run out of memory you're already doing it wrong and the OOM killer is a piece of code that has default settings that can be tuned, ditto for the virtual memory manager in the kernel.

http://www.oracle.com/technetwork/articles/servers-storage-d...

https://access.redhat.com/documentation/en-us/red_hat_enterp...

Select appropriate equivalent for your kernel where applicable, and don't forget about the per-process limits in ulimit.

This sort of 'almost useless' advice is pretty annoying, it shows that the writer has an interest in the matter but didn't bother to dig in deep enough to make the advice truly useful.

Also, the comment on that article that implies that swap will allow you to update a running process because it is backing the image is wrong, the backing is simply done through the filesystem and swap space has nothing to do with it. The practical upshot of which is that you're not going to be able to reclaim the space held by the binary if you should remove it because the 'file is busy' until the last instance of that process exits. In the meantime, if you really want to run the newer version under the old name or path then you're free to unlink it or 'mv' it to a different location and to start the new binary from the default location. On some Unixes you can even overwrite the binary using an mv command making it look like the old one has disappeared but again, under the hood it will still be held open, you can check this using lsof or whatever is your local equivalent to see that the file is indeed still present. In a pinch (and you really really have messed up if you ever need this trick) you can re-link the running image by hard linking to the inode.



The majority of your links seem to be chosen simply because they have the word 'swap' on it somewhere, not because they make the case why having 'some swap' is a must.

I wrote a kernel from scratch, besides that I did quite a bit of transaction oriented stuff and built the server side of a very successful messaging platform. That does not make me an expert either but I do have some minimal understanding of what goes on under the hood. If you are simply configuring swap space on a better-safe-than-sorry basis then I would rather not have you near a system whose response has to be deterministic because it almost certainly will fail in some unpredictable way sooner rather than later.

Yes it is difficult stuff, no you do not get a free pass just because you messed with important stuff on a duct-tape-and-glue basis. See, swap is just a stay of execution, once a system starts swapping it is as good as out of control anyway so in the vast majority of cases that I am familiar with you want that situation to announce itself loud and clear and in a way that gives you back control. You then do postmortem and you fix it so it will not happen again. That is far preferably over having systems that remain partially broken but pretend to be fine.

I know plenty of people run after each other claiming great insight by quoting the various help-yourself sites but it amounts to absolutely nothing when you suddenly find yourself trying to figure out why your transaction queues are overflowing because some system in the pipe suddenly decided 100ms is as good as 10 as long as things keep flowing.

If on the other hand your systems are not important enough to properly set them up with static resources committed to long running processes and with strict limits on what logged in users can do from the command line then this advise is going to fall on deaf ears because you did not need reliability in the first place. Note that in many business context latency is far more important than throughput.


if swap is 'cargo cult administration', pretending it won't save you a few times even with all the safeguards you claim that you perfectly do is 'narcissist administration'.

if your systems are so advanced that you've done all of this up front profiling, you'd know why the system in the pipe decided 100ms is as good as 10 because you'd be monitoring and alerting on memory/swap and would get the message when it hits 80%-90%, and be logging in while leaky process is blowing up into your swap space.

no need for a 'postmortem' if you can have a surgery..


If you're architecting your systems in such a way that there is a dynamic component that can cause you to hit the swap then your could also simply buy that much more memory so it won't save you.

All you need to do is to monitor your memory usage, any growth that you do not understand is a reason to stop what you are doing immediately and to figure out what is going on.

A few GB extra swap space will not save your bacon, will make it much harder to get the system back under control because it is no longer responsive compared to a process killed with a supervisory process restarting that process immediately and logging a fault.

And if your systems can't handle that you have bigger problems, likely you then also won't be able to deal with hardware faults, crashes, power failures and other errors.

A postmortem is far preferable when it is about a process that is re-launched in a small fraction of a second after which the system is back to normal if the alternative is a system that causes a whole cascade of stuff to go out of whack down the line.

Pretty much only Erlang/OTP get this sort of thing right to begin with.

Errors - including out of memory errors - should be expected and should be dealt with in a deterministic manner.


There have been many times, in 30 years, where I've been glad I've got swap enabled so I can recover a nearly-out-of-control production system in time, with the right procedures, instead of everything just being killed.

Thats enough for me to turn it on. But, also, gigs of RAM is another must-do cargo-cult thing...


Did you do root cause analysis and did you make sure that same condition could never happen again or did you figure that since the swap file saved you that no further action was needed?

The 'many times' has me worried.


Many times .. over 30 years. And yes, it was mostly due to bugs in the code, which I wouldn't have been able to assess if my machine had just OOM'ed everything. That's the point: swap gives you a little leeway for these analyses.


>you do not get a free pass just because you messed with important stuff on a duct-tape-and-glue basis

Hahah. That made me laugh.

Before you go insulting people, maybe listen to what highly-experienced folks, like myself, have done.


Interesting; Jeff Atwood of Coding Horror/Stackoverflow/Discourse forces swap creation as part of the installation of Discourse as a Docker container. He prefers slow performance to OOM. https://meta.discourse.org/t/create-a-swapfile-for-your-linu...


Pretty much all smart sysadmins want swap.

The kernel expects it, and performance is almost always worse without it.

https://news.ycombinator.com/item?id=15952447


Only slightly worse. Technically, kernel will sometimes swap out live executable code unless you set swappiness to 0. This makes performance completely unpredictable under load.

Kernel does not "expect" things, the defaults are just heuristics not failing to malloc past amount of ram you have. This causes all of the mentioned behaviour. Apps allocating too big RSS for instance, deferring GC and heap compaction too late causing kernel to unload executable code.

And once you really run out of RAM, even on fast SSD swap will kill any performance. Unless you're using only hugepages.


I've often disabled swap on systems running high-throughput databases like ElasticSearch and Cassandra because paging to disk will cause one or several nodes to slow down, which affects the performance of the whole cluster. The better thing to do in those cases is let the node fail right away by disabling swapping.

I fact, elasticsearch prefers to be run with `swapoff` : https://www.elastic.co/guide/en/elasticsearch/reference/curr...


There's even a name for a slow node stalling an entire cluster: https://danluu.com/limplock.


When running an instance where the biggest process is a garbage collected runtime, one does not want any swap at all. If the garbage collector is forced to walk through all of the pages in swap hunting for live references, performance is terrible. It is better then to have the process see only as much virtual memory as there is RAM in the system. It is a failure of Unix/Linux that there is no sensible API for managing real RAM from the point of view of the process.


What do the people that use cloud instances that don't have local disks do? You definitely don't want swap to be on EBS...


Allocate 20+% more memory that you could ever need. Use cgroups, lxc, systemd to constrain applications and containers to specific amounts of memory, cpu, etc. Properly engineered systems will have enough memory for everything else under the hood beyond the application. VM and container abstraction does not negate the requirement to calculate this.


Its a good question, but the swap advice is suspicious. If I would normally purchase a 4GB of ram instance with 4GB swap, then what happens when I purchase a 8GB instance? Do I still need 4GB swap? Hopefully this helps illustrate that it is really the applications you intend to run and the swap doesn't mean the system gets more efficient


Nowadays linux has this thing called zram disk. It compresses data in memory on the fly. I think general swapping advice would serve most people better if they always add swap on zram first, before swap on ssd or hdd. The difference is that getting data in and out of zram is so fast that you can actually use it as a tier of a slower ram, transparently compressed.


is this activated per default today? Last information I've got from Ubuntu 16.04 times is that it is not activated by default. I know that macOS and Windows do similar to this by default.


Some systems have it activated by default, on ubuntu it was a matter of installing a package zram-config with default configuration.


Distro-dependent. I think at least some distros for lightweight desktop systems and ChromeOS use it by default, RedHat trusts it enough to officially support it, but I don't think any of the major distros has it on by default.


zram is cool, but still considered experimental. Even zswap is still fairly young. I have used both on my own systems. zswap is much easier to conifg.


Not exactly helpful.

Swap is vital in modern systems: https://news.ycombinator.com/item?id=15952447


I don't think 2.6 kernel reasons are relevant anymore, in fact they apply to ram compression techniques better, than to old raw swap on disk.


I'm not going to disagree with the article, because I haven't done any testing myself that would suggest that the point made are wrong. I do however find it amusing to login to a system with 32 or 64GB of memory with a 2GB swap. It seems humorous that those 2GB should somehow do anything if you already used 64GB of RAM.


System tuning is largely dependent on workload and use case. You can't just give a generic advice like this.


So, wouldn't the best answer clearly be to buy even more RAM, and use a RAMdisk for swap?


That would be the ideal response, but unfortunately there are cases when it's not possible.

1. System is already maxed-out at its board limit

2. Specific RAM simply isn't available any more from preferred vendors


If you must use swap (you probably don't need to), then at least set zswap.enabled=1 in the kernel boot options, typically in grub. This will enable lzo compression of swap in memory so there is less writing to disk. Some newer kernels use lz4.

Either way, what most folks are actually missing is the correct kernel settings for the amount of memory they have. Sadly, the kernel does not dynamically adjust these based on your total amount of memory. Below is based loosely on Redhat suggestions. Please note, that if you fall below the min free, the kernel will decide what to do next, based on your oom and panic settings. The settings below will have the kernel free cache and other memory earlier so that you do not hit those stalling conditions and can even prevent some OOM race conditions. Some might suggest tuned, but use tuned with caution or at least read up on everything it does.

    MEM=`grep ^MemTotal /proc/meminfo | awk {'print $2'}`
    if   [ ${MEM} -gt 1129241478 ] ; then
        sysctl -q -w vm.min_free_kbytes=16384000
    elif [ ${MEM} -gt 564620739 ] ; then
        sysctl -q -w vm.min_free_kbytes=8192000
    elif [ ${MEM} -gt 352887962 ] ; then
        sysctl -q -w vm.min_free_kbytes=4096000
    elif [ ${MEM} -gt 176443981 ] ; then
        sysctl -q -w vm.min_free_kbytes=1024000
    elif [ ${MEM} -gt 88221990 ] ; then
        sysctl -q -w vm.min_free_kbytes=524288
    else
        sysctl -q -w vm.min_free_kbytes=262144
    fi
If you have small VM's, then perhaps set the default above to something a little smaller. You can of course free up about 128MB on default installations by removing "crashkernel" from your grub config and rebooting.

Then do this regardless, because overcommit set to 0 does not mean off, so we set the ration to 0 as well. Overcommit is good for developers testing code and finding the correct ways to manage memory in their applications during development.

    sysctl -q -w vm.overcommit_ratio=0
And of course, cache pressure plays into early evacuation of the right cache based on your usage:

    ## default is 100 (optimal for file servers).  4000+ for in memory databases.
    ## 10000 means always prefer page cache.
        sysctl -q -w vm.vfs_cache_pressure=1000
And if you have people oversubscribing a lot: (adjust based on your memory capacity)

    sysctl -q -w vm.admin_reserve_kbytes=131072
    sysctl -q -w vm.user_reserve_kbytes=262144
Please do read up on all of these before testing on your test machines. [1]

[1] https://www.kernel.org/doc/Documentation/sysctl/

Then finally, make sure you have Transparent Huge Pages disabled unless you know for sure you need it. THP can leak a lot of memory and it is nearly impossible to see without extensive kernel debugging.

In grub, set this and reboot:

    transparent_hugepage=madvise
Or to manually disable THP during run-time,

    echo -n "madvise" > /sys/kernel/mm/transparent_hugepage/enabled
    echo -n "never" > /sys/kernel/mm/transparent_hugepage/defrag
Then restart your applications. THP defrag can also cause stalling and lag spikes, especially in large memory java deployments (it will look like FGC's) and in MongoDB, Cassandra, others.

If you did this manually, stop your apps, flush cache, compact memory, then start your apps.

    sync;sync;sync
    echo 3 > /proc/sys/vm/drop_caches
    echo 1 > /proc/sys/vm/compact_memory
Some will say 3 sync's is not required. This is mostly true, but some old raid controllers treat this differently.


You probably do need swap.

Start here to know why: https://news.ycombinator.com/item?id=15952447


The reason those articles are suggesting swap is due to the settings I linked above that most people are missing. The kernel is not evacuating cache early enough and it gets wedged. Kernel devs even argue among themselves about this. A properly engineered system would never need swap, that much is for certain.

And if you must use it, then at least know when you need to encrypt your swap. If you have customer data in memory that is encrypted at rest, then you must encrypt your swap.

Some people use crypttab for this, but I think that is a mistake. Rather, people should have a swap volume or partition that on each system boot, you use cryptsetup with a long randomly generated password and mkswap -f, then swapon each time.

Most companies have policies about encrypting customer data. If you have swap, it plays into that policy.


Those 3 sync's in rapid succession are secret-monkey-code for "tell the tape unit to rewind..", just FYI .. you don't strictly need 3. ;)


hehe it was also the secret monkey code on some old raid controllers to tell them to commit their cache to disk. It isn't even strictly required these days, but old bugs and features find their way into systems all the time, so I just keep the old incantations around as paranoid habits. :)


Yeah, in my case its pure muscle-memory from the 80's. Can't stop myself from doing it the moment I start typing "sync"...


Good stuff here.


I always see people caring way to much about the amount of swap space used and not enough about the swap activity going on. It isn't the amount of swap space that slows a system down but actual swapping in/out of memory pages. In vmstat you want to pay attention to the "si" and "so" columns and in your alerting/graphing you want to keep track of the values "pswpin" and "pswpout" in /proc/vmstat. If these are almost always exactly or near zero then that means the fact that some memory pages are swapped out has virtually no impact on the performance of your system.

There are two other important issues to take into account though. 1) Even if the memory pages swapped out are not accessed generally they might be forces back into memory because of some specific action. One example is a database that has all the hot records in memory but other records that are accessed only very rarely swapped out to disk. In general everything will perform fine in this situation but the moment you do e.g. a table scan and a lot of these records need to be moved back into memory you might see a disk I/O peak that might be quite a kick on the neck for overall performance if the database is really busy.

2) If you don't have any swap space configured you might still run into problems with swap which seem to be caused by a bug in the memory handling in the kernel. I've seen this on some KVM hypervisors which were CentOS 7 systems. These systems were equipped with 128G of RAM and had two virtual machines running which both were configured with 32G of virtual RAM. They ran fine until one day the kswapd kernel process ran with 100% cpu usage even though no swap swap was configured whatsoever (to avoid the situation mentioned above). The "fix" was to dump the systems caches with "echo 3 > /proc/sys/vm/drop_caches" which seemed to calm down kswapd again. As best as I can tell what happened is that the system used all the free ram for the page cache and buffers and when the system needed some memory it apparently prefered to swap pages out to disk (even though no swap was configured) rather than reclaiming page cache of which there was plenty to reclaim. Unfortunately that means there seems to be no bulletproof way to say "only use physical ram and never try to swap anything out to disk". Even /proc/sys/vm/swappiness can be dangerous as a value of "0" doesn't actually tell the system to only swap if absolutely necessary but can lead to OOM situations even if swap space is still available (see https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux... for details).

TL;DR: 1) Don't just pay attention to the amount of swap space but to actual swap activity over time 2) Be aware of corner cases and bugs relating to swap


I've been making the same point about pagefile size versus paging I/O for a couple of decades now, finally turning it into an FGA a decade ago. (-:

* http://jdebp.eu./FGA/dont-throw-those-paging-files-away.html


Ensure Transparent Huge Pages are disabled. There is a known bug/leak. The only fix is to set to madvise or disable it all together.


That sounds like a bug worthy of reporting.


I'm sorry, but not a single reason was presented in favour of swap when you're 100% confident that you will not max out your RAM. Swap = completely useless slowdown if you have total control over your system.

Better use ulimit to prevent processes with e.g. memory leaks from causing havoc.


The most compelling reason to enable at least some swap space is that the Linux kernel's memory handling algorithms can't cope with there being no swap. If you have no swap, then you will encounter stalls and OOMs even if you don't actually run out of memory. Adding even a tiny amount of swap prevents this from happening.


This sounds as if it is a serious issue and should be fixed. Is there a bug report about it?


This is due to memory overcommit, it's a performance feature. You can turn it off.


Interesting, I sort of remember an interaction with NUMA as well. I don't believe this is a bug, just a miss-understanding of the way the underlying system works. From what I remember when swapiness is 0, there are lots of reports of processes that use more than a single NUMA node worth of memory getting killed by the OOM killer, even with plenty of free RAM available. I unfortunately don't remember the details, but this prevented the memory from being able to be allocated to the additional node.

I've heard of this mostly with mysql, where it's common to have a big server with lots of RAM, but a single large process that uses most of the system RAM. The way we got around this, was by setting the process to allocate interleaved among the NUMA nodes.

I'll have to dig into the overcommit.


It's not a serious issue at all. It's intentional.


i agree with this , lack of swap makes oom too aggressive . , question what's worst no app (and react to OOM) or app swapping


OOM is tunable.


They presented one reason: the system will swap out more or less unused memory and give that much more space for disk cache, etc.

Now, I'm not sure I find this a particularly compelling reason. I suppose it depends on how much RAM you have available for cache vs how much unused junk gets swapped out. If you've got 5G of unused RAM and 300M extra of stale allocations that get swapped out, then it's probably not going to impact your system in any measurable way.

I personally like a quick OOM, myself, rather than a machine slowing down to a crawl for multiple minutes while swap gets exhausted.


RAM freed by swapping some infrequently accessed memory out to disk becomes available for cache use. If you have total control of your system, you know when you can safely allow data to be paged, and when you need to lock memory to physical ram.


Can you ever be 100% confident?


Yes. ulimit and cgroups do wonders.


Swap isn’t needed at all in this age. It’s just there for people who can’t afford more RAM and therefore have to resort to hacky solutions like this.


The article specifically lists reasons why you should enable swap when there is plenty of RAM. Can you elaborate on what part of the article's reasoning you disagree with?


> It’s just there for people who can’t afford more RAM

So I should throw away my 5-year-old notebook that's still working perfectly fine just because the RAM is soldered in? Doesn't sound like efficient resource usage (both money and rare earths).


Disk is still way cheaper than RAM, and I always add some swap space to my servers just to lessen the chance of getting ENOMEM errors if/when an unexpected or unusual workload would come along.


What an absurd statement.

Do you use a laptop? I do and I'm using swap space right now. I would like to buy a laptop with 32 or 64 Gigs of RAM but I can't as most(all?) laptop makers don't ship with a memory controller that permits more than 16 Gigs of RAM. And I can afford to buy more RAM.


What is the rule for setting swap size, or can I use just swap file just like on win/mac? (I heard it will disable hibernation then)


The basic rule is `RAM + 2GB`

I usually peak at 34GB, unless I have reason (like needing/wanting to enable full hibernation) to use 66GB, 130GB, etc.


what if if you increase your RAM after installing the system?


It's a circular argument that people get into because they have the incorrect kernel settings. Even if you have 4TB of ram, someone will say to add more memory on disk. It just means the system is not configured correctly. I have 30k servers and not a single one of them has swap.


>I have 30k servers and not a single one of them has swap.

Physical? VM? Cloud?

I've never seen an environment with more than a couple carefully-tuned machines that didn't run swap on every last one


30k physical servers. 512GB to 1TB ram each.


>30k physical servers. 512GB to 1TB ram each.

You running an AWS data center?


Not a public cloud. It does involve some private cloud visualization and containers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: