Hacker News new | past | comments | ask | show | jobs | submit login
Swap, swap, swap, and bad places to work (rachelbythebay.com)
213 points by r4um 66 days ago | hide | past | web | favorite | 119 comments



Swap (her solution), xor the kernel should just OOM kill Chrome; but spinning the disk for half an hour to accomplish border-line nothing for a human that has long since gotten bored and left is pointless — the behavior noted in the LKML post is severely annoying, and not productive.

My previous team was also a "no swap in prod", and this behavior bit us more than I care to admit. The devs were occasionally on the side of "swap for safety", ops was religious no-swap, and ugh. It can take 10+, 30+ minutes for systems encountering this to resolve to some meaningful conclusion, and half the time, I'm desperately trying to ssh in so-as to kill -9 the errant task anyway but ssh is paged out, and I wish the OOM killer would just do it for me instead of Linux trying to page everything through what feels like a single 4KiB page. I need to play around will sysctls more on some sort of test rig.

On AWS instances with EBS disks (most instances), disk is basically network.

I once suggested "cgroup'ing" (loosely speaking) the entire system into two rough buckets: one for SSH, with enough dedicated RAM that ssh will never get swapped, and one for everything else.

Also, I feel like the number of devs out there who understand that mmap'd files — including binaries/libraries — are basically mini-swap files when memory pressure is high is really low; more than once I've diagnosed a machine as "page thrashing" to get back "what? but it has no swap that cannot be?". Well, pgmajfault and disk I/O metrics don't lie.


> I once suggested "cgroup'ing" (loosely speaking) the entire system into two rough buckets: one for SSH, with enough dedicated RAM that ssh will never get swapped, and one for everything else.

I feel like I've been waiting 25 years for someone to implement this by default in every operating system, and yet no one seems to do this. (Arguably iOS does this with Jetsam priorities, btw.) Why is it so hard to ensure that these base components get guaranteed RAM? It was always extra infurating on Windows, as you'd end up in swap hell and need to kill something because the computer is now running insanely slow; so you hit ctrl-alt-del and the UI would instantly respond with a menu that would work great... and then you'd click Task Manager, and rather than that functionality being implemented in that menu--or being designed to never swap--you would get thrown back into the swap hell in the hope that eventually the taskman would load. It was insane, and so easily fixed in numerous ways (the most simple probably being to add a trivial task manager that only had memory usage and a kill button in that OS escape menu).


> I feel like I've been waiting 25 years for someone to implement this by default in every operating system, and yet no one seems to do this.

Damn straight. The xbox builds of Windows do this too (so the console is still controllable when a game is running) and I swear the lack of control in desktop OSs (can't start Task Manager to kill the thing that's stopping you from starting anything) will kill desktop computing if mobile OSs increase their capabilities.

A device that isn't responsive to input is indistinguishable from the device being broken.


This principal of reserving resources for OAM is also something people should really think about in the network too. Provide some dedicated resources for OAM traffic. I’ve seen too many people be anti-QoS for no real reason (it’s too complicated for them, or surprisingly common is they believe QoS is for “dumb enterprises” or for “legacy telco service providers not us cool new guys”). This includes one very large well known cloud provider that still resisted this even after traffic from multiple DOS events interrupted their monitoring and management access.


For what it's worth, Control+Shift+Esc will bring up the Task Manager without roundtripping through the other menu. Won't help if the system is thrashing though.


> Also, I feel like the number of devs out there who understand that mmap'd files — including binaries/libraries — are basically mini-swap files when memory pressure is high is really low; more than once I've diagnosed a machine as "page thrashing" to get back "what? but it has no swap that cannot be?".

This is also one of the reasons that putting swap on a different physical drive than /usr and /bin can help a lot. Once you hit memory pressure, I/O caching is going to disappear at the same time as your I/O is saturated doing swap. Just being able to read /usr/bin/ssh from a different drive than the one swap is thrashing can be night and day.


> I once suggested "cgroup'ing" (loosely speaking) the entire system into two rough buckets: one for SSH, with enough dedicated RAM that ssh will never get swapped, and one for everything else.

You can use memlockd for that: https://manpages.debian.org/buster/memlockd/memlockd.8.en.ht...


Sorry, but that’s not enough. You need more ram to actually make the ssh connection useful. You can lock sshd and then be blocked from running your shell or any commands in that shell. You really do need reserved unused memory.


This approaches understanding why this isn't done; it's not just reserved memory for SSH, it's reserved memory for SSH plus bash plus the vim, top, grep, kill, etc commands you'll be running once you SSH. How much RAM does top need? What about vim? Now realize you're imposing this penalty on every embedded, low-memory Linux device which might not ever have this problem.


Linux was quite usable in 4MiB 25 years ago. Just give yourself a minimal busybox environment for rescue activities.


Why would you be imposing this on every device when I would assume it would be controlled by a settings file in /etc


But for the actual goal you don't have to reserve it as empty, you just have to prioritize it.


I don't have problems with my desktop (16G ram no swap). I do run into OOM on some servers (I've just migrated a service from a 512MB instance to a 2G one because it crashed every few months)

When I have had swap, the machine simply hangs rather than OOM saving the machine.

What I don't understand is how 500M of swap can help on a 16G machine. Why would it be any better than an 18G machine?


Maybe if the kernel behaved optimally it wouldn't be any different, but empirically it seems like it does help.

My guess would be that if it's all RAM, then the kernel will happily use all of it for IO buffers, etc. and make no attempt to reduce its usage until actually necessary. If some of it is swap, then the kernel will (over time) try to reduce the size of its IO buffers until swap is no longer needed, but that de-allocation doesn't need to happen immediately, or block the rest of the system whilst it does happen.

Maybe there is a way to have that "high water mark" without needing swap - reserve 500M to be only used in "emergencies" - I don't know the details of the linux kernel well enough to know if that's a possibility.


Now I'm suddenly wondering if this isn't really easy after all (or rather: should be really easy if all was well). If Linux behaves better when it has some slack space past the normal RAM that it doesn't like to put stuff in but will when necessary (i.e. swap), let it do the exact same thing to the last portion of your RAM. In other words: make a swap file of, say, 10% of your memory size, and put it in a tmpfs. Linux probably doesn't want that I believe, but it would, if working, give the results you want.


> make a swap file of, say, 10% of your memory size, and put it in a tmpfs

You're looking for zram, I believe.


Of course, having more ram would be more useful than having swap in general, but swap provides for a small cushion in case your usage grows over time.

If your load fits into 14gb and you get 16gb of ram and add 0.5gb of swap, when you see swap start to fill, you know it's time to look for memory leaks and/or get more ram. If you have a bigger memory leak / burst allocation, you have a chance of being able to connect in and shut the thing down cleanly.

There are some issues with kernels swapping out 'the wrong' pages, and filling swap early. But, assuming that's tuned, I'm not aware of a better indicator of you need more ram than your swap is full / swap i/o is high.


We had a Hadoop cluster that ran noswap, it killed our ability to use all the ram and when it did exhaust ram, which was all the time, it would often KP the whole machine. No swap is not the fix. I love the idea of protecting core components, maybe even give them a core or two and dedicated ram.


With all due respect, this poor design on part of Hadoop and/or Hadoop configuration. Hadoop workers should be bounded to certain amount hardware resources, and killed if they surpass them. As opposed to killing random processes / kernel panic / etc. On Linux, this is realised via cgroups, which originated in Borg and today is widely used by Kubernetes.


I am not-incredibly-surprised that none of these many talks about swap have mentioned kubernetes. Kubernetes does not solve this problem conclusively, but it does solve it to a huge extent. Your Application Cgroups have lower RAM priority than system tasks. You allocate enough RAM for system-level tasks in kubernetes ( --system-reserved and --kube-reserved). You run your application tasks in a fixed reservation, and any that are over their reservation are first to be killed. If application tasks are repeatedly getting killed, your node is faulty and gets restarted. All of this is done with little fuss. At some higher level, faulty processes are easy to identify because they are the most evicted (And if your system-level daemons are what's at fault, it gets really obvious because of how unobvious the application is)

https://kubernetes.io/docs/tasks/administer-cluster/reserve-...


I was only "a dev" and had an extremely difficult time getting cluster configuration changes pushed through. It was only because I bribed a sysop that we got a swap file added. We had our own Mr. No to deal with.

Lots of things could/should and would be done differently.


One thing many people don't realize is that Linux OOM killer is nearly broken in a lot of cases.

For example, if you don't have swap space.

Without swap, Linux is winds up flushing file caches, mmaped files, etc. and usually re-reading them all over again...the machine is technically not out of memory or "swapping", but effectively, it's swapping and unresponsive.


>It can take 10+, 30+ minutes for systems encountering this to resolve to some meaningful conclusion, and half the time, I'm desperately trying to ssh in so-as to kill -9 the errant task anyway but ssh is paged out, and I wish the OOM killer would just do it for me instead of Linux trying to page everything through what feels like a single 4KiB page.

Sounds like you have a redundancy problem not a swap problem. If should just be able to kill a machine that gets into a bad way like that and move on. What if it wasn't swap but one of the million other things that could make your server crawl?


Generally speaking, swap/page thrashing is very easy to pick out from the metrics on a VM. (We use a system called Prometheus[1], which records and transmits metrics to a certralized service.)

In particular, a machine that is swap/page thrashing will generally show as having no available RAM, a high amount (especially relative to baseline) of major (required disk reads) page faults, and often the CPU profile will be spending a lot of time in I/O wait, too, I think, though I usually just use out of RAM + page faulting. Also, the metrics service tends to go dark shortly afterwards — it's having the same issue as everything else on the VM at getting CPU time.

Major page faults, perhaps the key stat for "this machine is page thrashing" since it directly corresponds to it, is found in /proc/vmstat and is called "pgmajfault". Though like I said, we generally had Prometheus and Grafana to turn these into pretty graphs, and to export them out of the VM itself, since when something is page thrashing, getting it to do anything is hard.

CPU contention lacks the "out of RAM" part, and won't knock the metrics offline. Network contention can knock out the metrics, but often doesn't, and lacks the other signals: out of RAM/page faults. Disk I/O lacks the out of RAM & doesn't knock the metrics out since they don't require (beyond being paged in) disk I/O. (And those — CPU, RAM, network, and disk-ish — are about the only real resource dimensions on a VM.)

Alternatively, if you let it play out and the VM eventually recovers, it might nonetheless decide to OOM kill a thing or to along the way, and those show up in dmesg / on the console, in you can get to those.

> If should just be able to kill a machine that gets into a bad way like that and move on.

I must admit that the production I love and cared for was not always perfect. Often, yes, I could, but there were a few spots where things weren't so rosy. Even when I could, I generally wanted to have some understanding as to why the VM went under, so as to not have the problem come back again, later, on a different VM. Page thrashing, in particular, is basically always symptomatic of a bug. And in distributed systems, the bugs are also distributed.

The economics of getting devs enough time to develop to that quality vs. management wanting new features has been one of the hardest challenges of my career.

Also for a while I lacked permission to actually kill a machine because ops was locking permissions down because "devs shouldn't have access to the actual machines"; like she says in one of the linked posts, fine, have my pager, you deal with the pages.

[1]: https://prometheus.io/


Swap, the sub religion.

I've wobbled back and forward on swap. In the early days I used to be annoyed by how much disk space it'd take. (an 8 gig disk with 1 gig for swap is too much)

I've run an 8 core machine with only 2 gigs of ram, and tried to compile something with boost in it. swap allowed me to kill it, and recover the system.

I've run VMs with no swap, some swap and loads.

However, what I've never done is actually benchmarked the same workload on machines with no, some and loads of swap. However, I generally defer to Rachel, because Rachel has been there and been bitten by that before.

On this point: http://rachelbythebay.com/w/2018/04/28/meta/ which everyone should read and digest, this remark jumped out:

>"This is so easy to test for"

If I ever say this and never qualify it, it should read: "haha, yeah, I made that same mistake, that's why I test for it now."

The only reason why I am "better"[I hope] than my younger self, is because I've made a bucket load of mistakes before. Some of them are technical, but to be honest a load of them are societal. (as in, bleating like Cassandra and not being able to affect change.)

If we, as "engineers" are to grow as a class of people, we have to actually learn from other people's mistakes, not just use them as bias confirmation. This is why I like blog posts where they lay out the problem, outfall, cause, workaround and eventual solution.


It's not always about performance. Years ago while traveling I installed Linux to a flash drive to carry my entire OS, apps, and data with me. While I did backup often, I was well aware of the danger that excessive write wear posed in killing my stick, so swap was disabled. To this very day and for the same reason, my Pi also has swap disabled.

As a result, I've encountered this flaw the Linux kernel design multiple times. And it is a flaw, as you only have to do the same test on Windows to see how an OS should behave when you run out of RAM: a hiccup followed by your offending tab being terminated by the OOM killer. Locking up the system for 10 min. is just not acceptable.


> Some of them are technical, but to be honest a load of them are societal. (as in, bleating like Cassandra and not being able to affect change.)

This is where I am now, where I'm comfortable making technical mistakes and know enough to avoid the most grevious pitfalls, but struggle to expand the sphere of things affected by my changes beyond a few people or single team/product. Some people get into management for this, but I've seen that management is mostly planning, budgeting, hiring/firing, and non-technical communications.


This may seem like dumb wordplay, but while management may seem to be about hiring / firing, budgets, etc, _leadership_ is about _influence_.

Influence and leadership are force multipliers. If you can lead people well and influence decisions (be it as a people leader, or a technical leader, or both), your experience can be learnt from by more and more people.

There is no doubt that leadership involves doing 'management' more often than not. But it's just one aspect, it's not the end in itself.


Depends on the company.


>benchmarked

This is what this story is really about. People not instrumenting their systems, no benchmarks, no real clue how things will perform under load.

Swap or not swap: it doesn't matter what you do if you haven't benchmarked the system before you set it up - in this case, its your customers who will tell you, eventually, how your software behaves ...


>However, what I've never done is actually benchmarked the same workload on machines with no, some and loads of swap. However, I generally defer to Rachel, because Rachel has been there and been bitten by that before.

I have. A well utilized machine is going to absolutely tank once it hits swap. Do you want to engineer your application to be able to cope with two radically different performance regimes, or do you simply want to ensure that your working set stays bounded?


If you ensure your working set stays bounded, having swap enabled (say, with swappiness 1) is not a big deal. If you're worried about the performance characteristics of swap, you've already conceded that you aren't confident in the bounds of your working set. So it doesn't seem like the relevant question is "Do we want to have well-bounded memory or use swap?" so much as "Can you tolerate some temporary slowness more easily than a server crash?"


I have too. I've also been in the opposite scenario where not having swap caused run away service kill and restart scenarios. Reading between the lines it looks like Rachel has too. Which is why she advises a "not too much" swap approach. As is usual the answer is not one or the other extreme it's much more nuanced than that.


>I've also been in the opposite scenario where not having swap caused run away service kill and restart scenarios.

Seems like a good time to just kill it and move on. "Cattle not pets."


The funny thing about cattle is that they stampede. One host goes down the load increases on the others. They go down too. That cascades until they are all in a kill/restart cycle. I've seen lack of swap cause this. I'm betting Rachel has too.


It's still not as simple as that.

Swap is there to relieve memory allocation pressure. Memory allocation pressure is an incredibly dense concept but it basically means "if lots of people are asking for fresh pages, how quickly can I service them"?

There are also other types of memory pressure. One is "if lots of people are reading and writing to pages, how quickly can I service them?"

This depends on the state of the mapping for the virtual page being read or written. If that virtual page has an associated physical page then the answer is "not too slowly". If the physical page is in one of the caches then the answer is "more quickly". If the relevant part of the page is in a register then the answer is "in one clock cycle".

On the other hand, if the read or write is associated with a page that needs to be mapped in, either from a disk file or a swap file, then the answer is "slowly".

These types of pressure need to be balanced with each other. The idea isn't just to keep as much data in physical RAM as possible. The idea is to use the RAM as effectively as possible (perhaps by keeping as much relevant data in RAM as possible) whilst also being able to respond effectively to requests for fresh pages.

In general, under pressure and load, the Linux kernel tries to keep a few MiB of ready-to-allocate pages. When it runs out of these it raids the page cache. It's reasonably quick to get (non-dirty) pages from the page cache because they can just be zero'd and handed out. The most difficult pages to reclaim are those that must be copied to swap before they are zero'd. So the kernel tries to minimise that.

How does it do that? Well, it has some tricks:

When processes load and run they allocate a bunch of pages. Sometimes, and not infrequently, they allocate and write to pages which are never accessed again!

I see this on my laptop all the time. I never use even close to the physical amount of RAM I have in the machine. Right now I'm using 4,959MiB of 15,799MiB and it rarely surpasses 6 or 7 GiB. However, if I leave it running for any length of time (a number of days or weeks) then I start to see a little bit of swap getting allocated. I've currently been up about 98 days and right now I'm using 317MiB of swap.

What's happened here is that the kernel has swapped out pages that it thinks are never going to be used again. That way, if memory allocation pressure suddenly increases it's got those physical pages ready to service those requests.

If there was no swap, those pages would be unnecessarily pinned into physical memory even tho' they would never be used.

Another commenter asked a question like "What's the difference between a machine with some memory and some swap and a machine with more memory?" Well, there are a few subtle differences, but this is one of them. The machine with swap will have a higher percentage of physical memory available and will be able to respond faster to larger allocations.

Another difference is price. RAM is still expensive compared to disk. If you have a 16GiB box with a bit of swap then you have a 16GiB box that you can use for data that's actively being read and written. If you don't have the swap then you're paying for some of that RAM that gets written and never used. This doesn't matter so much in the small scale but when you have a few machines you want to be getting the most out of them (and what that really means is the subject of another post!).

Swap space gets you more bang for your buck.

So swap space is really about giving the kernel a mechanism to manage the different types of memory pressure without taking too many compromises on the different trade-offs. Even if you're never using all your RAM, swap will still get used so that the machine can respond optimally to as many types of memory activity as possible.

One thing I'd really like to know is whether the number `free` gives me for "Swap used" is the amount of data that's in swap and nowhere else or whether it's the amount of swap space that's used even if the pages are also still in physical memory.


You make a good point about swapping out unused pages, and I'd love to discover how to balance between maintaining the "fail fast" of swap-off and the ability to swap out never used paged with swap-some. While the latter would be nice to have that extra RAM its not worth the risk, in my world at least.


> how to balance between maintaining the "fail fast" of swap-off and the ability to swap out never used paged with swap-some

A monitoring process, like earlyoom, can do this for you. Also features like cgroups, setrlimit(), prlimit() can help. Last one can even be used to dynamically adjust memory limits per each process and avoid kernel's weird OOM behavior. This is all in accordance with fail fast, which requires monitoring processes that actually deal with all the failings.


A non-swap comment:

> For everyone else, you'd probably cry too. I sure did.

I remember a colleague (employee) crying because a third party vendor screwed up and tried to blame us which would have sent a multimillion dollar project down the tubes. What saddened me was she felt the need to apologize. It was a pure expression of frustration, anger, sadness, and exhaustion on a project we were all deeply committed to, produced by the brazen unfairness of this contract house.

It's not good to live in a culture that denigrates human expression. I'm glad rachelbythebay was able to express this.

* We were able to apportion blame properly, get a proper result from someone else, and make the regulators happy with no funny business.


Desktop machine? Swap. For some reason you have a single server and don't care about performance? Swap. Running an application that might have a working set thats larger than RAM and the application doesn't understand how to do its own disk paging? Swaps good there!

Larger scale systems with redundancy? No swap.

Having swap in systems like this still doesn't make sense to me. It treads heavily on the "cattle not pets" philosophy. I shouldn't be ssh-ing into a machine thats swapping to see whats up. It should be killed. One server in the cluster starts swapping and falls out of step with its peers? It should be killed. When a machine starts swapping it falls into a while different performance regime than the rest of your systems, now you've got more variance in your response times. Not good when you care about your response times. Unless you have memory-pretending-to-be-disk for swap (in which case why isn't it just memory)

I've never seen a machine 'act funny' because it didn't have swap, its always the other way around. I don't think I've ever encountered a machine that used so much memory that the kernel didn't have buffers, but not so much that it invoked OOM killer. Unless there was a woefully misconfigured process running on the machine.

If a machine is well utilized CPU wise it is going to get absolutely crushed when it starts swapping.

Time and time again I see swap being an issue. The past year I've been in a Large Scale shop which for some ungodly reason it has swap (nowhere I've been in the past 10 years as swap as a general rule)

Don't even get me started with EBS IOPS exhaustion when you start swapping onto an EBS volume.


> I don't think I've ever encountered a machine that used so much memory that the kernel didn't have buffers, but not so much that it invoked OOM killer.

Note that the reason this topic is currently in vogue is that this has become a lot easier recently. If you run the system on a modern low-latency SSD, the current OOM killer algorithm often fails to kill anything before the entire system is on it's knees with approximately 0 pages left for IO and non-anonymous memory, at which point the OOM killer will never run because the machine is so thoroughly locked. The proper way to fix this of course is to make the OOM killer hit earlier.


I would like OOM killer to be smarter and maybe easier to configure. I'm glad you can instrument it better with BPF now, at least.


> Larger scale systems with redundancy? No swap.

Why not give them swap, set off pagers, and _maybe_ kill them? There could still be something worth investigating there, and having swap will make that easier

You also don't want to have a cascading failure where a massive leak makes all your machines fill their ram, and start killing everything like crazy.


This is a plausible route, but it still requires some engineering, specifically tweaking swapiness setting. Otherwise the swap will get used even with plenty of memory available, which in my experience can still cause havoc for GCed processes with high allocation rates on non-ssd disks.


Why let them live? Why wake myself up? Now your swapping systems are introducing a performance degradation.


Quote: "maybe kill some of them" Quote: "You also don't want to have a cascading failure where a massive leak makes all your machines fill their ram, and start killing everything like crazy."

Cascading failures are a very real thing that have knocked whole systems offline.

It sounds like the real solution is a balanced solution involving some engineering: kill them if you aren't killing _everything_. Page if the problem is ongoing, not if a couple of machines have a problem.

Either way, you can add swap _and_ kill them. One does not preclude the other.


I largely agree.

Although I do find the scenario of the kernel evicting mmapped pages causing performance degradation to be interesting and sounds plausible, but I haven't personally witnessed this behavior.

Where I see swap tend to get especially detrimental is with GCed processes. I've spent significant effort tracking down long GC pauses to getting blocked on swapped pages (although the software was not optimized and responsible as well and this was spinning rust). But in line with the article and your comments, this depends on engineering the system to have headroom.

IIRC processes that use more than a NUMA node worth of memory also run into some issues with the OOM killer with swap disabled, unless set to interleaved on the NUMA policy. So that's another thing to look out for when dropping swap, although I forget exactly why it happens.


I wonder why we talk so often about swap but rarely about using zram. I mean, isn't it much simpler to add some zram as swap instead of messing with the partitions and in the end, it should solve the problem equally well, doesn't it?

I have seen this being done on Android devices and wondered why it is being used so rarely in other areas (Desktops/Servers).


Swap dates to the 1960s. zram was introduced to the mainline Linux kernel in 2014.[1] zswap presumably later, though Wikipedia states it was added in 2013, possibly indicating improved temporal performance under zswap.[2] Hrm, actually, they appear to be two distinct features.

There's a lot of institutional knowledge, and mythology, around swap. Less so around zram/zswap, and that is going to compete with other competing capabilities and lore.

I've been wranging boxen since the late 1990s, and using Unix since at least the late 1980s. I'd only run across references to zswap / zram a few weeks ago when attempting to compile OpenWRT, and didn't look into it until seeing your comment (one reason for writing copiously footnoted HN comments -- I might accidentally learn something).

zswap might very well be The Answer We've All Been Looking for, but, well, All Of Us Realising that is another stage in the Hierarchy of Failures in Problem Resolution.[3]

________________________________

Notes:

1. https://en.m.wikipedia.org/wiki/Zram

2. https://en.m.wikipedia.org/wiki/Zswap

3. https://old.reddit.com/r/dredmorbius/comments/2fsr0g/hierarc...


Thanks for the info. For everybody, who wants to use it with Arch Linux, the following link should give a good overview:

https://wiki.archlinux.org/index.php/Improving_performance#Z...

If you want to use the systemd solution you can do it like:

  pacman -S systemd-swap
  vim /etc/systemd/swap.conf   # disable zswap, enable zram
  systemctl start systemd-swap.service
  systemctl enable systemd-swap.service


Using zram swap in a recent Debian is as simple as installing "zram-tools" (and changing the size in /etc/default/zramswap if you're not happy with the default).


... and knowing to do that. And knowing to do that over other methods.

Institutional knowledge, mindshare, and documentation are all Things.

Debian's documentation (Debian Administrator's Handbook, Debian Installation Guide, Debian FAQ) do not appear to mention either zram or zswap at all. DAH/DIG do mention swap configuration, but only in terms of traditional swap patterns.

There is mention on the Debian Wiki: https://wiki.debian.org/ZRam But not under the Swap topic: https://wiki.debian.org/Swap

"As easy as" really doesn't mean much if the information isn't accessible. It's also harder to to advocate if it's not at least mentioned in standard documentation.

If you're aware of any standard Linux documentation mentioning zram/zswap as options, please let me know.

Again: this is not an argument against the technical merits or advisibility of zram or zswap. It's an argument that knowledge of these options is not widely disseminated or assimilated. I'd commented recently on the matter of intergenerational knowledge transfer, both general and specific (https://news.ycombinator.com/item?id=20617656). This would be a case of that.


Zswap is still backed by disk, it's better than a raw swap of course, but disk is still a problem. I fallback to zswap only if I can't use zram on a system.


And you probably can't even find an environment these days where swap on disk works better than swap on zram, people are just unaware that it exists and how it performs. You can squeeze so much out of zram, it's ridiculous. I'm thinking of dropping freebsd systems that I have just because of how much more I can fit on a system with zram and freebsd doesn't have it.


We are hoping have something similar in the next freebsd release, FWIW.


Yeah, a small bit of zram with alerts or automated scaling when it starts to fill up seems like the right answer here.

zram and zswap are also a lifesaver on low memory devices like chromebooks, especially if they have anything slower than nvme storage.


I don’t know if this is just the big corporations in Silicon Valley. Just guys in general around here (in tech) seem like that. There’s a whole movement around empathy and then vulnerability but that just makes the competition more veiled.


You mean that some people play empathy and vulnerability?


I’ve run clusters of several thousands machines installed with petabytes of ram with no swap (or even disks).

It works just fine however you need to keep a appropriate headroom to allow the kernel to do its thing with caches as indicated otherwise things get very weird very quickly.

Containers are very helpful in this regard for helping explicitly divide a machine up between processes without allowing any one to get out of hand.


I guess this works if you run few well known apps which manages memory well (yeh right). Maybe recycle/restart in regular intervals. Some programs have 4-8GB limits and will crash by themselves. Or you have a watcher service and recycle/restart after some threshold. And probably your distributed system can handle a few nodes going down. But without memory they will start to act weird. With swap they will usually work, but slower, and might eventually grind to an halt anyway, but it will make it easier to make a graceful shutdown/restart.


If your application starts running slower it should be killed. Redundancy means not having to worry about graceful anything.


You've stated this multiple times in this thread now.

Maybe consider that not everyone has the same use case as you: some people are running larger chunks of computation per node or even stateful applications. Even when they could be killed and are redundant, running a bit slower for a moment until they recover may be preferable to restarting the node (which also takes time).


There are lots of different application models the issue here is ton of rule-of-thumb regression from "redundancy is important" to "swap is okay after-all I guess."


> I’ve run clusters of several thousands machines installed with petabytes of ram

Do you mean a cluster of many computers that are collectively seen by the OS as a single machine with petabytes of ram? I can I do that too?


Sorry if it’s not clear.

The cluster had thousands of individual machines, each performing their own discrete tasks with aggregate ram across the cluster in the petabytes.

There was no swap on any of those machines.


As an aside, the commentary at the bottom of the post is very insightful and a great example of how important it is to carefully choose what is being optimised for.


On the swap aspect:

I absolutely hate Linux's behavior with swap enabled, as described in a previous thread: https://news.ycombinator.com/item?id=20479622

It makes sense that it can also be broken with swap disabled: paging out too many file-backed pages can also lead to an unresponsive system.

> Earlier this week, a post to the linux-kernel mailing list talking about what happens when Linux is low on memory and doesn't have swap started making the rounds. ... Now, here we are in 2019, and we have a fresh set of people still fighting over it, like it's some kind of brand new dilemma. It's not.

The problem isn't new, but the approach I saw them discussing (use the new PSI stuff to OOM kill early) is new—PSI was only added ~a year ago, iirc. So I think this comment is unnecessarily dismissive.

I've seen the systems behave badly without swap. I don't see the bad swapless behavior as often personally, but I believe it exists. (In particular, I haven't tried the reproduction instructions in the lkml thread.) I don't know how the "tinyswap" approach is supposed to help—I'd love details. Swapless with the PSI-based OOM killing is an approach that actually makes sense to me in theory.


> I stand by my original position: have some swap. Not a lot. Just a little.

Is there some exact figure on this ? Like how much percentage of RAM size should be allocated as swap space.


512MiB no mater how much RAM you have. Keep in kind this is based on "it seems to do the trick" rather than any proper reasoning. But hey, so is the article, I guess.

My reasoning is as follows (ams concludes that swap does nothing): 4GB RAM, no swap. Firefox allocates all RAM. Linux starts taking the program code and data sections out of disk. Serious thrashing. Hang.

4GB, some swap. Firefox allocates all RAM. Some stuff is in swap. Firefox allocates more memory. Swap is full of firefox's stuff. RAM is full of firefox's stuff. Linux starts taking the program code and data sections out of disk. Serious thrashing. Hang.

It's the same thing, just more complicated thus harder to debug. It will behave better, if you never get to fill RAM+swap. But today's programs simply allocate more. And the kernel says yes.

IMO the real solution, given that we overcommit, is for firefox to check that % RAM usage you see in top(1) and to try to not let it go above 95% by freeing cache, and for the kernel to do OOM killing early.


> IMO the real solution, given that we overcommit, is for firefox to check that % RAM usage

Doesn't it to that already? I remember 3-4 years ago I use to watch in amazement how people complained that Firefox ate over 2GBs of memory, whereas I ran it on a laptop with 2GB total, and FF rarely hit 1G even with dozens of tabs open.

EDIT: Mozillazine seems to confirm my experience: "These numbers will vary because Firefox is configured by default to use more memory on systems that have more memory available and less on systems with less" (http://kb.mozillazine.org/Reducing_memory_usage_-_Firefox)


This is annoying on Android. System normally would kill (or suspend or whatever Android exactly does in those circumstances) other applications if Firefox ate too much of memory. And I would be happy with that, but unfortunately under memory pressure Firefox kills less important tabs. With memory hog in the backgrounds (these wonderful mobile games) I can jump between maybe 2 or 3 tabs before Firefox starts killing.


There are small systems in which 512 MB (or MiB) isn't available, but in which some swap might conceivably be useful. Mostly networking gear, which still ships with absurdly low storage/memory allocations (my DLS router is 64 MB RAM, 8 MB flash, as an example).

Not unviable under the stock OEM firmware, but confining when switching to, say, OpenWRT. Not that I'd know about this or anything.


THe default for debian 9 was 100% of ram. I have 64 gigs, so it was a bit much (especially as I only have 128 of disk.)

The only time I've ever seen my[1] swap go above 500 megs was when I was running qterminal with a memory leak. It was a long running terminal, with a long running process that was extra ordinarily chatty.

4 gigs is probably enough for most people under most conditions.

[1] this is workstations. on servers, I've seen it happen all the time. Crucially it gives one time to react.


I just did a fresh install of Debian 10 and it is still the default. I understand that they want to support hibernation out of the box but it creates gigantic swaps for a feature that I believe is rarely used, nowadays sleep is good enough and reboots are super fast.

It would be great if the installer optionally asked how much swap to allocate without having to do the full partitioning manually.


Interestingly enough the debian installer freaks the fuck out if you have more ram than disk and choose the auto-partition option. We've run into this a few times with servers that have 512gb ram and a 120gb ssd installed. The installer doesn't seem to recognize this and try to set some sort of sane swap amount, instead it gives you some cryptic error messages about a bad disk and hangs.


You mention time to react. How do you get notified that memory usage is getting out of hand before the system is actually unusable?


Depends on the service its running really.

I'd assume that you'd have key metrics plotted in grafana, so its case of following up/alerting on that

In previous cases, It was fairly simple, an alert was fired because the queue size ballooned. Looking at the machine stats we saw that all the ram cache had been ejected just at the same time performance started to drop.

Either way, its pretty trivial to put an alert on a stat in grafana.

however, your mileage may vary.


It depends on your workload, does it not? Generally you'll want swap to be mostly unused, so you don't need a lot, but if you want to eg. hibernate a laptop, you'll need at least as much as you have RAM to do it.

Also If you use tmpfs you should have enough swap to cover the size of your tmpfs partitions in their entirety so that a bad application won't eat all your memory.


It depends on what you're doing on the box. Lots of disk I/O and lots of different apps each causing a spike of RAM usage at a different time, like a typical LAMP server? You probably need a bit more swap. Just one app with a very consistent RAM usage, like a database server with an explicitly configured buffer size? Then you don't need much swap. The old rule "2x RAM" actually isn't bad on a desktop PC, but it's a terrible default on a server.

Linode has been configuring 256MB swap partitions by default on their VPS for a long time, even for large plans. Maybe it's bigger for the really large plans, but I haven't tested them. Anyway it feels like a nice default, and seemed to work fine with most kinds of loads in 10+ years of usage. Some of the newer VPS services (DigitalOcean, Lightsail) come with no swap by default, and I don't feel comfortable about it so I add a 256MB swapfile on them. I do turn down the swappiness a bit, though.


Just as a gut-feeling guess, I feel like 256MB or 512MB should be fine. I don't think we need to deal with percentages; the number is likely independent of the total RAM amount. Not enough to run things on (and you don't want to run things on swap anyway), but enough to get you out of the weeds when the system is under heavy memory pressure.


If you don’t hibernate or other reasons to use swap, what does adding half a gig of swap do that adding that as extra ram doesn’t?

Or more practically, if I’m doing stuff with 8gigs and then I upgrade to 16gigs of ram, then why would I need to keep my 512mb of swap? Surely the extra 8gigs comfortably covers that extra room? Am I missing something?


> If you don’t hibernate or other reasons to use swap, what does adding half a gig of swap do that adding that as extra ram doesn’t?

Linux (and generally most popular OS) really don't like having no swap whatsoever (https://lkml.org/lkml/2019/8/4/15).

Plus if you're running out of RAM (regardless of how much you have, sure put more in if you have it) the system starts visibly degrading but remains recoverable when it starts swapping. If it can't swap, it pretty much just dies.


> If it can't swap, it pretty much just dies.

The point with my example was that if 512 MB of swap would save you, then why wouldn't 8 GB of extra RAM save you?

Yes, if you run out of RAM, you're in trouble, but if you run out of RAM+swap you're in trouble too, so what's the point in adding a small 256 or 512 MB swap?

Ok, so the point is that its visibly bad with swap before it actually dies, giving you time to recover. How about setting up some kind of alert when you're in your last half a gig or gig of RAM then? It seems like this would give a much better recovery experience. If its a server, you need an alert anyway since you're not going to notice the bad performance until its too late, most likely (you're not watching it 24/7 I assume).

Of course, from the article:

> I stand by my original position: have some swap. Not a lot. Just a little. Linux boxes just plain act weirdly without it.

That's fair enough, if the reason is to stop linux from acting weirdly, then fine.


> The point with my example was that if 512 MB of swap would save you, then why wouldn't 8 GB of extra RAM save you?

Having 8 GB more RAM might avoid the issue, but it won't visibly degrade the system so you will not see that you're in trouble.

And again if you can have both, have both.

> Yes, if you run out of RAM, you're in trouble, but if you run out of RAM+swap you're in trouble too, so what's the point in adding a small 256 or 512 MB swap?

Because if you run out of RAM and have no swap, the system dies. If you run out of RAM and have swap, it starts swapping, which is noticeable.

> How about setting up some kind of alert when you're in your last half a gig or gig of RAM then? It seems like this would give a much better recovery experience. If its a server, you need an alert anyway since you're not going to notice the bad performance until its too late, most likely (you're not watching it 24/7 I assume).

If it's a server you need to do that anyway because the swapping will probably not be noticeable.


> Because if you run out of RAM and have no swap, the system dies. If you run out of RAM and have swap, it starts swapping, which is noticeable.

Only if you're using the system, or are reacting to monitoring (if you can monitor it and the thing you're monitoring fails) fast enough

I'd rather have a dead system than one that's not working

But in any case shouldn't OOM killer come to the rescue?


> But in any case shouldn't OOM killer come to the rescue?

Indeed. Its similar to the argument that segfaults are good due to fail fast. Wouldn't it be better to fail fast due to OOM than to hobble along?

If you're actively using the system, ok, you might get a chance to save your work or whatever first, but, in my personal anecdotal experience, that process is pretty much dead anyway and I have to kill it. Yes, my system doesn't get taken down, but if OOM takes the system down, why isn't the kernel killing the process that's eating all the memory? It sounds to me that swap is just masking the problem.


> I'd rather have a dead system than one that's not working

A swapping system is degraded, not "not working".

> But in any case shouldn't OOM killer come to the rescue?

The OOM killer will heuristically kill random crap. It could be the process with an unbounded memory growth, or it could be your text editor.


I think the point is that you want to know when you are approaching the limit so that you can start shutting down things instead of having your system crash on you.


Have an alert trigger when your RAM gets low? On desktop systems, I've literally never had any problems where I wasn't already aware that my RAM was low (and therefore made sure I saved my work etc) and on servers, I'm not watching them 24/7, so need alerting anyway. A tiny amount of swap only buys you a tiny amount of time.


For mixed workload servers I tend to go for 1GB swap, vm.swappiness=0 and then alerts on around 300-400MB of swap usage. I found alerting on swap to be the best indicator of impending doom, whether it's from a slow memory leak or a buildup from an i/o bottleneck.


Usually a sqr of your total ram. Zram is a another option as it a block device that does virtual ram.


Square root of what unit? That's going to give different results if you measure in gigabytes, bytes or bits.


Almost certainly gigabytes.

Try some numbers yourself.

Anything kilobyte or below gives you effectively zero swap.

Megabytes gives you 90MB of swap on a machine with 8GB ram, that's an unusually small number and if it was supposed to be that small the advice would probably just be "128MB for all".

Gigabytes fit the idea that if you have a smaller amount of ram, you have swap comparable to it, but once you scale up you don't add a lot more for each gig.

Terabytes gives you hilariously large numbers for computers with 64GB of ram or less.



If you don't allocate swap, you have to do other things to compensate, like reduce or eliminate overcommit.

At one company where I worked over a decade ago, we ran some Linux-based equipment without swap also. To prevent executables from being evacuated by low memory pressure, I put a hack into the kernel: executables and shared libs were mapped such that they were nailed into memory (MAP_LOCKED).


I would assume that the consensus is clear these days that swap is good and should be enabled for most cases for Linux > 4.0.

Of course, real-life often is different than theory. Does your machine have a spinning disk or an SSD? I am much faster to enable swap on an SSD, since it won't be painfully slow, should we ever get into a situation where our RAM is saturated.

What happens in cloud VMs? These things use a network disk storage (transparent to us), and often writes need to be sent over the network more than once (for redundancy). How extensive swapping would behave in such an environment?

As for saying no, it's important to set some rules to avoid chaos, but it's also important to trust our senior people to take decisions. If they need to go against a rule, I would expect a good explanation in their commit —because, infrastructure as code— and documentation. If a junior wants to go against a rule, they can consult a senior. Issuing a no and expecting everyone to follow it blindly, is the worst form of micromanagement. :)


It seems to me that if you hit the point where you really need swap, then you're already in trouble. Maybe that swap gives you a little buffer before things get really bad, but chances are that will just keep you unaware of your impending problem until they go critical (unless you have lots of good monitoring/alerting).


You would think so, but as Rachel points out the Linux kernel displays some pathological behavior with noswap that even a tiny amount of swap lets it get around. Something weird is going on in the memory management code.


I'm not sure what shes observing other than "feels weird" but my observation is the exact opposite, but my observations usually have some measurements to them.


It requires a perfect storm of just shy of 100% used memory and a lot of mmaped io. In that case the mmaped can get shunted to a handful of pages (or even one page) and so you lose all ability to have any block io larger then one page size. And every page of memory involves a fully blocking io request. It’s most certainly real for certain workloads (databases are common).


"Feels weird" when she logs on to investigate serious performance problems that she's getting paged for.


> Item: If you allocate all of the RAM on the machine, you have screwed the kernel out of buffer cache it sorely needs. Back off.

Why not just permanently allocate enough RAM for the kernel? If I have 16GB of RAM but the kernel needs 1GB to do its job, then just tell me that I have 15GB to work with.


That would be nice, but quite a lot of kernel code does not work like that, specifically disk cache. 1GB is not enough for 32k PID worth of process kernel stack.


This is an example of where desktop and server engineers could benefit from having embedded design experience.

It's certainly possible to create a small, protected area of memory that contains a kernel-level interrupt handler (which itself allocates no memory) whose sole job is to run a couple of times a second and check for thrashing and OOM. If it sees memory problems, it takes over the computer, determines which processes are using the most memory and kills the ones that are expendable. ("Expendable" is a list configurable by the user and yeah, Chrome would be right at the top for a desktop system.)

Embedded systems designers routinely build such watchdogs into their systems. It could probably be added to Linux as a kernel patch.


In a first for one of these "bad places to work" stories, I recognized the project she described in the "A patch which wasn't good enough (until it was)" post linked from this one, so I looked up the history. Sure enough, I know the developer she was complaining about in both posts.

In the patch case, he asked about testing, and they realized the ssh/scp versions she tested with weren't the same as the ones the code was using. She promised to follow up with best-practice testing and didn't. (Without knowing the reason, this isn't unusual: people get busy and drop things all the time.) I didn't get the same sense of rejection or hostility she did. And the second developer (who got her patch accepted) credited her in the code review, tested in a middle way (better than she originally did, worse than she promised to do later), requested the review from a different person than she had (why I don't know), and got a review question with a similar tone before it was accepted. None of the parties' behavior looked unusual/red-flag-worthy to me.

I don't fault her for imperfectly describing an interaction that was five years ago when she wrote that post and is twelve years ago now. I'm trying to figure out what the lesson is and who should be learning it. A few unorganized ideas:

* Much of what people are thinking and feeling is left unwritten/unsaid, so two people can have very different ideas of what happened. (A reminder I suppose to listen to both sides before making a judgement on something.)

* I don't want to dismiss her feeling about bad team dynamics, even if I don't see them in this particular interaction. "At the end of the day people won't remember what you said or did, they will remember how you made them feel." - Maya Angelou

* A (imo typical) code review question can seem intimidating or hostile from a senior developer when "you're already not sure you belong there at all". Maybe an in-person follow-up would have helped, either then or later ("hey, did you have a chance to try writing that test? can I help? I want to get your change in"). I've been on both sides of this one. The junior developer often wants some extra help and attention, and the senior developer is often feeling overwhelmed by the volume of questionable-quality things coming in, such that they can go into more of a gatekeeper role than trying to mentor each person thoroughly in each interaction. (I think this is what she's talking about with "Any lazy fool can deny a request and get you to 'no.' It takes actual effort to appreciate and recognize what they're trying to accomplish and try to help them get to a different 'yes'.")


On every dedicated box I keep a swap partition with an alert raised when its used beyond a certain threshold. For all VMS no-swap because as far as they are concerned disk = network.

Then again anything above 80% memory utilisation and we begin looking at adding another box to the cluster due to the fact that occasional spikes in usage can easily put us beyond what swap can protect from and that just causes a shit storm.


Swap is best seen as a component of the "bad idea trinity", next to fork() and overcommit.


If you need a network server to respond in 40 ms p99, letting it start swapping is crazy. But it made sense for timesharing undersized computers running jobs that take minutes or hours. You overcommit because your institution couldn't afford two computers.

I/O also used to have a smaller speed penalty. There was just one core and it wasn't so dramatically faster than the rest of the machine. Hell, there used to be faster disks for swap than for the filesystem, and tuning was about scheduling jobs and placing inputs and outputs so as to fully utilize the fs disks.


I dont know if I would say 'used to be', things like Optane are essentially a ssd memory/disk tier. If it's running in bit addressable mode you can consider it a swap tier between memory and disk/network.


Situations like the ones described are the reason that the Bastard Operator From Hell is still a wet dream for some of us.

http://bofh.bjash.com/


One problem with Linux in low memory situations is that the OOM killer is a really blunt force instrument. It would be nice if it were a lot more configurable. Simple OOM scores don't cut it, IMO.


You can do it completely in userspace, that's the ultimate configurability.


You can adjust pid scores in userspace or turn it off, there's literally nothing else you can do.


You can monitor memory, swap usage, lots of things about each process and kill processes all from a userspace program.


And there are existing userspace daemons; see: https://code.fb.com/production-engineering/oomd/


Did I miss the point? Is this a rant about incorrectly configuring swap space, or is this a rant about some kind of bad team dynamics?

Anyway, isn't this the kind of argument that should be replaced by gathering objective data? Otherwise, the low/no swap space problems really appear to be symptoms of someone irresponsibly experimenting in production.


Yes, you missed the point, this is not a rant.


Then what is the point?


"Nope" is not a strategy.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: