
Swap, swap, swap, and bad places to work - r4um
http://rachelbythebay.com/w/2019/08/08/swap/
======
deathanatos
Swap (her solution), xor the kernel should just OOM kill Chrome; but spinning
the disk for half an hour to accomplish border-line nothing for a human that
has long since gotten bored and left is pointless — the behavior noted in the
LKML post is severely annoying, and not productive.

My previous team was also a "no swap in prod", and this behavior bit us more
than I care to admit. The devs were occasionally on the side of "swap for
safety", ops was religious no-swap, and ugh. It can take 10+, 30+ minutes for
systems encountering this to resolve to some meaningful conclusion, and half
the time, I'm desperately trying to ssh in so-as to kill -9 the errant task
_anyway_ but ssh is paged out, and I wish the OOM killer would just do it for
me instead of Linux trying to page everything through what feels like a single
4KiB page. I need to play around will sysctls more on some sort of test rig.

On AWS instances with EBS disks (most instances), disk is basically network.

I once suggested "cgroup'ing" (loosely speaking) the entire system into two
rough buckets: one for SSH, with enough dedicated RAM that ssh will never get
swapped, and one for everything else.

Also, I feel like the number of devs out there who understand that mmap'd
files — including binaries/libraries — are basically mini-swap files when
memory pressure is high is really low; more than once I've diagnosed a machine
as "page thrashing" to get back "what? but it has no swap that cannot be?".
Well, pgmajfault and disk I/O metrics don't lie.

~~~
saurik
> I once suggested "cgroup'ing" (loosely speaking) the entire system into two
> rough buckets: one for SSH, with enough dedicated RAM that ssh will never
> get swapped, and one for everything else.

I feel like I've been waiting 25 years for someone to implement this by
default in every operating system, and yet no one seems to do this. (Arguably
iOS does this with Jetsam priorities, btw.) Why is it so hard to ensure that
these base components get guaranteed RAM? It was always extra infurating on
Windows, as you'd end up in swap hell and need to kill something because the
computer is now running insanely slow; so you hit ctrl-alt-del and the UI
would _instantly_ respond with a menu that would work _great_... and then
you'd click Task Manager, and rather than that functionality being implemented
in that menu--or being designed to never swap--you would get thrown back into
the swap hell in the hope that eventually the taskman would load. It was
_insane_ , and so easily fixed in numerous ways (the most simple probably
being to add a trivial task manager that only had memory usage and a kill
button in that OS escape menu).

~~~
nailer
> I feel like I've been waiting 25 years for someone to implement this by
> default in every operating system, and yet no one seems to do this.

Damn straight. The xbox builds of Windows do this too (so the console is still
controllable when a game is running) and I swear the lack of control in
desktop OSs (can't start Task Manager to kill the thing that's stopping you
from starting anything) will kill desktop computing if mobile OSs increase
their capabilities.

A device that isn't responsive to input is indistinguishable from the device
being broken.

------
KaiserPro
Swap, the sub religion.

I've wobbled back and forward on swap. In the early days I used to be annoyed
by how much disk space it'd take. (an 8 gig disk with 1 gig for swap is too
much)

I've run an 8 core machine with only 2 gigs of ram, and tried to compile
something with boost in it. swap allowed me to kill it, and recover the
system.

I've run VMs with no swap, some swap and loads.

However, what I've never done is actually benchmarked the same workload on
machines with no, some and loads of swap. However, I generally defer to
Rachel, because Rachel has been there and been bitten by that before.

On this point:
[http://rachelbythebay.com/w/2018/04/28/meta/](http://rachelbythebay.com/w/2018/04/28/meta/)
which everyone should read and digest, this remark jumped out:

>"This is so easy to test for"

If I ever say this and never qualify it, it should read: "haha, yeah, I made
that same mistake, that's why I test for it now."

The only reason why I am "better"[I hope] than my younger self, is because
I've made a bucket load of mistakes before. Some of them are technical, but to
be honest a load of them are societal. (as in, bleating like Cassandra and not
being able to affect change.)

If we, as "engineers" are to grow as a class of people, we have to actually
learn from other people's mistakes, not just use them as bias confirmation.
This is why I like blog posts where they lay out the problem, outfall, cause,
workaround and eventual solution.

~~~
VectorLock
>However, what I've never done is actually benchmarked the same workload on
machines with no, some and loads of swap. However, I generally defer to
Rachel, because Rachel has been there and been bitten by that before.

I have. A well utilized machine is going to absolutely tank once it hits swap.
Do you want to engineer your application to be able to cope with two radically
different performance regimes, or do you simply want to ensure that your
working set stays bounded?

~~~
zaphar
I have too. I've also been in the opposite scenario where not having swap
caused run away service kill and restart scenarios. Reading between the lines
it looks like Rachel has too. Which is why she advises a "not too much" swap
approach. As is usual the answer is not one or the other extreme it's much
more nuanced than that.

~~~
VectorLock
>I've also been in the opposite scenario where not having swap caused run away
service kill and restart scenarios.

Seems like a good time to just kill it and move on. "Cattle not pets."

~~~
zaphar
The funny thing about cattle is that they stampede. One host goes down the
load increases on the others. They go down too. That cascades until they are
all in a kill/restart cycle. I've seen lack of swap cause this. I'm betting
Rachel has too.

------
gumby
A non-swap comment:

> For everyone else, you'd probably cry too. I sure did.

I remember a colleague (employee) crying because a third party vendor screwed
up and tried to blame us which would have sent a multimillion dollar project
down the tubes. What saddened me was she felt the need to apologize. It was a
pure expression of frustration, anger, sadness, and exhaustion on a project we
were all deeply committed to, produced by the brazen unfairness of this
contract house.

It's not good to live in a culture that denigrates human expression. I'm glad
rachelbythebay was able to express this.

* We were able to apportion blame properly, get a proper result from someone else, and make the regulators happy with no funny business.

------
VectorLock
Desktop machine? Swap. For some reason you have a single server and don't care
about performance? Swap. Running an application that might have a working set
thats larger than RAM and the application doesn't understand how to do its own
disk paging? Swaps good there!

Larger scale systems with redundancy? No swap.

Having swap in systems like this still doesn't make sense to me. It treads
heavily on the "cattle not pets" philosophy. I shouldn't be ssh-ing into a
machine thats swapping to see whats up. It should be killed. One server in the
cluster starts swapping and falls out of step with its peers? It should be
killed. When a machine starts swapping it falls into a while different
performance regime than the rest of your systems, now you've got more variance
in your response times. Not good when you care about your response times.
Unless you have memory-pretending-to-be-disk for swap (in which case why isn't
it just memory)

I've never seen a machine 'act funny' because it didn't have swap, its always
the other way around. I don't think I've ever encountered a machine that used
so much memory that the kernel didn't have buffers, but not so much that it
invoked OOM killer. Unless there was a woefully misconfigured process running
on the machine.

If a machine is well utilized CPU wise it is going to get absolutely crushed
when it starts swapping.

Time and time again I see swap being an issue. The past year I've been in a
Large Scale shop which for some ungodly reason it has swap (nowhere I've been
in the past 10 years as swap as a general rule)

Don't even get me started with EBS IOPS exhaustion when you start swapping
onto an EBS volume.

~~~
mankyd
> Larger scale systems with redundancy? No swap.

Why not give them swap, set off pagers, and _maybe_ kill them? There could
still be something worth investigating there, and having swap will make that
easier

You also don't want to have a cascading failure where a massive leak makes all
your machines fill their ram, and start killing everything like crazy.

~~~
VectorLock
Why let them live? Why wake myself up? Now your swapping systems are
introducing a performance degradation.

~~~
mankyd
Quote: "maybe kill some of them" Quote: "You also don't want to have a
cascading failure where a massive leak makes all your machines fill their ram,
and start killing everything like crazy."

Cascading failures are a very real thing that have knocked whole systems
offline.

It sounds like the real solution is a balanced solution involving some
engineering: kill them if you aren't killing _everything_. Page if the problem
is ongoing, not if a couple of machines have a problem.

Either way, you can add swap _and_ kill them. One does not preclude the other.

------
arendtio
I wonder why we talk so often about swap but rarely about using zram. I mean,
isn't it much simpler to add some zram as swap instead of messing with the
partitions and in the end, it should solve the problem equally well, doesn't
it?

I have seen this being done on Android devices and wondered why it is being
used so rarely in other areas (Desktops/Servers).

~~~
dredmorbius
Swap dates to the 1960s. zram was introduced to the mainline Linux kernel in
2014.[1] zswap presumably later, though Wikipedia states it was added in 2013,
possibly indicating improved temporal performance under zswap.[2] Hrm,
actually, they appear to be two distinct features.

There's a lot of institutional knowledge, and mythology, around swap. Less so
around zram/zswap, and that is going to compete with other competing
capabilities and lore.

I've been wranging boxen since the late 1990s, and using Unix since at least
the late 1980s. I'd only run across references to zswap / zram a few weeks ago
when attempting to compile OpenWRT, and didn't look into it until seeing your
comment (one reason for writing copiously footnoted HN comments -- I might
accidentally learn something).

zswap might very well be The Answer We've All Been Looking for, but, well, All
Of Us Realising that is another stage in the Hierarchy of Failures in Problem
Resolution.[3]

________________________________

Notes:

1\.
[https://en.m.wikipedia.org/wiki/Zram](https://en.m.wikipedia.org/wiki/Zram)

2\.
[https://en.m.wikipedia.org/wiki/Zswap](https://en.m.wikipedia.org/wiki/Zswap)

3\.
[https://old.reddit.com/r/dredmorbius/comments/2fsr0g/hierarc...](https://old.reddit.com/r/dredmorbius/comments/2fsr0g/hierarchy_of_failures_in_problem_resolution/)

~~~
mrob
Using zram swap in a recent Debian is as simple as installing "zram-tools"
(and changing the size in /etc/default/zramswap if you're not happy with the
default).

~~~
dredmorbius
... and knowing to do that. And knowing to do that over other methods.

Institutional knowledge, mindshare, and documentation are all Things.

Debian's documentation (Debian Administrator's Handbook, Debian Installation
Guide, Debian FAQ) do not appear to mention either zram or zswap at all.
DAH/DIG do mention swap configuration, but only in terms of traditional swap
patterns.

There _is_ mention on the Debian Wiki:
[https://wiki.debian.org/ZRam](https://wiki.debian.org/ZRam) But not under the
Swap topic: [https://wiki.debian.org/Swap](https://wiki.debian.org/Swap)

"As easy as" really doesn't mean much if the information isn't accessible.
It's also harder to to advocate if it's not at least mentioned in standard
documentation.

If you're aware of any standard Linux documentation mentioning zram/zswap as
options, please let me know.

Again: _this is not an argument against the technical merits or advisibility
of zram or zswap._ It's an argument that knowledge of these options is not
widely disseminated or assimilated. I'd commented recently on the matter of
intergenerational knowledge transfer, both general and specific
([https://news.ycombinator.com/item?id=20617656](https://news.ycombinator.com/item?id=20617656)).
This would be a case of that.

------
lambdasquirrel
I don’t know if this is just the big corporations in Silicon Valley. Just guys
in general around here (in tech) seem like that. There’s a whole movement
around empathy and then vulnerability but that just makes the competition more
veiled.

~~~
_nalply
You mean that some people play empathy and vulnerability?

------
georgebarnett
I’ve run clusters of several thousands machines installed with petabytes of
ram with no swap (or even disks).

It works just fine however you need to keep a appropriate headroom to allow
the kernel to do its thing with caches as indicated otherwise things get very
weird very quickly.

Containers are very helpful in this regard for helping explicitly divide a
machine up between processes without allowing any one to get out of hand.

~~~
z3t4
I guess this works if you run few well known apps which manages memory well
(yeh right). Maybe recycle/restart in regular intervals. Some programs have
4-8GB limits and will crash by themselves. Or you have a watcher service and
recycle/restart after some threshold. And probably your distributed system can
handle a few nodes going down. But without memory they will start to act
weird. With swap they will usually work, but slower, and might eventually
grind to an halt anyway, but it will make it easier to make a graceful
shutdown/restart.

~~~
VectorLock
If your application starts running slower it should be killed. Redundancy
means not having to worry about graceful anything.

~~~
jmiserez
You've stated this multiple times in this thread now.

Maybe consider that not everyone has the same use case as you: some people are
running larger chunks of computation per node or even stateful applications.
Even when they could be killed and are redundant, running a bit slower for a
moment until they recover may be preferable to restarting the node (which also
takes time).

~~~
VectorLock
There are lots of different application models the issue here is ton of rule-
of-thumb regression from "redundancy is important" to "swap is okay after-all
I guess."

------
scottlamb
On the swap aspect:

I absolutely hate Linux's behavior with swap enabled, as described in a
previous thread:
[https://news.ycombinator.com/item?id=20479622](https://news.ycombinator.com/item?id=20479622)

It makes sense that it can also be broken with swap disabled: paging out too
many file-backed pages can also lead to an unresponsive system.

> Earlier this week, a post to the linux-kernel mailing list talking about
> what happens when Linux is low on memory and doesn't have swap started
> making the rounds. ... Now, here we are in 2019, and we have a fresh set of
> people still fighting over it, like it's some kind of brand new dilemma.
> It's not.

The problem isn't new, but the approach I saw them discussing (use the new PSI
stuff to OOM kill early) is new—PSI was only added ~a year ago, iirc. So I
think this comment is unnecessarily dismissive.

I've seen the systems behave badly without swap. I don't see the bad swapless
behavior as often personally, but I believe it exists. (In particular, I
haven't tried the reproduction instructions in the lkml thread.) I don't know
how the "tinyswap" approach is supposed to help—I'd love details. Swapless
with the PSI-based OOM killing is an approach that actually makes sense to me
in theory.

------
psibi
> I stand by my original position: have some swap. Not a lot. Just a little.

Is there some exact figure on this ? Like how much percentage of RAM size
should be allocated as swap space.

~~~
KaiserPro
THe default for debian 9 was 100% of ram. I have 64 gigs, so it was a bit much
(especially as I only have 128 of disk.)

The only time I've ever seen my[1] swap go above 500 megs was when I was
running qterminal with a memory leak. It was a long running terminal, with a
long running process that was extra ordinarily chatty.

4 gigs is probably enough for most people under most conditions.

[1] this is workstations. on servers, I've seen it happen all the time.
Crucially it gives one time to react.

~~~
cosarara
You mention time to react. How do you get notified that memory usage is
getting out of hand before the system is actually unusable?

~~~
KaiserPro
Depends on the service its running really.

I'd assume that you'd have key metrics plotted in grafana, so its case of
following up/alerting on that

In previous cases, It was fairly simple, an alert was fired because the queue
size ballooned. Looking at the machine stats we saw that all the ram cache had
been ejected just at the same time performance started to drop.

Either way, its pretty trivial to put an alert on a stat in grafana.

however, your mileage may vary.

------
mzs
link to flattened lkml discussion thread:
[https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca48...](https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca485ec20@gmx.com/T/#u)

some background on anonymous memory:
[https://utcc.utoronto.ca/~cks/space/blog/unix/NoSwapConseque...](https://utcc.utoronto.ca/~cks/space/blog/unix/NoSwapConsequence)

------
kazinator
If you don't allocate swap, you have to do other things to compensate, like
reduce or eliminate overcommit.

At one company where I worked over a decade ago, we ran some Linux-based
equipment without swap also. To prevent executables from being evacuated by
low memory pressure, I put a hack into the kernel: executables and shared libs
were mapped such that they were nailed into memory (MAP_LOCKED).

------
andmarios
I would assume that the consensus is clear these days that swap is good and
should be enabled for most cases for Linux > 4.0.

Of course, real-life often is different than theory. Does your machine have a
spinning disk or an SSD? I am much faster to enable swap on an SSD, since it
won't be painfully slow, should we ever get into a situation where our RAM is
saturated.

What happens in cloud VMs? These things use a network disk storage
(transparent to us), and often writes need to be sent over the network more
than once (for redundancy). How extensive swapping would behave in such an
environment?

As for saying no, it's important to set some rules to avoid chaos, but it's
also important to trust our senior people to take decisions. If they need to
go against a rule, I would expect a good explanation in their commit —because,
infrastructure as code— and documentation. If a junior wants to go against a
rule, they can consult a senior. Issuing a no and expecting everyone to follow
it blindly, is the worst form of micromanagement. :)

------
blunte
It seems to me that if you hit the point where you really need swap, then
you're already in trouble. Maybe that swap gives you a little buffer before
things get really bad, but chances are that will just keep you unaware of your
impending problem until they go critical (unless you have lots of good
monitoring/alerting).

~~~
azernik
You would think so, but as Rachel points out the Linux kernel displays some
pathological behavior with noswap that even a tiny amount of swap lets it get
around. Something weird is going on in the memory management code.

~~~
VectorLock
I'm not sure what shes observing other than "feels weird" but my observation
is the exact opposite, but my observations usually have some measurements to
them.

~~~
aardvarklegend
It requires a perfect storm of just shy of 100% used memory and a lot of
mmaped io. In that case the mmaped can get shunted to a handful of pages (or
even one page) and so you lose all ability to have any block io larger then
one page size. And every page of memory involves a fully blocking io request.
It’s most certainly real for certain workloads (databases are common).

------
rcfox
> Item: If you allocate all of the RAM on the machine, you have screwed the
> kernel out of buffer cache it sorely needs. Back off.

Why not just permanently allocate enough RAM for the kernel? If I have 16GB of
RAM but the kernel needs 1GB to do its job, then just tell me that I have 15GB
to work with.

~~~
AstralStorm
That would be nice, but quite a lot of kernel code does not work like that,
specifically disk cache. 1GB is not enough for 32k PID worth of process kernel
stack.

------
dreamcompiler
This is an example of where desktop and server engineers could benefit from
having embedded design experience.

It's certainly possible to create a small, protected area of memory that
contains a kernel-level interrupt handler (which itself allocates no memory)
whose sole job is to run a couple of times a second and check for thrashing
and OOM. If it sees memory problems, it takes over the computer, determines
which processes are using the most memory and kills the ones that are
expendable. ("Expendable" is a list configurable by the user and yeah, Chrome
would be right at the top for a desktop system.)

Embedded systems designers routinely build such watchdogs into their systems.
It could probably be added to Linux as a kernel patch.

------
scottlamb
In a first for one of these "bad places to work" stories, I recognized the
project she described in the "A patch which wasn't good enough (until it was)"
post linked from this one, so I looked up the history. Sure enough, I know the
developer she was complaining about in both posts.

In the patch case, he asked about testing, and they realized the ssh/scp
versions she tested with weren't the same as the ones the code was using. She
promised to follow up with best-practice testing and didn't. (Without knowing
the reason, this isn't unusual: people get busy and drop things all the time.)
I didn't get the same sense of rejection or hostility she did. And the second
developer (who got her patch accepted) credited her in the code review, tested
in a middle way (better than she originally did, worse than she promised to do
later), requested the review from a different person than she had (why I don't
know), and got a review question with a similar tone before it was accepted.
None of the parties' behavior looked unusual/red-flag-worthy to me.

I don't fault her for imperfectly describing an interaction that was five
years ago when she wrote that post and is twelve years ago now. I'm trying to
figure out what the lesson is and who should be learning it. A few unorganized
ideas:

* Much of what people are thinking and feeling is left unwritten/unsaid, so two people can have very different ideas of what happened. (A reminder I suppose to listen to both sides before making a judgement on something.)

* I don't want to dismiss her feeling about bad team dynamics, even if I don't see them in this particular interaction. "At the end of the day people won't remember what you said or did, they will remember how you made them feel." \- Maya Angelou

* A (imo typical) code review question can seem intimidating or hostile from a senior developer when "you're already not sure you belong there at all". Maybe an in-person follow-up would have helped, either then or later ("hey, did you have a chance to try writing that test? can I help? I want to get your change in"). I've been on both sides of this one. The junior developer often wants some extra help and attention, and the senior developer is often feeling overwhelmed by the volume of questionable-quality things coming in, such that they can go into more of a gatekeeper role than trying to mentor each person thoroughly in each interaction. (I think this is what she's talking about with "Any lazy fool can deny a request and get you to 'no.' It takes actual effort to appreciate and recognize what they're trying to accomplish and try to help them get to a different 'yes'.")

------
c12
On every dedicated box I keep a swap partition with an alert raised when its
used beyond a certain threshold. For all VMS no-swap because as far as they
are concerned disk = network.

Then again anything above 80% memory utilisation and we begin looking at
adding another box to the cluster due to the fact that occasional spikes in
usage can easily put us beyond what swap can protect from and that just causes
a shit storm.

------
gok
Swap is best seen as a component of the "bad idea trinity", next to fork() and
overcommit.

~~~
erik_seaberg
If you need a network server to respond in 40 ms p99, letting it start
swapping is crazy. But it made sense for timesharing undersized computers
running jobs that take minutes or hours. You overcommit because your
institution couldn't afford _two_ computers.

I/O also used to have a smaller speed penalty. There was just one core and it
wasn't so dramatically faster than the rest of the machine. Hell, there used
to be _faster disks_ for swap than for the filesystem, and tuning was about
scheduling jobs and placing inputs and outputs so as to fully utilize the fs
disks.

~~~
pixl97
I dont know if I would say 'used to be', things like Optane are essentially a
ssd memory/disk tier. If it's running in bit addressable mode you can consider
it a swap tier between memory and disk/network.

------
tempodox
Situations like the ones described are the reason that the Bastard Operator
From Hell is still a wet dream for some of us.

[http://bofh.bjash.com/](http://bofh.bjash.com/)

------
hacknat
One problem with Linux in low memory situations is that the OOM killer is a
really blunt force instrument. It would be nice if it were a lot more
configurable. Simple OOM scores don't cut it, IMO.

~~~
zzzcpan
You can do it completely in userspace, that's the ultimate configurability.

~~~
hacknat
You can adjust pid scores in userspace or turn it off, there's literally
nothing else you can do.

~~~
zzzcpan
You can monitor memory, swap usage, lots of things about each process and kill
processes all from a userspace program.

~~~
scottlamb
And there are existing userspace daemons; see:
[https://code.fb.com/production-
engineering/oomd/](https://code.fb.com/production-engineering/oomd/)

------
gwbas1c
Did I miss the point? Is this a rant about incorrectly configuring swap space,
or is this a rant about some kind of bad team dynamics?

Anyway, isn't this the kind of argument that should be replaced by gathering
objective data? Otherwise, the low/no swap space problems really appear to be
symptoms of someone irresponsibly experimenting in production.

~~~
loriverkutya
Yes, you missed the point, this is not a rant.

~~~
gwbas1c
Then what is the point?

------
raldi
"Nope" is not a strategy.

