
The MySQL “swap insanity” problem and the effects of the NUMA architecture - admiun
http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/
======
guard-of-terra
One behavior I've noticed with linux that if you read files sequentially from
disk (for example, doing scp), then linux would fill all the memory with those
files' contents and then it would swap out everything but the (obviously
useless) disk caches. So you'll have all the memory filled with data you would
never need again and trying to do anything would cause a large and painful
unswapping (had side effect of halting my qemu).

This is true insanity. Surely you can disable swap or tune swappiness, but
what's the reason for crazy default behavior?

~~~
justincormack
Use rsync. It now preserves the buffer cache status that files had before so
it does not stomp on your allocations.

<http://insights.oetiker.ch/linux/fadvise/>

~~~
mceachen
Also, consider using --bwlimit to throttle the copy speed, so the spindle can
still respond to other IO requests. (25-50% of unthrottled speed seemed to be
a reasonable tradeoff).

------
andreasvc
I still don't comprehend why one needs swap at all. All the explanations I
have come across talk about not having enough memory. Given that one has at
least 8 GB of memory, or maybe even >100GB, why on earth would you need swap?
Sure some process might allocate even more than that, but maybe it's better to
refuse such a request than to slow down the whole system due to thrashing.

I get the idea that the reason might be that a lot of programs allocate memory
which they don't actually need regularly, which is then very convenient to
swap out. Rather than enabling this bad habit using slow disk storage it would
be much better to expect programs to be more frugal, or at least signify
whether something should be kept in memory or not.

~~~
makmanalp
Because most data is critical and you can't afford to just drop it on the
ground whenever you please. A better option would be to have the
application/db to have its own swap routines optimized for its own purposes
rather than letting the OS doing a catch-all swap method.

~~~
jcrites
Even beyond that: it's desirable for a machine to be able to compute
arbitrarily large data sets. If the data set can't efficiently fit in memory,
the machine should still make progress, just more slowly, using disk.

It is not desirable for a machine to have a "wall" which, upon being hit,
becomes a harsh restriction on its capabilities. This is because we often
encounter the "wall" unexpectedly, at a time that might be critical.

~~~
vidarh
But they still have a wall - it's just takes a bit more to hit it. In Linux
systems swap is usually 2x memory. With swap set like that, all swap does is
raise the wall to 3 times what it previously was.

But for a lot of systems your service will fail shortly after you start
swapping anyway, because the performance cost of swapping is so high that it
often starts a death spiral (can't handle enough requests, so they start
piling up, eating even more memory, until your system dies or you hit
connection limits etc.).

So "best case" in a typical configuration is that the wall is a bit higher.
Worst case you gain nothing at all from the swap.

Personally I treat it as a failure if we ever hit swap - it means connection
limits etc. has been set too high.

~~~
ars
That's not why you need swap. Swap is because many applications will use
memory when starting and then never touch it again.

You can therefor swap it out and use the extra memory for cache.

Most long term applications only need a small fraction of their startup
memory.

~~~
justincormack
Then surely they could free it?

~~~
ars
If you allocate memory "after" that memory, it's not possible to return the
earlier memory to the OS.

Also, suppose you need the memory only for startup and shutdown (things like
logfiles, network connections, command line parsing, etc).

~~~
justincormack
Yes it is. Memory allocators are heaps not queues.

Things like network connections, logfiles are used all the time, so they won't
be swapped out (actually file handles are kernel side so never swapped
anyway). You can free the command line parse after setting the options.

And clean shutdown is overrated: long running programs can just terminate
fairly gracelessly if necessary, the OS cleans everything up.

~~~
ars
Memory allocators might be heaps, but behind the scenes it's just an area of
memory, and that area is contiguous.

If you increase the size of the memory available to you (sbrk) you can only
decrease it if no memory is allocated between the new area and the end of it.

In practice the memory is never returned, and applications rely to swap to
deal with that.

It's not the logfile (and network) handle that is swapped out - it's the code
for deciding where it is, and opening it. Also initialization code.

Some programs can abort, but others will require a (slow) consistency check of
their data if that happens to them.

And finally theory is all well and good, but in actual practice about 3/4 of
the memory used by running programs can be swapped out.

------
xxjaba
I am very impressed with how well written this article is. A brief description
of the problem, links to relevant discusions for less informed readers to come
up to speed, and clear examples of how key pieces of information were
gathered. I learned more from this article about the topic at hand than I have
from a Blog post in recent memory.

~~~
jeremycole
Thanks! I am glad you learned something, and happy to get great feedback!

~~~
finnh
I'll second this.

The intro was so well written that by the time I got to the first numa_maps
output ("2aaaaad3e000 default anon=13240527 dirty=13223315 swapcache=3440324
active=13202235 N0=7865429 N1=5375098") I immediately thought "well geez look
at that N0/N1 imbalance, there's your problem right there".

Point being, I haven't dealt with low-level hardware details since college,
and yet your article's delightfully clear intro got me sufficiently educated
to feel like I was right there with you.

A question well phrased is half answered...

------
sciurus
Anther good article is [https://kevinclosson.wordpress.com/2009/05/14/you-buy-
a-numa...](https://kevinclosson.wordpress.com/2009/05/14/you-buy-a-numa-
system-oracle-says-disable-numa-what-gives-part-ii/)

On commodity servers, unless you have specific reasons to do otherwise just
switch from NUMA to SUMA. There are two things yo should do

* Change a BIOS setting. The term for this will vary by manufacturer. For Dell, it means enabling node interleaving.

* Pass numa=off to the linux kernel (e.g. edit grub.cfg)

------
larsberg
Yes, NUMA effects will really kill you, though how much depends on the
particular quad-proc topology. I have some measurments for the interested in a
small workshop paper I put together (I gathered the numbers in the context of
tuning our garbage collector anyway):

[http://dl.dropbox.com/u/1620890/website/writings/mspc12-stre...](http://dl.dropbox.com/u/1620890/website/writings/mspc12-stream.pdf)

------
WALoeIII
Will this optimization help on virtualized machines like Xen? Or does all
memory appear to be the same?

On an EC2 m1.xlarge:

$ numactl --hardware available: 1 nodes (0) libnuma: Warning: /sys not mounted
or invalid. Assuming one node: No such file or directory node 0 cpus: node 0
size: <not available> node 0 free: <not available> libnuma: Warning: Cannot
parse distance information in sysfs: No such file or directory

------
jakejake
For those of us mere mortals, would it be safe to assume that adding the
suggested line to mysql_safe would be ok to do?

cmd="/usr/bin/numactl --interleave all $cmd"

------
corford
Interesting read. Does anyone know if things have improved/changed
significantly since the article was posted (Sep 2010)?

~~~
jeremycole
They have not changed in any way, however there is a patchset proposed
currently to change how NUMA works a bit. Unclear if it will change this
situation.

~~~
corford
Would that be a mysql or linux kernel patch (assume the latter)? Also want to
echo xxjaba's comment further down - thanks for doing the work on that post,
it was really enlightening!

~~~
jeremycole
Yes, a Linux kernel patch. I've got a lot of ideas on NUMA optimization for
MySQL directly though, so keep an eye out for that, perhaps some time this
year.

~~~
corford
>I've got a lot of ideas on NUMA optimization for MySQL directly though, so
keep an eye out for that

Thanks Jeremy, I will do.

------
lawnchair_larry
The title is inaccurate. It should say, "the linux swap insanity problem"
because this is entirely related to the linux kernel. It just happens to
affect MySQL and similar workloads, but it is not MySQL's fault. It doesn't
behave that way on other platforms either.

~~~
jeremycole
Pretty hard to make you happy, eh?

I agree with you on a purely technical basis, however, this article was
written for the MySQL community, was tested (only) on MySQL, and has primarily
affected my only on MySQL systems, which I (and the others referenced in the
article) primarily run on Linux.

------
defen
Are the lessons here applicable to other commonly used databases (mongo,
postgres, redis, etc)?

~~~
wmf
This applies to any case where you want a single process to use more than half
the server's RAM.

------
j2labs
tl;dr - If you're running a database, or generally memory intensive system,
while also using multiple CPUs you should run this command: echo 0 >
/proc/sys/vm/zone_reclaim_mode

But the article is great. You should definitely read it.

~~~
finnh
Except that's not the TL;DR at all.

From the article:

"An aside on zone_reclaim_mode

The zone_reclaim_mode tunable in /proc/sys/vm can be used to fine-tune memory
reclamation policies in a NUMA system. Subject to some clarifications from the
linux-mm mailing list, it doesn’t seem to help in this case."

The real TL;DR is "run your mysql command under the auspices of
'/usr/bin/numactl --interleave all' so that your big pool allocation is split
evenly across nodes"

And an even better solution would be if _only_ the big pool allocation use
interleaved allocation, and all the rest used normal node-bound allocation.
This would require some sort of change to the malloc calls though, yes? All of
the solutions listed in the article operate at the granularity of a process
(or higher), not down to the individual allocation.

~~~
j2labs
My mistake, you are correct that I forgot the second command.

The new tldr is: numactl --interleave=all /path/to/daemon; echo 0 >
/proc/sys/vm/zone_reclaim_mode

This file helps explain the different ways one can tweak memory with Linux:
<http://www.kernel.org/doc/Documentation/sysctl/vm.txt>

