
Mysql "Swap Insanity"  - xal
http://jcole.us/blog/archives/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/
======
stephenjudkins
This is all very interesting, and the author has clearly done a lot of
research. But where are the benchmarks? I'd love to see some sort of
replicable evidence that using this command helps things that much.

Linux's NUMA policy does seem broken for this use case. If all the memory on
node 0 is used up (but plenty is free in node 1) and a thread in node 0
attempts to allocate memory, why not, instead of swapping out pages from node
0, simply move them to node 1? Alternatively just allocate memory in node 1,
as the author suggests. I'm not a kernel programmer. Anyone who's more
familiar with this care to answer?

~~~
illumin8
The problem is that there is a latency hit required for a thread running on
node 0 to access memory on node 1. Furthermore, this uses Hypertransport on
AMD or QPI on Intel, which has limited bandwidth so if you get too many off-
node memory accesses, performance begins to suffer.

The real solution to this issue is for MySQL to become NUMA aware and place
threads and the cached data blocks those threads are accessing more
intelligently on nodes that have enough space. Other more robust databases
like Oracle already do this, having been running on NUMA architectures for
decades now.

~~~
jemfinch
> The problem is that there is a latency hit required for a thread running on
> node 0 to access memory on node 1. Furthermore, this uses Hypertransport on
> AMD or QPI on Intel, which has limited bandwidth so if you get too many off-
> node memory accesses, performance begins to suffer.

Does it suffer more than _hitting the disk_?

~~~
illumin8
No need to be snarky. You take a one time large hit in performance to page out
some data as opposed to many small continual hits in performance going across
the interconnect between nodes.

Which would you rather suffer? A 50 ms one time hit to page some data out, or
many thousands of 500 microsecond hits and interconnect saturation over time?
The kernel engineers looked at these trade offs and determined it was better
to page out data. After all, the kernel does not know how long you'll need
your data, and if it allowed memory to be allocated haphazardly all over a
NUMA system, after many hours or days you could end up with a very slow
running system where every other thread had to access memory in a different
node.

I find it rather puzzling that DBAs think they know more about how a kernel
should page memory than a kernel developer like Linus Torvalds.

The answer seems clear - if your software relies on huge amounts of memory,
make it NUMA aware. Oracle did this a long time ago and I don't see any
strange swap activity on our 8-socket 48-core 128GB NUMA systems (AMD
Opteron).

~~~
jemfinch
> You take a one time large hit in performance to page out some data as
> opposed to many small continual hits in performance going across the
> interconnect between nodes.

You misunderstand the OP. He's not saying "Just put it in Node 1 and access it
from there." He's saying to _swap it out_ to Node 1, and then when it is
needed in Node 0 again, swap it back to Node 0. That's certainly cheaper than
swapping to disk and back.

~~~
illumin8
I see, so he's essentially proposing a memory to memory swap functionality as
opposed to just memory to disk. It sounds like a workable solution, although
it would require some engineering in the kernel paging algorithms. You'd also
need to make changes to the scheduler so that you could intelligently schedule
threads on the node where their memory is. It sounds doable, but it seems that
this is a lot of work for kernel engineers to do that could be done by
software that is NUMA aware.

------
mmaunder
The author doesn't mention anything about load e.g. queries per second. I
wonder if the description is: Huge data set, low usage and very fast response
time required.

There are a host of problems with low traffic stacks including:

-Persistent database connections going away and the first user who hits the system having to wait for the web server to connect to the DB.

-Cache warmup time taking a long time because it takes a long time to get enough queries for the cache to figure out what needs to be cached.

-App server warmups taking a long time. Low traffic means it takes a long time for all apache children (or ruby app server or whatever) to get a hit and compile code into mem so the next request is fast.

-And the author's [I'm assuming] problem of systems deciding that low usage means the data can be expired from cache.

I find the innodb buffer pool cache to be the best in the business on high
traffic sites using the author's configuration. [Random aside: I also use
Redis and Memcached in production extensively.]

~~~
jeremycole
Hi,

In this case, the traffic level is somewhat irrelevant. However, swapping can
be demonstrated at levels of a few thousand queries per second. Read the
article: this has nothing to do with swapping due to non-use (which would
logically be OK).

Regards,

Jeremy

------
nwilkens
What about using the large-pages (huge pages) mysql option:
[http://dev.mysql.com/doc/refman/5.0/en/large-page-
support.ht...](http://dev.mysql.com/doc/refman/5.0/en/large-page-support.html)

From the kernel documentation @
[http://www.mjmwired.net/kernel/Documentation/vm/hugetlbpage....](http://www.mjmwired.net/kernel/Documentation/vm/hugetlbpage.txt)
: "Pages that are used as huge pages are reserved inside the kernel and cannot
be used for other purposes. Huge pages cannot be swapped out under memory
pressure."

~~~
jeremycole
Hi,

I think hugepages etc., serve to solve the symptom of the problem (making
mysqld itself unswappable) rather than the actual problem (the system needs
memory on a certain node and there is none to be had). Making mysqld's memory
unswappable will just mean that something else gets swapped, or in the worst
case something gets OOM-killed or the allocation just fails. Those situations
could be worse.

Regards,

Jeremy

------
gaius
You would think MySQL types would simply "shard" their servers by starting a
MySQL process on each board.

(This problem was fixed in Oracle in the 90s originally for deployment on
Sequent hardware).

~~~
apenwarr
That would probably not help much here; you'd then end up caching pretty much
the same stuff on node 0 and 1, so it would be like having two 32 GB nodes
instead of one big 64 GB node. They would both be maximally fast, within the
32 GB memory constraint, but they would _both_ have to swap if you wanted to
exceed 32 GB of cached information.

Real sharding - storing totally different stuff on the two shards - would
probably give a really good performance improvement. But real sharding is much
more of a pain than just starting two copies of mysqld.

------
forkqueue
I've worked on several servers of similar specs to this (64GB RAM, MySQL 5.0
or 5.1) and have never seen this issue.

All systems were running RHEL or CentOS, so perhaps Red Hat have fixed the
problem.

~~~
metageek
What CPUs? As the article says, the NUMA characteristics show up with Optera
and Nehalems; older Intel chips didn't have it.

------
mikey_p
Does MariaDB (and other MySQL variants) suffer from this issue?

~~~
nwmcsween
It's an operating system issue - specifically how Linux handling of NUMA is
generally broken.

~~~
timthorn
I'm not sure I agree; the application can be written with an understanding of
how best to utilise the memory architecture, but the OS must manage memory for
the common case unless instructed otherwise. I'm not commenting on Linux's
handling of NUMA - rather that applications shouldn't assume that a generic OS
can provide an optimal hardware abstraction. All memory is equal - but some
memory is more equal than others.

~~~
JoachimSchipper
Yes, the OS must assume the common case. However, by the time one node has 90%
of its memory in-use and the other node has 1% of its memory in use, it's
clear that this is not the common case, and this should be handled more
gracefully.

------
irv
there's a (superficially) similar problem in MS SQL Server/Windows Server. You
can fix that by setting "lock pages in memory"

------
fragmede
sync; echo 3 > /proc/sys/vm/drop_caches

Drop caches was originally added for benchmarking purposes, but I've found
running it every N minutes seems to help system responsiveness. (I've been
unable to quantify it, unfortunately.)

~~~
jeremycole
This is a tremendously bad idea, but rather than clarify why, just read:

[http://www.listware.net/201009/linux-kernel/48874-rfc-
patch-...](http://www.listware.net/201009/linux-kernel/48874-rfc-patch-update-
procsysvmdropcaches-documentation.html)

~~~
fragmede
The thread doesn't at all clarify why (it is), and actually gives a counter
example.

For those who compare the before and after output of 'free', stop it. Yes, the
numbers are (sometimes drastically) different. It doesn't matter. The kernel
drops pages when it needs a page, and for the general case, this _does_ work.
But, as the linked LKML thread states, there may be a pathological case that
you are not expected to hit (10 million+ files, and 40 Gigs of ram). For that
specific use-case, it _did_ make sense.

The reason this is a bad idea is because dirty pages cannot be freed, which is
why it is recommended to run 'sync' first. Unfortunately, on a busy system,
pages will get dirty in between the sync and the drop_cache, moreso if you're
doing it in a shell. Those dirty pages can then never be reclaimed (due to how
drop_caches works, and because drop_caches is only intended for benchmark
testing).

(link to same thread but threaded
[http://lkml.indiana.edu/hypermail//linux/kernel/1009.1/02943...](http://lkml.indiana.edu/hypermail//linux/kernel/1009.1/02943.html)

