Linux's NUMA policy does seem broken for this use case. If all the memory on node 0 is used up (but plenty is free in node 1) and a thread in node 0 attempts to allocate memory, why not, instead of swapping out pages from node 0, simply move them to node 1? Alternatively just allocate memory in node 1, as the author suggests. I'm not a kernel programmer. Anyone who's more familiar with this care to answer?
I can see the logic where allocating on a non-local node is potentially a mistake. Depending on how many times the memory will be accessed, it may well be worth the immediate hit to swap a page to disk and keep all your accesses local. For the swap, you at least have evidence that it hasn't been used that recently, thus may never be used again. It would be sad to work yourself into a corner where lots of long lived processes are constantly cross-allocating.
Edit to add:
Looks like a good reference paper here: http//www.kernel.org/pub/linux/kernel/people/christoph/.../numamemory.pdf
Only skimmed, but it makes it sound like 'page migration' is already in place.
I'm particularly interested in the idea of migration partly because it might help provide an answer to my recent StackOverflow question: http://stackoverflow.com/questions/3784434/inserting-pages-i...
I didn't feel that benchmarks were necessary in this case, since the result is clearly visible: either it does or does not swap under a given workload. We did run benchmarks, but only to prove that the performance was nominal with and without the setting in place, swap behavior aside, to ensure that this doesn't introduce some regression.
The real solution to this issue is for MySQL to become NUMA aware and place threads and the cached data blocks those threads are accessing more intelligently on nodes that have enough space. Other more robust databases like Oracle already do this, having been running on NUMA architectures for decades now.
Does it suffer more than hitting the disk?
Which would you rather suffer? A 50 ms one time hit to page some data out, or many thousands of 500 microsecond hits and interconnect saturation over time? The kernel engineers looked at these trade offs and determined it was better to page out data. After all, the kernel does not know how long you'll need your data, and if it allowed memory to be allocated haphazardly all over a NUMA system, after many hours or days you could end up with a very slow running system where every other thread had to access memory in a different node.
I find it rather puzzling that DBAs think they know more about how a kernel should page memory than a kernel developer like Linus Torvalds.
The answer seems clear - if your software relies on huge amounts of memory, make it NUMA aware. Oracle did this a long time ago and I don't see any strange swap activity on our 8-socket 48-core 128GB NUMA systems (AMD Opteron).
You misunderstand the OP. He's not saying "Just put it in Node 1 and access it from there." He's saying to swap it out to Node 1, and then when it is needed in Node 0 again, swap it back to Node 0. That's certainly cheaper than swapping to disk and back.
There are a host of problems with low traffic stacks including:
-Persistent database connections going away and the first user who hits the system having to wait for the web server to connect to the DB.
-Cache warmup time taking a long time because it takes a long time to get enough queries for the cache to figure out what needs to be cached.
-App server warmups taking a long time. Low traffic means it takes a long time for all apache children (or ruby app server or whatever) to get a hit and compile code into mem so the next request is fast.
-And the author's [I'm assuming] problem of systems deciding that low usage means the data can be expired from cache.
I find the innodb buffer pool cache to be the best in the business on high traffic sites using the author's configuration. [Random aside: I also use Redis and Memcached in production extensively.]
In this case, the traffic level is somewhat irrelevant. However, swapping can be demonstrated at levels of a few thousand queries per second. Read the article: this has nothing to do with swapping due to non-use (which would logically be OK).
From the kernel documentation @ http://www.mjmwired.net/kernel/Documentation/vm/hugetlbpage.... :
"Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes. Huge pages cannot be swapped out under memory pressure."
I think hugepages etc., serve to solve the symptom of the problem (making mysqld itself unswappable) rather than the actual problem (the system needs memory on a certain node and there is none to be had). Making mysqld's memory unswappable will just mean that something else gets swapped, or in the worst case something gets OOM-killed or the allocation just fails. Those situations could be worse.
(This problem was fixed in Oracle in the 90s originally for deployment on Sequent hardware).
Real sharding - storing totally different stuff on the two shards - would probably give a really good performance improvement. But real sharding is much more of a pain than just starting two copies of mysqld.
All systems were running RHEL or CentOS, so perhaps Red Hat have fixed the problem.
Drop caches was originally added for benchmarking purposes, but I've found running it every N minutes seems to help system responsiveness. (I've been unable to quantify it, unfortunately.)
For those who compare the before and after output of 'free', stop it. Yes, the numbers are (sometimes drastically) different. It doesn't matter. The kernel drops pages when it needs a page, and for the general case, this does work. But, as the linked LKML thread states, there may be a pathological case that you are not expected to hit (10 million+ files, and 40 Gigs of ram). For that specific use-case, it did make sense.
The reason this is a bad idea is because dirty pages cannot be freed, which is why it is recommended to run 'sync' first. Unfortunately, on a busy system, pages will get dirty in between the sync and the drop_cache, moreso if you're doing it in a shell. Those dirty pages can then never be reclaimed (due to how drop_caches works, and because drop_caches is only intended for benchmark testing).
(link to same thread but threaded http://lkml.indiana.edu/hypermail//linux/kernel/1009.1/02943...