
Optimizing Linux Memory Management for Low-latency, High-throughput Databases - dhruvbird
http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases
======
WestCoastJustin
You can also optimize memory and cpu management through linux control groups.
Oracle published a pretty good description (see: example 1: NUMA Pinning) of
how to assign dedicated cpus and memory to a process or group of processes
[1], but you can also read about the supporting cpuset & memory cgroups
subsystems too [2, 3].

p.s. I can recently created a screencast about control groups (cgroups) for
anyone interested @ [http://sysadmincasts.com/episodes/14-introduction-to-
linux-c...](http://sysadmincasts.com/episodes/14-introduction-to-linux-
control-groups-cgroups)

[1] [http://www.oracle.com/technetwork/articles/servers-
storage-a...](http://www.oracle.com/technetwork/articles/servers-storage-
admin/resource-controllers-linux-1506602.html)

[2]
[https://www.kernel.org/doc/Documentation/cgroups/memory.txt](https://www.kernel.org/doc/Documentation/cgroups/memory.txt)

[3]
[https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt)

------
MichaelGG
Very interesting note about the reclaiming. Yet another warning when
transparently using a NUMA system.

NUMA can be a real pain. You can get a 40% hit on direct memory access, and
far worse if you're modifying a cacheline in another processor. On one of our
VoIP workloads, we noticed major (250%+) increase in performance and CPU
stability after splitting a very thread-intensive process into multiple
processes, each set with affinity to a particular core.

OSes try to help you, but it seems like they're primarily concerned with
multiple processes, not huge processes like databases. Such processes should
become NUMA aware and handle things themselves for best performance.

It might even make sense to ask if you can split the machine on NUMA
boundaries and just act like they're separate systems. RAM's getting very
cheap, and RAM/core is going up faster than CPU power is (it seems to me,
anyways).

Also, is there a reason not to use large pages directly for the mmap'd sets if
you know you're going to have them hot at all times? (I assume they read the
entire file on start?)

~~~
apurvamehta
Hi, post author here.

> Also, is there a reason not to use large pages directly for the mmap'd sets
> if you know you're going to have them hot at all times? (I assume they read
> the entire file on start?)

We could use large pages directly. But, as I mentioned in the article, the
performance gains would be negligible compared to the gains that come from
having things in memory in the first place. These are not very large memory
systems and the page table / TLB miss overhead doesn't seem to be biting us.
We are just following the mantra 'pre-mature optimization is the root of all
evil' :)

~~~
erichocean
In my experience, most people don't know they have TLB problems because,
effectively, it's always bad.

It's only when you start getting to the metal to see what your hardware is
actually capable of that the TLB stands out as a glaring source of
inefficiency.

Put another way: yeah, the TLB is making your app slow, but it's doing so
_always_ , so you don't notice. Instead, you mistakenly think your hardware is
just slower than it really is.

------
introspectif
"after rolling out our optimizations, we saw our error rates (ie. the
proportion of slow or timed out queries) drop by up to 400%"

There is some good shared knowledge in the post (unlike this comment, to be
fair), but what does drop by 400% mean?

If a rate drops by 100% it becomes zero. I get that.

If it increases by 400%, the outcome is slightly ambiguous (do we add 400% for
500% total or do we multiply up to 400% of the original value).

But a rate decreasing by 400% - am I the only person who finds that (not
uncommon) expression hard to conceptualize?

~~~
erichocean
I understood it to mean that it had 1/4 the error rates they were previously
seeing.

~~~
apurvamehta
This is exactly right :)

~~~
caf
Then it dropped by 75%.

~~~
apurvamehta
Yes. It was a blunder. The post has been updated to reflect this.

------
caf
In regard to conclusion 2, there is another approach here - when you're
finished with an old segment, posix_fadvise(..., POSIX_FADV_DONTNEED) can be
used to drop it from the page cache.

~~~
apurvamehta
That would be true if we were using C++. Unfortunately, all our code is in
Scala and we use Java NIO libraries to memory map our files. AFAIK, they don't
give us the option on using these POSIX calls.

~~~
pquerna
Cassandra binds to posix_fadvise to do exactly this when writing out new
SSTables:

[https://github.com/apache/cassandra/blob/trunk/src/java/org/...](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/CLibrary.java#L145-L162)

~~~
apurvamehta
Wow.. that's great to know. We will definitely investigate this approach.
Thanks for sharing! :)

------
Erwin
I was hit by the transparent huge pages on RHEL 6.2 in my workload. If you
find our ordinary processes randomly taking up huge amounts of CPU time --
system CPU time -- when doing apparently ordinary tasks, you might be affected
too. That was a real pain to diagnose when you're used to trusting the kernel
not doing anything that weird. Running "perf top" helped to narrow down what
the system was REALLY doing.

I didn't have LI-size databases -- just a dozen Python processes allocating
each perhaps 300MB and all restarting at the same time were enough to trigger
it, taking 10 minutes rather than 2 seconds to start up.

------
krakensden
According to LWN, this is probably going to be automatic in the future:

[http://lwn.net/Articles/568870/](http://lwn.net/Articles/568870/)
(subscriber-only now, will be free in a week)

------
dllthomas
_" we saw our error rates (ie. the proportion of slow or timed out queries)
drop by up to 400%."_

Should that be 80%?

Edited to add: Apparently it should be 75%, per comments elsewhere.

~~~
apurvamehta
Thanks, the 400% number is wrong. It was a last minute edit.. I should learn
not to do that. I have updated the post to say that the error rates have
dropped by 1/4th.

~~~
apurvamehta
Yikes. I meant that they dropped TO 1/4th the original.

------
vosper
Does the information in this article apply to VMs (specifically AWS) or is it
only relevant when you're running directly on hardware?

~~~
apurvamehta
We did our experiments directly on hardware. I don't think that AWS VMs
simulate multiple physical sockets. If they don't then this article will not
apply to them.

------
dhruvbird
> On small setting for Linux, one dramatic performance improvement for
> LinkedIn!

should be...

you know what it should be ;)

