

Redis latency spikes and the Linux kernel: a few more details - mickeyben
http://antirez.com/news/84

======
antirez
Unfortunately I just spotted that I'm totally wrong, the huge pages apparently
are only used by jemalloc: I just mis-read the outputs since this seemed so
obvious. So on the contrary, it appears that the high latency is due to the
huge pages thing for some unknown reason. So actually it is malloc that, while
NOT using huge pages, is going much faster. I've no idea about what is
happening here, so please disregard the blog posts conclusions (all the rest
is hopefully correct).

EDIT: Oh wait... since the problem is huge pages, this is MUCH better, since
we can disable it. And I just verified that it works:

echo never > /sys/kernel/mm/transparent_hugepage/enabled

This means, free latency upgrade for all the Redis users out there during
fork... (more or less 20x in my tests) just with this line above.

~~~
sandGorgon
this is a recommended setting even on Oracle [1] [2] and Couchbase [3]

the doc recommends adding "transparent_hugepage=never" to the kernel boot line
in the "/etc/grub.conf" file.

[1] [http://oracle-base.com/articles/linux/configuring-huge-
pages...](http://oracle-base.com/articles/linux/configuring-huge-pages-for-
oracle-on-linux-64.php#disabling-transparent-hugepages)

[2]
[https://support.oracle.com/epmos/faces/DocContentDisplay?id=...](https://support.oracle.com/epmos/faces/DocContentDisplay?id=1557478.1)

[3] [http://blog.couchbase.com/often-overlooked-linux-os-
tweaks](http://blog.couchbase.com/often-overlooked-linux-os-tweaks)

~~~
antirez
Yes, never reported by Redis users so far, maybe since in Redis it does not
have the same impact it has in other databases, but a more subtle latency
issue apparently.

EDIT: Actually in the CouchBase case they talk about "page allocation delays"
that looks related, potentially. I started the investigation mainly because
the Stripe graph looked suspicious with a modern EC2 instance type that has
good fork times, so there was something more indeed (unless the test was
performed with tens of GB of data).

------
mrb
Your jemalloc case is using 1564672 kB, this means you only have 764
hugepages. Since the RAM access patterns in the parent process are random, you
only need ~1000 accesses to trigger multiple COW (copy-on-write) totaling most
of the ~1.5GB. redis-benchmark simulates 50 clients by default and you use a
pipelining factor of 4, so you have 200 in-flight requests at any one time. I
am not familiar with how the redis server prioritizes the requests to be
served, but if it is like most network servers it is not deterministic
(whatever comes first from the network event loop). So while on average a
request may wait for 200 other ones to be served, a small fraction of them
will have to wait longer because of this non-deterministic factor. It is not
unimaginable that some have to wait for 500, maybe 1000 other requests.
Waiting for 1000 requests means waiting for a COW of most of the ~1.5GB which
_would_ take a few hundred milliseconds and would absolutely explain your
latency spike.

On the other hand your malloc case is using 4kB pages. If a request has to
wait for 1000 other in-flight requests to be executed, that means waiting for
at most 1000 pages (4MB) to be COW'd, which would take on the order of 0.1-1.0
milliseconds. This is why the latency is much lower.

tl;dr: smaller pages (4kB vs 2MB) allow finer granularity of the COW
mechanism, and lead to lower latencies.

~~~
antirez
This was my first thought as well, but actually this is not what is happening
AFAIK, and the performance hit is likely due to inefficient huge page
allocation. There are reasons I believe this, but I'm actually checking in a
more systematic way right now before saying random things.

EDIT: you were exactly right. This is what happens, there are 50 clients in
the benchmark, with many queued requests, so indeed since the benchmark is
designed to touch all the keys evenly, what happens is that every client
served in a given event loop cycle has a big chance to get a page fault. This
seemed unrealistic to me, since I saw the spike in a single event-loop cycle,
but it is how is working actually. Thanks!

------
diem_perdidi
I find the back-and-forth ("<this is wrong>", "EDIT", "<wrong advice>"...) to
be quite interesting and instructive, better than just editing the whole post
with the 'correct' explanation and workaround. Thanks for that!

~~~
sswaner
I agree! I also appreciate the effort over the past few days to dig into this
after it was originally reported. It is fascinating to watch the analysis
unfold.

------
gpvos
PTE = Page Table Entry

------
makmanalp
Pardon my ignorance - but isn't this a problem that's existed in database
software in general, and one that's been dealt with in the past?

This seems to be a similar problem to how to do replication while also serving
requests - on one side you want to preserve consistency, on the other side you
want to make a full snapshot. It shouldn't matter how long the replication
process takes (i.e. how slow fork is), seconds or days, as long as the
incoming changes aren't faster than the replication rate. In this case though,
your replication target is the disk.

It seems like there should never have to be a full lock of everything.

~~~
seiji
_(i.e. how slow fork is)_

The problem is fork() is a blocking system call, and the side effects of fork
also cause some latency when accessing the COW memory.

 _there should never have to be a full lock of everything._

There's no software lock involved from the Redis side, but fork() behavior has
the effect of a pseudo-lock (because the system call blocks the entire process
from running until it returns).

~~~
makmanalp
Thank you, this clears up the issue a bit more in my mind.

------
Erwin
I had a similar transparent huge pages issues some time back -- the symptoms
were incomprehensible spikes in CPU usage _sometimes_. Apparently this had
something to do with memory allocation pressure when my Python processes
restarted. However I thought these were perhaps weird power throttling issues
which meant CPU was performing at a lower performance (explaining why a normal
startup took 1 cpu minute instead of 0:10 seconds).

I should have spotted this by looking at the split between user and system
CPU: when the THP problem happened SYS% spiked, but I was looking at a dozen
processes pegged at 100% CPU.

I eventually tracked this down via "perf top" which lets you see what
everything on the system is doing: both kernel and user space symbols are
tracked. I saw significant kernel-space work in an odd memory-related symbol
and found someone else with the same problem.

------
matkam
I'm having trouble figuring out how to disable transparent hugepages without a
reboot:

    
    
      ># uname -a
      Linux ip-10-214-7-159 3.2.0-36-virtual #57-Ubuntu SMP Tue Jan 8 22:04:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
    
      ># grep TRANSPARENT /boot/config-3.2.0-36-virtual
      CONFIG_TRANSPARENT_HUGEPAGE=y
      # CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS is not set
      CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y
    
      ># echo never > /sys/kernel/mm/transparent_hugepage/enabled
      bash: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory

------
jbert
Dumb question - if problem is a latency spike in the parent, any chance you
can continue serving in the _child_ and let the parent do the BGSAVE (and take
the latency spike)?

(The child should be able to inherit all open fds?)

~~~
antirez
I'm not sure but probably both sides are blocked when copying the shared PTEs.

~~~
amluto
This issue has nothing to do with copying the PTE [1] and everything to do
with copying the data.

After fork, the parent and the child have write-protected references in their
page tables pointing at the same data. (It makes very little difference
whether said write-protected reference is an actual entry in the page tables
or just in the logical VMA data.)

As soon as one of the processes tries to modify the data, the data must be
copied so that the original data doesn't change (because the other process can
still see it). That means that the kernel needs to find somewhere to put the
copy, and that's almost certainly the expensive part with THP if
defragmentation is on. Once the kernel finds a place to copy to, it can update
the vm data structures (page tables or otherwise) to give the writer a private
copy.

[1] Minor nitpick: for hugepages, there aren't actually any PTEs involved, but
that's really just a terminology issue. It's actually an entry in a higher
level table in the tree.

~~~
antirez
Hello amluto, we were talking about PTEs here since Linux fork() PTE copying
does not happen during fork like in other OSes, but is lazy. So Linux
delegates at a latter time _both_ the copying of the PTEs (or huge page
entries), and the pages themselves.

------
msandford
This is why open-source software is so rad. I suspect that you'd never see an
Oracle customer complaining publicly and then an Oracle engineer publicly
tracking down the source of their complaints.

~~~
mjfisher
I'm a huge fan of antirez in particular for this. So many other devs would
have shrugged off the occasional latency as acceptable, or a poor measure of
true performance, or blamed it on nebulous EC2 performance issues. Which
happened, to a greater or less extent. But three blog posts later he's still
digging in to the root causes and has eventually (through no small effort)
come up with the cause, and a fix.

Open source is good. Its individuals like antirez who make it great.

