Unfortunately I just spotted that I'm totally wrong, the huge pages apparently are only used by jemalloc: I just mis-read the outputs since this seemed so obvious. So on the contrary, it appears that the high latency is due to the huge pages thing for some unknown reason. So actually it is malloc that, while NOT using huge pages, is going much faster. I've no idea about what is happening here, so please disregard the blog posts conclusions (all the rest is hopefully correct).
EDIT: Oh wait... since the problem is huge pages, this is MUCH better, since we can disable it. And I just verified that it works:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
This means, free latency upgrade for all the Redis users out there during fork... (more or less 20x in my tests) just with this line above.
Yes, never reported by Redis users so far, maybe since in Redis it does not have the same impact it has in other databases, but a more subtle latency issue apparently.
EDIT: Actually in the CouchBase case they talk about "page allocation delays" that looks related, potentially. I started the investigation mainly because the Stripe graph looked suspicious with a modern EC2 instance type that has good fork times, so there was something more indeed (unless the test was performed with tens of GB of data).
Replying to myself for an update. The hypothesis of a a few users here is correct, while hard to believe. The latency spike is due to the fact that even with the 50 clients of the benchmark, it is possible to touch all the huge pages composing the process in the space of a single event loop. This is why I was observing this initial spike and nothing more. This seemed unrealistic to me with 50 clients, but I remembered that one of the Redis optimizations is to try to serve multiple requests for the same client in the same event loop cycle if there are many in queue, so this is what happens: 50 clients x N-queued-requests = enough requests to touch all the memory pages, or at least a very significant percentage to block for a long time and never again.
Have you tried disabling only defrag? The performance issues I'm familiar with are generally related to defrag running at allocation time. I would be curious to see your results while enabling THP but:
echo never > /sys/kernel/mm/transparent_hugepage/defrag
Awesome hint, I'll try it. However if huge pages means 2MB copy-on-write for each fault, you want still them turned off for Redis. Also does not help with jemalloc memory reclaiming.
That makes sense. Since jemalloc is using huge pages, even if it's one byte inside a huge page changes, the kernel needs to copy an entire 2mb per modified entry over (worst case localilty). Whereas on a non-hugepage allocation it'll only be 4kb. Thats why you saw movs in the stack trace - the kernel was too busy copying over the entire huge page over. With small pages, it can be more granular about it's copying.
I don't think that's the reason actually, there is a big spike and then all the other COWs are ok, also copying 2MB is no way near to the 300 ms I was observing. Not sure if for some reason most pages where copied, or it was some other issue at all. It could be the code that fragments the huge page into multiple pages maybe? MOV -> fault -> split of page -> COW of a single 4k page.
You're probably right but a quick way to test this theory is to check the amount of memory private memory (non-cow) is being consumed by the child fork with and without THP. This should tell you how much data is actually being copied vs shared. Not sure how you get this # from Linux though.
Exactly! Doing this. There are ways to do this fortunately (Redis does this actually... in order to report the amount of copy-on-write performed during save, you may want to check the code if you are curious, it is pretty trivial, and it is possible thanks to /proc/<pid>/smaps info).
Disabling transparent hugepages has been common advice in the Postgres world for awhile, and for the same reason. I suspect anything that forks a lot should have them off.
Could your confirm that the MOV is copying PTEs? I doubt that maybe it's copying the heap, because of copy-on-write. After fork(), the parent write to heap randomly, so it will copy more data with jemalloc than libc (because of huge page)
I just tested - while running the redis-benchmark command you listed after a BGSAVE smaps lists all anonymous memory as AnonHugePages. So it ends up as THP, although I suppose its possible that it is broken into 4k pages and then adjacent 4k pages are merged back into huge pages. Will see if I can figure out how the lazy PTE copying works.
I do see it now... wasn't testing right before. Goes away as you say when disabling transparent THP. KSM has no effect, contrary to what I thought earlier.
But the immutable data structure is not that much different from the copy-on-write pages the kernel gives you, likely to involve the same amount of copying. And disk IO is blocking so you need at least a thread, so it is a natural strategy.
Your jemalloc case is using 1564672 kB, this means you only have 764 hugepages. Since the RAM access patterns in the parent process are random, you only need ~1000 accesses to trigger multiple COW (copy-on-write) totaling most of the ~1.5GB. redis-benchmark simulates 50 clients by default and you use a pipelining factor of 4, so you have 200 in-flight requests at any one time. I am not familiar with how the redis server prioritizes the requests to be served, but if it is like most network servers it is not deterministic (whatever comes first from the network event loop). So while on average a request may wait for 200 other ones to be served, a small fraction of them will have to wait longer because of this non-deterministic factor. It is not unimaginable that some have to wait for 500, maybe 1000 other requests. Waiting for 1000 requests means waiting for a COW of most of the ~1.5GB which would take a few hundred milliseconds and would absolutely explain your latency spike.
On the other hand your malloc case is using 4kB pages. If a request has to wait for 1000 other in-flight requests to be executed, that means waiting for at most 1000 pages (4MB) to be COW'd, which would take on the order of 0.1-1.0 milliseconds. This is why the latency is much lower.
tl;dr: smaller pages (4kB vs 2MB) allow finer granularity of the COW mechanism, and lead to lower latencies.
This was my first thought as well, but actually this is not what is happening AFAIK, and the performance hit is likely due to inefficient huge page allocation. There are reasons I believe this, but I'm actually checking in a more systematic way right now before saying random things.
EDIT: you were exactly right. This is what happens, there are 50 clients in the benchmark, with many queued requests, so indeed since the benchmark is designed to touch all the keys evenly, what happens is that every client served in a given event loop cycle has a big chance to get a page fault. This seemed unrealistic to me, since I saw the spike in a single event-loop cycle, but it is how is working actually. Thanks!
I find the back-and-forth ("<this is wrong>", "EDIT", "<wrong advice>"...) to be quite interesting and instructive, better than just editing the whole post with the 'correct' explanation and workaround. Thanks for that!
I agree! I also appreciate the effort over the past few days to dig into this after it was originally reported. It is fascinating to watch the analysis unfold.
Pardon my ignorance - but isn't this a problem that's existed in database software in general, and one that's been dealt with in the past?
This seems to be a similar problem to how to do replication while also serving requests - on one side you want to preserve consistency, on the other side you want to make a full snapshot. It shouldn't matter how long the replication process takes (i.e. how slow fork is), seconds or days, as long as the incoming changes aren't faster than the replication rate. In this case though, your replication target is the disk.
It seems like there should never have to be a full lock of everything.
The problem is fork() is a blocking system call, and the side effects of fork also cause some latency when accessing the COW memory.
there should never have to be a full lock of everything.
There's no software lock involved from the Redis side, but fork() behavior has the effect of a pseudo-lock (because the system call blocks the entire process from running until it returns).
I had a similar transparent huge pages issues some time back -- the symptoms were incomprehensible spikes in CPU usage sometimes. Apparently this had something to do with memory allocation pressure when my Python processes restarted. However I thought these were perhaps weird power throttling issues which meant CPU was performing at a lower performance (explaining why a normal startup took 1 cpu minute instead of 0:10 seconds).
I should have spotted this by looking at the split between user and system CPU: when the THP problem happened SYS% spiked, but I was looking at a dozen processes pegged at 100% CPU.
I eventually tracked this down via "perf top" which lets you see what everything on the system is doing: both kernel and user space symbols are tracked. I saw significant kernel-space work in an odd memory-related symbol and found someone else with the same problem.
I'm having trouble figuring out how to disable transparent hugepages without a reboot:
># uname -a
Linux ip-10-214-7-159 3.2.0-36-virtual #57-Ubuntu SMP Tue Jan 8 22:04:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
># grep TRANSPARENT /boot/config-3.2.0-36-virtual
CONFIG_TRANSPARENT_HUGEPAGE=y
# CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS is not set
CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y
># echo never > /sys/kernel/mm/transparent_hugepage/enabled
bash: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory
Dumb question - if problem is a latency spike in the parent, any chance you can continue serving in the child and let the parent do the BGSAVE (and take the latency spike)?
(The child should be able to inherit all open fds?)
The problem happens when one of the processes (parent or child) tries to modify a page. Modifying the page triggers the copy-on-write mechanism and blocks the process making the modifications - it doesn't matter if it's parent or child. In Redis' case, the BGSAVE process doesn't do any writes - so it's good. It's always the server process that does the writes.
This issue has nothing to do with copying the PTE [1] and everything to do with copying the data.
After fork, the parent and the child have write-protected references in their page tables pointing at the same data. (It makes very little difference whether said write-protected reference is an actual entry in the page tables or just in the logical VMA data.)
As soon as one of the processes tries to modify the data, the data must be copied so that the original data doesn't change (because the other process can still see it). That means that the kernel needs to find somewhere to put the copy, and that's almost certainly the expensive part with THP if defragmentation is on. Once the kernel finds a place to copy to, it can update the vm data structures (page tables or otherwise) to give the writer a private copy.
[1] Minor nitpick: for hugepages, there aren't actually any PTEs involved, but that's really just a terminology issue. It's actually an entry in a higher level table in the tree.
Hello amluto, we were talking about PTEs here since Linux fork() PTE copying does not happen during fork like in other OSes, but is lazy. So Linux delegates at a latter time both the copying of the PTEs (or huge page entries), and the pages themselves.
This is why open-source software is so rad. I suspect that you'd never see an Oracle customer complaining publicly and then an Oracle engineer publicly tracking down the source of their complaints.
I'm a huge fan of antirez in particular for this. So many other devs would have shrugged off the occasional latency as acceptable, or a poor measure of true performance, or blamed it on nebulous EC2 performance issues. Which happened, to a greater or less extent. But three blog posts later he's still digging in to the root causes and has eventually (through no small effort) come up with the cause, and a fix.
Open source is good. Its individuals like antirez who make it great.
EDIT: Oh wait... since the problem is huge pages, this is MUCH better, since we can disable it. And I just verified that it works:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
This means, free latency upgrade for all the Redis users out there during fork... (more or less 20x in my tests) just with this line above.