Hacker News new | past | comments | ask | show | jobs | submit login
Redis latency spikes and the 99th percentile (antirez.com)
168 points by r4um on Oct 30, 2014 | hide | past | favorite | 51 comments

A coworker pointed me to this interesting post noting that in a browser-facing web server (not the same as hits to a datastore), most users will experience the 99th percentile for at least one resource on the page.


This is only true if these latency outliers are evenly distributed over time, if they indeed happen once every 30 minutes than only users requesting a page at the same time will be affected.

Excellent point, yes.

That ignores that the 99th percentile is all for the dynamically generated content, rather than any of the scripts/stylesheets/images which should be distributed close to the user on a CDN

Sometimes even to create the HTML you serve itself, many DB calls are performed: tens is not very uncommon.

But the root of this thread was talking about users encountering 99th percentile for at least one resource on the page. But the common pattern is that most of the resources on a page are static (even if the generation of the page itself is dynamic).

I'll note that I'm excluding API calls here. I don't commonly hear API calls referred to as 'resources' on a page, so I'm assuming that this wasn't being referred to.

XHRs appear under the "Resources" tab of my web inspector, so I guess by that right they're resources on the page :) But you're right -- the latency histograms for static resources are very different than for dynamic resources.

So while the probabilities do suggest that most users will encounter at least one 99th percentile latency, it's very different if that latency is for a static resource vs a dynamic one. Thanks for pointing it out.

A nice follow-up and details digging into yesterday's "This is why I can't have conversations using Twitter" post:



I really wanted to comment "just always make a blog post in reaction, twitter is not a good medium for explaining complex ideas" on that thread, but thought it would be unproductive.

Please just do this every time, people who care will read, people who don't care will probably not comment for fear of looking stupid.

Why is forking on xen slow? Most google hits for "xen forking slow" seem to point to some discussions about redis, but I guess other software would suffer from that too.

It's important to remember two things: First, in a more general way, fork() is kind of slow just by virtue of being a system call and involving several steps, so applications try to avoid making lots of calls to fork (e.g. webservers long ago stopped doing the naïve fork-for-every-request model).

Redis uses fork in a way nobody else seems to. Applications using tens of gigabytes of RAM -- databases, media editing, etc. -- just don't usually fork except during a startup process.

Redis uses fork for persistence. It's clever, and it works great in general, but it's weird! Unique, even. It seems much more likely to hit a fork() weakness in a noticeable way than almost any other application out there.

Totally true. I always think that eventually some kernel change may totally break how Redis works and I'll have to resort to implementing the same stuff in user space. A simpler parallel path is to use Redis Cluster / Sentinel and provide good safety for when persistence is disabled, and achieve data safety via redundancy in replicas. However even in such a scenario, it will be kind of handy that Redis can persist so that restarts performed with SHUTDOWN (for example for upgrades) will still retain the dataset when needed.

Fork has always been a mutant feature. Split the entire process into twins? So that the caller can immediately exec() and discard the tediously-cloned twin and become a different program? Its the most egregiously inefficient feature ever to grace an operating system. Page tables, stacks, heap allocations, file handles - all cloned and then discarded.

In this Regard, Redis is the first system using it the proper way I guess ;-) I mean, all this copying is not discarded but used to create a point-in-time snapshot.

No, every Unix command shell uses the copied program and file descriptors.

Though these days it's worth noting that much of that is COW, cutting down the actual copying (at the expense of some bookkeeping). There's certainly still needless work, however.

A thread on xen-devel, similar discussion. http://lists.xen.org/archives/html/xen-devel/2012-07/msg0008...

AFAIK The kernel needs to perform calls to the hypervisor in order to copy the page table, but I'm not expert enough to provide you with details about this unfortunately. It is for sure not inherently due to virtualization, since for example VMware does not have this issue.

That sounds like a PV vs. HVM issue. Using hardware virtualization extensions to handle virtual memory is almost always faster than paravirtualization these days, which is why Xen introduced PVH mode in 4.4.

Yep. And EC2's "HVM"-type instances are now actually PVHVM, not pure HVM.

Since this change, there has been absolutely no reason to use anything other than HVM AMIs, and pure paravirtual instances can basically be considered a deprecated feature in EC2. New EC2 instance classes (e.g. t2) don't even support PV.

Basically, PV instances are just there to support current customers who are relying on already-built PV AMIs and have too much inertial to be nudged into switching over.

Me sees you've struggled with the problem for a while too: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1815 filed April 2012 and not even an ack.

Yes, at some point I got a report about Xen 3.0 (If I remember correctly) fixing the issue, but I never see in the real world things improving much AFAIK. Here is a table that shows fork times with different environments, just to show how bad the thing is:

Linux beefy VM on VMware 6.0GB: 12.8 milliseconds per GB.

Linux running on physical machine (Unknown HW): 13.1 milliseconds per GB.

Linux running on physical machine (Xeon @ 2.27Ghz): 9 milliseconds per GB.

Linux VM on 6sync (KVM): 23.3 millisecond per GB.

Linux VM on EC2 (Xen): 239.3 milliseconds per GB.

Linux VM on Linode (Xen): 424 milliseconds per GB.

Around 30 times slower than bare metal, and I'm talking about old physical servers with slow memory compared to today's.

EDIT: Make sure to read this -> https://redislabs.com/blog/testing-fork-time-on-awsxen-infra...

Have you tried to use huge page to reduce the time of fork()?

There also a similar problem for master of MFS, it will be great if you have some experience in it.

EDIT: https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt

At least one user of Redis tried to enable huge pages and reported that there were issues: https://groups.google.com/d/msg/redis-db/3VCDYKXhDdI/rIaPnK9...

This mailing list post:


suggests that it's page table validation time, and it can be avoided if you use a PVHVM guest. Has anyone checked whether these redis problems apply only to PV guest images on EC2?

Side note: you can set you Y axis format to "ms" in Grafana to make the values more descriptive and add the "Points" chart option under display styles to make the mean values visible, which are obscured by the 90th percentile bar in your chart. Also I assume label is wrong, it says 90ths percentile in the chart, but you speak of the 99th percentile.

Graphs are not mine, it's from Stripe engineers, so I've no control in how they are generated. About 90th vs 99th, I was talking about 99th but in the case of Redis latency spikes due to fork, you would get exactly the same graph as all the requests are delayed in this moment.

Missed the "Stripe blog post" part, sorry for the misdirection.

np at all, thanks.

I'm kind of glad Redis did the fork approach first. It's the reason I went with a userspace COW implementation in my work instead of forking and that paid huge dividends. It's the difference between starting COW in 10-20 milliseconds versus seconds and most of that time is distributed coordination not flipping the COW boolean.

When you crank up the density per node to 256 or 512 gigabytes even bare metal is problematic and in some domains like telecommunications they don't care that the spikes are concentrated because they cause cascading failures.

I think a userspace COW implementation in Redis would be a big project because you would need a different strategy for every data structure. Being single threaded also makes it challenging to get other/more cores to do the serialization and IO. It's very doable just not within the current philosophy of thou shalt have one thread per process.

I think "userspace COW" is the wrong approach here.

The entire idea of getting a point-in-time snapshot of something subject to rapid change is problematic. Your options boil down to (1) make a "snapshot" and save that (Redis's current approach) or (2) accept that point-in-time consistency might be impossible, and work around it.

I wonder whether it'd be possible to have two persistence strategies in Redis: "consistent" and "low-latency". "Consistent" would use the current fork(2) COW behavior, "low-latency" would do some kind of one-chunk-at-a-time block copy and amortize the latency spike over the entire operation, while having the overall effect of less of a "cliff" to latency.

cc1.4xlarge is on pair with bare metal apparently...

That would make sense if the other EC2 Xen hosts were running PV guests (as was required for the smaller / older EC2 hosts IIRC? ) & the cc1.4xlarge was PVHVM, according to: http://lists.xen.org/archives/html/xen-devel/2012-06/msg0102...

cc1.4xlarge was the first Linux HVM EC2 instance. All new families support HVM (and some are HVM only)

Should probably add a disclosure here. I ran the EC2 instance product management team for several years and continue to be involved.

Looks like a smoking gun then.

Curious, why fork in the main thread? Forking traditionally is a pretty heavyweight operation. Perhaps versioning might be more performant?

It looks like redis is using fork to generate checkpoints of the database. Which is quite a neat hack - use the copy-on-write memory properties of modern unix fork() implementations to implement persistent checkpoints of database state.

It does mean that forking the entire process is pretty much the point of the exercise however.

Redis uses a reactive architecture using non-blocking I/O. They fork to get a point-in-time consistent snapshot that can be written to disk. The problem is that fork blocks, and while blocked, it stalls the event loop, subjecting incoming requests to stalls.

In Unix blocking means going into IO wait. In this case it's just slow when emulated ("paravirtualiuzed") by Xen.

Pretty much this: doing blocking operations in your poller thread is a big no, even if it appears clever.

Btw, I am not sure one can call "reactive architecture" if using a single thread.

Does "versioning" have a specific meaning in this context?


I know that redis really wants to be a persistent kvstore. I had a problem with a large website when I increased caching by 2 orders of magnitude (enough RAM to play with). When it came time to write a snapshot to disk, everything died for 5 minutes. Turned it off and haven't thought about it since. I'm not sure I'll ever shed my RDBMS predelections.

This is a huge problem on wall street, where trades must have predictable latencies. Stop-the-world garbage collectors are another source of latency "catastrophes".

Does fork not use copy-on-write? I would expect it to add overhead to all overlapped memory operations going forward, but am really surprised if it literally duplicated the entire memory contents.

It does, but the kernel must still copy the page tables in full. For a very large process on a 64-bit machine, the amount of copying can be substantial.

There can be fast differences in fork times between different virtualization schemes, even if your code is CoW-friendly (and Redis's is).

Actually duplicating the memory contents would be slower still, I'm sure. What seems to be going on here is simply a consequence of paravirtualization needing to implement certain functions in software that would otherwise be taken care of in hardware when using HVM or bare metal.

Memory fragmentation seems to play a role too. Even if you have (barely) enough free RAM, machines can start swapping and a Redis server in that state is bound to wake you up at night... :(

If you run a server with anything but "swapoff -a" you're doing it wrong. (or even having any swap partition in /etc/fstab). If the server has to swap, get more RAM or scale your stuff out but never swap.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact