Hacker News new | past | comments | ask | show | jobs | submit login
Deep inside the Linux kernel: a network latency spike (cloudflare.com)
220 points by jgrahamc on Nov 19, 2015 | hide | past | web | favorite | 28 comments

Great detective work.

Before people go off reducing tcp_rmem too low, I thought I'd explain its actual purpose.

A TCP sender will only send data of the receiver has buffer space available. With 100msec of latency, at 1Gbit/sec there can be 12MBytes of data "on the wire". If a retransmit is necessary, you can multiply that by a number of round trips (say 2 or 3?) to get the needed amount of receive buffer to keep the sender sending. Otherwise per-connection throughput is reduced linearly with the tcp_rmem size.

I mention this because objects in a CDN are usually pretty small, and lower values of tcp_rmem may be ok for them.

And really I mention this hoping it motivates someone to fix the bug.

> if the receiver has buffer space available

nitpick: if the sender believes the entire path between sender and receiver has buffer space available. I.e. it will send up to the minimum of the receive window and the congestion window.

Retransmits cause the cwnd to be dramatically reduced; in my experience huge buffers really only work under perfect network conditions.

I had always associated the two together. There is one "window" slot in a TCP packet. How does the sender know the receiver's receive window separately from the congestion window?

The sender maintains cwnd as a completely separate parameter that does not go in any packets. It doesn't "know" anything; the value of this parameter is arrived at by some set of heuristics.

See for instance https://en.wikipedia.org/wiki/TCP_congestion-avoidance_algor...

SystemTap (or any other dynamic tracing tool) is very powerful; I suggest to everyone who does system-level or performance-oriented work that you learn how to use it. It's not as difficult as it first looks. If you understand what a syscall is, it's useful. If you understand more about the kernel insides, e.g. the various vfs functions, scheduling, memory mapping and allocation, it gets so much more powerful. It basically lets you run arbitrary code anywhere inside a kernel (or userspace). That is, at any line of code, you can break and jump to your own code and return again. Your code can do pretty much anything - there is a way to embed C in a SystemTap script. Usually, you're just keeping counters, timers, and backtraces; there is a nice little scripting language that SystemTap compiles to C for those cases.

I have done a very similar analysis recently. I had an HTTP service with latency spikes. I used SystemTap to discover occasional long-running syscalls; the contents of the syscalls pointed out the exact problem. Once those latency spikes were clear, I still had a few very occasional spikes (as in this article). I created a SystemTap script that collected backtraces anytime our process was rescheduled after being asleep for longer than some time (using [1][2] to start/stop a timer). I then used that to plot a flamegraph. After that, the problem was pretty clear and I was able to intelligently mess with some kernel tuning parameters and eventually discover that it wasn't enough - I just need to deal with the problem on my side, from there the fix was pretty easy.

[1] https://sourceware.org/systemtap/tapsets/API-scheduler-cpu-o... [2] https://sourceware.org/systemtap/tapsets/API-scheduler-cpu-o...

> Setting the rmem sysctl to only 2MiB is not recommended as it could affect the performance of high throughput, high latency connections

So what I took away from this article is that they'd set some absurdly high receive buffer, which predictably caused something to break, and never quite got around to explaining how they arrived at the original number. That buffer is for data the kernel has received but has not yet been read() by userspace. Seriously, 5MB of RAM dedicated to each socket in the kernel?

Assuming these are Nginx proxy connections, why did you choose to tweak a kernel tunable rather than increasing the corresponding userspace application buffer sizes instead?

edit: so the answer seems to be that Nginx is wholly reliant on the kernel to handle buffering, which is completely asinine to me, but perhaps there is good rationale here I've somehow missed. The take home becomes "our app stack is so broken we managed to break additional extreme-battle-tested components in a bid to to get it to function sensibly"

Note that these buffers are not only relevant for unread data of the user space application, they also reflect the TCP window size - this means the sending side knows how much room the receiver has at all times , and can throw that much data at it without receiving a ACK - so there's a fine dance between network speed , latency, and this buffer size.

> Seriously, 5MB of RAM dedicated to each socket in the kernel?

Debian 8 seems to have a default of 6 MiB. So, the "absurdly high receive buffer" might not be cloudflare's fault.

The TCP protocol needs to be able buffer one round trip time (RTT) worth of data at both sender and receiver in the kernel. It needs one RTT of data at the receiver to cope with waiting for a retransmission of a lost packet, or just so it doesn't drop incoming packets when the receiving app doesn't read immediately. The same amount of data must be buffered at the sender because it may need to be retransmitted if a packet was lost. So, 6MiB of buffering with a fairly typical 100ms RTT gives you 6*8/0.1 Mbit/s = 480Mbit/s. In practice you may need more buffering than this, but that's the maximum rate you can achieve.

Now, if you've got an overbuffered router at the bottleneck, as is common on home broadband links, you can build a queue there. It's not really all that unusual to see queues build to 1 second, especially on ADSL. If you've a 1 second RTT, the maximum rate you can support with 6MiB of buffering at both sender and receiver drops to 48Mbit/s. It seems to me that 6MiB is a fairly reasonable default value, though if you're running a really busy server, you might benefit from reducing this, especially if you're mostly serving small files.

Well one rationale could be that if nginx did buffering too, you're essentially doing buffering twice. Doing something once is often faster than doing it twice.

The collapsing and pruning of tcp buffers shows up in network statistics also, could have been the first hint.

    ‣ netstat -s | grep -E '(prune|collap)'
    233 packets pruned from receive queue because of socket buffer overrun
    1581 packets collapsed in receive queue due to low socket buffer
Usually system defaults work just fine, unless you want to tune for a very specific type of workload.

Looking at netstat makes sense as long as the numbers are low. It's hard to correlate one of 167 counters to the unpredictable latency event you are just debugging :)

  $ netstat -s | grep -E '(prune|collap)'
  4445504 packets pruned from receive queue because of socket buffer overrun
  150606 packets pruned from receive queue
  278333723 packets collapsed in receive queue due to low socket buffer

> To make things worse all our usual monitoring hadn't caught the problem

I suspect this is because most monitoring focuses on average instead of percentiles or maximum. Outliers are the most interesting data points. There is an almost cult-like following of the average as a KPI in monitoring tools.

No, they explain why in the article: Their monitoring was looking at the application (nginx), so packets getting stuck in the kernel weren't reflected in that.

Did I misread the article or did they actual prove that the tcp_collapse was the root cause of their 1000ms spike?

The working theory is that "tcp_collapse" causes spikes of ~20 ms, which in turn make Linux not receive packets in time, which makes the network buffers to fill, which means new packets will be lost, which may be visible as 1s... latency spike end-to-end.

This is for the TCP case. For the ICMP, indeed, the 1.8s times are not explained, but it's totally possible that 100 of 18ms events accumulated. Remember that the ping was running for 32 hours, while the system tap scripts for 30 seconds.

Not specifically tcp_collapse, but the article theorizes that some memory management done by tcp_collapse (i.e. GC) is the cause. I'd place my bets on memory calls being the cause too.

But... they never went to the next step and confirmed that tcp_collapse was making lengthy system memory calls. They kind of gloss over the details of why reducing rmem alleviates the problem.

Is it because with smaller rmem tcp_collapse aggregates segments into smaller sk_buffs, and these smaller-size memory calls don't take as long to return?

Or is it because tcp_collapse has less memory "stitching" to do, since it can't fit as many segments into the smaller sk_buffs?

My understanding of the article was that the amount of data that they could be trying to collapse had been up to 32 MB, and now was never more than 2 MB.

Doing a global operation on a large data structure takes longer than doing a global operation on a small data structure. Average throughput may be better if you do that, but maximum latency was the issue here.

Thanks, I missed the detail that their maximum buffer size was 32 MB. They speak about the default size but then change the maximum size and show that the longer delays don't happen.

I read it as a smaller buffer meant that when garbage collection kicked in, it was working on a smaller amount of memory and was therefore faster to complete.

Can anyone explain how they go from the flame graph to identifying the function of interest net_rx_action? I can't see anything that grabs my attention, but maybe that's because the graph is too small to highlight the issues.

The page on flame graphs [1] shows a lot more details as to how they are useful for findings like this. They gave the ten miles up view, and from there, you can click on hotspots to drill deeper and deeper and find the slow points. Cloudflare didn't give the svg, just the png, so you can't get the deeper levels.

[1] http://brendangregg.com/FlameGraphs/cpuflamegraphs.html

OK thanks for the explanation.

Great post. I would like to see a follow-up post about what this exposes in your tests, and how one might put stap or something similar to use as a part of their tests/monitoring.

And we're back to the basics, even if net.ipv4.tcp_rmem is mentioned nearly in every write up on network performance, you should still tune it with care, if you want to save time on future investigations.

Nice work there!

Any idea if Cloudflare is hosted on AWS and faced this problem on their EC2 instance? Anyone?

CloudFlare is a CDN, so needs to be hosted in data centres all over the place, often collocated with ISPs.

Very nice!

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact