
Deep inside the Linux kernel: a network latency spike - jgrahamc
https://http2.cloudflare.com/the-story-of-one-latency-spike/
======
paulsutter
Great detective work.

Before people go off reducing tcp_rmem too low, I thought I'd explain its
actual purpose.

A TCP sender will only send data of the receiver has buffer space available.
With 100msec of latency, at 1Gbit/sec there can be 12MBytes of data "on the
wire". If a retransmit is necessary, you can multiply that by a number of
round trips (say 2 or 3?) to get the needed amount of receive buffer to keep
the sender sending. Otherwise per-connection throughput is reduced linearly
with the tcp_rmem size.

I mention this because objects in a CDN are usually pretty small, and lower
values of tcp_rmem may be ok for them.

And really I mention this hoping it motivates someone to fix the bug.

~~~
bboreham
> if the receiver has buffer space available

nitpick: if the sender believes the entire path between sender and receiver
has buffer space available. I.e. it will send up to the minimum of the receive
window and the congestion window.

Retransmits cause the cwnd to be dramatically reduced; in my experience huge
buffers really only work under perfect network conditions.

~~~
planckscnst
I had always associated the two together. There is one "window" slot in a TCP
packet. How does the sender know the receiver's receive window separately from
the congestion window?

~~~
bboreham
The sender maintains cwnd as a completely separate parameter that does not go
in any packets. It doesn't "know" anything; the value of this parameter is
arrived at by some set of heuristics.

See for instance [https://en.wikipedia.org/wiki/TCP_congestion-
avoidance_algor...](https://en.wikipedia.org/wiki/TCP_congestion-
avoidance_algorithm)

------
planckscnst
SystemTap (or any other dynamic tracing tool) is very powerful; I suggest to
everyone who does system-level or performance-oriented work that you learn how
to use it. It's not as difficult as it first looks. If you understand what a
syscall is, it's useful. If you understand more about the kernel insides, e.g.
the various vfs functions, scheduling, memory mapping and allocation, it gets
so much more powerful. It basically lets you run arbitrary code anywhere
inside a kernel (or userspace). That is, at any line of code, you can break
and jump to your own code and return again. Your code can do pretty much
anything - there is a way to embed C in a SystemTap script. Usually, you're
just keeping counters, timers, and backtraces; there is a nice little
scripting language that SystemTap compiles to C for those cases.

I have done a very similar analysis recently. I had an HTTP service with
latency spikes. I used SystemTap to discover occasional long-running syscalls;
the contents of the syscalls pointed out the exact problem. Once those latency
spikes were clear, I still had a few very occasional spikes (as in this
article). I created a SystemTap script that collected backtraces anytime our
process was rescheduled after being asleep for longer than some time (using
[1][2] to start/stop a timer). I then used that to plot a flamegraph. After
that, the problem was pretty clear and I was able to intelligently mess with
some kernel tuning parameters and eventually discover that it wasn't enough -
I just need to deal with the problem on my side, from there the fix was pretty
easy.

[1] [https://sourceware.org/systemtap/tapsets/API-scheduler-
cpu-o...](https://sourceware.org/systemtap/tapsets/API-scheduler-cpu-on.html)
[2] [https://sourceware.org/systemtap/tapsets/API-scheduler-
cpu-o...](https://sourceware.org/systemtap/tapsets/API-scheduler-cpu-off.html)

------
_wmd
> Setting the rmem sysctl to only 2MiB is not recommended as it could affect
> the performance of high throughput, high latency connections

So what I took away from this article is that they'd set some absurdly high
receive buffer, which predictably caused something to break, and never quite
got around to explaining how they arrived at the original number. That buffer
is for data the kernel has received but has _not yet been read() by
userspace_. Seriously, 5MB of RAM dedicated to each socket in the kernel?

Assuming these are Nginx proxy connections, why did you choose to tweak a
kernel tunable rather than increasing the corresponding userspace application
buffer sizes instead?

edit: so the answer seems to be that Nginx is wholly reliant on the kernel to
handle buffering, which is completely asinine to me, but perhaps there is good
rationale here I've somehow missed. The take home becomes "our app stack is so
broken we managed to break additional extreme-battle-tested components in a
bid to to get it to function sensibly"

~~~
creshal
> Seriously, 5MB of RAM dedicated to each socket in the kernel?

Debian 8 seems to have a default of 6 MiB. So, the "absurdly high receive
buffer" might not be cloudflare's fault.

~~~
mhandley
The TCP protocol needs to be able buffer one round trip time (RTT) worth of
data at both sender and receiver in the kernel. It needs one RTT of data at
the receiver to cope with waiting for a retransmission of a lost packet, or
just so it doesn't drop incoming packets when the receiving app doesn't read
immediately. The same amount of data must be buffered at the sender because it
may need to be retransmitted if a packet was lost. So, 6MiB of buffering with
a fairly typical 100ms RTT gives you 6*8/0.1 Mbit/s = 480Mbit/s. In practice
you may need more buffering than this, but that's the maximum rate you can
achieve.

Now, if you've got an overbuffered router at the bottleneck, as is common on
home broadband links, you can build a queue there. It's not really all that
unusual to see queues build to 1 second, especially on ADSL. If you've a 1
second RTT, the maximum rate you can support with 6MiB of buffering at both
sender and receiver drops to 48Mbit/s. It seems to me that 6MiB is a fairly
reasonable default value, though if you're running a really busy server, you
might benefit from reducing this, especially if you're mostly serving small
files.

------
r4um
The collapsing and pruning of tcp buffers shows up in network statistics also,
could have been the first hint.

    
    
        ‣ netstat -s | grep -E '(prune|collap)'
        233 packets pruned from receive queue because of socket buffer overrun
        1581 packets collapsed in receive queue due to low socket buffer
    

Usually system defaults work just fine, unless you want to tune for a very
specific type of workload.

~~~
majke
Looking at netstat makes sense as long as the numbers are low. It's hard to
correlate one of 167 counters to the unpredictable latency event you are just
debugging :)

    
    
      $ netstat -s | grep -E '(prune|collap)'
      4445504 packets pruned from receive queue because of socket buffer overrun
      150606 packets pruned from receive queue
      278333723 packets collapsed in receive queue due to low socket buffer

------
laichzeit0
> To make things worse all our usual monitoring hadn't caught the problem

I suspect this is because most monitoring focuses on average instead of
percentiles or maximum. Outliers are the most interesting data points. There
is an almost cult-like following of the average as a KPI in monitoring tools.

~~~
luhn
No, they explain why in the article: Their monitoring was looking at the
application (nginx), so packets getting stuck in the kernel weren't reflected
in that.

------
mey
Did I misread the article or did they actual prove that the tcp_collapse was
the root cause of their 1000ms spike?

~~~
gricardo99
Not specifically tcp_collapse, but the article theorizes that some memory
management done by tcp_collapse (i.e. GC) is the cause. I'd place my bets on
memory calls being the cause too.

But... they never went to the next step and confirmed that tcp_collapse was
making lengthy system memory calls. They kind of gloss over the details of why
reducing rmem alleviates the problem.

Is it because with smaller rmem tcp_collapse aggregates segments into smaller
sk_buffs, and these smaller-size memory calls don't take as long to return?

Or is it because tcp_collapse has less memory "stitching" to do, since it
can't fit as many segments into the smaller sk_buffs?

~~~
btilly
My understanding of the article was that the amount of data that they could be
trying to collapse had been up to 32 MB, and now was never more than 2 MB.

Doing a global operation on a large data structure takes longer than doing a
global operation on a small data structure. Average throughput may be better
if you do that, but maximum latency was the issue here.

~~~
acqq
Thanks, I missed the detail that their maximum buffer size was 32 MB. They
speak about the default size but then change the maximum size and show that
the longer delays don't happen.

------
abrookewood
Can anyone explain how they go from the flame graph to identifying the
function of interest net_rx_action? I can't see anything that grabs my
attention, but maybe that's because the graph is too small to highlight the
issues.

~~~
Sanddancer
The page on flame graphs [1] shows a lot more details as to how they are
useful for findings like this. They gave the ten miles up view, and from
there, you can click on hotspots to drill deeper and deeper and find the slow
points. Cloudflare didn't give the svg, just the png, so you can't get the
deeper levels.

[1]
[http://brendangregg.com/FlameGraphs/cpuflamegraphs.html](http://brendangregg.com/FlameGraphs/cpuflamegraphs.html)

~~~
abrookewood
OK thanks for the explanation.

------
nickysielicki
Great post. I would like to see a follow-up post about what this exposes in
your tests, and how one might put stap or something similar to use as a part
of their tests/monitoring.

------
zenlot
And we're back to the basics, even if net.ipv4.tcp_rmem is mentioned nearly in
every write up on network performance, you should still tune it with care, if
you want to save time on future investigations.

------
tylerconscious
Nice work there!

Any idea if Cloudflare is hosted on AWS and faced this problem on their EC2
instance? Anyone?

~~~
bboreham
CloudFlare is a CDN, so needs to be hosted in data centres all over the place,
often collocated with ISPs.

------
abc_lisper
Very nice!

