Before people go off reducing tcp_rmem too low, I thought I'd explain its actual purpose.
A TCP sender will only send data of the receiver has buffer space available. With 100msec of latency, at 1Gbit/sec there can be 12MBytes of data "on the wire". If a retransmit is necessary, you can multiply that by a number of round trips (say 2 or 3?) to get the needed amount of receive buffer to keep the sender sending. Otherwise per-connection throughput is reduced linearly with the tcp_rmem size.
I mention this because objects in a CDN are usually pretty small, and lower values of tcp_rmem may be ok for them.
And really I mention this hoping it motivates someone to fix the bug.
nitpick: if the sender believes the entire path between sender and receiver has buffer space available. I.e. it will send up to the minimum of the receive window and the congestion window.
Retransmits cause the cwnd to be dramatically reduced; in my experience huge buffers really only work under perfect network conditions.
See for instance https://en.wikipedia.org/wiki/TCP_congestion-avoidance_algor...
I have done a very similar analysis recently. I had an HTTP service with latency spikes. I used SystemTap to discover occasional long-running syscalls; the contents of the syscalls pointed out the exact problem. Once those latency spikes were clear, I still had a few very occasional spikes (as in this article). I created a SystemTap script that collected backtraces anytime our process was rescheduled after being asleep for longer than some time (using  to start/stop a timer). I then used that to plot a flamegraph. After that, the problem was pretty clear and I was able to intelligently mess with some kernel tuning parameters and eventually discover that it wasn't enough - I just need to deal with the problem on my side, from there the fix was pretty easy.
So what I took away from this article is that they'd set some absurdly high receive buffer, which predictably caused something to break, and never quite got around to explaining how they arrived at the original number. That buffer is for data the kernel has received but has not yet been read() by userspace. Seriously, 5MB of RAM dedicated to each socket in the kernel?
Assuming these are Nginx proxy connections, why did you choose to tweak a kernel tunable rather than increasing the corresponding userspace application buffer sizes instead?
edit: so the answer seems to be that Nginx is wholly reliant on the kernel to handle buffering, which is completely asinine to me, but perhaps there is good rationale here I've somehow missed. The take home becomes "our app stack is so broken we managed to break additional extreme-battle-tested components in a bid to to get it to function sensibly"
Debian 8 seems to have a default of 6 MiB. So, the "absurdly high receive buffer" might not be cloudflare's fault.
Now, if you've got an overbuffered router at the bottleneck, as is common on home broadband links, you can build a queue there. It's not really all that unusual to see queues build to 1 second, especially on ADSL. If you've a 1 second RTT, the maximum rate you can support with 6MiB of buffering at both sender and receiver drops to 48Mbit/s. It seems to me that 6MiB is a fairly reasonable default value, though if you're running a really busy server, you might benefit from reducing this, especially if you're mostly serving small files.
‣ netstat -s | grep -E '(prune|collap)'
233 packets pruned from receive queue because of socket buffer overrun
1581 packets collapsed in receive queue due to low socket buffer
$ netstat -s | grep -E '(prune|collap)'
4445504 packets pruned from receive queue because of socket buffer overrun
150606 packets pruned from receive queue
278333723 packets collapsed in receive queue due to low socket buffer
I suspect this is because most monitoring focuses on average instead of percentiles or maximum. Outliers are the most interesting data points. There is an almost cult-like following of the average as a KPI in monitoring tools.
This is for the TCP case. For the ICMP, indeed, the 1.8s times are not explained, but it's totally possible that 100 of 18ms events accumulated. Remember that the ping was running for 32 hours, while the system tap scripts for 30 seconds.
But... they never went to the next step and confirmed that tcp_collapse was making lengthy system memory calls. They kind of gloss over the details of why reducing rmem alleviates the problem.
Is it because with smaller rmem tcp_collapse aggregates segments into smaller sk_buffs, and these smaller-size memory calls don't take as long to return?
Or is it because tcp_collapse has less memory "stitching" to do, since it can't fit as many segments into the smaller sk_buffs?
Doing a global operation on a large data structure takes longer than doing a global operation on a small data structure. Average throughput may be better if you do that, but maximum latency was the issue here.
Any idea if Cloudflare is hosted on AWS and faced this problem on their EC2 instance? Anyone?