
Hidden latency in the Linux network stack - jgrahamc
https://blog.cloudflare.com/revenge-listening-sockets/?p=0
======
nkurz
Stray thoughts:

1) Any chance this bug would have manifested as "connection reset" errors when
accessing HN? I exchanged email with Dan a couple months ago trying to figure
out why about 10% of my requests were failing, but we never figured out the
root cause before (after some weeks) the problem went away.

2) As others have pointed out, doubling the number of hash buckets seems like
a bandaid. But other than the scolding comment, is there any reason not to go
to an appropriately sized hash? If you know in advance that you are going to
have 16K addresses (ie, not the use case the original code anticipated), it
would seem beneficial to choose a data structure that has a fighting chance of
providing good performance.

3) This seems like a wonderful argument for _against_ running your high
performance DNS server on the same machine as your other services. Would
containerization have helped here, possibly with each OS pinned to a set of
cores? Is the cost of splitting it off onto a separate physical machine
prohibitive? At the optimization level you are aiming for, "process isolation"
seems like a pretty leaky abstraction.

4) Going farther down the path of reducing jitter, have you considered running
a tickless (NOHZ_FULL) kernel? Perhaps you are already doing it, but quieting
down the cores running your important services can make a significant
difference. I've been exploring this direction, and have found it rewarding.
Info on one way to measure is here:
[https://kernel.googlesource.com/pub/scm/linux/kernel/git/fre...](https://kernel.googlesource.com/pub/scm/linux/kernel/git/frederic/dynticks-
testing/+/master)

~~~
Florin_Andrei
If the issue is in the kernel, containers won't solve it.

~~~
nkurz
You are correct. As evidenced by the mention of "each OS", I mangled that
sentence and wrote containerization where I should have said virtualization. I
meant something like "While containerization wouldn't have helped here, would
virtualization (perhaps with each OS pinned to a different set of cores) have
helped?"

~~~
Florin_Andrei
Yeah, that sounds better.

Either way, if it's a critical service, I'd rather have it running on hardware
where there isn't much competition for resources, so not a whole lot of
virtualization.

------
charliedevolve
Why did increasing the size of the hash help? Wouldn't the %64 (or whatever
the new value was) just send all the port 53 sockets into the same bucket
again? It seems you'd need a different hash function that provides more
uniformity.

~~~
kbenson
_For TCP connections our DNS server now binds to ANY_IP address (aka:
0.0.0.0:53, :53). We call this "bind to star". While binding to specific IP
addresses is still necessary for UDP, there is little benefit in doing that
for the TCP traffic. For TCP we can bind to star safely, without compromising
our DDoS defenses._

I suspect that's the real fix. Now all those (16k) bound addresses aren't
creating hash table entries, so other connections that happen to use a port
that hashes to 21 (or 53 after enlarging the table) aren't being shoved into a
hash bucket that _starts_ with 16k entries already in it.

The enlarging of the hash table I think is less a fix for this problem
(although it would halve the number of later connections being put in the
bucket), and more just a good fix they happened to do at the same time.

~~~
creshal
> The enlarging of the hash table I think is less a fix for this problem
> (although it would halve the number of later connections being put in the
> bucket), and more just a good fix they happened to do at the same time.

Yes. It just reduces the risk that they run into this problem again with a
different port constellation.

~~~
stingraycharles
A bit pedantic, but it doesn't necessarily reduce the risk, but rather the
impact: using 64 buckets would only have half as many connections go into a
bad bucket. This, however, does not in any way decrease the chance of the
problem occuring again.

~~~
sinxoveretothex
Using twice as many buckets, there will be half as many destination ports in
the same bucket (65535 / 32 ≈ 2048, 65535 / 64 ≈ 1024), but since the "bad"
connections described in the blogpost all use the same destination port, it
won't change anything wrt that.

It does, however, reduce the overall impact when all connections are
considered.

------
kazinator
In the internet protocol suite, there is something called ARP. That will cause
a small latency spike whenever an ARP cache entry expires.

The host wants to send a datagram to some IP address Y somewhere. It knows
that the route for that IP goes through some local gateway with IP address X,
so it must actually send the datagram to X, for which it needs the MAC.
Normally, it knows the MAC of that gateway, because X is associated with the
MAC in the ARP cache.

From time to time, the ARP cache entry expires. In that case, before the
packet can be sent to the remote host Y, an ARP query must be performed: an
ARP "who has X?" broadcast request is generated, to which the gateway will
reply. Only then can that host send the packet to Y, through X.

This extra ARP exchange shouldn't take anywhere near 100 milliseconds, of
course. But that is beyond the control of the host that is querying.

------
deegles
I love these mysterious bug deep dives. I should start aggregating them
somewhere...

------
ecma
I'm not entirely convinced that increasing the size of LHTABLE solves
anything. True, it may remove some collisions in the hash table but note that
63925 % 64 = 53. Given that the two slow ports listed seem to be arbitrary and
assigned to customers, they're probably just a symptom of the overload on
53/UDP. I'm not suggesting they chose 64, that's unclear, but whatever they
chose probably just shifts problem the elsewhere. Increasing it /would/
inherently reduce the frequency of the events though so you can call that a
win.

A naïve solution would be to choose a bucket based on the destination port as
well as the source port if one is available (e.g. TCP). This might help
balance load affecting particular local ports since we can assume the source
port for TCP will be random enough. However, it doesn't solve the problem -
it'll just hide it. Random spikes in latency for connections to random
customers? Sounds undesirable.

A reasonable solution might be to work out a way to map gateway 53/UDP to a
diverse set of ports which are bound to rrdns processes on the boxes which
currently have 16K IP addresses. For UDP packets, this would be possible by
doing on-wire modifications to the transport header and recalculating any
checksums. Perhaps that just shifts the burden though.

~~~
eridius
You can't include the source port in the hash, because this is a table of
listening sockets, i.e. no connection has been established yet and the socket
needs to see packets from ANY source as long as they go to the right
destination.

You could suggest including the bound destination IP in the hash, but then
you'd also need a separate hashtable for sockets that are bound to any IP
(instead of being bound to a specific IP).

~~~
ecma
Good catch. I did mean that mixing in the source would apply to the
established table if it's constructed in the same way but doesn't do so
already (which would surprise me now that I think about it properly).

I don't think that you'd need a separate table for star bound listeners if the
IP is mixed in since you could just hash in 0.0.0.0 but you'd need to check
both the real IP and the special value too which is a potentially damaging
performance hit. It's probably done with just the destination port for a good
reason.

------
Gratsby
Why not look into changing the hash keys such that they are not bound to
listening port or alternatively put a new lookup and hash in place for this
specific purpose?

While bind to star works, it feels like you answered an operational concern
but left the design consideration on the table.

------
forgotmypassw
Even though I'm not using CloudFlare myself, I always enjoy reading their blog
posts, these guys are amazing hackers.

------
alberth
What about DragonflyBSD?

I'm curious is if Cloudflare has investigated using DrafonflyBSD, given that
it has a lockless network stack.

~~~
caf
This wasn't a locking issue.

------
ishtu
Can anyone suggest a good reading on differences in linux and freebsd network
stacks?

~~~
lazyant
here's a diagram for Linux
[http://www.linuxfoundation.org/images/1/1c/Network_data_flow...](http://www.linuxfoundation.org/images/1/1c/Network_data_flow_through_kernel.png)

------
Zenst
I wonder how if this was IPv6 if this problem would still entail and if IPv6
has design aspects that would negate this issue?

Would the lookup be different or the design aspect that produces what many
would class a edge case instance, one that as companies grow, is only going to
become more common.

------
triplenineteen
It seems like they are using a non-standard netstat switch:

    
    
      $ netstat -ep4ln --udp
    

Is that '4' a typo or something?

~~~
dice
It's present in my netstat, from `net-tools 2.10-alpha`. Man page says it's
shorthand for `--protocol=inet` (as opposed to `inet6`) which shows only IPv4
sockets.

------
lttlrck
It would be helpful to show the improvement for the two changes made
individually instead of lumping them together.

------
js2
What did you increase the size of LHTABLE to? I don't see that mentioned in
the post.

~~~
majke
In order to reduce the number of packets hitting the bad bucket. With LHTABLE
of size 32 port 16257 will hit bad bucket. With other sizes port 16257 may not
hit the naughty bucket.

~~~
nkurz
I think the question was what number you used for the new LHTABLE, rather than
the purpose of increasing it. That is, did you double it from 32 to 64? Choose
a prime? Something else?

~~~
js2
Correct, that was my question. The post doesn't say what they chose for the
LHTABLE size when they recompiled the kernel, only that it was increased from
32. BTW, according to the patch the article linked to, the size has to be a
power of 2.
[http://patchwork.ozlabs.org/patch/79014/](http://patchwork.ozlabs.org/patch/79014/)

~~~
wrigby
That makes sense - since port numbers are unsigned integers, you can just use
a bitwise AND rather than a more costly MOD. I haven't actually looked at the
internals of the hash, so I'm just assuming it's simply BUCKET = DPORT & MASK.

------
known
Can't we fix/configure it in /etc/sysctl.conf

~~~
wrigby
That would actually be pretty cool, but the data would have to be
redistributed across the new set of buckets when the parameter was changed.
That could get really messy...

