
Conntrack tales – one thousand and one flows - jgrahamc
https://blog.cloudflare.com/conntrack-tales-one-thousand-and-one-flows/
======
majke
Author here. There are more conntrack stories to tell...

In this one I tried to do a quick "what happens when table is full", and ended
up with a rather rich and complex blog post. In it I discuss:

\- EPERM

\- implicitly dropped packets

\- weird kernel behaviour on loose=0 and ACK being dropped

\- iptable policy counters having their gotchas

\- problem of syn floods and conntrac

and more.

It's not that this is super breathtaking, but modern deployments depend on
conntrack (I'm looking at you k8s). Even though conntrack is notoriously
under-documented and misunderstood. I hope that the linked scripts would allow
folks to try to reproduce the weird conntrack behaviour and spread the
knowledge of underdocumented corner cases.

~~~
wahern
The conntrack(1) utility also has some fun brokenness. For example, there's a
TOCTTOU issue when you flush connection state, as Calico and other k8s CNIs do
regularly. The utility queries the kernel state, builds a list in user space,
and then iteratively deletes each session. But if a session expires between
when its reported and when the utility tries to delete it (for example,
hitting the standard UDP session timeout, or a TCP FIN, which happens all the
time), the delete fails with ENOENT and the utility immediately exits without
flushing the rest of the state.

Calico and other controllers resolve this by calling conntrack(1) in a loop
until success or a limit (e.g. 3, 5, or w'ever magic number) is reached. But
even then you regularly get one more error than the limit, triggering alarms.

I submitted a patch last year that added a command-line switch for suppressing
certain errors, but the netdev mailing-list is high traffic and I never got
any feedback. I never expected my patch to be accepted on the first round; not
even the _approach_. I can think of several other alternative ways to address
the issue, such as not exiting on error until working through the queue,
utilizing exit status codes for reporting the reason, etc. But it's
unfortunate I got no feedback whatsoever considering it's a substantial pain
point in the wild and a clear (IMO) defect in the implementation--because the
error is _inevitable_ and to be expected given the TOCTTOU race.

Of course, if _I_ were writing a controller like Calico I probably wouldn't be
shelling out to conntrack(1), but instead making the netlink calls in-process.
But I'm not most people, especially in this particular case where everybody
else manifestly treats conntrack(1) as the de facto programmatic interface.

~~~
jkbs
Try resubmitting to netfilter mailing list:

[http://vger.kernel.org/vger-lists.html#netfilter-
devel](http://vger.kernel.org/vger-lists.html#netfilter-devel)

------
the8472
I wish iptables had an option to use full-cone NAT (endpoint independent
mapping + filtering per rfc5128) instead of always doing full cone (endpoint-
dependent). It would dramatically cut down on the number of table entries
needed, possibly down to 1 if you use source port binding.

------
Thaxll
Conntrack table can become full pretty quickly on Kubernetes if you don't
cache DNS...

