
Linux kernel bug delivers corrupt TCP/IP data - twakefield
https://medium.com/vijay-pandurangan/linux-kernel-bug-delivers-corrupt-tcp-ip-data-to-mesos-kubernetes-docker-containers-4986f88f7a19#.ay7af76xj
======
nn3
If you rely on the TCP checksum to protect your data you are delusional
anyways. The TCP checksum is a simple additive checksum that is very weak and
cannot detect wide classes of data corruptions. It is is well known that it
won't catch many problems, with classical papers describing this. All it can
do is to catch very obvious "big" mistakes.

That is why link layers always have other stronger CRCs too.

This is actually one of the better reasons to always use TLS. Yhe MAC
authentication it uses is much stronger.

~~~
felixge
Does TLS support transparent retransmits of corrupted packets?

~~~
vbezhenar
I don't think so. Probably it'll just reset connection.

~~~
felixge
I guess that's arguably better than accepting corrupted data, but still a step
backwards from the abstraction TCP is intended to provide in regards to
checksums ...

------
jsolson
Google Container Engine VMs should be protected from this by the virtual NIC
advertised to them. In particular, it advertises support for TCP checksum
verification offload which all modern Linux guests negotiate (including the
kernel used in GKE). If this feature is negotiated the host-side network
(either in hardware or software) verifies the TCP checksum on the packet on
behalf of the guest and marks it as having been validated prior to delivering
it to the guest.

Older Linux kernels have an additional (I believe distinct) veth related bug
that requires we do some extra work for externally verified packets (and
jumbograms): in particular we must set up the packet we are delivering
assuming that it might be routed beyond the guest _and_ that the guest will
not remove/reapply the virtio-net header as an intermediate step (this is a
pretty leaky abstraction of the Linux virtio-net driver, but one we're aware
of and have worked to accommodate).

Of course, none of the above changes the somewhat fragile nature of TCP
checksums, generally.

(note: I wrote the virtio-net NIC we advertise in GCE/GKE, although very
little of the underlying dataplane, but I double checked with the GKE team in
terms of underlying kernel versions that we typically run).

~~~
evanj
Thanks for the details! This is basically what Vijay and I were guessing by
the fact that the MTU is something less than 1500 on Google Compute Engine.

The demonstration that Vijay added to the post was done on Google Container
Engine, using Kubernetes. The packets were _sent corrupted_ using netem. We
tested a few configurations and were unable to get corrupt packets to be
delivered to a Google Container Engine instance, so I agree with your
assessment. Most importantly: it appears that the Container Engine TCP load
balancer drops corrupt packets from the public Internet.

However: If someone is using some weird routing or VPN configuration, it
_might_ be possible (but this seems unlikely). Notably: I seem to recall that
if you send corrupt packets to a Compute Engine instance, they are received
corrupted (through the Compute Engine NAT). So if you used your own custom
routing to get packets to a Google Container Engine application, this might
apply. But again, you would have to really _try_ to have this happen :)

------
fred256
Sounds similar to bug #3 in this article: [https://www.pagerduty.com/blog/the-
discovery-of-apache-zooke...](https://www.pagerduty.com/blog/the-discovery-of-
apache-zookeepers-poison-packet/)

~~~
MBCook
Looks like a similar issue with a different cause (in this case the kernel
seems to be assuming that IPSec packet contents were already checksummed by
the outer IPSec packet and thus correct).

------
packetized
This is a great write up.

Also, for those saying that TLS is a panacea: encrypting and/or HMAC'ing all
TCP data in and out of a box is operationally ridiculous unless you're in some
sort of ultra high security environment.

~~~
ajross
Sorry, what's ridiculous about it? It's a very achievable thing. On modern
CPUs with instruction support, AES encryption can be done faster than DRAM
bandwidth. There are definitely latency costs in connection setup that will
penalize "transaction-like" protocols I guess. It's not 100% free, but
relative to the other performance issues you're looking at it's surely way way
way down the list of priorities.

~~~
packetized
Please note my use of the word 'operationally'.

~~~
ajross
How does that change things? What's "operationally" ridiculous about it?

~~~
packetized
The overhead involved in piling on encryption management in an environment
that doesn't specifically warrant it is a waste of resources.

------
betaby
From the link I didn't understand why hardware didn't do check-sum checking.
Or in fact it does and only if one using a) veth and b) nic without hardware
check-summing is affected?

~~~
vijayp
Oops I probably should have been clearer.

Hardware verification IS performed. For various reasons, the nic never itself
drips packets that are corrupt, packets are instead marked by HW as either
verified or unverified. When a packet is marked as unverified, the kernel
should verify and potentially reject the packet before delivery to the
application. The bug in the veth driver causes the kernel to treat packets
marked unverified as "verified"

~~~
csandanov
If this problem affects all veth drivers, why does Docker's NAT IPv4 is safe?

------
derFunk
Is there a tl;dr? Who would be affected? Sounds scary.

~~~
tyingq
It does mention toggling off the veth device "checksum offloading" as a valid
workaround.

~~~
evanj
I've been trying to get Docker to include this workaround when it creates
containers to ensure people don't run into it, but this has not gotten any
attention:
[https://github.com/docker/docker/issues/18776](https://github.com/docker/docker/issues/18776)

Mesos has a workaround like this in it now.

------
felixge
Here is another Linux TCP horror story: An application I'm working on was
experiencing slow database query performance under "load" [1]. Restarting the
database temporarily "fixed" the issue, only to reappear again after a short
time.

Luckily I was able to recreate the problem in a test environment (our
secondary backup cluster) allowing me to study it. What I found was that I
could reliably send the database cluster in a "bad state" by sending a burst
of > ~200 concurrent requests to it. After this, I observed a bi-modal
response time distribution with some requests completing quickly as expected
(<10ms) and some taking much longer (consistently ~6s for one particular
request). My initial instinct was to blame the database, but some SYN Cookie
flood warnings in the kernel logs caused me to consider the network as well.

So I started using tcpdump and Wireshark to go deeper and found the following:
The burst of traffic from the application also caused a burst of traffic
between the database cluster nodes which were performing some sort of result
merging. To make things worse, the inter-node requests of the database cluster
were using http, which meant a lot of connections were created in the process.
Some of these connections were interpreted as a SYN flood by the Linux kernel,
causing it to SYN-ACK them with SYN cookies. Additionally, these connections
would get stuck with very small TCP windows (as low as 53 bytes), and also
suffer from really high ACK latencies (200ms), so a 1600 byte inter-node http
request wound up taking 6s! Disabling SYN cookies "fixed" the issue (and so
did increasing somaxconn, but that's effectively the same), but despite my
best effort, I was unable to understand why SYN cookies should impact the TCP
window.

To make this even more mysterious, this problem only occurred in one of our
data centers, and we narrowed it down to the router being the only difference.
Replacing the router also "fixed" the issue.

I wish my team had the resources and expertise to debug problems like this
down to the kernel, but I was too far out of my depth trying to understand the
gnarly code the makes up the Linux TCP Syn Cookie and Congestion Control
implementation ... : (.

Anyway, I'm posting this in the vague hope that somebody may have seen
something similar before, or becomes inspired to go on kernel bug hunt :).

Additionally this experience gave me a new appreciation for TCP/IP and how
amazing it is that is usually "just works" for me as an application developer.
This is not to say that we can't improve upon it, but I think there is a lot
to learn from the philosophy and approach that went into designing TCP/IP.

[1] By "load" I mean bursts of hundreds of concurrent http requests created
due the application performing JOIN requests on behalf of the NoSQL database
which doesn't provide this feature. My journey of replacing this database with
one that's more suited for the task at hand is being written as we speak :).
[2]
[https://en.wikipedia.org/wiki/SYN_cookies](https://en.wikipedia.org/wiki/SYN_cookies)

~~~
makomk
I believe SYN cookies affect the TCP window because the kernel can't actually
guarantee it has enough RAM to allocate a reasonably sized buffer when it
sends the SYN-ACK, as by design it doesn't allocate any memory until after the
ACK is returned from the other side.

~~~
cawoodfield
One potential explanation depending on your kernel version - traditional SYN
cookies cannot encode TCP extensions, including window scaling, which will
(nearly) always result in slower transfers. Newer kernels (The Googles aren't
giving me a kernel version, unfortunately) use the TCP timestamp option in
addition to the sequence to encode a limited number of TCP options, including
WS.

More info: [http://www.inesc-
id.pt/pt/indicadores/Ficheiros/165.pdf](http://www.inesc-
id.pt/pt/indicadores/Ficheiros/165.pdf)

------
dgpl
Very interesting, does it affect libvirt / lxc? I wonder what is the frequency
of this problem.

~~~
evanj
It might. I don't know enough about them. Anything that routes packets without
NAT to a veth device is affected.

------
outworlder
Shouldn't this affect anything that uses veths, not just containers? Such as
Openstack.

~~~
evanj
Yes, with some caveats: Lots of configurations that use veths use NAT to share
the IP address of the host. For example, this is Docker's default
configuration. In this case, the host kernel checks the TCP checksum no matter
what, so this issue doesn't apply.

This problem only happens when the packets are _routed_ from the host to the
container. This happens in Kubernetes, which assigns each container its own IP
address, Docker's IPv6 configuration (same), and Mesos, which using Linux
Traffic Control rules to share a range of ports with each container.

------
pmontra
Ubuntu 12.04 LTS (still 14 months to go) with the latest Hardware Enablement
Stack (12.04.5) runs on the 3.13 kernel. I hope that Canonical will backport
the fix to 3.13 and not only to 3.14 as hinted by the article.

~~~
vijayp
I think it's in the 3.13 queue already:
[http://kernel.ubuntu.com/git/ubuntu/linux.git/commit/?h=linu...](http://kernel.ubuntu.com/git/ubuntu/linux.git/commit/?h=linux-3.13.y-queue&id=ebd355e41f090a3d3a6fb24e1e5921464710b2c2)

------
doggydogs94
I like the "goto drop" statement.

~~~
seba_dos1
Glad you shared that with us.

------
sz4kerto
Does this affect Docker overlay networks?

------
bobinator606
use romana.io instead of veth

