Hacker News new | past | comments | ask | show | jobs | submit login
Linux kernel bug delivers corrupt TCP/IP data (medium.com)
221 points by twakefield on Feb 12, 2016 | hide | past | web | favorite | 57 comments

If you rely on the TCP checksum to protect your data you are delusional anyways. The TCP checksum is a simple additive checksum that is very weak and cannot detect wide classes of data corruptions. It is is well known that it won't catch many problems, with classical papers describing this. All it can do is to catch very obvious "big" mistakes.

That is why link layers always have other stronger CRCs too.

This is actually one of the better reasons to always use TLS. Yhe MAC authentication it uses is much stronger.

Yeah, TCP's checksum is indeed weak and TLS is indeed a good answer.

However, I think delusional is a bit of a strong word to use here. "Performance of Checksums and CRCs over Real Data" has a bunch of interesting data about the number oand types of common errors detected by simple checkums and CRC: http://ccr.sigcomm.org/archive/1995/conf/partridge.pdf

You misunderstand the purpose of checksums. Checksums and error correcting codes do no eliminate errors, they reduce the frequency of undetected errors by a known factor.

If a system was designed with the usage of TCP checksums in mind, then removing TCP checksums will create a system with an error rate 4 orders of magnitude higher than before.

Does SCTP have a better CRC/checksum than TCP?

EDIT: yes, it does. As of RFC 3309, it uses CRC32c (the one which has an instruction in SSE).

Does TLS support transparent retransmits of corrupted packets?


This article talks about tls and corruption, including dtls. I found it educational!

I don't think so. Probably it'll just reset connection.

I guess that's arguably better than accepting corrupted data, but still a step backwards from the abstraction TCP is intended to provide in regards to checksums ...

TCP checksum and TLS checksum/macing both have use case. Not every protocol built on TCP supports TLS. If TCP checksum is broken, we have to fix it. So I agree with below delusional is too strong. If a simple checksum is buggy, how can I trust anything about said TCP protocol implementation? How can you reassure other data integrity implementation isn't buggy? This bug isn't really about whether TCP checksum is weak or not, rather, just a bug which I suppose most people ignore.

checksums are cost efficient for detecting statistically HW errors.

It is just no modern engineers seems to care to check the checksums.

Which only makes me more curious about piping things over udp, tcp included.

Google Container Engine VMs should be protected from this by the virtual NIC advertised to them. In particular, it advertises support for TCP checksum verification offload which all modern Linux guests negotiate (including the kernel used in GKE). If this feature is negotiated the host-side network (either in hardware or software) verifies the TCP checksum on the packet on behalf of the guest and marks it as having been validated prior to delivering it to the guest.

Older Linux kernels have an additional (I believe distinct) veth related bug that requires we do some extra work for externally verified packets (and jumbograms): in particular we must set up the packet we are delivering assuming that it might be routed beyond the guest and that the guest will not remove/reapply the virtio-net header as an intermediate step (this is a pretty leaky abstraction of the Linux virtio-net driver, but one we're aware of and have worked to accommodate).

Of course, none of the above changes the somewhat fragile nature of TCP checksums, generally.

(note: I wrote the virtio-net NIC we advertise in GCE/GKE, although very little of the underlying dataplane, but I double checked with the GKE team in terms of underlying kernel versions that we typically run).

Thanks for the details! This is basically what Vijay and I were guessing by the fact that the MTU is something less than 1500 on Google Compute Engine.

The demonstration that Vijay added to the post was done on Google Container Engine, using Kubernetes. The packets were sent corrupted using netem. We tested a few configurations and were unable to get corrupt packets to be delivered to a Google Container Engine instance, so I agree with your assessment. Most importantly: it appears that the Container Engine TCP load balancer drops corrupt packets from the public Internet.

However: If someone is using some weird routing or VPN configuration, it might be possible (but this seems unlikely). Notably: I seem to recall that if you send corrupt packets to a Compute Engine instance, they are received corrupted (through the Compute Engine NAT). So if you used your own custom routing to get packets to a Google Container Engine application, this might apply. But again, you would have to really try to have this happen :)

Update: Actually under rare circumstances we'll validate the checksum on the host side, realize it's wrong, pass it to the guest as CHECKSUM_NEEDED and the guest will happily fail to checksum it and mark it as already validated. So, GCE/GKE is not currently completely immune to this unfortunately :(

Practically speaking any traffic from the internet would not be affected as we require a valid checksum before doing the internet->guest NAT (see the GCE docs for what I mean here if this is unclear), but inter-VM traffic can potentially be impacted. On the other hand, inter-VM traffic has other verification applied to it that's stronger than a TCP checksum. Basically: if you try to trigger this by sending bad packets you might succeed (although even there it's not 100% guaranteed, but the details of what will/won't trigger it delve into implementation arcana that I'm not comfortable sharing).

Sounds similar to bug #3 in this article: https://www.pagerduty.com/blog/the-discovery-of-apache-zooke...

Looks like a similar issue with a different cause (in this case the kernel seems to be assuming that IPSec packet contents were already checksummed by the outer IPSec packet and thus correct).

Just skimmed it; very interesting. I hadn't heard of this bug before, but will read it in detail soon. Thanks!

This is a great write up.

Also, for those saying that TLS is a panacea: encrypting and/or HMAC'ing all TCP data in and out of a box is operationally ridiculous unless you're in some sort of ultra high security environment.

Sorry, what's ridiculous about it? It's a very achievable thing. On modern CPUs with instruction support, AES encryption can be done faster than DRAM bandwidth. There are definitely latency costs in connection setup that will penalize "transaction-like" protocols I guess. It's not 100% free, but relative to the other performance issues you're looking at it's surely way way way down the list of priorities.

Please note my use of the word 'operationally'.

How does that change things? What's "operationally" ridiculous about it?

The overhead involved in piling on encryption management in an environment that doesn't specifically warrant it is a waste of resources.

From the link I didn't understand why hardware didn't do check-sum checking. Or in fact it does and only if one using a) veth and b) nic without hardware check-summing is affected?

Oops I probably should have been clearer.

Hardware verification IS performed. For various reasons, the nic never itself drips packets that are corrupt, packets are instead marked by HW as either verified or unverified. When a packet is marked as unverified, the kernel should verify and potentially reject the packet before delivery to the application. The bug in the veth driver causes the kernel to treat packets marked unverified as "verified"

If this problem affects all veth drivers, why does Docker's NAT IPv4 is safe?

The hardware does do checksumming at the link layer (usually). This is talking about the TCP checksum, which is an essentially redundant (and 16 bit!) sum higher up the protocol stack. Honestly it's mostly useless as an error detection mechanism. But it's required per spec to be validated, and the veth devices apparently didn't.

Its actually very useful, although very weak. The link layer check, at least in the case of Ethernet, really only protects your data "on the wire". It doesn't protect it inside the switches. It turns out there are failure modes, such as the one that caused this issue, where the packets get corrupted inside the switch (probably due to bad RAM). Details about how this can happen: http://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html

After this kernel bug was fixed, the TCP checksum did its job and discarded all the corrupt packets.

Is there a tl;dr? Who would be affected? Sounds scary.

Virtual ethernet devices, which are used in some container deployments (you / your ops team probably know if you're using them), do not check TCP checksums. This appears to be because the original programmer was thinking about applications on the same machine communicating across virtual ethernet devices, but the optimization also affected traffic from a physical network that was routed onto a virtual network.

If your physical network corrupts data, TCP is supposed to notice the checksum mismatch and drop the packet, and wait for it to be retransmitted. Because of this bug, Linux's TCP implementation was not validating checksums, which allowed corrupt data to reach the application.

This requires a faulty physical network, which is rare but nowhere near nonexistent. (The kernel is not introducing corruption to these packets.)

You'll be affected if you use a veth (becoming more popular with docker/container schedulers) and have a corrupt packet floating through your network.

... and are not using authenticated encryption as you should.

In my experience, there is a bit of hardware (which was the root cause in the articles case) between SSL termination and application servers. So even using encryption, you are still vulnerable.

In many situations, you might still have unencrypted traffic, even if your app is using authenticated encryption. Like, for example, if you're doing DNS lookups, or syslog to a remote host, etc.

It does mention toggling off the veth device "checksum offloading" as a valid workaround.

I've been trying to get Docker to include this workaround when it creates containers to ensure people don't run into it, but this has not gotten any attention: https://github.com/docker/docker/issues/18776

Mesos has a workaround like this in it now.

yes, the code is as follows in the broken veth:

if (skb->ip_summed == CHECKSUM_NONE && rcv->features & NETIF_F_RXCSUM)

checksum offloading is encapsulated in the rcv-features bitmap, so disabling it will hide this bug.

You can do something like this within your container to disable it (from memory, might be slightly off): $ ethtool --offload VETH_DEVICE_NAME rx off tx off $ ethtool -K VETH_DEVICE_NAME gso off

Here is another Linux TCP horror story: An application I'm working on was experiencing slow database query performance under "load" [1]. Restarting the database temporarily "fixed" the issue, only to reappear again after a short time.

Luckily I was able to recreate the problem in a test environment (our secondary backup cluster) allowing me to study it. What I found was that I could reliably send the database cluster in a "bad state" by sending a burst of > ~200 concurrent requests to it. After this, I observed a bi-modal response time distribution with some requests completing quickly as expected (<10ms) and some taking much longer (consistently ~6s for one particular request). My initial instinct was to blame the database, but some SYN Cookie flood warnings in the kernel logs caused me to consider the network as well.

So I started using tcpdump and Wireshark to go deeper and found the following: The burst of traffic from the application also caused a burst of traffic between the database cluster nodes which were performing some sort of result merging. To make things worse, the inter-node requests of the database cluster were using http, which meant a lot of connections were created in the process. Some of these connections were interpreted as a SYN flood by the Linux kernel, causing it to SYN-ACK them with SYN cookies. Additionally, these connections would get stuck with very small TCP windows (as low as 53 bytes), and also suffer from really high ACK latencies (200ms), so a 1600 byte inter-node http request wound up taking 6s! Disabling SYN cookies "fixed" the issue (and so did increasing somaxconn, but that's effectively the same), but despite my best effort, I was unable to understand why SYN cookies should impact the TCP window.

To make this even more mysterious, this problem only occurred in one of our data centers, and we narrowed it down to the router being the only difference. Replacing the router also "fixed" the issue.

I wish my team had the resources and expertise to debug problems like this down to the kernel, but I was too far out of my depth trying to understand the gnarly code the makes up the Linux TCP Syn Cookie and Congestion Control implementation ... : (.

Anyway, I'm posting this in the vague hope that somebody may have seen something similar before, or becomes inspired to go on kernel bug hunt :).

Additionally this experience gave me a new appreciation for TCP/IP and how amazing it is that is usually "just works" for me as an application developer. This is not to say that we can't improve upon it, but I think there is a lot to learn from the philosophy and approach that went into designing TCP/IP.

[1] By "load" I mean bursts of hundreds of concurrent http requests created due the application performing JOIN requests on behalf of the NoSQL database which doesn't provide this feature. My journey of replacing this database with one that's more suited for the task at hand is being written as we speak :). [2] https://en.wikipedia.org/wiki/SYN_cookies

> To make this even more mysterious, this problem only occurred in one of our data centers, and we narrowed it down to the router being the only difference. Replacing the router also "fixed" the issue.

Did you confirm the packets were the same on both sides of the router? This pattern sounds like possibly network microbursts, which can lead to dropped packets -- maybe the replacement router had a larger buffer so you didn't drop any packets, and things worked better.

I've also seen some exciting bugs with SYN cookies on other platforms, it's possible there's some encoding error leading you to a very small window -- something wrong with the window scaling negotiation perhaps? If you've got enough incoming syns that syn+ack state is being dropped (so syncookies are being used to finish the connection), you're also going to not retransmit syn+acks, so then you need to wait for the SYN to be retransmitted.

There was definitely packet loss and retransmission due to it, so the packets were not the same on both sides of the router. But this doesn't explain why TCP connections would be stuck with small windows for their entire lifetime after the burst is over? I was actually also able to reproduce the problem in one of our other DCs by reducing somaxconn to a very low value (e.g. 2 IIRC).

Window scaling was definitely on my radar, but IIRC the Linux SYN cookie implementation is supposed to encode if this TCP option is set in the initial SYN as well, so I don't see how this would explain thing unless the implementation is broken. I'm not sure if I understand what you're saying about SYN+ACKs not being retransmitted, I think they should be if they get lost, it's only the final ACK from the client that doesn't get retransmitted if its lost, but that would cause different symptoms.

> Window scaling was definitely on my radar, but IIRC the Linux SYN cookie implementation is supposed to encode if this TCP option is set in the initial SYN as well, so I don't see how this would explain thing unless the implementation is broken.

Yeah, it's supposed to be encoded, but it's possible it's broken. Given that you can reproduce it with a low somaxconn, it seems like the router wasn't the problem, the router just was able to more easily trigger the problem.

> I'm not sure if I understand what you're saying about SYN+ACKs not being retransmitted, I think they should be if they get lost, it's only the final ACK from the client that doesn't get retransmitted if its lost, but that would cause different symptoms.

SYN+ACK will normally be retransmitted, but if you're in synflood conditions, it means the kernel is going to throw away state for SYN+ACKs, and then it no longer has the state to retransmit. It will reconstruct the state if it gets an ACK that is a decodable syncookie, or if it gets a retransmitted SYN, since it'll make a new SYN+ACK for that (although that may end up with a different sequence number).

This reminds me of a similar problem observed in a distributed file system. Eventually it was tracked down to a single bad network card. Why do you think it was a kernel issue? Hardware has "bugs" too.

I don't know if it was a kernel issue, but I was able to reproduce the problem in another DC by setting somaxconn artificially low (to 2 IIRC) in order to artificially force Linux into SYN cookie mode, and that reproduced this issue as well.

Given my knowledge of TCP and SYN cookies, I was not able to come up with another theory for why individual TCP connections would be stuck with a tiny window, because it seems that the entire idea behind congestion control is to increase the window once the congestion is gone ... That's why I think it might be a kernel issue. Given the complexity of TCP, I'd say there is an at least equally high chance I just don't know enough ;)

I believe SYN cookies affect the TCP window because the kernel can't actually guarantee it has enough RAM to allocate a reasonably sized buffer when it sends the SYN-ACK, as by design it doesn't allocate any memory until after the ACK is returned from the other side.

One potential explanation depending on your kernel version - traditional SYN cookies cannot encode TCP extensions, including window scaling, which will (nearly) always result in slower transfers. Newer kernels (The Googles aren't giving me a kernel version, unfortunately) use the TCP timestamp option in addition to the sequence to encode a limited number of TCP options, including WS.

More info: http://www.inesc-id.pt/pt/indicadores/Ficheiros/165.pdf

Sure, but after the connection is established, and "memory pressure" [1] is gone, shouldn't the tcp implementation grow the window again?

[1] I'm putting this in quotes, because the machines in questions have ridiculous amounts of memory, and the pressure merely came from SOMAXCONN being at their default value of 128.

Very interesting, does it affect libvirt / lxc? I wonder what is the frequency of this problem.

It might. I don't know enough about them. Anything that routes packets without NAT to a veth device is affected.

lxc uses the same fake ethernet devices (they're easy to create, check out brctl). KVM guests too, I'll bet.

Shouldn't this affect anything that uses veths, not just containers? Such as Openstack.

Yes, with some caveats: Lots of configurations that use veths use NAT to share the IP address of the host. For example, this is Docker's default configuration. In this case, the host kernel checks the TCP checksum no matter what, so this issue doesn't apply.

This problem only happens when the packets are routed from the host to the container. This happens in Kubernetes, which assigns each container its own IP address, Docker's IPv6 configuration (same), and Mesos, which using Linux Traffic Control rules to share a range of ports with each container.

Ubuntu 12.04 LTS (still 14 months to go) with the latest Hardware Enablement Stack (12.04.5) runs on the 3.13 kernel. I hope that Canonical will backport the fix to 3.13 and not only to 3.14 as hinted by the article.

I think it's in the 3.13 queue already: http://kernel.ubuntu.com/git/ubuntu/linux.git/commit/?h=linu...

Anyone know if RedHat plans to fix this for their distros (and downstream like CentOS)? Poking around in their bugzilla, I can't find anything yet.

I like the "goto drop" statement.

Glad you shared that with us.

Does this affect Docker overlay networks?

use romana.io instead of veth

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact