That is why link layers always have other stronger CRCs too.
This is actually one of the better reasons to always use TLS. Yhe MAC authentication it uses is much stronger.
However, I think delusional is a bit of a strong word to use here. "Performance of Checksums and CRCs over Real Data" has a bunch of interesting data about the number oand types of common errors detected by simple checkums and CRC: http://ccr.sigcomm.org/archive/1995/conf/partridge.pdf
If a system was designed with the usage of TCP checksums in mind, then removing TCP checksums will create a system with an error rate 4 orders of magnitude higher than before.
EDIT: yes, it does. As of RFC 3309, it uses CRC32c (the one which has an instruction in SSE).
This article talks about tls and corruption, including dtls. I found it educational!
It is just no modern engineers seems to care to check the checksums.
Older Linux kernels have an additional (I believe distinct) veth related bug that requires we do some extra work for externally verified packets (and jumbograms): in particular we must set up the packet we are delivering assuming that it might be routed beyond the guest and that the guest will not remove/reapply the virtio-net header as an intermediate step (this is a pretty leaky abstraction of the Linux virtio-net driver, but one we're aware of and have worked to accommodate).
Of course, none of the above changes the somewhat fragile nature of TCP checksums, generally.
(note: I wrote the virtio-net NIC we advertise in GCE/GKE, although very little of the underlying dataplane, but I double checked with the GKE team in terms of underlying kernel versions that we typically run).
The demonstration that Vijay added to the post was done on Google Container Engine, using Kubernetes. The packets were sent corrupted using netem. We tested a few configurations and were unable to get corrupt packets to be delivered to a Google Container Engine instance, so I agree with your assessment. Most importantly: it appears that the Container Engine TCP load balancer drops corrupt packets from the public Internet.
However: If someone is using some weird routing or VPN configuration, it might be possible (but this seems unlikely). Notably: I seem to recall that if you send corrupt packets to a Compute Engine instance, they are received corrupted (through the Compute Engine NAT). So if you used your own custom routing to get packets to a Google Container Engine application, this might apply. But again, you would have to really try to have this happen :)
Practically speaking any traffic from the internet would not be affected as we require a valid checksum before doing the internet->guest NAT (see the GCE docs for what I mean here if this is unclear), but inter-VM traffic can potentially be impacted. On the other hand, inter-VM traffic has other verification applied to it that's stronger than a TCP checksum. Basically: if you try to trigger this by sending bad packets you might succeed (although even there it's not 100% guaranteed, but the details of what will/won't trigger it delve into implementation arcana that I'm not comfortable sharing).
Also, for those saying that TLS is a panacea: encrypting and/or HMAC'ing all TCP data in and out of a box is operationally ridiculous unless you're in some sort of ultra high security environment.
Hardware verification IS performed. For various reasons, the nic never itself drips packets that are corrupt, packets are instead marked by HW as either verified or unverified. When a packet is marked as unverified, the kernel should verify and potentially reject the packet before delivery to the application. The bug in the veth driver causes the kernel to treat packets marked unverified as "verified"
After this kernel bug was fixed, the TCP checksum did its job and discarded all the corrupt packets.
If your physical network corrupts data, TCP is supposed to notice the checksum mismatch and drop the packet, and wait for it to be retransmitted. Because of this bug, Linux's TCP implementation was not validating checksums, which allowed corrupt data to reach the application.
This requires a faulty physical network, which is rare but nowhere near nonexistent. (The kernel is not introducing corruption to these packets.)
Mesos has a workaround like this in it now.
if (skb->ip_summed == CHECKSUM_NONE && rcv->features & NETIF_F_RXCSUM)
checksum offloading is encapsulated in the rcv-features bitmap, so disabling it will hide this bug.
You can do something like this within your container to disable it (from memory, might be slightly off):
$ ethtool --offload VETH_DEVICE_NAME rx off tx off
$ ethtool -K VETH_DEVICE_NAME gso off
Luckily I was able to recreate the problem in a test environment (our secondary backup cluster) allowing me to study it. What I found was that I could reliably send the database cluster in a "bad state" by sending a burst of > ~200 concurrent requests to it. After this, I observed a bi-modal response time distribution with some requests completing quickly as expected (<10ms) and some taking much longer (consistently ~6s for one particular request). My initial instinct was to blame the database, but some SYN Cookie flood warnings in the kernel logs caused me to consider the network as well.
So I started using tcpdump and Wireshark to go deeper and found the following: The burst of traffic from the application also caused a burst of traffic between the database cluster nodes which were performing some sort of result merging. To make things worse, the inter-node requests of the database cluster were using http, which meant a lot of connections were created in the process. Some of these connections were interpreted as a SYN flood by the Linux kernel, causing it to SYN-ACK them with SYN cookies. Additionally, these connections would get stuck with very small TCP windows (as low as 53 bytes), and also suffer from really high ACK latencies (200ms), so a 1600 byte inter-node http request wound up taking 6s! Disabling SYN cookies "fixed" the issue (and so did increasing somaxconn, but that's effectively the same), but despite my best effort, I was unable to understand why SYN cookies should impact the TCP window.
To make this even more mysterious, this problem only occurred in one of our data centers, and we narrowed it down to the router being the only difference. Replacing the router also "fixed" the issue.
I wish my team had the resources and expertise to debug problems like this down to the kernel, but I was too far out of my depth trying to understand the gnarly code the makes up the Linux TCP Syn Cookie and Congestion Control implementation ... : (.
Anyway, I'm posting this in the vague hope that somebody may have seen something similar before, or becomes inspired to go on kernel bug hunt :).
Additionally this experience gave me a new appreciation for TCP/IP and how amazing it is that is usually "just works" for me as an application developer. This is not to say that we can't improve upon it, but I think there is a lot to learn from the philosophy and approach that went into designing TCP/IP.
 By "load" I mean bursts of hundreds of concurrent http requests created due the application performing JOIN requests on behalf of the NoSQL database which doesn't provide this feature. My journey of replacing this database with one that's more suited for the task at hand is being written as we speak :).
Did you confirm the packets were the same on both sides of the router? This pattern sounds like possibly network microbursts, which can lead to dropped packets -- maybe the replacement router had a larger buffer so you didn't drop any packets, and things worked better.
I've also seen some exciting bugs with SYN cookies on other platforms, it's possible there's some encoding error leading you to a very small window -- something wrong with the window scaling negotiation perhaps? If you've got enough incoming syns that syn+ack state is being dropped (so syncookies are being used to finish the connection), you're also going to not retransmit syn+acks, so then you need to wait for the SYN to be retransmitted.
Window scaling was definitely on my radar, but IIRC the Linux SYN cookie implementation is supposed to encode if this TCP option is set in the initial SYN as well, so I don't see how this would explain thing unless the implementation is broken. I'm not sure if I understand what you're saying about SYN+ACKs not being retransmitted, I think they should be if they get lost, it's only the final ACK from the client that doesn't get retransmitted if its lost, but that would cause different symptoms.
Yeah, it's supposed to be encoded, but it's possible it's broken. Given that you can reproduce it with a low somaxconn, it seems like the router wasn't the problem, the router just was able to more easily trigger the problem.
> I'm not sure if I understand what you're saying about SYN+ACKs not being retransmitted, I think they should be if they get lost, it's only the final ACK from the client that doesn't get retransmitted if its lost, but that would cause different symptoms.
SYN+ACK will normally be retransmitted, but if you're in synflood conditions, it means the kernel is going to throw away state for SYN+ACKs, and then it no longer has the state to retransmit. It will reconstruct the state if it gets an ACK that is a decodable syncookie, or if it gets a retransmitted SYN, since it'll make a new SYN+ACK for that (although that may end up with a different sequence number).
Given my knowledge of TCP and SYN cookies, I was not able to come up with another theory for why individual TCP connections would be stuck with a tiny window, because it seems that the entire idea behind congestion control is to increase the window once the congestion is gone ... That's why I think it might be a kernel issue. Given the complexity of TCP, I'd say there is an at least equally high chance I just don't know enough ;)
More info: http://www.inesc-id.pt/pt/indicadores/Ficheiros/165.pdf
 I'm putting this in quotes, because the machines in questions have ridiculous amounts of memory, and the pressure merely came from SOMAXCONN being at their default value of 128.
This problem only happens when the packets are routed from the host to the container. This happens in Kubernetes, which assigns each container its own IP address, Docker's IPv6 configuration (same), and Mesos, which using Linux Traffic Control rules to share a range of ports with each container.