
How both TCP and Ethernet checksums fail (2015) - mmastrac
http://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html
======
rwg
At ${PREVIOUS_JOB}, I changed offices and discovered my window wasn't the only
thing I lost in the move: SSH sessions kept dying with MAC errors. It happened
with multiple computers with different NICs and running different OSes, so it
wasn't computer-related. I tried swapping the patch cable, but that didn't
change anything.

On a whim, I took the faceplate off the box containing my (100 Mbps) Ethernet
and phone jacks and discovered both drops were provided over a single four
pair cable originally installed in the late 1980s. More alarmingly, the outer
jacket of the cable had been cut off about a foot from the punchdowns on the
backs of the keystone jacks, and the entire foot of conductors emerging from
the jacket were all untwisted and balled up in the box.

Cutting off about 10 inches of each wire, twisting the pairs back together,
and punching the wires back down onto the backs of the jacks fixed the SSH
problem...

~~~
adrianN
You should have used Mosh /s

~~~
js2
While getting my degree I worked as a sys admin assistant for the CS dept.
This was back in the 10BASE2 days. It wasn't unheard of find a professor who
had re-arranged his or her office and needed a longer patch cable to connect
their workstation, so he or she found some 75 ohm coax and used that and then
wondered why the networking wasn't so good.

------
throwawayish
I develop backup software among other things and I can tell you, that a CRC is
neat but I've gotten many many reports where a CRC did not catch data
corruption (but SHA-2 did). Much fewer reports indicated that the CRC caught
it first.

My conclusion: A CRC32 over a couple megs of data is only good enough to
likely catch "gross" corruption, like truncating the data, not finer
corruption like memory and disks tend to do it, especially when we're talking
about bit rot.

Also, unless you're using an optimized implementation -- and zlib is not
optimized -- using eg. PCLMUL or at least a slice-by-4/8 variant it's a waste
of time, because eg. BLAKE2b isn't much slower than zlib CRC, but will catch
All The Corruption!1 [1] (Also, Python 3.6 added the full might of BLAKE2 to
the stdlib, although it's the reference implementation and not some fancy
SSE4/AVX(2) impl.)

[1] A technical note: There is a tradeoff between checksum length and data
length; with a good checksum increasing it's length eventually increases the
error rate over the same medium (aka false positives [the checksum is
corrupted instead of the data]). However, it's usually more important these
days to be certain about correctness and not to minimize total indicated BER
over a certain channel.

~~~
beagle3
I think it's an apple to orange comparison; sha1 needs to be compared to a
proper crc160, and sha2 to a proper crc256.

(Proper means one nontrivial cycle e.g. An irreducible generating polynomial
of a proper size)

I would be surprised if there's material difference in error detection between
sha2 and crc255 (other than the 1 bit difference).

However, crc is useless against an adversary - preimage attacks are trivial.
at the same size, it becomes an apple to apple comparison, and whether you
choose depends on a speed/preimage requirements.

~~~
mikeash
Is it really an apple to orange comparison? CRC can't do everything SHA does,
but SHA can do what CRC does, so it's more like an apple to apple-and-orange
comparison. And that's an easy comparison: pick the second one! You get a free
orange!

Is there something that a proper CRC256 would be better at than SHA-256?

~~~
beagle3
Crc is much simpler; depending on the polynomial, it can be as little as 2 xor
gates and a shift register; if you do your own asic/fpga, I suspect it could
be 10-100 times smaller and faster than sha256. In a modern CPU, You can do
something like 100 bits/clock of CRC, 10-100 times slower for Sha.

The other thing that crc can do that sha256 cannot is an error correcting code
of sort; if your errors are limited to a small subset (say, only bursts), you
can tabulate the crc of those errors, and then (expected-crc xor computed-crc)
is the crc of the error; if it's in your list, you know what the error is. I
had actually worked on an device where this was useful some 20 years ago.

~~~
mikeash
That second bit is pretty cool, I didn't know about that. I actually
experimented a bit with using cryptographic hashes as an error correcting
code, but you have to brute force it, so it's only practical for small amounts
of data and small errors.

------
kozak
This is what worries me about the HTTP "Cache-Control: immutable" proposal.
Immmutable caching should come with a reliable way to be sure that the content
that is going to become immutable is not corrupted in the first place.

~~~
fivesigma
Not if it's over HTTPS, since it guarantees data integrity. Corrupted network
bits will cause the connection to fail.

If the file is corrupted at the edge server an SRI attribute [1] can detect
it, although it's not supported on IE and Safari yet.

[1] [https://developer.mozilla.org/en-
US/docs/Web/Security/Subres...](https://developer.mozilla.org/en-
US/docs/Web/Security/Subresource_Integrity)

~~~
m_eiman
Better hope the file isn't corrupted while it's being written to disk (or
while read, or in place, or…).

IMHO, anything immutable should be identified with its SHA2+ hash.

------
avian
Another case I've seen where TCP checksum failed was in old Eee PCs. There,
the TCP checksum check on incoming packets was offloaded to the network
interface. Unfortunately, some Linux driver and/or hardware bug occasionally
corrupted the packet payload _after_ the check passed. The workaround was to
disable the hardware check.

[https://www.tablix.org/~avian/blog/archives/2010/10/eeepc_no...](https://www.tablix.org/~avian/blog/archives/2010/10/eeepc_notworking/)

------
evanj
See the previous Hacker News discussion from October 2015:
[https://news.ycombinator.com/item?id=10360108](https://news.ycombinator.com/item?id=10360108)

I'm glad people are still interested in this subject. :)

------
aftbit
Can we somehow strengthen the TCP checksum? Asking all application-layer
protocols to implement something that the lower layer is supposed to give for
free seems unwise in the long run.

~~~
fidget
The end to end principle says no, you need application level checksums anyway

------
gwern
The end to end principle strikes again.

------
wfunction
I still don't understand how 4-byte CRCs are sufficient. Can someone explain?

~~~
billforsternz
It's a statistical thing I suppose. 2^32 is 4 billion. So the chances of a
garbled packet happening to have the same CRC as your original packet are 1 in
4 billion. Ultimately no system is completely impervious to unlikely freak
collisions. For example even if you duplicated each packet there is still a
chance your original and duplicated packet could freakishly be corrupted in
exactly the same way. Having said all that I suppose a 64 bit CRC would get
you a lot closer to monkeys typing Shakespeare territory.

~~~
hueving
Consider the volume of traffic going over big backbones though (e. g. 100gbps
per second). Let's say one of the devices terminating this backbone has a
piece of memory go bad that mangles IP payloads. At 8.3 Mpps (1500byte mtu
line rate), a 1 in 4 billion chance would trigger on average every 9 minutes
or so.

That only assumes random probably with checksum collisions, but its
interesting to see how the volumes of traffic moved around the Internet now
have brought 'rare' events into much more frequency.

~~~
phicoh
The ethernet CRC is designed for packets on the wire. On the wire you have a
bit error rate and sometimes burst errors. For bit errors, as long as no
packet has 32 bit errors the CRC will always catch them. For burst errors, if
the error sets all bits to zero (or to one) then the CRC will catch it.

This is different from random memory corruption.

Note that while a packet is the memory of a router you typically only have to
TCP checksum to protect the packet, which is a rather weak 16-bit sum. So it
is safe to assume that TCP will fail to detect many failures and if you care
add a sha2 hash.

Why sha2, because while md5 is perfectly fine for random errors, we should
should make sure to kill all use of insecure hash functions. Otherwise they
keep popping up in contexts where errors are not random.

------
woliveirajr
Interesting that it took 4 months to get the kernel bug too.

