The little ssh that sometimes couldn't (2012)

karlshea · on July 10, 2015

Previous discussion: https://news.ycombinator.com/item?id=4709438

foxhill · on July 10, 2015

tldr; single bit flips in a hop to a remote server.

the moral of this story is - the number of layers and abstractions between our code (even our shell scripts - cron jobs in this case) and the network layer is so large.. the most subtle of bugs in one of these layers is a massive pain to track down.

i am in awe of the tenacity of these bug hunters.

digi_owl · on July 11, 2015

Another thing is that TCP does not have a facility for reporting what the problem is.

So you basically has to dump signals down the wire and hope something comes out of it.

gpvos · on July 10, 2015

Is there some kind of TCP signal that the kernel could reasonably send back to the originator if it detected packet corruption?

toast0 · on July 11, 2015

There are some ways to coax a retransmission (duplicate acking, maybe selective ack?); but retransmissions doesn't really help, since a given socket was always running through the same route, and getting corrupted. I guess an explicit 'got bad data' message would have shown up better in tcpdump though.

derefr · on July 11, 2015

Sounds like a session-layer/presentation-layer sort of thing. TLS or IPSec might have such a protocol message.

gpvos · on July 12, 2015

Because the TCP checksum was incorrect, the packet would never reach a higher level such as TLS. TCP or ICMP would be the only options; maybe IPsec.

keeperofdakeys · on July 11, 2015

Besides debugging, would this help though? A corrupt packet is the same as not receiving the (correct) packet at all. It will be retransmitted when no ACK is received. If the problem is ephemeral, it will be resolved on the retransmission. If it's not, timing out the connection is the only course of action.