Hacker News new | past | comments | ask | show | jobs | submit login
The little ssh that sometimes couldn't (2012) (naguib.ca)
90 points by jasonmp85 on July 10, 2015 | hide | past | favorite | 8 comments




tldr; single bit flips in a hop to a remote server.

the moral of this story is - the number of layers and abstractions between our code (even our shell scripts - cron jobs in this case) and the network layer is so large.. the most subtle of bugs in one of these layers is a massive pain to track down.

i am in awe of the tenacity of these bug hunters.


Another thing is that TCP does not have a facility for reporting what the problem is.

So you basically has to dump signals down the wire and hope something comes out of it.


Is there some kind of TCP signal that the kernel could reasonably send back to the originator if it detected packet corruption?


There are some ways to coax a retransmission (duplicate acking, maybe selective ack?); but retransmissions doesn't really help, since a given socket was always running through the same route, and getting corrupted. I guess an explicit 'got bad data' message would have shown up better in tcpdump though.


Sounds like a session-layer/presentation-layer sort of thing. TLS or IPSec might have such a protocol message.


Because the TCP checksum was incorrect, the packet would never reach a higher level such as TLS. TCP or ICMP would be the only options; maybe IPsec.


Besides debugging, would this help though? A corrupt packet is the same as not receiving the (correct) packet at all. It will be retransmitted when no ACK is received. If the problem is ephemeral, it will be resolved on the retransmission. If it's not, timing out the connection is the only course of action.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: