tldr; single bit flips in a hop to a remote server.
the moral of this story is - the number of layers and abstractions between our code (even our shell scripts - cron jobs in this case) and the network layer is so large.. the most subtle of bugs in one of these layers is a massive pain to track down.
There are some ways to coax a retransmission (duplicate acking, maybe selective ack?); but retransmissions doesn't really help, since a given socket was always running through the same route, and getting corrupted. I guess an explicit 'got bad data' message would have shown up better in tcpdump though.
Besides debugging, would this help though? A corrupt packet is the same as not receiving the (correct) packet at all. It will be retransmitted when no ACK is received. If the problem is ephemeral, it will be resolved on the retransmission. If it's not, timing out the connection is the only course of action.