That is an awesome story. If you're in devops I would suggest you look at the sequence of events, especially the debugging decision tree. You can't always get access to all of the machines but you can create 'views' by going through them. Sort of like astronomers using a gravitational lens.
We had a similar issue at Blekko where a 10G switch we were using would not pass a certain bit pattern in a UDP packet fragment. Just vanished. Annoying as heck, the fix was to add random data to the packet on retries so that at least one datagram made it through intact.
We had a similar issue at Blekko where a 10G switch we were using would not pass a certain bit pattern in a UDP packet fragment. Just vanished. Annoying as heck, the fix was to add random data to the packet on retries so that at least one datagram made it through intact.