Well, while we're at it, here's my crazy MTU-related war story, although not as crazy as that one!
I was troubleshooting with a user of an audio streaming application running over a LAN. The user could stream classical music but not rock music. Seriously. Classical was fine, but when streaming rock, the connection would drop after a few minutes.
The application took chunks of audio, compressed them with a lossless codec, and then sent each chunk in a separate UDP packet to the other end. It tried to use IPv6 whenever possible because it was generally more reliable in the LAN environment, although it would happily use IPv4 if need be.
After a huge amount of boring troubleshooting going back and forth with this guy, I finally figured it out. Somehow, he had set his network interface's MTU to 1200 bytes. IPv6 won't perform automatic IP-level fragmentation for MTUs below 1280 bytes, so larger packets simply could not be sent at all. The streaming application would try to send an audio packet larger than 1200 bytes, get an error, and bail out of the connection.
Why did it only happen with rock music? Turns out to be pretty simple. Lossless codecs are necessarily variable bitrate, and classical music compresses better than rock music. When streaming classical, each chunk of audio consistently compressed to less than 1200 bytes, but rock music produced occasional packets over the threshold.
The user didn't know why his MTU was turned down and didn't need it, so we turned it back up and everything worked just fine.
I love this sort of insane-sounding problem descriptions. Here's one that happened to me: WiFi disconnects when I visit Gmail, and doesn't reconnect until I reboot Linux. Cause: Gmail's chat thingy uses Flash, which for some reason does some webcam initialization, which triggers a bug (in combination with crappy hardware) in the uvcvideo kernel module which leads to a timeout which leads to the whole USB bus going down. Which includes the WiFi chip.
Well, back in the day I did a lot of network management on ATM and frame-relay networks and MTU-related problems were quite common there. It left me with a habit of always checking packets of different sizes and looking closely for fragmentation-related issues (also ICMP filtering).
But this one is different -- it's not just the packet size that is the problem, it's that certain packets above a certain size get corrupted. Much more difficult to trace, so kudos to the authors. In fact, it seems I was bitten by the exact same problem, but could not trace it down, as I checked and ICMP packets of various sizes passed OK.
Having a smaller MTU is fine. The problem comes when people start blocking the ICMP fragmentation needed packets, presumably due to some assumption that this will in some way help with security.
This is insane. The closest scenarios to this I've seen in my career:
1) A private frame relay network that one day stopped passing packets over a certain size. Worked around by lowering the MTU at both ends till I was able to convince the frame relay provider that yes, the problem was in their network. This was relatively straight-forward to diagnose, but it was still odd being able to ssh into a box, then have the connection hang once I did something that sent a full-size packet (cat a large file, ls -l in a big directory, etc).
2) A paging gateway program I wrote (email to SMS) that worked fine when testing on my Mac, but couldn't establish connections to a particular Verizon web site when I ran it from a Linux box. Turned out that the Linux TCP stack had ECN enabled and at the time the Verizon website was behind a buggy firewall that blocked any packets with ECN bits set.
3) A Solaris box that could randomly be connected to, but not always. Turned out someone had deleted its own MAC address from its ARP table (yes, you can do this with Solaris) so it wasn't replying to ARP packets for itself. As I recall, it could make outbound connections, and then you could connect to it from that same peer until the peer timed out the ARP entry. Then the peer couldn't reach the Solaris box again.
None of these are nearly as complex as the scenario in this story.
So, I was working at a very small internet service provider in a rural area in the mid-nineties. For the lack of affordable hardware, we were using Linux machines for routing, and a lot of "unconventional" solutions were necessary due to insufficient hardware being used. Tunelling, and other virtual interfaces of any kind were used often.
I remember one particular case were we running both routed IP and bridged ethernet over a single frame-relay link, and there we had to resort to fixed ethernet-to-ip mapping (turning off ARP) on the bridged link for some reason I really can no longer remember.
> How do you attain the skills required to do this while not also learning not to?
Upvoted for being one of the greatest ways I've ever seen to put what is a VERY common problem. I will quote this mercilessly in the future, if I may. Thanks.
Reminds me of a problem I had with a T1 circuit corrupting packets.
Shortly after bringing up a second T1 into a remote location we discovered that some web pages would show broken JPG images at the remote site.
Some troubleshooting revealed that this only happened when traffic was routed over the new T1. The old T1 worked just fine. Pings, and other IP traffic seemed to work over either line but we kept seeing the broken image icon for some reason when traffic came over the new T1.
We tried several times to confirm with the telco that the T1 was provisioned correctly and that our equipment matched those telco parameters. Still had some mangled bits going over that new T1.
Finally had the telco check the parameters over every span in the new (long-distance) T1 circuit and they eventually found one segment that was configured for AMI instead of B8ZS (if I can remember correctly, certainly it was a misconfigured segment though).
The net result is that certain user-data patterns that didn't include sufficient 0/1 transitions would lead to loss of clock synchronization over that segment and corrupted packets. Those patterns were most likely to occur in JPGs.
Once they corrected the parameters on that segment, everything worked as expected.
Quite a bit of head scratching with that one and lots of frustration as the layer-1 telco culture just couldn't comprehend that layer-2/3 Internet folks could accurately diagnose problems with their layer-1 network.
One of my fellow ISP admins (silicon.net, I was elite.net) had the same problem around 1995. GIFs would load on web pages but not JPEGs. They'd load for some amount of time and then hang.
The way he diagnosed it was to do a transfer of /dev/zero out one link. It worked. But it stopped almost immediately out the problem link. It turned out to be the same problem -- no zero bit stuffing configured on the line.
By the way, this same technique, known as "weak bits", was used as a floppy and CD/DVD copy protection scheme.
That is an awesome story. If you're in devops I would suggest you look at the sequence of events, especially the debugging decision tree. You can't always get access to all of the machines but you can create 'views' by going through them. Sort of like astronomers using a gravitational lens.
We had a similar issue at Blekko where a 10G switch we were using would not pass a certain bit pattern in a UDP packet fragment. Just vanished. Annoying as heck, the fix was to add random data to the packet on retries so that at least one datagram made it through intact.
On a related note I used to be the person who got to go see customers who had problems with our software that the support desk couldn't solve. This often meant one or two day trips to glamourous industrial estates on the edge of various cities all around the world.
About 3 visits in a row I went to look at problems (core dumps or errors) that the customer could reproduce at will, only for them to be unable to replicate the problem with me present on site.
I sat at one customer (in sunny Minneapolis) for 2 hours in the morning with the customer getting increasingly baffled as to why he couldn't get it to fail; it had been happily failing for him the previous evening when I was talking to him on the 'phone. We gave up and went for lunch (mmm, Khan's Mongolian Barbeque). A colleague of his called him midway through lunch to tell him that the software was failing again. Excellent I thought, we'll finally get to the bottom of it. Back to their office and ... no replication; it was working fine.
As a joke I said I should leave a clump of my hair taped to the side of the E450 it was running on. The customer took me up on that offer and, as far as I know (definitely for a few years at least), the software ran flawlessly at that customer.
It's the closest I've got to a "'more magic' switch" story of my own.
This is why good DevOps people are worth their weight in platinum. As a developer who has done just enough administration to be dangerous, I can easily say that my job is always far more enjoyable when there are good DevOps folks around to keep my systems happy and shield me from the crazy place that is the Internet's wiring.
The striking thing about this story is that even after the problem was solved by re-routing traffic around the bad hardware, the author continued to investigate until the ultimate cause was tracked down. This almost obsessive desire to understand the true causes of problems (whether they be related to operations, software development, or whatever) is one of the things that makes people really good at what they do.
The numbers in this story do not add up. For a connection to be made within 3ms, there needs to be a round-trip within that time, which reduces the maximum possible radius by half.
The 'Current' travels along a wire at anything between 97% and 60% of c (depending on insulation; more insulation == slower).
Individual electrons in copper wire travel much more slowly. The 'drift velocity' is roughly proportional to the voltage; for low voltage DC it is the order of millimeters per hour.
For AC voltage individual electrons don't have any net movement since they're oscillating back and forth with the alternating current.
A dev submitted code and broke our build once, when we looked at what he submitted there appear to random syntax errors in it. On his workstation the code was correct however.
We tracked it down to a switch that was corrupting packets enough that the TCP checksum wasn't sufficient protection, and the packets would simply pass their checksum despite having been altered.
The out come was that we always use compression, or encryption, as an added layer of protection.
Hey, I had this problem! Exactly the same symptoms, although I never got as far as dumping actual packet contents. But I did verify that packet loss (of various sizes) was not the culprit. It was SSH (and some monitoring TCP connections) that failed (hung), always precisely at the same moment.
I suspected the VM code at the time, but it is very likely that my packets had to go through the same router (geography would support this).
I'm so glad somebody debugged this problem. Also, I'm quite glad that at least this time I'm not the only person with a weird issue (I have a knack for breaking things).
Off-topic: can someone provide a good reason why SSH w/ the HPN patches is not the default for every SSH install on every platform?
Today, people are relying on SSH for binary transfer more than ever. SFTP and SCP are the new defacto file transfer standards between machine to machine over a secured connection. Source control like GIT (or even SVN) make heavy use of binary transfers over SSH. The performance benefit to the entire world is immeasurable. Yet unless you explicitly go out of your way to manually compile and install SSH-HPN, you don't get it.
That said, given how slow SSH is on Windows (GIT pushes and pulls are exponentially slower than on *nix or OS X), does anyone have a good link to a Putty HPN build?
I have to admit I felt pretty ignorant for not knowing what you were talking about. So, for anyone else in a similar situation:
http://www.psc.edu/index.php/hpn-ssh
Every host it's installed on has to be properly tuned. Fine for large setups where finely tuned TCP stacks are the norm and maintaining your own ssh isn't much overhead, probably not fine for most setups where the 2MB buffer does the job.
"To compute the BDP, we need to know the speed of the slowest link in the path and the Round Trip Time (RTT)". Do you know the slowest link in the path for everything you want to conceivably connect to?
The patches were an exercise in trying to max out high bandwidth connections using scp under ideal lab conditions, nothing more.
Even when enabled, it'll only not encrypt binary blobs; TTY input will remain encrypted. Obviously many times that is not an option, but sometimes it is.
So you are not worried about the confidentiality of your data?
Leaving aside the crypto worries and concerns over why it was not merged upstream can you imagine being the Debian package maintainer? Having to manage and triage bug reports with two upstreams? And then having to keep track of whether the bug occurred when HPN initiated a connection to pristine upstream, pristine connects to HPN or HPN connects to HPN? If you want to get an idea of the headache involved site:debian.org ssh hpn.
Have you read why upstream never merged it? The pleas for funding and lack of maintainer time do not give you cause for concern?
My first inclination would be a firmware problem: was it upgraded recently? Are there any known problems with the version that was installed? Did you build it yourself? If we have access to the code, the information about the 'shape' of 576 may come in handy. Or maybe we just need to look at your build environment.
Assuming the firmware had not changed recently (very likely in gear that sits quietly doing its job without human intervention for long periods of time), then failing hardware becomes the suspect. Maybe a memory module is going bad and this particular byte is normally avoided (see intermittent failures in OP.) Maybe it's using flash to store transitory data and a particular cell is going bad. Maybe the unit has suffered vibration damage and a solder point related to memory has come loose.
Some of these are far less likely than others. I think of all these, I'd put money on a bad memory stick.
576 bytes is the minimum datagram size that must be accepted by all nodes in an IPv4 internet.
In practice, on the modern Internet, you'll see tons of packets larger. So who knows; perhaps some massively outdated optimization in an ASIC somewhere resulted in different hardware paths for >576 bytes.
I'd be pissed at a transit provider who mangled packets like this.
The more ambiguous situation is that early Juniper routers would fairly frequently re-order packets. That's nominally allowed, but a lot of protocols didn't like it.
There are way weirder things on satellite or other networks (spoofing acks, etc.).
Great story - I've had MTU and firewall fun before but nothing so subtly treacherous.
I've been wondering about something not entirely unrelated we see sporadically from a small but widespread number of users. We serve deep zoom images and the client appears to run normally but sends malformed image tile requests - e.g. in the URLs "service" is consistently garbled as "s/rvice", "dzi" as "d/i". I've seen this from IPs on every continent and user agents for most common browsers as well as both iOS and Android. My current theory is that it's some sort of tampering net filter as a fair number of the IPs have reverse DNS / Whois info suggesting educational institutions but have thus far failed to confirm this, particularly since none of the users have contacted us.
I had a similar problem, less hairy, involving a bad bit in a disk drive's cache RAM. Took a day or so to figure out a solid repro.
Stuff like this does happen. Handling bit errors in consumer electronics storage systems is an interesting problem, and one that I'd love to see more attention paid to.
I'm curious how the kernel was able to diagnose that a single one bit always being fixed on the 15th of 16th bytes in the packet was corruption. That sounds like some intense algorithmic profiling especially if its being applied to every packet.
The kernel is rejecting packets that fail the checksum verification. It just happens that the source of the bit errors was behaving in this predictable way.
So this is one of those times that I find myself torn between labels.
A little background... I was brought up in the network ranks, I worked as a network / sys admin in high school, ended up working for an ISP as a junior network engineer in college (while I went to college at one of the first Cisco NetAcad baccalaureate programs - which was a combo of network study and Cisco curriculum and certifications) and have gone on to work in every major vertical since then for the past 10+ years; government, finance, healthcare, retail, telecomm, etc. I always tell clients and potential employers that having a network background generally gives me somewhat of an edge in the industry I primarily focus on: security, and I generally will study and take Juniper & Cisco tests and work on labs just to stay current. Most software devs and security folks I've run into (keep in mind there are a lot of really good folks who have a better grasp on network than a lot of seasoned engineers do) are generally overzealous in the thought that they truly do understand IP from a debugging and troubleshooting standpoint.
Case in point: I interviewed for a "Network Architect" position with a very well known online backup company (think top 4). The interview was the most bizarre I've ever had, not that it spanned more than 5 interviews, but that every time they positioned a complex network problem it was generally solvable within 5 to 10 minutes of pointed questions. The software dev who was interviewing me was baffled by how I came to a reasonable solution that took them over a week, in some cases, that quickly - and it was pretty simple in the fact that 1) I've seen something similar and 2) that's what I studied and still have a passion for over the course of 20+ years (when I found the Internet in 1991).
Most of the time when I run across a "magical" problem it's because someone hasn't looked at it from L1 up. As this article showcases you generally have two generic stack angles to approach it from - application back down to physical, or the inverse. Having been in network support - by the time you get a problem like this it's often so distorted with crazy outliers that really have nothing to do with the problem your best bet is to start from that L1 and go back up through the stack. Reading into the problem the author describes I think there were some key data that was missed and/or misinterpreted. There most surely would have been key indicators in TCP checksum errors and it was glossed over pretty lightly in the explanation - but it's interesting that those items of interest are often cast aside when digging into something like this. Nobody in this thread has indicated where a bit error test or even something as simple as iperf, or similar, would have been able to more accurately showcase/reproduce the problematic network condition.
But back to the labels remark - I don't believe, as some people have said, that this is a DevOps role largely. I don't mean to cut down on DevOps folks because I think, at some level, if you're a jack-of-all in any org then that's your role, it is what it is. However, this would be a problem most suited towards a professional network engineer - and you don't see much of that need in the startup space until people get into dealing with actual colo / DC type environments, otherwise it's often very simple and not architected with significant depth or specific use cases.
Long story short: network professionals are worth the money in the case of design, build, fix of potentially issues that may seem complex to others, but can be solved or found in minutes when you know what you're looking at. That being said, I'm impressed that the OP dug into it to get to a point where he could ask a specific person (who was probably a network engineer / tech of some level) to validate/fix his claim.
I was troubleshooting with a user of an audio streaming application running over a LAN. The user could stream classical music but not rock music. Seriously. Classical was fine, but when streaming rock, the connection would drop after a few minutes.
The application took chunks of audio, compressed them with a lossless codec, and then sent each chunk in a separate UDP packet to the other end. It tried to use IPv6 whenever possible because it was generally more reliable in the LAN environment, although it would happily use IPv4 if need be.
After a huge amount of boring troubleshooting going back and forth with this guy, I finally figured it out. Somehow, he had set his network interface's MTU to 1200 bytes. IPv6 won't perform automatic IP-level fragmentation for MTUs below 1280 bytes, so larger packets simply could not be sent at all. The streaming application would try to send an audio packet larger than 1200 bytes, get an error, and bail out of the connection.
Why did it only happen with rock music? Turns out to be pretty simple. Lossless codecs are necessarily variable bitrate, and classical music compresses better than rock music. When streaming classical, each chunk of audio consistently compressed to less than 1200 bytes, but rock music produced occasional packets over the threshold.
The user didn't know why his MTU was turned down and didn't need it, so we turned it back up and everything worked just fine.