I was troubleshooting with a user of an audio streaming application running over a LAN. The user could stream classical music but not rock music. Seriously. Classical was fine, but when streaming rock, the connection would drop after a few minutes.
The application took chunks of audio, compressed them with a lossless codec, and then sent each chunk in a separate UDP packet to the other end. It tried to use IPv6 whenever possible because it was generally more reliable in the LAN environment, although it would happily use IPv4 if need be.
After a huge amount of boring troubleshooting going back and forth with this guy, I finally figured it out. Somehow, he had set his network interface's MTU to 1200 bytes. IPv6 won't perform automatic IP-level fragmentation for MTUs below 1280 bytes, so larger packets simply could not be sent at all. The streaming application would try to send an audio packet larger than 1200 bytes, get an error, and bail out of the connection.
Why did it only happen with rock music? Turns out to be pretty simple. Lossless codecs are necessarily variable bitrate, and classical music compresses better than rock music. When streaming classical, each chunk of audio consistently compressed to less than 1200 bytes, but rock music produced occasional packets over the threshold.
The user didn't know why his MTU was turned down and didn't need it, so we turned it back up and everything worked just fine.
But this one is different -- it's not just the packet size that is the problem, it's that certain packets above a certain size get corrupted. Much more difficult to trace, so kudos to the authors. In fact, it seems I was bitten by the exact same problem, but could not trace it down, as I checked and ICMP packets of various sizes passed OK.
1) A private frame relay network that one day stopped passing packets over a certain size. Worked around by lowering the MTU at both ends till I was able to convince the frame relay provider that yes, the problem was in their network. This was relatively straight-forward to diagnose, but it was still odd being able to ssh into a box, then have the connection hang once I did something that sent a full-size packet (cat a large file, ls -l in a big directory, etc).
2) A paging gateway program I wrote (email to SMS) that worked fine when testing on my Mac, but couldn't establish connections to a particular Verizon web site when I ran it from a Linux box. Turned out that the Linux TCP stack had ECN enabled and at the time the Verizon website was behind a buggy firewall that blocked any packets with ECN bits set.
3) A Solaris box that could randomly be connected to, but not always. Turned out someone had deleted its own MAC address from its ARP table (yes, you can do this with Solaris) so it wasn't replying to ARP packets for itself. As I recall, it could make outbound connections, and then you could connect to it from that same peer until the peer timed out the ARP entry. Then the peer couldn't reach the Solaris box again.
None of these are nearly as complex as the scenario in this story.
- Is there ever a valid reason to do this?
- How do you attain the skills required to do this while not also learning not to?
I remember one particular case were we running both routed IP and bridged ethernet over a single frame-relay link, and there we had to resort to fixed ethernet-to-ip mapping (turning off ARP) on the bridged link for some reason I really can no longer remember.
Upvoted for being one of the greatest ways I've ever seen to put what is a VERY common problem. I will quote this mercilessly in the future, if I may. Thanks.
Half-understood StackOverflow answers, natch.
Shortly after bringing up a second T1 into a remote location we discovered that some web pages would show broken JPG images at the remote site.
Some troubleshooting revealed that this only happened when traffic was routed over the new T1. The old T1 worked just fine. Pings, and other IP traffic seemed to work over either line but we kept seeing the broken image icon for some reason when traffic came over the new T1.
We tried several times to confirm with the telco that the T1 was provisioned correctly and that our equipment matched those telco parameters. Still had some mangled bits going over that new T1.
Finally had the telco check the parameters over every span in the new (long-distance) T1 circuit and they eventually found one segment that was configured for AMI instead of B8ZS (if I can remember correctly, certainly it was a misconfigured segment though).
The net result is that certain user-data patterns that didn't include sufficient 0/1 transitions would lead to loss of clock synchronization over that segment and corrupted packets. Those patterns were most likely to occur in JPGs.
Once they corrected the parameters on that segment, everything worked as expected.
Quite a bit of head scratching with that one and lots of frustration as the layer-1 telco culture just couldn't comprehend that layer-2/3 Internet folks could accurately diagnose problems with their layer-1 network.
The way he diagnosed it was to do a transfer of /dev/zero out one link. It worked. But it stopped almost immediately out the problem link. It turned out to be the same problem -- no zero bit stuffing configured on the line.
By the way, this same technique, known as "weak bits", was used as a floppy and CD/DVD copy protection scheme.
We had a similar issue at Blekko where a 10G switch we were using would not pass a certain bit pattern in a UDP packet fragment. Just vanished. Annoying as heck, the fix was to add random data to the packet on retries so that at least one datagram made it through intact.
About 3 visits in a row I went to look at problems (core dumps or errors) that the customer could reproduce at will, only for them to be unable to replicate the problem with me present on site.
I sat at one customer (in sunny Minneapolis) for 2 hours in the morning with the customer getting increasingly baffled as to why he couldn't get it to fail; it had been happily failing for him the previous evening when I was talking to him on the 'phone. We gave up and went for lunch (mmm, Khan's Mongolian Barbeque). A colleague of his called him midway through lunch to tell him that the software was failing again. Excellent I thought, we'll finally get to the bottom of it. Back to their office and ... no replication; it was working fine.
As a joke I said I should leave a clump of my hair taped to the side of the E450 it was running on. The customer took me up on that offer and, as far as I know (definitely for a few years at least), the software ran flawlessly at that customer.
It's the closest I've got to a "'more magic' switch" story of my own.
For electricity: Paraphrasing https://en.wikipedia.org/wiki/Speed_of_electricity
The 'Current' travels along a wire at anything between 97% and 60% of c (depending on insulation; more insulation == slower).
Individual electrons in copper wire travel much more slowly. The 'drift velocity' is roughly proportional to the voltage; for low voltage DC it is the order of millimeters per hour.
For AC voltage individual electrons don't have any net movement since they're oscillating back and forth with the alternating current.
There are many more. Unless specifically noted, they all happened to me at some point.
* F-Secure's blog and some other AV companies' blogs
* Ramond Chen's The Old New Thing
* Anything written by Mark Russinovich
* Slightly off topic, but http://www.devttys0.com/ manages
some crazy reverse engineering feats, which I feel are similar to this story.
We tracked it down to a switch that was corrupting packets enough that the TCP checksum wasn't sufficient protection, and the packets would simply pass their checksum despite having been altered.
The out come was that we always use compression, or encryption, as an added layer of protection.
I suspected the VM code at the time, but it is very likely that my packets had to go through the same router (geography would support this).
I'm so glad somebody debugged this problem. Also, I'm quite glad that at least this time I'm not the only person with a weird issue (I have a knack for breaking things).
Today, people are relying on SSH for binary transfer more than ever. SFTP and SCP are the new defacto file transfer standards between machine to machine over a secured connection. Source control like GIT (or even SVN) make heavy use of binary transfers over SSH. The performance benefit to the entire world is immeasurable. Yet unless you explicitly go out of your way to manually compile and install SSH-HPN, you don't get it.
That said, given how slow SSH is on Windows (GIT pushes and pulls are exponentially slower than on *nix or OS X), does anyone have a good link to a Putty HPN build?
Every host it's installed on has to be properly tuned. Fine for large setups where finely tuned TCP stacks are the norm and maintaining your own ssh isn't much overhead, probably not fine for most setups where the 2MB buffer does the job.
"To compute the BDP, we need to know the speed of the slowest link in the path and the Round Trip Time (RTT)". Do you know the slowest link in the path for everything you want to conceivably connect to?
The patches were an exercise in trying to max out high bandwidth connections using scp under ideal lab conditions, nothing more.
Even when enabled, it'll only not encrypt binary blobs; TTY input will remain encrypted. Obviously many times that is not an option, but sometimes it is.
Leaving aside the crypto worries and concerns over why it was not merged upstream can you imagine being the Debian package maintainer? Having to manage and triage bug reports with two upstreams? And then having to keep track of whether the bug occurred when HPN initiated a connection to pristine upstream, pristine connects to HPN or HPN connects to HPN? If you want to get an idea of the headache involved site:debian.org ssh hpn.
Have you read why upstream never merged it? The pleas for funding and lack of maintainer time do not give you cause for concern?
576 decimal looks like this in other common bases:
binary: 0000 00010 0100 0000
Assuming the firmware had not changed recently (very likely in gear that sits quietly doing its job without human intervention for long periods of time), then failing hardware becomes the suspect. Maybe a memory module is going bad and this particular byte is normally avoided (see intermittent failures in OP.) Maybe it's using flash to store transitory data and a particular cell is going bad. Maybe the unit has suffered vibration damage and a solder point related to memory has come loose.
Some of these are far less likely than others. I think of all these, I'd put money on a bad memory stick.
In practice, on the modern Internet, you'll see tons of packets larger. So who knows; perhaps some massively outdated optimization in an ASIC somewhere resulted in different hardware paths for >576 bytes.
The more ambiguous situation is that early Juniper routers would fairly frequently re-order packets. That's nominally allowed, but a lot of protocols didn't like it.
There are way weirder things on satellite or other networks (spoofing acks, etc.).
I've been wondering about something not entirely unrelated we see sporadically from a small but widespread number of users. We serve deep zoom images and the client appears to run normally but sends malformed image tile requests - e.g. in the URLs "service" is consistently garbled as "s/rvice", "dzi" as "d/i". I've seen this from IPs on every continent and user agents for most common browsers as well as both iOS and Android. My current theory is that it's some sort of tampering net filter as a fair number of the IPs have reverse DNS / Whois info suggesting educational institutions but have thus far failed to confirm this, particularly since none of the users have contacted us.
I had a similar problem, less hairy, involving a bad bit in a disk drive's cache RAM. Took a day or so to figure out a solid repro.
Stuff like this does happen. Handling bit errors in consumer electronics storage systems is an interesting problem, and one that I'd love to see more attention paid to.
A little background... I was brought up in the network ranks, I worked as a network / sys admin in high school, ended up working for an ISP as a junior network engineer in college (while I went to college at one of the first Cisco NetAcad baccalaureate programs - which was a combo of network study and Cisco curriculum and certifications) and have gone on to work in every major vertical since then for the past 10+ years; government, finance, healthcare, retail, telecomm, etc. I always tell clients and potential employers that having a network background generally gives me somewhat of an edge in the industry I primarily focus on: security, and I generally will study and take Juniper & Cisco tests and work on labs just to stay current. Most software devs and security folks I've run into (keep in mind there are a lot of really good folks who have a better grasp on network than a lot of seasoned engineers do) are generally overzealous in the thought that they truly do understand IP from a debugging and troubleshooting standpoint.
Case in point: I interviewed for a "Network Architect" position with a very well known online backup company (think top 4). The interview was the most bizarre I've ever had, not that it spanned more than 5 interviews, but that every time they positioned a complex network problem it was generally solvable within 5 to 10 minutes of pointed questions. The software dev who was interviewing me was baffled by how I came to a reasonable solution that took them over a week, in some cases, that quickly - and it was pretty simple in the fact that 1) I've seen something similar and 2) that's what I studied and still have a passion for over the course of 20+ years (when I found the Internet in 1991).
Most of the time when I run across a "magical" problem it's because someone hasn't looked at it from L1 up. As this article showcases you generally have two generic stack angles to approach it from - application back down to physical, or the inverse. Having been in network support - by the time you get a problem like this it's often so distorted with crazy outliers that really have nothing to do with the problem your best bet is to start from that L1 and go back up through the stack. Reading into the problem the author describes I think there were some key data that was missed and/or misinterpreted. There most surely would have been key indicators in TCP checksum errors and it was glossed over pretty lightly in the explanation - but it's interesting that those items of interest are often cast aside when digging into something like this. Nobody in this thread has indicated where a bit error test or even something as simple as iperf, or similar, would have been able to more accurately showcase/reproduce the problematic network condition.
But back to the labels remark - I don't believe, as some people have said, that this is a DevOps role largely. I don't mean to cut down on DevOps folks because I think, at some level, if you're a jack-of-all in any org then that's your role, it is what it is. However, this would be a problem most suited towards a professional network engineer - and you don't see much of that need in the startup space until people get into dealing with actual colo / DC type environments, otherwise it's often very simple and not architected with significant depth or specific use cases.
Long story short: network professionals are worth the money in the case of design, build, fix of potentially issues that may seem complex to others, but can be solved or found in minutes when you know what you're looking at. That being said, I'm impressed that the OP dug into it to get to a point where he could ask a specific person (who was probably a network engineer / tech of some level) to validate/fix his claim.