Since we're all sharing wacky debugging stories....
I had a user of an audio streaming product who suffered from random disconnects. He could only stream for about five minutes before it quit. But, and this was the crazy part, the problem only happened when streaming rock and roll. Classical music worked fine.
After many weeks going back and forth, I tracked it down to a misconfiguration on his network interface. The MTU was set lower than the IPv6 spec allowed (or so, forgive me if I get this detail wrong) and the OS wouldn't fragment our UDP packets with this setting.
Our packets usually were under the limit, but occasionally went over and triggered an error. Why was classical music immune? The lossless codec we were using did a better job on classical than rock, so classical (or at least this guy's classical) never went over the size limit.
I'm assuming this is due to the lower average signal level of symphonic recordings? Rock music is mastered closer to 0db due to the well-publicized loudness war.
Obviously, this doesn't affect bandwidth in uncompressed audio, where every sample contains every bit, so filesize = bit-depth * sample-rate(hz) * T(seconds). But in compressed audio, average signal level can make a significant difference in filesize/bandwidth.
(Who would have thought a network would be a casualty of the loudness war... harhar)
I think rock is just less regular at a small scale. Consider a flute versus an electric guitar. The flute is at least somewhat like a pure tone which would be easy to compress, while an electric guitar is a mishmash of stuff that will look much more random on a sample by sample basis.
Cool -- I didn't even consider the shape of the waveforms (sine-like for flute, saw/square-like for anything distorted) having an effect on the data, but now that you tell me... duh!
To be clear, I surmised this from what I described, I didn't actually get down and poke at how the codec works on these different types of music. But I'm pretty sure that's what's going on.
These lossless codecs work by attempting to predict the next sample from all the previous ones, and then encoding the difference between the prediction and reality. So a more regular waveform will compress better than a crazy-looking one. It's somewhat counterintuitive since rock music tends to be less complicated from a high-level human perspective.
I once built a MS Excel add-in in C++ that worked perfectly when compiled by Visual Studio in debug mode but would not run when compiled in release mode. Obviously, debugging in release mode was a huge pain.
Long story short: at some point Excel stores the file path of the add-in in a string with (for some reason) a max length of 128. The file path "C:/.../Debug/addin.xll" totaled 127 characters and the path "C:/.../Release/addin.xll" was 129.
Back in the late 80's I was working on an X.25 gateway. We had it installed at a Wall Street firm. It had a problem where it would accumulate thousands of CRC errors in just a few minutes.
We knew the problem was on the customer premises because we had loaded the gateway down using the same test suite. We ran them through everything. Their point was when they plugged the scope on both sides of the connection and loaded the link with a loop back test, they had no problem.
I get on a plane. We have the scopes hooked up on both sides and the guys in the data center are on the phone talking. We load up the test suite and no errors occurred. Scratching heads, we watched it not fail for a half hour. The consensus was we needed to regroup and come back at the problem from a different angle. The data center guys agreed. Then, suddenly on our scope, we got like 500 CRC errors...
I said to the DC guys "what did you just do? We're getting CRC errors"
"Nothing"
"You didn't touch anything?"
"No we just unplugged the scope from the patch panel"
"Try plugging it back in please"
And the CRC errors stopped. Faulty patch panel diagnosed.
The firm probably paid $25k in consulting fees for that diagnosis. But for me, the lesson that my tools might alter what I am observing was priceless.
I used to work as a fire alarm and security system engineer, and my boss was a dodgy maverick, picking up contracts here and there. He somehow had me responsible for a shopping centre that was about 150 miles away ,and it housed the cctv for the entire town centre (cctv is big in the UK). I was on a salary, so didn't get overtime. One Saturday morning I got a call saying all the cctv system was down, and that I needed to got sort it out ASAP!
Drove for 4 hours, and got to the control room, where I was confronted with a totally dead system - all the monitors were on, but all black screens. I asked the guy what happened, mostly to stall for a bit as I'd never been on site before and there were no schematics available, and I was expecting to be there for days.
He told me that he'd gone to put the kettle on, come back and when he came back, it was all dead. I asked where the kettle was, and he looked bemused,but showed me. Instead of it being in a rest room, it was located on a shelf behind the equipment rack. I took a closer look, and saw that it was plugged into the power strip that was built into the rack, and the cable was twisted round other power cables, one of which was... The demultiplexer that carried all the cctv camera signals from round the town, and which had been pulled out (iec mains lead, easily done). Pushed it back in by 5mm, went round the front to see all the monitors on... Said "thanks for that" to the guy, got the paperwork signed and left. Spent a total of 15 minutes on site, and over 8 driving as the traffic was terrible on the way home.
Funny, I didn't understand what you were referring to and quickly forgot about your comment. I then read the 500-mile story, got to the end and thought "units? oh cool, new command... wait, wasn't there a comment about `units`?"
Just the kind of bug which seems ridiculous as a bug report, but turns out to be true.
Like the "OpenOffice.org won’t print on Tuesdays" bug, or the bug which crashed the computer when the general visited. Or the one when the server went down whenever a certain guy had a support ticket.
It was in a military facility in the seventies, and always when the general visited the computers crashed. This happened too often to be a demonstration effect, it turned out to be the metal in the generals shoes interfering with the electronics.
Around 2000: a colleague told me about his slow PC, rattling noises, random crashes. i did what any responsible workmate would do: told him to call IT support, I'm a web-designer (well actually I was a frontend, backend developer, database guy and project manager but devision of labor in webdevelopment didn't happen yet so my job title was webdesigner, anyway...)!
5 weeks of various visits by the IT support later, including full replacement of all PC hardware and software - still random crashes.
so one day my direct boss tells me to take a look as the corporate IT support "couldn't do it". i just went over there and encounterd a desktop PC covered in refrigorater-sticky-note-magnets.....
turns out he removed them every time "so that the IT guys have better access to the computer"...
I had a similar thing happen to a friend of the family whose business I provided IT services for. He had 2 servers and 2 rack mounted UPS's and a handful of other devices that all fit into the same rack. One day I needed to do some maintenance that required shutting one of his servers down. This was a backup server that wasn't actually used for anything as long as the primary was up and functional.
The problem was that as soon as I shut the backup server down, his entire network stopped working. I was trying to the maintenance over a VPN, so immediately I lost access. I don't really have a clever way of telling this right now, but after a lot of frustration trying to figure out the problem remotely (and wondering if someone had pwn'd his servers and was using the backup server to MITM all his traffic) I drove out there and noticed the problem right away. The UPS that the backup server was connected to was faulty, what was happening was that once the server was no longer pulling electricity from the faulty UPS, it failed to power the other equipment that was plugged into it, one of those things being a critical switch. As soon as the server was powered back on, all the other equipment attached to it immediately powered up too.
Sounds like one of those surge protectors that control the power to other devices based on whether or not a central device is powered on. Some powerbars for home theater setups have a similar feature; if you turn off the TV, all the other stuff turns off with it, and if you turn on the TV, everything turns back on.
Weird that a rackmount UPS would have that sort of feature, but hey, it's possible.
About 10 years ago I was working for one of the Big 4 banks here in Australia, doing 3rd level support. An issue came up with, if memory serves, Siebel. Desktop issue, don't remember the details, but it was pretty serious - a team couldn't do the thing that they do.
I was on this thing for weeks. Hacking away in whatever that tool from sysinternals was called (Procmon?), monitoring calls at the process level, running multiple tests on multiple machines, the whole lot. It's the most complex troubleshooting I've ever done.
I found nothing.
The guys in the team were in an office a few miles away from me so one day I said, hey, I'm going to come down. I need to see this with my own eyes rather than over a remote connection.
I went down there. We started up the desktop. They launched the software. They clicked the button to do The Thing That Wasn't Working.
And it worked. It just did the thing it was meant to do.
Side note: that scream makes a superb notification tone for people you don't like, or PagerDuty.
I also heard a good yarn once where a customer's DSL service inexplicably stopped working at exactly the same time every day, and after weeks of troubleshooting and changing absolutely everything, with pretty much everyone involved ready to give up, a senior network engineer sitting in the guy's living room happened to notice the street light directly outside turning on at the same time. Turns out the lamp was poorly shielded and throwing enough RF into the poor customer's apartment to light up Pittsburgh. Never heard what came of that, though I imagine that call to public works was interesting.
We had a similar thing happen with our cable... we'd get high upload latency and packet loss when it rained.
Turned out that the contractor who replaced the coax on our block didn't use the right grade of cable... the weight of the water plus wind would stretch the cable and open a crack in the housing.
The old phone line here has a similar problem. The bandwidth goes to shit when the weather turn bad. Likely there is a crack somewhere and a combo of wind and Dian gets inside.
What an interesting analyses of the scene. I always got the impression Linux was looked down on early on (learned about Linux very late), and there you go.
From what I have seen around the net, the (ex-)Sun people are still salty about the outcome. Keep in mind that Solaris can be traced back to the original Bell Labs UNIX.
"Keep in mind that Solaris can be traced back to the original Bell Labs UNIX."
Kind of. Solaris' history is a bit quirky due to it being built on SVR4 (which is the part that "can be traced back to the original Bell Labs UNIX"), SVR4 having in turn been based on a hodge-podge of "good parts" from all sorts of Unix implementations (including BSD - both on its own and by way of SunOS - and Xenix).
It's thanks to Solaris, though, that we have the only (AFAIK) FOSS implementation of "real" Unix in any form: OpenSolaris (which was then violently murdered by Oracle, but it lives on as illumos and the various distributions thereof, so not all is lost).
I reckon the biggest reason for the eventual outcome is that Linux was FOSS from pretty much the start and wasn't dealing with a big legal conflict (unlike BSD, which was still dealing with AT&T lawsuits and such). By the time Sun released OpenSolaris, it was by far too late for them to really curb Linux's momentum.
Linux was also Free. It's hard to beat Free of acceptable quality - the kids would install it to check it out, see that it's cool, then leave high school ready to be Linux sysadmin interns. In contrast, they might get a few minutes per week of mainframe time in University.
Similar quote from McNealy with a little more context:
"People say, 'Why haven't you retired?' I said, 'I can't leave my kids to a world of Control, Alt, Delete,' " he said, referring to the function in Microsoft Windows for rebooting after a system crash. "I can't leave my kids to MSN. I think you [developers should be out there helping your families [by spreading Java use.]"
Doesn't sound that far-fetched. Here's a slightly newer story where cosmic rays caused address lines to generate spurious signals, and radioactive lead solder caused L1 cache corruption on a bg/l machine: http://spectrum.ieee.org/computing/hardware/how-to-kill-a-su...
A newly hired employee had the strangest problem with his ethernet when he started.
Whenever he plugged in ethernet it would only give him 10mbps. But if he unplugged and replugged it in. It'd switch over to 1000.
What was more strange. On off. Router or hardware on off. Reboot. Nothing would switch it over to higher speed. We tested it a dozen times. But if you plug it in. Unplug then plug in a 2nd time. Viola 1gbps connection made. Always 10mbps the first time. Tried 4 wires and 3 different switches on multiple computers. Problem is his computer. Even reformatted computer and tried different hardware ethernet ports. Never fully resolved why it always need to be replugged in twice though for faster connection.
In this BeOS debugging story ("A Testing Fairy Tale"), their floppy disk stress tests can run all day but fail when run overnight. The morning sunlight through the office window triggered the test machine's floppy disk write-protection mechanism, causing a write failure during the test.
This reminds me of the time when I managed to diagnose a packet storm issue after two days of methodically excluding software then hardware followed by tracing each Ethernet cable to the switch. Turned out that the re-connection of the cables caused a network loop within a dumb switch, ala packet storm !
Well my best-worst debugging story concerns a friend's work based email account and Microsoft Outlook (in 2012). Occasionally it would fail to connect to send email to the server, just randomly.
Obvious troubleshooting ensued, web traffic worked, could ping the email server, could connect with telnet and read email that way; Thunderbird worked. Created a new account, that would work for a while and then fail again.
Less obvious troubleshooting, traced route to server whilst running the connection - route worked, connection failed. Outlook logs showed attempts to connect to the correct URL but the connection wasn't being made. Checked for malware. Reset router, actually I think we replaced it. Pulled out a sysinternals tool, tcpview IIRC, watched the connections being made ... hang on, what's that IP address??
Turns out Windows was querying and getting the IP address but somewhere it was reversing the dotted quad and when Outlook said it was connecting to the relay.example.com server - lets say 6.7.8.9 - it was instead attempting to connect to 9.8.7.6 ...
I didn't track whether it was MS Windows or Outlook that was in error, I just dropped the correct address in as a line in the HOSTS file on the three affected computers. Fixed.
Very satisfying to find the way the problem arose and an easy fix; but would love to have seen internally where the error was arising and exactly why. I did find one other report that sounded like the same problem IIRC. My only idea was that an automated reverse-IP hostname like some ISPs use - like "9-8-7-6.ispnet.com" - was for some reason getting parsed in as the IP, but I wasn't about to start reverse engineering stuff to find out.
Sounds like someone forgot to call this function, and maybe most of the systems were big-endian, so it didn't matter, but one was little-endian: https://linux.die.net/man/3/ntohl
Endianness was the first thing I thought of as well. Or maybe one component of the stack thought that IP addresses are char[4] while another thought they're u32_t, though you'd expect that to be caught by the typechecker.
Similar to the UK academic network JANET's problems with computer science department emails in the late eighties/early nineties. JANET used X.25 before transitioning to TCP/IP, with its own idiosyncratic email addressing that reversed the order of the domain name segments (relative to DNS order) in an email address. So, a University of Edinburgh CS department members with an address like 'grkvlt@cs.ed.ac.uk' was translated into user 'grkvlt' and host 'uk.ac.ed.cs' then promptly sent off to Czechoslovakia by overly-keen international mail gateways. This led to many CS mail server sub-domains gaining an initial departmental 'd', thus 'dcs', giving 'dcs.ed.ac.uk', 'dcs.gla.ac.uk' and so on...
At a prior employer, we had racks of Dell servers, each with their own disk boxes attached to RAID controllers.
Sometimes, more than one of the servers at the same time would decide that their respective disk enclosures had disappeared and reappeared, and the RAID controllers would be unhappy and mark the volume as foreign until a human intervened.
Windows and Linux both did this, so it wasn't an OS problem, and it was multiple machines in multiple racks, ostensibly UPSed with line filtering.
The odder thing was when we noticed it was almost always the ones on the upper half of the racks.
The best, though, was what "resolved" the issue - the power exchange next to the building blew up one day (AIUI one of the phase lines for the three phase connected to another phase's busbar, and BOOM, the room was covered in a fine copper mist), and after all the power equipment in the exchange was (eventually) replaced, the problem went away.
Wacky story, but this is very typical. They had unsecured wireless devices attached to thier network. A misconfigured device, misconfigured due to the hurricane, ended up causing internal problems. I muust ask, why was a wifi device so ready to connect to some random device? Be glad this wasnt a rogue device.
Original commenter here. Talk about a blast from the past!
It was indeed surprising and scary for me too. Hard to recall precise details from a decade ago but I think that both the devices connected because I used the same password on them and while the reset caused the test WAP to become a repeater, it still had the credentials necessary to connect to the production WAP.
I did not have enough time or resources to find the root cause because we were still getting back up, I believe after Hurricane Charley.
In college we had a similar scenario when a UPS delivery truck making it's routine stop at a fairly consistent time each day caused connectivity to degrade. If I remember correctly, the solution was discovered following a UPS strike in the late 90s.
Thanks! That was one of the more kafka-esque debug stories. Going to reread it for old times sake and encourage anyone who hasn't to give it a look over.
Going to come across as mean, but this post illustrates the difference between a sysadmin and a network engineer that specializes in things like this (and why you should not be using dumb unmanaged switches in an enterprise environment).
Remove the stuff about wireless repeaters and pallets of shampoo and this is a basic l2 loop that should have taken 10 minutes to track down.
Yes, as pointed out on many debugging stories: problems are easy to find if you look in the right place (hint: knowing where to look is the hard part). Unfortunately for the people living them, there are wireless repeaters and pallets of shampoo that make looking in the right place harder. You might as well have said that Waldo is easy to find if you remove all of the other people on the page.
As for it being "a basic L2 loop that should have taken 10 minutes to track down." If it were a "normal" L2 loop, and he could have taken 10 minutes tracking it down; sure. But when it only happens for ~2 minutes at a time, getting 10 minutes of actual tracking in is hard.
Sure, having managed switches would have made finding it easier. But most debugging stories have something that could have made them easier if done beforehand.
I had a user of an audio streaming product who suffered from random disconnects. He could only stream for about five minutes before it quit. But, and this was the crazy part, the problem only happened when streaming rock and roll. Classical music worked fine.
After many weeks going back and forth, I tracked it down to a misconfiguration on his network interface. The MTU was set lower than the IPv6 spec allowed (or so, forgive me if I get this detail wrong) and the OS wouldn't fragment our UDP packets with this setting.
Our packets usually were under the limit, but occasionally went over and triggered an error. Why was classical music immune? The lossless codec we were using did a better job on classical than rock, so classical (or at least this guy's classical) never went over the size limit.