I had a user of an audio streaming product who suffered from random disconnects. He could only stream for about five minutes before it quit. But, and this was the crazy part, the problem only happened when streaming rock and roll. Classical music worked fine.
After many weeks going back and forth, I tracked it down to a misconfiguration on his network interface. The MTU was set lower than the IPv6 spec allowed (or so, forgive me if I get this detail wrong) and the OS wouldn't fragment our UDP packets with this setting.
Our packets usually were under the limit, but occasionally went over and triggered an error. Why was classical music immune? The lossless codec we were using did a better job on classical than rock, so classical (or at least this guy's classical) never went over the size limit.
Obviously, this doesn't affect bandwidth in uncompressed audio, where every sample contains every bit, so filesize = bit-depth * sample-rate(hz) * T(seconds). But in compressed audio, average signal level can make a significant difference in filesize/bandwidth.
(Who would have thought a network would be a casualty of the loudness war... harhar)
These lossless codecs work by attempting to predict the next sample from all the previous ones, and then encoding the difference between the prediction and reality. So a more regular waveform will compress better than a crazy-looking one. It's somewhat counterintuitive since rock music tends to be less complicated from a high-level human perspective.
Long story short: at some point Excel stores the file path of the add-in in a string with (for some reason) a max length of 128. The file path "C:/.../Debug/addin.xll" totaled 127 characters and the path "C:/.../Release/addin.xll" was 129.
We knew the problem was on the customer premises because we had loaded the gateway down using the same test suite. We ran them through everything. Their point was when they plugged the scope on both sides of the connection and loaded the link with a loop back test, they had no problem.
I get on a plane. We have the scopes hooked up on both sides and the guys in the data center are on the phone talking. We load up the test suite and no errors occurred. Scratching heads, we watched it not fail for a half hour. The consensus was we needed to regroup and come back at the problem from a different angle. The data center guys agreed. Then, suddenly on our scope, we got like 500 CRC errors...
I said to the DC guys "what did you just do? We're getting CRC errors"
"You didn't touch anything?"
"No we just unplugged the scope from the patch panel"
"Try plugging it back in please"
And the CRC errors stopped. Faulty patch panel diagnosed.
The firm probably paid $25k in consulting fees for that diagnosis. But for me, the lesson that my tools might alter what I am observing was priceless.
Drove for 4 hours, and got to the control room, where I was confronted with a totally dead system - all the monitors were on, but all black screens. I asked the guy what happened, mostly to stall for a bit as I'd never been on site before and there were no schematics available, and I was expecting to be there for days.
He told me that he'd gone to put the kettle on, come back and when he came back, it was all dead. I asked where the kettle was, and he looked bemused,but showed me. Instead of it being in a rest room, it was located on a shelf behind the equipment rack. I took a closer look, and saw that it was plugged into the power strip that was built into the rack, and the cable was twisted round other power cables, one of which was... The demultiplexer that carried all the cctv camera signals from round the town, and which had been pulled out (iec mains lead, easily done). Pushed it back in by 5mm, went round the front to see all the monitors on... Said "thanks for that" to the guy, got the paperwork signed and left. Spent a total of 15 minutes on site, and over 8 driving as the traffic was terrible on the way home.
(As per https://news.ycombinator.com/newsguidelines.html I've been allowed to say that for some time)
Like the "OpenOffice.org won’t print on Tuesdays" bug, or the bug which crashed the computer when the general visited. Or the one when the server went down whenever a certain guy had a support ticket.
It was in a military facility in the seventies, and always when the general visited the computers crashed. This happened too often to be a demonstration effect, it turned out to be the metal in the generals shoes interfering with the electronics.
5 weeks of various visits by the IT support later, including full replacement of all PC hardware and software - still random crashes.
so one day my direct boss tells me to take a look as the corporate IT support "couldn't do it". i just went over there and encounterd a desktop PC covered in refrigorater-sticky-note-magnets.....
turns out he removed them every time "so that the IT guys have better access to the computer"...
Most of all it was reproduceable. With magnets ... slower, slower, rattling, slower ... 1, 2, 3 days later crash.
The problem was that as soon as I shut the backup server down, his entire network stopped working. I was trying to the maintenance over a VPN, so immediately I lost access. I don't really have a clever way of telling this right now, but after a lot of frustration trying to figure out the problem remotely (and wondering if someone had pwn'd his servers and was using the backup server to MITM all his traffic) I drove out there and noticed the problem right away. The UPS that the backup server was connected to was faulty, what was happening was that once the server was no longer pulling electricity from the faulty UPS, it failed to power the other equipment that was plugged into it, one of those things being a critical switch. As soon as the server was powered back on, all the other equipment attached to it immediately powered up too.
Weird that a rackmount UPS would have that sort of feature, but hey, it's possible.
I was on this thing for weeks. Hacking away in whatever that tool from sysinternals was called (Procmon?), monitoring calls at the process level, running multiple tests on multiple machines, the whole lot. It's the most complex troubleshooting I've ever done.
I found nothing.
The guys in the team were in an office a few miles away from me so one day I said, hey, I'm going to come down. I need to see this with my own eyes rather than over a remote connection.
I went down there. We started up the desktop. They launched the software. They clicked the button to do The Thing That Wasn't Working.
And it worked. It just did the thing it was meant to do.
And the problem never came back.
Oh, and back when Sun Microsystems said random server crashes were due to alpha particles. http://web.archive.org/web/20020202013942/http://www.compute...
Side note: that scream makes a superb notification tone for people you don't like, or PagerDuty.
I also heard a good yarn once where a customer's DSL service inexplicably stopped working at exactly the same time every day, and after weeks of troubleshooting and changing absolutely everything, with pretty much everyone involved ready to give up, a senior network engineer sitting in the guy's living room happened to notice the street light directly outside turning on at the same time. Turns out the lamp was poorly shielded and throwing enough RF into the poor customer's apartment to light up Pittsburgh. Never heard what came of that, though I imagine that call to public works was interesting.
Turned out that the contractor who replaced the coax on our block didn't use the right grade of cable... the weight of the water plus wind would stretch the cable and open a crack in the housing.
That gave me more than a few giggles.
What an interesting analyses of the scene. I always got the impression Linux was looked down on early on (learned about Linux very late), and there you go.
Kind of. Solaris' history is a bit quirky due to it being built on SVR4 (which is the part that "can be traced back to the original Bell Labs UNIX"), SVR4 having in turn been based on a hodge-podge of "good parts" from all sorts of Unix implementations (including BSD - both on its own and by way of SunOS - and Xenix).
It's thanks to Solaris, though, that we have the only (AFAIK) FOSS implementation of "real" Unix in any form: OpenSolaris (which was then violently murdered by Oracle, but it lives on as illumos and the various distributions thereof, so not all is lost).
I reckon the biggest reason for the eventual outcome is that Linux was FOSS from pretty much the start and wasn't dealing with a big legal conflict (unlike BSD, which was still dealing with AT&T lawsuits and such). By the time Sun released OpenSolaris, it was by far too late for them to really curb Linux's momentum.
> Q: Do you ever think about retiring?
> A: Every day. No, not really. I can't leave my kids to Microsoft. The government won't fight the battle. The government won't enforce the laws.
i'm not old enough to know the context, and out of context this is pretty funny.
"People say, 'Why haven't you retired?' I said, 'I can't leave my kids to a world of Control, Alt, Delete,' " he said, referring to the function in Microsoft Windows for rebooting after a system crash. "I can't leave my kids to MSN. I think you [developers should be out there helping your families [by spreading Java use.]"
Whenever he plugged in ethernet it would only give him 10mbps. But if he unplugged and replugged it in. It'd switch over to 1000.
What was more strange. On off. Router or hardware on off. Reboot. Nothing would switch it over to higher speed. We tested it a dozen times. But if you plug it in. Unplug then plug in a 2nd time. Viola 1gbps connection made. Always 10mbps the first time. Tried 4 wires and 3 different switches on multiple computers. Problem is his computer. Even reformatted computer and tried different hardware ethernet ports. Never fully resolved why it always need to be replugged in twice though for faster connection.
Obvious troubleshooting ensued, web traffic worked, could ping the email server, could connect with telnet and read email that way; Thunderbird worked. Created a new account, that would work for a while and then fail again.
Less obvious troubleshooting, traced route to server whilst running the connection - route worked, connection failed. Outlook logs showed attempts to connect to the correct URL but the connection wasn't being made. Checked for malware. Reset router, actually I think we replaced it. Pulled out a sysinternals tool, tcpview IIRC, watched the connections being made ... hang on, what's that IP address??
Turns out Windows was querying and getting the IP address but somewhere it was reversing the dotted quad and when Outlook said it was connecting to the relay.example.com server - lets say 126.96.36.199 - it was instead attempting to connect to 188.8.131.52 ...
I didn't track whether it was MS Windows or Outlook that was in error, I just dropped the correct address in as a line in the HOSTS file on the three affected computers. Fixed.
Very satisfying to find the way the problem arose and an easy fix; but would love to have seen internally where the error was arising and exactly why. I did find one other report that sounded like the same problem IIRC. My only idea was that an automated reverse-IP hostname like some ISPs use - like "9-8-7-6.ispnet.com" - was for some reason getting parsed in as the IP, but I wasn't about to start reverse engineering stuff to find out.
I.e. not escaping the .
Sometimes, more than one of the servers at the same time would decide that their respective disk enclosures had disappeared and reappeared, and the RAID controllers would be unhappy and mark the volume as foreign until a human intervened.
Windows and Linux both did this, so it wasn't an OS problem, and it was multiple machines in multiple racks, ostensibly UPSed with line filtering.
The odder thing was when we noticed it was almost always the ones on the upper half of the racks.
The best, though, was what "resolved" the issue - the power exchange next to the building blew up one day (AIUI one of the phase lines for the three phase connected to another phase's busbar, and BOOM, the room was covered in a fine copper mist), and after all the power equipment in the exchange was (eventually) replaced, the problem went away.
It was indeed surprising and scary for me too. Hard to recall precise details from a decade ago but I think that both the devices connected because I used the same password on them and while the reset caused the test WAP to become a repeater, it still had the credentials necessary to connect to the production WAP.
I did not have enough time or resources to find the root cause because we were still getting back up, I believe after Hurricane Charley.
I think this post perfectly illustrates the major difference between "theory & practice".
Remove the stuff about wireless repeaters and pallets of shampoo and this is a basic l2 loop that should have taken 10 minutes to track down.
As for it being "a basic L2 loop that should have taken 10 minutes to track down." If it were a "normal" L2 loop, and he could have taken 10 minutes tracking it down; sure. But when it only happens for ~2 minutes at a time, getting 10 minutes of actual tracking in is hard.
Sure, having managed switches would have made finding it easier. But most debugging stories have something that could have made them easier if done beforehand.