
Every time we lift a pallet from the shipping room, the server times out (2006) - raldi
https://www.reddit.com/r/reddit.com/comments/vunp/the_case_of_the_500mile_email/cvw60/
======
mikeash
Since we're all sharing wacky debugging stories....

I had a user of an audio streaming product who suffered from random
disconnects. He could only stream for about five minutes before it quit. But,
and this was the crazy part, the problem only happened when streaming rock and
roll. Classical music worked fine.

After many weeks going back and forth, I tracked it down to a misconfiguration
on his network interface. The MTU was set lower than the IPv6 spec allowed (or
so, forgive me if I get this detail wrong) and the OS wouldn't fragment our
UDP packets with this setting.

Our packets usually were under the limit, but occasionally went over and
triggered an error. Why was classical music immune? The lossless codec we were
using did a better job on classical than rock, so classical (or at least this
guy's classical) never went over the size limit.

~~~
nerflad
I'm assuming this is due to the lower average signal level of symphonic
recordings? Rock music is mastered closer to 0db due to the well-publicized
loudness war.

Obviously, this doesn't affect bandwidth in uncompressed audio, where every
sample contains every bit, so filesize = bit-depth * sample-rate(hz) *
T(seconds). But in compressed audio, average signal level can make a
significant difference in filesize/bandwidth.

(Who would have thought a network would be a casualty of the loudness war...
harhar)

~~~
mikeash
I think rock is just less regular at a small scale. Consider a flute versus an
electric guitar. The flute is at least somewhat like a pure tone which would
be easy to compress, while an electric guitar is a mishmash of stuff that will
look much more random on a sample by sample basis.

~~~
nerflad
Cool -- I didn't even consider the shape of the waveforms (sine-like for
flute, saw/square-like for anything distorted) having an effect on the data,
but now that you tell me... duh!

~~~
mikeash
To be clear, I surmised this from what I described, I didn't actually get down
and poke at how the codec works on these different types of music. But I'm
pretty sure that's what's going on.

These lossless codecs work by attempting to predict the next sample from all
the previous ones, and then encoding the difference between the prediction and
reality. So a more regular waveform will compress better than a crazy-looking
one. It's somewhat counterintuitive since rock music tends to be _less_
complicated from a high-level human perspective.

------
benp84
I once built a MS Excel add-in in C++ that worked perfectly when compiled by
Visual Studio in debug mode but would not run when compiled in release mode.
Obviously, debugging in release mode was a huge pain.

Long story short: at some point Excel stores the file path of the add-in in a
string with (for some reason) a max length of 128. The file path
"C:/.../Debug/addin.xll" totaled 127 characters and the path
"C:/.../Release/addin.xll" was 129.

------
CodeWriter23
Back in the late 80's I was working on an X.25 gateway. We had it installed at
a Wall Street firm. It had a problem where it would accumulate thousands of
CRC errors in just a few minutes.

We knew the problem was on the customer premises because we had loaded the
gateway down using the same test suite. We ran them through everything. Their
point was when they plugged the scope on both sides of the connection and
loaded the link with a loop back test, they had no problem.

I get on a plane. We have the scopes hooked up on both sides and the guys in
the data center are on the phone talking. We load up the test suite and no
errors occurred. Scratching heads, we watched it not fail for a half hour. The
consensus was we needed to regroup and come back at the problem from a
different angle. The data center guys agreed. Then, suddenly on our scope, we
got like 500 CRC errors...

I said to the DC guys "what did you just do? We're getting CRC errors"
"Nothing" "You didn't touch anything?" "No we just unplugged the scope from
the patch panel" "Try plugging it back in please"

And the CRC errors stopped. Faulty patch panel diagnosed.

The firm probably paid $25k in consulting fees for that diagnosis. But for me,
the lesson that my tools might alter what I am observing was priceless.

------
djaychela
I used to work as a fire alarm and security system engineer, and my boss was a
dodgy maverick, picking up contracts here and there. He somehow had me
responsible for a shopping centre that was about 150 miles away ,and it housed
the cctv for the entire town centre (cctv is big in the UK). I was on a
salary, so didn't get overtime. One Saturday morning I got a call saying all
the cctv system was down, and that I needed to got sort it out ASAP!

Drove for 4 hours, and got to the control room, where I was confronted with a
totally dead system - all the monitors were on, but all black screens. I asked
the guy what happened, mostly to stall for a bit as I'd never been on site
before and there were no schematics available, and I was expecting to be there
for days.

He told me that he'd gone to put the kettle on, come back and when he came
back, it was all dead. I asked where the kettle was, and he looked bemused,but
showed me. Instead of it being in a rest room, it was located on a shelf
behind the equipment rack. I took a closer look, and saw that it was plugged
into the power strip that was built into the rack, and the cable was twisted
round other power cables, one of which was... The demultiplexer that carried
all the cctv camera signals from round the town, and which had been pulled out
(iec mains lead, easily done). Pushed it back in by 5mm, went round the front
to see all the monitors on... Said "thanks for that" to the guy, got the
paperwork signed and left. Spent a total of 15 minutes on site, and over 8
driving as the traffic was terrible on the way home.

------
noxin
Stories like this always remind me of the car allergic to vanilla ice cream.

[http://www.cgl.uwaterloo.ca/smann/IceCream/humor.html](http://www.cgl.uwaterloo.ca/smann/IceCream/humor.html)

~~~
robert_tweed
Not forgetting the classic "magic / more magic" switch:

[http://catb.org/esr/jargon/html/magic-
story.html](http://catb.org/esr/jargon/html/magic-story.html)

~~~
deleted_soon
Or the email server that couldn't send emails more than 500 miles:

[https://www.ibiblio.org/harris/500milemail.html](https://www.ibiblio.org/harris/500milemail.html)

~~~
mathewsanders
First time hearing this story, and also just discovered `units` which I really
love :)

~~~
tvmalsv
Funny, I didn't understand what you were referring to and quickly forgot about
your comment. I then read the 500-mile story, got to the end and thought
"units? oh cool, new command... wait, wasn't there a comment about `units`?"

------
franze
Around 2000: a colleague told me about his slow PC, rattling noises, random
crashes. i did what any responsible workmate would do: told him to call IT
support, I'm a web-designer (well actually I was a frontend, backend
developer, database guy and project manager but devision of labor in
webdevelopment didn't happen yet so my job title was webdesigner, anyway...)!

5 weeks of various visits by the IT support later, including full replacement
of all PC hardware and software - still random crashes.

so one day my direct boss tells me to take a look as the corporate IT support
"couldn't do it". i just went over there and encounterd a desktop PC covered
in refrigorater-sticky-note-magnets.....

turns out he removed them every time "so that the IT guys have better access
to the computer"...

~~~
bobbles
I doubt fridge magnets would have anything to do with problems in a PC, they
would need to be about 100x stronger than that to have any sort of impact

~~~
dTal
Sounds like it was somehow causing a fan to stick, hence the rattling and
subsequent instability.

------
Dagwoodie
I had a similar thing happen to a friend of the family whose business I
provided IT services for. He had 2 servers and 2 rack mounted UPS's and a
handful of other devices that all fit into the same rack. One day I needed to
do some maintenance that required shutting one of his servers down. This was a
backup server that wasn't actually used for anything as long as the primary
was up and functional.

The problem was that as soon as I shut the backup server down, his entire
network stopped working. I was trying to the maintenance over a VPN, so
immediately I lost access. I don't really have a clever way of telling this
right now, but after a lot of frustration trying to figure out the problem
remotely (and wondering if someone had pwn'd his servers and was using the
backup server to MITM all his traffic) I drove out there and noticed the
problem right away. The UPS that the backup server was connected to was
faulty, what was happening was that once the server was no longer pulling
electricity from the faulty UPS, it failed to power the other equipment that
was plugged into it, one of those things being a critical switch. As soon as
the server was powered back on, all the other equipment attached to it
immediately powered up too.

~~~
yellowapple
Sounds like one of those surge protectors that control the power to other
devices based on whether or not a central device is powered on. Some powerbars
for home theater setups have a similar feature; if you turn off the TV, all
the other stuff turns off with it, and if you turn _on_ the TV, everything
turns back on.

Weird that a rackmount UPS would have that sort of feature, but hey, it's
possible.

~~~
Fomite
I have a rackmount UPS that was expressly intended for AV purposes that does
that.

------
jen729w
About 10 years ago I was working for one of the Big 4 banks here in Australia,
doing 3rd level support. An issue came up with, if memory serves, Siebel.
Desktop issue, don't remember the details, but it was pretty serious - a team
couldn't do the thing that they do.

I was on this thing for _weeks_. Hacking away in whatever that tool from
sysinternals was called (Procmon?), monitoring calls at the process level,
running multiple tests on multiple machines, the whole lot. It's the most
complex troubleshooting I've ever done.

I found nothing.

The guys in the team were in an office a few miles away from me so one day I
said, hey, I'm going to come down. I need to see this with my own eyes rather
than over a remote connection.

I went down there. We started up the desktop. They launched the software. They
clicked the button to do The Thing That Wasn't Working.

And it worked. It just did the thing it was meant to do.

And the problem never came back.

------
tyingq
Similar story: [http://patrickthomson.tumblr.com/post/2499755681/the-best-
de...](http://patrickthomson.tumblr.com/post/2499755681/the-best-debugging-
story-ive-ever-heard)

Oh, and back when Sun Microsystems said random server crashes were due to
alpha particles.
[http://web.archive.org/web/20020202013942/http://www.compute...](http://web.archive.org/web/20020202013942/http://www.computerworld.com/storyba/0,4125,NAV47_STO66102,00.html)

~~~
waterhouse
Someone recently posted a list of these.

[http://beza1e1.tuxen.de/lore/](http://beza1e1.tuxen.de/lore/)

~~~
yellowapple
The "Crash Cows" one is absolutely horrifying.

~~~
VanV
E

------
Namrog84
A newly hired employee had the strangest problem with his ethernet when he
started.

Whenever he plugged in ethernet it would only give him 10mbps. But if he
unplugged and replugged it in. It'd switch over to 1000.

What was more strange. On off. Router or hardware on off. Reboot. Nothing
would switch it over to higher speed. We tested it a dozen times. But if you
plug it in. Unplug then plug in a 2nd time. Viola 1gbps connection made.
Always 10mbps the first time. Tried 4 wires and 3 different switches on
multiple computers. Problem is his computer. Even reformatted computer and
tried different hardware ethernet ports. Never fully resolved why it always
need to be replugged in twice though for faster connection.

~~~
fit2rule
Intels' IEGBE drivers used to have a bug that would cause exactly this
problem. (I know, because I found the bug and fixed it once, years ago..)

~~~
throwanem
Disabling autonegotiation and forcing 1Gbps full-duplex on the NIC might solve
it, too.

------
cpeterso
In this BeOS debugging story ("A Testing Fairy Tale"), their floppy disk
stress tests can run all day but fail when run overnight. The morning sunlight
through the office window triggered the test machine's floppy disk write-
protection mechanism, causing a write failure during the test.

[https://www.haiku-os.org/legacy-
docs/benewsletter/Issue4-22....](https://www.haiku-os.org/legacy-
docs/benewsletter/Issue4-22.html)

------
kuahyeow
This reminds me of the time when I managed to diagnose a packet storm issue
after two days of methodically excluding software then hardware followed by
tracing each Ethernet cable to the switch. Turned out that the re-connection
of the cables caused a network loop within a dumb switch, ala packet storm !

------
pbhjpbhj
Well my best-worst debugging story concerns a friend's work based email
account and Microsoft Outlook (in 2012). Occasionally it would fail to connect
to send email to the server, just randomly.

Obvious troubleshooting ensued, web traffic worked, could ping the email
server, could connect with telnet and read email that way; Thunderbird worked.
Created a new account, that would work for a while and then fail again.

Less obvious troubleshooting, traced route to server whilst running the
connection - route worked, connection failed. Outlook logs showed attempts to
connect to the correct URL but the connection wasn't being made. Checked for
malware. Reset router, actually I think we replaced it. Pulled out a
sysinternals tool, tcpview IIRC, watched the connections being made ... hang
on, what's that IP address??

Turns out Windows was querying and getting the IP address but somewhere it was
reversing the dotted quad and when Outlook said it was connecting to the
relay.example.com server - lets say 6.7.8.9 - it was instead attempting to
connect to 9.8.7.6 ...

I didn't track whether it was MS Windows or Outlook that was in error, I just
dropped the correct address in as a line in the HOSTS file on the three
affected computers. Fixed.

Very satisfying to find the way the problem arose and an easy fix; but would
love to have seen internally where the error was arising and exactly why. I
did find one other report that sounded like the same problem IIRC. My only
idea was that an automated reverse-IP hostname like some ISPs use - like
"9-8-7-6.ispnet.com" \- was for some reason getting parsed in as the IP, but I
wasn't about to start reverse engineering stuff to find out.

~~~
raldi
Sounds like someone forgot to call this function, and maybe most of the
systems were big-endian, so it didn't matter, but one was little-endian:
[https://linux.die.net/man/3/ntohl](https://linux.die.net/man/3/ntohl)

~~~
andreareina
Endianness was the first thing I thought of as well. Or maybe one component of
the stack thought that IP addresses are char[4] while another thought they're
u32_t, though you'd expect that to be caught by the typechecker.

------
ajkjk
Is anyone else weirded out that there are posts on reddit that are ten years
old?

------
rincebrain
At a prior employer, we had racks of Dell servers, each with their own disk
boxes attached to RAID controllers.

Sometimes, more than one of the servers at the same time would decide that
their respective disk enclosures had disappeared and reappeared, and the RAID
controllers would be unhappy and mark the volume as foreign until a human
intervened.

Windows and Linux both did this, so it wasn't an OS problem, and it was
multiple machines in multiple racks, ostensibly UPSed with line filtering.

The odder thing was when we noticed it was almost always the ones on the upper
half of the racks.

The best, though, was what "resolved" the issue - the power exchange next to
the building blew up one day (AIUI one of the phase lines for the three phase
connected to another phase's busbar, and BOOM, the room was covered in a fine
copper mist), and after all the power equipment in the exchange was
(eventually) replaced, the problem went away.

------
sandworm101
Wacky story, but this is very typical. They had unsecured wireless devices
attached to thier network. A misconfigured device, misconfigured due to the
hurricane, ended up causing internal problems. I muust ask, why was a wifi
device so ready to connect to some random device? Be glad this wasnt a rogue
device.

~~~
chime
Original commenter here. Talk about a blast from the past!

It was indeed surprising and scary for me too. Hard to recall precise details
from a decade ago but I think that both the devices connected because I used
the same password on them and while the reset caused the test WAP to become a
repeater, it still had the credentials necessary to connect to the production
WAP.

I did not have enough time or resources to find the root cause because we were
still getting back up, I believe after Hurricane Charley.

~~~
raldi
Hey, a reunion! In case you were wondering, the reason I submitted this was
that it recently came up in /r/TenYearsAgoOnReddit:
[https://www.reddit.com/r/TenYearsAgoOnReddit/comments/5jo2dk...](https://www.reddit.com/r/TenYearsAgoOnReddit/comments/5jo2dk/20061223_the_case_of_the_500mile_email_redditcom/)
(then ctrl-f "original post")

~~~
chime
That's awesome. Thanks for bringing back the nostalgia :)

------
hlesesne
In college we had a similar scenario when a UPS delivery truck making it's
routine stop at a fairly consistent time each day caused connectivity to
degrade. If I remember correctly, the solution was discovered following a UPS
strike in the late 90s.

------
vonklaus
I'll try and dig my favorite debugging story up about all emails failing to
send passed ~500 miles. That was pretty crazy near sorcery as well.

~~~
cldellow
Do you perhaps mean
[http://www.ibiblio.org/harris/500milemail.html](http://www.ibiblio.org/harris/500milemail.html)?
:)

~~~
vonklaus
Thanks! That was one of the more kafka-esque debug stories. Going to reread it
for old times sake and encourage anyone who hasn't to give it a look over.

------
Aardwolf
This reminds me of "COMPUTER-RELATED HORROR STORIES, FOLKLORE, AND ANECDOTES"
which is full of this type of stories:

[http://wiretap.area.com/Gopher/Library/Techdoc/Lore/rumor.ne...](http://wiretap.area.com/Gopher/Library/Techdoc/Lore/rumor.net)

------
WillKirkby
I'm reminded somewhat of the PS1 cartridge saving bug:
[http://www.gamasutra.com/blogs/DaveBaggett/20131031/203788/M...](http://www.gamasutra.com/blogs/DaveBaggett/20131031/203788/My_Hardest_Bug_Ever.php)

------
rawdan
haha, talk about hard code debugging.

I think this post perfectly illustrates the major difference between "theory &
practice".

~~~
hprotagonist
"in theory, there's no difference between theory and practice. "

~~~
drfuchs
... but in practice, there is. - Jan van de Snepscheut

~~~
mirimir
But in practice, theories are _only_ hypotheses, which at best can just be
falsified ;)

------
ww520
This is a delicious debugging story.

------
Ericson2314
Clearly, the solution is software defined networks.

