
Software Folklore – A collection of weird bug stories - tpaschalis
http://beza1e1.tuxen.de/lore/index.html
======
AnimalMuppet
I've posted this story before, but it fits here rather nicely.

I had a function that looked like this:

    
    
      void f() {
        bool flag = true;
        while (flag) {
          g();
        }
      }
    

This function would sometimes exit. But that's really all there was to the
function. Somehow flag was becoming false, even though nothing ever wrote to
it.

So you might think about g() smashing the stack, when a variable is
mysteriously changing, but you'd expect the return address to also get
written, and it wasn't - the function returned from g() to f(), found flag to
be false, exited the loop, and returned from f().

Eventually I got desperate enough to look at the assembly code produced by the
compiler, and I became enlightened. (This was g++ on an ARM, by the way.) flag
was being stored in R11, not in memory. (Might have been R12 - it's been a
while.) When g() was called, f() just pushed the return address. Then g()
pushed R11, because it was going to have its own variable to stash there, and
then created space for its stack variables. And one of those variables was
smashing the stack by 4 bytes, over-writing the saved flag value from f().

Worse, the way the stack was getting smashed was on a call to mesgrecv(). This
takes a pointer to a structure and a size, but the relationship between the
two isn't what you'd expect. The size isn't the size of the structure, but
rather the size of a substructure within that structure. A contractor had
gotten that detail wrong when they used that mechanism for IPC between two
chips. (They'd gotten it wrong on the sending side, too, so the data stayed in
sync.)

The net result was that the flag got cleared when four next-door-but-unrelated
bytes _on another CPU_ were all zero. It took me a month, off and on, to
figure that out.

~~~
tonyarkles
Crazy thing to go with that... if your compiled with different (more
aggressive) optimization flags, it might have gone away!

~~~
AnimalMuppet
It already went away when I tried to print out the address of the variable, so
that I could watch it in the debugger (because, in order to take the address
of it, it had to become a stack variable).

~~~
yitchelle
In the end, do you remember what tools you used to confirmed that R11 was
overwritten? The tools and the path to the root cause are also quite
interesting.

~~~
AnimalMuppet
I first looked at R11 because of the assembly output. There are flags that you
can give g++ to produce the assembly output when it compiles. That showed me
that the variable was in R11, and where it wound up on the stack in the g()
function body.

From there, it was a question of how g() was smashing the stack. (I hadn't
looked at that before, because I assumed that it had to be _f()_ smashing the
stack in order to change the variable.) Well, the next thing on the stack was
the structure for mesgrecv. If too much got read into it, it would overwrite
the stored copy of R11. That led me to look _very_ carefully at the mesgrecv
call. Checking the parameters against the man page showed up the unexpected
(at least to me) requirements for the size parameter.

I never "verified" that the stored copy of R11 was being overwritten, except
by changing the size parameter and noting that the loop in f() never
terminated any longer.

------
gumby
I love these toughies, especially the full ssh one! A true debugging wizard.

“Dumb” problems can happen to anyone too. I once walked by the desk
of$well_known_open_source_developer who was struggling with a mysterious bug.
He’d narrowed it down to the specific function and was groveling in the
function setup code (what the compiler generates before your code is called)
He asked me to take a look and within seconds I saw an uninitialized variable
being read.

This is not because he was a bozo! He had decades of experience. It’s simply
that sometimes we get slightly wedged and can’t see the thing that is “staring
us in the face”. He was embarrassed (so not mentioning his name) but he should
not have been. If anything it simply proves that it can happen to anyone.

Related to this: at one organization my debugging skills were (spoiler:
undeservedly) legendary...literally word got around until some new hire asked
me about it months later.

Why? I came in one morning to find some folks trying to get some new model of
terminals to work with the mainframe. Back then you needed the right combo of
byte length, stop bits etc. they asked me if I could fix it and I said sure.
As one does I poked at the setting switches and walked off to get my coffee So
I could come back and think clearly. By the time I came back all the terminals
were in use so I just went on with my day.

Apparently I had randomly toggled the necessary bit. But the way the story was
told: I had walked in, agreed to help, rubbed my chin then simply pushed the
right button and walked off without another word. Which in some sense is true,
But gave me completely undeserved credit.

~~~
ensiferum
When I was a kid (teenager) I worked at an indoords shooting range, mind you,
not real guns but just BB guns. I was supervising a bunch of school kids do
some practice shooting at biathlon targets (just 10m however) and one of them
had an issue with the gun with the pellet getting stuck somehow. I had a look
at the gun, sorted the pellet out and fired the gun off the hip and hit a
bullseye without aiming. Pure luck of course but the kids were like "woaaah"
and of course I never told them that it was just luck and not my mad leet
shooting skills XD

------
notacoward
Here's the craziest one that actually happened to me.

The company I worked for had installed what's best described as a mini-
supercomputer (though we avoided the term) at a site in Boulder. We started
getting reports of failures on the internal communication links between the
compute nodes ... only at high load, late in the day. Since I was responsible
for the software that managed those links, I got sent out. Two days in a row,
after trying everything we could to reproduce or debug the problem, I got
paged minutes after I'd left (and couldn't get back in) to tell me that it had
failed again.

Our original theory was that it had to do with cosmic rays causing bit-flips.
This was a well known problem with installations in that area, having caused
multi-month delays for some of the larger supercomputer installations in the
area. But we'd already corrected for that. It wasn't the problem.

What it ultimately turned out to be was airflow and cooling. The air's thinner
up there, so it carries less heat. But it wasn't the processors or links that
were getting too hot. It was the power supply. When a power supply gets warmer
it gets less efficient. Earlier in the day or with shorter runs as we tried
different things this wasn't enough to cause a problem. With it being warmer
later in the day, continuous load for longer periods was enough to cause
slight brown-outs, and _those_ were making our links flaky. And of course it
would always restart just fine because it had cooled down a bit.

The fix ended up being one line in a fan-controller config.

~~~
newswasboring
> Our original theory was that it had to do with cosmic rays causing bit-
> flips. This was a well known problem with installations in that area, having
> caused multi-month delays for some of the larger supercomputer installations
> in the area. But we'd already corrected for that.

Wow, I sense a more interesting story in here. Care to reveal how it was first
found out and how common it actually is?

~~~
notacoward
In a nutshell, cosmic rays causing bit-flips really is a thing, and it's more
of a thing at higher altitude because of less atmosphere. It's rarely a
problem at sea level. At higher altitude you really need to use ECC memory,
and do some sort of scrubbing (in Linux it's called Error Detection And
Correction or EDAC) to correct single-bit errors before they accumulate and
some word somewhere becomes uncorrectable.

The incident that brought this home to a lot of people was at either NCAR or
UCAR, both near Boulder. Whichever it was, they were installing a new system -
tens of thousands of nodes - and had _not_ been careful about the EDAC
settings. Therefore, EDAC wasn't running often enough, and wasn't catching
those single-bit errors. Therefore^2, uncorrectable errors were bringing down
nodes constantly. According to rumor, this caused a huge delay and almost
torched the entire project. It's easy to say _in retrospect_ that they should
have checked the EDAC settings first, but as it happened they probably only
got to that after multiple rounds of blaming the vendor for flaky hardware
(which would generally be the more likely cause especially when you're on the
bleeding edge).

~~~
yjftsjthsd-h
> It's easy to say in retrospect that they should have checked the EDAC
> settings first, but as it happened they probably only got to that after
> multiple rounds of blaming the vendor for flaky hardware (which would
> generally be the more likely cause especially when you're on the bleeding
> edge).

Yeah, part of the nightmare of cosmic-ray bitflips (or any random bitflips, I
suppose) is precisely that they don't look like anything. A server randomly
locks up. A packet has a bad checksum (and is silently resent). A process gets
into an unexpected state. That buggy batch job fails 1% more frequently than
it used to. Nothing ever points to memory errors, except that there is no
pattern.

------
wolfspider
"Fail on certain moon phases" reminds me of a C++ bug I encountered while
trying to set up the demo for PSIP (Digital TV Guide) destined for NAB in Las
Vegas. We had programming schedules resembling excel spreadsheets and my job
was just to create a good one for the demo. I would spend all night making one
and sent it to my boss and each morning would get in trouble for sending in
blank schedules and had no idea why. On one occasion I happened to be editing
at 3am and noticed all of my edits rolling back one by one. It was actually
viewable on the screen as if someone took control of excel and was rolling
back each field. My immediate thought was I really need to get some sleep but
later we found the auto-save feature inverted itself after 3am exactly and
would go through each delta one by one rolling itself back as it had been
edited. The bug was found in the calculation of the vernal equinox which moves
from 3am to 9pm to 3pm. Since it was triggering the leap year code 6 hours of
time would get rolled back edits and all! This was of course 2008 year of the
digital transition from analog cable which happened to also be a leap year.

------
rich_sasha
I can't scan documents when my daughter is asleep. When she is awake, all is
fine, but the minute she goes to sleep, and I'd like to use my free time to
scan documents and suchlikes, forget it. I could still print documents on the
same device though. Here's what I found:

The printer-scanner was connected to wi-fi. The wi-fi router was in my
daughter's room, as that is where the cable socket was, tucked just behind a
bench in her room. It was also near that bench that her baby monitor camera
was standing. It wasn't wi-fi connected, but for whatever reason it interfered
with the wi-fi signal. Same with the receiver, if I put it near my laptop, the
wi-fi connection would die.

The monitor was off most of the time, and on precisely when my daughter was
asleep.

As for why I could still print, just not scan: presumably that's something to
do with the bandwidth, I'm guessing it took more wi-fi bandwidth to send a
scanned image than to print a document (I never printed pictures on this
printer).

~~~
GnarfGnarf
You really want to think about moving the Wi-fi out of the baby's room. Get a
GQ-390 meter and you will see the torrent of dangerous radiation flooding her
room, above recommended levels.

I was able to get my Internet provider to relocate the device to my basement.

~~~
chipuni
I just exposed myself to radiation from the largest fusion reactor in the
solar system! Should I panic?

....errr....

Wait.

I have just been slipped a note that the first sentence I wrote means "I just
went for a walk in sunshine."

~~~
jrockway
I have a fun story about the sun.

I used to work for an IPTV provider (who also happens to make a search
engine). We received TV feeds from TV stations via satellite; we had an
antenna farm, received their signals there, compressed/encoded the signal, and
then sent them to customers over the network. Because we only had one antenna
farm, we would have TV outages throughout the year -- sometimes the satellite
happens to be directly in front of the sun, and the sun is a huge RF emitter
that would overwhelm our receivers. (I asked if we can just move the
satellites, but was told that we didn't have the delta-V budget. We eventually
built another antenna farm. Another question I asked is why can't the TV
stations just send us video files over the network. Apparently it simply isn't
done; they have used satellites for decades, so why switch?)

------
angarg12
I got a tiny one of these.

One time I was writing some code in C. I found a bug, the solution seemed
pretty obvious, so I fixed it, recompiled the code, and ran it again. The bug
was still there.

I took a look at the rest of the code in case that I missed something. I
couldn't find anything, so I added a few print statements and recompiled. I
ran the code and nothing came up.

Interesting, apparently the code is not executing the branches it should. I
verified the input data and code. It didn't make sense, there had to be some
serious bug there that I didn't consider. I added a bunch more prints.

Recompile and execute. Still nothing. Wait a minute, THAT doesn't look good. I
added a print statement right at the entry point of the program. Nothing.

At this point the root problem became apparent; my changes just weren't
getting compiled. Phew, problem solved! I cleaned all the cached files and
recompiled the source code. Those print statements still weren't coming up.

At the end I had to move my source code to another machine and compile it
there to get it working. I suspect some global variables or path trickery to
be involved, but up to this day I still haven't got a clue what was wrong, or
have I seen it happen again.

~~~
smallstepforman
.o file stamp newer than .c timestamp. Make figures out file can be skipped.
Run recursive touch on these problematic files and headers.

~~~
yxhuvud
The most fun is when the build then restarts, and do a build cycle until the
the clock passes some magical timestamp making the build succeed.

~~~
russfink
More magical, your object files reside on a network shared drive that has a
clock slightly different than that of the compiler machine.

------
rogierhofboer
Display intermittently blanking, flickering or losing video signal:

[https://support.displaylink.com/knowledgebase/articles/73861...](https://support.displaylink.com/knowledgebase/articles/738618-display-
intermittently-blanking-flickering-or-los)

"Surprisingly, we have also seen this issue connected to gas lift office
chairs. When people stand or sit on gas lift chairs, they can generate an EMI
spike which is picked up on the video cables, causing a loss of sync. If you
have users complaining about displays randomly flickering it could actually be
connected to people sitting on gas lift chairs. Again swapping video cables,
especially for ones with magnetic ferrite ring on the cable, can eliminate
this problem. There is even a white paper about this issue."

~~~
polytely
I've seen this one on twitter
[https://twitter.com/royvanrijn/status/1214162400666103808?la...](https://twitter.com/royvanrijn/status/1214162400666103808?lang=en)
There's probably some civilizational complexity limit where the unexpected
interactions between seemingly isolated pieces of tech become so bad that we
cannot introduce anything new without introducing legion of weird bugs.

~~~
no-s
>There's probably some civilizational complexity limit where the unexpected
interactions between seemingly isolated pieces of tech become so bad that we
cannot introduce anything new without introducing legion of weird bugs.

Come to think of it, I believe that meme was wondering about on UseNet from
before the early days. I think Vernor Vinge alluded to it in one of his
novels, (paraphrasing here) some interplanetary civilization crashing because
in-transit space traffic was so dense no new launches could occur in a useful
time frame due to safety-lock outs, and they didn't want to accept the risk in
changing the safety margins...

------
VBprogrammer
My favourite crazy bug was during a university course on autonomous robotics.
One of the other groups was using a a metal castor at the back of the robot
along with 2 driven wheels. After a little while their robot would completely
crash and stop responding.

I'd previously encountered a similar issue which was due to the library code
we'd been given which opened a new /dev/i2c file for each motor command,
eventually exceeding the max file handles and killing the program. So I
assumed it was something sensible like that.

Some time later they got all excited and called us over to explain the real
reason it was crashing. Their robot would initially work fine for a reasonable
period of time. Then when the robot drove over the metallic tape on the floor
of the arena it would die. The robot must have been building up a static
charge while moving around which would eventually be dissipated when the metal
touched the tape.

I wouldn't have believed it had they not setup two tests, one outside the
normal arena and one inside. Changing the metal castor for a bit of lego fixed
the problem.

------
schemescape
Obvious in retrospect, but very surprising to my inexperienced past self:

I'd been working on some C code for an hour or two. It wasn't behaving how I
expected it to (and at the time I knew nothing of debuggers), so I added a
print statement and recompiled. I got a compilation error: something like
"Syntax error on line 123: #incl5de <stdio.h>". Shocked, I scrolled to that
line in my text editor to fix the typo, but it wasn't there. I compiled the
same code again and there were no errors.

Turns out there wasn't a bug in my code! I immediately shut down my computer
because my RAM was going bad. To this day, what surprises me most is that my
computer was able to successfully boot and behave normally for an hour or two,
even though random bits were apparently being flipped.

~~~
ignoramous
Reminds me of a rouge NIC, flipping a single bit every now and then, that took
down S3:
[https://youtube.com/watch?v=swQbA4zub20&t=46m02s](https://youtube.com/watch?v=swQbA4zub20&t=46m02s)

Ref:
[https://news.ycombinator.com/item?id=13859733](https://news.ycombinator.com/item?id=13859733)

~~~
mrtksn
I recall a talk about someone registering domains similar to google.com and
due to occasional bit flips people landing to these domains.

~~~
baud147258
An article on something like this was linked on HN a few years ago:
[https://news.ycombinator.com/item?id=2944445](https://news.ycombinator.com/item?id=2944445)

------
arethuza
I remember reading one years ago where someone had a problem installing new
software on some embedded device - whatever they did it came up "checksum is
bad".

After much testing they eventually realised that the checksum literally was
the hex "bad".

------
stevesimmons
I have two favourite bugs, one weird and one dumb.

Weirdest one was an IDE where the colorizer gave up on source lines longer
than 998 chars. Instead it rendered the whole line as background, i.e.
invisible. I once wasted two hours debugging a program with an invisible line
of code!

The dumbest was a postage billing system for a bank using a third party Print-
and-Mail company. Somehow the billing system went live adding the previous
day's total postage costs to itself, then adding the new day's postage. These
expontentially growing totals were then paid automatically by the accounting
system each night... So it goes live, ... and a week later Finance gets an
alert the account is overdrawn... They actually paid out nearly $1b in postage
costs before hitting their internal credit limit with the bank's treasury.

~~~
lainga
Which IDE was it?

~~~
Drdrdrq
_That_ is what you want to know? :D

~~~
stevesimmons
A highly customised Visual Studio I think, with lots of in-house extensions.
To be fair, this was long, long ago. Nowadays you'd write an extension and
leverage the Language Server Protocol.

~~~
Drdrdrq
Cool, good to know... Now, did they get that $1B back? :)

------
jtlienwis
My favorite was when I was working at SGI after it had taken over Cray
Research. I was one of the lowly Cray guys in Wisconsin working with the
wonderkind in California. I was to run the regression tests on the chip being
design in California using some software that they had provided. I would run
the tests, but some days they would crash in the middle of the night. Then the
California guys would be angry that they got no tests results. I started
debugging the code and got to a program called lswalk that would dole out jobs
to the dozens of servers to be 'run'. The code was written by a hot shot young
MIT graduate, but I was sure that the problem was with this code. I got the
source code and started looking for problems and one thing I found was that if
one of the servers resplonding with error the code would print out an error
message. One problem though... The error string printed had an uninitialized
string, so that when the printf routine would search for an end of string that
was never there, probably overwriting buffers and crashing code all over the
place. So one lesson is that even the best and brightest make mistakes.
Sometimes I wonder how we accomplished anything in those days with software
that had so may trap doors beneath it.

------
indymike
I love this article.

My most recent bizarre bug: a coworker came to me with a bug where no matter
what he tried, he could not get an if some_var is null to be true. The
debugger would show the value was null. The logger showed the value was null,
but the if statement would not work. After a morning of trying to fix it, he
asked if I would take a look. I told him to put the null in quotes in the if.
It worked. Turned out a JavaScript library had a bug where it would use the
string "null" instead of null.

~~~
imjared
There's a popular progressive API I've been forced to work with that uses
"true" and "false" instead of their respective booleans. The most egregious
that I've ever seen in this class of errors was an API (from Google!) that
returned " false".

------
rikroots
The very first time my company farmed me out to work onsite with a client. Day
1, Job 1: download the client's website code and get it running on my laptop.

... It just wouldn't work. Everything I tried - failed. Everything else on my
laptop was working fine, except this code. Everyone else who had ever
downloaded the code had managed to get it working on their machines within a
couple of hours. Colleagues working onsite with me tried to help, but
everything they tried - failed. Finally the decision was taken to reset my
laptop to factory defaults and reinstall everything. That took up half of Day
2. Tried to get the client's site running - failed! Things were beginning to
get really embarrassing - all this was happening in full view of the clients.
In desperation, my company called me back to their offices and issued me with
a new laptop. Back onsite, the code downloaded ... and worked first time!

Turned out that the issue was that my hard drive filesystem had been setup
(not by me!) as case-sensitive, and the client code included a file with an
all-caps filename, which the code called using a lowercase string. Almost lost
my job over that one.

------
hnick
My weirdest that I can recall right now was a PDF file that would not print.
Since the printer was typically unhelpful with the error message, as was
support (this is a room-sized commercial printer but we didn't get the help
I'd really expect), I had to dive into it myself.

Long story short, whatever had produced the PDF had also embedded a TrueType
font where one character was named //something. This is fine. The character
just has a weird name, but it works. It's technically up to spec AFAIK, and I
got it out of the PDF with ttfdump to have a look at it.

Well the printer's internal RIP, unknown to us, converted the PDF to
Postscript when rasterising. And //something is called an "immediately
evaluated name" which I forget the details for, but basically this font
character, interpreted as postscript, was causing a lookup for a named
variable which did not exist. Hence the crash.

I had a similar one where Adobe InDesign had been used to make a PDF where
someone had selected the words to change the font, but not the spaces between
(or perhaps they did, and it was a bug). This meant that the PDF included a
subset font that only included the space character. Since the space character
is not drawn, this resulted in a 0 byte long glyf table. Based on my reading
of the TrueType font spec at the time, this isn't really proper.

Printer didn't like that one bit and died as it does to anything that smells
slightly wrong. Adobe said it was fine and up to spec though, apparently
'TrueType' has a different meaning inside a PDF :)

------
newswasboring
My favorite out of these is the 500 miles email limitation one. I work mostly
on big bulky manufacturing equipment but my job is to abstract out the
computing part. This story reminds me that every time I want to do something I
am still limited by physics. I am reminded of this story whenever the hardware
people ask me to insert an artificial delay in computation.

~~~
Zenst
Yes the old limits of the time style bugs are great. SOme arbitrary limit or
variable to hold a value deemed way more than enough at the time, for
years/decades later to jump out and catch you out. Y2K was one of those well
known ones, but been many of those types of bugs.

------
ljm
I love programmer stories like this. My favourite personal experience was on
my first Ruby on Rails project after first moving to London. I was pretty
green at the time, having had only a few years of PHP experience under my belt
and little else.

We had to build a Rails app around a poker game. We didn't own the source to
the poker game or its API, but we had to embed it nonetheless. We had this
really strange issue where some people, under a certain circumstance, couldn't
get into the game. It would just boot them out. Me and my team mate must have
poured through the Ruby code dozens and dozens of times and found no evidence
of this bug, no ability to reproduce it; bearing in mind I was still learning
the ropes and jumping head first into an unfamiliar codebase is quite
daunting.

Eventually I decide to get my hands dirty and I start poking into this game
engine. We embedded it as s flash widget, but the server doing most of the
work was written in a mix of C++ and Python. I didn't fully comprehend what I
was looking at but, even though things looked suspicious, I couldn't put my
finger on an actual problem until I looked at the API written in Flask and
noticed that one line of code didn't look like any other.

    
    
        some_value = params['some_key']
    

If the request didn't contain the parameter `some_key` then this would raise a
KeyError.

After maybe three solid weeks of trying to debug this thing, I submitted a one
line patch:

    
    
        some_value = params.get('some_key')
    
    

It's not quite as weird or as fun as most examples but for me personally it
was such a great lesson in debugging and being curious about unfamiliar stuff,
rather than closed off or afraid.

------
teddyh
See also:

“COMPUTER-RELATED HORROR STORIES, FOLKLORE, AND ANECDOTES”

[https://www.cs.earlham.edu/~skylar/humor/Unix/computer.folkl...](https://www.cs.earlham.edu/~skylar/humor/Unix/computer.folklore.from.net.rumors.html)

“Computer Stupidities”

[http://www.rinkworks.com/stupid/](http://www.rinkworks.com/stupid/)

------
darekkay
There's a GitHub repository for such stories [1]. I've even contributed one of
my stories: "Script crashes before 10 a.m." [2]

[1]: [https://github.com/danluu/debugging-
stories](https://github.com/danluu/debugging-stories)

[2]: [https://darekkay.com/blog/script-crashes-
before-10/](https://darekkay.com/blog/script-crashes-before-10/)

------
nervousvarun
This scratches a "thedailywtf" itch I had forgotten I had :)

------
BitwiseFool
We're actually working on a collection of such stories internal to our
division. We've found that these tales are a great way of helping people
understand the complexities and quirks of our nearly 3 decade old code base.

~~~
qznc
I think story telling is an underrated technique in our profession.

In all projects there are coding rules (like "make destructors noexcept"). A
rule sticks much better if you also tell a story about some debugging caused
by _not_ following the rule.

~~~
disgruntledphd2
I worked at a place one that had a style guide for the main front-end language
used with links to terrible things that had happened as a result of breaking
the style guides rules.

It was surprisingly effective, so I completely agree with you on the story-
telling.

------
yxhuvud
One favorite in the category is always the good old "Can't print on tuesdays"
ubuntu bug that has been submitted here a bunch of times:
[https://bugs.launchpad.net/ubuntu/+source/cupsys/+bug/255161...](https://bugs.launchpad.net/ubuntu/+source/cupsys/+bug/255161/comments/28)

------
KingOfCoders
I had this one.

We were using mSQL in the 90s for web projects. A very important customer
wanted a "real" database so we bought DB2. Because we didn't have an IBM
plattform or Solaris we went with Windows NT.

Everyhing went fine, until one day we recognized the website being slow.
Investigating brought the database as the culprit. So I went there and logged
into the NT box in the data center and checked the DB2. Everything was fast.
Back to my desk and the database was slow again after some time. Back to the
NT server and the same thing happend.

After quite a long time I found the real culprit. The NT pipes GL software
render screen blanker. After some time without interaction the screen blanker
started up and took all the CPU. So the database and the website went slow.
Someone had set the screenblanker to the nice GL pipes renderer.

[Searching the web, IBM introduced DB2 for Windows NT 31.10.1995 and I went to
Cebit that year to check it out]

~~~
villuv
This reminds me a nice day we spent at customer's premises trying to figure
out why DB2 won't install or start properly on a win2k box. Weird error
messages etc. Problem was that it didn't like that the box was named 'DB2'...

------
Tade0
Was looking for the 500-mile email - it's there.

------
marcodave
First month or so at my new employer, big consultancy firm for a financial
institution. Had a fairly complex distributed monolithic application
integrated with Tibco EMS, Oracle DB and distributed XE transactions.

Regularly, but randomly, in production, after receiving a good amount of
messages in the input queues, (which then got rerouted to other event queues
for parallel processing) some DB transactions simply were getting stuck. Not
rolled back, but stuck in limbo -- after a while the DB simply refused new
transactions because so many were stuck. Nobody got a clue on why that was
happening, it meant regular manual restart of the services and re-feeding of
the failing messages. Users started to get fed up and the project threatened
to fail.

Got into it, after couple of weeks of investigations and trial and errors with
all possible weird flags, turned out that the version of Tibco EMS had a
_wierd_ behavior with distributed transaction when the queues got full of
messages (queues had 50MB size limit).

Instead of rolling back gracefully the JMS+JDBC XE transaction, it...kinda
exited with an IO error.

Turned out that newer versions of Tibco EMS fixed that issue, but no way to
ask ops to install that new version. Since upgrading was out of the question,
the actual fix was to enable message compression to limit the size of the
messages coming into the queues, turned out that the XML we sent there were up
to 1.5MB (!)

After discovering that, became basically a war hero and respected by the
client as the "savior of the project". Good times.

~~~
tlavoie
Your compression workaround reminded me of an issue I ran into a while back.

My team at work uses a reporting tool for vulnerability assessments and pen-
tests; basically you can import a bunch of data files, review it in the web
app, and generate a report.

I would run into cases where I couldn't upload one of my data files. The web
app is JS-heavy, lots of things going on in the background without much
visible feedback. It turns out that the programmers had implemented the upload
as this async task with a hard-coded timeout for completion, and they likely
wrote it while they had great network speed.

I'm on DSL, and generally, it gets the job done. However, upload speed is only
1Mbit/s, so with a big file, my upload would time out. It's hard-coded
remember, so it didn't matter that it was still functioning when it got
clobbered.

It occurred to me that some file formats, like WAR or Office documents, are
basically Zip archives under the hood, so I put my large XML file into one,
and tried that.... and it worked! Something on the back-end quietly unzipped
my upload and imported the file it contained.

Funnier is that when I mentioned it to the devs, this behaviour was not
something they expected. Probably built into a library they use.

------
me_again
I had a function which only failed at 8 or 9 minutes past the hour.

It parsed a string containing the timestamp and "08" or "09" was interpreted
as an invalid octal number. Argh.

~~~
mankeysee
The beauty of loose dynamic typing

------
coreyp_1
This one happened last night. A student contacted me because her Anaconda
Jupyter notebook (installed on her laptop) just wouldn't connect to the Python
kernel. (The notebook itself would load, though, meaning that the server was
running fine. It's just that the kernel and its websocket was failing.) I
should point out that, because of COVID-19, this troubleshooting was over
Zoom, which complicated the diagnosis a bit.

She had not been using Jupyter for several months, as we have been writing
stand-alone programs in class using Spyder (the editor that comes with
Anaconda), and the command line, and Jupyter had worked the last time that she
tried it.

We restarted everything, and still the problem was there. I helped her to
update everything, but that didn't solve the problem.

Finally, I looked at the error messages in the console where the Jupyter
server is running. It had a huge list of errors, all relating to the pickle
library.

We had done an exercise with pickle in the class, but nobody had reported a
similar problem. When we looked in her classwork directory, though, we saw
that she had created a "pickle.py" file when she was testing something with
pickle. But, at that point in the class, we were working in the command line,
and everything (including Spyder) still worked just fine.

Evidently, this was the cause of Jupyter's problem. When trying to start the
Python kernel in Jupyter, it imported pickle, and evidently it imported her
test file rather than the actual library. The fix was simple: we renamed her
test file, and everything worked perfectly.

------
throwaway3563
Had a flaky unit test that would randomly fail with some random Chinese
character in the output.

The test was running a log parsing tool against a temporary file that had a
pseudo-SQL syntax where you could “select ... from
c:\Users\\...\temp\abcd1234.xyz\testdata.dat”. The temporary directory was a
randomly generated name so that the folder was guaranteed to be empty before
every execution of the test.

The test failed on the rare occasion that the randomly generated temp dir
consisted of the letter ‘u’ plus four characters that were valid hex digits.
When this happened the randomly generated dir name interacted with the
backslash before it and become a Unicode escape sequence. It was easy to fix
but that test was flaky for months before anyone worked out why.

------
chiph
A bank I did some contracting for had a problem where their Token-Ring network
would crash at random intervals during the day in one of their branches. It
would also crash at night, but the times when it would happen were more
predicable.

And that was the clue they needed to solve the problem - it turned out that
the wiring installers had run the cable up the elevator shaft. When the
elevator stopped at a certain floor the door motor was sometimes interfering
with the signal. The more-regular nightly disruptions were because of the
security guard making his rounds.

It turned out that run was pretty close to the length limit for 16mbps Token-
Ring, so they added a repeater in the middle to boost the signal strength.

------
nieve
I eventually gave up without finding the issue, but somewhere deep inside one
version of the Sphinx full text search software was a bug that would sometimes
switch query got what result set. It only happened sometimes when queries were
within a few seconds of each other, but it wouldn't happen with only one front
end process even in multithreaded mode and would disappear if requests were
_too_ close together. If I'd found a way to reproduce it I'd have submitted it
to the Sphinx team, but after a few days of potentially private info leaking I
gave up and moved to PostgreSQL's FTS.

------
benibela
I just had a weird bug in a programming competition.

You basically had to sort the English letters that occur in a text according
to their frequency descending. Except the one letter that occurs the least,
needs to be sorted as if it occurs the most.

The expected output of the sample case was TPFOXLUSHB

I ran my program on the sample, the output looked correctly; then I submitted
it, and the judge said it failed the sample case. In fact, it was printing
ͲPFOXLUSHB

That nearly looks like the correct output.

I had confused two variables and it was printing the frequency count as
codepoint rather than the letter. But such a coincidence that it looks the
same

------
MaxBarraclough
Glad to see the More Magic story in the list.

------
arafalov
I used to work as a senior technical support for BEA Weblogic and had all
sorts of crazy situations to debug remotely. Including one time when I had to
get a person to edit a config file in Vim (which they never used before), on
Unix (which they never used before), with me guiding them by phone (no
visuals).

This is the one I recorded that seems to fit into the current theme:
[https://www.outerthoughts.com/2004/10/perfect-multicast-
stor...](https://www.outerthoughts.com/2004/10/perfect-multicast-storm/)
(tldr: multicasting on 237.0.0.1 is bad).

And if somebody really understands network and multicast, I would love to know
whether I actually nailed the problem or just made it go away accidentally. I
have no problems with being wrong, especially this much later :-)

------
kazinator
Similar to the train being stopped by a toilet flush, in the 90's, I worked
with devices based on Microsoft's Pocket PC OS. These were equipped with
wireless radios. The transmission of a packet caused some interference that
the device interpreted as a click on the screen. The cursor was over the [X]
to close the application window, so the application would just quit, looking
like it crashed.

------
villuv
IBM Java 1.1.8 that was embedded into Lotus Notes 5 (if I remember correctly)
didn't have 29th of April if it happened to be on Tuesday.

When you constructed a Date object of 29th of April with such a year that it
was a Tuesday, you get the 30th of April when you read back the value. Took a
while to figure out why date calculations were sometimes off. The flux of
expletives was impressive when we finally did...

~~~
swsieber
Any ideas on why it happened? I'm impressed you found that bug though.

~~~
villuv
No idea, it was years ago. JVM-s were extremely buggy. We were quite n00bs at
that time also so we didn't dig into the depths of it, we had to fix
production issue that we had already spent a lot of time on. We found the bug
by just narrowing down a broken case further and further. Finally we wrote a
code that just tried a range of dates one by one over several years to find
out the pattern. Unfortunately it wasn't possible upgrade the JVM as it was
embedded into Lotus Notes, so we wrote our own date implementation (yay!) that
satisfied our needs. It was fixed on later Notes version, but our sh*tty date
implementation lived much much longer...

------
themeiguoren
Not software, but along the lines of cool engineering stories one of my
favorites is this one about fixing 230 kV, many-hundred-amp, 10 mile long coax
cable in Southern California.

[https://www.jwz.org/blog/2002/11/engineering-
pornography/](https://www.jwz.org/blog/2002/11/engineering-pornography/)

------
blurryroots
Thanks, this is hilarious! "Okay! I'm braking now", definitly my new going to
the toilet catch phrase.

------
teddyh
I'm very disappointed that the very first entry in the list is a bogus story
_confirmed_ to be false.

I mean, if you simply want general computing legends of unconfirmed veracity,
read “The Devouring Fungus” by Karla Jennings.

~~~
qznc
The list desperately needs another story where the title starts with A or B.

------
raverbashing
The SSH one is brilliant

I find that most of these "hard problems" are something small, so something
that's almost unnoticeable to not break immediately but that makes it kinda
work.

Now finding exactly what is the trick

------
DesiLurker
for me it was 21 day bug for a broadcast video encoder, some internal frame
counter was coded with int instead of int64. would reset every 21 days. fun to
debug it was not!

------
smitty1e
It is good to pore over these legendary tales. They come in handy when we need
to break out of the moment and try something outrageous to solve the problem.

------
robocat
Crash cows: “There were often significant food shortages in the Soviet Union,
and the government plan was to mix the meat from Chernobyl-area cattle with
the uncontaminated meat from the rest of the country. This would lower the
average radiation levels of the meat without wasting valuable resources.”

There is some sense to that: low levels of radiation are not a cancer risk
last I read - everything we eat is slightly radioactive. That said, I can’t
think how significantly radioactive cows could be “diluted“ enough.

~~~
Y_Y
It's worth knowing that all the cows you eat are radioactive, just like
bananas. If the cow didn't die, you've got a good chance too.

------
andrepd
The Crash Bandicoot story is my favourite.

~~~
wodenokoto
This was my first thought too - "Wonder if tht story with the PS1 controller
is in there" \- man that must have sucked to debug!

~~~
dmbaggett
It did.

------
pascalmahe
Always impressed by Mel's story.

~~~
krallja
That describes a real machine, the LGP-30, which really had absolutely no
business being as powerful for the price and era as it was. It used an
oscilloscope for its (tiny) debug display: the operator’d read the voltage
waveform directly as binary.

[https://en.wikipedia.org/wiki/LGP-30](https://en.wikipedia.org/wiki/LGP-30)

------
superice
So at my previous employer we still had an old frontend running an atrociously
old version of EmberJS. I run the build but it suddenly fails. We were on a
three weekly release schedule, so I figure it must have happened somewhere in
the past three weeks, about 400 commits. So I start Git Bisecting, teaching it
to a handful of my more junior colleagues as I go along. It took us way to
long to figure out that the original build also failed however.

So my teammates wish me good luck, and I go off on a journey debugging what
the issue actually is. As it turns out, the horribly old Ember.JS CLI package
version we're using is a version called '0.2.0-beta'. That did not bode well.
This frontend of course did not use the nice yarn dependency pinning, just a
regular old package.json file, so I go tracing the error into the
dependencies.

Eventually I trace the thing do a dependency nested three layers deep or so. A
library added a deprecation warning when being used. That in itself is not so
bad, but it did that using an injected logging framework from the package
using that library. Except that wasn't introduced until a way later version.
Ofcourse this tiny little addition could never cause any breakage, so this was
released as a semver bugfix release.

The commit time shows it was 23:00 local time
([https://github.com/goldenice/ember-cli-
babel/commit/c4c95d6f...](https://github.com/goldenice/ember-cli-
babel/commit/c4c95d6f1637bfb8988f68dcacbaa436c6eb94bb)) when I figured the
problem out and committed a fix. So I submit a PR to the library, figuring
that if the author happens to be awake I won't have to figure out a way to pin
the dependency to an earlier version (which would have been easy in a regular
dependency, but this was a global dependency, where it's not as trivial as
switching from npm 2.x or 3.x to yarn)

The author thankfully responds almost immediately, asking me to provide a
fallback to console.warn instead of skipping the deprecation entirely. Makes
sense, so I update my PR, submit it within a few minutes, and I see that the
author immediately publishes a new version. Finally something works out for
me. Or so I thought.

As it turns out the author made a tiny stylistic fix in my code. Except that
tiny stylistic fix butchers my carefully crafted if statement, and now the
code is broken again. It took me a while to figure out that the new version of
the dependency WAS being used, but was also broken.

So I contact the author again, explain the situation. They changed the code
immediately, pushing out another update. In the meantime I figured out how to
do dependency pinning and all was well with the world again.

And that kids, is the story of how I came to appreciate transitive dependency
pinning as a really useful feature.

(It's still amazing to me by the way that I can contact somebody that wrote
some random code that our code happens to rely on and get a response within
half an hour.)

