
My Hardest Bug Ever (2013) - shawndumas
http://www.gamasutra.com/blogs/DaveBaggett/20131031/203788/My_Hardest_Bug_Ever.php
======
kabdib
Ah yes:

\- An OS would write the "please wake me up on the next interrupt" to a
hardware register, then a shadow RAM location. Interrupt happened between the
two writes, so the task scheduler wrote the wrong value to the register and
stopped. Finding the problem: Two weeks. Fixing the problem: Swapped two
instructions and rebuilt. Write-only registers are evil badness.

\- USB controller with a temperature problem: It would also work first thing
in the morning, and after lunch. After maybe four days of correlating this the
"Aha!" was visible on my face at lunch and I left the table rather abruptly.
The initial fix was a bag of ice on the chip. Real fix took several weeks of
increasingly loud conversations with the chip vendor.

\- Cache on a disk drive had a bad bit, occasionally corrupting file contents
or (even more fun) directory structures. Memo: All the ECC in the world won't
help you if your I/O system is flappin' in the breeze. End-to-end checksum
anything you truly care about (in fact, we caught this one with a Merkle
tree).

Never mind the heap corruptions by code you don't even have source for, the
async callbacks that fire long after you've nuked the objects and assumed
everything was dead, the DMAs that came from outer space because you set them
up hours ago and they finally fired, wiping out something random every time,
the incoherent cache coherency systems, the serial ports that drop bytes if
you send too fast . . . it's amazing anything sufficiently complex actually
works.

~~~
tonyarkles
I think I would love your job. That's the kind of stuff I _love_ chasing down.

~~~
jovdg
Agreed. Also, I usually blame the hardware quite fast... Have seen enough
faulty gbics, or wrongly plugged cables. But then that's what you see as a
sysadmin, not as a (game) developer...

~~~
digi_owl
Yep. If it was one thing a networking (the wires and cards kind) teacher
drilled into me was to check wiring first and then work my way up the stack.

------
13of40
About a thousand years ago, I bought a new 286 motherboard from the local
computer shop, and when I brought it home, DOS would boot on it, things like
edlin and debug worked, but anything like a game (Lemmings!) would hang. I
tore my hair out over this for a while until I noticed another symptom -- when
I ran the 'date' command, the system time never changed. It turned out the
clock on the motherboard was faulty, so it never fired the interrupt to tell
the computer time was passing, so anything that depended on that interrupt for
poor-man's threading would hang. But DOS booted up and ran like a champ.

~~~
acomjean
Speaking of time..

I inherited a bug that only showed up the last 5 minutes of an hour. Except I
didn't know that. The first time I worked on it, I reproduced the error but
before I got the debugger started the time went by and the new hour caused the
bug to go away...(why does this bug not show when the debugger is on?)

Eventually by hand tracing I figured it out.. Some weird time function was
rounding the minutes to the nearest 10 minutes. When the time crept past 55
minutes rounded up to 60 and passed to a time function that didn't like...

Only bug that was harder was a serial cable with flipped wires. Which kinda
worked, but gave out garbage. We figured that out with a scope.

------
JoshTriplett
This kind of hardware quirk is _really_ common in old consoles. In the NES and
SNES era, not only were these kinds of issues passed around by tribal
knowledge, but half of them were repurposed as features and used to push the
hardware further. Not random save corruption issues, obviously, but many other
hardware quirks became semi-documented "features" of the console, and years
later, emulators would have to reproduce them faithfully or games wouldn't run
correctly.

~~~
Liru
Got any examples of this? It seems interesting.

~~~
JoshTriplett
Super Mario Bros 3 did diagonal scrolling, which was not an intended feature
of the hardware; it also used some careful timing to split the screen between
the play area and the status area on the bottom.

 _Many_ console games reprogrammed or switched palettes partway through
scanout to get more colors on the screen than normally possible; some did so
with careful timing in the middle of horizontal scanlines. Emulators commonly
don't implement this, because it adds significant overhead to the scanout
fast-path. See the screenshot of Air Strike Patrol in
[http://arstechnica.com/gaming/2011/08/accuracy-takes-
power-o...](http://arstechnica.com/gaming/2011/08/accuracy-takes-power-one-
mans-3ghz-quest-to-build-a-perfect-snes-emulator/) , where failing to emulate
intra-scanline changes results in the plane's shadow disappearing, which is a
critical gameplay component that makes it easier to aim.

------
Taniwha
I chased a similar bug a lon time ago, we were putting Unix on a late '80s era
platform (which I wont name), some system's keyboards failed if you used the
floppy, but not others.

Turns out the floppy code sat in a tight loop polling the clock in a timer
chip whenever it waited for sectors to pass by, when the hardware did this the
clock to the keyboard controller changed (got faster) - turns out some
keyboard chips had old firmware (they swore they didn't) that couldn't
tolerate the faster clock.

A couple of no-ops in the floppy loop fixed it.

As someone point out above metastability issues are the real impossible bug -
but the hardware guys (I wear both hats) should have got that right in the
first place

------
DrScump
This reminds me of when I was a staff Consultant at an RDBMS company 25 years
ago.

A major telecom company was getting crashes in the server because of alleged
I/O errors (mostly writes) in the raw i/o version of the server (no
filesystem)... just a few times per several hundred thousand (or million)
transactions. Subsequent disk and controller diagnostics were always fine.
(This was before any fault tolerance was programmed into the server... my code
for this problem was the first ever.) Most failures were writes, a few were
reads. No obvious pattern (time of day, load, transaction type, lunar phase,
sunspots, strength of coffee, etc.) This was on a DEC with Unix System V and
DEC disk hardware, so only one vendor to deal with.

Anyway, seeing nothing "wrong" with the related code (other than no tolerance
for error codes from the raw i/o calls), I theorized that maybe the "errors"
were spurious and no actual fault or corruption was resulting, so I wrapped
the i/o calls in retry loops (with the number of retry attempts tunable by the
user) and logged any "failures" and results of the retry attempts. So, I did a
build with my changes, had the customer run from my directory, then wait and
watch for the carnage...

Turns out that every retry was successful. In fact, all but one of the dozen
or two per day was successful on first try, and none needed more than two. No
actual flaw in data was ever found.

Anyway, it turned out to be some spurious error specific to a specific drive
type with that specific controller running that specific firmware... and
apparently their filesystems code knew to work around it.

Client site personnel were really nice to me, too.

------
bargl
This was an awesome article. It's been up here before. For relevant discussion
check
[https://news.ycombinator.com/item?id=6654905](https://news.ycombinator.com/item?id=6654905)

------
vmorgulis
It's a little like a row hammer bug
([https://en.wikipedia.org/wiki/Row_hammer](https://en.wikipedia.org/wiki/Row_hammer)).

------
castratikron
Quantum mechanics? Just sounds like plain old induction to me.

~~~
danso
The OP was originally published by Baggett on Quora, and then subsequently re-
published on Gamasutra. The Quora posting has a few footnotes, including him
admitting that "quantum mechanics" was mostly a flourish. What he meant was,
that unlike other software bugs, _" the behavior was -- at least at the level
of the source code -- non-deterministic"_

[http://www.quora.com/Whats-the-hardest-bug-youve-
debugged](http://www.quora.com/Whats-the-hardest-bug-youve-debugged)

\------

Footnotes for posterity:

A few people have pointed out that this bug really wasn't a product of quantum
mechanical effects, any more than any other bug is. Of course I was being
hyperbolic mentioning quantum mechanics. But this bug did feel different to
me, in that the behavior was -- at least at the level of the source code --
non-deterministic.

Some people have said I should have taken more electronics classes. That is
absolutely true; I consider myself a "full stack" programmer, but my stack
really only goes down to hand-writing assembly code, not to playing with
transistors. Perhaps some day I will learn more about the "bare metal"...

Finally, a few have questioned whether a better development methodology would
have prevented this kind of bug in the first place. I don't think so, but it's
possible. I use test-driven development for some coding tasks these days, but
it's doubtful we could have usefully applied these techniques given the
constraints of the systems and tools we were using.

------
userbinator
Reminds me of this 30-year-old hardware bug, related to metastability:
[https://news.ycombinator.com/item?id=5314959](https://news.ycombinator.com/item?id=5314959)

(The article's location has changed, it's now at
[http://www.pouet.net/prod.php?which=61024#c637759](http://www.pouet.net/prod.php?which=61024#c637759)
)

More information at:
[http://www.linusakesson.net/scene/safevsp/index.php](http://www.linusakesson.net/scene/safevsp/index.php)

------
GuiA
Debugged a (somewhat similar) problem a few months back on a USB device that
would only happen when it was plugged in on non grounded computers (e.g.
laptops running on battery). Working in hardware is fun.

------
davesque
Yeah, I remember this being posted before. This reminds me of another article
which I'm pretty sure was posted on HN as well about a single bit being
randomly flipped on someone's `expr` binary as it was loaded in memory. The
author, sort of jokingly, suggested that the bit flip was caused by a cosmic
ray:

[https://blogs.oracle.com/ksplice/entry/attack_of_the_cosmic_...](https://blogs.oracle.com/ksplice/entry/attack_of_the_cosmic_rays1)

------
rbritton
I used to work at a hotel that ran a Nomadix gateway appliance to get around
guests having static IPs configured from their office environments. Without it
we were sure to get a call about them being unable to connect.

One year the hotel purchased another across the street and connected it via
fiber. As our network infrastructure was somewhat old we used a converter on
each end to take a standard patch cable from each switch and transmit the data
stream over the fiber. Everything worked as expected except for one key part:
computers attempting to access the guest network would never receive a DHCP-
assigned IP address. Static IP addresses worked just fine.

After quite a bit of packet sniffing and digging into specs I found the
problem: the packet size of the initial DHCP responses from the Nomadix
gateway was smaller than the minimum packet size of the converters. The fix
was to switch to a different model converter that operated at layer 2 instead
of layer 3 and thus didn't have the packet size minimum.

------
bwy
This has been reposted a few times now. Past discussion:
[https://news.ycombinator.com/item?id=6654905](https://news.ycombinator.com/item?id=6654905)

------
digi_owl
Got me thinking about a Bill Herd video about working on the C116.

[https://www.youtube.com/watch?v=xPD5N43VIsk](https://www.youtube.com/watch?v=xPD5N43VIsk)

Specifically the part where he talks about sorting out the joystick port.

More hardware than software but still...

------
amelius
Lesson: make sure you can always (re)run your code in a fully deterministic
environment.

~~~
Tloewald
Such as?

------
aerovistae
That is cool as hell.

