
My Hardest Bug - peterlundgren
http://www.peterlundgren.com/blog/my-hardest-bug/
======
lmm
One from my first job (not mine though).

Java app, deployed to four servers (by rsyncing a zip file and unzipping it).
One of the four fails on startup, in some way that's hard to trace (I think it
might have been a JNI crash)? But all four servers have the same OS, same
version, same packages installed, same JVM, and it was the same zip file.
Check the md5sum on the zip, it matches. In desperation one of my colleagues
writes a script to recursively go through the unpacked version of the app and
check the md5sums of all the files. Still matches perfectly, and the same
files are present on all machines.

We get the dev team to try the app - there's a bit more variety in our
devboxes than servers. Two of them can reproduce the failure, but there's no
obvious correlation - one's java 1.5, one's java 1.6. One's Debian, one's
Gentoo. For every combination there's another developer with a similar machine
where it works fine.

Turns out that one server had been installed with a different filesystem from
the other three (reiserfs?), which meant that the directory entries for the
files were in a different order. The JVM just lists all the classes in the
directory and then loads them in on-disc order, so classes were getting
initialized in a different order, and it was that that was ultimately
triggering the bug.

~~~
pasbesoin
I recall averting what would have been a rather opaque security weakness by
pointing out that occasionally someone would goof and vary the default locale
of some JVM instances. (With a particular knock-on effect that I'll omit
here.)

Paying attention to the system you'll land on is something that can go lost
even to senior developers. And, therein, details, details, details...

Another reason to have a strong ops (or devops) team: Providing/enforcing
proper, intended, and thoughtful context and runtime.

P.S. As I now recall, there was also the use case for locale to deliberately
vary, although that use case / those instances should have remained orthogonal
to those of other locales. Nonetheless, one more possible driver of a mistake
that would enable this weakness to occur.

~~~
lmm
> Another reason to have a strong ops (or devops) team: Providing/enforcing
> proper, intended, and thoughtful context and runtime.

I have to disagree, actually. This kind of problem is exactly where you really
benefit from having full-stack people who understand both sides of the system;
it would have been very hard for an ops person who didn't know about Java's
quirks or a pure dev who didn't know about the unix filesystem to diagnose.

~~~
pasbesoin
I see your point. And... I had to escalate to a very senior person of that
sort (full stack, or full picture -- in detail) in order to get that
particular problem paid attention to and resolved.

(Such a perspective is also how I determined the problem in the first place --
which actually resulted from a botched fix to a prior problem that I'd
identified.)

However, despite the variance -- or risk of same -- that I described, in
general we had some very capable and dedicated devops people who put a lot of
effort and care into our environments. I get rather uncomfortable considering
how things would have been had that not been the case.

It's been a while, and maybe I'm mixing my stories a bit. But I left that role
and product mix -- which had very significant security requirements and
ramifications -- quite impressed with the role those devops folks played in
keeping us safe.

Perhaps what I meant by "strong" goes somewhat in the direction of your
description of "full stack". Our senior devops people tended to trend in that
direction.

And we didn't have "turf wars". Instead, devops was a partner often throughout
the development lifecycle. It helped make sure that the final destination was
appropriate, safely, and consistently configured. (My "locale" situation
aside; and in such an instance, the setting would subsequently receive
heightened and sustained scrutiny, putting a curb on unintended variance of
the setting as well as fixing the code that such variance would impact.)

Strong in knowledge and ability, as opposed to simply or foremost in an
authority to dictate.

~~~
lmm
I think we agree that it's good to have these skills somewhere on your team,
and to work together during development. I just don't think dividing people
into "ops" and "dev" is helpful. There it was something like one guy who spent
80% of his time on traditionally-ops stuff and 20% development, another guy
who was 70% development and 30% database admin, one who was basically full-
time ops but happened to be the maintainer of an open-source library that we
used in our products, one who had been hired as a developer but was starting
to pick up ops tasks because he preferred that...

------
TwistedWeasel
My favorite technical interview question is to ask the candidate about the
worst bug they ever wrote. I let them choose their own definition of "worst",
and see if they choose something that was hard to debug, or caused a lot of
trouble or was just something they thought was a dumb mistake.

It usually provides them with an opportunity to talk technically about
something they know and helps me understand how well they communicate problems
and solutions. Plus, it's sometimes fun to hear the stories.

My own personal worst bug (where by worst I mean "had the worst impact") was
when we disabled a large chunk of the southern Beijing cell phone system for a
short time during the night whilst deploying a field test of new base station
hardware. That was a stressful rollout.

~~~
finnw
> My favorite technical interview question is to ask the candidate about the
> worst bug they ever wrote.

Are you sure you are getting honest answers? Some candidates may be thinking
"He'll never hire me if I admit how stupid I was, so I'll use this secondhand
or dumbed-down story instead."

~~~
groby_b
Irrelevant - they talk about a technical issue they're familiar with. I can
ask them questions about that issue. It's a good way to gauge their skill
level while keeping them on familiar terrain.

It's usually fairly straightforward to see if somebody talks about their own
experiences or is retelling a story they simply heard.

~~~
TwistedWeasel
Exactly, even if it's not their own story, if they are able to talk
intelligently about the subject matter and explain the problem and solution
well then i've gained valuable information about the candidate.

Unfortunately i've had a couple of candidates who's response to the question
was "I don't really write bugs". Those are the ones I know for sure are lying.

------
varelse
My hardest bug:

Memory corruption in a video game that I was developing in the 1990s. It took
2-3 days of running attract mode to trigger it, whereupon the game would crash
catastrophically.

Solution: Videotaped attract mode for 2 days until it happened. Then I single
frame advanced through the 4 frames between its first manifestation and the
complete crash of the program. 15 minutes later I knew exactly what was going
on and fixed it shortly thereafter.

These days, any bug that survives my best efforts for more than a day usually
ends up being a HW/driver issue in equal measure. I've learned a lot since
then.

------
EvanMiller
While we're swapping war stories and tall tales, here is the chronicle of my
hardest bug:

[http://www.evanmiller.org/winkel-tripel-warping-
trouble.html](http://www.evanmiller.org/winkel-tripel-warping-trouble.html)

~~~
flym4n

       I spent three days in coffee shops with a pen and paper taking first and then second derivatives of the Winkel Tripel formula
    

I felt bad for you reading that. There are loads of tools to do it
automatically (matlab, wolfram, etc).

~~~
sanskritabelt
The stuff coming out of those computer algebra things is not always exactly
the form you want to be stuffing in your code. Working it out manually focuses
the mind.

------
twoodfin
I don't know if it was the hardest bug I've ever tracked down, but I really
enjoyed discovering that HP-UX executables have an attribute bit that controls
whether or not they'll segfault upon dereferencing a null pointer. If I'm not
mistaken, the non-faulting behavior was the default in their toolchain.

Turns out that HP was also shipping a build of Kerberos/GSS libraries that
actually relied on this behavior to function properly, and since our own
project linking in these libraries did the sane thing and enabled the faults,
the Kerberos code would crash.

I won't forget debugging through the library assembly code, wondering how the
hell it ever worked. Luckily my education had included a good survey of the
computer architecture zoo, where surely someone once upon a time had thought
it was a great idea to spec a zero page at low memory addresses.

And of course today I know enough to never believe that a platform must
segfault on a null pointer dereference. C undefined behavior works in
mysterious ways.

------
sergiotapia
Another very interesting bug was the Case of The 500-Mile Email:
[http://www.ibiblio.org/harris/500milemail.html](http://www.ibiblio.org/harris/500milemail.html)

Mirror:
[https://gist.github.com/sergiotapia/d60b5866a61fd0ae8c9c](https://gist.github.com/sergiotapia/d60b5866a61fd0ae8c9c)

~~~
enkephalin
i knew of that one, but had forgotten the cause, and just enjoyed reading it a
second time.

------
sytelus
I used be in awe of articles like this. For a long time I believed some of the
most difficult bugs are the ones where debugger itself has an issue. But
having been working in machine learning and distributed computing all these
bugs looks like pretty stories in kids book. Imagine this: When your machine
learned model does not "work", there are no breakpoints to put, watches to
watch or even a code to reflect about. All you have is data and probabilistic
statements from which you need to statistically infer the root cause. It's
same way in distributed computing. One of my script was over 1600 lines of
dense high level statements that ultimately gets translated to hundreds of map
reduces. Again you don't get any exceptions when things may get unexpected.
All you can see is some statistical pattern that "doesn't look right" and it's
always interesting to make you r probabilistic "if-then" arguments leading to
a root cause.

~~~
username223
Add some consistency checks on your probability distributions (do the
marginals sum up?).

What makes articles like this impressive is that they involve a problem at a
different level of abstraction. For example, what if your model didn't work
because multiplication broke for specific inputs?

------
danso
Of all the "this is how we coded things", debugging stories are my favorite,
and are probably the things I re-read most...especially since most of them
seem to happen in low-level areas, which I don't get to deal with on a regular
basis but still find the intuition and principles to be transferrable.

That said, every time I read one of these...I'm always reminded of a legendary
bug report...or at least an old one, because I thought I remember reading it
in a Usenet newsgroup posting...it involved some malfunctioning hardware and
the cause was related to the floor panels and the proximity of people walking
around...and I can't find it to store in my bookmarks. Someone here must know
what I'm referring to yet can't summon on Google

~~~
hfsktr
I think this might be it:
[http://patrickthomson.tumblr.com/post/2499755681/the-best-
de...](http://patrickthomson.tumblr.com/post/2499755681/the-best-debugging-
story-ive-ever-heard)

~~~
danso
I believe that's the one... _wow_ , my memory is horrible. You can't get much
further from "Usenet newsgroup" than "on a Tumblr". I guess that the story as
told took place so long in the past, my brain associated it with the days of
reading from a text-browser.

------
willvarfar
They say you try and suppress your worst memories, which explains why I can't
remember anything much worse than this:

On Symbian OS, the window manager managed all the screen drawing. All visible
apps would be asked to send draw ops to the WM and it would draw them clipped
to the apps's windows.

And at UIQ we were adding theme wallpapers and memory hungry graphics faster
than out licensees were adding RAM.

And a real problem was running out of RAM drawing the screen. Doing rectangle
intersections actually requires allocation, so drawing isn't constant memory.

To speed up drawing we made the WM retain the draw ops. This was transparent
to the apps, but a massive performance win. We made a 'transition engine' to
smoothly slide between windows and smooth scroll windows and things, at a time
when some Nokians confidently told me it wasn't possible :)

But what if our cleverness caused an out-of-memory in the WM? I had a cunning
plan...

We intercepted the malloc and, if it failed, we called out a memory manager
app to start zapping things. And if a second alloc attempt failed, we started
discarding draw op buffers and unloading theme assets.

And this seemingly worked! By making our graphics adapt dynamically to RAM
usage rather than ring fencing we got much better app switching because
background apps weren't getting unloaded.

And then, just as the first phone with this tech was being tested (manually,
by a small army), it would sometimes crash with meaningless stacks.

My team jumped into the challenge thinking we were elite clever sods and all
bugs were shallow.

After a few days, I had to start making excuses; we were stumped. The thousand
monkey tests should no pattern, only that it was often. Where was the crash
coming from?

A lunchtime walk cleared our heads a bit and suddenly the horrid realisation
was before us: if the allocation that failed was a bitmap data block, the
bitmap itself may be reaped but the stack would resume initialising RAM that
malloc didn't think it had any more and in the end some other random bit of
data would be interpreted as a memory address and eventually the WM would blit
to it...

The phone never shipped because the plug was pulled on UIQ, but I think this
bug was fixed and forgotten before then.

------
batbomb
Hardest bug #1:

With some custom embedded electronics, the ADC would work fine in a lab
setting but always throw out garbage in the field. Now, about 500 of these had
been running fine in the field themselves since February. Turned out there was
a bug in the FPGA that controlled the ADC and at somewhere around 80 degrees
fahrenheit, there was enough of a propagation delay that the ADC wouldn't
start up correctly. Since the other units were started in February when it was
25 degrees and only a few had been restarted, it wasn't noticed.

That was frustrating.

Another fun one was the rapid degradation of a database when the write-back
cache battery on the RAID controller failed on the write-ahead logging disk
and nobody was notified.

Right now I've been battling a random corruption NFS bug for a few weeks.
Recently thought it was the automounter but the bug has appeared in a few
other nodes since:

[http://stackoverflow.com/questions/20460238/random-
corruptio...](http://stackoverflow.com/questions/20460238/random-corruption-
in-file-created-updated-from-shell-script-on-a-singular-client)

------
YZF
Do other people who have been doing this for a while feel like those bugs end
up as a blur? As new programmers we all went through hard to debug
timing/memory corruption/threading issues but I like to think I've learned
enough to avoid those classes of issues...

War story #1: I designed and programmed a board based on a TI fixed point DSP
(5x series). Problem is the software ran for a very short while and then the
board would crash. I went through everything I could think of, the software,
verifying the reset sequence, memory accesses. Everything looked good. Called
TI support. Couldn't figure it out. After I think two weeks of checking
everything it turned out that one of the ground pins that was supposed to be
connected (it was in my schematic) was left unconnected by the PCB designer.
When we brought out the PCB design we saw the via to the ground plane but a
tiny little segment between the pad and the via (under the chip) was left
unconnected. If you don't hook up all the Vcc and Gnd pins you get undefined
behaviour...

War story #2: Odd intermittent very rare behaviour in an application we worked
on. Turned out we were using some implementation of shared pointers that used
interlocked increment for incrementing the count but didn't use interlocked
decrement for decrementing. So very rarely two threads on two cores would hit
that and someone would end up with an invalid pointer. That one also took a
long time with trying to get some semi-reproducible behaviour to even know
where to start looking.

EDIT: One thing I've learnt over the years is that bugs that look impossible
to figure out will eventually. The magic time period is around two weeks for
those rare super hard bugs. This is from having no clue of what's going on,
intermittent weird failures that look impossible to figure out, all you have
to do is "do the time" and you can figure those out. I've seen people simply
give up and live with things not working and believe that those issues are
"unsolvable"...

~~~
aidenn0
Worst bug I saw took nearly a year with some really smart people working on
it. It reproduced only after about 6 machine years (had ~100 test machines,
and one of them would randomly crash every 3 weeks).

~~~
anabis
Yeah, reproducability would determine how long it will take to zoom in, but
being rare should also reduce the severity.

------
philh
The bug I think I've spent most time on was when my Racket program would run
for a while and then segfault randomly. Core dumps showed a stack overflow,
but unsurprisingly, examining the racket source code didn't enlighten me. I
had to git bisect through a number of commits, running them on several
instances and tentatively marking them 'good' if I didn't find anything after
a few hours. (One time I was too quick to mark a commit 'good'...)

The guilty commit seemed fairly innocent, but it prompted me to try running
`(loop (thread (const null)))`, which immediately segfaulted. `(loop (thread
(thunk null)))` didn't. At this point we handed off to the racket devs, and
replaced our `(const null)` callbacks with `(thunk null)`. After a few days
they worked out what was going on and fixed it.

------
andersthue
I remember a friend of mine that I and all our friends tried to help figuring
out why two lines of c code never worked when running, but always worked when
debugging.

It was a simple while loop and it took as I remember two weeks to spot the
mistake, a missing = 0 in the

while (int c; c < x; c++)

The debugger initialized all memory to zero so it never failed when debugging
:)

------
chrisbennet
Around 26 years ago, I was working on some C code that was ported from
assembler. It was too long ago to remember the actual bug, but I was stumped
how a certain variable "C" was getting set. I wasn't set explicitly anywhere
in the code. It turns out that is was declared like this:

int A, B, C;

and it was being set something like this:

int* pA = &A; // pointer to contents of "A"

pA[2] = 3; // WTF?

A,B, and C where assumed to be contiguous in memory and the code was treating
"C" like the 3rd item in an array starting at "A".

------
lesterbuck
In the mid-eighties, I was working on some app that did serial port
communications, probably under PC-DOS. The program was failing to communicate,
and I spent most of a week stepping through the debugger, watching it send the
characters down the cable, working perfectly. But it was failing at normal
speed. It finally occurred to me that maybe the hardware wasn't quite working,
so I ran a diagnostic and it immediately failed. It was probably a lightning
surge from months before that made that port "special". The psychic scar of
that week made such an impression on me that I bought a used 5KW 120lb ultra-
isolation transformer for $75, and I have run all my various equipment behind
it ever since. I've never had another hardware failure due to electrical
surge, and my cat loves to sleep snuggled against it for warmth.

------
saalweachter
My favorite bug came just after starting a job. There was a C++ program that
was being run 32-bit on servers with 8GB of memory, back when that was a lot,
because it would crash when compiled 64-bit. I was assigned the task of making
it work.

After many, many fruitless debug sessions the problem turned to be that the
_structure packing_ was different between two different compilation units. In
some compilation units a particular structure was 56 bytes, in others 48 (or
something like that). This was Bad.

There was an unterminated pragma-pack which was included in some compilation
units but not others. In 32-bit mode it didn't cause any problems, because the
structures were optimally packed anyway, but in 64-bit mode, when pointers
were 8 bytes, the structures packed differently when the unterminated pragma-
pack was included in the header before them.

------
dugmartin
It's bugs like those that both make me miss embedded system development and
make me glad I don't do it anymore.

------
o_nate
Worst I can remember recently, since I never really solved it, was a .NET
Runtime Fatal Execution Engine Error that occurred somewhat randomly in a
long-running, multi-threaded console application, but only on one machine.
Eventually just moved to a different machine.

------
moron4hire
Once had an issue with a network of wireless mesh devices that I had to try to
debug from across the country (I was in Philadelphia, the client was having an
issue with a network installed in a hotel in Las Vegas). Getting ready to
leave for the a long weekend, I got a call from a client at 4:45pm (knowing
this client, I suspected they did it on purpose. They had called me on my
cell, so clearly they didn't expect me to be in the office). Basically, every
time you'd try to reset the room for a new guest, the blinds would shake, half
the lights would fail to turn on, and the door would unlock. If you kept at
it, it would eventually work. But then, setting the thermostat would sometimes
make half the lights go out. And the door would unlock.

They had a packet sniffer that we had built for them, so we went to trying to
diagnose the issue. I'd send them new versions of the programmer tool, they'd
flash the room (which required pinging each and every device in the room and
tapping a physical button on it, in one of the largest single-room
installations of this brand of mesh network in the world), and they'd send me
the logs of the sniffed packets. We could see that the packets for any device
you happened to be standing next to would be completely fine, but if you
turned your back on it and walked across the room, it started acting up. But
they would be fine again if you walked back to it. "It's like it knows you're
watching it", said the guy on the phone.

They kept insisting that I had let a virus into their network. Never mind that
it was only possible to rewrite the configuration ROM over the air, rewriting
the ROM required physical access to the board in the case, clearly it was my
fault for all the "unnecessary fiddling" I had been doing recently (i.e. a
slew of bug fixes they had requested all involving my predecessor's lack of
understanding of the pitfalls of threading and UI in .NET).

I kept telling the client that all the symptoms suggested radio interference
from an outside source. They insisted they had never heard of such a thing,
ever, in any context, including static on their car's radio. I being "merely"
an applications developer and not an electrical engineer, they lacked faith in
my explanation and insisted that I "just fix it". How I was supposed to be
knowledgeable enough to fix it if I was apparently not knowledgeable to
understand what was going on with it was beyond me, but whatever.

"Put it back to what it was before you started f __*ing around with it. "
Revert the code through source control (thank God I had installed SVN when I
first arrived at that company, because apparently EEs don't understand that
it's not a good idea to keep dated copies of code directories around as
"backups"). "This isn't working, I said give me the old version." Send them
links to the installer on their own server. "You must have changed it on our
server! How did you get access to our servers? This isn't working!"

I finally gave up around midnight and drove the 3 hours to my parents' house
for Thanksgiving the next day. I nearly got fired for it.

We shipped them a radio spectrum analyzer and determined for sure, it was
radio interference. The hotel opened the room next door and found a baby
monitor, still on, fallen behind the dresser. They turned it off and the room
responded flawlessly.

I should have quit then, but I needed the money and I was going through some
depression issues so I really thought it was my fault. I eventually did get
fired from that place, the only place I've ever gotten fired from, for "not
working enough overtime" because I was only doing 50 hours a week when the
intern fresh out of college needed 60 to get his much simpler tasks done (and
often leaving me blocked because of it, but I wasn't allowed to help him with
anything because "you're not an electrical engineer", where apparently only
electrical engineers know how to code in C). I don't regret it, biggest piece
of shit place I've ever been, and just the motivation I needed to get off my
ass and finally change my relationship to work. I've been freelancing every
since.

I guess that's not so much "my hardest bug", but I did actually fix a bunch of
bugs in the process of trying to convince them it was interference and not
some mythical radio virus that could corrupt packets in mid-air. And all of it
based on phone calls and emails with hex dumps of sniffed radio packets, with
being "nothing more" than a lowly applications programmer.

------
mark-r
Whenever I see a problem that just _has_ to be a compiler or hardware bug, I
keep digging in my code because I know I'm wrong. 99% of the time that turns
out to be the case.

------
GSimon
As a front end web developer, my hardest bug is Internet Explorer.

Most recently, trying to get Javascript to run correctly when images that have
been cached already (issue described here:
[http://mir.aculo.us/2005/08/28/internet-explorer-and-ajax-
im...](http://mir.aculo.us/2005/08/28/internet-explorer-and-ajax-image-
caching-woes/))

That was a fun 2 days of troubleshooting, I think I 'solved' the problem 3
times before it was actually solved cross-browser.

------
signa11
on the topic of debugging, i would just like to heartily recommend "the
medical detectives", which has a dr-house'esque streak to it, but it is
_real_...

------
remon
Great article. I understood quite a few of those words!

------
fayyazkl
Apparently, most of the ones categorized hard seem to be some thing related to
hardware i.e. not a software mistake of some programmer. I would narrate a
couple which were not the case.

a) I used to work on deep packet inspection software for a multicore network
processor. It was kind of c but with restricted api's and some unique concepts
related to multicore. Among the concepts was, same binary being run on
multiple cores to process packets, but still no hardware locks, because there
was an implicit tag - a kind of a hash computed on 5 tuple (src/dst ip, ports,
protocol) to ensure only one core gets packets from one session / 5 tuple.

So the scenario was a protocol parser whose job was to parse some other info
along with ip, call an external api to add a subscriber. When this parser was
ran for like 10-15 minutes on live setup, it used to seg fault after
processing some 60-70 million packets. The behavior was reproducible, but was
not occurring at the same time, nor in the same piece of code.

Narrowing down didn't exactly work, since it stopped occurring with either of
the subscriber addition api call OR the parser was commented. But each worked
perfectly on its own.

Finally, after a couple weeks of long debug cycles and notes, it turned out to
be AN IMPLICIT tag switch inside the subscriber addition api. Since we were
not locking through apis, the tag switch would lead to same packet being sent
to multiple cores, and any where along the line in the follow up code, an
allocation (which turns redundant) or a shared mem access or deletion (free)
it could turn into a seg fault.

Now implicit switch of locks in the subscribe api was also a documented and
needed feature of hardware. Just that it should have been DOCUMENTED in BOLD
on the api, which was not the case.

b) In the same dpi product, once we added two fields to look for in the
incoming traffic which should not have matched but were still matching in
results. Unique thing was, they only fail when those were together and would
work fine independently.

Going deeper in their code, showed a strncpy which was intended to use as a
safety against strcpy, but with MAX_STRING_SIZE. So basically when the actual
string was much shorter, it would just wipe off the entire length with padded
zeros in the buffer, there by over writing the originally appended fields to
look for. The author seemed to have missed the following comment in strncpy's
definition.

"If the end of the source C string (which is signaled by a null-character) is
found before num characters have been copied, destination is padded with zeros
until a total of num characters have been written to it."

Since then, i have been really careful in choosing to use strncpy instead of
strcpy as often mistakenly advised in general.

~~~
JoachimSchipper
Ouchies. On the narrow subject of strncpy(): strlcpy() and friends are the
"correct" API, IMHO, and it's easy enough to copy-paste them from e.g. the
OpenBSD code.

~~~
fayyazkl
Oh, so you mean the openbsd version doesn't suffer from these issues /
features? :) I would definitely like to take a look. Thanks

------
themodelplumber
My Hardest Kick (dolflundgren.com)

------
rurban
[http://hospitalityrisksolutions.files.wordpress.com/2013/02/...](http://hospitalityrisksolutions.files.wordpress.com/2013/02/roflbot51-e1360768553167.jpg)

