
My Hardest Bug to Debug (2018) - jsnell
https://www.programminginsteeltoecaps.com/my-hardest-bug-to-debug/
======
AnimalMuppet
Here's my worst:

We had a function that looked like this:

    
    
      void f() {
        bool flag = true;
        while (flag)
          g();
      }
    

This function exited sometimes, which should not be possible.

You say, "Duh, Mr. AnimalMuppet, clearly g is smashing the stack!" But the
return address wasn't getting trashed - just the value of flag.

This bug also would disappear if you did something like, for example, print
out the address of flag so you could watch it in a debugger.

I chased this bug for a month, off and on. Finally I got desperate enough to
print out the assembly produced by the compiler, and things got clearer.

flag was a register variable. (This was gcc compiling for an ARM CPU, by the
way.) It lived in R11 (or maybe R12, it's been a long time). When f called g,
it just pushed the return address on the stack. But g was going to have its
own value to put in the R11, so it pushed R11 onto the stack just before
allocating space for its own variables. So _f 's_ local variable wound up in
_g 's_ stack frame...

... and g was smashing the stack. Duh.

In particular, g was using queues (msgsnd and msgrcv) to exchange messages
with another CPU. The API here is misleading. These functions expect to send a
packet of the form

    
    
      struct msgbuf {
        long mtype;
        char mtext[1];
      };
    

and they take a size parameter, among others, because you actually pass in a
different structure, with an mtext array big enough to carry your actual
information. But the size you pass to the function is expected to be the size
of the actual mtext array, _not_ the size of the actual structure (that is,
the size doesn't include the size of the mtype field).

The contractors who wrote the code didn't know that. They assumed that they
could just pass sizeof(type_similar_to_msgbuf) into msgsnd/msgrcv, and it
would work. In fact, they were sending and receiving 4 bytes too many. Which
kind of worked, in that the sender and receiver never got out of sync. But
when the receiver got four bytes too many, it trashed what was just past it on
the stack, which turned out to be the stored value of R11.

The net result was that the flag would get cleared if, _on a different CPU_ ,
four bytes of unrelated memory were zero.

~~~
saagarjha
Psst…I'd mark that flag volatile or atomic if I were you; a smart C11 compiler
might mark your loop as terminating ;)

~~~
2fast4you
Wait, why would it do that?

~~~
saagarjha
From a recent ([http://www.open-
std.org/jtc1/sc22/wg14/www/docs/n1570.pdf](http://www.open-
std.org/jtc1/sc22/wg14/www/docs/n1570.pdf)) draft, section 6.8.5, Iteration
statements:

> An iteration statement whose controlling expression is not a constant
> expression, that performs no input/output operations, does not access
> volatile objects, and performs no synchronization or atomic operations in
> its body, controlling expression, or (in the case of a for statement) its
> expression-3, may be assumed by the implementation to terminate.

~~~
the_why_of_y
For that condition to be met, the compiler would have to prove that g()
doesn't do IO, etc - but the OP said it calls msgrcv(2), so it's not an issue.

------
shdon
I think my hardest thing to debug wasn't really a bug at all. It was having to
clean up some legacy code I'd inherited and one function which was a
bottleneck had a few hundred lines of horribly complicated code with lots of
calculations which didn't make sense to me. It took a lot of time to figure
out that the calculations were wholly unnecessary. The results of the
calculations either got discarded or unconditionally overwritten. In the end,
the refactoring consisted of a single press of the "delete" key, but building
up the confidence to press it was what cost a lot of time and effort.

~~~
bballer
Maybe your case was more complicated but most IDE's now will tell you about
unused variables, methods/functions etc. So if variables storing the
calculations that were being built up never ended up being used in the end
context of the function you may have gotten a nice gray/red line. It becomes
harder to trace if those values are being updated into some table which then
never gets used.

It is amusing that sometimes huge swaths of complicated code exist simply
because everyone inheriting them is to scared to touch them even though they
are basically orphaned and have a net 0 effect on outcome.

~~~
shdon
This was code written in PHP 3 and part of pricing and tax calculations on an
import with data coming in from a custom made inventory management system
module for an ERP solution running on Windows outputting the most god awful
XML in UCS-2 with node names that contained accented characters, and converted
that data into temporary MySQL tables for the custom made web shop.

Fun fact: the import, to take into account the fact that product IDs could
change (as the entire product database with over 50k products was rebuilt with
every import), would also update the IDs of all items in customers' shopping
carts at the time of the import. Great fun if the customers had a product in
their cart that was discontinued or changed.

That single function was 56 kilobytes of code.

------
voidmain
It sounds like the camera was sending match results using a blocking send or
write call, and no one was reading them, so once the total size of match
results exceed the total size of socket buffers the camera software blocks and
stops working.

~~~
msandford
That was what I thought too! I've been bitten by a similar issue where I
thought I was dodging it by using a size unlimited Python Queue but the
underlying buffer filled up anyhow.

~~~
ohazi
You'll run out of memory eventually... an "unlimited size" queue usually just
means that the original developer won't be around to deal with the fallout.

------
mc3
I hate questions like "What's the hardest bug you've ever debugged?". It means
as I work I need to keep a scrapbook of "will what I did today make a great
interview answer in 5 years time". I'd rather just fix the bug and forget
about it.

~~~
locusofself
I kindof agree. My memory doesn't work like this, I could spend a week
troubleshooting, following various twists and turns and eventually finding a
solution with a big aha! moment, and not remember much of anything about it a
week later, at least in terms of spontaneous voluntary recollection.

~~~
qiqitori
And you don't remember any bugs that took much longer to get to the bottom of
than the others? Or bugs that required debugging at a much lower level than
usual? Or bugs that involved a huge black box that required a lot of guessing
to fix?

------
blantonl
I'm not going to lie, but this article left me absolutely hanging off a huge
cliff with no one to rescue me.

The actual bug was not fixed!!

~~~
lisper
Yes, it was.

> We looked at how we communicate with the camera. We opened a TCP connection
> the first time it was needed, and left it open for changes to camera
> settings via the application. We modified this to close the connection once
> we had sent the required information and open it again if we needed it. We
> tested this thoroughly over the next few days and it looked solid.

Furthermore...

> I still don't understand how this caused the camera to lock up. We were
> receiving the TCP results via Telnet but we weren't reading the stream. Did
> it just build up in some buffer? How did this cause the camera to lock up?

My guess: the buffer filled up and the result was that the computer's OS
stopped sending ACK packets to the camera. The camera then locked up waiting
for ACKs that never came.

Kinda straightforward actually.

~~~
watt
What if the author decided to make the minimal testcase to make the camere
lock up, and reported it as a vulnerability to the camera manufacturer. I
think the camera vendor is obligated to fix the defect. It's really annoying
to work around such obvious issues when communicating to hardware and frankly
the manufacturer should be held against a higher standart.

~~~
dfox
This behavior is not a bug much less a security vulnerability. If the camera
would try to handle this in some "clever" way (and thus started to drop data),
that would be a bug.

------
makach
Weird. He did not understand what caused the bug. To me, it always is about
understanding the issue so that we can make sure that it doesn't exist in a
similar condition in the code; Just removing the bug is half the job of
debugging, the other part is understanding it - this is also to me where I
have my biggest a-ha moments.

~~~
nighthawk648
It sounds like because they were not processing the data there was some queue
of files that were building up until the buffer was full. Once full buffer the
request would be denied by their transport protocols, maybe a security feature
run ary?

It seems once you unplug either via Ethernet or hard reset the pointer to the
MAC address / up would get refreshed thus the full load dumped. Which would be
consistent as to why in 2.5 hours the bug would persist again.

I agree it’s funny the author never found out what caused the bug.

------
dblohm7
There is a subreddit devoted to these kinds of topics:
[https://www.reddit.com/r/TalesFromDebugging/](https://www.reddit.com/r/TalesFromDebugging/)

------
pbadenski
Broken printing in an enterprise Java app - it was relatively serious as
customer's business process was relying on collaborating using prints. There
was no message, no error & it was only failing for one specific customer.

It took us 2 weeks to found the problem. From what I remember some entry in
the middle of windows PATH variable had a problem. While loading DLLs this
caused a silent failure, and caused printing DLLs not to be loaded. It was all
in Citrix environment which definitely did not help.

------
axilmar
My hardest bug ever was the following:

A weapon's camera tracked the simulated target perfectly, but there were
occasional hiccups, where the camera jumped randomly.

The camera rotation data were confirmed, through Wireshark, to be valid,
except for very specific values.

It turned out to be the following problem: a function that converted the
endianess of float32 values returned the endianess-switched value as a result,
rather than change the destination buffer directly.

Somewhere in the process, the x86 hardware expanded the switched float value
to 80 bits, stored it in the internal FPU stack, and then this value was
retrieved from the stack as a float32 and send over the network.

In some cases, this conversion altered the value in such a way that the camera
jumps were not frequent.

The code that did this conversion was deep into a library that has been given
to us and that it was widely used in the contractor's company for all
Simulation needs.

The problem got away when I replaced the PDU handling with custom code which
switched endianess of floats in an appropriate manner (i.e. directly to the
destination buffer).

The problem most probably was created because the 'standard' functions for
integer endianess swap have the form

    
    
        long ntohl(long v);
        short ntohs(short v);
    

and the author of the library thought that following:

    
    
        float ntohf(float f);
    

was going to be the same as the above.

------
seanhunter
My hardest ever bug to debug was a core dump deep in the guts of some nasty
c++ code.

What's so hard about that? Just load the core dump into a debugger! Well
loading the core dump into the debugger caused the debugger to dump core...

~~~
chopin
I can relate. An older version of the Eclipse debugger would run into a stack
overflow when the debugged process had a stack overflow and you paused the
affected thread. It took me a while to figure that one...

------
infinity0
[https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82677](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82677)
incorrect use of assembly integer division affecting all projects that have
c+p code from GMP (including linux/openssl/etc)

[https://github.com/fredrik-
johansson/arb/issues/194](https://github.com/fredrik-johansson/arb/issues/194)
was the original exhibitor of buggy behaviour

------
taneq
My hardest in terms of pure awfulness wasn't a single bug, but was a
constellation of crashes in a commercial MMO game engine (which shall remain
nameless). From the architecture of the engine it seems to have started life
as a model viewer, then they'd stuck that code in a for() loop to display
multiple models, bolted on SpeedTree, mashed the whole resultant mess through
Umbra (we can occlusion cull, YAY!), and called it a AAA game engine.

And then it got really ugly, because MMO engines need to be able to stream
content from disk.

So they took the model loading code and put it in its own thread, and added a
few _bLoaded_ flags to various structs to let the renderer know which of the
models, textures etc. had finished loading. And called it done.

It kind of worked.

Along with a couple of other coders I was handed this and a bug tracker full
of "Was doing <random thing X> and the game crashed" bugs. To make it worse,
there was some obscure build issue that meant switching between Debug to
Release build required a full rebuild, which took about 25 minutes. I spent
about six months trying to add sufficient synchronization to make it stable.
By the time the company folded it was playable, but there were still race
conditions and it was still a bit flaky, and to this day I still don't like
threads unless they're really necessary.

\--

I think the trickiest bug I ever had, in the spirit of the article, was when I
was commissioning a piece of mining equipment. It'd been shipped with the
controls hardware in a half-finished state and I had to rebuild chunks of it
onsite, which meant over a kilometer underground. Eventually I had it all
running as intended, it worked perfectly all day, and I thought we were done.
Then just as we were packing up for the day, the machine started shutting
down. No alarms, no faults, no visible cause, it'd just... stop. I spent an
hour pulling my hair out before we had to leave for the day due to scheduled
blasting. On the drive back to the surface, I was going over code to figure
out what could possibly cause the shutdowns.

Halfway up, it hit me. The only way that the system could stop without a fault
was if the 'run' signal from the radio control system was turning off. And the
only way that would happen is if the radio transmitter turned off or lost
comms. And that's when I realised I'd never fitted the antenna to the radio
receiver. So when the battery in the radio transmitter started wearing down
after a day of use, the signal strength started getting just low enough for
the connection to drop and the machine to shut down.

------
ggambetta
Oh, I have a couple of fun ones.

"The most impossible one" was during my first "professional" job, while still
in college. I was writing part of the GUI of an app - a screen that showed a
grid with controls and data. This was ~20 years ago, so C++ and MFC; the
control was, IIRC, a regular listbox with some custom drawing.

We got a bug report from an user that was something like _" if I do this and
that, the empty cells become '3'"._ Some of my very senior colleagues had
drilled into me that _CUSTOMERS ALWAYS LIE_ , so my first reaction was to be
sceptical. This was, after all, an impossible bug - most of these cells were
absolutely empty (as in, it was impossible for them to have content), and the
action the user was describing was absolutely unrelated to this control
anyway.

So I tried to reproduce the bug, and to my utter disbelief, the blank cells
now showed a "3".

After much debugging, I found the root cause. I was using some API call to
lock the char* buffer of an entirely unrelated CString, for some direct
manipulation. IIRC the second argument was the length of the string, but there
was the bug: I incorrectly thought this arg was the offset, not the length, so
I was passing 0 there.

Somehow MFC must have been doing something like interning strings, because by
locking this CString of size 0, MFC was giving me a pointer to the char buffer
_of the application-wide empty string_ - and I was writing a "3" there for
unrelated reasons. So every empty string in the program (or at least for some
controls) now was "3".

The fix was trivial once I identified the bug, but I always remember this as
the most impossible "can't happen" bug that did, in fact, happen.

\-----

"The funniest bug title" was "HANDBAG TURNS INTO A BANANA". By then I was
making games ([http://www.mysterystudio.com](http://www.mysterystudio.com)).
This was one of the Sherlock games, IIRC, of the "hidden object" genre -
there's a list of objects you have to find in a very cluttered scene. When you
click on the object, it flies to your inventory. In particular, if the object
was partially obscured by the scenery or another object, it would appear
unobscured (think "in the foreground") while it flew to the inventory.

To do this, we had a static background with a bunch of "filler" objects, but
no "target" objects. Then, for each "target" object, we had a partial (P) and
a full (F) image. We'd show the P image of every object, then when the player
clicked it, we'd show the F image and animate it to the inventory.

What the tester was reporting was that there was a handbag that, when clicked,
turned into a banana.

The root cause was again trivial. For some reason, the artist had made a
naming mistake while exporting assets, so we had a P asset of the handbag, but
its matching F asset was a banana.

\-----

"The most difficult to debug" was at Improbable. In the very early days, part
of the world simulation code used Scala and Akka, with a pubsub system (this
has been rewritten using more sensible technologies and architectures). During
one of our nightly stress tests, we had a bunch of entities essentially
walking the extents of the world, looping endlessly for hours. And after hours
of running, some entities (but not all) would get stuck, as in "stopped
moving", at the boundary of two contiguous simulation regions (still very
likely in the same machine).

What followed was a three-week investigation that truly tested my sanity. This
was a massively distributed system, and the bug was very hard to reproduce. I
spent countless hours staring at logs, diffing logs from "good" runs and "bad"
runs, trying to reduce the scenario to isolate the bug. Everything was suspect
at some point or other, from our pubsub implementation, to the implementation
of hash maps in the Scala standard library.

In the end it turned out to be an extremely subtle race condition in some
cache somewhere in the pubsub subscription and desubscription logic. It took 3
weeks of suffering, doubting my own sanity, and being tempted to abandon it
(at that point, we were considering just rewriting a big chunk of the code,
just to nuke this bug from orbit). I'm glad I persevered and fixed it - it
would have haunted me to this day if I hadn't. I still remember its name, and
I probably always will. F __* you, ENG-168.

------
dimman
Long story short: Unexpected OpenSSL hard crashes in our application. Turns
out our HW was reporting support for unaligned access where as it was actually
disabled in CPU due to buggy hw (arm platform).

------
saagarjha
> Most people like to regale war stories of a particular missing semi-colon

I’m curious if anyone has any real stories of a missing semicolon causing a
problem in their running software.

~~~
reptation
Not a semi-colon but -

Setting up a GPO to push Microsoft updates to clients in an AD domain, I spent
~8 hours debugging why the clients couldn't connect to the WSUS server. It
turned out that there was trailing whitespace in the server URL text field
that was getting parsed.

As a side note, this type of story IMO doesn't actually play that well in
interview situations.

~~~
rurounijones
Everytime I receive input from a human source I remove leading and trailing
whitespace. (I have never had a situation where it was needed)

Far too many instances of copy/pastes including those bloody things.

------
goshx
Debugging often takes more time than necessary because of flawed assumptions.
When you assume something can’t be related and you don’t test the theory, you
may end up in a path that will lead you nowhere. I learned over the years to
behave in a naive way and always check the assumptions first. That led me to
more successes in solving hard bugs than coworkers who I consider smarter than
me.

------
andrey_utkin
Has anybody here solved their actual bugs with record & replay technology, aka
reverse debugging, like Undo, rr, MS TTL etc? Would love to hear the stories.

------
zedware
Old gcc/g++ doesn't assure the thread safety of static variables. And I have a
core dump triggered by it.

------
smabie
If an interviewer asked you what you're hardest bug is and you mention a
missing semicolon, he should probably just stop the interview right there.

~~~
ken
A missing semicolon has an entirely different connotation for C programmers
and Lisp programmers.

~~~
smabie
Well if you're missing a semi-colon in a Lisp program and can't figure it out
then you _really_ have some problems.

~~~
kazinator
Missing semicolon:

    
    
      (foo bar))
               ^
    

Loads now:

    
    
      (foo bar);)

