
The bug that hides from breakpoints - luu
http://drewdevault.com/2014/02/02/The-worst-bugs.html
======
userbinator
Having worked with embedded systems before, I've seen my fair share of bugs
like this and it's almost certainly one related to timing/race conditions with
external events. Looks like this one is no exception. An instruction trace (or
a logic analyser logging the data/address buses) is _extremely_ useful, since
you can mentally execute the code that it's tracing and see where it went
wrong. Configure it to start/stop logging at the appropriate points and you
won't have millions of instructions to look through. Binary search is very
helpful.

In my experience the _truly insanely irritating_ Heisenbugs are those that
involve physical phenomena like marginal logic levels/crosstalk/noise, and in
that case even attaching (or in one case removing) the extra debugging
hardware can cause it to disappear.

~~~
pdkl95
Once, while _writing_ a new program-loader-and-debugger/"bios" for a Z80
clone, I had a particularly nasty heisenbug.

The problem was my new code "usually" worked. About 25% of the time it would
simply fail to do anything (it worked perfectly the other 75% - "umm... aren't
computers supposed to be deterministic?") Of course, every single time I did
_anything_ to try extract any useful information from the device, it would
work 100% of the time. Even tossing a quick 2-byte sequence to flip an I/O pin
made the problem completely go away.

The problem was, of course, that the hardware wasn't quote done rebooting, and
needed to be left alone a _tiny_ bit longer after hitting the RST pin. About 8
cycles longer than when the hardware engineers assured me it would be
"completely safe". Any time I added debug code, it simply worked as the needed
delay, and was solved by a _very_ carefully documented sequence of NOOPs.

What made this particularly infuriating, though, was it's seemingly arbitrary
sensitivity to when the real code was started. Depending on how much you
padded with NOOPs, several behaviors were observed.

    
    
        #NOOP | Behavior
        ---------------
          0   | original 75/25 ok/fail split
          1   | always failed
          2   | always worked
          3   | always worked
          4   | worked ~95% of the time
          5   | ~80/20 ok/fail split 
          6   | ~50/50 ok/fail split (not a lot of data)
          7   | *almost* always worked (I think I saw it fail once)
          8+  | always worked
    

Of course, the various stuff I would add for debugging tended to line up with
the cases when it DID work.

 _sigh_

I miss working with that chip. Z80 assembly is kind of fun. Well, except for
when my boss told me my new loader had to work on our slowest 4Mhz part.
Receiving a program over a 115kbaud serial line, verifying CRC-8, and writing
it to flash is pretty hard when you have to do it less than ~100
clocks/byte....

~~~
SpikedCola
Reminds me of the Read-Modify-Write procedure on a lot of PIC devices. Our
teacher taught us to randomly place NOP instructions until it worked; we all
thought he was crazy until we realized what he was working around.

(The _actual_ solution was to write to the LATx registers instead of PORTx)

------
Taniwha
The canonical "bug that hides from breakpoints" was used for copy protection
in some early pc games it involved writing code into the instruction stream
but not executing it because that location was already in the prefetch buffer.
Try an single step through there or set a breakpoint and the debugger would
see the written code not the original

------
ddevault
I never thought I'd see my blog on HN. Hope you liked this post.

FWIW, this is my first venture into kernels, and I've made a lot of mistakes.
In fact, this project is on the fourth rewrite and its followers are
desperately trying to convince me to leave the known cruft in place in favor
of a more timely release. I would love to hear your feedback, though, because
I want to polish this as much as possible in the future.

------
hcarvalhoalves

        The debugging described by this blog post took approximately three weeks.
    

I now have a new standard for "hard to debug" bugs.

~~~
Karellen
Try 30 years, while being attacked by some of the best demo scene coders out
there...

[http://www.linusakesson.net/scene/safevsp/index.php](http://www.linusakesson.net/scene/safevsp/index.php)

I found out about this from
[https://news.ycombinator.com/item?id=5314959](https://news.ycombinator.com/item?id=5314959)
but the original article has disappeared. Although I think the one above I
found via Google gives a reasonable account.

Also, the demo on youtube gives an explanation in the scrolling top line:

[http://www.youtube.com/watch?v=vXcA4OWx0vo](http://www.youtube.com/watch?v=vXcA4OWx0vo)

~~~
userbinator
From my experience demoscene coders are mainly software people (and they are
_amazingly_ knowledgeable at that), and while most of then likely understand
how CPUs work at the logical level, won't be as familiar with the analogue
nature of hardware and electronics.

The original article just moved here:
[http://www.pouet.net/prod.php?which=61024#c637759](http://www.pouet.net/prod.php?which=61024#c637759)

------
dmbass
One of the worst embedded bugs I ever had the displeasure of looking into was
on a data streaming device developed by a contractor. One of the I/O processes
would lockup after something like 6 or 7 days of continuously running without
error. My boss and I (the only programmers) scoured the codebase for any
possible condition that looked like it might maybe somehow could cause a
lockup and added debugging outputs. We ran multiple devices in parallel in an
attempt to increase the likely hood of the bug occurring.

The weeks kept rolling by without progress and there was more and more
pressure to ship the thing. We ended up writing code that would detect the
lockup and perform a soft reset. Luckily there were only a few instances where
resetting would cause a perceivable problem but it still hurts to know that
you shipped incorrect software.

------
jevinskie
I believe this feat is only possible thanks to an effort to crack 512 bit RSA
signatures used to sign the firmwares. Is this only the case on newer
calculators like the TI 83+ SE or did even the earlier calculators use RSA?

~~~
dTal
Depends what you mean by "newer". The TI-89 is considerably older than the
TI-83+ and it also uses RSA. All of the TI calculators with updatable firmware
(i.e. all the ones with flash memory instead of ROM) use RSA and have now had
their keys factored [note]. Here's the comprehensive list:

[http://db48x.net/TI-keys/keys.shtml](http://db48x.net/TI-keys/keys.shtml)

[note] apart from the very newest ARM-based TI Nspire, which is a completely
different generation.

------
raverbashing
Yeah, mechanical switches are complicated in a microprocessor context, and
sometimes debouncing causes more problems than it solves

And debugging things in embedded contexts is very complicated

------
krazydad
I've taken to calling them _heisenbugs_.

~~~
slavak
You're a little late to the party, then.[1]

[1]
[http://en.wikipedia.org/wiki/Heisenbug](http://en.wikipedia.org/wiki/Heisenbug)

~~~
sergiotapia
Hey nice annotation number on a 1 line response. /cringe

------
frozenport
Would using a higher level language (C?) have prevented this problem?

~~~
auvrw
i was going to say something similar. since the problem appears, essentially,
to be that threaded programs are difficult to write, perhaps some certain
language or library with a model of concurrency other than threads would've
prevented the problem. but then since this is an embedded system, compiling
the entire program down from anything other than C incur too much performance
overhead, depending on the toolchain. plus, the article mentions a legacy
codebase.

although i have very limited experience with embedded systems, there are
points here that i think apply just as well to web programming:

* concurrency is a fundamentally difficult problem that we've yet to find a really great answer to for all domains. just the other day, there was an article about how threads are better than node-style async, but this article is a good description of the kind of difficulties threaded programming presents, so much so that the first and second rules of thread programming, imo, are both "don't use threads."

* since different languages are suited to different problem domains, it seems like the thing to do is rather than choose a "low/high level" language (quotes b/c the distinction is actually for implementations, not languages; don't forget that lisp used to run on bare metal) for an entire program (and deal with the ensuing tradeoffs of performance vs. code clarity, etc.), it ought to be more common practice to write in multiple languages.

so probably not C, but perhaps lua would've helped?

~~~
josteink
> compiling the entire program down from anything other than C incur too much
> performance overhead

On a system with 32KB ram, I think the biggest concern is the size of the
final generated binary. Writing things in C or assembly will allow you to
optimize for a smaller resulting image. Using higher level languages with
libraries and what not, can considerably add to the generated output, and with
a 32KB maximum size, that's a very real concern.

And here I thought I was having troubles when I was hacking Android and needed
to keep the recovery-kernel for my particular tablet below 5MB. The luxuries
of modern day computing, eh? :)

