Hacker News new | past | comments | ask | show | jobs | submit login
The bug that hides from breakpoints (drewdevault.com)
107 points by luu on May 2, 2014 | hide | past | favorite | 28 comments



Having worked with embedded systems before, I've seen my fair share of bugs like this and it's almost certainly one related to timing/race conditions with external events. Looks like this one is no exception. An instruction trace (or a logic analyser logging the data/address buses) is extremely useful, since you can mentally execute the code that it's tracing and see where it went wrong. Configure it to start/stop logging at the appropriate points and you won't have millions of instructions to look through. Binary search is very helpful.

In my experience the truly insanely irritating Heisenbugs are those that involve physical phenomena like marginal logic levels/crosstalk/noise, and in that case even attaching (or in one case removing) the extra debugging hardware can cause it to disappear.


Once, while writing a new program-loader-and-debugger/"bios" for a Z80 clone, I had a particularly nasty heisenbug.

The problem was my new code "usually" worked. About 25% of the time it would simply fail to do anything (it worked perfectly the other 75% - "umm... aren't computers supposed to be deterministic?") Of course, every single time I did anything to try extract any useful information from the device, it would work 100% of the time. Even tossing a quick 2-byte sequence to flip an I/O pin made the problem completely go away.

The problem was, of course, that the hardware wasn't quote done rebooting, and needed to be left alone a tiny bit longer after hitting the RST pin. About 8 cycles longer than when the hardware engineers assured me it would be "completely safe". Any time I added debug code, it simply worked as the needed delay, and was solved by a very carefully documented sequence of NOOPs.

What made this particularly infuriating, though, was it's seemingly arbitrary sensitivity to when the real code was started. Depending on how much you padded with NOOPs, several behaviors were observed.

    #NOOP | Behavior
    ---------------
      0   | original 75/25 ok/fail split
      1   | always failed
      2   | always worked
      3   | always worked
      4   | worked ~95% of the time
      5   | ~80/20 ok/fail split 
      6   | ~50/50 ok/fail split (not a lot of data)
      7   | *almost* always worked (I think I saw it fail once)
      8+  | always worked
Of course, the various stuff I would add for debugging tended to line up with the cases when it DID work.

sigh

I miss working with that chip. Z80 assembly is kind of fun. Well, except for when my boss told me my new loader had to work on our slowest 4Mhz part. Receiving a program over a 115kbaud serial line, verifying CRC-8, and writing it to flash is pretty hard when you have to do it less than ~100 clocks/byte....


Reminds me of the Read-Modify-Write procedure on a lot of PIC devices. Our teacher taught us to randomly place NOP instructions until it worked; we all thought he was crazy until we realized what he was working around.

(The actual solution was to write to the LATx registers instead of PORTx)


On some x86 PC BIOS, there is a do-nothing delay loop (I've seen 20K and 64K cycles) not far from the reset vector, and I had figured that it was there because of this exact same reason; to give the system some more time to reach a stable state. If the voltages haven't completely settled, or there are still some undefined levels somewhere, then the CPU might be able to execute some if not most instructions correctly, but behave incorrectly for others --- overclockers are probably quite familiar with this "analogue" failure characteristic.


This one is just code that the author didn't understand. He probably wrote it, but didn't understand it. Actually

call _ jr z,_ xor a ret _

takes 4 bytes on a z80. Should have done something like

call _ ret z

which saves three bytes (ret z is a one byte instruction). I guess he may have forgotten about the conditional return instructions.

And, key debounce is important. In future, when the key switches become (when, not if) marginal, there will be a lot of make/break contact events. Software debounce was very common for this class of system. Missing that, and the effect this had on the kernel being written? Very sloppy work.

3 weeks to track this down? Back in the 70's when the z80 was new we typically HAD NO DEBUGGER (unless you wrote it yourself), And, the hardware being developed may have been flaky. Hell, even the KIM-1 needed software debounce.


Author here. The `xor a` was important - it pretended no key was pressed. I'm familiar with conditional RET.

Thanks for the rest of your kind words, though.


I've experienced plenty of similar bugs that don't manifest when running under a debugger while working on modern ARM systems (not that any of it is necessarily specific to ARM, it's just what I work on).

Usually it's one of the following causes:

The boring one which should show up as a warning in the compiler: use of code with undefined behavior, which could be something as simple as an uninitialized variable or in manually written assembly use of an instruction in a context in which it has undefined behavior.

Bugs caused by poor understanding of the memory model: ARM has a weakly ordered memory model, which allows for various hardware optimisations, but the (systems) programmer has to understand it. The basic idea is that memory accesses could be observed in different order in different coherent cores, so synchronisation points require carefully placed barriers.

Bugs caused by not accounting for the modified Harvard architecture: the data and instruction caches aren't coherent. If you do JIT you'll notice odd behavior soon enough when some executable code is overwritten, but if it's just one time dynamic code generation, there's a fair chance most of the time there won't be any stale data in the instruction cache. Until there is, you get odd erros and if you inspect the code in the debugger it will look just fine, since it's fetched through the data cache.

DMA bugs: if the DMA peripheral isn't connected through a cache coherent bus, it can be quite tricky to ensure you won't accidentally transfer stale data.

Finally, something that's not so much influenced by using a debugger, it's just that the debugger is completely useless: silicon bugs. Just a few weeks ago, I found a bug in a MCU which causes it to send multiple copies of an SPI transfer for certain ratios between the peripheral's clock and the core's clock. It was relatively easy to find since SPI is an external bus which I could probe with a scope, but if a similar issuue would happen on chip, it would be a lot more tricky.


One of the most irritating ones I found was in an FPGA. There was a bug in the FPGA code with an initial state which would only fail to initiate an ADC converter correctly if the temperature was somewhere around 75F.

So at room (lab) temperature, it was fine. Out in the "field", which was 1000 feet away from the lab, nothing would work.


Did you run a post-PAR (timing) simulation on that? On designs with low timing slack or large area occupation this is really important, otherwise you may easily violate setup/hold time constraints in what are usually very nondeterministic ways. This is, in fact, one of the reasons why ASIC development is ~80% verification.

We had a bug of the kind once, where a Microblaze core would fail to boot the Linux image after what seemed a trivial modification in the peripheral code, but it turns out we were packing the logic so tightly that this change blew the timing on some critical net. Ah, the joys of digital design.


I just want to point out that nowadays people don't usually catch timing issues in digital designs with simulation. It's mostly done by static timing analysis tools such as PrimeTime, which is much more accurate, efficient, and thorough than running logic simulations with annotated delays.


The canonical "bug that hides from breakpoints" was used for copy protection in some early pc games it involved writing code into the instruction stream but not executing it because that location was already in the prefetch buffer. Try an single step through there or set a breakpoint and the debugger would see the written code not the original


I never thought I'd see my blog on HN. Hope you liked this post.

FWIW, this is my first venture into kernels, and I've made a lot of mistakes. In fact, this project is on the fourth rewrite and its followers are desperately trying to convince me to leave the known cruft in place in favor of a more timely release. I would love to hear your feedback, though, because I want to polish this as much as possible in the future.


    The debugging described by this blog post took approximately three weeks.
I now have a new standard for "hard to debug" bugs.


Try 30 years, while being attacked by some of the best demo scene coders out there...

http://www.linusakesson.net/scene/safevsp/index.php

I found out about this from https://news.ycombinator.com/item?id=5314959 but the original article has disappeared. Although I think the one above I found via Google gives a reasonable account.

Also, the demo on youtube gives an explanation in the scrolling top line:

http://www.youtube.com/watch?v=vXcA4OWx0vo


From my experience demoscene coders are mainly software people (and they are amazingly knowledgeable at that), and while most of then likely understand how CPUs work at the logical level, won't be as familiar with the analogue nature of hardware and electronics.

The original article just moved here: http://www.pouet.net/prod.php?which=61024#c637759


This is my current champion:

https://github.com/django/django/commit/8a0fa75839

Took the better part of a year to finally understand, at which point the two-line fix was simple.


Would you mind producing a lil' write up?


So, this was in the early, early days of Django with the old ORM and the old forms system (which went by the name of "manipulators"), though the bug persisted past the ORM rewrite, since it was actually in the forms system.

When a "manipulator" was creating or modifying an ORM model instance, it would need to figure out which fields from the model and, sometimes, which fields from related (via foreign-key or many-to-many relations) models to include.

For related models, the code would generate a dictionary, called "follow", listing the fields to, well, follow across the relation and include in the manipulator. The method "get_manipulator_fields()" on the class representing the related object would then iterate over its own fields, and if a field name turned up in the "follow" dictionary it would add that one to the under-construction list of fields for the manipulator.

Except sometimes that code would crash with an exception: "AttributeError: 'bool' object has no attribute 'get'". This was rather puzzling, and although there did seem to be patterns to when it would happen, it wasn't always possible to consistently reproduce it.

That two-line fix came from realizing that the exception was a symptom of an underlying problem: it was coming from a situation where the name of a foreign-key field on one model was the same as the internal-bookkeeping name Django had generated for another model class. In that case, and only in that case, the manipulator-generating code would get confused and end up on the wrong code path, which is how get_manipulator_fields() was receiving a boolean argument where it expected a dictionary (this was compounded by the fact that everything which could end up adding fields to a manipulator did so via a method of that name).

So the fix was to ensure that in the code which handled related objects, Django would always use a name that couldn't conflict and throw the manipulator code down the wrong path.

Of course, not long after that the manipulator system was ripped out and replaced with the much-saner django.forms module.


One of the worst embedded bugs I ever had the displeasure of looking into was on a data streaming device developed by a contractor. One of the I/O processes would lockup after something like 6 or 7 days of continuously running without error. My boss and I (the only programmers) scoured the codebase for any possible condition that looked like it might maybe somehow could cause a lockup and added debugging outputs. We ran multiple devices in parallel in an attempt to increase the likely hood of the bug occurring.

The weeks kept rolling by without progress and there was more and more pressure to ship the thing. We ended up writing code that would detect the lockup and perform a soft reset. Luckily there were only a few instances where resetting would cause a perceivable problem but it still hurts to know that you shipped incorrect software.


I believe this feat is only possible thanks to an effort to crack 512 bit RSA signatures used to sign the firmwares. Is this only the case on newer calculators like the TI 83+ SE or did even the earlier calculators use RSA?


Depends what you mean by "newer". The TI-89 is considerably older than the TI-83+ and it also uses RSA. All of the TI calculators with updatable firmware (i.e. all the ones with flash memory instead of ROM) use RSA and have now had their keys factored [note]. Here's the comprehensive list:

http://db48x.net/TI-keys/keys.shtml

[note] apart from the very newest ARM-based TI Nspire, which is a completely different generation.


Yeah, mechanical switches are complicated in a microprocessor context, and sometimes debouncing causes more problems than it solves

And debugging things in embedded contexts is very complicated


I've taken to calling them heisenbugs.


You're a little late to the party, then.[1]

[1] http://en.wikipedia.org/wiki/Heisenbug


Hey nice annotation number on a 1 line response. /cringe


Would using a higher level language (C?) have prevented this problem?


i was going to say something similar. since the problem appears, essentially, to be that threaded programs are difficult to write, perhaps some certain language or library with a model of concurrency other than threads would've prevented the problem. but then since this is an embedded system, compiling the entire program down from anything other than C incur too much performance overhead, depending on the toolchain. plus, the article mentions a legacy codebase.

although i have very limited experience with embedded systems, there are points here that i think apply just as well to web programming:

* concurrency is a fundamentally difficult problem that we've yet to find a really great answer to for all domains. just the other day, there was an article about how threads are better than node-style async, but this article is a good description of the kind of difficulties threaded programming presents, so much so that the first and second rules of thread programming, imo, are both "don't use threads."

* since different languages are suited to different problem domains, it seems like the thing to do is rather than choose a "low/high level" language (quotes b/c the distinction is actually for implementations, not languages; don't forget that lisp used to run on bare metal) for an entire program (and deal with the ensuing tradeoffs of performance vs. code clarity, etc.), it ought to be more common practice to write in multiple languages.

so probably not C, but perhaps lua would've helped?


> compiling the entire program down from anything other than C incur too much performance overhead

On a system with 32KB ram, I think the biggest concern is the size of the final generated binary. Writing things in C or assembly will allow you to optimize for a smaller resulting image. Using higher level languages with libraries and what not, can considerably add to the generated output, and with a 32KB maximum size, that's a very real concern.

And here I thought I was having troubles when I was hacking Android and needed to keep the recovery-kernel for my particular tablet below 5MB. The luxuries of modern day computing, eh? :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: