In my experience the truly insanely irritating Heisenbugs are those that involve physical phenomena like marginal logic levels/crosstalk/noise, and in that case even attaching (or in one case removing) the extra debugging hardware can cause it to disappear.
The problem was my new code "usually" worked. About 25% of the time it would simply fail to do anything (it worked perfectly the other 75% - "umm... aren't computers supposed to be deterministic?") Of course, every single time I did anything to try extract any useful information from the device, it would work 100% of the time. Even tossing a quick 2-byte sequence to flip an I/O pin made the problem completely go away.
The problem was, of course, that the hardware wasn't quote done rebooting, and needed to be left alone a tiny bit longer after hitting the RST pin. About 8 cycles longer than when the hardware engineers assured me it would be "completely safe". Any time I added debug code, it simply worked as the needed delay, and was solved by a very carefully documented sequence of NOOPs.
What made this particularly infuriating, though, was it's seemingly arbitrary sensitivity to when the real code was started. Depending on how much you padded with NOOPs, several behaviors were observed.
#NOOP | Behavior
0 | original 75/25 ok/fail split
1 | always failed
2 | always worked
3 | always worked
4 | worked ~95% of the time
5 | ~80/20 ok/fail split
6 | ~50/50 ok/fail split (not a lot of data)
7 | *almost* always worked (I think I saw it fail once)
8+ | always worked
I miss working with that chip. Z80 assembly is kind of fun. Well, except for when my boss told me my new loader had to work on our slowest 4Mhz part. Receiving a program over a 115kbaud serial line, verifying CRC-8, and writing it to flash is pretty hard when you have to do it less than ~100 clocks/byte....
(The actual solution was to write to the LATx registers instead of PORTx)
takes 4 bytes on a z80. Should have done something like
which saves three bytes (ret z is a one byte instruction).
I guess he may have forgotten about the conditional return instructions.
And, key debounce is important. In future, when the key switches become (when, not if) marginal, there will be a lot of make/break contact events. Software debounce was very common for this class of system. Missing that, and the
effect this had on the kernel being written? Very sloppy work.
3 weeks to track this down? Back in the 70's when the z80 was new we typically HAD NO DEBUGGER (unless you wrote it yourself), And, the hardware being developed may have been flaky. Hell, even the KIM-1 needed software debounce.
Thanks for the rest of your kind words, though.
Usually it's one of the following causes:
The boring one which should show up as a warning in the compiler: use of code with undefined behavior, which could be something as simple as an uninitialized variable or in manually written assembly use of an instruction in a context in which it has undefined behavior.
Bugs caused by poor understanding of the memory model: ARM has a weakly ordered memory model, which allows for various hardware optimisations, but the (systems) programmer has to understand it. The basic idea is that memory accesses could be observed in different order in different coherent cores, so synchronisation points require carefully placed barriers.
Bugs caused by not accounting for the modified Harvard architecture: the data and instruction caches aren't coherent. If you do JIT you'll notice odd behavior soon enough when some executable code is overwritten, but if it's just one time dynamic code generation, there's a fair chance most of the time there won't be any stale data in the instruction cache. Until there is, you get odd erros and if you inspect the code in the debugger it will look just fine, since it's fetched through the data cache.
DMA bugs: if the DMA peripheral isn't connected through a cache coherent bus, it can be quite tricky to ensure you won't accidentally transfer stale data.
Finally, something that's not so much influenced by using a debugger, it's just that the debugger is completely useless: silicon bugs. Just a few weeks ago, I found a bug in a MCU which causes it to send multiple copies of an SPI transfer for certain ratios between the peripheral's clock and the core's clock. It was relatively easy to find since SPI is an external bus which I could probe with a scope, but if a similar issuue would happen on chip, it would be a lot more tricky.
So at room (lab) temperature, it was fine. Out in the "field", which was 1000 feet away from the lab, nothing would work.
We had a bug of the kind once, where a Microblaze core would fail to boot the Linux image after what seemed a trivial modification in the peripheral code, but it turns out we were packing the logic so tightly that this change blew the timing on some critical net. Ah, the joys of digital design.
FWIW, this is my first venture into kernels, and I've made a lot of mistakes. In fact, this project is on the fourth rewrite and its followers are desperately trying to convince me to leave the known cruft in place in favor of a more timely release. I would love to hear your feedback, though, because I want to polish this as much as possible in the future.
The debugging described by this blog post took approximately three weeks.
I found out about this from https://news.ycombinator.com/item?id=5314959 but the original article has disappeared. Although I think the one above I found via Google gives a reasonable account.
Also, the demo on youtube gives an explanation in the scrolling top line:
The original article just moved here: http://www.pouet.net/prod.php?which=61024#c637759
Took the better part of a year to finally understand, at which point the two-line fix was simple.
When a "manipulator" was creating or modifying an ORM model instance, it would need to figure out which fields from the model and, sometimes, which fields from related (via foreign-key or many-to-many relations) models to include.
For related models, the code would generate a dictionary, called "follow", listing the fields to, well, follow across the relation and include in the manipulator. The method "get_manipulator_fields()" on the class representing the related object would then iterate over its own fields, and if a field name turned up in the "follow" dictionary it would add that one to the under-construction list of fields for the manipulator.
Except sometimes that code would crash with an exception: "AttributeError: 'bool' object has no attribute 'get'". This was rather puzzling, and although there did seem to be patterns to when it would happen, it wasn't always possible to consistently reproduce it.
That two-line fix came from realizing that the exception was a symptom of an underlying problem: it was coming from a situation where the name of a foreign-key field on one model was the same as the internal-bookkeeping name Django had generated for another model class. In that case, and only in that case, the manipulator-generating code would get confused and end up on the wrong code path, which is how get_manipulator_fields() was receiving a boolean argument where it expected a dictionary (this was compounded by the fact that everything which could end up adding fields to a manipulator did so via a method of that name).
So the fix was to ensure that in the code which handled related objects, Django would always use a name that couldn't conflict and throw the manipulator code down the wrong path.
Of course, not long after that the manipulator system was ripped out and replaced with the much-saner django.forms module.
The weeks kept rolling by without progress and there was more and more pressure to ship the thing. We ended up writing code that would detect the lockup and perform a soft reset. Luckily there were only a few instances where resetting would cause a perceivable problem but it still hurts to know that you shipped incorrect software.
[note] apart from the very newest ARM-based TI Nspire, which is a completely different generation.
And debugging things in embedded contexts is very complicated
although i have very limited experience with embedded systems, there are points here that i think apply just as well to web programming:
* concurrency is a fundamentally difficult problem that we've yet to find a really great answer to for all domains. just the other day, there was an article about how threads are better than node-style async, but this article is a good description of the kind of difficulties threaded programming presents, so much so that the first and second rules of thread programming, imo, are both "don't use threads."
* since different languages are suited to different problem domains, it seems like the thing to do is rather than choose a "low/high level" language (quotes b/c the distinction is actually for implementations, not languages; don't forget that lisp used to run on bare metal) for an entire program (and deal with the ensuing tradeoffs of performance vs. code clarity, etc.), it ought to be more common practice to write in multiple languages.
so probably not C, but perhaps lua would've helped?
On a system with 32KB ram, I think the biggest concern is the size of the final generated binary. Writing things in C or assembly will allow you to optimize for a smaller resulting image. Using higher level languages with libraries and what not, can considerably add to the generated output, and with a 32KB maximum size, that's a very real concern.
And here I thought I was having troubles when I was hacking Android and needed to keep the recovery-kernel for my particular tablet below 5MB. The luxuries of modern day computing, eh? :)