Hacker News new | comments | show | ask | jobs | submit login

> you can't trust libraries blindly, even one of the most used and broadly adopted ones

There is a corollary to development and debugging. When things break in mysterious ways, we tend to go through a familiar song and dance. As experience, skills and even personal networks grow, we can find ourselves diving ever further in the following chain.

1. "It must be in my code." -- hours of debugging

2. "Okay, it must be somewhere in our codebase." -- days of intense debugging and code spelunking

3. "It HAS TO be in the third party libraries" -- days of issue tracker excavations and never-before-enabled profiling runs

4. "It can't possibly be in stdlib..." -- more of the same, but now profiling the core runtime libraries

5. "Please let this not be a compiler bug" -- you become intensely familiar with mailing list archives

6. "I will not debug drivers. I will not debug drivers. I will not debug drivers."

7. "I don't even know anyone who could help me figure out the kernel innards."

8. "NOBODY understands filesystems!"

9. "What do you mean 'firmware edge case'?"

And the final stage, the one I have witnessed only one person ever achieve:

10. "Where is my chip lab grade oscilloscope?"

Apart from bullheadedness, this chain also highlights another trait of a good developer. Humility.




From my experience this is normal for embedded development; particularly for consumer electronics. Part of the reason some developers in this space have to wear so many hats is that the pace in consumer electronics is unforgiving. I don't think my current employer is unusual either.

Hopefully, the number of frameworks at the top, and the size of your individual programs are relatively small (so that 1-3 aren't nightmares by themselves).

In my experience, 4-5 are seldomly the problem (thanks Linaro!). I suspect the ratio of C to C++ is significantly larger in embedded systems though.

In general, PowerPC/MIPS/ARM toolchains and drivers are not as mature as x86/AMD64. 6-8 tend to occur because CPU vendors usually have their own "blessed" toolchains and BSPs that have diverged from their upstream projects. Fortunately, this means that it's often the case that someone else has already fixed the problem. It's just as often that a driver has not been tested for your use-case since the last time that particular driver's infrastructure was refactored inside the kernel. Or... you wrote the driver and made the mistake (or it might be something from 9/10).

9-10 happen because we're often using hardware that is new and has not had all of its errata discovered yet.

When products need to ship, we're regularly going through this stack. I've seen every one of these, even in just the last 4 years.


Can confirm. I had a trippy experience where I had on one monitor some RTL+simulation for our chip up for view, on another I had the PCB schematic I had helped design, and on my third I had the GUI and embedded toolchain development environments up, and on my desk I had an oscilloscope measuring that PCB running that firmware. It was basically rolling through the list and really fun!


Indeed. I too had once the dubious pleasure of having an oscilloscope on my desk, between two computers and a prototype.


I've done the oscilloscope thing, though it was only a vanilla couple-of-hundred Mhz scope on some pins that were bugging me, and not the full deal: "Gang way, we're cracking the lid of that thing and going in." That sounds exciting, I'd love to see it.

Once, I used a chunk of ice to cool down a chip, and that made it work. The hardware guys were unimpressed. But hey, they've got cans of Chill and they use them a lot, and this software guy took a while to realize the reason the board worked in the morning and was dead by lunch, and worked for a little while again after lunch, was temperature related.

There were some devs who tracked down a nasty bug in a processor's TLB. I only heard about that one, wish I had been there. I only had to deal with the fallout in the hypervisor. Note: If you have to spend 20ms hunting down and killing lies with all interrupts turned off and everything basically stopped in its tracks, you are no longer a real-time operating system.


Heh. It could be telling that I had to look up the expansion for TLB. CPU cache implementation... holy crap.

My ex-coworker has done the vanilla scope thing too and has a 400MHz scope at home. For some reason people like this are not too uncommon in Finnish oldskool[tm] IT scene. I remember how he isolated a latency and concurrency bug to an expensive interrupt handler. Rewriting isolated parts of core kernel code to make a really tricky problem go away was one of his more hardcore skills.

I'm not even near his level. My own experience is limited to slightly nibbling the edges of file system and block cache behaviour. It's a brave person who dares dive into that code. Not me.

But I do know one person who regularly works with decapped chips. He works for a company who do extremely low-level hardware investigations. Now that's hardcore.


Cache bugs are one of the fun ones. You think you're losing your mind and the people around you would probably agree. A couple of weeks go by, your spouse is ready to fire you, your boss wants to divorce you, and every waking moment is full of race conditions. Four-way stops on the drive to work are a source of stress and you punch buttons in the elevator and worry about firmware bugs. Then you get to your desk and there's the setup, a laughably small board for all the trouble it's made, and it's time for single combat, Sherlock Holmes style.

When you find the problem it's usually a blinding flash of realization that illuminates a tiny, eensy bit of code that you tweak and make right in a couple of minutes. Invariably the mistake was pretty stupid. The glory moment is over quickly because you know all the test cases will pass and that you've just nailed another one.

You've got bragging rights during one lunch, but that's it. It's off to more mundane bugs in the mortal world, and you feel a little sad.

I need to do hardware again.



I remember a 3G network signalling simulation I worked on back in about 2002. We ran it on a rack of custom servers. The CPU load was pretty hellish, and the only way we could get it to run reliably without segfaults was to install gaming cooling systems and underclock the CPUs ... ran like a charm then!


At the late 90's and earlier 2000's it was commonplace to fix OS panics by opening the computer and pointing a fan at it.

I would try it even before going for some harder software problems, because it's so easy.


Ever squeaked the chips on an Amiga?


People couldn't import computers into my country at the Amiga's time.

I had a local-made spectrum clone, it didn't overheat, but I lost a multimeter on its power supply.


> 10. "Where is my chip lab grade oscilloscope?"

11. "Shit, where do I borrow a spectrum analyzer and a set of near field probes? These things cost an arm and a leg!"

Yes, STM32F1 MCUs generate inference that jams GPS receivers. No, it's not documented anywhere.


And here's the documentation for future generations! :P


12. "Try our new 7nm fab they said. You'll be ahead of 10nm competition with few issues. Now, gotta call engineers at the fab to see if it's materials or production messing my custom stuff up. (sighs)"


You'd be surprised how much you can find out just with a $10 TV tuner dongle and a piece of coax with a short section of the outer braid trimmed off at once end.


You're describing my college tv antenna. Coax is cheap. An actual digital antenna is like... 50 ramen equivalents.


50 ramen = $10 USD


RTL-SDR is the Arduino of EMC work. :)


any MCU can "jam" a GPS receiver if the board is laid out improperly or without enough shielding


A bare, free-floating STM32F103 with literally nothing but a LiPo battery connected with two wires, running the blinky.c demo, will completely jam many GPS receivers when placed next to the antenna.


11. "Hm maybe I should check my code again... ah there's the bug"


"oh, this config shouldn't be linked to /dev/null..."


    if (featureFlags[HN_DEBUG_HIER_FLAGS] = null) {
    /* Who won't this trigger!?!?!?
    */
    }
...oh god, kill me now.


I've done the oscilloscope thing, but it was for IoT stuff - debugging broken a I2C communication with Arduino (8-bit 16MHz ATMega CPU).

The Arduino software stack is not huge, there is no operating system involved. Our application is the only thing that runs on bare slow hardware with very limited memory. But this also makes debugging harder. The IDE is limited, you debug over serial output. You have to reflash the flash-memory after every re-compile, which can take a minute.

Building a IoT system for very specific tasks that has to run reliable for years without interruption, I would still use a 8-bit tiny ATMega CPU (e.g. Arduino), and to control this tiny CPU and do some networking stuff with a control center using a 32bit ARM CPU (e.g. RPi).


> The IDE is limited, you debug over serial output. You have to reflash the flash-memory after every re-compile

uh, you know that AVRs have debugWIRE (smaller parts) or JTAG (bigger parts)?


The furthest that I have got down the list was trying to bring up the first prototype of a board that had been designed with too long traces on the PCB between the SoC and DRAM. If you tried to read a location in memory you got the value of the page table entry for that address rather than the address contents.


I once had to debug a poorly-designed board where the CPU would lock up if you did a DRAM burst write with at least 3 of the 5 highest bits of the word set (yes, I narrowed the test case down that far). A quick look at the layout confirmed that those traces were routed directly under the crystal oscillator without any form of ground shielding...

(We ended up underclocking the CPU by about 20% because there wasn't enough time for a redesign. Sigh. It's a miracle the thing even worked in the first place...)


... then your power supply goes marginal (because it will) and well . . . never been there :-)


I've had the opportunity spend about a week debugging incorrect configuration of SDRAM by BSP team. At first I blamed third party library with no source code available. Then it occurred to me that my initial SDRAM tests were doing word-by-word access. Third party library used memset which was optimised to use DMA for bulk transfers, which failed to write subsequent words in the same transaction.

An easy, one bit change in configuration registers of SDRAM fixed it. A week well spent!


Similar: my new driver crashes the machine. A couple days debugging. Triple-check every register value. All good. It doesn't crash when I single-step! A couple more days debugging. Finally get it: the machine crashes when two ports are enabled close enough in time. Go talk to the hardware guys. “Yeah, we know, power traces are fixed in the next rev.”


I feel like "I can reliably make the bug disappear by turning on my debug harness" is a reliable sign that things are about to get weird.


Ooofff, that list made my stomach churn, more stuff of nightmares! All debugging post-mortems of this level should be written in Lovecraftian style.

... it's not widely known, whispers attribute it to a transcription error, unsure when it started, copied through ancient manuscripts, that the Dead Thing that lies dreaming at the bottom of the ocean, is actually named ... C++hulhu


I have also seen developers far too keen to blame the library before exhausting the most likely case that the issue is caused by the local code (step 1 and 2), or at any rate is fixable in it.

It's a well-known syndrome. The classic motto for it is "'SELECT' isn't broken" https://blog.codinghorror.com/the-first-rule-of-programming-...


This too. For every one time it's the parent article, 99 times it's my code.


Regarding #10: Oh Lord, I've been there too many times to count.... One of the more memorable time was with an old timing distribution system. The thing would pretty much just send out clock pulses to networked machines and this cost a lot of money to do properly (very abridged here). This particular one was acting 'funky' and came back in. In testing it, we go really weird behavior. True to your list, I think we went 1, 2, 5, 6 (no drivers, per se), 7 (for about 5 days), 8, 9, 10. At 10, we finally plugged in the o-scope and started debugging the PCB vias and connections themselves. Things were getting really wacky now. The Faraday cage that was the testing room had to be re-grounded, we thought, as the wires themselves were still carrying current even when the power was dis-connected. One of the guys brought in his old hand-held impact hammer to drive a new copper stake into the peeled up linoleum and through the foundation of the building. Still, we got strange results. Like really strange results that, to us, were worthy of a Nobel Prize, as we had thus far proved to ourselves that physics herself was broken inside the lab. For reference: a lot of people worked in there, so having stuff about in all kinds of dis-repair was typical. I remember, long after the pizza had gone cold and the Mt. Dew was flat, I was looking up at the ceiling of the room. I saw an old RF horn hanging from the roof, kinda held together with the connecting wires. 'Hey, if that thing was on, would it do anything?' The other techs' eyes all lit up. Turns out, one of the guys was doing something with the horn for some other test. He had left for an extended backpacking vacation and accidentally still had the thing on. The broadcasting from the horn was adding the small amount of current to all our wires, thereby causing the whole box to go out of whack just enough to cause all the issues. At about 4 am, we finally got the box re-configured, the original problem from the customer solved, and all of it packed up and ready to overnight out to the customer for when the UPS store opened at 7am, about 3 hours from then. The poor guy got back from vacation to that mess of an email inbox and many meetings. It was an honest mistake, and he bought us all 12 packs for the trouble. Still, when you think you have proved that physics is broken, I think that will qualify as step 12.


11. "Do we have the IP core for this?"

12. "Where is my electron microscope?"


..13 "We're gonna need some time on the FIB workstation"[1]

[1] http://www.electronicdesign.com/eda/fib-circuit-edit-becomes...


That's the one I should've thought of. I said fab but who trusts them to know what's on it!? Haha.


This is the worst thing I've ever found, still not solved: https://github.com/crystal-lang/crystal/issues/4127


You must be kidding me. We hit a bug with the mysql2 gem where the client randomly crashes in libmariadbclient (but not libmysqlclient) only on debian (using Arch Linux and OS X for dev, but exact same versions of everything) and only for database names of length 25. And 28, but we cannot reproduce it on the repro docker image we made. And only if there are enough aliases in the query (could be as low as 5 but could need as much as 20+). And only if 'active_record' is require'd, but even when it's not used at all. And never ever under GDB or Valgrind, making it the perfect heisenbug.

That's a lot of stars to align there, but when they do, hell breaks lose just often enough to be sure it's not completely random, and obviously this hit one of our most finicky customers, and only in production because of course "#{customer}_production".size == 28 (and not 25, because nah that'd have been too easy to be able to reproduce the bug right away).

[0] bug: https://github.com/brianmario/mysql2/issues/822

[1] repro: https://github.com/adhoc-gti/mysql2_pointer_bug



I've only seen level 5 personally, with a C# compiler bug that would omit totally valid `else` branches.


My favorite level 5 was a bug in clang that caused it to occasionally emit incorrect code when calling a vararg function. However, the bug was harmless when combined with clang's vararg function prologue. When calling a vararg function compiled with gcc, the clang bug would cause gcc's prologue to jump to a quasi-random address vaguely nearby and continue execution in the middle of some other function. That was great fun. I wrote it up here:

https://www.mikeash.com/pyblog/friday-qa-2013-06-28-anatomy-...


Care to elaborate? Which C# compiler?


The official one, I think it was in .NET 3, but it was a few years ago at an old job, so I'm a bit hazy on the details.

Basically we had a bug where a whole conditional branch was being skipped, and we traced it down to the branch being omitted entirely from the compiled IR.

And no, it was nothing fancy, just something like:

    if (customer.country == "US") {
      doSomething();
    } else {
      doDifferentThing();
    }
The whole `else` branch was simply missing from the compiled program.

If I remember correctly, we got around it by doing something like:

    bool isUsCustomer = customer.country == "US"
    if (isUsCustomer) {
      doSomething();
    } else {
      doDifferentThing();
    }
Anyway, the point is that the compiler fucked up it's handling of if/else statements, but only at that specific part of the code, leading to a few wasted days of effort tracking down the problem.


It can get even more "fun" with Java. Your code can start running through an interpreter, then after a while suddenly be transformed by a JIT engine. The interpreter and the JIT engines (there's more than one JIT engine) have different bugs. The optimizations made by the JIT engine can depend on the data which went through your method before the JVM decided to optimize it.

I'm not finding it right now, but I recall seeing a few weeks ago a presentation with several of these sorts of bugs in a recent version of Java (all reported and fixed): after a number of iterations, it suddenly starts returning wrong results.


Sounds like optimization going haywire, deducing that the statement under question would constantly evaluate to this term. Its valid to optimize a else statement out- if it will never be reached (Dead Codepath Optimizing out). Was there something akin to this in the statement?


> Was there something akin to this in the statement?

It probably was the optimiser at fault, but there wasn't anything special about this conditional, and certainly nothing that _should_ have caused the optimiser to throw away the else branch.

If memory serves right it was comparing a string field of an object to a static string, like `someObject.foo == "some string"`.


Sorry, I don't buy it. I've seen countless cases where developers conclude that "compiler has a bug" and it never ultimately did. There are also cases where they never bother to figure it out, change the code a bit and continue with their lives thinking they've found a bug in the compiler. But they didn't.


> Anyway, the point is that the compiler fucked up it's handling of if/else statements, but only at that specific part of the code

It would have to be specific. Put it this way: if this was a general bug and the "else" was always omitted, how long would it last before being found and fixed?

Related, if you were to say to me "I found the issue, the compiler isn't correctly handling if/else statements" Then my first thought would be about your medication not about the compiler.


> if you were to say to me "I found the issue, the compiler isn't correctly handling if/else statements" Then my first thought would be about your medication not about the compiler

And yet, it happened :)

And the senior engineers at the company looked at it and confirmed it was a compiler bug. Their best guess was that something about that part of the code was putting the compiler in a funny state, causing it to skip that particular `else` branch.

We reported it to Microsoft, but never heard anything back.


went through these kinds of stages a few times in my career.

once it led me to discover a leak in a major travel website's purchase flow caused by Java's Thread class, related to thread groups.

most recently, I was writing some Linux auth code in C, reached a point where I could rule out my code, and found a bug in sudo. freaking sudo.

(also related to groups, though the Linux user kind, not Java threads.)


For devs using higher level languages it is more like:

1. "It must be my code" -- minutes of debugging

2. "It must be in our codebase" -- hours of debugging

3. "Third party library or framework" -- If library use a different library, if framework accept the bug and work around it whilst cursing framework choice.


This one goes to eleven: 11. "On some setups, clients get a corrupted stack when swapped back in from kernel."

(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/745836)


This is the most concise list of debugging layers I have seen. I'm mostly commenting so that I get this into my comment feed and can locate it easier.


In that case, feel free to bookmark this one: http://bostik.iki.fi/aivoituksia/random/developer-debugging-...

This got more attention that I thought possible, so I decided to pull it out as an item all by itself.


Spot on. The experience level of a developer is directly proportional to how fast he/she assumes that the problem is in someone else's code. ;-)


That one depends a lot on who your coworkers are too.


I was thinking more of the code in the OS, compiler, run-time, etc. I should have made that more clear in my comment.


I feel lucky to only have reached item 4; I wouldn't feel confident beyond it, anyway.


I don't think anybody is ever confident beyond level 1. The worst part is those "why the fuck does everything work when I plug a logical analyzer?" moments.


I'm usually well into #3 before I realize I'm in too deep




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: