Debugging hardware is hard

ChrisMarshallNY · 2024-06-30T19:34:36 1719776076

True, dat.

My first job, was as a bench tech for an RF/microwave manufacturer (defense).

I think that experience made me a much better debugger, when I switched to software.

A lot of my work was analog (Spectrum Analyzers, Oscilloscopes, Signal Generators, etc.), but we also did a lot of digital debugging.

One of the coolest tools we had, was what was called an "ICE" (In-Circuit Emulator).

You yanked out the processor, and plugged in the ICE, and it replaced the processor. You could view everything going on, inside the processor, registers, accumulators, the whole kitchen sink.

These days, it would be impossible to make an ICE for current processors (although AI-assisted design might give us some surprises).

A lot of issues came about, at the intersection of analog and digital. Ringing could wreak havoc on digital buses, and, in those days, we weren't as good at handling GHz frequencies, as we are now. Every solder burr would become a microwave broadcast antenna.

YZF · 2024-07-01T02:54:26 1719802466

Many chips come with on-board debugging accessible via JTAG that lets you do most of what you'd do with ICE in the good old days.

Another option I've seen is to simulate/emulate at lower speeds with FPGAs.

I also had a lot of fun debugging electronics. My favorite story is spending more than a month debugging a TI DSP based board that I designed that wouldn't boot properly. It turned out that some of the ground pins (it had many) were left unconnected because the guy that did the PCB layout (not me, I did do layout for many of my designs but not this one), didn't notice a tiny segment that needed to be wired from the pin to the ground plane.

An important lesson I learned from debugging hardware is that you have to persevere, I learnt this from a very experienced EE I worked with, all problems can be found with enough perseverance. Kids these days (lol) give up too soon...

qiqitori · 2024-07-01T02:41:59 1719801719

I do some retro work, and in order to debug a system (this was at a time when I knew less about this type of system) I once plugged in a Raspberry Pi Pico into a Z80 socket (not directly of course) and ran commands to read/write from/to memory/IO/ROM to see if I could find the problem. Makes you feel quite omnipotent :)

Course, my setup didn't really emulate the Z80 to run code, it just emulated its signals for a handful of instructions, which I instructed the Pico to do on demand via its UART. (Pictures and some code: https://blog.qiqitori.com/2023/02/two-raspberry-pi-picos-pre...)

jagraff · 2024-06-30T21:59:35 1719784775

Why is it impossible to make an ICE for current processors? Just because they're too complex and/or data exfiltration at speed is too hard?

ChrisMarshallNY · 2024-06-30T22:11:08 1719785468

They run at GHz frequencies, and are orders of magnitude more complex than the 8-bit processors of yore. ICEs were big suckers, and they cost a great deal.

That was for linear-socket, Sub-MHz-clocked, 8-bit processors, like the 8085.

joezydeco · 2024-07-01T02:20:03 1719800403

The umbilical cord for a 500-700 pad BGA would be...impressive.

pajko · 2024-07-01T13:56:37 1719842197

Other tools made it irrelevant, like boundary scanning on the JTAG bus, the EJTAG interface of the MIPS CPUs, the ETM of ARM chips, and so on. The required adapters are still somewhat pricey, but there are cost-effective solutions for both HW and SW.

userbinator · 2024-07-01T04:30:48 1719808248

Intel processors have DP/XDP/ITP which does much the same thing as an ICE, except it's integrated on the chip.

barbegal · 2024-06-30T18:20:37 1719771637

Someone else had a similar issue 6 years ago https://electronics.stackexchange.com/questions/334012/hsi-a...

It sounds like the sampling clock frequency is not what is expected (but that's quite easy to check based on the transmitted signal so I'm quite confused)

UARTs are nice if you are constrained on pins but SPI is always a safer bet where you don't have the necessary high accuracy clocks.

Gibbon1 · 2024-07-01T03:06:44 1719803204

I had something like that 30 years ago on a Motorolla 68332 processor.

It could generate the high speed CPU clock using the 32khz clock as a reference. What I found was the layout guy ran a digital trace through the pins of the 32khz xtal. There was enough coupling that digital transitions on that line would cause the high speed clock to swing wildly. Which showed up as garbage on the UARTs. Joyous thing is most of the time that line was idle.

I cut and rerouted the trace and the problem went away.

readmodifywrite · 2024-07-01T13:13:48 1719839628

My guess is that the receiver clock glitches in some way when the MSI auto calibration runs, but it never showed up on the transmitter (and the device on the other side of the connection has never had a reception issue).

I ended up disabling the auto cal feature during a UART reception and then turning it back on when the reception is done.

SPI is definitely better as far as clocking, but MCU support as a SPI receiver is sometimes a lot less convenient to deal with.

A lot of UARTs have a synchronous mode which adds a dedicated clock signal - I've used that before out to a couple MHz.

In this application though, I'm only running 1 MHz so I really didn't think I should need a separate clock (and, it turns out, still don't).

barbegal · 2024-07-01T14:33:27 1719844407

According to the documentation there is no calibration as such, the MSI clock simply runs in a phase locked loop (PLL) configuration with LSE (32.768 kHz). For example in 1MHz mode the MSI is setup to run at approximately 1Mhz, this clock then goes into a downscaler which downscales by a factor of 31 to approximately 32kHz and this is compared to the LSE clock to generate feedback for the MSI clock. When locked the MSI runs at 1015.8 kHz (32.768 * 31) so out by 1.58%.

It's also possible that the design hasn't been thoroughly tested and the PLL doesn't lock in certain conditions which could leave you with an unstable clock.

readmodifywrite · 2024-07-01T16:49:29 1719852569

The lack of status bits on the auto-cal is really unfortunate.

Turning it off during a UART transaction definitely "fixes" it.

I'm somewhat tempted to do the manual calibration the HSI instead.

barbegal · 2024-07-01T17:37:47 1719855467

Yeah a PLL without a status flag to indicate it is locked isn't good. I think there are also issues with stabilisation when using it with stop modes https://community.st.com/t5/stm32-mcus-products/msi-pll-mode...

If you really need the accuracy then regularly time the LSE clock using a timer clocked from MSI and apply the best trim values as described in this app note file:///home/tom/Downloads/an4736-how-to-calibrate-stm32l4-series-microcontrollers-internal-rc-oscillator-stmicroelectronics-1.pdf

YZF · 2024-07-01T02:49:28 1719802168

It's been forever since I used UARTs but I remember them being fairly resilient creatures. They center on the start bit and so as long as the clocks are close enough so you can sample the rest of the bits you should be good. Often the problem is configuration and not clock accuracy, e.g. how many stop bits or parity.

mattegan · 2024-06-30T19:17:34 1719775054

Great writeup! You show things are difficult to debug even when you have a board where all of your signals of interest are easily accessible.

It's a bad week when you have a bug that only is reproducible on form-factor hardware. Imagine something like a tiny earbud where the only pin accessible while the device is assembled is a single UART (bidirectional) pin? Ouch. Then, if you can manage to disassemble the earbud - the PCBs are usually so small most signals never appear on the outer layers - so you can't probe them even if you want to! Oh, and the issue is only appearing on one out of every few thousand earbuds? Better not break your failing unit while taking it apart! Good luck!

Just watched the Kickstarter video too -- looks like a great product y'all! Best of luck :)

userbinator · 2024-07-01T04:24:49 1719807889

I couldn’t find anything in the docs or on the Internet about why this might happen with the autocal, and there’s nothing that details exactly how it works either.

A quick search found this document on the internal RC oscillator calibration, which explains all you need to know: http://nic.vajn.icu/PDF/STMicro/ARM/STM32F0/STM32F0xx_intern...

It is recommended to stop all application activities before the calibration process, and to restart them after calling the calibration functions. Therefore, the application has to stop the communications, the ADC measurements and any other processes (except when using the ADC for the calibration, refer to Step 5. below). These processes normally use clock configurations that are different from those used in the calibration process. Otherwise, errors might be introduced in the application: errors while reading/sending frames, ADC reading errors since the sampling time has changed, and so on.

readmodifywrite · 2024-07-01T13:21:03 1719840063

This is a good resource, however, it didn't apply in my situation because it describes the manual calibration process, not the auto-cal (which the F0 probably doesn't even have).

I still haven't come across anything that explains in detail how the auto-cal works and precautions one needs to take when it is running. The reference manual section is something like one paragraph and can be summarized as: "You can turn this on and it will calibrate your clock. You can also turn it off."

If I had to guess, it probably does something similar to the manual process, but just in the MCU logic. It's the lack of detail that got me: I basically ran out of things to try on the UART itself and started looking around at other parts of the chip to see what could at least be indirectly related.

upwardbound · 2024-07-01T05:15:53 1719810953

Is it possible that the autocal is in the process of shifting the phase the clock it controls to match the reference clock, and that the user is supposed to wait until that's done before running phase-sensitive operations? I'm unfamiliar with the chip in question, just making guesses or shots in the dark.

readmodifywrite · 2024-07-01T13:27:18 1719840438

It really seems like it has to be something like that. The problem is there is no detail in the docs and no status bits in the chip. There's no way to know when the auto-cal runs.

One of the several things I did to eliminate the problem was to disable the auto-cal during a UART reception (the STM32 is the bus master so it knows when it will be receiving) and re-enable it when it is finished. It absolutely confirmed that is the glitch, but I don't think I'll ever get a true why unless an ST engineer wants to chime in!

utensil4778 · 2024-07-02T10:57:21 1719917841

I've had enough bad experiences with ST to just avoid their MCUs altogether when possible. Last year I wasted several days trying to figure out why a certain ST chip wouldn't respond to the programmer I was using. The chip claimed to support flashing from UART, but the bootloader just never answered. It responded to other interfaces like ST's two wire interface, whatever it was called. Searching around online, it seems like this chip has a silicon bug or a bad ROM and ST is just happy to keep selling it as is for years. They haven't even published errata acknowledging that it's broken and doesn't answer on all interfaces.

lemonlime0x3C33 · 2024-06-30T19:04:54 1719774294

I enjoyed reading through your debugging process and as someone who has been trying to debug a custom board for a few weeks now I feel your pain. I still cannot say that my issue is hardware, firmware, or software -_-

I do have some UART devices that really seem to like when I just disconnect and reconnect the GND wire when they start to act up.

russdill · 2024-06-30T18:23:27 1719771807

It would be incredibly useful to output the user clock and monitor it with a logic analyzer.

skadamat · 2024-07-01T17:39:26 1719855566

Reminds me of Bret Victor's Seeing Spaces talk: https://www.youtube.com/watch?v=klTjiXjqHrQ

tonetegeatinst · 2024-06-30T18:19:57 1719771597

As someone who is interested In hardware engineering and reverse engineering....this truly is an understatement.

I have an old router I wanted to dump the firmware from as a learning experience to see if I could go from firmware dumping to finding a bug.

You got to constantly question if its your hardware or the device that's faulty....you got to double check and make sure everything is connected.

Want to do low level silicon reverse engineering? Yeah that's not cheap as the tool, chemicals, and PPE is very expensive.

But I'd argure I shouldn't treat the hardware world as magic. And with how expensive these devices are, some of it being justified, other times a company just sets insane markup on items like nvidia, its reasonable to want to learn how this stuff works and how one could do it themselves.

This hardware reverse engineering is also how we find/look for potential security issues or backdoors.

bsder · 2024-06-30T23:25:49 1719789949

What are those PCB holders he is using?

nthobe · 2024-07-01T00:33:29 1719794009

Those are Hakko Omnivise PCB Holders. I have one and it's a joy to use. Super heavy and stable.

banish-m4 · 2024-07-01T05:32:42 1719811962

Google Image search?

Btw, for through hole soldering and rework, Idea-Tek PCSA is on a completely different level of efficiency.

daelon · 2024-07-02T04:58:44 1719896324

Ideal-Tek, not Idea-Tek.

banish-m4 · 2024-07-02T18:49:10 1719946150

Correct. Iphone autocorrected incorrectly, but you know what I meant.

readmodifywrite · 2024-07-01T13:32:51 1719840771

Yup, it's the Hakko Omnivise. They might seem a little pricey, but they are worth every penny.