Hacker News new | past | comments | ask | show | jobs | submit login
Extreme debugging - a tale of microcode and an oven (alanwinfield.blogspot.com.br)
193 points by flaviojuvenal on March 9, 2013 | hide | past | favorite | 17 comments



I've worked at a couple of jobs (telecom, semiconductors) with environment chambers and equipment of that ilk. Heat guns were a common bench testing tool.

Fast forward ten years and I'm doing contract work for a local company. Of course the contractor got the shittiest workstation, which dies every afternoon like clockwork. So I move the POS white box away from the window and out of the sunlight. "Magically" it only crashes once week. So I send a ticket into the queue, stating the obvious, then the IT folks quibbled and finally admit the motherboard needs to be replaced.


Cool story, bro.


Beautiful example of great engineering. I like how they, at least from how the story is told, they first formed a hypothesis based on their knowledge, found a systematic way to test/particularise it and then built a fix for production in a separate step.


The scientific method can get iterative. Observe / purpose / hypothesis / experiment / result / conclusion. It gets harder when all one's hypotheses' test cases conclude out as "Well, it's wasn't that". Or when the bug's intermittent with no apparent relation to one's domain knowledge.

That's the time one must apply heuristics such as (A) expand one's domain knowledge through research or talking to others, or (B) expand one's observations on the subject, or (C) expand one's hypotheses into the realm of "Oh it's just not possible that...", or "Let's try something that looks really 'stupid'...".

Of course, heuristic (C), being presumption-checking in face-saving disguise, has been the most fruitful heuristic in all my years. Those successes often become the best "teaching moments".

But here's a success story that followed heuristic (B). It goes back to the early days of the MITS Altair 8800[1] and its "S-100 bus"[2], a physically large passive backplane bus for the 8080 CPU. My employer sold complete systems comprised of that hardware with general purpose business software we wrote (using 1979-and-later versions of Bill Gates' first commercial product [3]).

We had systems that would intermittently "cold-crash" with no observed relationship to anything "rational".

So, observe: Pull the boards out and inspect them for bent pins on chips, cold solder joints... nothing. OK look at the backplane connectors, topside: No dust or bent pins. Finally turn the emptied chassis upside-down and look at the 100 solder connections for each of the 18 or so card slots. Get out a magnifying glass because this will take some time...

...and there they were: Three "cakey" looking broken solder joints on the bus. Evidently, the backplane board flexed enough (or the S-100 connector pins themselves ccommunicated enough flexure to their solder joints) upon board insertion and removal to break a weak solder joint on the bottom of the big PC board that was the backplane bus. Resolder all 1,800 pins and promlem solved for the lifetime of the machine. Found the same failure mode in a couple other boxes in the time 7 years I worked there, otherwise maintaining software.

[1] https://en.wikipedia.org/wiki/Altair_8800 [2] https://en.wikipedia.org/wiki/IEEE-696 [3] https://en.wikipedia.org/wiki/Altair_BASIC


You've accurately described the predicament Boeing is in with the 787 battery packs.

There obviously is time to pursue further domain knowledge of how these batteries fail in the field, but pressures of the market are now in full force.


This really is a fantastic article. Mayhaps you should resubmit it Monday morning or some time when it might get more traction.


I like forth. I personally have played with it by making grobot teams.

Creeper world will actually use a stack based language like forth for it's next release.

(specifically, it will use a derivative called crpl for scripting)


I enjoyed reading this story. Hunting down those kind of bugs is not something most of us will ever be able to do.


I'm surrounded at work by at least a dozen HALT (Highly Accelerated Life Test) chambers, and many, many of our engineers and QA people debug and test code with thermocouples connected to the system they are working with. Anybody with a EE will have lots of stories of debugging temperature related bugs on their systems - it's a (relatively) easy variable to both quantify, measure, and replicate - and the components you are working with are all temperature rated.

Combine RF + Temperature + Antennas - Then you are in for several months of interesting debugging - particularly if you want to optimize for performance at low power transmission over a wide variety (-55 to +85 Celsius) of temperatures. The math gets hinky really quickly.


We had one particularly bad customer experience on some wireless lighting control systems (power electronics and RF in a tight package) where some nodes would randomly drop from the network and couldn't be controlled.

After two sets of embedded engineers trying to solve the problem with software updates, our RF guru got involved.

He traced the problem to the RF transceiver losing PLL lock at high temperatures and the transceiver not setting the PLL status bit correctly. We had to fix our uC to reset the RF chip every X minutes based on worst-case temperature rise scenarios.

No one else would have figured this out for months. The bug had not been fixed through three generations of this transceiver and the guru had dealt with the issue years ago.


It depends hugely on which industry you work in. I'm developing SW for mobile phones and it's not been too long ago I had to use a climate chamber to debug a memory chip that was sensitive to temperature. There was also an issue with USB that failed to enumerate at -20C.

I haven't used it yet but we have some really cool (ha!) equipment that will adjust the temperature from -20 to +60 (Celsius) in minutes.


How large are those chambers? Where I work we have massive thermal chambers (about 20 feet x 40 feet or so) but there is no way they can achieve that rapid a temp swing.

I really hate going in those things when they're testing at high temp and high humidity, especially when I'm dressed for winter weather.


That's not really true, beware not to fall into a "Golden Age Fallacy". I am sure many of us will be able to debug stuff even more crazy than that. Just look at all those hardware startups popping up lately. Let's check back in 20 years.


I agree that there are some very interesting hardware startups and that we will read some great stories in 20 years, but I disagree that the most of us will have a chance to work on them. I'm under the impression that the average HNer is working on some location aware social networking mobile thingy.

This is not a "in the days of old, when men were men and women were men, too" kind of thing. Most of us will, probably, spend the next 20 years doing the web/mobile stuff.


And the rest of us are working on fancy "CRUD" applications.

I enjoy working on hardware and RF projects as a hobby. Not much fun doing it at a large corporation anymore....


I was first exposed to Forth by a member of the PSU Timex-Sinclair UG (while I was still in high-school) and (much) later used it to program a system that controlled the terminals used by stock traders. I'm not saying I'd want to go back to writing software in Forth, but it was a strong language for control systems ... perhaps the Erlang of its day?


Why not? Much better than assembly. I believe that even today, a sound asynchronous stack design (much simpler, therefore with asynchronicity still being a viable option) would consume less joules per unit of computation than any current ARM/x86/RISC nonsense. (You don't actually have to program in the Forth machine code, it's just that it's still a good medium for many interesting high-level languages.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: