
Extreme debugging - a tale of microcode and an oven - flaviojuvenal
http://alanwinfield.blogspot.com.br/2013/03/extreme-debugging-tale-of-microcode-and.html
======
xradionut
I've worked at a couple of jobs (telecom, semiconductors) with environment
chambers and equipment of that ilk. Heat guns were a common bench testing
tool.

Fast forward ten years and I'm doing contract work for a local company. Of
course the contractor got the shittiest workstation, which dies every
afternoon like clockwork. So I move the POS white box away from the window and
out of the sunlight. "Magically" it only crashes once week. So I send a ticket
into the queue, stating the obvious, then the IT folks quibbled and finally
admit the motherboard needs to be replaced.

~~~
sneak
Cool story, bro.

------
konstruktor
Beautiful example of great engineering. I like how they, at least from how the
story is told, they first formed a hypothesis based on their knowledge, found
a systematic way to test/particularise it and then built a fix for production
in a separate step.

------
EEGuy
The scientific method can get iterative. Observe / purpose / hypothesis /
experiment / result / conclusion. It gets harder when all one's hypotheses'
test cases conclude out as "Well, it's wasn't that". Or when the bug's
intermittent with no apparent relation to one's domain knowledge.

That's the time one must apply heuristics such as (A) expand one's domain
knowledge through research or talking to others, or (B) expand one's
observations on the subject, or (C) expand one's hypotheses into the realm of
"Oh it's just not possible that...", or "Let's try something that looks really
'stupid'...".

Of course, heuristic (C), being presumption-checking in face-saving disguise,
has been the most fruitful heuristic in all my years. Those successes often
become the best "teaching moments".

But here's a success story that followed heuristic (B). It goes back to the
early days of the MITS Altair 8800[1] and its "S-100 bus"[2], a physically
large passive backplane bus for the 8080 CPU. My employer sold complete
systems comprised of that hardware with general purpose business software we
wrote (using 1979-and-later versions of Bill Gates' first commercial product
[3]).

We had systems that would intermittently "cold-crash" with no observed
relationship to anything "rational".

So, observe: Pull the boards out and inspect them for bent pins on chips, cold
solder joints... nothing. OK look at the backplane connectors, topside: No
dust or bent pins. Finally turn the emptied chassis upside-down and look at
the 100 solder connections for each of the 18 or so card slots. Get out a
magnifying glass because this will take some time...

...and there they were: Three "cakey" looking broken solder joints on the bus.
Evidently, the backplane board flexed enough (or the S-100 connector pins
themselves ccommunicated enough flexure to their solder joints) upon board
insertion and removal to break a weak solder joint on the bottom of the big PC
board that was the backplane bus. Resolder all 1,800 pins and promlem solved
for the lifetime of the machine. Found the same failure mode in a couple other
boxes in the time 7 years I worked there, otherwise maintaining software.

[1] <https://en.wikipedia.org/wiki/Altair_8800> [2]
<https://en.wikipedia.org/wiki/IEEE-696> [3]
<https://en.wikipedia.org/wiki/Altair_BASIC>

~~~
joezydeco
You've accurately described the predicament Boeing is in with the 787 battery
packs.

There obviously is time to pursue further domain knowledge of how these
batteries fail in the field, but pressures of the market are now in full
force.

------
ctoth
This really is a fantastic article. Mayhaps you should resubmit it Monday
morning or some time when it might get more traction.

------
GhotiFish
I like forth. I personally have played with it by making grobot teams.

Creeper world will actually use a stack based language like forth for it's
next release.

(specifically, it will use a derivative called crpl for scripting)

------
vukmir
I enjoyed reading this story. Hunting down those kind of bugs is not something
most of us will ever be able to do.

~~~
ghshephard
I'm surrounded at work by at least a dozen HALT (Highly Accelerated Life Test)
chambers, and many, many of our engineers and QA people debug and test code
with thermocouples connected to the system they are working with. Anybody with
a EE will have lots of stories of debugging temperature related bugs on their
systems - it's a (relatively) easy variable to both quantify, measure, and
replicate - and the components you are working with are all temperature rated.

Combine RF + Temperature + Antennas - Then you are in for several months of
interesting debugging - particularly if you want to optimize for performance
at low power transmission over a wide variety (-55 to +85 Celsius) of
temperatures. The math gets hinky really quickly.

~~~
Dwolb
We had one particularly bad customer experience on some wireless lighting
control systems (power electronics and RF in a tight package) where some nodes
would randomly drop from the network and couldn't be controlled.

After two sets of embedded engineers trying to solve the problem with software
updates, our RF guru got involved.

He traced the problem to the RF transceiver losing PLL lock at high
temperatures and the transceiver not setting the PLL status bit correctly. We
had to fix our uC to reset the RF chip every X minutes based on worst-case
temperature rise scenarios.

No one else would have figured this out for months. The bug had not been fixed
through three generations of this transceiver and the guru had dealt with the
issue years ago.

------
smoyer
I was first exposed to Forth by a member of the PSU Timex-Sinclair UG (while I
was still in high-school) and (much) later used it to program a system that
controlled the terminals used by stock traders. I'm not saying I'd want to go
back to writing software in Forth, but it was a strong language for control
systems ... perhaps the Erlang of its day?

~~~
gngeal
Why not? Much better than assembly. I believe that even today, a sound
asynchronous stack design (much simpler, therefore with asynchronicity still
being a viable option) would consume less joules per unit of computation than
any current ARM/x86/RISC nonsense. (You don't actually have to program in the
Forth machine code, it's just that it's still a good medium for many
interesting high-level languages.)

