
My Hardest Bug Ever - danso
http://gamasutra.com/blogs/DaveBaggett/20131031/203788/My_Hardest_Bug_Ever.php
======
fpgaminer
In the hardware/embedded engineering world, this is what is known as The
Moment. How long it takes for a developer to experience The Moment is
correlated with how closely they are to the metal. I'm not sure if it's the
solder, the silicon, or the flux, but it seeps into your mind and slowly, but
surely, The Moment will happen.

The project is going swimingly well. Smoother than normal, actually. The calm
before the storm. Suddenly, abruptly, your device fails. Maybe you were there,
maybe someone else reported it; neither makes it less mysterious. You brush it
off with more important things to do. It happens again. Now it bothers you;
top priority. You try for hours to make it happen consistently. No amount of
tin foil, bunny ears, or interpretive dance will make it happen again. So you
give up, you start steel plating the code. Paranoid error checking everywhere,
setting variables twice, resetting, extra padding on all the timing
parameters. PARANOIA. You code begins to look like a lunatic wrote it. But the
bug never happens again. Your paranoia and double guessing fixed the problem.

The Moment! From then on, whenever something doesn't quite _feel_ right with
your projects, you start inserting strange, almost ritualistic incantations in
your code. If anyone asks why you decided to sleep for 14.23ms instead of the
prescribed 10ms, you just call them a devil and run away back to your ancient
cross compiling GCC and make a sacrifice to the Upstream Gods.

I feel for the author. From that Moment on, his code will never be the same.
His sanity, now left on the cutting room floor along with the trust he once
had in Datasheets and Programming Manuals.

Embedded 4 Life.

~~~
X-Istence
This post spoke to me in ways I had never imagined. This is what embedded
programming is like!

------
AceJohnny2
"As a programmer, you learn to blame your code first, second, and third... and
somewhere around 10,000th you blame the compiler. Well down the list after
that, you blame the hardware."

I dunno, as an embedded programmer working on new HW and sometimes with a
custom compiler, blaming the HW became the 5th thing, and the compiler the
6th...

~~~
jakejake
I'd probably be more likely to blame the hardware or compiler if my work was
programming new hardware or custom compilers!

I think the situation is that if you're quick to blame others for a problem
which turns out to be your own mistake - it has a tendency to really piss off
co-workers. Experienced programmers learn to challenge their own code first as
a habit out of consideration for others on their team. It's embarrassing to
swear up and down that there is a hardware/compiler/whatever problem, only to
have a bunch of people look into it and find a mistake that you made.

~~~
Zak
I think it's more a matter of hardware and compilers _usually_ being widely-
used and well-tested. You're just not very likely to discover a bug in an
Intel CPU or GCC, while you're newly-written code, being newly-written code
hasn't been well-tested yet.

If you're using one-off hardware or a brand new compiler, it isn't so
unreasonable to suspect those pieces of having bugs.

------
josh2600
Just in case you guys don't know this, since I don't see it mentioned
anywhere, Dave is also the co-founder of ITA software which is basically the
engine holding up modern transportation. It powers almost all of the websites
that deal in transportation (hipmunk, kayak, etc.), and this is, and I believe
this is Dave's term, a fractal problem.

Fractal problems are those that look really simple from 100,000 feet up (like
a dot) and reach unimaginable complexity as you get closer (like a fractal).
There are a lot of problems of this nature (like email, transportation,
ecommerce, taxation, etc.) and solving them is worth a lot of money but
requires a LOT of work.

~~~
philsnow
Wouldn't a fractal problem look only somewhat hairy from 100k feet, and yet
also be hairy at 10k, 1k, 100, 10, 1 feet ? It should look somewhat similar at
all zoom levels.

Problems that look simple from 100k feet but get complex as you get up close
are what I'd call "normal" problems.

------
neilk
The same story was making the rounds yesterday from Quora.

[http://www.quora.com/Programming-Interviews/Whats-the-
hardes...](http://www.quora.com/Programming-Interviews/Whats-the-hardest-bug-
youve-debugged/answer/Dave-Baggett)

~~~
teej
I'm surprised this story didn't credit Quora as the original published source.
I imagine that's possibly because Gamasutra reached out directly to Baggett or
he decided himself to cross-post the story after its success on Quora.

I find the phenomenon of Quora posts becoming "real" articles quite
fascinating. I've actually been "published" in Slate and Forbes online just
for spending some time writing answers.

~~~
BenderV
Using Quora (or Reddit) is really a smart first move. It allow you to find
interresting question/subject that you could answer (which is not trivial). It
also allow you to directly reach a base of interested reader.

Kind of what HN does for ideas/start-up.

I wonder if this also exist for other type of markets (product, music, ...)
...

~~~
Kiro
>I wonder if this also exist for other type of markets (product, music, ...)
...

Please elaborate on that thought!

~~~
BenderV
I think it's about service that are stream-centered and sectioned.

HN, Reddit, Upbeatapp, Quora (?), are stream-centered. Information come and
go, which make it very easy accessible.

Reddit, Upbeatapp, Quora are also sectioned ; There are sub-reddit, there are
subgenres in Upbeatapp and in Quora, you can follow
topics/interests/questions. HN is not sectioned but is already very specific.
That point make an access to your pre-targeted audience.

The two factor combines make an easy accessible access to a large interested
audience.

( Note that is not the case in twitter, youtube, soundcloud, tumblr...) I
don't know Medium enough to be sure, but I think they are trying to do the
same (sections + stream).

I'm trying to find online services that use this pattern for other type of
market (Movie, Food recipes, ect.)

------
JonFish85
How is this a quantum effect? Clocks can be noisy, so if your board isn't
designed to keep everything isolated properly, you'll pick up noise all over
the place. It's an electromagnetic effect, sure, but a quantum effect?

~~~
bobz
Technically, quantum refers to physics at the atomic / subatomic scale, and
not necessarily to the spooky effects like particle wave duality and such that
we usually attribute to the term.

So, I believe this would be an "effect on the quantum level," even if it can
be understood through the lens of more traditional electromagnetic physics as
well.

~~~
beambot
Calling an electrical noise / timing bug "quantum mechanics" is hyperbole.
Otherwise, every EE that touches hardware is a quantum physicist (they're
not).

EDIT: Not trying to diminish the OPs impressive feat of debugging though.
Hardware errors can be beastly to diagnose. When wire-wrapping an 8086
computer, I used a spool of wire with occasional (random) breaks that would
intermittently open. Worst. Bugs. Ever.

------
DenisM
I had to debug a problem in our program where MMX register would get corrupted
under a new sampling profiler. Turns out profiler would forget to restore MMX
registers - the profiler devs never used MMX and it did not occur to them that
a component they called would do that. That took a while to debug.

Another fun bug was when alpha version of CLR failed to restore one of the two
registers used to control loop execution on the Itanium. (Yes, Ia64 had two
registers - one for loop variable as seen by the program and one to actually
control the loop execution).

------
cpeterso
Here is a recent Firefox crash which, after some heroic debugging, appears to
be an AMD CPU bug involving a CPU race condition "after a not-taken branch
that ends on the last byte of an aligned quad-word"! The gory debugging
details:

[https://bugzilla.mozilla.org/show_bug.cgi?id=772330#c21](https://bugzilla.mozilla.org/show_bug.cgi?id=772330#c21)

------
ddoolin
There was only two developers for that game? Wow, I would've thought there'd
be a lot more.

~~~
jere
As you go back in time, the number of people working on a game drops
dramatically. _Crash Bandicoot_ was released in 1996 and had 2 programmers
[http://en.wikipedia.org/wiki/Crash_Bandicoot_(video_game)](http://en.wikipedia.org/wiki/Crash_Bandicoot_\(video_game\))

Three years earlier, _Doom_ was released. It had 4 programmers.

Six years before that, _Final Fantasy_ had 1 programmer.

If you stick around for the credits in a modern video game*, it's clear
hundreds of people are involved.

~~~
eru
Unless you look at the Indies.

~~~
jere
Of course. I think the idea of the 2-3 person indie team or even solo
developer is quite romantic. For instance: [http://jere.in/sneaking-into-
rohrers-castle-part-1](http://jere.in/sneaking-into-rohrers-castle-part-1)

With that last comment though, I was mainly referencing AAA games (e.g.
_Bioshock Infinite_ ).

~~~
eru
Thanks. You might have just stolen all my freetime.

~~~
jere
Ha! You mean TCD? It _is_ pretty addictive, but the crazy part is I couldn't
convince hardly any of my friends to play. Anyway, I hope you enjoy it. I'm
pretty active on the forums if you have any questions.

~~~
eru
Reminds me to get back into Frozen Synapse. (Though I did manage to get
friends to play.) Frozen Synapse is probably the perfect execution of the idea
"`UFO: Enemy Unknown' tactical battles in multiplayer".

------
vijayboyapati
It sounds pretty bad; anything where the bug appears randomly sucks. But, for
me, the worst bugs are random+multithreaded+statistical. E.g., a random bug in
a distributed machine learning system=bug of death

------
raverbashing
And here's my pet peeve:

"He called me and, in his broken English and my (extremely) broken Japanese,
we argued. I finally said, "just let me send you a 30-line test program that
makes it happen when you wiggle the controller." He relented. This would be a
waste of time, he assured me, and he was extremely busy with a new project,
but he would oblige because we were a very important developer for Sony. I
cleaned up my little test program and sent it over."

Really? Being humble doesn't hurt.

At the same time I love when a smug face melts with a concrete proof and "I
told you so". Save face and be a little more humble.

~~~
brisance
I found this to largely be a cultural thing with the Japanese. Japanese are
extremely proud of their work and to mention ANY kind of mistake or
improvement is tantamount to insulting their mother. Of course there are
exceptions but this is the general attitude they bring to work.

~~~
dmbaggett
This is how I interpreted it as well. I was not offended by his reaction;
perhaps I should have noted that in the piece.

~~~
raverbashing
I understand you weren't offended. It's just that he was putting his pride
before the work.

Also, yes, I know this may be more common in Japanese people, still I didn't
want to associate it with the stereotype since I've seen it in lots of people
(and why not, maybe even with me), with several different backgrounds.

------
Eliezer
> This is the only time in my entire programming life that I've debugged a
> problem caused by quantum mechanics.

Technically, _all_ bugs are caused by quantum mechanics.

~~~
mistercow
And evn more technically, it doesn't actually make sense to say "caused by
quantum mechanics".

~~~
dmbaggett
:)

------
kabdib
Interference is a pretty good one.

I've diagnosed temperature problems (grab ice from the freezer, apply to chip,
see it work...), clocking problems (you insert strategic delays, sometimes on
the order of thousands of instructions) and just badly documented registers
(make sure mystery bit number 13 gets toggled just right, or it's curtains).

It's fun stuff.

------
peapicker
Dave Baggett wrote a couple of nice text adventures for the TADS system before
he worked on Crash back in the early 1990s, I remember enjoying both
"Unnkulian Unventure II: The Secret of Acme" and "The Legend Lives!"
(Unnkulian episode 5)...

~~~
dmbaggett
Now _that_ is a blast from the past. :)

~~~
peapicker
What can I say, we had some nice chats back on rec.arts.int-fiction as well
waaaaaaay back when....

------
Isamu
Yes. We've all had to do similar things on embedded systems - you just cut
code out until you narrow down the problem. And yes, you have to convince the
hardware engineer that there is a problem with his board and that's always an
uphill climb. Sometimes he'll fix the board, but more often you're stuck with
a workaround.

Nowadays I'm working on these big distributed clusters, far from the bare
metal. But you know, just now I rather miss those days on the little embedded
systems.

------
malkia
Not my own bug, but a bug that happened in some other studio, retold by one of
our ex-producers:

Basically the bug was in the Nintendo GameBoy, where it only happens if you
press two buttons (left and right, or up and down) at the same time. Now you
can't do that normally, since the controller won't allow it. But if you are
hard-core QA (as the guy which this producer told me about it) - he ripped the
controller, and manually wired some stuff - so he'll get LEFT and RIGHT
pressed down at the same time, and only then the game would crash...

But then as we say sometimes here, for such things - it might get the... WNF
(Will-not-fix).

~~~
chilldream
This type of bug is very well-known in the tool-assisted speedrun community;
many NES TASes rely upon similar bugs.

------
adam-f
I once solved a bug where i=5 didn't execute (single stepping through the
assembly also failed to assign the variable), the fix?

    
    
        i=5;
        i=5;
        i=5;
        i=5;
    

and, I believe we it shipped like that.

~~~
chris_wot
The optimizer didn't remove that?

~~~
makomk
If it was an embedded system, some of the widely-used compilers are apparently
rather lacking in functionality compared to the desktop ones, so it might not
be able to.

------
brooksbp
Corruption on a management bus; transactions were getting corrupted .01% of
the time: writes turning into reads, reads turning into writes, etc. SI
issues.

As an embedded software engineer, you need to understand hardware nearly as
much as the software. And depending on how far along the hardware is (pre
'gerber out', or deployed in field) it's usually up to software to "patch it
over," "hide the issue," "fix it in software," if possible...

------
aidenn0
My worst was an issue where a load-to-register, followed by a jump to the
register, when the jump was on the last word in a cache-line, would about 1 in
10 million times jump to the value that was in the register before the load.
It was a race-condition in the hardware. That took many, many months to track
down, localize, and then convince the HW vendor was an issue.

------
ww520
Having a consistent reproduction scenario is half of the battle in debugging,
but sometime getting to that point is hard.

------
ctdonath
When USB 2 was about to be introduced, I added it to a printer for Kodak using
beta versions of driver chips. There was a hardware bug connecting two
unrelated bits. Took a month and a whole lot of intuition to figure that one
out via software. Chip maker was very happy I found it.

------
robomartin
I can see how a software developer could put hardware last, particularly when
working with an established platform. I get that.

Coming from the other side of the fence, a place where we develop the hardware
first and then bring it to life through software things are very different.

The first step in bringing up a not-trivial board is to go through a full
characterization phase. This is where electrical, mechanical and short
software tests are performed in order to determine if the hardware is
operating according to requirements. Depending on the nature of the hardware
this period can last months and require many re-spins (iterations where
something is fixed or modified).

While this is taking place, and depending on the nature of the team, the
application software is probably starting to be assembled on prototype
hardware. In some cases this can't happen until you have actual hardware that
works reasonably close to specs. Perhaps rev-1 hardware is used to jump-start
software development while the hardware team goes through many revs in order
to make adjustments and fix problems.

Seemingly weird hardware problems abound. I have been in situations where the
signal is good at one end of a trace or cable and not so good at the other
end. In the case of high speed design this can easily happen if there are
problems with the design of the transmission lines carrying the signals. You
can easily end-up with reflections that will wreck havoc on the signals as the
go down the transmission line.

Another "weird" hardware issue in high speed design are signals that don't
arrive at the destination within a specific timing window. Dynamic ram designs
are one example of this. A clock is used to gate various signals at both ends
of the transaction. Everything is sampled relative to this clock. If some
signals, for example, control signals, arrive before, after or staggered with
respect to their acceptance window you can have really weird effects.

With large FPGA designs you can have issues related to faulty design of the
power distribution system. Power and signal integrity are major fields of
study and truly necessary parts of modern electronics design. Traces on a
board are like capacitors that need to be charged and discharged. If you have
200 traces switching from 0 to 1 simultaneously a lot of current will be
require of the power system within nanoseconds (or picoseconds). If the power
distribution system on the board isn't designed to deal with such transients
you end-up with all manner of weird effects. For example, transmission lines
might be perfect in impedance, crosstalk and time of flight yet signals arrive
with lots of jitter and all over the place in terms of timing. The power
distribution system on a board is like your heart, if it can't deal with
demand you are not going to go from sitting to standing and then running
without major problems.

This is only the tip of the iceberg. I could go on for pages and probably
write a book about this. I've made enough mistakes.

And so, from the vantage point of a software engineer who also happens to be a
hardware engineer blaming the hardware almost always comes first until the
hardware is proven to be operating according to requirements.

In terms of the playstation issue on the original post. Well, from my
perspective this is simply bad engineering on the part of those who designed
the hardware. OK, this isn't the engine control computer on a Toyota. The
sentiment is the same. Fault tolerant design is important, even for toys.
Think consumer drones.

~~~
webhat
On the Coursera course mosfet-001 this was one of the first things which was
discussed, coming from a non hardware background it was an eye opener to
realize that minute changes in the composition of components could have such a
large impact.

------
Zenst
Had many fun bugs in my time:

Buffer size on line printer out of sync with what the program (COBOL) was
sending and would produce variable results every so often when WRITE BEFORE
and WRITE AFTER got used.

Then the ones which work fine with debug libraries and not on production ones
as turns out the debug libraries had a unintended fix to a bug that was yet to
be known. Also had programs work fine when you don't use debug libraries.

Then we have driver bugs, when if you have even touched graphics will of
encountered, more so when we have a shift in driver model as ME introduced and
was stable enough for windows XP, same with Vista which is stable in 7 and 8.

Surfice too say the hardest bugs almost always turn out to be hardware or some
external software being documented wrong or undocumented features. Also new
revisions in hardware can casue the smallest of issues in the niches of uses,
but it can be you and it is a small lonely world when you try to fix one of
those.

Though for me the hardest bugs are always the ones which you know were abouts
they are, you are just unable to prove it too those enable to investigate it
too the level of being able to prove it is there issue.

Also some bugs for some are easier for others and we have all had a bit of
code or aspect of life that we seem blinkered from seeing what is wrong;
Somebody else could look at it and solve it in seconds. Then we have also been
that somebody looking at others code or issue and seen the problem and
solution. Sometimes we see the problem before they can see it as a problem.
Had classic in the 80's doing a mailing list update to add on postal codes
(same as USA ZIP codes only UK flavour). I advised that some address's will
already have a postal code tagged onto the address lines and in some cases
could end up with not only 2 post codes on the address but also not always the
same as the one entered may of been slightly wrong. This was dismissed until
5pm friday on prepping to head 300miles home south for the weekend. Was late
heading home that weekend. Though that would be an annoying bug more than a
hard bug, however hard it was upon my weekend. But we learn how to highlight
issues better, or overhighlight. I have learned much from weather warnings at
least and how they lean more towards the worst case situation thesedays more
than say in the 80's.

Lastly though we have those intermitant bugs, so rare and niche that the
impact is so neglagable and negatable that they are easier to work around
instead of trying too fix. Mostly such fix's would cost more to identify fully
and address than any work around. I suspect we have encountered more of those
than you realise. With that, is how hacking was born after all. Without bugs
would we of had the hacker history gave us - that in itself is worth a
thought.

------
joshguthrie
Coolest bug ever.

------
abraininavat
I wonder if this experience had any effect on how the author writes code. He
basically backed into making the code testable. The process of identifying the
clock as part of the problem would presumably have between much easier if he
had adhestrongertronger SOLID principles.

~~~
crazy1van
Much easier said than done. Does your latest web app have code that allows you
to isolate and test the motherboard's clock generator circuit?

At some point you just have to take it on faith that lower level components
work correctly. And yeah, everyone once in a while you'll hit a sticky issue
like in this article.

