Hacker News new | past | comments | ask | show | jobs | submit login
My Hardest Bug Ever (gamasutra.com)
345 points by danso on Nov 1, 2013 | hide | past | web | favorite | 79 comments

In the hardware/embedded engineering world, this is what is known as The Moment. How long it takes for a developer to experience The Moment is correlated with how closely they are to the metal. I'm not sure if it's the solder, the silicon, or the flux, but it seeps into your mind and slowly, but surely, The Moment will happen.

The project is going swimingly well. Smoother than normal, actually. The calm before the storm. Suddenly, abruptly, your device fails. Maybe you were there, maybe someone else reported it; neither makes it less mysterious. You brush it off with more important things to do. It happens again. Now it bothers you; top priority. You try for hours to make it happen consistently. No amount of tin foil, bunny ears, or interpretive dance will make it happen again. So you give up, you start steel plating the code. Paranoid error checking everywhere, setting variables twice, resetting, extra padding on all the timing parameters. PARANOIA. You code begins to look like a lunatic wrote it. But the bug never happens again. Your paranoia and double guessing fixed the problem.

The Moment! From then on, whenever something doesn't quite _feel_ right with your projects, you start inserting strange, almost ritualistic incantations in your code. If anyone asks why you decided to sleep for 14.23ms instead of the prescribed 10ms, you just call them a devil and run away back to your ancient cross compiling GCC and make a sacrifice to the Upstream Gods.

I feel for the author. From that Moment on, his code will never be the same. His sanity, now left on the cutting room floor along with the trust he once had in Datasheets and Programming Manuals.

Embedded 4 Life.

This post spoke to me in ways I had never imagined. This is what embedded programming is like!

"As a programmer, you learn to blame your code first, second, and third... and somewhere around 10,000th you blame the compiler. Well down the list after that, you blame the hardware."

I dunno, as an embedded programmer working on new HW and sometimes with a custom compiler, blaming the HW became the 5th thing, and the compiler the 6th...

I'd probably be more likely to blame the hardware or compiler if my work was programming new hardware or custom compilers!

I think the situation is that if you're quick to blame others for a problem which turns out to be your own mistake - it has a tendency to really piss off co-workers. Experienced programmers learn to challenge their own code first as a habit out of consideration for others on their team. It's embarrassing to swear up and down that there is a hardware/compiler/whatever problem, only to have a bunch of people look into it and find a mistake that you made.

I think it's more a matter of hardware and compilers usually being widely-used and well-tested. You're just not very likely to discover a bug in an Intel CPU or GCC, while you're newly-written code, being newly-written code hasn't been well-tested yet.

If you're using one-off hardware or a brand new compiler, it isn't so unreasonable to suspect those pieces of having bugs.

The playstation was out dec 94(japan)/Sep 95(North America) Crash came out in 96. I think it ha a custom compiler called ccpsx(found based on googling around).

I was thinking the same thing. At my last job we had one dev board (TI ARM/DSP SOC + lots of custom hardware) that we could reset with a static discharge into the carpet 6 - 8 feet away. That took us a while to figure out :p

Yes. Clocks and reactances in circuit always spring to mind as likely candidates almost immediately when you include hardware. A PCB's EMI can be altered measurably by software decisions. An acquaintance asked the other day why his toshiba portable was locking up in boot. First question I asked was did you change the ram clocks in the bios.

Many have seen timing and slew with all it's reasons while overclocking. Nothing at all to do with quantum effects, per se.

Just in case you guys don't know this, since I don't see it mentioned anywhere, Dave is also the co-founder of ITA software which is basically the engine holding up modern transportation. It powers almost all of the websites that deal in transportation (hipmunk, kayak, etc.), and this is, and I believe this is Dave's term, a fractal problem.

Fractal problems are those that look really simple from 100,000 feet up (like a dot) and reach unimaginable complexity as you get closer (like a fractal). There are a lot of problems of this nature (like email, transportation, ecommerce, taxation, etc.) and solving them is worth a lot of money but requires a LOT of work.

Wouldn't a fractal problem look only somewhat hairy from 100k feet, and yet also be hairy at 10k, 1k, 100, 10, 1 feet ? It should look somewhat similar at all zoom levels.

Problems that look simple from 100k feet but get complex as you get up close are what I'd call "normal" problems.

Current fractal problem: email. (See my profile for details.)

The same story was making the rounds yesterday from Quora.


I'm surprised this story didn't credit Quora as the original published source. I imagine that's possibly because Gamasutra reached out directly to Baggett or he decided himself to cross-post the story after its success on Quora.

I find the phenomenon of Quora posts becoming "real" articles quite fascinating. I've actually been "published" in Slate and Forbes online just for spending some time writing answers.

I'm guessing he ran it on Quora to see if it would gain any traction, and since it did, either submitted it to Gamasutra or was asked by them to publish it there.

Since it's his answer, he can reproduce it anywhere he wants without crediting Quora as a source (since he's the source here), though it'd be sweet if he gave Quora a nod :)

Should have done that. Updated.

Thanks :)

Using Quora (or Reddit) is really a smart first move. It allow you to find interresting question/subject that you could answer (which is not trivial). It also allow you to directly reach a base of interested reader.

Kind of what HN does for ideas/start-up.

I wonder if this also exist for other type of markets (product, music, ...) ...

>I wonder if this also exist for other type of markets (product, music, ...) ...

Please elaborate on that thought!

I think it's about service that are stream-centered and sectioned.

HN, Reddit, Upbeatapp, Quora (?), are stream-centered. Information come and go, which make it very easy accessible.

Reddit, Upbeatapp, Quora are also sectioned ; There are sub-reddit, there are subgenres in Upbeatapp and in Quora, you can follow topics/interests/questions. HN is not sectioned but is already very specific. That point make an access to your pre-targeted audience.

The two factor combines make an easy accessible access to a large interested audience.

( Note that is not the case in twitter, youtube, soundcloud, tumblr...) I don't know Medium enough to be sure, but I think they are trying to do the same (sections + stream).

I'm trying to find online services that use this pattern for other type of market (Movie, Food recipes, ect.)

You can put your music on YouTube.


See...not exacly.

Because, If I create an account on Soundcloud, put some awesome music that I created, what will happen ? Nothing. Same for Youtube. I would have to be detected by someone with a huge number of followers to take off.

If I make an awesome post on start-up or about an experience, and I post it and HN, I could go from 0 to lots of good (read "targeted") traffics.

If I make a great anwser on Quora, I could go from nothing to lots of readers.

That can happen. But on soundcloud you also get followers, replays, loves etc. I find artists there all the time I would never find anywhere else. myristica is one: https://soundcloud.com/myristica/into-the-night or this: https://soundcloud.com/myristica/beyond-the-glow so good. I'd say anything good is built slowly, iteratively, each fan/follower appreciated and eventually you keep putting content on as many channels as you can, one will pick up or hit some sort of momentum and you run with it. New artists are discovered all the time on Soundcloud. I can't live without BLOOD BROS. I have shipped tons of product to this: https://soundcloud.com/maddecent/blood-bros-iii-back-in

Not against Quora in anyway, I like it, get traction however you can wherever you can and especially places where people are looking for what you make.

Looks like it credits quora now.

<<I find the phenomenon of Quora posts becoming "real" articles quite fascinating.>>

I'm assuming that before they become real, they exist in an indeterminate quantum state?

Noted...though this link has the advantage of not being behind a login-wall.

This was a fascinating read. Thanks.

How is this a quantum effect? Clocks can be noisy, so if your board isn't designed to keep everything isolated properly, you'll pick up noise all over the place. It's an electromagnetic effect, sure, but a quantum effect?

It's not. What he means is that he's not a hardware guy, and this problem was caused something even beyond the problems what software people normally call 'hardware issues'.

Technically, quantum refers to physics at the atomic / subatomic scale, and not necessarily to the spooky effects like particle wave duality and such that we usually attribute to the term.

So, I believe this would be an "effect on the quantum level," even if it can be understood through the lens of more traditional electromagnetic physics as well.

Calling an electrical noise / timing bug "quantum mechanics" is hyperbole. Otherwise, every EE that touches hardware is a quantum physicist (they're not).

EDIT: Not trying to diminish the OPs impressive feat of debugging though. Hardware errors can be beastly to diagnose. When wire-wrapping an 8086 computer, I used a spool of wire with occasional (random) breaks that would intermittently open. Worst. Bugs. Ever.

Electromagnetism and quantum physics have quite a bit to do with one another. Einstein won his Nobel on that one ;)

He probably should have said "quantization error".

I had to debug a problem in our program where MMX register would get corrupted under a new sampling profiler. Turns out profiler would forget to restore MMX registers - the profiler devs never used MMX and it did not occur to them that a component they called would do that. That took a while to debug.

Another fun bug was when alpha version of CLR failed to restore one of the two registers used to control loop execution on the Itanium. (Yes, Ia64 had two registers - one for loop variable as seen by the program and one to actually control the loop execution).

Here is a recent Firefox crash which, after some heroic debugging, appears to be an AMD CPU bug involving a CPU race condition "after a not-taken branch that ends on the last byte of an aligned quad-word"! The gory debugging details:


There was only two developers for that game? Wow, I would've thought there'd be a lot more.

As you go back in time, the number of people working on a game drops dramatically. Crash Bandicoot was released in 1996 and had 2 programmers http://en.wikipedia.org/wiki/Crash_Bandicoot_(video_game)

Three years earlier, Doom was released. It had 4 programmers.

Six years before that, Final Fantasy had 1 programmer.

If you stick around for the credits in a modern video game*, it's clear hundreds of people are involved.

I've beaten FF many times before and have seen the early credits screen...but I would've never guessed that just a single programmer was behind it:


> Although Final Fantasy is one of the most popular video games on the NES, programming an RPG proved somewhat difficult for Nasir. According to Sakaguchi, "it was the first time he had programmed anything like an RPG". Gebelli did not fully understand what an RPG was and how the battle system for such a game should work. I believe that Final Fantasy may have suffered from so many bugs and glitches due to Gebelli's understanding of RPGs. Nevertheless, Final Fantasy is still a fantastic game and Gebelli did not cease to amaze. Players were in awe at the battle system that Nasir programmed; being able to use four characters at once, the turn-based combat, and especially the spells and their glorious 8-bit animations.

I don't remember there being "so many bugs" in FF. As far as I remember, it was the only released-in-America FF to have any real challenge.

The bugs I know of boiled down to abilities that either did nothing, or worked far less often than they were supposed to. It's quite possible to beat the game while not consciously noticing the bugs (although you might just not cast certain spells that seem to never work not knowing that they actually never worked)

Unless you look at the Indies.

Of course. I think the idea of the 2-3 person indie team or even solo developer is quite romantic. For instance: http://jere.in/sneaking-into-rohrers-castle-part-1

With that last comment though, I was mainly referencing AAA games (e.g. Bioshock Infinite).

Thanks. You might have just stolen all my freetime.

Ha! You mean TCD? It is pretty addictive, but the crazy part is I couldn't convince hardly any of my friends to play. Anyway, I hope you enjoy it. I'm pretty active on the forums if you have any questions.

Reminds me to get back into Frozen Synapse. (Though I did manage to get friends to play.) Frozen Synapse is probably the perfect execution of the idea "`UFO: Enemy Unknown' tactical battles in multiplayer".

What I usually see in the credits for AAA video games is a handful of programmers - 5 - 10 - for the game itself, a bunch more for the engine - if they are credited at all. Most 'programming' done in video games is actually scripting, i.e. player touches button, door opens.

Well, they were writing in Lisp ;).

That was my thought too. For a studio that was so important to Sony that a hardware engineer investigated a "hardware" bug just to make a software engineer happy, I would assume they would have a larger team.

Also, having played the game - wow, two engineers.

There would be a lot more artists.

It sounds pretty bad; anything where the bug appears randomly sucks. But, for me, the worst bugs are random+multithreaded+statistical. E.g., a random bug in a distributed machine learning system=bug of death

And here's my pet peeve:

"He called me and, in his broken English and my (extremely) broken Japanese, we argued. I finally said, "just let me send you a 30-line test program that makes it happen when you wiggle the controller." He relented. This would be a waste of time, he assured me, and he was extremely busy with a new project, but he would oblige because we were a very important developer for Sony. I cleaned up my little test program and sent it over."

Really? Being humble doesn't hurt.

At the same time I love when a smug face melts with a concrete proof and "I told you so". Save face and be a little more humble.

Note about general concept: it's not about saving face; it's more about losing face (for yourself or others). "Saving face" has selfish and individualistic connotations, and implies actively maintaining one's face or ego. That's not the point of the concept, and people don't try to maintain face or their ego. Losing is more significant than saving or gaining.

And it's not about trying to prevent loss of face, it's about the act of losing face. You don't actively try to maintain face. You only criticize and say something about losing face when you actually do because it's not a good thing to lose face, and generally people don't lose face (if they are competent or do things right). So when you do mess up, that's when you say you lost face.

And also, if one thinks that [Japanese](substitute for any other) society is all about saving face or ego, then that's a wrong conception. People (everywhere) don't really think about losing face or maintaining ego anyways. If everyone did, then societies would be quite narcissistic and self-interested, and that's not the case. The concept is only invoked when someone loses face. That's it.

I found this to largely be a cultural thing with the Japanese. Japanese are extremely proud of their work and to mention ANY kind of mistake or improvement is tantamount to insulting their mother. Of course there are exceptions but this is the general attitude they bring to work.

This is how I interpreted it as well. I was not offended by his reaction; perhaps I should have noted that in the piece.

I understand you weren't offended. It's just that he was putting his pride before the work.

Also, yes, I know this may be more common in Japanese people, still I didn't want to associate it with the stereotype since I've seen it in lots of people (and why not, maybe even with me), with several different backgrounds.

> This is the only time in my entire programming life that I've debugged a problem caused by quantum mechanics.

Technically, all bugs are caused by quantum mechanics.

And evn more technically, it doesn't actually make sense to say "caused by quantum mechanics".


Interference is a pretty good one.

I've diagnosed temperature problems (grab ice from the freezer, apply to chip, see it work...), clocking problems (you insert strategic delays, sometimes on the order of thousands of instructions) and just badly documented registers (make sure mystery bit number 13 gets toggled just right, or it's curtains).

It's fun stuff.

Dave Baggett wrote a couple of nice text adventures for the TADS system before he worked on Crash back in the early 1990s, I remember enjoying both "Unnkulian Unventure II: The Secret of Acme" and "The Legend Lives!" (Unnkulian episode 5)...

Now that is a blast from the past. :)

What can I say, we had some nice chats back on rec.arts.int-fiction as well waaaaaaay back when....

Yes. We've all had to do similar things on embedded systems - you just cut code out until you narrow down the problem. And yes, you have to convince the hardware engineer that there is a problem with his board and that's always an uphill climb. Sometimes he'll fix the board, but more often you're stuck with a workaround.

Nowadays I'm working on these big distributed clusters, far from the bare metal. But you know, just now I rather miss those days on the little embedded systems.

Not my own bug, but a bug that happened in some other studio, retold by one of our ex-producers:

Basically the bug was in the Nintendo GameBoy, where it only happens if you press two buttons (left and right, or up and down) at the same time. Now you can't do that normally, since the controller won't allow it. But if you are hard-core QA (as the guy which this producer told me about it) - he ripped the controller, and manually wired some stuff - so he'll get LEFT and RIGHT pressed down at the same time, and only then the game would crash...

But then as we say sometimes here, for such things - it might get the... WNF (Will-not-fix).

This type of bug is very well-known in the tool-assisted speedrun community; many NES TASes rely upon similar bugs.

I once solved a bug where i=5 didn't execute (single stepping through the assembly also failed to assign the variable), the fix?

and, I believe we it shipped like that.

Tangent: there are lots of points to debate about how code should be commented, but this is the kind of code that must always, always be commented explicitly.

The OP is a great story where the coder had the time (well, was forced to have the time!) and insight to actually find an underlying hardware issue behind the bug -- and could also contact the hardware developers directly, present the details, and get it fixed.

Most of the time these kinds of bugs are never narrowed down to the hardware, compiler, VM, scripting engine, whatever... they're attacked with the "change things until it works" approach, and then the developer moves on.

It can work, though of course it's fragile because it's based on ignorance... but for practical reasons it's probably going to stay the most common approach for mysterious bugs.

The most important thing is documentation -- because the fix is fragile (maybe a tiny change in timing elsewhere will bring back the bug), and it's likely some other dev will need to dig deeper to provide a better fix. If you've documented exactly what you found so far, what worked and what didn't, etc., you can save a lot of wasted time and frustration.

The optimizer didn't remove that?

If it was an embedded system, some of the widely-used compilers are apparently rather lacking in functionality compared to the desktop ones, so it might not be able to.

Corruption on a management bus; transactions were getting corrupted .01% of the time: writes turning into reads, reads turning into writes, etc. SI issues.

As an embedded software engineer, you need to understand hardware nearly as much as the software. And depending on how far along the hardware is (pre 'gerber out', or deployed in field) it's usually up to software to "patch it over," "hide the issue," "fix it in software," if possible...

My worst was an issue where a load-to-register, followed by a jump to the register, when the jump was on the last word in a cache-line, would about 1 in 10 million times jump to the value that was in the register before the load. It was a race-condition in the hardware. That took many, many months to track down, localize, and then convince the HW vendor was an issue.

Having a consistent reproduction scenario is half of the battle in debugging, but sometime getting to that point is hard.

When USB 2 was about to be introduced, I added it to a printer for Kodak using beta versions of driver chips. There was a hardware bug connecting two unrelated bits. Took a month and a whole lot of intuition to figure that one out via software. Chip maker was very happy I found it.

I can see how a software developer could put hardware last, particularly when working with an established platform. I get that.

Coming from the other side of the fence, a place where we develop the hardware first and then bring it to life through software things are very different.

The first step in bringing up a not-trivial board is to go through a full characterization phase. This is where electrical, mechanical and short software tests are performed in order to determine if the hardware is operating according to requirements. Depending on the nature of the hardware this period can last months and require many re-spins (iterations where something is fixed or modified).

While this is taking place, and depending on the nature of the team, the application software is probably starting to be assembled on prototype hardware. In some cases this can't happen until you have actual hardware that works reasonably close to specs. Perhaps rev-1 hardware is used to jump-start software development while the hardware team goes through many revs in order to make adjustments and fix problems.

Seemingly weird hardware problems abound. I have been in situations where the signal is good at one end of a trace or cable and not so good at the other end. In the case of high speed design this can easily happen if there are problems with the design of the transmission lines carrying the signals. You can easily end-up with reflections that will wreck havoc on the signals as the go down the transmission line.

Another "weird" hardware issue in high speed design are signals that don't arrive at the destination within a specific timing window. Dynamic ram designs are one example of this. A clock is used to gate various signals at both ends of the transaction. Everything is sampled relative to this clock. If some signals, for example, control signals, arrive before, after or staggered with respect to their acceptance window you can have really weird effects.

With large FPGA designs you can have issues related to faulty design of the power distribution system. Power and signal integrity are major fields of study and truly necessary parts of modern electronics design. Traces on a board are like capacitors that need to be charged and discharged. If you have 200 traces switching from 0 to 1 simultaneously a lot of current will be require of the power system within nanoseconds (or picoseconds). If the power distribution system on the board isn't designed to deal with such transients you end-up with all manner of weird effects. For example, transmission lines might be perfect in impedance, crosstalk and time of flight yet signals arrive with lots of jitter and all over the place in terms of timing. The power distribution system on a board is like your heart, if it can't deal with demand you are not going to go from sitting to standing and then running without major problems.

This is only the tip of the iceberg. I could go on for pages and probably write a book about this. I've made enough mistakes.

And so, from the vantage point of a software engineer who also happens to be a hardware engineer blaming the hardware almost always comes first until the hardware is proven to be operating according to requirements.

In terms of the playstation issue on the original post. Well, from my perspective this is simply bad engineering on the part of those who designed the hardware. OK, this isn't the engine control computer on a Toyota. The sentiment is the same. Fault tolerant design is important, even for toys. Think consumer drones.

On the Coursera course mosfet-001 this was one of the first things which was discussed, coming from a non hardware background it was an eye opener to realize that minute changes in the composition of components could have such a large impact.

Had many fun bugs in my time:

Buffer size on line printer out of sync with what the program (COBOL) was sending and would produce variable results every so often when WRITE BEFORE and WRITE AFTER got used.

Then the ones which work fine with debug libraries and not on production ones as turns out the debug libraries had a unintended fix to a bug that was yet to be known. Also had programs work fine when you don't use debug libraries.

Then we have driver bugs, when if you have even touched graphics will of encountered, more so when we have a shift in driver model as ME introduced and was stable enough for windows XP, same with Vista which is stable in 7 and 8.

Surfice too say the hardest bugs almost always turn out to be hardware or some external software being documented wrong or undocumented features. Also new revisions in hardware can casue the smallest of issues in the niches of uses, but it can be you and it is a small lonely world when you try to fix one of those.

Though for me the hardest bugs are always the ones which you know were abouts they are, you are just unable to prove it too those enable to investigate it too the level of being able to prove it is there issue.

Also some bugs for some are easier for others and we have all had a bit of code or aspect of life that we seem blinkered from seeing what is wrong; Somebody else could look at it and solve it in seconds. Then we have also been that somebody looking at others code or issue and seen the problem and solution. Sometimes we see the problem before they can see it as a problem. Had classic in the 80's doing a mailing list update to add on postal codes (same as USA ZIP codes only UK flavour). I advised that some address's will already have a postal code tagged onto the address lines and in some cases could end up with not only 2 post codes on the address but also not always the same as the one entered may of been slightly wrong. This was dismissed until 5pm friday on prepping to head 300miles home south for the weekend. Was late heading home that weekend. Though that would be an annoying bug more than a hard bug, however hard it was upon my weekend. But we learn how to highlight issues better, or overhighlight. I have learned much from weather warnings at least and how they lean more towards the worst case situation thesedays more than say in the 80's.

Lastly though we have those intermitant bugs, so rare and niche that the impact is so neglagable and negatable that they are easier to work around instead of trying too fix. Mostly such fix's would cost more to identify fully and address than any work around. I suspect we have encountered more of those than you realise. With that, is how hacking was born after all. Without bugs would we of had the hacker history gave us - that in itself is worth a thought.

Coolest bug ever.

I wonder if this experience had any effect on how the author writes code. He basically backed into making the code testable. The process of identifying the clock as part of the problem would presumably have between much easier if he had adhestrongertronger SOLID principles.

Much easier said than done. Does your latest web app have code that allows you to isolate and test the motherboard's clock generator circuit?

At some point you just have to take it on faith that lower level components work correctly. And yeah, everyone once in a while you'll hit a sticky issue like in this article.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact