The project is going swimingly well. Smoother than normal, actually. The calm before the storm. Suddenly, abruptly, your device fails. Maybe you were there, maybe someone else reported it; neither makes it less mysterious. You brush it off with more important things to do. It happens again. Now it bothers you; top priority. You try for hours to make it happen consistently. No amount of tin foil, bunny ears, or interpretive dance will make it happen again. So you give up, you start steel plating the code. Paranoid error checking everywhere, setting variables twice, resetting, extra padding on all the timing parameters. PARANOIA. You code begins to look like a lunatic wrote it. But the bug never happens again. Your paranoia and double guessing fixed the problem.
The Moment! From then on, whenever something doesn't quite _feel_ right with your projects, you start inserting strange, almost ritualistic incantations in your code. If anyone asks why you decided to sleep for 14.23ms instead of the prescribed 10ms, you just call them a devil and run away back to your ancient cross compiling GCC and make a sacrifice to the Upstream Gods.
I feel for the author. From that Moment on, his code will never be the same. His sanity, now left on the cutting room floor along with the trust he once had in Datasheets and Programming Manuals.
Embedded 4 Life.
I dunno, as an embedded programmer working on new HW and sometimes with a custom compiler, blaming the HW became the 5th thing, and the compiler the 6th...
I think the situation is that if you're quick to blame others for a problem which turns out to be your own mistake - it has a tendency to really piss off co-workers. Experienced programmers learn to challenge their own code first as a habit out of consideration for others on their team. It's embarrassing to swear up and down that there is a hardware/compiler/whatever problem, only to have a bunch of people look into it and find a mistake that you made.
If you're using one-off hardware or a brand new compiler, it isn't so unreasonable to suspect those pieces of having bugs.
Many have seen timing and slew with all it's reasons while overclocking. Nothing at all to do with quantum effects, per se.
Fractal problems are those that look really simple from 100,000 feet up (like a dot) and reach unimaginable complexity as you get closer (like a fractal). There are a lot of problems of this nature (like email, transportation, ecommerce, taxation, etc.) and solving them is worth a lot of money but requires a LOT of work.
Problems that look simple from 100k feet but get complex as you get up close are what I'd call "normal" problems.
I find the phenomenon of Quora posts becoming "real" articles quite fascinating. I've actually been "published" in Slate and Forbes online just for spending some time writing answers.
Since it's his answer, he can reproduce it anywhere he wants without crediting Quora as a source (since he's the source here), though it'd be sweet if he gave Quora a nod :)
Kind of what HN does for ideas/start-up.
I wonder if this also exist for other type of markets (product, music, ...) ...
Please elaborate on that thought!
HN, Reddit, Upbeatapp, Quora (?), are stream-centered. Information come and go, which make it very easy accessible.
Reddit, Upbeatapp, Quora are also sectioned ; There are sub-reddit, there are subgenres in Upbeatapp and in Quora, you can follow topics/interests/questions.
HN is not sectioned but is already very specific.
That point make an access to your pre-targeted audience.
The two factor combines make an easy accessible access to a large interested audience.
( Note that is not the case in twitter, youtube, soundcloud, tumblr...)
I don't know Medium enough to be sure, but I think they are trying to do the same (sections + stream).
I'm trying to find online services that use this pattern for other type of market (Movie, Food recipes, ect.)
Because, If I create an account on Soundcloud, put some awesome music that I created, what will happen ?
Same for Youtube. I would have to be detected by someone with a huge number of followers to take off.
If I make an awesome post on start-up or about an experience, and I post it and HN, I could go from 0 to lots of good (read "targeted") traffics.
If I make a great anwser on Quora, I could go from nothing to lots of readers.
Not against Quora in anyway, I like it, get traction however you can wherever you can and especially places where people are looking for what you make.
I'm assuming that before they become real, they exist in an indeterminate quantum state?
So, I believe this would be an "effect on the quantum level," even if it can be understood through the lens of more traditional electromagnetic physics as well.
EDIT: Not trying to diminish the OPs impressive feat of debugging though. Hardware errors can be beastly to diagnose. When wire-wrapping an 8086 computer, I used a spool of wire with occasional (random) breaks that would intermittently open. Worst. Bugs. Ever.
Another fun bug was when alpha version of CLR failed to restore one of the two registers used to control loop execution on the Itanium. (Yes, Ia64 had two registers - one for loop variable as seen by the program and one to actually control the loop execution).
Three years earlier, Doom was released. It had 4 programmers.
Six years before that, Final Fantasy had 1 programmer.
If you stick around for the credits in a modern video game*, it's clear hundreds of people are involved.
> Although Final Fantasy is one of the most popular video games on the NES, programming an RPG proved somewhat difficult for Nasir. According to Sakaguchi, "it was the first time he had programmed anything like an RPG". Gebelli did not fully understand what an RPG was and how the battle system for such a game should work. I believe that Final Fantasy may have suffered from so many bugs and glitches due to Gebelli's understanding of RPGs. Nevertheless, Final Fantasy is still a fantastic game and Gebelli did not cease to amaze. Players were in awe at the battle system that Nasir programmed; being able to use four characters at once, the turn-based combat, and especially the spells and their glorious 8-bit animations.
I don't remember there being "so many bugs" in FF. As far as I remember, it was the only released-in-America FF to have any real challenge.
With that last comment though, I was mainly referencing AAA games (e.g. Bioshock Infinite).
Also, having played the game - wow, two engineers.
"He called me and, in his broken English and my (extremely) broken Japanese, we argued. I finally said, "just let me send you a 30-line test program that makes it happen when you wiggle the controller." He relented. This would be a waste of time, he assured me, and he was extremely busy with a new project, but he would oblige because we were a very important developer for Sony. I cleaned up my little test program and sent it over."
Really? Being humble doesn't hurt.
At the same time I love when a smug face melts with a concrete proof and "I told you so". Save face and be a little more humble.
And it's not about trying to prevent loss of face, it's about the act of losing face. You don't actively try to maintain face. You only criticize and say something about losing face when you actually do because it's not a good thing to lose face, and generally people don't lose face (if they are competent or do things right). So when you do mess up, that's when you say you lost face.
And also, if one thinks that [Japanese](substitute for any other) society is all about saving face or ego, then that's a wrong conception. People (everywhere) don't really think about losing face or maintaining ego anyways. If everyone did, then societies would be quite narcissistic and self-interested, and that's not the case. The concept is only invoked when someone loses face. That's it.
Also, yes, I know this may be more common in Japanese people, still I didn't want to associate it with the stereotype since I've seen it in lots of people (and why not, maybe even with me), with several different backgrounds.
Technically, all bugs are caused by quantum mechanics.
I've diagnosed temperature problems (grab ice from the freezer, apply to chip, see it work...), clocking problems (you insert strategic delays, sometimes on the order of thousands of instructions) and just badly documented registers (make sure mystery bit number 13 gets toggled just right, or it's curtains).
It's fun stuff.
Nowadays I'm working on these big distributed clusters, far from the bare metal. But you know, just now I rather miss those days on the little embedded systems.
Basically the bug was in the Nintendo GameBoy, where it only happens if you press two buttons (left and right, or up and down) at the same time. Now you can't do that normally, since the controller won't allow it. But if you are hard-core QA (as the guy which this producer told me about it) - he ripped the controller, and manually wired some stuff - so he'll get LEFT and RIGHT pressed down at the same time, and only then the game would crash...
But then as we say sometimes here, for such things - it might get the... WNF (Will-not-fix).
The OP is a great story where the coder had the time (well, was forced to have the time!) and insight to actually find an underlying hardware issue behind the bug -- and could also contact the hardware developers directly, present the details, and get it fixed.
Most of the time these kinds of bugs are never narrowed down to the hardware, compiler, VM, scripting engine, whatever... they're attacked with the "change things until it works" approach, and then the developer moves on.
It can work, though of course it's fragile because it's based on ignorance... but for practical reasons it's probably going to stay the most common approach for mysterious bugs.
The most important thing is documentation -- because the fix is fragile (maybe a tiny change in timing elsewhere will bring back the bug), and it's likely some other dev will need to dig deeper to provide a better fix. If you've documented exactly what you found so far, what worked and what didn't, etc., you can save a lot of wasted time and frustration.
As an embedded software engineer, you need to understand hardware nearly as much as the software. And depending on how far along the hardware is (pre 'gerber out', or deployed in field) it's usually up to software to "patch it over," "hide the issue," "fix it in software," if possible...
Coming from the other side of the fence, a place where we develop the hardware first and then bring it to life through software things are very different.
The first step in bringing up a not-trivial board is to go through a full characterization phase. This is where electrical, mechanical and short software tests are performed in order to determine if the hardware is operating according to requirements. Depending on the nature of the hardware this period can last months and require many re-spins (iterations where something is fixed or modified).
While this is taking place, and depending on the nature of the team, the application software is probably starting to be assembled on prototype hardware. In some cases this can't happen until you have actual hardware that works reasonably close to specs. Perhaps rev-1 hardware is used to jump-start software development while the hardware team goes through many revs in order to make adjustments and fix problems.
Seemingly weird hardware problems abound. I have been in situations where the signal is good at one end of a trace or cable and not so good at the other end. In the case of high speed design this can easily happen if there are problems with the design of the transmission lines carrying the signals. You can easily end-up with reflections that will wreck havoc on the signals as the go down the transmission line.
Another "weird" hardware issue in high speed design are signals that don't arrive at the destination within a specific timing window. Dynamic ram designs are one example of this. A clock is used to gate various signals at both ends of the transaction. Everything is sampled relative to this clock. If some signals, for example, control signals, arrive before, after or staggered with respect to their acceptance window you can have really weird effects.
With large FPGA designs you can have issues related to faulty design of the power distribution system. Power and signal integrity are major fields of study and truly necessary parts of modern electronics design. Traces on a board are like capacitors that need to be charged and discharged. If you have 200 traces switching from 0 to 1 simultaneously a lot of current will be require of the power system within nanoseconds (or picoseconds). If the power distribution system on the board isn't designed to deal with such transients you end-up with all manner of weird effects. For example, transmission lines might be perfect in impedance, crosstalk and time of flight yet signals arrive with lots of jitter and all over the place in terms of timing. The power distribution system on a board is like your heart, if it can't deal with demand you are not going to go from sitting to standing and then running without major problems.
This is only the tip of the iceberg. I could go on for pages and probably write a book about this. I've made enough mistakes.
And so, from the vantage point of a software engineer who also happens to be a hardware engineer blaming the hardware almost always comes first until the hardware is proven to be operating according to requirements.
In terms of the playstation issue on the original post. Well, from my perspective this is simply bad engineering on the part of those who designed the hardware. OK, this isn't the engine control computer on a Toyota. The sentiment is the same. Fault tolerant design is important, even for toys. Think consumer drones.
Buffer size on line printer out of sync with what the program (COBOL) was sending and would produce variable results every so often when WRITE BEFORE and WRITE AFTER got used.
Then the ones which work fine with debug libraries and not on production ones as turns out the debug libraries had a unintended fix to a bug that was yet to be known. Also had programs work fine when you don't use debug libraries.
Then we have driver bugs, when if you have even touched graphics will of encountered, more so when we have a shift in driver model as ME introduced and was stable enough for windows XP, same with Vista which is stable in 7 and 8.
Surfice too say the hardest bugs almost always turn out to be hardware or some external software being documented wrong or undocumented features. Also new revisions in hardware can casue the smallest of issues in the niches of uses, but it can be you and it is a small lonely world when you try to fix one of those.
Though for me the hardest bugs are always the ones which you know were abouts they are, you are just unable to prove it too those enable to investigate it too the level of being able to prove it is there issue.
Also some bugs for some are easier for others and we have all had a bit of code or aspect of life that we seem blinkered from seeing what is wrong; Somebody else could look at it and solve it in seconds. Then we have also been that somebody looking at others code or issue and seen the problem and solution. Sometimes we see the problem before they can see it as a problem. Had classic in the 80's doing a mailing list update to add on postal codes (same as USA ZIP codes only UK flavour). I advised that some address's will already have a postal code tagged onto the address lines and in some cases could end up with not only 2 post codes on the address but also not always the same as the one entered may of been slightly wrong. This was dismissed until 5pm friday on prepping to head 300miles home south for the weekend. Was late heading home that weekend. Though that would be an annoying bug more than a hard bug, however hard it was upon my weekend. But we learn how to highlight issues better, or overhighlight. I have learned much from weather warnings at least and how they lean more towards the worst case situation thesedays more than say in the 80's.
Lastly though we have those intermitant bugs, so rare and niche that the impact is so neglagable and negatable that they are easier to work around instead of trying too fix. Mostly such fix's would cost more to identify fully and address than any work around. I suspect we have encountered more of those than you realise. With that, is how hacking was born after all. Without bugs would we of had the hacker history gave us - that in itself is worth a thought.
At some point you just have to take it on faith that lower level components work correctly. And yeah, everyone once in a while you'll hit a sticky issue like in this article.