Whose bug is this anyway?

MattRogish · on Dec 19, 2012

"Incidentally, this is one of the reasons that crunch time is a failed development methodology, as I’ve mentioned in past posts on this blog; developers get tired and start making stupid mistakes."

Totally. Strangely enough, Founders at Work (http://www.amazon.com/Founders-Work-Stories-Startups-Problem...) is chock full of startup founders extolling the virtues of overwork. Is this some survivorship bias or does overwork in startups really lead to shipping sooner and achieving product/market fit faster?

It's never been my experience that sustained overwork of software developers leads to actual, measurable productivity increases due to the "two steps forward, one step back" phenomenon. Yeah, you can ship a feature "sooner" but it'll be buggy and disappointing to the end users (probably causing them to hesitate to pay - are you really achieving product/market fit with a buggy product?)

We encourage every developer to find a sustainable pace (it's different for everyone) with the guidance that it's almost always less than 50 hours a week. Why is it that software companies think that overworking software developers is a net positive?

eliben · on Dec 19, 2012

I think it's important not to reason in absolutes. Taking an obviously multi-person, multi-year project like Starcraft and arbitrarily setting a 2-month deadline "because some code is there from Warcraft II" is not good use of crunch-time. If you and a friend have an idea and can pull out a prototype in a week, then overworking that week may be not a bad thing (as long as it's not followed by another such week).

MattRogish · on Dec 19, 2012

That's a good point. I believe the research (and my experience) suggests overwork can provide benefit in the short run (2-3 weeks) but much after that and you get diminishing and then negative marginal returns - followed by a "time off" recovery period.

Strategically used, overwork can provide benefit but it sounds like Starcraft was a year-long overwork, which is a disaster. Many of the "Founders at Work" stories glorify overwork and make it sound like it was part of the norm of the culture, so it seems like it was much longer than a few weeks.

Which is why I'm confused - either long-running overwork caused these companies to succeed, or it was not bad enough to cause them to fail?

cpeterso · on Dec 19, 2012

If your business or project plan requires herculean overtime, then maybe it's not a very good plan. :)

mgkimsal · on Dec 19, 2012

Or maybe that's your moat. "The only people who'd even be capable of competing with me have to have a team of people willing to work 140 hours/week with no pay for 9 months - once I crack this I'll have no competitors!"

mark-r · on Dec 19, 2012

And then 8 months later you find out about a group that had the same plan, but got started a month before you did...

mduerksen · on Dec 19, 2012

Whenever a fellow programmer or myself starts even considering that the OS, .NET-Framework, JVM etc. might be responsible for a bug, I have to smile.

In that moment, 2 things are almost certainly true:

1. The bug is in your code. 2. You are too tired/stressed/overfocused to see it.

So when I find myself in this state, I instantly stop working and go for a walk (or go home when its late enough).

Then, when rested, I tackle the problem again. I will find my bug, or, for the very rare case, find a way to prove that it really is the environment my code runs in.

But most importantly, I will have a sharp tool for the job.

dfox · on Dec 19, 2012

I used to reason exactly like this, but then in last few months occurrences of this rare cases of bugs in OS, environment or even hardware started to be little to often. Although I might be biased by spending inordinate amounts of time on these bugs (like one day each). Various X11 related bugs (well, when X server crashes, you can be pretty sure that it is not bug in _your_ code), Windows driver bugs, Windows hotfixes fixing one bug and exacerbating impact of another bug from "minor annoyance" to "does not work", RTC chip with errata longer than datasheet, weird firmware<->kernel interactions, generally failing hardware and so on.

kibwen · on Dec 18, 2012

"The bug was easily fixed by upgrading the build server, but in the end we decided to leave assertions enabled even for live builds. The anticipated cost-savings in CPU utilization (or more correctly, the anticipated savings from being able to purchase fewer computers in the future) were lost due to the programming effort required to identify the bug, so we felt it better to avoid similar issues in future."

I'd like to ask: for anyone here who's ever worked on a large C++ codebase, were assertions ever actually observed to be a noticeable detriment to performance? I'm sort of naively assuming that a good branch predictor would make their impact negligible, but I ain't exactly a system programmer.

snprbob86 · on Dec 18, 2012

My primary experience with C++ involves game development, so my perspective is a bit skewed. The cheaper chipsets in consoles tend to have weaker branch prediction, so any branching can be a big hit in aggregate. And everything is in aggregate because all your code is running in a tight frame loop at 30 or 60hz.

That said, the typical large C++ codebase is probably losing a lot more performance to bad algorithms than it is to problems that generally are only measurable in micro-benchmarks. There's just something about C++ that makes a lot of people obsess over performance to a degree that doesn't even affect the mindset of most C hackers. And because your brain is so preoccupied with performance in the small, you often miss opportunities for performance in the large.

Unless, of course, you're a AAA game, in which case you're fine tuning at the individual instruction and cache line levels for your most inner loops. I'm sure there are other, similar use cases for C++, but desktop software probably isn't on that list outside of a key component or two.

vog · on Dec 19, 2012

> Unless, of course, you're a AAA game, in which case you're fine tuning at the individual instruction and cache line levels for your most inner loops.

I think this is the most important part:

Those micro-optimizations have exactly one place: the most inner loops. Nowhere else! In the larger scale, better algorithms and code readability provide more performance than micro-optimizations ever could.

yk · on Dec 19, 2012

> There's just something about C++ that makes a lot of > people obsess over performance to a degree that doesn't > even affect the mindset of most C hackers.

Interesting observation. My first guess is that it is a lot easier in C++ to hide a stupid bottleneck, for example an object in a function call (which will call a copy constructor). So in the experience of a C++ dev, there are low hanging fruits. On the other hand in pure C it is a lot harder to hide this type of complexity and therefore C optimizations tend to be a lot more subtle.

dfox · on Dec 19, 2012

It is also often pretty evident that C++ induces people to obsess over performance bottlenecks that were relevant on 80's hardware and while doing so introduce another (often more severe) bottlenecks relevant for modern CPU's. See for example C++ developers affinity for inline functions and templates expanding to huge amounts of inlined code, another common belief is that there is profound performance difference between virtual and non-virtual methods.

NickPollard · on Dec 20, 2012

Do you have evidence that there is not a profound performance difference between virtual and non-virtual methods? Virtual methods require two memory lookups (the vtable address, then the function address) and hence often two cache misses, compared to non-virtual methods which can be static addresses. If you've got a (common in games) loop like:

  for ( ..some list of 5000 objects.. ) {
    object.update();
  }

Then those cache misses will add up. That is my experience anyway, though I'll don't have statistics to back it up

michaelt · on Dec 18, 2012

Presumably it depends on the complexity of the logic that has to be evaluated to determine whether the assertion passes.

I've seen math libraries where, for example, the invert-matrix function finishes with an assertion that the input matrix multiplied with the output matrix is the identity matrix. That's a reasonable enough test, but it means when you enable assertions you see a major performance hit.

gizmo686 · on Dec 18, 2012

I've have little experience in performance critical code. However (in non performance critical code), many of the assertions I have written are not simple equality tests, but rather depend on the outcome of one or two method calls, which may require a non trivial amount of CPU.

glimcat · on Dec 18, 2012

Many instances of this type are properly unit tests, e.g. assert brute_method(foo) == optimized_method(foo) for foo in cases.

colomon · on Dec 18, 2012

If the number of cases is small and can be completely enumerated, sure. But there are plenty of algorithms out there where the number of cases is effectively infinite -- the matrix inversion function someone else mentioned is a great example -- where it is prudent to routinely check results to make sure the algorithm is working. (Mind you, also having a unit test to sure it works on a small set of carefully chosen cases is a great idea.)

jacktoole1 · on Dec 18, 2012

Sure, but isn't it awesome if, instead of being just one unit tests, you can have some critical checks running in every unit test (and when running the code as a whole)?

However, if you're going this route, it might be worth having a second assert macro that can allow you more fine-grained control of enabling/disabling fast asserts vs slow asserts.

zapu · on Dec 18, 2012

In Guild Wars case, it might also have been that you should not leave too much debug info in your release build, to make it harder to reverse engineer and write cheats/bots.

NickPollard · on Dec 20, 2012

It depends where they are. Some games uses asserts at a very low level to check things like ensuring that vectors or matrices are valid (to some expected properties) after every calculation. This is very useful for bug fixing, but means you suddenly have asserts in the most performance intensive of inner loops.

Now those particular asserts aren't often turned on the default debug build, but turning them on will have a significant effect on performance.

valisystem · on Dec 18, 2012

I really like the part on detection of hardware failure to guide users on a computer maintenance page.

Computer enthusiast, which are many amongst gamer, are just very eager of this kind of information, discovering it by a game you like that tells you 'check or do that to have a better gaming experience' must be a wonderful and exciting experience.

unwind · on Dec 19, 2012

Yeah, that was brilliant.

One idea that struck me: given the classical role of the operating system, doesn't this sound like something an OS should be able to provide?

I imagine an OS service that, if requested, sits in the background and does what that game code did, in order to detect those kinds of errors. Does any OS have this? It really seems semi-obvious, now ...

wladimir · on Dec 19, 2012

Indeed. Having personally experienced power supply issues a few times (either due to malfunction or the mentioned problem of super-hungry GPU) and the resulting random crashes, I would have been greatly helped by this kind of detection in the OS.

It seems that in the PC world there is very little functionality in place to detect, isolate, and nail down hardware issues. Or if it exist somewhere deep in the firmware, at least no standardized way to access it.

On the positive side, I was recently very surprised when Linux started to give errors about a certain CPU core after programs started hanging. Somehow one of my cores had failed without crashing the OS(!). After disabling that core in the BIOS with the next reboot the issues went away.

So there is some level of hardware problem detection and robustness in modern OSes, but maybe not enough.

ferrouswheel · on Dec 22, 2012

While I also agree that it's a great idea, the challenge is that it's a complex system of hardware components. An error in one component won't necessarily show up directly associated with the component generating the error.

A simple mem test like the article discussed is a nice test, but what if the comparison tables are corrupted? Then it would falsely claim the memory was bad, when it might be the network interface, hard drive, or bus!

jacoblyles · on Dec 18, 2012

The author mentions it in passing, and I do wish Starcraft/Broodwar and Warcraft III were open-sourced. Blizzard has understandably chosen to stop updating both games for newer Mac architectures (there is no build for Intel Macs for either, and thus no support for OSX 10.7+). The community would be happy to port the games to new platforms, increasing the popularity of the lucrative franchises.

barbs · on Dec 19, 2012

I believe you meant the first Starcraft. I'd also love for these games to be open-sourced, along with the first Diablo, though I'm pretty sure WC3 and SC are still making money for them.

Also, here's a blog post I wrote on how to get these games working on OSX 10.7+ using Windows builds and Wineskin (shameless plug) http://marzzbar.wordpress.com/2012/11/06/how-to-play-classic...

jacoblyles · on Dec 19, 2012

Yep! I meant SC/Broodwar. Also, throw in Diablo II to that list.

And great blog post! Thanks for that. I tried using wineskin for Starcraft before but got a little lost. I'll give your walkthrough a try. Sometimes I feel like playing a classic campaign to take a break from laddering on SC2.

SideburnsOfDoom · on Dec 19, 2012

See also: "select" isn't broken:

http://pragmatictips.com/26

http://www.codinghorror.com/blog/2008/03/the-first-rule-of-p...

There's nothing wrong with re-discovering things that other people had written about decades before. But discovering the wisdom of the ancients is also helpful.

shawn-butler · on Dec 19, 2012

I think not having the build system match developer systems is a not uncommon source of production issues. It always starts out matching but no one ever updates the build machine (mostly out of fear of breaking something I think).

A good strategy I have used is that the first job of a new hire is to "make a build machine" on his or her own machine. Just having new eyes go through the steps on a semi-regular basis catches alot of stuff.

tehwalrus · on Dec 19, 2012

My stackoverflow profile is a sad list of questions like "what's wrong with this library?" which have an answer 2 days later, often by me, saying "it was in this bit of the code."

On several occasions I've actually left the bugged lines out of my initial submission, because I thought they weren't important...

jtchang · on Dec 18, 2012

This is such an awesome post. When I think about the history of bug fixing it seems like we haven't gone very far in the last couple of years. Measuring how long a bug takes to fix is still voodoo.

mark-r · on Dec 19, 2012

99 times out of 100, the bug is yours. Sometimes you really do run into oddball stuff though. The most memorable for me was a bug that only surfaced after you printed something - the printer driver modified the floating point rounding mode and didn't restore it, resulting in some subtle failures of floating point calculations that followed.

lucian1900 · on Dec 19, 2012

Interesting that both Guild Wars and Redis have reached similar conclusions about hardware failure.

troels · on Dec 19, 2012

Maybe I'm just slow, but what was the bug in that code?

darklajid · on Dec 19, 2012

He explains it right in the text.

  if (someBool)
    return A;

  ...

  if (!someBool)
    return B;

Now - both cases are covered. The result will be A or B, never anything else. Statements after the second return are unreachable.

troels · on Dec 20, 2012

Ah .. He expected it to reach the unreachable code? I thought that it did reach `return R`, which would be totally strange and could indeed be explained by a compiler bug. (Or more likely, by side effects in the code between the two checks)

jgeralnik · on Dec 19, 2012

Almost. The result could be something else if it's in the code in the middle. It just won't ever get to whatever's after that return B.

martinced · on Dec 19, 2012

I just love his articles: reminds me of the good old days playing these great games.

Just a note, he writes:

   "Overheating: Computers don’t much like to be hot and malfunction more frequently in those conditions..."

One of the reason there are way less spurious crashes of desktop / game apps than back in these days is that now most CPUs have built-in temp sensor that automatically reduce the speed in the case of any kind of CPU overheating.

So the piece of code they wrote performing computation and comparing to know good results that could find as much as 1% of broken systems (!!!) would probably not find anywhere near close that number nowadays: the CPU simply slows down if it overheats.

Not sure what happens when GPU overheat and if these too have now built-in protection and not sure if they were trying to detect faulty GPUs too.

Another common "symptom" was an aunt, uncle, friend of parents, calling and explaining that : "my computer works fine for a while then it doesn't work anymore"... And you'd go check their system, open it, and find a CPU full of dust (which TFA mentions).

Nowadays I don't get these calls anymore from those people : they're still using computers, but once their fans get clogged the system simply works at a slower speed and don't overheat anymore.

zbyszek · on Dec 19, 2012

Interesting. I recall how we had to decrease the clock speed of the processors on a high-performance machine in order to reduce errors - these were detected by simple reproducibility tests. With code optimised to sustain a huge percentage of peak speed and a thousand or so processors, such problems can become manifest, even in the chilly environment in which the machine was housed.

ferrouswheel · on Dec 22, 2012

GPUs have throttling as well. They happily burst up to the maximum clock speed, but in benchmarks that stress the GPU at max capacity for prolonged periods, one will often see the clock rate drop by as much as half to prevent overheating.

... at least, that's what my Nvidia GTX580 does!

Shivetya · on Dec 19, 2012

Simply had to add my name to this praise, this article in particular hits home so well it should be required reading for many