

Whose bug is this anyway? - experiment0
http://www.codeofhonor.com/blog/whose-bug-is-this-anyway

======
MattRogish
"Incidentally, this is one of the reasons that crunch time is a failed
development methodology, as I’ve mentioned in past posts on this blog;
developers get tired and start making stupid mistakes."

Totally. Strangely enough, Founders at Work ([http://www.amazon.com/Founders-
Work-Stories-Startups-Problem...](http://www.amazon.com/Founders-Work-Stories-
Startups-Problem-Solution/dp/1430210788)) is chock full of startup founders
extolling the virtues of overwork. Is this some survivorship bias or does
overwork in startups really lead to shipping sooner and achieving
product/market fit faster?

It's never been my experience that sustained overwork of software developers
leads to actual, measurable productivity increases due to the "two steps
forward, one step back" phenomenon. Yeah, you can ship a feature "sooner" but
it'll be buggy and disappointing to the end users (probably causing them to
hesitate to pay - are you really achieving product/market fit with a buggy
product?)

We encourage every developer to find a sustainable pace (it's different for
everyone) with the guidance that it's almost always less than 50 hours a week.
Why is it that software companies think that overworking software developers
is a net positive?

~~~
eliben
I think it's important not to reason in absolutes. Taking an obviously multi-
person, multi-year project like Starcraft and arbitrarily setting a 2-month
deadline "because some code is there from Warcraft II" is not good use of
crunch-time. If you and a friend have an idea and can pull out a prototype in
a week, then overworking that week may be not a bad thing (as long as it's not
followed by another such week).

~~~
MattRogish
That's a good point. I believe the research (and my experience) suggests
overwork can provide benefit in the short run (2-3 weeks) but much after that
and you get diminishing and then negative marginal returns - followed by a
"time off" recovery period.

Strategically used, overwork can provide benefit but it sounds like Starcraft
was a year-long overwork, which is a disaster. Many of the "Founders at Work"
stories glorify overwork and make it sound like it was part of the norm of the
culture, so it seems like it was much longer than a few weeks.

Which is why I'm confused - either long-running overwork caused these
companies to succeed, or it was not bad enough to cause them to fail?

------
mduerksen
Whenever a fellow programmer or myself starts even considering that the OS,
.NET-Framework, JVM etc. might be responsible for a bug, I have to smile.

In that moment, 2 things are almost certainly true:

1\. The bug is in _your_ code. 2\. You are too tired/stressed/overfocused to
see it.

So when I find myself in this state, I instantly stop working and go for a
walk (or go home when its late enough).

Then, when rested, I tackle the problem again. I will find my bug, or, for the
very rare case, find a way to _prove_ that it really is the environment my
code runs in.

But most importantly, I will have a sharp tool for the job.

~~~
dfox
I used to reason exactly like this, but then in last few months occurrences of
this rare cases of bugs in OS, environment or even hardware started to be
little to often. Although I might be biased by spending inordinate amounts of
time on these bugs (like one day each). Various X11 related bugs (well, when X
server crashes, you can be pretty sure that it is not bug in _your_ code),
Windows driver bugs, Windows hotfixes fixing one bug and exacerbating impact
of another bug from "minor annoyance" to "does not work", RTC chip with errata
longer than datasheet, weird firmware<->kernel interactions, generally failing
hardware and so on.

------
kibwen
_"The bug was easily fixed by upgrading the build server, but in the end we
decided to leave assertions enabled even for live builds. The anticipated
cost-savings in CPU utilization (or more correctly, the anticipated savings
from being able to purchase fewer computers in the future) were lost due to
the programming effort required to identify the bug, so we felt it better to
avoid similar issues in future."_

I'd like to ask: for anyone here who's ever worked on a large C++ codebase,
were assertions ever actually observed to be a noticeable detriment to
performance? I'm sort of naively assuming that a good branch predictor would
make their impact negligible, but I ain't exactly a system programmer.

~~~
snprbob86
My primary experience with C++ involves game development, so my perspective is
a bit skewed. The cheaper chipsets in consoles tend to have weaker branch
prediction, so any branching can be a big hit in aggregate. And _everything_
is in aggregate because all your code is running in a tight frame loop at 30
or 60hz.

That said, the typical large C++ codebase is probably losing a lot more
performance to bad algorithms than it is to problems that generally are only
measurable in micro-benchmarks. There's just something about C++ that makes a
lot of people obsess over performance to a degree that doesn't even affect the
mindset of most C hackers. And because your brain is so preoccupied with
performance in the small, you often miss opportunities for performance in the
large.

Unless, of course, you're a AAA game, in which case you're fine tuning at the
individual instruction and cache line levels for your most inner loops. I'm
sure there are other, similar use cases for C++, but desktop software probably
isn't on that list outside of a key component or two.

~~~
yk
> There's just something about C++ that makes a lot of > people obsess over
> performance to a degree that doesn't > even affect the mindset of most C
> hackers.

Interesting observation. My first guess is that it is a lot easier in C++ to
hide a stupid bottleneck, for example an object in a function call (which will
call a copy constructor). So in the experience of a C++ dev, there are low
hanging fruits. On the other hand in pure C it is a lot harder to hide this
type of complexity and therefore C optimizations tend to be a lot more subtle.

~~~
dfox
It is also often pretty evident that C++ induces people to obsess over
performance bottlenecks that were relevant on 80's hardware and while doing so
introduce another (often more severe) bottlenecks relevant for modern CPU's.
See for example C++ developers affinity for inline functions and templates
expanding to huge amounts of inlined code, another common belief is that there
is profound performance difference between virtual and non-virtual methods.

~~~
NickPollard
Do you have evidence that there is _not_ a profound performance difference
between virtual and non-virtual methods? Virtual methods require two memory
lookups (the vtable address, then the function address) and hence often two
cache misses, compared to non-virtual methods which can be static addresses.
If you've got a (common in games) loop like:

    
    
      for ( ..some list of 5000 objects.. ) {
        object.update();
      }
    

Then those cache misses will add up. That is my experience anyway, though I'll
don't have statistics to back it up

------
valisystem
I really like the part on detection of hardware failure to guide users on a
computer maintenance page.

Computer enthusiast, which are many amongst gamer, are just very eager of this
kind of information, discovering it by a game you like that tells you 'check
or do that to have a better gaming experience' must be a wonderful and
exciting experience.

~~~
unwind
Yeah, that was brilliant.

One idea that struck me: given the classical role of the operating system,
doesn't this sound like something an OS should be able to provide?

I imagine an OS service that, if requested, sits in the background and does
what that game code did, in order to detect those kinds of errors. Does any OS
have this? It really seems semi-obvious, now ...

~~~
wladimir
Indeed. Having personally experienced power supply issues a few times (either
due to malfunction or the mentioned problem of super-hungry GPU) and the
resulting random crashes, I would have been greatly helped by this kind of
detection in the OS.

It seems that in the PC world there is very little functionality in place to
detect, isolate, and nail down hardware issues. Or if it exist somewhere deep
in the firmware, at least no standardized way to access it.

On the positive side, I was recently very surprised when Linux started to give
errors about a certain CPU core after programs started hanging. Somehow one of
my cores had failed without crashing the OS(!). After disabling that core in
the BIOS with the next reboot the issues went away.

So there is some level of hardware problem detection and robustness in modern
OSes, but maybe not enough.

------
jacoblyles
The author mentions it in passing, and I do wish Starcraft/Broodwar and
Warcraft III were open-sourced. Blizzard has understandably chosen to stop
updating both games for newer Mac architectures (there is no build for Intel
Macs for either, and thus no support for OSX 10.7+). The community would be
happy to port the games to new platforms, increasing the popularity of the
lucrative franchises.

~~~
barbs
I believe you meant the first Starcraft. I'd also love for these games to be
open-sourced, along with the first Diablo, though I'm pretty sure WC3 and SC
are still making money for them.

Also, here's a blog post I wrote on how to get these games working on OSX
10.7+ using Windows builds and Wineskin (shameless plug)
[http://marzzbar.wordpress.com/2012/11/06/how-to-play-
classic...](http://marzzbar.wordpress.com/2012/11/06/how-to-play-classic-
blizzard-games-on-mac-os-x-mountain-lion/)

~~~
jacoblyles
Yep! I meant SC/Broodwar. Also, throw in Diablo II to that list.

And great blog post! Thanks for that. I tried using wineskin for Starcraft
before but got a little lost. I'll give your walkthrough a try. Sometimes I
feel like playing a classic campaign to take a break from laddering on SC2.

------
SideburnsOfDoom
See also: "select" isn't broken:

<http://pragmatictips.com/26>

[http://www.codinghorror.com/blog/2008/03/the-first-rule-
of-p...](http://www.codinghorror.com/blog/2008/03/the-first-rule-of-
programming-its-always-your-fault.html)

There's nothing wrong with re-discovering things that other people had written
about decades before. But discovering the wisdom of the ancients is also
helpful.

------
shawn-butler
I think not having the build system match developer systems is a not uncommon
source of production issues. It always starts out matching but no one ever
updates the build machine (mostly out of fear of breaking something I think).

A good strategy I have used is that the first job of a new hire is to "make a
build machine" on his or her own machine. Just having new eyes go through the
steps on a semi-regular basis catches alot of stuff.

------
tehwalrus
My stackoverflow profile is a sad list of questions like "what's wrong with
this library?" which have an answer 2 days later, often by me, saying "it was
in _this_ bit of the code."

On several occasions I've actually left the bugged lines out of my initial
submission, because I thought they weren't important...

------
jtchang
This is such an awesome post. When I think about the history of bug fixing it
seems like we haven't gone very far in the last couple of years. Measuring how
long a bug takes to fix is still voodoo.

------
mark-r
99 times out of 100, the bug is yours. Sometimes you really do run into
oddball stuff though. The most memorable for me was a bug that only surfaced
after you printed something - the printer driver modified the floating point
rounding mode and didn't restore it, resulting in some subtle failures of
floating point calculations that followed.

------
lucian1900
Interesting that both Guild Wars and Redis have reached similar conclusions
about hardware failure.

------
troels
Maybe I'm just slow, but what _was_ the bug in that code?

~~~
darklajid
He explains it right in the text.

    
    
      if (someBool)
        return A;
    
      ...
    
      if (!someBool)
        return B;
    

Now - both cases are covered. The result will be A or B, never anything else.
Statements after the second return are unreachable.

~~~
troels
Ah .. He _expected_ it to reach the unreachable code? I thought that it _did_
reach `return R`, which would be totally strange and could indeed be explained
by a compiler bug. (Or more likely, by side effects in the code between the
two checks)

------
martinced
I just _love_ his articles: reminds me of the good old days playing these
great games.

Just a note, he writes:

    
    
       "Overheating: Computers don’t much like to be hot and malfunction more frequently in those conditions..."
    

One of the reason there are way less spurious crashes of desktop / game apps
than back in these days is that now most CPUs have built-in temp sensor that
automatically reduce the speed in the case of any kind of CPU overheating.

So the piece of code they wrote performing computation and comparing to know
good results that could find as much as 1% of broken systems (!!!) would
probably not find anywhere near close that number nowadays: the CPU simply
slows down if it overheats.

Not sure what happens when GPU overheat and if these too have now built-in
protection and not sure if they were trying to detect faulty GPUs too.

Another common "symptom" was an aunt, uncle, friend of parents, calling and
explaining that : "my computer works fine for a while then it doesn't work
anymore"... And you'd go check their system, open it, and find a CPU full of
dust (which TFA mentions).

Nowadays I don't get these calls anymore from those people : they're still
using computers, but once their fans get clogged the system simply works at a
slower speed and don't overheat anymore.

~~~
zbyszek
Interesting. I recall how we had to decrease the clock speed of the processors
on a high-performance machine in order to reduce errors - these were detected
by simple reproducibility tests. With code optimised to sustain a huge
percentage of peak speed and a thousand or so processors, such problems can
become manifest, even in the chilly environment in which the machine was
housed.

