
Sometimes the bug isn't in your code, it's in the CPU - there
http://leaf.dragonflybsd.org/mailarchive/kernel/2012-03/msg00000.html
======
jaylevitt
As someone who found four compiler bugs in three weeks - in a five-nines
fault-tolerant OS, yet! - and who found a PostgreSQL optimizer bug within
weeks of learning SQL, I think the key to being "that guy" is playing five-
whys with every single bug you encounter.

I work with some very talented developers who, when they try something and it
doesn't work, try something else. I am fundamentally incapable of that. If it
doesn't work, I MUST KNOW WHY. Even if that requires building a debug version
of my entire stack, adding all sorts of traces, and wolf-fence debugging until
I have a minimal fail case.

It's a real limitation; if I hit an undebuggable brick wall, I have no ability
to attack the problem from a different angle. Luckily, there are few things
that are fundamentally undebuggable.

~~~
jacques_chester
I _had_ to google "wolf-fence debugging".

I found I knew of it under a different name: binary search debugging. Git
includes built-in support under the bisect command.

~~~
mbateman
I also looked this up, and found the original paper that coined the term,
which starts:

> The "Wolf Fence" method of debugging time-sharing programs in higher
> languages evolved from the "Lions in South Africa" method that I have taught
> since the vacuum-tube machine language days. It is a quickly converging
> iteration that serves to catch run-time errors.

<http://dl.acm.org/citation.cfm?id=358695> (if you have access)

Anyone know what the "Lions in South Africa" method is? I couldn't find it via
Google, it just kept turning up references to the same paper.

~~~
prolepunk
After a quick googling I found this explanation of wolf-fence debugging:

[http://coreygoldberg.blogspot.com/2008/12/wolf-fence-
debuggi...](http://coreygoldberg.blogspot.com/2008/12/wolf-fence-
debugging.html)

It stipulates that the state of Alaska has got exactly one wolf, so you build
a fence across the middle of the state to find on which side wolf would howl,
then subdivide the problem, etc...

I'm assuming in your case the wolf got replace with a lion and Alaska with
South Africa.

~~~
slavak
Personally I always preferred the mathematician's method for catching a lion
in the Sahara...

~~~
gaius
First, place a lion in Egypt so you know your algorithm will terminate...

------
jdfreefly
First off, I would say that is some pretty awesome work by this guy to chase
this down. Including his work with the manufacturer to help them reliably
recreate the issue.

Second, I would say that over the course of my 10 year career in managing
developers, I've heard many, many times that the bug was in the kernel, or in
the hardware, or in the complier, or in the other lower level thing the
developer had no control over. This has been the correct diagnosis exactly
once. If I had to guess, I would say about 5%.

~~~
VBprogrammer
I would go as far as to say 'finding' bugs in the Compiler / Optimizer / OS /
Hardware is a warning signal of a poor programmer.

Always expect you are doing it wrong. It will so rarely be the case that this
expectation is wrong that you can discount it as insignificant.

~~~
JOnAgain
Disagree. In fact, were I to come up with a rule of thumb, I'd say the
opposite is true.

Want to find bugs in Sun's Java 6 compiler for X64 Linux , use annotations
(yeah, I found one in their V30 release last week). Want to find bugs in MS'
C++ compiler, write your own templates (this was a few years ago, maybe it's
better?). The best programmers push the limit of their tools because they know
what's "supposed to happen".

Poor programmers hit something that doesn't work, and just try something else,
cause, well they're just trying shit. I would go so far to say that poor
programmers, in fact, are unable to find compiler, optimizer, OS, or hardware
bugs because, by definition, they probably don't have a firm handle on what's
"supposed to happen".

~~~
sparsevector
I think what VBprogrammer meant is that _thinking_ you've found a bug in a
compiler / OS / CPU is often a warning sign you're a poor programmer. Often
times a beginner will have a bug in their code that is too subtle for them to
identify, so they end up attributing it to some external factor. _Actually_
finding a bug in a compiler / OS / CPU is as you suggest likely a sign you're
doing something advanced or unusual and therefore are perhaps more
knowledgeable than most.

~~~
VBprogrammer
Yeah, that's exactly what I meant. Sorry if the sarcasm didn't quite carry.

I know these things can and do happen. I've come across one or two of these
strange ones before, but too often I've seen people jump to the conclusion
that someone / something else was to blame. Without any other real evidence
other than that they have exhausted their shallow back of talent.

------
16s
Please stop referring to him as "this guy", he's well known in the BSD and
Linux worlds. He had commit access to FreeBSD before many things we take for
granted today even existed. His name is Matt Dillon and he's one hell of a
hardware/OS hacker.
[http://en.wikipedia.org/wiki/Matt_Dillon_%28computer_scienti...](http://en.wikipedia.org/wiki/Matt_Dillon_%28computer_scientist%29)

~~~
etrain
I'm an offender of the 'this guy' thing. I have heard of Matt before, but come
on, there is almost no context as to who he is given in the posting, and do
you really expect everyone in this community to know every semi-significant
kernel hacker of the last 2 decades?

The link is nice for everyones education, but I, for one, would appreciate a
little less condescension.

~~~
ghshephard
Referring to Matt Dillon at "This guy" is akin to referring to Linus Torvalds,
Theo De Raadt, Jony Ives, Zed Shaw, John Gruber, etc.. as "This Guy" -
particularly in this community - everyone should know who Matt Dillon is.

And, yes, I would expect everyone in this community to recognize who these
people are, and roughly what their contributions have been.

~~~
netdog
_Matt Dillon, Linus Torvalds, Theo De Raadt, Jony Ives, Zed Shaw, John Gruber_

One of these guys is not like the others.

~~~
philwelch
_All_ of those guys are not like the others!

~~~
phillmv
Only two of those people are even on the same order of magnitude of influence.

~~~
masklinn
I'd say three, although the work of the third one (which I'd assert is Theo)
is more obscure it's pretty darn critical and influential[0]. Unless the two
you were thinking of are Linus and Theo?

[0] <http://en.wikipedia.org/wiki/OpenSSH>

~~~
phillmv
Oh true that, I'd forgotten all about how SSH wasn't originally Free.

------
etrain
My hat's off to this guy for the work he did, and indeed, finding a CPU is
quite the accomplishment.

That said - what is it about the hardware manufacturers that makes them
relatively immune to this sort of thing? Is it formal verification and rigid
engineering process? Is it that they spend so much money developing these
things that they better do them right, god dammit?

Sometimes I think that the whole industry would be much better off if everyone
up the stack was held to these kinds of standards. If that were the case
though, where would we be? We'd have rock solid systems, but how sophisticated
would they be? Would UNIX exist? What about (a more bulletproof and less
feature complete) Java?

~~~
sliverstorm
_Is it formal verification and rigid engineering process? Is it that they
spend so much money developing these things that they better do them right,
god dammit?_

All of the above- with the minor correction that it's not about money spent
developing per se. Producing silicon masks is obscenely expensive, so catching
a bug before tape-out vs. after tape-out can be a difference of hundreds of
thousands of dollars. So, think of it as "you better get it right the first
time, god dammit"

~~~
riffraff
Is there some high level description of the design/production process used by
large chip producers? I'd be very interested in reading about it.

(nitpick: "per se" <http://en.wiktionary.org/wiki/per_se>)

~~~
sdbbp
There's little open development, so there's little incentive to write up
public articles. You could try something like Bob Colwell's "The Pentium
Chronicles".

------
bebop
Great job tracking down a hardware bug! That must be really exciting, and you
get your name in the AMD errata I assume?

One of my comp sci professors found a bug in an Intel chip and got his name in
the errata. I think that gives you +100 to nerd credibility :)

~~~
methoddk
+100 nerd cred indeed. That's something that trumps any normal bug.

~~~
rhizome
He already had it. Matt Dillon is the business, and that's not a fake name of
his. He quit FreeBSD-core because they chafed at his awesomeness, from which
he went to start a FreeBSD fork, Dragonfly BSD, one of whose goals is process
and state portability across CPUs and machines. He was also the technical stud
behind Best Internet, one of the earliest and largest and highest-performance
ISPs of the Mom&Pop era of the Internet (~93-97).

------
gue5t
Here are some more details about this particular bug:
[http://leaf.dragonflybsd.org/mailarchive/commits/2011-12/msg...](http://leaf.dragonflybsd.org/mailarchive/commits/2011-12/msg00259.html)

~~~
there
Such fun work to be doing on Christmas day...

------
augustl
In order to reliably reproduce the bug, he wrote his own operating system. A
small one, but still, an operating system. That's pretty badass..

------
bgrainger
If you're interested in the types of bugs that are present in modern CPUs, AMD
makes their errata documentation publicly available. (As far as I know,
Intel's errata are not public. Edit: See tedunangst's comment below for a
correction.)

The errata documentation for AMD Family 10h Processors (Athlon, Opteron,
Phenom, etc.) is here:
[http://support.amd.com/us/Processor_TechDocs/41322_10h_Rev_G...](http://support.amd.com/us/Processor_TechDocs/41322_10h_Rev_Gd.pdf)

The errata for AMD Family 12h Processors (A-Series APU, etc.):
[http://support.amd.com/us/Processor_TechDocs/44739_12h_Rev_G...](http://support.amd.com/us/Processor_TechDocs/44739_12h_Rev_Gd.pdf)

I found this out when an AMD engineer confirmed an AMD CPU bug for me:
[http://stackoverflow.com/questions/7004728/is-this-should-
no...](http://stackoverflow.com/questions/7004728/is-this-should-not-happen-
crash-an-amd-fusion-cpu-bug)

~~~
tedunangst
Intel includes errata in updated spec sheets for each CPU.

~~~
yuhong
"Specification Updates", to be more precise.

------
throwawayderp
Nice catch.

It would be interesting if he has accidentally triggered a backdoor, such as
mentioned in this post.

[http://theinvisiblethings.blogspot.com.au/2009/03/trusting-h...](http://theinvisiblethings.blogspot.com.au/2009/03/trusting-
hardware.html)

------
ot
Original thread with all the analysis performed before the bug was attributed
to the CPU:

<http://thread.gmane.org/gmane.os.dragonfly-bsd.kernel/14471>

(Check out in particular the section "EFFORTS AT FINDING A KERNEL BUG THAT
WASN'T A KERNEL BUG")

------
sjwright
When a CPU bug is discovered, what options are available for remedying the
situation?

~~~
forgotusername
Patch around it in microcode (applied by the OS or BIOS on every boot;
releasable as an OS update), disable the related CPU feature if possible
(twiddling bits during OS initialization), or trap any related exception the
CPU throws, detect the bug's condition, and patch up the running task's state
from the exception handler (again, another OS update).

If all else fails, issue a product recall or downplay the bug's severity.

~~~
troymc
or recall all CPUs and replace them with new repaired ones. That's what they
do in Wonderland. :D

On a more serious note, I wonder what auto manufacturers do. (There are many
CPUs in modern automobiles, and auto manufacturers are often compelled to do
recalls.)

~~~
marshray
Traditional embedded devices (i.e. not smartphones) tend to use very mature
and well understood CPUs. Moreso for things holding life and propery like
cars. If a bug does occur and cause a crash, normally a watchdog timer will
reset the CPU quickly enough to avoid unrecoverable problems.

They're certainly not pushing the bleeding edge at all like the 3 GHz
desktop/laptop processors.

------
daenz
Amazing. I'm happy his sanity survived!

------
dhruvbird
wow! this is quite a rare thing...

------
comice
Next time my code isn't working as expected, I'm going to shout "cpu bug!" and
cite this article.

