

Bug in the Pentium FPU (1994) - shubhamjain
http://www.trnicely.net/pentbug/bugmail1.html

======
wscott
Ah yes, Dr Nicely caused quite a bit of excitement at Intel. I was on the p6
architecture team at the time. (p6 == Pentium Pro) Our FPU was formally
verified and didn't have the same bug.

To be nice to Dr Nicely we sent him a pre-release p6 development system to
test with his program to demonstrate that his bug was fixed. He was working on
a prime number sieve program at the same and came back reporting that the p6
ran at 1/2 the speed of a Pentium for his code. Wow, another
blackeye/firestorm caused by Dr. Nicely. He had too much of an audience for
him to report to the world this new processor was slower.

So I got to spend a lot of time learning how to sieve works and what is
happening. For the most part it allocates a huge array in memory with each
byte representing a number. You walk the array with a stride of known primes
setting bytes and whatever is left must be prime. ie. every 3 is not prime,
every 5 is not prime, every 7 ....

So in the steady state you are writing a single byte to a cache line without
reading anything. And every write hits a different cache line.

Now p6 had a write allocate cache, but the Pentium would only allocate on
read, so on the Pentium a write that misses the cache would become a write to
memory. On the p6 that write would need to load the cache line from memory
into the cache and then the line in the cache was modified. And since every
line in the cache was also modified we had to flush some other cache line
first to make room. So every 1 byte write would become a 32-byte write to
memory followed by a 32-byte read from memory.

Normally write allocate is a good thing, but in this case it was a killer. We
were stumped.

Then the magic observation: 99% of these writes were marking a space that was
already marked. When you get up to walking by large strides most of those were
already covered by one of the smaller factors.

So if you changed the code from:

    
    
         array[N] = 1
    

to:

    
    
         if (!array[N]) array[N] = 1
    
    

Now suddenly we are doing a read first, and after that read we skip the write
so the data in the cache doesn't become modified and can be discarded in the
future.

Also the p6 was a super-scalar machine that ran multiple iterations of this
loop in parallel and could have multiple reads going to memory at the same
time. With that small tweak the program got 4X faster and we went from being
1/2X the speed of a Pentium to being twice the speed. And this was at the same
clock frequency! The test hardware ran 100Mhz, we released at 200Mhz and went
up from there.

~~~
simonbyrne
One thing I always wondered about this: why was he using the FPU? These all
seem like integer operations.

~~~
wscott
Look here:
[http://www.mersenne.org/various/math.php](http://www.mersenne.org/various/math.php)
at the Lucas-Lehmer test which uses floating point FFT's to square large
numbers. I am not sure if that is what is was doing originally, but I suspect
it was.

We used that prime95 program from mersenne.org quite a bit in testing because
it was very close to our best max-power test for processors. It would keep
both the FP and integer ALUs saturated and validated all the results so if
anything was wrong it would start complaining.

------
userbinator
A more detailed explanation of the cause:

[http://www.cs.earlham.edu/~dusko/cs63/fdiv.html](http://www.cs.earlham.edu/~dusko/cs63/fdiv.html)

5 entries out of the 1066-entry lookup table were wrong. They probably didn't
use test vectors that exercised all the entries.

But in general, testing complex ICs is _hard_. There are analogue effects too
- if an instruction happens to make the right set of transistors switch in a
certain way, going past the estimated margins, power supply fluctuations and
crosstalk could flip a bit or two. Sometimes these bit-flips don't cause any
problems since it happens in an unused part of the logic, but sometimes they
do. As the enthusiasts who like to overclock have shown, it's easy to get
something that looks like it works most of the time, but then completely
crashes when executing just the right instructions.

~~~
thesz
My favorite tale about testing is from AMD effort. They hired ACL experts to
verify their FPU. These experts have built a translator from Verilog to ACL
and then formally verified resulting ACL code. They found several bugs that
slipped 84 million test suite.

~~~
carussell
This was done by Boyer and Moore (the same names behind the Boyer–Moore string
search algorithm) at Computational Logic, Inc. The automated reasoning
software that thesz is referring to here is ACL2, and not that other Austin
export that goes by the name ACL.

~~~
gonzo
CLI is no more, it all but stopped in 1997.
[http://computationallogic.com/news/index.html](http://computationallogic.com/news/index.html)

While you might be referring to Austin City Limits (the television show), you
might also be referring to the ACL Festival, which didn't launch until 2002.

------
julianpye
In the end despite the expensive recall, this was a big win for the Intel
brand. The early processor years were tumultuous as on the battleground the
486 clones of Cyrix and TI lost. Intel went the Trademark way with the Pentium
and the media embraced the incident as something that would sink them.

Although very few people would ever come across the bug, Intel allowed every
processor to be exchanged. No matter if you were a gaming consumer or a giant
corporation using coprocessor-heavy software.

So I remember a UPS driver coming by my student flat with an exchange
processor and picking my faulty unit up a week later or so. It was incredible
service that made Intel as a brand very reliable.

~~~
dspillett
_> the media embraced the incident as something that would sink them._

It could have given them a damn good kicking in the consumer retail market and
this was new(ish) territory for Intel so they pulled out all the stops to make
sure thre was no way it could be made to look like they were trying to fob off
the end user.

In reality bugs were present in CPUs all the time in the past, errata were
published and libraries & compilets & such were adjusted to avoid the problems
with no one making a big fuss - look at the Linux source for remnents of these
issues in the x86 line of processors (there were a few in 386/384 era chips,
some time after FDIV came the "F00F" bug in the pentium line, and so on).
Other chip lines are similarly affected: I remember from my youth tinkering
with assembly language there being a bug in some 6502's that meant inirect
jumps referencing the last byte of a page would not work as expected.

But the FDIV bug came to attention around the time when CPUs were first being
marketted directly at end users rather than PC makers ina big way (as the next
phase of the battle you mention in the i486 era), with the man on the street
suddenly being aware that they might be able to make the value choice between
the alternatives. That is in part why the Pentium line got a name instead of
just a code/number (80586, i586, ...): Intel found they couldn't stop people
using a number directly in their product names which would have made it harder
to differentiate their products from the competition (of course the workaround
for this that everyone used was to call their alternative chips "pentium
class"). Even ignoring that, a name tends to be much easier for marketing to
work with, but I digress... The common consumer had different expectations of
how flaws were dealt with and Intel couldn't risk trying to plecate people
with "this has happened before, your software will be recompiled and
everything will be fine, in fact you are likely not to be affected anyway,
stay calm, we've go this" because the masses probably wouldn't take that,
especially as Intel's competitors would capitalise on the situation in any way
they were given time to, so they instead took the route already common in
direct consumer markets: the face saving recall and free replace.

~~~
pdw
For some "fun" reading, here are the 149 public errata of Haswell CPUs:
[http://www.intel.com/content/dam/www/public/us/en/documents/...](http://www.intel.com/content/dam/www/public/us/en/documents/specification-
updates/4th-gen-core-family-desktop-specification-update.pdf)

------
caf
PKZIP 2.04g, now there's a version number that fires a long-dormant set of
neurons.

------
krupan
This, people, is how you write a bug report. Very detailed, he even told them
which version of pkzip he used. Very nice of him.

------
eitally
I grew up in Lynchburg and was in high school at this time. It was by far one
of the most exciting reasons my town ever hit national headlines. :) Thanks
for sharing this -- it was nice to relive the memories.

~~~
lfowles
Heh, thought these comments sounded familiar

[https://news.ycombinator.com/item?id=1742088](https://news.ycombinator.com/item?id=1742088)

~~~
eitally
hehe -- history repeats itself! :)

------
kstrauser
Well, huh. "bc" is still that bad:

    
    
      % bc -l
      bc 1.06
      Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
      This is free software with ABSOLUTELY NO WARRANTY.
      For details type `warranty'.
      (824633702441.0)*(1/824633702441.0)
      .99999999224129613242
      (824633702441.0)*(1/824633702441.0)-0.999999996274709702
      -.00000000403341356958
      ((824633702441.0)*(1/824633702441.0))/0.999999996274709702
      .99999999596658641539
    

Python on the same machine is not:

    
    
      % python
      Python 2.7.10 (default, May 26 2015, 13:01:57)
      [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
      Type "help", "copyright", "credits" or "license" for more information.
      >>> (824633702441.0)*(1/824633702441.0)
      0.9999999999999999

~~~
yuubi
In bc, write

    
    
        scale=1000
    

(or however many digits you want) first, and you'll get better results.

------
dmfdmf
__" The bug has been observed on all Pentiums I have tested or had tested to
date, including a Dell P90, a Gateway P90, a Micron P60, an Insight P60, and a
Packard-Bell P60." __

This made me laugh... we used to call them Packard-Smells.

