
Intel Underestimates Error Bounds by 1.3 Quintillion - ghusbands
http://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-bounds-by-1-3-quintillion/
======
luu
This is one of those things that's well known by people who spend a lot of
time doing low-level nonsense that really ought to get wider press. You really
don't want to use x86 hardware transcendental instructions.

I've seen people try to measure the effectiveness of their approximation by
comparing against built-in hardware, but their approximation is probably
better than the hardware! If you're not writing your own assembly, you'll be
ok on most modern platforms if you just link with "-lm" and use whatever math
library happens to be lying around as the default, but it's still possible to
get caught by this. On obscure platforms, it's basically a coin flip.

I used to work for another CPU vendor, and when we implemented more accurate
sin/cos functions, some benchmarks would fail us for getting the wrong result.

Turns out those benchmarks hardcode a check for a couple results, and that
check is based on what Intel has been doing for ages. My recollection is that
we had a switch to enable the extra accuracy, but that it wasn't enabled by
default because it was too much of a risk to break compatibility with Intel.

If that sounds too risk averse, there's a lot of code out there that depends
on your processor precisely matching Intel's behavior on undefined flags and
other things that code has no business depending on. It's basically the same
thing Raymond Chen is always talking about, but down one level of abstraction.
I've seen what happens if you just implement something that matches the "spec"
(the Intel manual) and do whatever you want for "undefined" cases. You get a
chip that's basically useless because existing software will crash on it.

~~~
JoeAltmaier
The problem is not so much "Intel stuff is old and broken" as it is "Intel
wrote the standard and there's no process to move it forward"?

~~~
sillysaurus3
OpenGL is a great example of how painful it is to write apps that are capable
of not breaking when the standard it relies on changes. You basically end up
with capability checks or version checks which cause your code to become
complicated.

I suppose there could've been a function that returned the implementation
version of fsin/fcos/etc, which was incremented whenever Intel changed the
implementation. That way benchmarks could test against that particular
version. But it'd be hard to come up with a precise, consistent way to
implement that. For example, do you increment the version whenever any
transcendental function changes? If you do that, then you get benchmark code
riddled with special cases for very specific versions, which someone has to
constantly look up in documentation somewhere. You'd basically have to
memorize the effect of every version increment. On the other hand, you could
implement a more elaborate scheme, like having a version number for each of
the transcendental functions, but then you'd either need to hardcode the
number of version numbers returned by the API, or expose a way to add to the
total number of version numbers in case someone wanted to add another
function, which is extra API complexity.

I'm not necessarily arguing against having some kind of process by which apps
could have the implementation of sin/cos/etc changed, just explaining how it
gets complicated pretty quickly.

~~~
ObviousScience
One thought on version number is that the version number could be the product
of prime powers.

If function 1 is on version 5, function 2 on version 3, function 3 on version
7, and function 4 on version 1, we can encode this as 2^5 * 3^3 * 5^7 * 7^1 =
472500000. This lets us unambiguously increment orthogonal values while still
keeping them represented as a single number. We could even easily add a
function 5 by multiplying the number by 11.

One problem is that it's not super dense (storing those 4 values takes ~29
bits), but a 128bit or 256bit number would store a larger value than is likely
needed by most applications.

One benefit is that software can check for the functions that it's interested
in (as long as we can agree on what prime corresponds to what function - a
relatively mild standard, if we just use order of introduction) while ignoring
any portion of the result it doesn't understand.

~~~
Danieru
> Storing those 4 values takes ~29 bits.

Just storing those 4 values in 4 bits is more efficient.

Checking that a bit is set is also a simple AND + CMP. Which beats out a DIV +
CMP.

Sorry I just had not considered this isn't common knowledge outside C or C++
land.

~~~
njr123
How do you store a version number in a single bit?

~~~
rjaco31
Well the bit n represents the version n, if this bit is at 1, this version is
supported, otherwise it's not.

------
CurtHagenlocher
Of possible interest, I researched a related issue once and here's what I
found:

The particular approximation used by the FPU instructions fsin and fcos is
valid in the domain [-π /4, π /4]. Inputs outside this domain are reduced by
taking the remainder when dividing by π /4 and then using trigonometric
identities to fix up values in the other three quadrants (ie sin(x) = sin(π /2
– x) for x in [π /4, π /2].

The Intel FPU is “off” because it uses an approximation for π /4 when
performing the reduction. In the mid 90s, some engineers at AMD figured out
how to do a more precise reduction and implemented this reduction in the K5
FPU. AMD went back to being bit-for-bit compatible with the x87 behavior in
the very next generation of their CPUs, presumably because it either broke or
appeared to break too many applications.

~~~
sillysaurus3
It's not possible to implement a more precise fsin() without breaking apps?

One scenario I can imagine an app breaking is if someone executed fsin(x),
pasted the result into their code, and then tested some other float against it
using the == operator.

I suppose a more common breakage scenario would be if someone did something
like floor(x + fsin(x)), since that'd very slightly change where the floor
"triggers" with respect to the input. But how could that cause an app to
break? Every scenario I can think of comes back to someone using == to test
floats, which everyone knows is a no-no for dynamic results, such as the
output of a function like fsin().

I guess a programmer wouldn't really expect the result of "x + fsin(x)" to
ever change for a certain value of x, so maybe they executed it with that
certain value, stored the result, and did an equality comparison later, which
would always work unless the implementation of fsin() changed, and the thought
process went something like, "This should be reliable because realistically
the implementation of fsin() won't change."

Can anyone think of some obvious way that making fsin() more accurate could
cause serious breakage in an app that isn't testing floats with the equality
operator?

~~~
Someone
I expect that Microsoft could get some angry customers if Excel's computations
varied with the CPU it runs on. Imagine discussing a spreadsheet with a
colleague and seeing wildly different data.

Some people would say "That's what you deserve if you use an ill-conditioned
algorithm or have an ill-conditioned problem", but most would not even
understand that statement.

~~~
vitd
But doesn’t that already happen? Didn’t we see that when the Pentium flaw was
discovered? In fact, I thought there was an Excel spreadsheet you could
download and run to test whether you had the flaw or not.

~~~
Someone
For this argument, the Pentium flaw is not the best example. As long as all
CPUs were equally flawed, everything was fine, if you define 'fine' as 'we see
the same results on all systems'.

I guess Excel will have the problem in some cases for rare CPUs. For 'common'
ones, I expect that Excel contains workaround for errata that might affect its
computations.

------
conistonwater
There is a mathematical reason for why this is not completely horrible, even
though this would catch you off guard. When x is approximately pi, the
function x -> sin(x) is ill-conditioned. [0] So the relative error between the
computed value ybar=fl(sin(x)) and the true value y=sin(x) is (ybar-y)/y and
is quite large.

However: you can find a number xbar that is very close to x, in the sense that
(xbar-x)/x is very small, such that the computed "inaccurate" value
ybar=fl(sin(x)) matches sin(xbar) exactly. Thus this procedure is backward-
stable, but not forward-stable. For "typical" numerically-stable algorithms,
this might not be such an issue.

Very interesting writeup, anyway. I also found this post by James Gosling [1]
describing the same issue, and the slower software workaround implemented in
JVM; the post is from 2005. The paper that explain K5's implementation is "The
K5 transcendental functions" by Lynch, Ahmed, Schulte, Callaway, Tisdale, but
it is behind an academic paywall.

[0]
[http://en.wikipedia.org/wiki/Condition_number](http://en.wikipedia.org/wiki/Condition_number)

[1]
[https://blogs.oracle.com/jag/entry/transcendental_meditation](https://blogs.oracle.com/jag/entry/transcendental_meditation)

------
acqq
[https://software.intel.com/en-
us/forums/topic/289702](https://software.intel.com/en-us/forums/topic/289702)

"Shane Story (Intel) Sun, 06/27/2010 - 11:44

(...) rigorous error bounds for most of the x87 transcendental functions
(f2xm1, fsincos, fyl2x...) were provided when the Pentium processor released.
I recall an appendix in the orginal Pentium manual which showed a collection
of ulp (unit in the last place) charts - don't recall the upper bound for all
of them but it was something like 1.8 ulps. This would be for double-extended
80 bit fp inputs. Results for the fsin/fcos were for arguments < pi/4 because
for the reduction for larger inputs, a 66-bit value of pi (or pi/2) was used."

It's really a documentation bug. Sometimes some important details get lost
after the documentation rewrites.

------
jhallenworld
Related: here is a very good article and comments on estimating transcendental
functions:
[http://www.coranac.com/2009/07/sines/](http://www.coranac.com/2009/07/sines/)

I'm annoyed that one of the links in the comments is broken: there was a page
where you entered your function and it provided a chebyshev polynomial to
generate it to a particular precision.

EDIT: well here is another one:
[http://goddard.net.nz/files/js/chebyshev/cheby.html](http://goddard.net.nz/files/js/chebyshev/cheby.html)

~~~
edwintorok
There is some interesting info here as well on how to compute most of libm
(including transcedental functions) in a correctly-rounded/precise way:
[http://lipforge.ens-
lyon.fr/frs/download.php/153/crlibm-1.0b...](http://lipforge.ens-
lyon.fr/frs/download.php/153/crlibm-1.0beta3.pdf)

~~~
jhallenworld
Also, the old standard reference for this was "Computer Approximations" by
Hart and Cheney

------
colanderman
_Note that sin(double(pi)) should not return zero. sin(pi) is, mathematically,
zero, but since float /double/long-double can’t represent pi it would actually
be incorrect for them to return zero when passed an approximation to pi._

Huh? sin(pi ± epsilon) better damn well return 0 if epislon is closer to 0
than the minimum representable floating-point value. Otherwise the processor
is rounding incorrectly.

(I realize epsilon in this case is _likely_ to be much larger than the minimum
representable float, but it's not _guaranteed_ and thus the article's
reasoning is flawed.)

~~~
nhaehnle
Actually, epsilon is _guaranteed_ to be much larger than the minimum
representable float: double(pi) is 1.57... * 2^1, so double(pi) - pi is going
to be on the order of 2^-m, where m is the size of the mantissa. The only way
that double(pi) - pi could be much smaller is if pi magically had a _lot_ of
0s in its binary representation (it doesn't).

This caught me off-guard at first as well, but it's really quite logical.

~~~
colanderman
_The only way that double(pi) - pi could be much smaller is if pi magically
had a lot of 0s in its binary representation (it doesn 't)._

Yes, this is what I was getting at. Obviously the statement is true for pi and
the majority of reals; I just don't understand why it should be a _rule_ as
the article implies.

------
preillyme
Wow, when doing argument reduction from the 80-bit long-double format there
will be almost complete cancellation? Damn.

------
leoh
I wonder what a great standard for this is. Incidentally, do TI or HP
calculators give good results?

~~~
StefanKarpinski
Wolfram Alpha (and Mathematica) are generally completely accurate at the cost
of being less efficient. I use those to check most things I want to verify.

------
nroets
This is a storm in a tea cup. "The absolute error ... was fairly constant at
about 4e-21, which is quite small. For many purposes this will not matter"

Good programmers know that most floating point calculations introduce small
errors and find ways to make their programs cope.

~~~
lvh
That's pretty bad for a double that you're about to do a bunch of operations
with. It's pretty easy to throw numerical precision away.

------
Aardwolf
> However this technique relies on printing the value of the double-precision
> approximation of pi to 33 digits, and up to VC++ 2013 this doesn’t work

How can an IDE prevent printing a number? If all else fails, you can
reinterpret_cast it to integer and handle the bits yourself?

~~~
azernik
First, a nitpick: VC++ is a compiler toolchain (including libraries), not just
an IDE.

Also, the next sentence:

    
    
        So, you either need *custom printing code* or you need to wait for Dev 14 which has improved printing code.

------
msie
I'm not worthy, I'm not worthy...

Edit:

The intel engineer was probably smart, but then there's this guy. And then
there's the commenters here. Lesson: There's always someone out there smarter
than you.

~~~
bsder
The lesson is actually "This was designed 35+ years ago in different
constraints."

Memory and ROM were small and expensive and processors were in-order and slow.
Putting the transcendental in microcode made sense, back then.

Now, memory is huge; processors are superscalar and speculative execution;
vector units exist; and everybody ignores the built-in transcendentals, but
Intel can't dump them for backwards compatibility.

~~~
msie
Yes, but from the text it seemed like fsin was broken to begin with:

 _It is surprising that fsin is so inaccurate. I could perhaps forgive it for
being inaccurate for extremely large inputs (which it is) but it is hard to
forgive it for being so inaccurate on pi which is, ultimately, a very ‘normal’
input to the sin() function._

------
pedrow
Possible solution: revive the Indiana Pi Bill[0] (assuming 3.2 has an exact
representation in binary?)

[0]:[http://en.wikipedia.org/wiki/Indiana_Pi_Bill](http://en.wikipedia.org/wiki/Indiana_Pi_Bill)

~~~
leni536
_assuming 3.2 has an exact representation in binary?_

It does not have an exact representation in binary.

~~~
thaumasiotes
Easy solution: amend the value to 3.25, or 3.125

