I've seen people try to measure the effectiveness of their approximation by comparing against built-in hardware, but their approximation is probably better than the hardware! If you're not writing your own assembly, you'll be ok on most modern platforms if you just link with "-lm" and use whatever math library happens to be lying around as the default, but it's still possible to get caught by this. On obscure platforms, it's basically a coin flip.
I used to work for another CPU vendor, and when we implemented more accurate sin/cos functions, some benchmarks would fail us for getting the wrong result.
Turns out those benchmarks hardcode a check for a couple results, and that check is based on what Intel has been doing for ages. My recollection is that we had a switch to enable the extra accuracy, but that it wasn't enabled by default because it was too much of a risk to break compatibility with Intel.
If that sounds too risk averse, there's a lot of code out there that depends on your processor precisely matching Intel's behavior on undefined flags and other things that code has no business depending on. It's basically the same thing Raymond Chen is always talking about, but down one level of abstraction. I've seen what happens if you just implement something that matches the "spec" (the Intel manual) and do whatever you want for "undefined" cases. You get a chip that's basically useless because existing software will crash on it.
...unless you're doing size-optimised graphics intros, where tremendous accuracy is not essential and the stack-based nature of x87 makes for some extremely compact code:
I think the main reason why this issue hasn't been so significant is that the majority of code that uses those instructions isn't dependent on extreme accuracy - code that does would likely use their own routines and/or even arbitrary-precision arithmetic.
I suppose there could've been a function that returned the implementation version of fsin/fcos/etc, which was incremented whenever Intel changed the implementation. That way benchmarks could test against that particular version. But it'd be hard to come up with a precise, consistent way to implement that. For example, do you increment the version whenever any transcendental function changes? If you do that, then you get benchmark code riddled with special cases for very specific versions, which someone has to constantly look up in documentation somewhere. You'd basically have to memorize the effect of every version increment. On the other hand, you could implement a more elaborate scheme, like having a version number for each of the transcendental functions, but then you'd either need to hardcode the number of version numbers returned by the API, or expose a way to add to the total number of version numbers in case someone wanted to add another function, which is extra API complexity.
I'm not necessarily arguing against having some kind of process by which apps could have the implementation of sin/cos/etc changed, just explaining how it gets complicated pretty quickly.
If function 1 is on version 5, function 2 on version 3, function 3 on version 7, and function 4 on version 1, we can encode this as 2^5 * 3^3 * 5^7 * 7^1 = 472500000. This lets us unambiguously increment orthogonal values while still keeping them represented as a single number. We could even easily add a function 5 by multiplying the number by 11.
One problem is that it's not super dense (storing those 4 values takes ~29 bits), but a 128bit or 256bit number would store a larger value than is likely needed by most applications.
One benefit is that software can check for the functions that it's interested in (as long as we can agree on what prime corresponds to what function - a relatively mild standard, if we just use order of introduction) while ignoring any portion of the result it doesn't understand.
Just storing those 4 values in 4 bits is more efficient.
Checking that a bit is set is also a simple AND + CMP. Which beats out a DIV + CMP.
Sorry I just had not considered this isn't common knowledge outside C or C++ land.
That doesn't really work for version numbers > 1. They're not just flags.
Just store, say, a byte per value, and concat them into your 128 bit number.
So "function 1 is on version 5, function 2 on version 3, function 3 on version 7, and function 4 on version 1" is four functions starting with 1 = (1000,0010),1010,0110,1110,0010 = 24 bits.
It gets better quick with larger numbers of functions.
BTW, the size of the prime scheme is log2(nthprime(function))*version bits for each function. If you don't know ahead of time which functions there might be, then you have to do a prime factorization, which is a hard problem. I guess if you used really large numbers you could have a cryptographically secure way of indicating which versions of which functions are in your library.
If you start with version 0 (and why not?), I think the Fundamental Theorem of Arithmetic implies maximum possible density?
I think I would just represent the version of each function with a byte.
55th prime > 256. So incrementing the version number adds more than one full byte to your number.
fn 2 @ v1, fn 3 @ v1 = 2^1 * 3^1 = 6
fn 2 @ v2, fn 3 @ v1, fn 5 @ v3 = 2^2 * 3^1 * 5^3 = 2 * 2 * 3 * 5 * 5 * 5 = 1500
EX: Multiplying by 2 adds a full bit to your number every time you increment it. And it just get's worse from there.
If you really want to store mostly zeros your probably better off just using a count of the functions > 0 and then either use a bit for each function to show it's > 0 or an index if things are sparse.
I just think it's fun for low value things, and interesting if you're in the situation you have a huge feature set relative to the number of active features, since it incurs no penalty for fields that are 0. Any time you have about ~1% features being present, it can make a difference. Example: if you have about 5 of 500 fields with 16 possible values represented, storing all the 0s is expensive. Using a naive 4-bits-per-field, you end up with 2000 bits, whereas, using my method, you only use ~620. Even using a 500bit number to flag which fields are present and then listing just those fields in order only saves you about ~100 bits over my method.
Plus, I manage to spark off a discussion about math and the problem, rather than just whining about it being hard, which I consider to be a good thing.
Or just a bit flag for active (500bits) then 4 bits * 5 for your count = 520 bits.
And if you really want to compress things start with a count so the 500 bit flag is only used if your count is over 50.
PS: in your case you also need an index of number if bits / bytes used.
Edit: If there is a reasonable chance your counts are greater than 1 just appending one bit per count vs using higher exponents saves space aka for 3 numbers append 100101 = first 1 then 3 then 2.
The particular approximation used by the FPU instructions fsin and fcos is valid in the domain [-π /4, π /4]. Inputs outside this domain are reduced by taking the remainder when dividing by π /4 and then using trigonometric identities to fix up values in the other three quadrants (ie sin(x) = sin(π /2 – x) for x in [π /4, π /2].
The Intel FPU is “off” because it uses an approximation for π /4 when performing the reduction. In the mid 90s, some engineers at AMD figured out how to do a more precise reduction and implemented this reduction in the K5 FPU. AMD went back to being bit-for-bit compatible with the x87 behavior in the very next generation of their CPUs, presumably because it either broke or appeared to break too many applications.
One scenario I can imagine an app breaking is if someone executed fsin(x), pasted the result into their code, and then tested some other float against it using the == operator.
I suppose a more common breakage scenario would be if someone did something like floor(x + fsin(x)), since that'd very slightly change where the floor "triggers" with respect to the input. But how could that cause an app to break? Every scenario I can think of comes back to someone using == to test floats, which everyone knows is a no-no for dynamic results, such as the output of a function like fsin().
I guess a programmer wouldn't really expect the result of "x + fsin(x)" to ever change for a certain value of x, so maybe they executed it with that certain value, stored the result, and did an equality comparison later, which would always work unless the implementation of fsin() changed, and the thought process went something like, "This should be reliable because realistically the implementation of fsin() won't change."
Can anyone think of some obvious way that making fsin() more accurate could cause serious breakage in an app that isn't testing floats with the equality operator?
It is, but it's A) slow as hell (it requires something like 384 bits of Pi to reduce properly) and B) nobody really cares.
Anybody who cares pulls out Cody and Waite and writes their own so that they know exactly what the errors and performance are. This is especially true for modern processors which have vector units and whose mainline performance is probably just as good as its microcode performance.
Anybody who doesn't care is really looking for an approximation anyway, so they're not going to use the transcendentals because they're too slow.
You're absolutely correct that a software implementation may be several times faster than the legacy x87 fsin instruction, while delivering well-rounded results. There shouldn't be a need to write your own implementation, however. High-quality library implementations are pretty widely available these days.
While this sounds grandious, thanks to aggressive inlining this could be the case even if you wrote in your logic:
sin(x) == sin(constant)
abs(sin(x) - sin(constant) < delta
I guess since sin(x) is computed by an FPU in hardware, then it's inherently "dynamically linked." But in that case, why is it valid for a compiler to transform sin(constant) into a hardcoded result? Is there some standard somewhere which says that the implementation of sin/cos/etc can't ever change, and therefore it's valid for the compiler to do that? Or do compilers just make an assumption that the implementation of sin(x) won't ever change?
A floating expression may be contracted, that is, evaluated as though it were an atomic
operation, thereby omitting rounding errors implied by the source code and the
expression evaluation method. The FP_CONTRACT pragma in <math.h> provides a
way to disallow contracted expressions. Otherwise, whether and how expressions are
contracted is implementation-defined.
I honestly don't remember all the rules here, but it does in fact do so, legally.
Just as a compiler is free to evaluate strlen("foo") at compile-time, it's free to evaluate sin(0.1) at compile-time.
Compilers generally don't use specialized local architecture instructions to compute their results, because they can't (It would break cross-compiling or targeting different platforms in a lot of cases)
In fact, it's more likely the other way around:
Compiler substitutes proper value. actual calls on sin(x) on the local platform give wrong answer.
You are right, though, that there's a lot of software out there that assumes determinism without being anywhere near sufficiently circumspect.
Some people would say "That's what you deserve if you use an ill-conditioned algorithm or have an ill-conditioned problem", but most would not even understand that statement.
I guess Excel will have the problem in some cases for rare CPUs. For 'common' ones, I expect that Excel contains workaround for errata that might affect its computations.
However: you can find a number xbar that is very close to x, in the sense that (xbar-x)/x is very small, such that the computed "inaccurate" value ybar=fl(sin(x)) matches sin(xbar) exactly.
Thus this procedure is backward-stable, but not forward-stable. For "typical" numerically-stable algorithms, this might not be such an issue.
Very interesting writeup, anyway. I also found this post by James Gosling  describing the same issue, and the slower software workaround implemented in JVM; the post is from 2005. The paper that explain K5's implementation is "The K5 transcendental functions" by Lynch, Ahmed, Schulte, Callaway, Tisdale, but it is behind an academic paywall.
"Shane Story (Intel)
Sun, 06/27/2010 - 11:44
(...) rigorous error bounds for most of the x87 transcendental functions (f2xm1, fsincos, fyl2x...) were provided when the Pentium processor released. I recall an appendix in the orginal Pentium manual which showed a collection of ulp (unit in the last place) charts - don't recall the upper bound for all of them but it was something like 1.8 ulps. This would be for double-extended 80 bit fp inputs. Results for the fsin/fcos were for arguments < pi/4 because for the reduction for larger inputs, a 66-bit value of pi (or pi/2) was used."
It's really a documentation bug. Sometimes some important details get lost after the documentation rewrites.
I'm annoyed that one of the links in the comments is broken: there was a page where you entered your function and it provided a chebyshev polynomial to generate it to a particular precision.
EDIT: well here is another one: http://goddard.net.nz/files/js/chebyshev/cheby.html
Huh? sin(pi ± epsilon) better damn well return 0 if epislon is closer to 0 than the minimum representable floating-point value. Otherwise the processor is rounding incorrectly.
(I realize epsilon in this case is likely to be much larger than the minimum representable float, but it's not guaranteed and thus the article's reasoning is flawed.)
This caught me off-guard at first as well, but it's really quite logical.
Yes, this is what I was getting at. Obviously the statement is true for pi and the majority of reals; I just don't understand why it should be a rule as the article implies.
I take issue with this assertion:
It would be absolutely wrong to round a non-zero number to zero.
Why? I know IEEE floats have subnormals and subnormals can be VERY small, but if the error is less than half the minimum representable float, why should it be rounded away from zero?
Yes, zero is "special" because you can't divide by it, but for stability, my understanding is that you shouldn't be dividing by anything that can be so close to zero as to potentially round to it anyway.
The relative error introduced in rounding 1e-16 down to 0 is 100%, or approximately 1e16 ulp, which is why that would be unacceptable. Floating point arithmetic cares about relative errors much much more than absolute.
Also, while dividing by zero is a problem, dividing by 1e-16 should not be a problem, at all. 1e-16 is a perfectly ordinary representable number.
What is missing from your claim is where you say, "Specifically, the delta between pi and its float-point approximation is greater than half the minimum representable float." This is almost certainly true of pi, but it is NOT true of all real numbers! 1+(1e-1000), for example, fails this test. You cannot – logically! – make the claim that the error of any close approximations of real numbers should not round to zero because the error is not small enough; clearly in many cases it is!
Yes, dividing by 1e-16 is not a problem! My question is whether this is a desirable thing to do in an algorithm that should be stable.
Floating point arithmetic cares about relative errors much much more than absolute.
This hints more at a reason, but why is this true? Yes, I know floating-point numbers are spaced logarithmically, and yes, relative error is obviously the correct measure when dealing with multiplication and division. But if you are not dividing due to stability concerns, so your value eventually finds its way to an addition, isn't absolute error more important?
Edit: Oh my god now I get what you are saying. There are plenty of implicit reasons why the value would end up being non-zero -- in particular, any unit-scale continuous function would have to be artificially toxic for epsilon to be smaller than the smallest float. There is no reason for the article to point out that nit.
We're talking about sin(x) with x ~ pi. In this situation the output's absolute value is much much smaller than the input's absolute value. So the output will have much finer resolution than the input.
The output being appreciably non-zero comes from the output having much higher resolution than the input. The output wouldn't be anywhere near small enough to be subnormal.
Good programmers know that most floating point calculations introduce small errors and find ways to make their programs cope.
How can an IDE prevent printing a number? If all else fails, you can reinterpret_cast it to integer and handle the bits yourself?
Also, the next sentence:
So, you either need *custom printing code* or you need to wait for Dev 14 which has improved printing code.
The intel engineer was probably smart, but then there's this guy. And then there's the commenters here. Lesson: There's always someone out there smarter than you.
Memory and ROM were small and expensive and processors were in-order and slow. Putting the transcendental in microcode made sense, back then.
Now, memory is huge; processors are superscalar and speculative execution; vector units exist; and everybody ignores the built-in transcendentals, but Intel can't dump them for backwards compatibility.
It is surprising that fsin is so inaccurate. I could perhaps forgive it for being inaccurate for extremely large inputs (which it is) but it is hard to forgive it for being so inaccurate on pi which is, ultimately, a very ‘normal’ input to the sin() function.
It does not have an exact representation in binary.