Hacker News new | comments | show | ask | jobs | submit login
Intel Underestimates Error Bounds by 1.3 Quintillion (randomascii.wordpress.com)
424 points by ghusbands on Oct 9, 2014 | hide | past | web | favorite | 73 comments



This is one of those things that's well known by people who spend a lot of time doing low-level nonsense that really ought to get wider press. You really don't want to use x86 hardware transcendental instructions.

I've seen people try to measure the effectiveness of their approximation by comparing against built-in hardware, but their approximation is probably better than the hardware! If you're not writing your own assembly, you'll be ok on most modern platforms if you just link with "-lm" and use whatever math library happens to be lying around as the default, but it's still possible to get caught by this. On obscure platforms, it's basically a coin flip.

I used to work for another CPU vendor, and when we implemented more accurate sin/cos functions, some benchmarks would fail us for getting the wrong result.

Turns out those benchmarks hardcode a check for a couple results, and that check is based on what Intel has been doing for ages. My recollection is that we had a switch to enable the extra accuracy, but that it wasn't enabled by default because it was too much of a risk to break compatibility with Intel.

If that sounds too risk averse, there's a lot of code out there that depends on your processor precisely matching Intel's behavior on undefined flags and other things that code has no business depending on. It's basically the same thing Raymond Chen is always talking about, but down one level of abstraction. I've seen what happens if you just implement something that matches the "spec" (the Intel manual) and do whatever you want for "undefined" cases. You get a chip that's basically useless because existing software will crash on it.


> You really don't want to use x86 hardware transcendental instructions

...unless you're doing size-optimised graphics intros, where tremendous accuracy is not essential and the stack-based nature of x87 makes for some extremely compact code:

http://www.pouet.net/prod.php?which=53816

https://www.youtube.com/watch?v=R35UuntQQF8

I think the main reason why this issue hasn't been so significant is that the majority of code that uses those instructions isn't dependent on extreme accuracy - code that does would likely use their own routines and/or even arbitrary-precision arithmetic.


The problem is not so much "Intel stuff is old and broken" as it is "Intel wrote the standard and there's no process to move it forward"?


OpenGL is a great example of how painful it is to write apps that are capable of not breaking when the standard it relies on changes. You basically end up with capability checks or version checks which cause your code to become complicated.

I suppose there could've been a function that returned the implementation version of fsin/fcos/etc, which was incremented whenever Intel changed the implementation. That way benchmarks could test against that particular version. But it'd be hard to come up with a precise, consistent way to implement that. For example, do you increment the version whenever any transcendental function changes? If you do that, then you get benchmark code riddled with special cases for very specific versions, which someone has to constantly look up in documentation somewhere. You'd basically have to memorize the effect of every version increment. On the other hand, you could implement a more elaborate scheme, like having a version number for each of the transcendental functions, but then you'd either need to hardcode the number of version numbers returned by the API, or expose a way to add to the total number of version numbers in case someone wanted to add another function, which is extra API complexity.

I'm not necessarily arguing against having some kind of process by which apps could have the implementation of sin/cos/etc changed, just explaining how it gets complicated pretty quickly.


One thought on version number is that the version number could be the product of prime powers.

If function 1 is on version 5, function 2 on version 3, function 3 on version 7, and function 4 on version 1, we can encode this as 2^5 * 3^3 * 5^7 * 7^1 = 472500000. This lets us unambiguously increment orthogonal values while still keeping them represented as a single number. We could even easily add a function 5 by multiplying the number by 11.

One problem is that it's not super dense (storing those 4 values takes ~29 bits), but a 128bit or 256bit number would store a larger value than is likely needed by most applications.

One benefit is that software can check for the functions that it's interested in (as long as we can agree on what prime corresponds to what function - a relatively mild standard, if we just use order of introduction) while ignoring any portion of the result it doesn't understand.


> Storing those 4 values takes ~29 bits.

Just storing those 4 values in 4 bits is more efficient.

Checking that a bit is set is also a simple AND + CMP. Which beats out a DIV + CMP.

Sorry I just had not considered this isn't common knowledge outside C or C++ land.


> Just storing those 4 values in 4 bits is more efficient.

That doesn't really work for version numbers > 1. They're not just flags.


How do you store a version number in a single bit?


Well the bit n represents the version n, if this bit is at 1, this version is supported, otherwise it's not.


A few bits per value (depending how big you predict they will get). One bit per value means you can only increment version of each function once :)


That's a crazy scheme on a computer.

Just store, say, a byte per value, and concat them into your 128 bit number.


It's easy to come up with much more compact schemes that are still completely generic. For example, to represent a variable length number use xxx0, 0xxxxxx1, 0xxxxxx1xxxxxx1, etc. Normally you need a pair of these to specify (version,function). But if we assume functions are packed, you can have a run-length scheme (run-length,starting-function-number).

So "function 1 is on version 5, function 2 on version 3, function 3 on version 7, and function 4 on version 1" is four functions starting with 1 = (1000,0010),1010,0110,1110,0010 = 24 bits.

It gets better quick with larger numbers of functions.

BTW, the size of the prime scheme is log2(nthprime(function))*version bits for each function. If you don't know ahead of time which functions there might be, then you have to do a prime factorization, which is a hard problem. I guess if you used really large numbers you could have a cryptographically secure way of indicating which versions of which functions are in your library.


This is called Godel numbering (http://en.wikipedia.org/wiki/Godel_numbering).


One problem is that it's not super dense...

If you start with version 0 (and why not?), I think the Fundamental Theorem of Arithmetic implies maximum possible density?


You have maximum density for version 0, but when versions get sufficiently high then the numbers of functions there are sequences of bits with no valid interpretation because they represent primes or multiples of primes > the number of functions. Also increasing the versions of functions represented by larger prime numbers requires more bits than increasing the versions of functions represented by smaller numbers.

I think I would just represent the version of each function with a byte.


D'oh! Yes, any possible library contains only a finite number of functions to version. Even if we order the functions in decreasing order of update frequency, this scheme will overflow quickly if more than one or two functions update regularly.


NM, That's space efficient for 0's, but only 0 values. Your probably better off with something easy to decode. aka 2 bytes per function for an ID = 50,000 functions for 100kb which is hardly an issue. And if you really need to have more than 65,000 versions of the same function you can always deprecate it. An if your really only tracking 10 functions then that's still only 10 bytes which is hardly an issue.

55th prime > 256. So incrementing the version number adds more than one full byte to your number.


I think the suggestion was to multiply the powers of primes. e.g.

fn 2 @ v1, fn 3 @ v1 = 2^1 * 3^1 = 6

fn 2 @ v2, fn 3 @ v1, fn 5 @ v3 = 2^2 * 3^1 * 5^3 = 2 * 2 * 3 * 5 * 5 * 5 = 1500


I guess that might be useful if the vast majority of things are on version 0. But it quickly get's out of hand.

EX: Multiplying by 2 adds a full bit to your number every time you increment it. And it just get's worse from there.

If you really want to store mostly zeros your probably better off just using a count of the functions > 0 and then either use a bit for each function to show it's > 0 or an index if things are sparse.


It's certainly true that it has a nasty growth curve (exponential in version number; probably worse in number of active terms).

I just think it's fun for low value things, and interesting if you're in the situation you have a huge feature set relative to the number of active features, since it incurs no penalty for fields that are 0. Any time you have about ~1% features being present, it can make a difference. Example: if you have about 5 of 500 fields with 16 possible values represented, storing all the 0s is expensive. Using a naive 4-bits-per-field, you end up with 2000 bits, whereas, using my method, you only use ~620. Even using a 500bit number to flag which fields are present and then listing just those fields in order only saves you about ~100 bits over my method.

Plus, I manage to spark off a discussion about math and the problem, rather than just whining about it being hard, which I consider to be a good thing.


What's wrong with indexes? 1024 = 10 bits * 5 = 50bits then 4 bits * 5 = 70 bits total.

Or just a bit flag for active (500bits) then 4 bits * 5 for your count = 520 bits.

And if you really want to compress things start with a count so the 500 bit flag is only used if your count is over 50.

PS: in your case you also need an index of number if bits / bytes used.

Edit: If there is a reasonable chance your counts are greater than 1 just appending one bit per count vs using higher exponents saves space aka for 3 numbers append 100101 = first 1 then 3 then 2.


Of possible interest, I researched a related issue once and here's what I found:

The particular approximation used by the FPU instructions fsin and fcos is valid in the domain [-π /4, π /4]. Inputs outside this domain are reduced by taking the remainder when dividing by π /4 and then using trigonometric identities to fix up values in the other three quadrants (ie sin(x) = sin(π /2 – x) for x in [π /4, π /2].

The Intel FPU is “off” because it uses an approximation for π /4 when performing the reduction. In the mid 90s, some engineers at AMD figured out how to do a more precise reduction and implemented this reduction in the K5 FPU. AMD went back to being bit-for-bit compatible with the x87 behavior in the very next generation of their CPUs, presumably because it either broke or appeared to break too many applications.


It's not possible to implement a more precise fsin() without breaking apps?

One scenario I can imagine an app breaking is if someone executed fsin(x), pasted the result into their code, and then tested some other float against it using the == operator.

I suppose a more common breakage scenario would be if someone did something like floor(x + fsin(x)), since that'd very slightly change where the floor "triggers" with respect to the input. But how could that cause an app to break? Every scenario I can think of comes back to someone using == to test floats, which everyone knows is a no-no for dynamic results, such as the output of a function like fsin().

I guess a programmer wouldn't really expect the result of "x + fsin(x)" to ever change for a certain value of x, so maybe they executed it with that certain value, stored the result, and did an equality comparison later, which would always work unless the implementation of fsin() changed, and the thought process went something like, "This should be reliable because realistically the implementation of fsin() won't change."

Can anyone think of some obvious way that making fsin() more accurate could cause serious breakage in an app that isn't testing floats with the equality operator?


> It's not possible to implement a more precise fsin() without breaking apps?

It is, but it's A) slow as hell (it requires something like 384 bits of Pi to reduce properly) and B) nobody really cares.

Anybody who cares pulls out Cody and Waite and writes their own so that they know exactly what the errors and performance are. This is especially true for modern processors which have vector units and whose mainline performance is probably just as good as its microcode performance.

Anybody who doesn't care is really looking for an approximation anyway, so they're not going to use the transcendentals because they're too slow.


It requires more than 1100 bits to do correct argument reduction for double, actually.

You're absolutely correct that a software implementation may be several times faster than the legacy x87 fsin instruction, while delivering well-rounded results. There shouldn't be a need to write your own implementation, however. High-quality library implementations are pretty widely available these days.


Agreed. I'm stunned that there is a compiler currently in existence that actually uses the built-in Intel transcendentals rather than their own library.


> One scenario I can imagine an app breaking is if someone executed fsin(x), pasted the result into their code, and then tested some other float against it using the == operator.

While this sounds grandious, thanks to aggressive inlining this could be the case even if you wrote in your logic:

    sin(x) == sin(constant)
Or even the following depending on your choice of delta.

    abs(sin(x) - sin(constant) < delta


Is it even valid for a compiler to transform sin(constant) into a hardcoded constant embedded into the executable?

I guess since sin(x) is computed by an FPU in hardware, then it's inherently "dynamically linked." But in that case, why is it valid for a compiler to transform sin(constant) into a hardcoded result? Is there some standard somewhere which says that the implementation of sin/cos/etc can't ever change, and therefore it's valid for the compiler to do that? Or do compilers just make an assumption that the implementation of sin(x) won't ever change?


Your mistake is in assuming that the language C and its standard library are somehow defined in terms of some hardware, and that there somehow is a mapping between C functions and operators and CPU instructions - there isn't. The standard specifies for what input values an operation is defined and which conditions have to be met by the result, nowhere does it say how an implementation is supposed to compute the result, that's for the compiler to decide - as far as the standard is concerned, the compiler may compile all arithmetic to only ANDs and NOTs, though most practical compilers tend to try and choose instructions that give the best performance for a given computation, or just compute the result at compile time where possible.


From the C standard:

    A floating expression may be contracted, that is, evaluated as though it were an atomic
    operation, thereby omitting rounding errors implied by the source code and the
    expression evaluation method. The FP_CONTRACT pragma in <math.h> provides a
    way to disallow contracted expressions. Otherwise, whether and how expressions are
    contracted is implementation-defined.


"but if it's dynamic, is it even valid for a compiler to transform sin(constant) into a hardcoded constant embedded into the executable?"

Yes.

I honestly don't remember all the rules here, but it does in fact do so, legally.


Yes - the standard that defines what the sin() function does is the C standard itself.

Just as a compiler is free to evaluate strlen("foo") at compile-time, it's free to evaluate sin(0.1) at compile-time.


It breaks binary compatibility. The compiler substitutes some value (e.g. the broken value), so if the sin(x) evaluation changes (to the correct one), then binary execution will fail.


Why would the compiler substitute the broken value unless the compiler was broken?

Compilers generally don't use specialized local architecture instructions to compute their results, because they can't (It would break cross-compiling or targeting different platforms in a lot of cases) In fact, it's more likely the other way around:

Compiler substitutes proper value. actual calls on sin(x) on the local platform give wrong answer.


One case that comes to mind is multiplayer games that rely on each participant running identical inputs on identical simulation code producing identical outputs. Doing so on floating point inputs is arguably madness, especially using vendor-provided functions or instructions as part of the process, but that doesn't mean it doesn't get done. Small differences can accumulate and desync the game rapidly.


IEEE floats are (nominally) deterministic in part, although that part [0] doesn't happen to include the transcendental functions.

You are right, though, that there's a lot of software out there that assumes determinism without being anywhere near sufficiently circumspect.

[0] http://randomascii.wordpress.com/2013/07/16/floating-point-d...


I expect that Microsoft could get some angry customers if Excel's computations varied with the CPU it runs on. Imagine discussing a spreadsheet with a colleague and seeing wildly different data.

Some people would say "That's what you deserve if you use an ill-conditioned algorithm or have an ill-conditioned problem", but most would not even understand that statement.


But doesn’t that already happen? Didn’t we see that when the Pentium flaw was discovered? In fact, I thought there was an Excel spreadsheet you could download and run to test whether you had the flaw or not.


For this argument, the Pentium flaw is not the best example. As long as all CPUs were equally flawed, everything was fine, if you define 'fine' as 'we see the same results on all systems'.

I guess Excel will have the problem in some cases for rare CPUs. For 'common' ones, I expect that Excel contains workaround for errata that might affect its computations.


Well there's a difference between trusting the CPU to get arithmetic right and trusting the CPU to get sin(x) right (or to just know it gets it wrong but not caring).


Is the change really big enough that you'd get "wildly different data" in a normal situation?


I just noticed that Bruce has moved from Valve to Google... . Interesting!


There is a mathematical reason for why this is not completely horrible, even though this would catch you off guard. When x is approximately pi, the function x -> sin(x) is ill-conditioned. [0] So the relative error between the computed value ybar=fl(sin(x)) and the true value y=sin(x) is (ybar-y)/y and is quite large.

However: you can find a number xbar that is very close to x, in the sense that (xbar-x)/x is very small, such that the computed "inaccurate" value ybar=fl(sin(x)) matches sin(xbar) exactly. Thus this procedure is backward-stable, but not forward-stable. For "typical" numerically-stable algorithms, this might not be such an issue.

Very interesting writeup, anyway. I also found this post by James Gosling [1] describing the same issue, and the slower software workaround implemented in JVM; the post is from 2005. The paper that explain K5's implementation is "The K5 transcendental functions" by Lynch, Ahmed, Schulte, Callaway, Tisdale, but it is behind an academic paywall.

[0] http://en.wikipedia.org/wiki/Condition_number

[1] https://blogs.oracle.com/jag/entry/transcendental_meditation


https://software.intel.com/en-us/forums/topic/289702

"Shane Story (Intel) Sun, 06/27/2010 - 11:44

(...) rigorous error bounds for most of the x87 transcendental functions (f2xm1, fsincos, fyl2x...) were provided when the Pentium processor released. I recall an appendix in the orginal Pentium manual which showed a collection of ulp (unit in the last place) charts - don't recall the upper bound for all of them but it was something like 1.8 ulps. This would be for double-extended 80 bit fp inputs. Results for the fsin/fcos were for arguments < pi/4 because for the reduction for larger inputs, a 66-bit value of pi (or pi/2) was used."

It's really a documentation bug. Sometimes some important details get lost after the documentation rewrites.


Related: here is a very good article and comments on estimating transcendental functions: http://www.coranac.com/2009/07/sines/

I'm annoyed that one of the links in the comments is broken: there was a page where you entered your function and it provided a chebyshev polynomial to generate it to a particular precision.

EDIT: well here is another one: http://goddard.net.nz/files/js/chebyshev/cheby.html


There is some interesting info here as well on how to compute most of libm (including transcedental functions) in a correctly-rounded/precise way: http://lipforge.ens-lyon.fr/frs/download.php/153/crlibm-1.0b...


Also, the old standard reference for this was "Computer Approximations" by Hart and Cheney


Note that sin(double(pi)) should not return zero. sin(pi) is, mathematically, zero, but since float/double/long-double can’t represent pi it would actually be incorrect for them to return zero when passed an approximation to pi.

Huh? sin(pi ± epsilon) better damn well return 0 if epislon is closer to 0 than the minimum representable floating-point value. Otherwise the processor is rounding incorrectly.

(I realize epsilon in this case is likely to be much larger than the minimum representable float, but it's not guaranteed and thus the article's reasoning is flawed.)


The article is correct. double(pi) is exactly 884279719003555/281474976710656, which is a perfectly valid number to apply the sin function to. sin(884279719003555/281474976710656) is a transcendental number close to 1.2246467991473532e-16. In fact, that is the closest 64-bit float value to the true value of sin(884279719003555/281474976710656) and thus the best answer that the sin function can give. There is no way to ask the sin(double x) function what the sine of the transcendental number π is since the argument must be a double and there's no double value for π. Since the sin function is not in the business of guessing which of the uncountably many real numbers you really wanted to take the sine of, it can only give you the best answer for the value you actually passed it.


Actually, epsilon is guaranteed to be much larger than the minimum representable float: double(pi) is 1.57... * 2^1, so double(pi) - pi is going to be on the order of 2^-m, where m is the size of the mantissa. The only way that double(pi) - pi could be much smaller is if pi magically had a lot of 0s in its binary representation (it doesn't).

This caught me off-guard at first as well, but it's really quite logical.


The only way that double(pi) - pi could be much smaller is if pi magically had a lot of 0s in its binary representation (it doesn't).

Yes, this is what I was getting at. Obviously the statement is true for pi and the majority of reals; I just don't understand why it should be a rule as the article implies.


No, it is not pi that must get rounded, but sin(pi). So in sin(double(pi)), first double(pi) is computed, which is not exactly pi, then an approximation to sin(double(pi)) is computed, which is not exactly zero. It would be absolutely wrong to round a non-zero number to zero.


Yes, we're both talking about sin(pi) (hence why I claimed "sin(pi ± epsilon) better damn well return 0").

I take issue with this assertion:

It would be absolutely wrong to round a non-zero number to zero.

Why? I know IEEE floats have subnormals and subnormals can be VERY small, but if the error is less than half the minimum representable float, why should it be rounded away from zero?

Yes, zero is "special" because you can't divide by it, but for stability, my understanding is that you shouldn't be dividing by anything that can be so close to zero as to potentially round to it anyway.


sin(pi+eps) is a quantity on the order of eps. Not only is it non-zero, it is also so far from zero that it can be represented quite accurately with a floating-point number (which is a separate concern from the accuracy with which it is computed).

The relative error introduced in rounding 1e-16 down to 0 is 100%, or approximately 1e16 ulp, which is why that would be unacceptable. Floating point arithmetic cares about relative errors much much more than absolute.

Also, while dividing by zero is a problem, dividing by 1e-16 should not be a problem, at all. 1e-16 is a perfectly ordinary representable number.


[eps] is also so far from zero that it can be represented quite accurately with a floating-point number

What is missing from your claim is where you say, "Specifically, the delta between pi and its float-point approximation is greater than half the minimum representable float." This is almost certainly true of pi, but it is NOT true of all real numbers! 1+(1e-1000), for example, fails this test. You cannot – logically! – make the claim that the error of any close approximations of real numbers should not round to zero because the error is not small enough; clearly in many cases it is!

Yes, dividing by 1e-16 is not a problem! My question is whether this is a desirable thing to do in an algorithm that should be stable.

Floating point arithmetic cares about relative errors much much more than absolute.

This hints more at a reason, but why is this true? Yes, I know floating-point numbers are spaced logarithmically, and yes, relative error is obviously the correct measure when dealing with multiplication and division. But if you are not dividing due to stability concerns, so your value eventually finds its way to an addition, isn't absolute error more important?


Nobody here besides you is talking about numbers that are less than half the minimum representable float.

Edit: Oh my god now I get what you are saying. There are plenty of implicit reasons why the value would end up being non-zero -- in particular, any unit-scale continuous function would have to be artificially toxic for epsilon to be smaller than the smallest float. There is no reason for the article to point out that nit.


Well, thanks at least for taking the time to understand my point. Though I don't consider it a "nit" when the article implies that "never round to zero" is some sort of rule (such as "round toward even LSBs" is), as opposed to a consequence of the properties of floats and the constants and functions involved.


The sign-exponent-mantissa nature of IEEE format means (basically by design) that you have a much finer resolution near 0 (smaller absolute value) than you do away from 0 (larger absolute value).

We're talking about sin(x) with x ~ pi. In this situation the output's absolute value is much much smaller than the input's absolute value. So the output will have much finer resolution than the input.

The output being appreciably non-zero comes from the output having much higher resolution than the input. The output wouldn't be anywhere near small enough to be subnormal.


Wow, when doing argument reduction from the 80-bit long-double format there will be almost complete cancellation? Damn.


I wonder what a great standard for this is. Incidentally, do TI or HP calculators give good results?


Wolfram Alpha (and Mathematica) are generally completely accurate at the cost of being less efficient. I use those to check most things I want to verify.


Nowhere near double precision.


This is a storm in a tea cup. "The absolute error ... was fairly constant at about 4e-21, which is quite small. For many purposes this will not matter"

Good programmers know that most floating point calculations introduce small errors and find ways to make their programs cope.


That's pretty bad for a double that you're about to do a bunch of operations with. It's pretty easy to throw numerical precision away.


> However this technique relies on printing the value of the double-precision approximation of pi to 33 digits, and up to VC++ 2013 this doesn’t work

How can an IDE prevent printing a number? If all else fails, you can reinterpret_cast it to integer and handle the bits yourself?


First, a nitpick: VC++ is a compiler toolchain (including libraries), not just an IDE.

Also, the next sentence:

    So, you either need *custom printing code* or you need to wait for Dev 14 which has improved printing code.


I'm not worthy, I'm not worthy...

Edit:

The intel engineer was probably smart, but then there's this guy. And then there's the commenters here. Lesson: There's always someone out there smarter than you.


The lesson is actually "This was designed 35+ years ago in different constraints."

Memory and ROM were small and expensive and processors were in-order and slow. Putting the transcendental in microcode made sense, back then.

Now, memory is huge; processors are superscalar and speculative execution; vector units exist; and everybody ignores the built-in transcendentals, but Intel can't dump them for backwards compatibility.


Yes, but from the text it seemed like fsin was broken to begin with:

It is surprising that fsin is so inaccurate. I could perhaps forgive it for being inaccurate for extremely large inputs (which it is) but it is hard to forgive it for being so inaccurate on pi which is, ultimately, a very ‘normal’ input to the sin() function.


It's hard for the constraints of old to justify lying about your accuracy in the documentation you produce, though...?


Possible solution: revive the Indiana Pi Bill[0] (assuming 3.2 has an exact representation in binary?)

[0]:http://en.wikipedia.org/wiki/Indiana_Pi_Bill


assuming 3.2 has an exact representation in binary?

It does not have an exact representation in binary.


Easy solution: amend the value to 3.25, or 3.125




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: