Hacker News new | comments | show | ask | jobs | submit login
Intel Underestimates Error Bounds by 1.3 quintillion (2014) (randomascii.wordpress.com)
175 points by sgillen 3 days ago | hide | past | web | favorite | 39 comments

Worth noting this is for the x87 fsin instruction. For x86_64 typically a library function is called (which in turn uses SSE2 math instead of x87) rather than generating inline code.

More "pervasively" than merely typically. The x87 registers still exist in a technical sense, but except for compatibility in legacy binaries those instructions are never used.

They are used by gcc on x86_64 Ubuntu when dealing with long double. I just tried it, with a trivial function that adds arguments and returns the resulting sum.

I think this is true on any x86_64 platform for which sizeof(long double) exceeds 8.

Sure, because "long double" is fundamentally an x86-specific compiler extension[1] designed to exploit the 80 bit FPU precision in the 8087 architecture. Nothing for an SSE (or any other 64 bit IEEE FPU) will be using long double, except for compatibility.

[1] Yeah, I think there's a software emulation layer to give you 128 bits on other platforms or somesuch. Again, not important nor relevant to a discussion abou FSIN.

I’m puzzled by what you mean here: you refer to an x87 instruction, which makes me think it is assembly/binary level and then you’re saying that a library function is called. Where is the magic? Is there an x86 feature where you can instruct the CPU to jump to an address when a specific instruction is encountered? Or are you maybe using “library” for microcode in the CPU itself?

Edit: or are you referring to fsin at the C language level (in which case the x87 is hard to relate)

There is an extension to the original x86 instruction set called x87 which has an instruction called fsin. Unfortunately, it's very slow and somewhat inaccurate. On newer processors it's actually faster to call a function using SSE2 the normal way instead of emitting a single fsin instruction.

To somewhat clarify, x87 is the instruction set of the Intel 8087 and its successors, which were floating point coprocessors designed to work with x86 family processors. The coprocessor was eventually integrated into x86 chips. The design was, in retrospect, rather poor, and so SSE2 was designed with (mostly) replacing x87 in mind (along with narrow SIMD operations), and since then x87 instructions have largely been second-class.

The integration of the x87 FPU (floating point unit) occurred with the 80486 series chips.

Why wasn't the x87 instruction updated to use the implementation of the SSE2 instruction?

In order for that to work well, I have a hunch that they would need to do some very dirty tricks with register renaming, and add a 80-bit ("long double") mode to the SSE hardware, and I suspect that's why they are not as fast as eachother (whether that's as a result of fully separate implementation, or very conservative microcode [possibly involving fencing and swapping registers, emulation of arithmetic quirks of x87]) in practice.

Though, disclaimer: I am not even really an amateur at this level of analysis, so take what I've said with its very own salt lick.

Because then it would no longer comply with the x87 specification.

No. SSE and SSE2 both simply don't have any instruction to calculate sin, also not for the reduced range, like in x87.

They could microcode it, it would just be a fairly large sequence, and might depend on a table.

What I wanted to point is that neither SSE nor SSE2 instruction sets have the "sine" function at all, which was, the way I understood it, implied by the question asked ("Why wasn't the x87 instruction updated to use the implementation of the SSE2 instruction?")

What exists are different libraries that calculate the "sine" better than fsin from x87 instruction set, and for all of these, it is actually not important that they use SSE or SSE2 -- it is completely possible to implement the better sine algorithms with the basic x87 functions too, or with anything that doesn't have SSE and SSE2. The effect of not having the sine function at all in SSE and SSE2 sets is that if you decide to use only these instructions for your x86_64 library, you have to implement everything with the basic instructions that you have, that is, you'd surely have to use some library code, even in the range in which fsin would suffice.

However, if the question was actually "why wasn't the implementation of the x87 fsin instruction updated to be simply better" (which certainly could be implemented in the microcode) the answer is that apparently AMD tried exactly that with their K5 (1996) and then for the later processors had to revert to the "worse" to keep the compatibility with the existing programs, it is written in the original article or in the comments of it.

The x87 instructions are both single and double precision, 32 and 64 bits respectively. The design motivation was around engineering/scientific calculations. SSE instructions are mostly single precision floating point motivated by audio and graphics for multi-media.

> The x87 instructions are both single and double precision, 32 and 64 bits respectively.

x87 supports 80-bits calculations, others are the result of setting the configuration register to shorter widths.

That's one of its advantages over both SSE and SSE2. There are still some use cases where it's reasonable to use x87.

SSE has single precision (32-bits) instructions only.

SSE2 has double precision (64-bits) instructions.

Sorry for not being clear. I was giving an historic rationale for why x87 didn't change to use SSE. The x87 instruction set has a much longer history than SSE and came out of Intel. SSE started at AMD and the multimedia focus was a competitive advantage at introduction while x87 was becoming IEEE 754.

> I was giving an historic rationale for why x87 didn't change to use SSE.

When stated as "why x87 didn't change to use SSE" the question is like asking "why a dog didn't change to use a cat" as both x87 and SSE are the instruction sets, from the start differently defined, and given that, one can't "use" another.

The original question, however, referred to fsin instruction of x87 instruction set, but also reflected somewhat of confusion, as neither SSE nor SSE2 had ever an instruction to calculate sine.

And the answer in which you apparently gave a "historic rationale" had incorrect statements, the correction of which is: 1) x87 didn't "have 32 and 64 bits", the x87 was ambitiously designed to do 80-bit precise calculations with the shorter results as the additional modes. 2) SSE was 32-bit only, but SSE2 added 64-bit instructions too. Still, x87 could not "use the implementation" of fsin from SSE2 as SSE2 doesn't provide the sine function. Finally, if the question was why wasn't the "fsin" ever improved, please see the other responses here, including mine.

By "SSE" I mean SSE. Sorry for the confusion.

Take attention to the "References" section (including references to Linus Torvalds messages in the GCC mail list, some bugs related to this issue, etc.)

Yes. It obviously shows where Linus steps out of the bounds in which he is an expert. He blindly believes Intel's false documentation as early as 2001, and, if I understood correctly, the result is that glibc has bad FP sin for quite a while, since the maintainers also don't try to check anything, and:

- either there was no user who seriously used it (to detect the error)

- or the user who seriously used the sin and detected the error didn't pass through the maintainers' "wall blocking the casual contributions" (a major maintainer for a few years was kind of legendary for being very dismissive and hard to communicate with).

- or the detector of the error remained silent.

It seems that Bruce was the first managing to induce the change in glibc, and he needed to reach Intel first for that.

Linus being wrong in believing the documentation instead of checking it himself:


I'm curious when Java fixed their bad assumptions:


It seems much faster, as already in 2005 Gosling knew the truth:


"the x87 fsin/fcos use a particular approximation to pi, which effectively means the period of the function is changed, which can lead to large errors outside [-pi/4, pi/4]."

"What we do in the JVM on x86 is moderately obvious: we range check the argument, and if it's outside the range [-pi/4, pi/4]we do the precise range reduction by hand, and then call fsin."

Gosling actually quotes "Joe Darcy, our local Floating Point God."

Joseph D. Darcy was also a co-author of:

"How Java's Floating-Point Hurts Everyone Everywhere" with W. Kahan (http://www.eecs.berkeley.edu/~wkahan/JAVAhurt.pdf)

and his master thesis was:

"Borneo: Adding IEEE 754 floating point support to Java" http://sonic.net/~jddarcy/Borneo/

"a dialect of the Java language designed to have true support for the IEEE 754 floating point standard." "Unfortunately, Java's specification creates several problems for numerical computation" ... "useful IEEE 754 features are either explicitly forbidden or omitted from the Java specification."

To be fair he did say "assuming they don't have an fdiv-like bug ;)"

> he did say "assuming they don't have an fdiv-like bug

The actual situation was not too similar: the instruction that should speed up the calculation and returning the correct bits in the specific range, actually behaves as intended, that is, the designers of the instruction assumed the users will know the limitation of it, that is, it was originally assumed that the users were supposed to know that the function is not the "general" one and the "bug" was just in the misleading documentation. By fdiv you've had an instruction that didn't behave as intended, being wrong for some very specific inputs and not only outside of the designed range.

> It seems that Bruce was the first managing to induce the change in glibc

glibc changed their implementation before I reported on this.

So is it wrong to me to interpret this as an imprecision of a partial ULP on the input number, thereby making the accuracy of the output just fine?

Is there a use case for such a numerically unstable function use except as a party trick to get pi?

The input number is, unavoidably, not a perfect representation of pi. However the fsin instruction's job is to calculate the sin of the actual input number, not try to guess what input was intended.

So it doesn't make sense to interpret this as an imprecision of the input number - the fsin instruction failed to give a result which was as accurate as its documentation promised. The documentation has been fixed now.

I don't know how often this actually matters, but it's worth nothing that library implementations of sin() don't use fsin anymore, and I think that is partially because of this flaw.

> However the fsin instruction's job is to calculate the sin of the actual input number, not try to guess what input was intended.

Not guessing what was 'intended', just treating a number like "1.234567" as "any number that could have rounded to 1.234567000, whatever is most convenient". Does it cause any practical problems to do this? How often do people actually expect/need a double to be treated as having infinite precision, rather than 53 bits or slightly more than 53 bits of precision?

Um, it was known at least as far back as the AMD K5 (which has something like 300+ bits of Pi) that argument reduction on x86 didn't have enough bits of Pi.

You need a lot of bits of pi if you want to have an even vaguely accurate result when you do sin(MAX_DOUBLE)

"Not so much!" ~Intel motto

From 2014, FYI

Thanks, added.

What's that? Floating point math strikes again? /s

Let's build something better.

IEEE 754 is one of the rare successes of computer engineering and standardization.

Turns out it's just hard to implement continuous math on physical hardware.

I'm curious (and a bit skeptical) about any efforts that think they can do better.

John Gustafson has developed several promising alternatives, such as unums, posits and valids:


I also recommend his book, The End of Error - Unum Computing:


A transcript of a debate with William Kahan is here:


Need many years ahead for IEEE interval arithmetic hardware implementations to be usable... https://standards.ieee.org/findstds/standard/1788-2015.html

Intervals in interval arithmetic can rapidly go to infinity even when the problem is relatively well behaved (in matrix inversion, for example) and standard floating point arithmetic produces a useful answer.

What better thing do you have in mind? I'm not aware of any general-purpose floating point math replacement that mature enough to be embedded in billions of processors.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact