Hacker News new | past | comments | ask | show | jobs | submit login
Consistency: How to defeat the purpose of IEEE floating point (2008) (yosefk.com)
38 points by aw1621107 on March 5, 2020 | hide | past | favorite | 29 comments



> 99.99% of the code snippets in that realm work great with 64b floating point, without the author having invested any thought at all into "numerical analysis"

In the accident that they "work great" (really, great?) it's only because:

- The code depends on properties of IEEE FP which were designed exactly so that it's harder for a casual user to shoot himself in the foot -- ant these properties were intentionally designed for IEEE FP by the people who DID invest a lot in "numerical analysis" and the practical consequences of potential bad decisions.

- The code depends on libraries that were designed with much more effort than the author of the above statement can imagine.

In short, yes, we do need all features of IEEE FP. And to produce anything non-trivial one should indeed learn more about all that, and care.

> Summary: use SSE2 or SSE, and if you can't, configure the FP CSR to use 64b intermediates and avoid 32b floats. Even the latter solution works passably in practice, as long as everybody is aware of it.

That was, and I guess it hasn't changed, the default with Microsoft's compilers on Windows for decades already, and probably sensible default for non-Microsoft scenarios, especially needed for the "consistency" across the compilers, which matches the title of the article. Oh, and make sure that the compiler doesn't do any optimization that produces unstable results.

That's about the "production" default. However, I still believe that during the development of anything non-trivial the evaluation of the results using different numbers of bits is worth doing.


One additional problem is that IEEE floating point fails to require that addition and multiplication be commutative.

"WHAT?", you say? Surely it has to be commutative!

Well, it is, except in cases where both operands are "NaN" (Not a Number). You see, there's not just one NaN, but many, with different "payloads", intended to indicate the source of the error leading to a NaN. The payload gets propagated through arithmetic. But what happens when both operands are NaN, with different payloads? The standard says that the result is one or the other of these NaNs, but leaves unspecified which.

The old Intel FPU chose the NaN with the larger payload, which gives results independent of the operand order. But SSE uses the payload from the first operand. And so we get non-commutative addition and multiplication.

The compilers, of course, assume these operations are commutative, so the results are completely arbitrary.

One practical effect: In R, missing data - NA - is implemented as a NaN with a particular payload. So in R, if you write something like NA+sqrt(-1), you arbitrarily get either NA or NaN as the result, and you probably get the opposite for sqrt(-1)+NA. And both might vary depending on the context in which the computation occurs (eg, in vector arithmetic or not).


This is also an issue in video game programming, where this lack of consistency causes issues in the implementation of replays or lockstep networking. The core idea of both is to store/share the inputs for each frame, such that the game's state can be derived from them. Even small inconsistencies every frame can explode in size due to the sheer amount of frames.

If you think this article is interesting, you may also be interested in learning about posits.

They are an alternative to floats with better precision near 0, which, the authors claim, makes them superior for things like machine learning. Relevant to this article is the fact that they are defined to be consistent, so if they become popular this will never be an issue again.

Here is an article from the authors of posit which explains its advantages. http://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf

Here is a more nuanced look at posits, which explains its disadvantages. https://hal.inria.fr/hal-01959581v3/document


> Compilers, or more specifically buggy optimization passes, assume that floating point numbers can be treated as a field – you know, associativity, distributivity, the works.

Of course, this largely depends on how "YOLO" your compiler is. I believe GCC and Clang try reasonably hard to follow IEEE 754, which ICC is much more lax.


Compilers do no such thing: they do not assume that floating point + and * are associative, because they most definitely aren't. Roundoff errors are different, and the effect of this can be huge.

Consider (a+b)+c vs a+(b+c). Now suppose a is 1, b is 2.0^60, c is -b. The first expression evaluates to 0, the second evaluates to 1.

Scientific and engineering code is carefully designed for numerical stability and compilers must not mess this up.

gcc has a "fast math" flag you can use if you don't care about the accuracy of your results.


I don't understand your example, why would

(1 + 2.060) - 2.060 == 0

while

1 + (2.060 - 2.060) == 1?

Am I just misunderstanding what you wrote?


No, the site messed up my comment. b is supposed to be 2 raised to the 60th power. The two asterisks were removed. Let's try ^, b is 2^60, c is minus b. I edited my original comment.


There are flags for this no? At least msvc has.

Most code survives being sloppy around this, some doesn't. We had some code which relied on the propagation of infinities, so had to fiddle with some compiler options for that file to ensure the compiler wasn't being too clever.


Yeah, this is exactly the sort of thing I'd expect to get if I use the "fun, safe math optimizations" flag.



Many of the older, business-driven mainframe designs have hardware BCD instructions--for faithful/performant implementation of grade-school-style dollars-and-cents base-10 arithmetic.

On the other hand, a great deal of PC evolution has been driven by games--where performance is king. Hard to beat IEEE floating point on performance & storage efficiency!

Then there are the rusty sharp edges of x86, but that is life...

I wonder if `-O0` would solve the inconsistency? I don't particularly trust many compiler optimizations--too much temptation for a compiler writer to go performance-crazy, and start treating this computer voodoo like it was actual algebra.


> I don't particularly trust many compiler optimizations

They're pretty decent most of the time, unless you go do something undefined.


the basic math in his article wasn't undefined


How well does floating point work for 3D games/programs and gpus? That seems to be a very large category of floating point usage but I have no knowledge on whether it works well in that space. Would gpus be x% faster if they didn't have to do floating point, would games have more or less rendering problems without floating point?


Generally floating point works fine for most things. Every now and then it doesn't, and you have to be aware enough of its weaknesses to (a) detect this, (b) understand what the hell is going on, and (c) find a workaround for it.


Indeed. Had a fun issue once where a colleague had prematurely optimized an expression to (a * c * e) / (b * d * f), this was in a somewhat hot path hence he wanted to eliminate divisions.

Turned out that in certain cases, the factors were all tiny and so both the numerator and denominator became denormalized and the division returned a NaN, ruining all further computations.

After a lot of debugging trying to find the source of the NaNs and understanding why exactly this happened the solution was quite clear, simply expand it back out: (a/b) * (c/d) * (e/f).

This was because due to the nature of the math being implemented, those ratios would always be roughly around order unity, even though each individual variable could potentially contain a very small number.

This costs us like 0.1% performance, but made the code handle all inputs without issue.


This comes up a lot in calculating odds, or ratio of probabilities. How one implements these in code is a good indicator of how much experience a person has with real world scenarios. One of those shibboleths. Another telltale giveaway is accumulating the dot product of float32s in float32s.


Serious probability calculations are usually best made in log-space, turning multiplication into addition.



See this is why we keep you maths people around. Sometimes all that abstract wizardry is really useful! :)


I just hope that wasn't bitter sarcasm.

I think this would have been a more relevant link

https://en.wikipedia.org/wiki/LogSumExp#log-sum-exp_trick_fo...


I had similar problem too, that multiplication results had to be in range of 0 < x < 1 because after that, some log operations was resulting NAN, first I was using if/else which was slowed down noticeably and after I end up applying sigmoid which speed up things a bit, still I dont know exact cause and I am not brave to dive into that mess again


Works great for most things. Things can get a bit complicated for scenes/models with large coordinates and 32 bit floats, but that's not something that fixed precision integers/decimals would solve. An easy way to solve these issues is to use floats for some parts of the rendering pipeline and doubles for others, e.g. double precision matrices on CPU side, then compute a combined world-view matrix where huge camera and huge vertex coordinates cancel each other out. You can then cast the double precision world-view matrix to a single precision matrix and use that for rendering on the GPU.


Many years ago I wrote some 3D code for the 486SX, back when floating point units for PCs were optional extras. I used a fixed-point approach with 9 bits after the fixed point in 32-bit integer arithmetic. It worked pretty well, but doing that does force you to design the world coordinate system to avoid objects that are too small and suffer loss of precision errors.

I don't think it would entirely do away with things like z-fighting and seam alignment issues, but it would effectively "round" them off in more cases.

I'd definitely encourage any game developer trying to represent a "big world" to investigate fixed-point systems to avoid precision-loss problems creeping in when far away from the origin.

Floating-point is slightly more future proof in a world of increasing resolutions.


I think fixed point is more appropriate for a Cartesian coordinate system. Floating point will have better resolution only near (0, 0); the grid will get increasingly coarse as you move away from the origin.


Numeric pros are not that happy w/ IEEE numbers. The main intellectual effort involved was that Intel had some freshers make a floating point coprocessor, then the standard just documented what the chip did.


This seems ... not very accurate?

My understanding of the history:

Intel hired William Kahan (a professor at Stanford, already quite eminent, and familiar with FP on existing mainframes) to help get the FP design right, precisely because they hoped to make it a standard.

Then other microprocessor companies got an IEEE standardization effort going. Kahan went along to the first meeting, went back to Intel and persuaded them to take part too, and brought a proposal based on the (still in progress) 8087 design to the meeting. There were rival proposals from other companies.

The Intel proposal won largely because Intel had thought through the details better than the others.

(For instance: The biggest fight was over something called gradual underflow. Intel wanted it, DEC didn't. DEC hired a numerics expert to look into gradual underflow, with the expectation that he would report that it wasn't useful. He looked into it and reported that in fact it was a good idea and ought to be done.)

So: (1) it wasn't just "some freshers", it included at least one really big name in the field; (2) the standard and the 8087 were being developed at the same time (and indeed the 8087 didn't quite do what the standard said; later 80x87 generations did); (3) the standard was an inter-company effort and if it ended up being more or less what Intel proposed that was because Intel's proposal was actually better.

I have to admit that my summary above is based to some extent on things written by William Kahan, who of course was the guy Intel hired as a consultant to make their floating-point design better. So there may be some bias. If anything above is wrong, I'd be glad of corrections.


I don’t think that’s an accurate assessment. Kahan, the “father” of IEEE 754, is a professor and Turing award recipient. The designers of Intel’s 8087 were influenced by his writing. Here’s a bit of a history of it, with gripping subtitles such as “The Battle over Gradual Underflow” and “The Showdown”, and this conclusion from Kahan:

“In the usual standards meetings everybody wants to grandfather in his own product. I think it is nice to have at least one example -- IEEE 754 is one -- where sleaze did not triumph.”

https://people.eecs.berkeley.edu/~wkahan/ieee754status/754st...


The problem is that instructions for ieee754 values use the full precision of those values (or greater), when you almost never need that much. And if you leave them as-is, you build up bias.

As your calculations progress, your results slowly build up significant digit bias (which will be different depending on the architecture and libraries). To get around this, you'd have to round regularly, but that also slows things down (and is difficult to do in binary float).

If you're taking the results of calculations at their full precision, you're just asking for trouble. 32-bit binary ieee754 may be able to represent 7 digits of precision, but I sure as hell wouldn't take the results of 32-bit float operations to more than 6!

The alternative is to get a contract from the compiler that everything will be done in the same precision with the same bias for the specified type, and just accept the buildup (which we're currently doing without that guarantee, and getting burned by it).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: