Hacker News new | past | comments | ask | show | jobs | submit login
Essential facts about floating point calculations (mortoray.com)
46 points by nkurz on July 6, 2015 | hide | past | favorite | 45 comments



A huge roadblock I recently ran into: It's possible, that floating operations may yield different results depending on the compiler or CPU architecture.

As far as I know it's not possible to run a simulation depending on floats on different machines, expecting the same output. Some people recommend switching to integer operations and to keep floating point arithmetic to a minimum. Is this the only way to get deterministic math across machines?

http://gafferongames.com/networking-for-game-programmers/flo...

Edit:

I ran into this while trying to procedurally generate a map based on a seed value. Heightmap, zones/biomes, rivers, towns, etc are all based on a seed. The result must be completely reproducible on different machines, otherwise players would see different maps. The terrain is large and therefore chunked. There's a seed per chunk and only seeds of chunks visible to the client are sent to him.

I now believe that the only way to implement this correctly is passing the full, complete state of a chunk to the player after the generation. There's no sane way to implement the whole generation with integer math only. But there's a positive side-effect: More traffic but less performance needed to regenerate the chunk on the client.

Actually it's simpler to work around this problem in a physics simulation, because the authoritative server can simply sync the properties across the players frequently.


You have to set compiler and runtime options to:

- force IEEE compliance on

- force Intel systems to disable extended-precision mode (80 bit internal registers)

See e.g. http://stackoverflow.com/questions/6683059/are-floating-poin...

Fixed-point integer is generally easier to handle but you have to do it "manually" as language support is scarce. And then watch out for overflow. It also may not be as fast as FP on modern systems (especially GPUs).

Edit: one could write an alternate history where the first PC 3D accelerators were targeted at Quake and implemented fast fixed point. It was a common design decision for 3D engines. http://gamedev.stackexchange.com/questions/26823/why-did-the...

Arguably fixed-point's limitations are easier to understand. Floating will give you within epsilon of the "right" answer, but as we have seen this may vary in weird ways across platforms. Fixed will run into precision loss problems a lot earlier, forcing you to confront and deal with them. You also have to confront truncate vs. round earlier.

A cite for "floating point is faster than fixed point": http://cboard.cprogramming.com/brief-history-cprogramming-co...


Mostly nonsense. You are making the same mistake as Matlab, throwing away all the internal Intel precision, just because you don't know the boundaries and limitations or you don't care. In short Matlab did as you said in v6.0 and it was wrong.

Read Kahan and matlab's (non)"defense": http://www.cs.berkeley.edu/~wkahan/MxMulEps.pdf http://blogs.mathworks.com/cleve/2014/07/07/floating-point-n...


Extended-precision mode is a disaster because it's confined to the FPU. If it were an 80-bit datatype it wouldn't be a problem. Because of this property, it's only predictable if you write all your math in assembler. If you write C, the generated code is very vulnerable to perturbation when the compiler changes when variables are written back to memory.

It may well give you more accurate results, but if you're doing something that's inherently numerically unstable (in our case, ASIC timing feeding into k-means clustering) it can give you different results in response to tiny code changes. Force-truncating trades accuracy for stability across changes to the software.

And obviously if you want C code that gives the same results on Intel and non-Intel hardware, you can't use Intel 80-bit FPU.

(Edit: I'll concede somewhat on "within epsilon of the right answer", because there are some examples in the Matlab case where 80 bit gets much closer to the right answer to 64-bit.)


So, do you have something we can learn from?

As it stands, pjc50 has some on-point advice that directly addresses OPs concerns with consistent results from seed values, and testing. Your reply is: "that is nonsense."

Great. Now I still know nothing.


If you want compatibility between x86 CPUs and ARM/PowerPC/MIPS then you have to not use the extended 80-bit results. The 80-bit results can be more accurate (although not always, it depends on exactly how they are used) but the OP wasn't concerned with maximum accuracy but with exact repeatability.

Similarly, if you want cross-architecture consistency you need to not use fmadd instructions because fmadd gives different results from separate multiply and add. The fmadd results are more accurate, but that is not the point in this case.


There's an additional caveat that may be of interest.

If you execute code in parallel, you may pick up different round-off errors unless you take care to combine the results from parallel tasks in the same order every time.

This can be a problem when simulating non-linear dynamical systems (such as the Solar System).


The important question to ask is: do you really need bit-accurate results? If so, then floating point arithmetic isn't probably what you want. However, running the same binary on different (compatible, ie. not ARM vs. x86) CPU architectures should produce identical results almost always (can anyone give a counterexample?).

But in reality, it is often acceptable to have one or two least significant bits differing. Especially when computing with doubles, a relative error in the order of 2^-53 = ~10^-16 isn't really significant. Even 32 bit floats give a precision of 2^-24 = 5*10^-8.

This is usually good enough for e.g. games, although multiplayer games need some kind of authoritative server model to avoid cheating.

Finally, the x87 floating point unit shouldn't be used any more because SSE2 is ubiquitous at this point so issues with 80 bit internal precision should be a thing of the past.

I'll conclude with a link to an article about a (very) competitive sim racing game where a least significant bit in a float ended up making a practical difference: http://www.iracing.com/q-circles-the-crop-circles-of-iracing...


I'm not sure 80-bit floats are a thing of the past. Many calculations still benefit from the extended precision compared to 64-bit. The issue also crops up agian on some platforms that can do 128-bit float, but are then truncated down to 64-bits for storage.

I agree there are ways to avoid that problem, but the general issue of truncation across storage/calculation I think remains an important concept.


> Many calculations still benefit from the extended precision compared to 64-bit.

Out of curiosity, can you name some examples?

64 bit doubles are really precise, to give a practical figure: a 64 bit float is good enough to calculate the orbit of Neptune to a precision of half a millimiter (0.497 mm). That's a distance of about 30 AU = 4.5 * 10^12 m = four and a half billion kilometers.

> The issue also crops up agian on some platforms that can do 128-bit float, but are then truncated down to 64-bits for storage.

Which CPU architectures use 128 bit floats internally? Is it common to use such platforms?

What comes to x87, I'd imagine that the difference in speed in comparison to SSE is already a good enough reason to not use 80 bit floats in most practical applications.

Perhaps some nuclear simulations or particle accelerator physics deal with numbers that vary in magnitude so greatly that increased precision may be useful but these cases are really rare (albeit important).


Technical calculations do care, e.g FMAC in matlab. http://www.cs.berkeley.edu/~wkahan/MxMulEps.pdf

In short Java went ahead and declared imprecise and architecture-independent float results better than more precise and architecture dependent results. Matlab followed suite with v6.0 and found big errors in mxmuleps compared to Matlab v5.0.

Such a little rounding problem in the 6th digit e.g. caused a big problem in one architectural problem of mine, where it did cost several millions. Check for the costs of rounding problem in the real world, comp.risks e.g. https://www.ima.umn.edu/~arnold/disasters/


Basically everything that can be done with 80-bit floats can be done with techniques such as compensated summation and double-double arithmetic. This requires several 64-bit operations per operation, but those operations can take advantage of SIMD capabilities when processing several inputs in parallel.


This is the case for example with the x86 .NET jit. The problem is usually when values are moved to 64bit locations rather than keeping them in 80bit fpu registers. This was a huge headache our regression tests for a .NET engineering/FEM application. .NET has no option to force "strict" FPU mode so there is no way of getting reproducible results. What saved us was that the x64 JIT uses only 64 bit, which makes it deterministic.


It's not impossible, it's just hard. I covered the things you have to worry about a while ago in this article:

https://randomascii.wordpress.com/2013/07/16/floating-point-...

As long as you avoid unspecified instructions (reciprocal square-root estimate, fsin) and avoid library functions or code-gen that takes different paths when SSE2/AVX/whatever are available and as long as you run the same binary on every machine (compatibility between PowerPC and x86 is much harder) you can definitely do it. Many games have.

So, if you can restrict yourself to x86 CPUs it's manageable, at least for modest sized chunks of code.


I once ran into this with python. 2 interpreters giving different results. Turns out they compiled 1 interpreter with some --I-don-t-care-about-correct-flops-as-long-as-they-are-fast flags. Still traumatised.


    if( abs(x - y) < 1e-6 ).
Testing for absolute error is often incorrect. Here is a better floating point comparator:

http://floating-point-gui.de/errors/comparison/


I (the author) agree. I'm mainly trying to point out all the issues with floating point rather than give a full solution. I believe the problem stems from people just not realizing a lot of these problems. Certainly I'm not offering the complete solution here, and for any one of those points could spawn off several articles just to give ideas.

So yes, links are appreciated so readers can get more into the details.


> Testing for absolute error is often incorrect

True, but in this case it was to test for near-zero values so absolute error comparison is good enough.

Here's a few macros I have been using to compare floating point values for approximate equality. It should be true if the values are equal to about 6-7 decimal places (in base-10). I use this in testing some code that deals with algorithms that have known numerical precision issues. The errors are inherent to the algorithms, not specifically floating point errors (e.g. solving Kepler's equation for near-parabolic trajectories).

Note that it isn't specified whether "a" or "b" is the "gold" value to compare against, so it's measuring the relative error against a rough average of the two magnitudes.

    #define ZEROF(x) ((x)*(x) < DBL_EPSILON)
    #define EQF(a, b) ((ZEROF(a) && ZEROF(b)) || \
        ZEROF(((a)-(b))*((a)-(b))/((a)*(a) + ((b)*(b)))))
I am no scientist, so if anyone has improvement to the code above, I'll gladly take any feedback.


but in this case it was to test for near-zero values so absolute error comparison is good enough

Unless you are working with small numbers? Suppose you are dividing 1e-30 by 1e-30 than the result should still be 1 so if you return 0 when the second argument is < 1e-6 then you're screwed.

I think boost's close_at_tolerance does what your macro tries to do; basically you say 'here's two numbers, check if their values differ no more than x% of their value' With the advantage that it is not a macro and guards against overflow etc: http://www.boost.org/doc/libs/1_37_0/boost/test/floating_poi....


> Unless you are working with small numbers? Suppose you are dividing 1e-30 by 1e-30 than the result should still be 1 so if you return 0 when the second argument is < 1e-6 then you're screwed.

A very good point which I didn't think of! Obviously the 1e-6 figure is specific to the application.

How should one avoid division by zero then? You can't figure out the relative error without a division, can you? For normal comparisons, using a relative error is the way to go but the zero case is different.

On the other hand, using quantities of magnitudes in 1e-30 is already using half the exponent, a change of units may be due here.


How should one avoid division by zero then?

Good question and in this case I really can't tell, it depends on the problem domain (i.e. is your normal data set numbers greater than 1 or rather smaller then 1e-100) and I'm not sure if is possible (in a feasible way) to write one generic function which works for all cases.


I've seen those "abs(fp_number)<some_tiny_value" in front of divisions, because people fear that the division will "blow up" in some way if a zero or sufficiently small number is passed to it.

    if (abs(q) >= 1e-6)
        rel_val = abs_val / q;
    else
        overflow_flag = true;
But it's quite pointless, as the semantics of floating point calculations give you the indication about overflow right in the division result, it will be marked as positive or negative infinity.

    rel_val = abs_val / q;
    if (isnan(rel_val) || !finite(rel_val) || ans(rel_val) > some_limit)
        overflow_flag = true;
(of course your FPU should be configured to not produce exceptions in the case of division by zero)


That still won't prevent a zero-ish noisy `q` from causing a totally messed-up `rel_val`.


I thought we had to use ulps and (worst-case) error bounds specified by a library in ulps?

https://en.wikipedia.org/wiki/Unit_in_the_last_place

http://docs.oracle.com/javase/7/docs/api/java/lang/Math.html


I am not sure I understand what you are saying exactly, but I believe in some circumstances you may be working with numerical values which you know to have errors > ulps.

For example, suppose you working with data that you've obtained from some physical measurement, and your knowledge of the measurement process means you know it will have around 1% relative error.

Or another example: suppose you are solving a linear system of equations Ax=b with highly precise right-hand-side data b (< \epsilon floating-point error), but the condition number of the matrix is large. Then perhaps it is only meaningful to expect the solution to be accurate to within k(A) \epsilon, the condition number of A.

(I haven't thought about this stuff for a while, please correct me if this is wrong).

edit: here's another reason. Suppose there's a heavy tradeoff between runtime and accuracy for some algorithm, so for some application it makes sense to use some cheap but somewhat inaccurate approximation. Then you're probably going to have and expect far larger errors. But, you might still be able to get some kind of (worst case?) bounds on the result, and want to assert something about that in test cases or sanity checks.


Good article! Just a couple corrections / comments:

> n * x != x + … + x

Not true if n == 2! For n = 2 they are exactly the same.

Why is this? Because "n * 2" and "n + n" are both exactly one operation, which means they only get rounded once, and since both are mathematically identical they both get rounded to the same value.

> I’m not sure why, but the default string formatting of floating point numbers in most languages, even domain-specific math languages, does not have high enough precision to store the complete number.

I think it's probably related to the fact that printf() doesn't have any format specifiers for float that do What You Really Want, which is: print the shortest number that will parse back to the right value.

So unless people want to implement this algorithm themselves (and, fascinatingly, the best known algorithm to do this was published only five years ago: http://dl.acm.org/citation.cfm?doid=1809028.1806623), you have to choose between having all your numbers be annoyingly long or losing precision for some numbers.


> Not true if n == 2! For n = 2 they are exactly the same.

Not true in many other cases either; a trivial example is n=1, or x=1 and n can be represented exactly. In all of the cases the "!=" in the headings is meant to be read "is not necessarily equal" or "there exist expressions e, e' for which e != e'"


> I think it's probably related to the fact that printf() doesn't have any format specifiers for float that do What You Really Want, which is: print the shortest number that will parse back to the right value.

Lots of language support the proper behaviour these days. Haskell/ghc does, for example.

[https://docs.python.org/3/tutorial/floatingpoint.html]:

> Historically, the Python prompt and built-in repr() function would choose the one with 17 significant digits, 0.10000000000000001. Starting with Python 3.1, Python (on most systems) is now able to choose the shortest of these and simply display 0.1.


Reference link to What Every Computer Scientist Should Know About Floating-Point Arithmetic :

http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.ht...


I can't count the amount of times I see this text being posted. Yet it takes something like a degree in math/CS/... to read and understand it completely; it is really not suitable for everyone.

As such I bookmarked Essential facts about floating point calculations so next time a psychologist or biologists comes asking me if I already knew matlab has a bug when comparing 0 to 0, I can point him/her to this document. And some proper code to compare floating point numbers using a relative error or so.


For anyone noticing the “Oracle” in the domain name: The same document used to be at http://docs.sun.com/source/806-3568/ncg_goldberg.html


Despite what this page says, you can round-trip between floating point and strings. Take a look at:

https://randomascii.wordpress.com/2012/03/08/float-precision...

Specifically, printf(“%1.8e”, f); ensures that a float will round-trip to decimal and back


I do mention if you simply increase the length of the formatted string you can do round-trip. It should be noted however (as I do) that the decimal number is still not equivalent to the actual floating point number, only that when converted back it is the same. This point could, in theory, be an issue if your serialized form crosses systems that use different floating point systems.


If you are trying to get exact compatibility between systems with different floating-point systems then serializing to text is the least of your problems.


It should be noted however (as I do) that the decimal number is still not equivalent to the actual floating point number, only that when converted back it is the same.

Since 2 divides evenly into 10, every binary number has an exact decimal representation, and you can produce a decimal string which is equivalent to any floating point number. You may need quite a few digits, though.


> It is possible to get the same number back though, even when using decimal. If the truncation length is simply extended long enough the parsing will yield the same number back.

But I agree that actual code would have been clearer.


> Just consider that with 64-bit floats 0.3 + 0.6 = 8.9999999999999991.

> It really took no effort at all to get a precision problem!

That sure looks like serious precision problem :)


I suppose in hindsight my example doesn't look as disturbing as it's meant to. :)

It's hard to show concise examples that truly show how this little bit of error eventually compounds and destabilizes a calculation.


I think he is referring to the fact that the result should be around 0.8999, not 8.9999.


It looks plenty disturbing to me. I don't even think it should be qualified as a 'precision problem' it's more like a 'total failure'.


In cases where the iteration can’t be avoided, it’s vital to understand you won’t get the expected result. A certain amount of rounding error will always accumulate.

You can track the error and account for it during iteration to avoid errors accumulating. See, for example, Kahan summation: https://en.wikipedia.org/wiki/Kahan_summation_algorithm


Maybe I don't fully understand the issues at play here but I remember reading about so-called 'unums'[0], an alternative binary number representation that seems quite amazing. I don't understand why it doesn't get more attention, apart from the fact it would need to be supported in CPU hardware to be of any real use.

    [0]: http://sites.ieee.org/scv-cs/files/2013/03/Right-SizingPrecision1.pdf



I recently ran into Minus Zero in javascript and was like WTF is MINUS zero ?

https://en.wikipedia.org/wiki/Signed_zero

Interesting stuff.


See also "why I use integers whenever possible"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: