
Essential facts about floating point calculations - nkurz
http://mortoray.com/2015/07/06/essential-facts-about-floating-point-calculations/
======
zubspace
A huge roadblock I recently ran into: It's possible, that floating operations
may yield different results depending on the compiler or CPU architecture.

As far as I know it's not possible to run a simulation depending on floats on
different machines, expecting the same output. Some people recommend switching
to integer operations and to keep floating point arithmetic to a minimum. Is
this the only way to get deterministic math across machines?

[http://gafferongames.com/networking-for-game-
programmers/flo...](http://gafferongames.com/networking-for-game-
programmers/floating-point-determinism/)

Edit:

I ran into this while trying to procedurally generate a map based on a seed
value. Heightmap, zones/biomes, rivers, towns, etc are all based on a seed.
The result must be completely reproducible on different machines, otherwise
players would see different maps. The terrain is large and therefore chunked.
There's a seed per chunk and only seeds of chunks visible to the client are
sent to him.

I now believe that the only way to implement this correctly is passing the
full, complete state of a chunk to the player after the generation. There's no
sane way to implement the whole generation with integer math only. But there's
a positive side-effect: More traffic but less performance needed to regenerate
the chunk on the client.

Actually it's simpler to work around this problem in a physics simulation,
because the authoritative server can simply sync the properties across the
players frequently.

~~~
exDM69
The important question to ask is: do you really need bit-accurate results? If
so, then floating point arithmetic isn't probably what you want. However,
running the same binary on different (compatible, ie. not ARM vs. x86) CPU
architectures should produce identical results almost always (can anyone give
a counterexample?).

But in reality, it is often acceptable to have one or two least significant
bits differing. Especially when computing with doubles, a _relative_ error in
the order of 2^-53 = ~10^-16 isn't really significant. Even 32 bit floats give
a precision of 2^-24 = 5*10^-8.

This is usually good enough for e.g. games, although multiplayer games need
some kind of authoritative server model to avoid cheating.

Finally, the x87 floating point unit shouldn't be used any more because SSE2
is ubiquitous at this point so issues with 80 bit internal precision should be
a thing of the past.

I'll conclude with a link to an article about a (very) competitive sim racing
game where a least significant bit in a float ended up making a practical
difference: [http://www.iracing.com/q-circles-the-crop-circles-of-
iracing...](http://www.iracing.com/q-circles-the-crop-circles-of-iracing/)

~~~
mortoray
I'm not sure 80-bit floats are a thing of the past. Many calculations still
benefit from the extended precision compared to 64-bit. The issue also crops
up agian on some platforms that can do 128-bit float, but are then truncated
down to 64-bits for storage.

I agree there are ways to avoid that problem, but the general issue of
truncation across storage/calculation I think remains an important concept.

~~~
exDM69
> Many calculations still benefit from the extended precision compared to
> 64-bit.

Out of curiosity, can you name some examples?

64 bit doubles are really precise, to give a practical figure: a 64 bit float
is good enough to calculate the orbit of Neptune to a precision of half a
millimiter (0.497 mm). That's a distance of about 30 AU = 4.5 * 10^12 m = four
and a half billion kilometers.

> The issue also crops up agian on some platforms that can do 128-bit float,
> but are then truncated down to 64-bits for storage.

Which CPU architectures use 128 bit floats internally? Is it common to use
such platforms?

What comes to x87, I'd imagine that the difference in speed in comparison to
SSE is already a good enough reason to not use 80 bit floats in most practical
applications.

Perhaps some nuclear simulations or particle accelerator physics deal with
numbers that vary in magnitude so greatly that increased precision may be
useful but these cases are really rare (albeit important).

~~~
rurban
Technical calculations do care, e.g FMAC in matlab.
[http://www.cs.berkeley.edu/~wkahan/MxMulEps.pdf](http://www.cs.berkeley.edu/~wkahan/MxMulEps.pdf)

In short Java went ahead and declared imprecise and architecture-independent
float results better than more precise and architecture dependent results.
Matlab followed suite with v6.0 and found big errors in mxmuleps compared to
Matlab v5.0.

Such a little rounding problem in the 6th digit e.g. caused a big problem in
one architectural problem of mine, where it did cost several millions. Check
for the costs of rounding problem in the real world, comp.risks e.g.
[https://www.ima.umn.edu/~arnold/disasters/](https://www.ima.umn.edu/~arnold/disasters/)

------
thisisanacc

        if( abs(x - y) < 1e-6 ).
    

Testing for absolute error is often incorrect. Here is a better floating point
comparator:

[http://floating-point-gui.de/errors/comparison/](http://floating-point-
gui.de/errors/comparison/)

~~~
exDM69
> Testing for absolute error is often incorrect

True, but in this case it was to test for near-zero values so absolute error
comparison is good enough.

Here's a few macros I have been using to compare floating point values for
_approximate_ equality. It should be true if the values are equal to about 6-7
decimal places (in base-10). I use this in testing some code that deals with
algorithms that have known numerical precision issues. The errors are inherent
to the algorithms, not specifically floating point errors (e.g. solving
Kepler's equation for near-parabolic trajectories).

Note that it isn't specified whether "a" or "b" is the "gold" value to compare
against, so it's measuring the relative error against a rough average of the
two magnitudes.

    
    
        #define ZEROF(x) ((x)*(x) < DBL_EPSILON)
        #define EQF(a, b) ((ZEROF(a) && ZEROF(b)) || \
            ZEROF(((a)-(b))*((a)-(b))/((a)*(a) + ((b)*(b)))))
    

I am no scientist, so if anyone has improvement to the code above, I'll gladly
take any feedback.

~~~
stinos
_but in this case it was to test for near-zero values so absolute error
comparison is good enough_

Unless you are working with small numbers? Suppose you are dividing 1e-30 by
1e-30 than the result should still be 1 so if you return 0 when the second
argument is < 1e-6 then you're screwed.

I think boost's close_at_tolerance does what your macro tries to do; basically
you say 'here's two numbers, check if their values differ no more than x% of
their value' With the advantage that it is not a macro and guards against
overflow etc:
[http://www.boost.org/doc/libs/1_37_0/boost/test/floating_poi...](http://www.boost.org/doc/libs/1_37_0/boost/test/floating_point_comparison.hpp).

~~~
exDM69
> Unless you are working with small numbers? Suppose you are dividing 1e-30 by
> 1e-30 than the result should still be 1 so if you return 0 when the second
> argument is < 1e-6 then you're screwed.

A very good point which I didn't think of! Obviously the 1e-6 figure is
specific to the application.

How should one avoid division by zero then? You can't figure out the relative
error without a division, can you? For normal comparisons, using a relative
error is the way to go but the zero case is different.

On the other hand, using quantities of magnitudes in 1e-30 is already using
half the exponent, a change of units may be due here.

~~~
stinos
_How should one avoid division by zero then?_

Good question and in this case I really can't tell, it depends on the problem
domain (i.e. is your normal data set numbers greater than 1 or rather smaller
then 1e-100) and I'm not sure if is possible (in a feasible way) to write one
generic function which works for all cases.

------
haberman
Good article! Just a couple corrections / comments:

> n * x != x + … + x

Not true if n == 2! For n = 2 they are exactly the same.

Why is this? Because "n * 2" and "n + n" are both exactly one operation, which
means they only get rounded once, and since both are mathematically identical
they both get rounded to the same value.

> I’m not sure why, but the default string formatting of floating point
> numbers in most languages, even domain-specific math languages, does not
> have high enough precision to store the complete number.

I think it's probably related to the fact that printf() doesn't have any
format specifiers for float that do What You Really Want, which is: print the
shortest number that will parse back to the right value.

So unless people want to implement this algorithm themselves (and,
fascinatingly, the best known algorithm to do this was published only five
years ago:
[http://dl.acm.org/citation.cfm?doid=1809028.1806623](http://dl.acm.org/citation.cfm?doid=1809028.1806623)),
you have to choose between having _all_ your numbers be annoyingly long or
losing precision for some numbers.

~~~
Sharlin
> Not true if n == 2! For n = 2 they are exactly the same.

Not true in many other cases either; a trivial example is n=1, or x=1 and n
can be represented exactly. In all of the cases the "!=" in the headings is
meant to be read "is not _necessarily_ equal" or "there exist expressions e,
e' for which e != e'"

------
kqr2
Reference link to _What Every Computer Scientist Should Know About Floating-
Point Arithmetic_ :

[http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.ht...](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html)

~~~
stinos
I can't count the amount of times I see this text being posted. Yet it takes
something like a degree in math/CS/... to read and understand it completely;
it is really not suitable for everyone.

As such I bookmarked _Essential facts about floating point calculations_ so
next time a psychologist or biologists comes asking me if I already knew
matlab has a bug when comparing 0 to 0, I can point him/her to this document.
And some proper code to compare floating point numbers using a relative error
or so.

------
joosters
Despite what this page says, you _can_ round-trip between floating point and
strings. Take a look at:

[https://randomascii.wordpress.com/2012/03/08/float-
precision...](https://randomascii.wordpress.com/2012/03/08/float-
precisionfrom-zero-to-100-digits-2/)

Specifically, _printf(“%1.8e”, f); ensures that a float will round-trip to
decimal and back_

~~~
mortoray
I do mention if you simply increase the length of the formatted string you can
do round-trip. It should be noted however (as I do) that the decimal number is
still not equivalent to the actual floating point number, only that when
converted back it is the same. This point could, in theory, be an issue if
your serialized form crosses systems that use different floating point
systems.

~~~
brucedawson
If you are trying to get exact compatibility between systems with different
floating-point systems then serializing to text is the least of your problems.

------
ucho
> Just consider that with 64-bit floats 0.3 + 0.6 = 8.9999999999999991.

> It really took no effort at all to get a precision problem!

That sure looks like serious precision problem :)

~~~
mortoray
I suppose in hindsight my example doesn't look as disturbing as it's meant to.
:)

It's hard to show concise examples that truly show how this little bit of
error eventually compounds and destabilizes a calculation.

~~~
jobigoud
I think he is referring to the fact that the result should be around 0.8999,
not 8.9999.

------
panic
_In cases where the iteration can’t be avoided, it’s vital to understand you
won’t get the expected result. A certain amount of rounding error will always
accumulate._

You can track the error and account for it during iteration to avoid errors
accumulating. See, for example, Kahan summation:
[https://en.wikipedia.org/wiki/Kahan_summation_algorithm](https://en.wikipedia.org/wiki/Kahan_summation_algorithm)

------
lorddoig
Maybe I don't fully understand the issues at play here but I remember reading
about so-called 'unums'[0], an alternative binary number representation that
seems quite amazing. I don't understand why it doesn't get more attention,
apart from the fact it would need to be supported in CPU hardware to be of any
real use.

    
    
        [0]: http://sites.ieee.org/scv-cs/files/2013/03/Right-SizingPrecision1.pdf

------
nsajko
Anybody used decimal floating point?
[https://en.wikipedia.org/wiki/Decimal64_floating-
point_forma...](https://en.wikipedia.org/wiki/Decimal64_floating-point_format)

------
thewarrior
I recently ran into Minus Zero in javascript and was like WTF is MINUS zero ?

[https://en.wikipedia.org/wiki/Signed_zero](https://en.wikipedia.org/wiki/Signed_zero)

Interesting stuff.

------
lectrick
See also "why I use integers whenever possible"

