
Demystifying Floating Point Precision - ingve
https://blog.demofox.org/2017/11/21/floating-point-precision/
======
gilbetron
Years ago I created a solar system simulation for education called the
Astronomicon. It was a fascinating education for me, not only about Astronomy
and our solar system, but about floating point precision and general numerical
stability issues. The problem really showed up with Pluto (back then a planet
;) ). Using 32 bits, a floating point seemed fine, except when you started up
the simulation and saw pluto jumping in discrete steps in it's orbit - you
could actually see the bits, basically!

Graphics cards (they weren't called GPUs yet!) were just transitioning to 64
bit, so our high end cards could handle 64 bit precision, and everything was
fine. But we were stuck with having to support some 32 bit cards. Then I
figured if I scaled everything by ... 10,000 ... I think ... it would work out
ok. The exact details escape me - maybe the mantissa wasn't enough for the
needed precision, so multiplying by 10,000 before sending it to the GPU was
enough. Or maybe the opposite. Regardless, it meant there was an annoying
constant we had to deal with :)

Nothing like diving in and dealing with an issue like that to do some
learnin'!

~~~
madez
Fixed precision floating point arithmetic is useful when you work over a large
range of numbers and care about relative accuracy. That makes them also useful
when the range is not known. However, the range is not irrelevant. Violating
the allowed range can lead to bad accuracy, and shifting the numbers via a
factor into the limitations can improve accuracy. From what you say, this
happened in your case; the numbers got too small and you shifted them into the
supported range, which is 2^(−126) to (1 − 2^(−24)) × 2^128 for IEEE 754-2008
32 bit numbers.

------
okdana
Related: A fun video that illustrates floating-point precision by way of Super
Mario 64:

[https://youtu.be/9hdFG2GcNuA](https://youtu.be/9hdFG2GcNuA)

~~~
nerdponx
I hate these text-on-picture-with-music videos. Even the robot-voice videos
are better. I know it's using the Super Mario music, but still. It's like a
blog post where you can't copy or paste anything, can't scroll back and forth,
can't be cached or indexed, etc.

~~~
katastic
Guy did it for free man. Give him a little slack. He's not even a programmer.
He does Mario speed runs. He didn't post his video here, someone else did.

And I absolutely disagree with the "Robot voice would be better." That's
horrific.

~~~
chillingeffect
i don't think he's picking on the authors themselves but on the broader
problem:

How to make decent auto-play presentations on the web.

most of us hate it when scrolling gets broken. many of us like to simply watch
a lecture go by with interaction, e.g. for deeper contemplation. many of us
would like to hear the lecture.

stuffing it into a video is tradeoff.

this is just a typical problem of a highly-mediated era: we have the core
information that can be rendered or published across numerous formats: book,
blog post, tweets, video, google slide presentation, etc., etc. How to deliver
the information to all formats, to be everything to everyone?

------
zmonx
Speaking of floats, I highly recommend John Gustafson's book, _The End of
Error: Unum Computing_ :

[https://www.crcpress.com/The-End-of-Error-Unum-
Computing/Gus...](https://www.crcpress.com/The-End-of-Error-Unum-
Computing/Gustafson/p/book/9781482239867)

It is an amazing compendium of the many ways in which floats fail to give any
guarantees in practice. In some cases, due to the format itself, and also
because many subtle issues are indicated via internal processor values and
flags that are simply not accessible in high-level languages. In addition, the
book presents Unums, an alternative number format with several extremely
interesting properties that make computations much less error-prone.

Prof. Gustafson has since also presented several additional formats with much
stronger guarantees than IEEE floats:

[http://www.johngustafson.net/unums.html](http://www.johngustafson.net/unums.html)

~~~
wruza
In short, how does u-bit solve multiplication problem when two u-intervals
produce >1 interval? Like (1.01+ * 1.01+) = 1.02+ .. 1.04+, in 3-decimal
precision

~~~
zmonx
In such cases, you get a ubound that contains the exact value. When the inputs
are exact, the result is exact too.

Here is a sample implementation in Julia:

[https://github.com/JuliaComputing/Unums.jl](https://github.com/JuliaComputing/Unums.jl)

------
dragontamer
I've stated some notes on Floats before, and I think this blogpost is pretty
good overall.

The one thing that trips up a lot of people (that wasn't mentioned in this
post) is that Floats are non-associative. Try the following in Python (or
whatever language that uses Doubles):

>>> 1 + 1 + (2. * * 53)

> 9007199254740994.0

>>> 1 + (1 + (2. * * 53))

> 9007199254740992.0

claytonjy noted a point of confusion, the above statement has been edited for
clarity.

The number 53 is chosen, because there's 53 bits in the double mantissa. Which
is when "+1" will become too small and drop-off of the Double.

In the first case, 1+1 becomes 2 (which is large enough to be added to 2. * *
53). So you end up with the "correct" answer. But (2. * * 53 + 1) rounds down
(because the 1 drops off), and the additional +1 afterwards also drops off due
to rounding error.

Therefore, Floating Point math is commutative but NOT Associative. And that's
really what makes things confusing for most people, in my experience.

x86 with the x87 coprocessor actually performs 80-bit floats to deal with this
issue. So you can afford to lose a fair number of bits under older x86
platforms. But modern C++ / C code typically compiles to the faster SSE2
instructions, which only keep 64-bits per operation. This only demonstrates
that rounding behavior is not only architecture-specific, but also compiler-
option specific !!

~~~
colejohnson66
Is there a way to force 80 bit floats if you need them?

~~~
dragontamer
Well, it seems odd to me that anybody would want "exactly 80 bit floats".

1\. In the case of "doubles", it seems like its more important that your code
works the same on all platforms. So you can enable precise IEEE754 behavior,
to ensure that your rounding errors and whatnot are the same from platform to
platform.

You'll still need to somehow ensure that all of your operations occur in the
same order however, IEEE754 only specifies when and how 64-bit numbers get
rounded. So sorting your numbers, adding them up from smallest magnitude to
largest magnitude is important still.

2\. In the case that 64-bits is insufficient precision, you should be using
something more precise but also consistent across different platforms.

In case #1: its usually more important that all code "makes the same rounding
errors", as opposed to "some code rounds more poorly than other code".

In case #2: you'll want ALL your code to have better rounding behavior. And
"Long Double" is still implementation specific. Its probably best to go to a
pure software solution (ie: BigNums of some kind) if precision is important to
you.

The case #3: some people don't care about the precision, and are willing to
have lower precision as long as the results are "nearly correct". Video Game
programmers are far more interested in speed for example, so the "fast inverse
square root" barely even has 10-bits of precision for example, but the far
greater speed wins out in most video game situations.

------
slededit
I've always thought that floats should be accompanied with a precision value
specifying the number of significant digits. Ideally updated by hardware in
the same operation.

All floating point code has subtle bugs if you don't track error accumulation.

~~~
kraghen
This sounds like unums, a proposed alternative to IEEE floats, where roughly
speaking the significand can have variable size thus only boasting as much
precision as the accuracy warrants.

~~~
slededit
Does it solve the equality issue? An epsilon tracker would allow you to say
it's equal within the margin of error for the computation.

If so that makes them much more interesting to me. Floating point bugs
manifest themselves at non linear parts such as equality and comparison
operators.

~~~
kraghen
I haven't really seen the issue of equality dealt with explicitly, but it
appears that you can extract the bounds of the interval implied by a given
unum. Essentially, it seems like an alternative to interval arithmetic with
potentially tighter bounds and faster computation.

~~~
zmonx
Yes, I think these are fair statements.

John Gustafson's presentation also includes a few important points about
interval arithmetic:

[http://www.johngustafson.net/presentations/Right-
SizingPreci...](http://www.johngustafson.net/presentations/Right-
SizingPrecision1.pdf)

------
madez
Floating point numbers are suboptimal to represent timers. They use more bits
than needed, aren't accurate on the bits they use, and require more effort for
operations.

Unsigned integers are better to represent timers. If the accuracy is 1/30s as
in the example, then with a 32 bit unsigned integer you could cover over 55
days with perfect numerical accuracy and minimal computational effort. With 32
bit floating point numbers you get only 6 days, while needing to worry about
accuracy and higher computational demand. At 64 bits it is 649936019 years for
unsigned integers vs 8925512 years for floating point. If your game doesn't
depend on state that is longer ago than the time covered, then using unsigned
integers allows unlimited runtime.

Don't blindly use floats.

~~~
pjc50
Floating point numbers are _terrible_ at representing times and money
quantities. Which is a shame given that Excel does it that way..

------
ska
Worth a read if you haven't seen it: “What every computer scientist should
know about floating point arithmetic”. This includes a better discussion of
all the points in the blog post.

[https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.h...](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html)

------
xaduha
There's no reason not to use BigNum type equivalent in your program where
computation intensive tasks are not expected, when language supports them well
enough.

------
saagarjha
Psst…your LaTeX formulas could be improved if you touched them up a bit. You
can put text in \text{}, and use the actual commands for math functions such
as floor and log by prefixing then with a \\.

~~~
Atrix256
Thanks. Probably ought to do that (:

------
yodacola
I try to avoid using floats when doing business calculations. Base 2 doesn't
work well with our base 10 currency.

Real world example: 1.09 - 1.09375

Double: 0.00374999999999992

Decimal: 0.00375

~~~
wruza
Base 2 doesn’t work well with military too:

[http://www-users.math.umn.edu/~arnold/disasters/patriot.html](http://www-
users.math.umn.edu/~arnold/disasters/patriot.html)

------
0xbear
Key point that takes some time to sink in: there are only 2^32 32 bit floats
in total. Meaning that you can easily enumerate them all.

