
What Every Computer Scientist Should Know About Floating-Point Arithmetic (1991) - brudgers
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
======
dang
Posted 30 times, but most of the comments are about how many times it has been
posted. Here are the on-topic threads:

2017 (just a bit):
[https://news.ycombinator.com/item?id=13431299](https://news.ycombinator.com/item?id=13431299)

2012:
[https://news.ycombinator.com/item?id=4815399](https://news.ycombinator.com/item?id=4815399)

2010 (not a good year):
[https://news.ycombinator.com/item?id=1982332](https://news.ycombinator.com/item?id=1982332)

2009 (also pretty bad):
[https://news.ycombinator.com/item?id=687604](https://news.ycombinator.com/item?id=687604)

[https://hn.algolia.com/?query=What%20Every%20Computer%20Scie...](https://hn.algolia.com/?query=What%20Every%20Computer%20Scientist%20Should%20Know%20About%20Floating-
Point%20Arithmetic&&sort=byDate&dateRange=all&type=story&storyText=none&prefix)

------
s9w
That's quite a lot. The by far most important things for me: Floats guarantee
at least 7, and doubles at least 15 significant decimal digits. That's
regularly surprisingly low to many devs.

~~~
kstenerud
It may seem low, but it's really not. At 6 significant digits, you'd have a
margin of error of a thousandth of an inch when measuring a football field.

~~~
T-hawk
Sure, that's low, depending on context. At 6 significant digits, a million-
dollar bank account would have a margin of error of up to +/\- 5 dollars.

And if you add up a million of any item with 6 significant digits, that's
enough for inaccuracy to propagate all the way into your _first_ significant
digit. (It would require systematic inaccuracy in one direction, but there are
certainly data sets and contexts where that would happen.)

~~~
kstenerud
[https://dzone.com/articles/never-use-float-and-double-for-
mo...](https://dzone.com/articles/never-use-float-and-double-for-monetary-
calculatio)

------
randcraw
Some historical context... IEEE 754 (the current floating point standard in
H/W and S/W) arose in 1985, only 6 years before this paper. Before that,
floating point representations were often proprietary (e.g. Cray or IBM). Back
then, most microprocessors did not have floating point hardware, so most non-
scientific apps avoided floating point operations and booleans using them. Few
coders had much experience with floating point unless they were engineers or
scientists.

This paper broke new-ish ground in introducing 'the rest of us' to the
nonlinear representation space of digital scientific numbers and subtleties of
roundoff & inaccuracy inherent in math-intensive libraries and scientific math
functions. It still usefully covers more ground in that space than most of
today's BSCS degree programs.

------
seren
Oddly enough I got an issue with float this week, 2 systems are communicating
by transmitting float (rather needlessly I would add) and the conversion from
float seems to be implemented differently on both systems, resulting in
discrepancy sometimes.

This is kind of textbook example of an obvious bug but still surprising when
you find it.

~~~
enriquto
I do not understand what you mean. If you "convert" a float to a float it
shouldn't change anything? As long as you keep to floats there should be no
problem.

~~~
nestorD
You could run into problems with little endians and big endians if you do a
binary tranfer or, more likely, have a lossy float->string->float conversion.

~~~
dahfizz
Unless you are using an IBM mainframe, Big Endian is all but extinct. Besides,
it is about ~10 lines of C to detect endianess and reorder bytes if necessary.

~~~
pedrocr
> Besides, it is about ~10 lines of C to detect endianess and reorder bytes if
> necessary.

Proper code to do this doesn't do any tests. Just grab the bytes from the
order of the transfer format (in this case network order) into the order of
the abstract machine in C. The compiler will turn that into whatever is needed
depending on if it's compiling to a big-endian or little-endian machine and
may use byteswap instructions if needed. Having tests in the code is a code
smell that creates unchecked paths to the code that are much more likely to be
wrong as they are not used most of the time.

~~~
account42
> and may use byteswap instructions if needed

Or it may not because often compilers fail to make simple optimizations:

[https://godbolt.org/z/NAt3uX](https://godbolt.org/z/NAt3uX)

~~~
pedrocr
It does fine if you actually write the correct code :)

[https://godbolt.org/z/qpLeWP](https://godbolt.org/z/qpLeWP)

~~~
account42
There is nothing incorrect about the code I posted - clang will compile it to
a single read + bswap just fine. And you don't need to go that far back for
your code to not be recognized - GCC 4.9 will produce individual loads and
shifts for that too.

The point is that you can't rely on compilers consistently to recognize
complex patterns.

~~~
pedrocr
Incorrect was too strong but it's a weird pattern to do this with sums. The OR
pattern is what is used pretty much everywhere and conveys the intention of
the code much more clearly.

And even if the compiler doesn't always optimize ideally my original point
still stands. Delegating this to the compiler instead of trying to manually
call swap instructions is a much better way of going about it.

------
bch
Original paper (PDF):
[https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf](https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf)

------
kristopolous
I think what every programmer should know is probably only 4 words:

They're approximations. Use accordingly.

------
phkahler
I'm still hoping posits make it one of these days.

------
henvic
This is gold. I've been forwarding this link for everyone who says they are
using float for storing currency values (money, $$$) since forever.

~~~
todd8
I've often used floating point for currency. What can go wrong? Actually, a
number of things can go wrong when using floating point as the article pointed
out. The main concern nowadays is that floating point doesn't represent values
with infinite precision and consequently is not capable of exactly
representing every possible value.

In the past, number formats were often a concern. The first assembly language
program I ever wrote was for an IBM mainframe that required a conversion of
some numbers in zoned decimal format to be converted to packed decimal format.
There were actually native assembly instructions for doing calculations in
these different decimal formats. The IEEE 754 standard for floating point that
we use today came about in 1985. Before that, floating point might involve 60
bit values (as on CDC mainframes) or hexidecimal based floating point (as on
the IBM 360/370 mainframes). Double precision was not widely used because of
memory limitations. Floating point was much slower. Programming languages
didn't provide good or consistent facilities for non-integer values (it was
virtually impossible to predict which implicit conversions were being done by
PL/1 when doing mixed calculations, COBOL and FORTRAN handled calculations
wildly differently). I believe that some of the current general advice about
handling financial calculations stems from considerations that made sense
during these Middle Ages of programming.

Now, with double precision available everywhere and standard, specified
rounding modes, and the other benefits of IEEE 754, I think it's safe to
consider using floating point for currency calculations. _The most widely used
software for financial calculations uses floating point_ (Microsoft Excel).

If Excel uses floating point, why is there a widely promelgated admonition to
avoid it for currency? I believe that it made sense in the Middle Ages of
computing, but now is not as relevant.

While it is true that some quantities cannot be represented exactly in
floating point, for example 0.01, the same is true about decimal fixed point
where 1/3 cannot be represented exactly. Common financial calculations can
fail to be exact no matter the number format.

Consider calculating the payments for a fixed-rate 30 year mortgage. At a 5%
interest rate a $200,000 loan will have a monthly interest rate of 0.05/12 so
there will be 12 * 30 payments the amount of each of these payments will be:

    
    
        200000 * (0.05/12) / (1 - (1 + (0.05/12))^(-(12*30)))
    

This formula cannot be calculated exactly in decimal floating point for two
reasons. First, the fraction (0.05/12) is not exact in decimal and secondly,
there is unlikely be be a direct way to do exponentiation of decimal values.

Some languages (like Common Lisp) support exact rational numbers. This allows
exact calculations with any rational fraction, but this still doesn't allow
calculations involving irrational numbers, like sqrt(2) to be represented
exactly. Consider calculating the cost of a circular plate when priced in
dollars per gram. This involves using pi.

Care must always be exercised when doing financial calculations if they need
to match how financial institutions are doing their calculations. Reports must
round fractional values using the same rounding methods (is 1.005 rounded to
1.00 or 1.01? i.e. round-up vs round-to-even). Values should usually be stored
after rounding to the same units used in the currency. These problems are not
caused by the inaccuracy of the 17th digit of a double precision floating
point being off by one.

For further information on the kinds of considerations that need to be made
take a look at the design discussions that have been documented for the
beancount project [1].

[1]
[https://beancount.github.io/docs/31_rounding_precision_in_be...](https://beancount.github.io/docs/31_rounding_precision_in_beancount.html)

~~~
mamcx
> I've often used floating point for currency

The question is WHY? Because decimal types are barely second class citizens
across the stack. Is similar with dates.

Bad defaults are bad, and requiere stupid amounts of workarounds.

And just because excel do it?

[https://en.wikipedia.org/wiki/Numeric_precision_in_Microsoft...](https://en.wikipedia.org/wiki/Numeric_precision_in_Microsoft_Excel)

Excel is amalgamation of surprises that are not fixed.And when you put the
CORRECT results, the users demand that we give the same wrong results as
excel!

------
bass_case
Hello imposter syndrome my old friend

~~~
brudgers
Computer science is hard and even Knuth is humbled by it. And he's been
writing several programs a week for about sixty years.

------
bjourne
A follow up to this classic article could be titled "What every language
designer should know about humans." :) Floats are among the least user-
friendly ways to do decimal arithmetic. They aren't even that fast anymore
given the resurgence of fixed point arithmetic in hpc.

Imo, in 2020 we should have had mainstream languages in which (2^(1/2))^2 ==
2. It's not rocket science. :) Float truncation is a little like an OS
silently converting a user's flac collection to mp3 because it was running low
on disk space.

~~~
ChrisLomont
> Imo, in 2020 we should have had mainstream languages in which (2^(1/2))^2 ==
> 2

Such things are either incredibly inefficient or mathematically unsound. And
there's plenty of provable impossible computational things involved.

Start at the tablemaker's dilemma, then learn the difficulty of deciding if
some basic expression is actually zero, and so on. There's a massive
literature on doing such computations, and many, many impossibility theorems.

There is no getting around some seemingly simple problems require arbitrarily
large computational resources to resolve - floating point picks the very
useful path of doing approximations in predictable time over attempting exact
answers in unbounded time and space.

~~~
bjourne
"Such things" are the foundation of symbolic math libraries, deep learning
libraries, most optimizing compilers and Wolfram Alpha. I.e it is not true
that they would be incredibly inefficient or mathematically unsound.

I don't understand your point about the Tablemaker's dilemma because it is a
dilemma only if precision is fixed. With the correct numeric type (rationals)
precision is not fixed.

~~~
ChrisLomont
Let's unpack:

> With the correct numeric type (rationals) precision is not fixed.

Almost no problems can be represented exactly by rationals. Even the sqrt(2)
you started with is not rational.

And rationals are unusable for almost any type of work, because they grow
exponentially (in time and space) for most problems.

For example, if you tried to compute a simple mandelbrot set with rationals,
you'd need more bytes than there are subatomic particles in the universe
(~10^80).

Proof: Mandelbrot takes a complex number c = x + i y, which you want to
represent as a rational, and iterates z <\- z^2 + c. Squaring doubles the
number of digits for each of the numerator and denominator, so each iteration
requires ~2 times the storage of the previous iteration. Thus you'll need over
2^300 bytes for 300 iterations, which is > 10^90. A common mandelbrot uses
over 1000 iterations.

So you see rationals are computationally unusable for even simple tasks.

> symbolic math libraries

These can be arbitrarily slow and use arbitrarily large memory for simple
problems - which I stated above when I said "Such things are either incredibly
inefficient or mathematically unsound." For example simple converging infinite
sums that don't have closed forms known to your tools would never evaluate -
the only solution is some fixed, finite representation if you want to use the
value in later computations. The same thing for integrals, root finding, pdes,
and thousands of other problems.

> deep learning libraries

Deep learning libraries have led to a reduction in precision specifically for
the reasons I mentioned: lower memory requirements and higher speed. Hardware
vendors from Intel to NVidia to AMD to ARM have introduced lower precision
math, including the new bfloat16 (and some have a bfloat8) for these reasons.
I think this is the opposite of what you claimed, but supports my claim.

Care to show me where deep learning libraries do perfect symbolic computation
in the process of deep learning? I'm aware people try to train them to do
symbolic computation, but the libraries themselves aren't using the types of
reductions you posted.

> most optimizing compilers

Most compilers would reduce sqrt(2) * sqrt(2) to 2? List them. (Heck - list
one!) It's simply not true (as can be checked trivially for all common C++
compilers on godbolt).

I've seen none, despite following optimizing compiler theory and practice for
decades.

>Wolfram Alpha

Redundant with the symbolic stuff above - and I've used Mathematica since v1.0
in 1987 quite extensively. It most certainly is slower doing anything
symbolically that can be done numerically, and there are lots of problems it
cannot do symbolically without puking, but those same problems run instantly
when approximated with floating point.

There's ample pages listing things symbolic engines get wrong or cannot do,
but that can be done elsewhere.

>I don't understand your point about the Tablemaker's dilemma because it is a
dilemma only if precision is fixed

????

The dilemma is not about fixed precision.

The problem is correctly rounding a function as if it had an _infinite_
expansion first. For some functions this means arbitrarily large computations.
Ahead of time mankind does not know how many digits are needed for many common
uses. So the problem is not about fixed precision - it's about possibly
unbounded computation, which is vastly different.

For example, no one has a bound on the number of digits needed to compute
correctly rounded a^b for any two floating point values a and b.

And there exist computable numbers for which the rounded value can never be
computed with any amount of digits [3].

So it's not about fixed precision. It's far more subtle.

If you're willing to run unbounded in time and space computations, then you
can more and more precise answers, but at every point you _still_ don't know
if you've escaped the Tablemaker's dilemma. This is the point.

Here's a guy I've followed for decades, one of many researchers on the topic
[1]. Note that his "hard to round cases" in 2013 took 1576 _years_ of computer
time to determine how many digits are needed for double precision values for
some elementary functions, precisely because this is not the trivial problem
you think it is.

Here's a 2016 paper doing some work on GPUs [2]: " For example, the Nesterenko
and Waldschmidt [1996] bound for the exponential in double precision states
that 7,290,678 bits of intermediate precision suffice to provide a correctly
rounded result."

To get 64 bits of accuracy, correctly rounded, you need 1 megabyte and an
incredible amount of computation. My point exactly. And these bounds tend to
increase exponentially, not linearly, soon pushing such issues beyond
currently computable (which is also why there are still research papers being
published on getting results for 64 bit doubles in various ways and for
various elementary functions).

Again you have the choice I presented: you either accept unbounded in time and
space behavior, or you fix time and space to make things usable with
approximations. Floating-point is an incredibly good solution to making
numerics as good as possible under fixed space and time requirements.

[1]
[https://www.vinc17.net/research/index.en.html](https://www.vinc17.net/research/index.en.html)
[2]
[https://dl.acm.org/doi/pdf/10.1145/2935746](https://dl.acm.org/doi/pdf/10.1145/2935746)
[3] [https://en.wikipedia.org/wiki/Rounding#Table-
maker's_dilemma](https://en.wikipedia.org/wiki/Rounding#Table-maker's_dilemma)

~~~
bjourne
> Almost no problems can be represented exactly by rationals. Even the sqrt(2)
> you started with is not rational.

Almost all numerical problems software developers deal with can be represented
exactly using rationals. Only a small fraction of all problems involve
transcendentals. And those can't be represented exactly by floats either so I
don't know what point you're making?

> For example, if you tried to compute a simple mandelbrot set with rationals,
> you'd need more bytes than there are subatomic particles in the universe
> (~10^80).

Well, yes. You'd run into the same problem if you tried to compute all the
decimals of pi too. Using rationals does not mean that you can't round
results.

> So you see rationals are computationally unusable for even simple tasks.

I see that rationals cannot represent all real numbers.

> These can be arbitrarily slow and use arbitrarily large memory for simple
> problems - which I stated above when I said "Such things are either
> incredibly inefficient or mathematically unsound." For example simple
> converging infinite sums that don't have closed forms known to your tools
> would never evaluate - the only solution is some fixed, finite
> representation if you want to use the value in later computations.

I wrote that in mainstream languages should be able to evaluate (2^(1/2))^2 ==
2. How do you go from there to requiring them to be able to evaluate infinite
sums?!

> Care to show me where deep learning libraries do perfect symbolic
> computation in the process of deep learning?

I didn't claim that.

> Most compilers would reduce sqrt(2) * sqrt(2) to 2? List them. (Heck - list
> one!) It's simply not true (as can be checked trivially for all common C++
> compilers on godbolt).

I didn't claim that. Compilers are stuck with IEEE 754 fp semantics and
therefore won't simplify (2^(1/2))^2. But there is no technical reason why
they couldn't.

To paraphrase your reply: "Rational arithmetic and symbolic computation can't
solve every problem. Therefore they are useless/no better than floating point
arithmetic." From a theoretical perspective that is perhaps true but in
practice both are very useful tools.

~~~
ChrisLomont
>Almost all numerical problems software developers deal with can be
represented exactly using rationals. Only a small fraction of all problems
involve transcendentals.

There's more classes of real numbers than rationals and transcendentals. Hint:
your example of sqrt(2) is neither.

As to doing things _exactly_ with rationals, anytime you take a sin, cos, exp,
log of any non-zero rational number the result is not rational. Any time you
take a root of a non-perfect power rational you don't get a rational number
(you get an algebraic number, which is neither rational nor transcendental).

So all you can do is simple arithmetic: + * / \- (actually, that's from the
definition of the field Q of rational numbers).

And you still run into the problem that rational number arithmetic gets
arbitrarily slow and runs out of memory. Each multiply on average results in
requiring the sum of the storage sizes of the multiplicands, making more than
about 40 multiplications infeasible before you're out of RAM.

So your view of "almost all numerical problems software developers deal with"
must be limited to only those using simple arithmetic and no more than around
40 operations. That's amazingly constraining.

What problems do you call numerical problems that fit your constraints?

Things that cannot be done exactly with rational numbers includes pretty much
all of gaming (3D or otherwise), deep learning (probably every model has an
exp in it), scientific computing, audio and video processing (cos and sin all
over the place), and on and on.

>Using rationals does not mean that you can't round results.

Wait, you just claimed "...can be represented _exactly_ using rationals. Now
you want to round, throwing out the exactness? Which is it?

And congrats - you just re-invented floating-point math which is simply
approximating values using rational numbers in a clever manner to handle
larger ranges than fixed point. But they're still always rational
approximations to values.

>I wrote that in mainstream languages should be able to evaluate (2^(1/2))^2
== 2. How do you go from there to requiring them to be able to evaluate
infinite sums?!

Ah, so you mean languages should be able to simply solve this one specific
problem, not all relatively simple symbolic problems? That seems like an even
bigger mess than using floating point, which is well-defined and covers a
large class of problems.

Does your mythical language know (A^(1/B))^B==A for all rationals A and B also
representable in your language? Or are you limited to the value A=B=2 only?

(Hint: if you claim it should hold, you're gonna hit problems again :)

~~~
bjourne
> As to doing things exactly with rationals, anytime you take a sin, cos, exp,
> log of any non-zero rational number the result is not rational.

I fail to see your point. I never claimed that you can do real arithmetic with
rationals.

> So all you can do is simple arithmetic: + * / \- (actually, that's from the
> definition of the field Q of rational numbers).

Same as with floats. Except with floats you don't even get division.

> And you still run into the problem that rational number arithmetic gets
> arbitrarily slow and runs out of memory. Each multiply on average results in
> requiring the sum of the storage sizes of the multiplicands, making more
> than about 40 multiplications infeasible before you're out of RAM.

What?

    
    
        >>> from fractions import Fraction
        >>> Fraction(7, 8)**40
        Fraction(6366805760909027985741435139224001, 1329227995784915872903807060280344576)
    

In other words, no, you don't run out of memory after 40 multiplications...

> Things that cannot be done exactly with rational numbers includes pretty
> much all of gaming (3D or otherwise), deep learning (probably every model
> has an exp in it), scientific computing, audio and video processing (cos and
> sin all over the place), and on and on.

You sure can do all of those things with rational arithmetic. What you can't
do is do them _exactly_ but I never claimed you could.

> > Using rationals does not mean that you can't round results.

> Wait, you just claimed "...can be represented exactly using rationals. Now
> you want to round, throwing out the exactness? Which is it?

I wrote "ALMOST ALL numerical problems software developers DEAL WITH can be
represented exactly using rationals." The key word in that sentence is
"ALMOST."

> And congrats - you just re-invented floating-point math which is simply
> approximating values using rational numbers in a clever manner to handle
> larger ranges than fixed point. But they're still always rational
> approximations to values.

Yes, but I have also solved the freakishly annoying issues detailed in the
article. Which one do you prefer:

    
    
        >>> print(sum(0.1 for _ in range(10)))
        0.9999999999999999
    

or

    
    
        >>> print(sum(Fraction(1, 10) for _ in range(10)))
        1
    

? I know which one _users_ prefer.

> >I wrote that in mainstream languages should be able to evaluate (2^(1/2))^2
> == 2. How do you go from there to requiring them to be able to evaluate
> infinite sums?!

> Ah, so you mean languages should be able to simply solve this one specific
> problem, not all relatively simple symbolic problems? That seems like an
> even bigger mess than using floating point, which is well-defined and covers
> a large class of problems.

I don't think Mathematica or the Python symbolic math packages I've worked
with are particularily messy.

> Does your mythical language know (A^(1/B))^B==A for all rationals A and B
> also representable in your language? Or are you limited to the value A=B=2
> only?

Yes, assuming A and B are positive.

~~~
ChrisLomont
> What? ... > In other words, no, you don't run out of memory after 40
> multiplications...

So you think one case proves there is no set of 40 multiplications that
overflows?

Since we're using a small starting value and only one operation per iteration,
let's try 50 mults:

    
    
        x = frac(7,8)
        for i = 1 to 50
            x = x * x
        print(x) # as exact decimal fraction
    

Tell me how much ram and time this took you to evaluate with exact rational
values. I'll wait :)

(Hint: you'll need approx 500 terabytes of ram to store these integers. I
guess I didn't really need all 50 iterations! )

If you want only 40 mults, then using different starting values and adding a
few additions will cause the same overflow. In fact, all sorts of common
algorithms will overflow in very few operations using rationals. This is why
numerical algorithms textbooks don't even cover it as a possibility - it's a
terrible idea thrown out decades ago for precisely these reasons.

Your exact rational number idea has replaced predictable behavior on
reasonable inputs with highly finicky unstable behavior, and a programmer will
have no idea how to ensure things don't go off the rails for common uses.

In each reply you make fundamental math errors, provably incorrect claims, and
don't stop to learn. I already gave a proof that such simple things will
overflow - you ignored it, then tried to prove the converse with one toy
example. It's not worth showing you more errors when you ignore evidence and
continue disproven claims.

We're done. You're too stubborn to absorb relevant material.

~~~
bjourne
> So you think one case proves there is no set of 40 multiplications that
> overflows?

> Since we're using a small starting value and only one operation per
> iteration, let's try 50 mults:

That is repeated exponentiation, not repeated multiplication and you are
grasping for straws. Try the same code using an integer or a float greater
than 1. Since the same thing will happen as when using rationals (you get an
error) are you going to tell me that those number types are useless too?

> In each reply you make fundamental math errors, provably incorrect claims,
> and don't stop to learn.

Now I realize that you are most likely trolling me. Fine. Have a nice day!
Bye!

