
The Fundamental Axiom of Floating Point Arithmetic - johnbcoughlin
http://www.johnbcoughlin.com/posts/floating-point-axiom/
======
fanf2
This “fundamental axiom” is not true:

 _”This means that for any two floating point numbers, say x and y, any
operation involving them will give a floating point result which is within a
factor of 1+ϵ of the true result.”_

It fails for x-y when x and y are close. Subtraction is the easiest way to get
big floating point errors. The article does mention this towards the end but
you need to wade through a lot of equations before you get told this very
important caveat.

The more tricky way to accumulate errors is to add lots of small numbers to a
big number. Just summing a list of floats can be fraught with peril! Kahan’s
summation algorithm is enlightening:
[https://en.wikipedia.org/wiki/Kahan_summation_algorithm](https://en.wikipedia.org/wiki/Kahan_summation_algorithm)

~~~
CharlesHorsey
"It fails for x-y when x and y are close"

I don't think that's quite right. Each individual operation is accurate within
epsilon, the trouble with cancellation is that the error from operations prior
to the subtraction is much larger than the result of the subtraction.

Example, if I do 1 + 2^30 - 2^30 in single precision I get zero. But each
individual operation is correct. 1 + 2^30 = 2^30, so the error is 1 < 2^30 *
epsilon. Then 2^30 - 2^30 = 0, which is exactly correct.

~~~
hpcjoe
This is due to the non-associative nature of FP. (a+b) + c != a + (b+c).

I disagree that the "axiom" as stated is fundamental. My main argument with it
is that I don't see an easy way to go from the axiom to usable theorems about
FP.

Happy to be wrong on this, but I am missing this at this time.

~~~
augustt
I'd say not only is fundamental, it's pretty much the only tool you have to
start the analysis of an FP algorithm. If you want to analyze a naive sum
algorithm for instance you do that by recursively applying the a ⊕ b = (a +
b)(1 + ε) rule and figure out how the epsilons bubble up - the result being
that the error is linear with the sum length.

------
rustybolt
I was recently thinking about writing something about fixed point vs floating
point.

The numbers that you can represent with floating point are exponentially
distributed. Floating point is well-behaved when you multiply things, and it
maintains a good relative error. Addition works best when you add similar
numbers. Overflow is almost impossible, but it's easy to get loss of
significance.

The numbers that you can represent with fixed point are uniformly distributed.
Fixed point is well-behaved when you add things, and it maintains a good
absolute error. Loss of significance doesn't happen, but it's easy to get an
overflow.

In both systems it's possible to accumulate error, but under normal
circumstances the approximation error should average out. Fixed point makes it
easier to reason about the error, floating point is a bit harder, but usually
works well in practice (with some exceptions -- for example, if you add a huge
array of small values you will run into trouble due to the loss of
significance).

~~~
edflsafoiewq
The big problem with fixed point is reciprocals are really bad. If you have a
fixed point format that goes from 0 to 1000, then 0.0125% of the numbers are
below 1/8, whereas 99.2% of the numbers are above 8.

Obviously for the reciprocal to be accurate, there needs to be the same amount
of numbers above 1 as below 1, like there is for floats.

~~~
rustybolt
Good point, indeed. Fixed point works best for multiplication with numbers
that are approximately 1 in magnitude.

------
tux3
I'm far from a floating point expert, but the fact that the article does
everything in terms of epsilons, without mentioning ULPs a single time
confuses me.

In fact, that fundamental axiom of FP arithmetic is NOT in terms of some
absolute value epsilonMachine. As I understand it, it says that there exists a
relative epsilon value (which depends on your input numbers), and that the
error of any FP operation on those numbers will be bounded by that epsilon.

What the article doesn't explain, is that floating point numbers and
operations are guaranteed to be accurate to the nearest half-ULP, not to the
nearest absolute epsilon. And what exactly is the absolute value of an ULP? It
depends!

The difference is so fundamental, I'm not sure you can have a useful mental
model of floats if you don't start by addressing this.

~~~
kd5bjo
Would you mind expanding the `ULP` acronym you’re using: if it’s so
fundamental to understanding this, I’d like to know what it is.

~~~
tux3
My apology, ULP stands for "Unit in the Last Place".

The simple explanation is that an ULP is defined as the difference between two
consecutive floating point numbers, which means it isn't a fixed value.

Another "intuitive" explanation, is that if you increment the binary
representation of a float, the difference between the two numbers is 1 ULP.

(I'd explain more, but I have to run to a meeting! More info on Wikipedia:
[https://en.wikipedia.org/wiki/Unit_in_the_last_place](https://en.wikipedia.org/wiki/Unit_in_the_last_place))

------
kanobo
The equations are well integrated in the article and matches the font of the
article very well. I thought it was some special font+math library but just
turned out to be Georgia and MathJax with tasteful design choices.

------
falcrist
As someone who spends most of my time at work writing firmware for embedded
systems, I learned VERY quickly never to use floating point numbers when I
could use fixed point numbers. Better to use an integer number of thousandths
of a thing than try to use floating point numbers and have error. This mindset
is clearly common, because it seems like the vast majority of microcontrollers
have no floating point hardware anyway.

At some point I was listening to a talk where the guy called floating point
numbers _"The incredibly misleading numbers that never really work the way
that you want them to."_

That's probably the best description of floating point numbers I've ever
heard.

------
lopmotr
My personal fundamental axiom of floating point numbers is that they're like a
physical measurement of a continuous quantity. Examples:

1 + 2^30 - 2^30 = 0 is like adding a drop of water to the ocean then removing
an ocean's worth of water. You're not going to be able to meter that
accurately enough to leave the one drop behind.

Round-trip conversions from decimal to FP and back result in a slightly
different value. Why do you expect special treatment for a number system in
any particular base? Nature's physical quantities aren't in any base. And why
do you care about the 17th significant figure? You can't measure that with
common scientific equipment.

Cumulative errors. They're like tolerance stacking. Engineers define
dimensions from a common datum instead of in a chain from each other because
their physical measurements accumulate error like floating point does.

They don't allow extreme values. You're never going to need them because you
can't find a 10^400 m or 10^-400 m long object in nature either.

Dollars and cents don't work right. Money isn't a continuous quantity.

------
qppo
> All floating point arithmetic operations are exact up to a relative error of
> [machine epsilon].

What about sub/de normal operands?

~~~
FullyFunctional
Exactly. This axiom doesn't hold for denormals which probably don't matter for
finance, but really matter for calculations that regress numbers for many
iterations.

OT: while obviously all fixed precision floating point must compromise on the
representation of real numbers, Posits give you more precision for the same
bits, behave far more sanely (eg. doesn't round to zero or Inf), and error
bounds are much easier to understand.

EDIT: capitalize Posits

~~~
nestorD
> Posits give you more precision for the same bits

Independent papers doing some validation on Posit computations are starting to
get out and... it is not better, just a different trade-off. Overall it seems
to be worth it for 16 bits precision (where floating points are very fragile)
but highly debatable for 32 and 64 bits.

Note that many papers claiming good properties for Posits also introduce what
they call a Quire which is... an exact dot product algorithm encapsulated in
an object (something that is also easily implemented with traditional
floating-points). Comparing a naive sum in IEEE floating point with an exact
summation in Posit is an apple to orange comparison (the algorithms are
different) when you want to evaluate the precision of a numeric type.

------
twelve40
> At Square, we used a long amount_cents, and we got along with our lives

how does this help if you need to divide or calculate sales tax? aren't you
back in the accumulating error trap?

~~~
kd5bjo
Accounting practice generally allows / requires rounding to some quantum value
at each operation, like $0.001– The issue with using floating point for money
isn’t that it has errors, it‘s that those errors don’t match the ones you get
when using a paper ledger.

~~~
kazinator
Firstly, if you treat floating-point without the proper care, that will still
be the case if you use an integer for all currency amounts (as a number of
cents) but reach for floating-point for percentage calculations and such.

Secondly, you can implement pencil-and-paper rounding rules using binary
floating point! It's done by serializing to decimal text, working with the
digits (perhaps as integers) and then converting back to floating point.

> _those errors don’t match the ones you get when using a paper ledger._

If you have errors in your paper ledger, you're not much of an accountant.

~~~
kd5bjo
> If you have errors in your paper ledger, you're not much of an accountant.

That’s because accounting practice has solidified certain inevitable
innaccuracies as “the way things are done.” Any method that doesn’t produce
the same accounting-correct but mathematically-imprecise results is “wrong,”
and those innaccuracies are based on a fixed-point model.

------
dvt
Great article, and a wonderful explanation as to why one should avoid (at all
costs) doing financial calculations using IEEE-754 floating point math.

> ... where each \epsilon is very small, on the other of \epsilon_machine.

Not sure if author is reading replies here, but I think "other" should be
"order."

~~~
cesaref
... except that every financial trading system i've ever seen is full of
floats (and i've seen a lot).

~~~
bigiain
Many years back I worked on a project involved in heavily government regulated
gambling (horse race betting). The rules were every calculation including all
intermediate steps had to be done to at least 1/10,000th of a cent precision,
and rounding was only allowed once at the very final step.

(Which, since my part of that project was a Javascript/jQueryMobile mobile
webapp, I managed to push all that to the backend team saying "I'm doing
calculations using Javascript, I don't even have proper integers to work with
here - you'll need to send me all financial figures as strings, and do all the
calculations at your end according to the rules. I can accurately display
strings, I will not look a judge in the eye and try to claim I can accurately
calculate _anything_ monetary in Javascript.")

------
ema
I wonder if it makes sense to treat floating points not as single points on
the number line but as a range. This way when you subtract two similar numbers
the result is a range that is wide in relation to the distance from 0,
explicitly encoding the bigger relative rounding error.

~~~
atq2119
That exists and is called interval arithmetic.

~~~
ritter2a
There are even more improvements (in some regard) to this, e.g. Affine
Arithmetic[0, 1]. Here, error terms from different operations are accumulated
symbolically so that in the case where they would cancel out (as could be
useful for numerical algorithms), tighter bounds can be obtained. There are
several research groups working on improving such analyses.

\--

[0] -
[https://en.wikipedia.org/wiki/Affine_arithmetic](https://en.wikipedia.org/wiki/Affine_arithmetic)

[1] - [http://users.ece.cmu.edu/~rutenbar/pdf/rutenbar-
intervalicas...](http://users.ece.cmu.edu/~rutenbar/pdf/rutenbar-
intervalicassp03.pdf)

edit: formatting

------
nikolay
I started to teach my 12-year son programming and when we started to talk
about floats, his reaction that computers can do simple math with fractions
properly was pretty astounding. I should've started him on Lisp instead.

------
tgb
I think this is a poor take on floating point numbers. Read the classic
article it references instead [1]. The IEEE guarantee is not that the result
is accurate to a relative epsilon but instead is the _closest_ representable
number to the true result. This is a clear guarantee but requires the reader
to actually know how floating point numbers are stored (which this article
fails to say!). It is also rather subtle and it needs to be explored in more
depth to be of any use.

[1]
[https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf](https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf)

~~~
kd5bjo
> The IEEE guarantee is not that the result is accurate to a relative epsilon
> but instead is the closest representable number to the true result.

The latter implies the former within the magnitude range that allows
normalized representation, which is the vast majority of calculations. As an
introductory article for a technical audience, it does a good job of
describing the abilities and limitations of floating point numbers. Going into
more detail about the representation runs the risk of sacrificing clarity for
precision.

~~~
tgb
I disagree and think that the article manages to fall into an "uncanny valley"
of not-actually-understanding. Why else would it need the "A Cautionary Tale"
section at the end? That's the first time he mentions that the previous
sections only hold for _positive_ numbers and shows that the statements made
before are bad.

And personally I prefer an "exact" guarantee that the result is as close-as-
possible rather than merely "within some relative epsilon". Compare his
(incorrect) "in mathematical terms" description with the actual statement that
would be that, if re(x) gives the real number represented by a floating point
number x, then:

fl(re(x) + re(y)) = x + y

where addition on the left is of reals and on the right is the floating point
addition operator. This is _simpler_ and more accurate than his inequality!
And it works for sub-normals etc.

~~~
johnbcoughlin
The axiom holds for all numbers and all single operations. The final section
presents a gotcha that arises when reasoning about the amount of _propagated_
error in certain calculations (called catastrophic cancellation).

The "exact" guarantee that you mention is precisely the same as saying that
it's "within some relative epsilon". Using machine epsilon tells you _how_
close "as close as possible" is.

Your proposed improvement to the inequality is true, but is not really useful
for computing the amount of error you can expect from numerical algorithms.
For that, you want some quantification of how close is "as close as possible".
Having the relative error bounded by machine epsilon is an extremely strong
guarantee, as the article shows.

~~~
tgb
The key I think your article is missing though is that it should explain which
numbers are representable exactly as floating points. This is very important
information for other reasons too and _that_ tells you how close "as close as
possible" is. I can tell you that when I started learning about floating point
numbers I knew there was some relative error and believed that operations like
addition and square root were just "close enough"\- in fact they're as good as
possible and the only difficulty is which numbers we're allowed to write down!
Enlightenment! And everything else falls out from there.

I do think your edit is a good clarification to make. My point about the "in
mathematical notation" part though is that it assumes a and b are exactly
represented as floating point numbers (else catastrophic cancellation can
happen) though the sentence after the equation implies they are _not_
representable.

------
kazinator
> _The moral of the story is, never use a floating point number to represent
> money._

That is simply not true. Floating-point arithmetic has sufficient precision to
represent multiples of $0.01 with vanishingly good accuracy even for
astronomical sums of money that your ledger will never see. If it didn't, it
wouldn't be able to "numerically solve differential equations, or linear
systems, or land on the moon." Moreover, programs like Microsoft Excel
wouldn't be able to use floating-point.

You have to make sure that all calculations that produce a currency amount are
actually rounded to the closest $0.01 multiple (using cent-based dollar
accounting as an example).

The main rookie mistake is to leave an amount represented as, say,
$17.4537359..., and then just _show_ it as $17.45 on paper or in the user
interface.

Rounding to the closest $0.01 will thwart cumulative error. So for instance,
suppose we start with 0, and then repeatedly add 0.01. The naive way would be:

    
    
       for (x = 0.0; x < WHATEVER; x += 0.01)
       {
       }
    

this will suffer from cumulative errors. Eventually the errors will add up to
more than half a penny: 0.005 and we are toast: N iterations of our loop will
have resulted in a value of x that is now closer to N+1 cents or N-1 cents
than it is to N cents.

However, if we do this:

    
    
       for (x = 0.0; x < WHATEVER; x += 0.01)
       {
          x = round_to_nearest_penny(x);
       }
    

then we are fine: our code will nicely stride from one good penny
approximation to another. We only start to run aground when x becomes
astronomical, representing a number of pennies that is an integer in the 14 or
15 digit range or something like that. After that we start getting into a
range where consecutive pennies cannot be represented any more at all, let
alone good approximations.

If we are just dealing with pocket change, like a few billion dollars here and
there, we are good.

So that is to say, we can't even take a column of figures that are
individually the best approximations of number of pennies and simply add them
together; we have to regularly clamp to the a penny as we carry on the
summation. It doesn't necessarily have to be after each addition, but we can't
let the error stray too far from the abstract penny we are trying to
represent. Any value we record into a ledger should be crisply clamped to the
floating point system's best approximation of that dollar and cent amount.

Using (binary!) floating-point, we can even implement specific pencil-and-
paper rounding methods for fractional results. This is because we can render
the floating point value into text, out to several more than the required
number of digits of precision, and then apply a textual technique to analyze
the digits.

Using double-precision IEEE floating point, I can easily write a routine that
will round 0.125 to 0.12 but 0.135 to 0.14. I.e. banker's rounding to the
closest even penny.

Now these numbers are actually:

    
    
       Abstract                      Actual
       0.125                         0.125
       0.12                          0.1199999999999999956
       0.135                         0.1350000000000000089
       0.14                          0.1400000000000000133
    

So e.g. my code will be actually taking 1350000000000000089 to
1400000000000000133. First, I will get the value as text rounded to four
digits after the decimal: "0.1350". Then I process the four digits as a
string, or perhaps make an integer out of them. I can work out that 1350 % 100
is 50, indicating that the number is "on the fence". Then get the parity of
the hundreds digit: h = (x / 100) % 10 == 3. Aha, odd, so we have to go to 4:
((x / 1000) * 1000) + h * 400.

