Hacker News new | past | comments | ask | show | jobs | submit login
Take care, it's Floating Point Arithmetic (thelimbeck.wordpress.com)
35 points by sankha93 on Apr 12, 2013 | hide | past | web | favorite | 49 comments



Hi. Member of the IEEE-754 committee here.

(Standards-conformant) floating-point addition and multiplication absolutely are commutative up to NaN encodings. This follows trivially from the fact that they are basic operations (and hence correctly rounded), combined with the fact that the corresponding operations on real numbers are commutative.

Feel free to ask other questions about floating-point superstitions, and I’ll do my best.


This - I disagree with the entire last paragraph of the original article, although I think he meant to imply that the operations are not commutative for all real numbers x, since applying the same operation in reverse on the result will not return the same number x (unless, of course, the proper precision arithmetic is used).

The rest of the article is great, and really hits at how to deal with mixing floating point arithmetic with really small and really large numbers. Start small so the partial sum remains as accurately as possible, since any exponential corrections will result in shedding some significant digits. Then again, if this is the point at which you start nit-picking about commutativity, you should probably use a higher precision floating point representation, or look towards "infinite" precision numerical libraries.


Accurate rounding of long sums is a surprisingly interesting problem (especially when cancellation is present!). To get correct rounding, there is a tradeoff between how much sorting of the summands you need to do and how much extra precision needs to be carried, and the algorithms can be quite subtle.


I remember there was clever algorithm that substracted before adding, that made it possible to get much more accurate sums.

EDIT: found it http://en.wikipedia.org/wiki/Kahan_summation_algorithm


There's even more cleverness you can do -- there are exact summation algorithms[1], which guarantee that the sum you get is the floating point representation of the real number sum corresponding to the sum of the real numbers corresponding to the summands (that's a mouthful!). However, they have superlinear runtimes and non-constant space (well, assuming you don't incorporate the size of the float into the constant). In Python, you can use math.fsum to use one. However, even these cannot recover already-lost precision, so these need to be used with care as well. For example, if you solve a quadratic and average the two roots to get the quadratic's maximum or minimum (stupid approach, yes this is contrived), you already have some lost precision in solving the quadratic and rounding results that isn't possible to recover. Also, if there is cancellation involved, there is error you can't recover from -- this is called the condition number of the sum (it is (Sum abs(x_i)) / abs(Sum x_i)), and it represents a error bound (of a complex sort) that it's impossible to improve upon.

[1] www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps


Yes, the article was too pessimistic. The cool thing about IEEE fp is that it does have rules, and implementations on the usual platforms follow them very well. (I know, there are exceptions.)


Sounds pretty definitive 8-)

Thanks.



A pretty interesting (if quite long) introductory reading about the subject:

http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.ht...


If you don't have time to read the whole thing and just want to learn how to add reasonably, I recommend http://en.wikipedia.org/wiki/Kahan_summation_algorithm


    > Floating point arithmetic is not arithmetic we learnt in school.
This is certainly true ...

   > It's not commutative ...
Hmm. Is it not commutative?


Order of operations matter. Try adding a few small floating point numbers together with a really large floating point number in different orders.


Yes, that's associativity failing, not commutativity.


It's not commutative: -0 + 0 = -0, but 0 + -0 = 0.


But doesn't IEEE-754 consider -0 and 0 equal? From https://en.wikipedia.org/wiki/IEEE_floating_point#Total-orde... :

"The normal comparison operations however treat NaNs as unordered and compare −0 and +0 as equal."


Right. -0 == 0. The existence of a "-0" is a property of the underlying representation. The mathematical properties don't know about the sign bit, per se.


Ah, that's news to me - can you provide a reference? I'm trying to look it up but my Google-fu is failing to find a specific link. Thanks.


That's not the behaviour I'm seeing on my machine. (http://codepad.org/rlLuB5VF) ... but, if you subtract zero from zero the sign depends on the order: http://codepad.org/g9LzZWPd .



So quoting from the discussion:

    Oh, of course: for non-NaNs, correctly implemented IEEE
    arithmetic is "exactly rounded". So addition of non-NaNs
    must be commutative for any given rounding rule, ...


(-0.) + (+0.) and (+0.) + (-0.) both evaluate to +0., as any compiler should readily confirm.


N.B.: Except in round-to-minus-infinity, in which case they evaluate to -0.


Is the compiler calculating that at compile time based on the lexical tokens, or at runtime with native CPU floating point instructions? That absolutely matters.


Take the example that I cited: 1 + 2^-25 + 2^-25 ... (2^25 times) will give you 1.

But if you do: 2^-25 + 2^-25 ... (2^25 times) + 1 it will give you 2.


But that uses associativity, and it's the associativity that fails. I believe that "a+b" is always equal to "b+a".

A more simplified version of what you say is that:

  1 + x + x + x + x != x + x + x + x + 1
when x is suitably small. But to convert one into the other requires both associativity and commutativity:

     (((1 + x) + x) + x) + x

  -> (((x + 1) + x) + x) + x using commutativity
  -> ((x + (1 + x)) + x) + x using associativity
  -> ((x + (x + 1)) + x) + x using commutativity
  -> (((x + x) + 1) + x) + x using associativity
  -> ((x + x) + (1 + x)) + x using associativity
  -> ((x + x) + (x + 1)) + x using commutativity
  -> (((x + x) + x) + 1) + x using associativity
  -> ((x + x) + x) + (1 + x) using associativity
  -> ((x + x) + x) + (x + 1) using commutativity
  -> (((x + x) + x) + x) + 1 using associativity
If you write Y as x+x+x+x, then 1+Y == Y+1.

It's the associativity that fails.


In this example, yes. But floating point commutativity isn't guaranteed for multiple operations. a + b = b + a, but a * b * c is not guaranteed to equal b * a * c due to the rounding effects in the registers, which means commutativity fails for numbers that are at the rounding edges of floating point dynamic range.


(a * b) * c is equal to (b * a) * c. And floating-point arithmetic is commutative, according to Eric Lippert (http://blogs.msdn.com/b/ericlippert/archive/2005/01/20/fun-w... ) and an university professor (http://www.cs.uiuc.edu/class/fa07/cs498mjg/notes/floating-po... ). Also according to IEEE 754, which I can't seem to find online.


That's a useful detail, but can you be a bit more specific, or point me to a paper that describes this effect in more detail? The discussions I've had in the past, and the papers I've read in the past, have left me with the impression that every "odd" effect of this type was an interaction with associativity, and not with commutativity.

I'd like to learn more, but searches are failing to turn up anything specific and detailed. In particular, the usual reference - http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.ht... - doesn't seem to mention commutativity.

Thanks.


Hmm! Well I took a shot at it on python

"Python 2.7.2 (default, Jun 20 2012, 16:23:33) [GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin"

http://i.imgur.com/2B4T0nn.png

the first 2 outputs are the same, demonstrating that commutative property holds. the next pair of outputs seems to indicate that commutative is failing (but this may be associative in the background). The third pair of outputs reveals explicitly that associative fails. So it's the 2nd pair that is probably causing the confusion in understanding since a lack of parenthesis seems to imply that we aren't looking at some sort of associativity when we really are.

Now, if we go to code bin and run the same test:

http://codepad.org/tBH9PgJE

There is no discrepancy, and that's because of rounding.

Now it is clear to me the confusion, and I hope that helps illustrate some of the issues for others!


I'm pretty sure Steve McConnel had a bit on this (adding the absolute magnitude "small" numbers up before the big ones - sorted) in order to avoid underflow and loss - in the book "Code Complete" back in the early 90s.

Obviously, it's a known old issue, since the IEEE folks point out that their standard produces an error (NaN) to indicate data loss if you add things with too small and too big a magnitude together. Not all floating point is IEEE, some implementations just give you their best estimate, throwing away the little bits.

And, please, DON'T use floating point to work with currency, ever. http://roboprogs.com/devel/2010.02.html


(1) NaN is not an error code. It is a value added to the system to make it closed under all the usual operations (thus eliminating the need for error codes).

(2) Adding values of varying magnitude does not produce NaN results in IEEE-754 arithmetic. The only addition that produces a NaN is +Infinity + -Infinity, which have equal magnitude.


Well, drat. I was under the impression from comments elsewhere in this thread that part of the (a?) standard was to track overflow/underflow.

Go with what I know, then: add the small numbers first (sort the addends in increasing absolute magnitude)


For single precision floats: adding 2^-25 (2^25 times) won't get you to 1 since the fraction is only represented by 23 bits...


Actually, it will. 2^-25 is represented exactly, since it is really represented as 0b 1.00000000000000000000000 * 2^(0b100011001 - 128). The exponent is represented exactly, and the mantissa (the 1.xxxx term) is representable exactly since it is exactly 1.


If you repeatedly add 2^-25 to a single accumulator register, then you'll hit a wall (1-2^-25 is not representable). If you do something less efficient like pairwise addition or Kahan summation, then you can avoid that problem.


You are completely correct, and I was wrong. My sincere apologies for filling the net with misinformation. Thanks for pointing out the correction, and I now better see what you are saying.


I think you mean "it's"


We should stop this limited-precision dyadic nonsense and go back to first principles.

    Natural  := Zero | Succ of Natural

    Integer  := Positive of Natural | Negative of Natural
    ## Negative(0) |=> error in constructor

    Rational := {num:Integer, denom:Integer}
    ## denom <= 0 || gcd(num, denom) != 1 |=> error in const.

    Real := (Rational, () => Real)
    ## Lazy Cauchy Sequence-- user is responsible for convergence.

    Complex := {re:Real, im:Real}

    Quaternion := {re: Real, i:Real, j:Real, k:Real}

    Octonion := ... ## NOW we can use FP because octonions   
                    ## are non-associative anyway!
Example usage (still pseudocode):

    $ realMath.sqrt(Positive(Succ(Succ(Zero)))
    -> (Rational{num = Positive(Succ(Succ(Zero))) 
                 denom = Positive(Succ(Zero))},
        #<fn () => Real@f73d>)  ## function produces (3/2, fn)

    ## Of course, with a proper type system, we'd never let sqrt return a Real! Sorry.
    ## It should be List[Complex], like so. 

    $ realMath.sqrt(Negative(Succ(Zero))
    -> Cons(Complex{re = (Rational{num = Positive(Zero),
                                   denom = Positive(Succ(Zero))},
                          #<fn ...>,
                    im = (Rational{num = Negative(Succ(Zero)),
                                   denom = Positive(Succ(Zero))},
                          #<fn ...>)},
            Cons(Complex{...},  ## The other root...
                 Nil))


As long as you’re prepared to wait while the computer tries to figure out which branch to take on "if (a != 0)”, without knowing if it will ever make a decision =)


The simplest solution to that is not to define equality for the reals.


Sure. But you realize that reals are sometimes equal?

Also, comparison is also undecideable, and that's pretty certainly an operation you want to have at times.


Rather than the Cauchy-sequence approach for reals, you could just use a symbolic representation with a standard set of simplification rules, like most Computer Algebra Systems, with resort to floating point approximation for explicit inexact operations (including approximate comparisons).

For strict equality, if two reals have the same symbolic simplification, they are equal, otherwise, not. For greater than or less than comparisons (on possibly a "tolerant" form of equality), but if you can't resolve the comparison symbolicly using a set of guaranteed-to-terminate operations, you can resort to calculating floating point approximations of the reals and then comparing those. You might even have "exact" comparison operations that throw an exception if they can't be resolved without resorting to floating point approximations, and "tolerant" comparisons that allow approximations.


That's not really necessary. If you need approximate comparison/equality, just compare up to a fixed depth. You'll get an answer that is one of: gt, lt, eq or unknown.

There are some cases where your system provides better results, comparing two equal irrational numbers for example, but I can't think of a use case for that.


> That's not really necessary. If you need approximate comparison/equality, just compare up to a fixed depth. You'll get an answer that is one of: gt, lt, eq or unknown.

I was trying to (provide a user option to) rule out "unknown" as an answer. Though, you can do that with your method, too (with the same kind of risk of error that you'd have with resort to floating point approximation, so in that regard its no worse than my method.)

> There are some cases where your system provides better results, comparing two equal irrational numbers for example, but I can't think of a use case for that.

I suspect that doing symbolic calculations with a standard set of reduction rules may be more efficient over multiple manipulations involving irrational numbers than manipulating them in lazy Cauchy sequence form, at least when you actually need at some point to do comparisons or numerical approximations (and, certainly, for display, either a more traditional symbolic form or a numerical approximation is going to be preferred to any kind of direct representation of a lazy Cauchy sequence.)

But that's just an intuition, and certainly not in an area that I am an expert in.


> Sure. But you realize that reals are sometimes equal?

Of course.

> Also, comparison is also undecideable, and that's pretty certainly an operation you want to have at times.

Why?


Excellent point, but perhaps termination is just a detail. We already know that it's unsolvable. What's the rush?

Next you'll be telling me that different polynomials in P matter, like O(n^27) is too slow or at least requires attention to constants and scale.

This construction also solves the divide-by-zero problem very elegantly. When we promote Natural Zero to Real, we make sure that it's something like:

    (1, λ_:Unit. (1/2, λ_:Unit. (1/3, λ...)))
This makes division by zero a legal operation (pointwise division of sequences) if we extend our tolerance to sequences without the Cauchy property (and why not? It's just bits).


Does there exist a legitimate practical use case for octonions? :)


If you believe that Lie theory is "pratical," then yes. See John Baez's excellent survey: http://math.ucr.edu/home/baez/octonions/


Quaternions are useful for modeling rotations, which become non-commutative in 3+ dimensions. Octonions? They're connected to certain loops (in the abstract algebra sense... they aren't a real group because of non-associativity) for which I have no idea whether there are real-world applications. Group theory gets some play in cryptography, and loops turn out to be useful in survey design.

More screwy are the sedenions (16 basis vectors) which also have zero divisors (i.e. xy = 0 does not imply x = 0 or y = 0).




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: