It actually isn't "surprising" if you understand how the format works. It essentially uses scientific notation but in binary, with a set number of bits for both the mantissa and the exponent as well as a few changes thrown in for better behavior at its limits (like denormalization). This means that it can't directly express numbers which are very easy to write in decimal form, like 0.1, just like we can't express 1/3 as a finite decimal. It's designed to manage this as well as possible with the small number of bits at its disposal, but we still inevitably run into these issues.
Of course, most programmers only have a vague idea of how floating point numbers work. (I'm certainly among them!) It's very easy to run into problems. And even with a better understanding of the format, it's still very difficult to predict exactly what will happen in more complex expressions.
A really cool aside is that there are some relatively new toys we can use to model floating point numbers in interesting ways. In particular, several SMT solvers including Z3[1] now support a "theory of floating point" which lets us exhaustively verify and analyze programs that use floating point numbers. I haven't seen any applications taking advantage of this directly, but I personally find it very exciting and will probably try using it for debugging the next time I have to work with numeric code.
A little while back, there was an article about how you can test floating point functions by enumerating every single 32-bit float. This is a great way of thinking! However, people were right to point out that this does not really scale when you have more than one float input or if you want to talk about doubles. This is why SMT solvers supporting floating point numbers is so exciting: it makes this sort of approach practical even for programs that use lots of doubles. So you can test every single double or every single pair of doubles or more, just by being clever about how you do it.
I haven't tried using the floating point theories, so I have no idea how they scale. However, I suspect they are not significantly worse than normal bitvectors (ie signed/unsigned integers). And those scale really well to larger sizes or multiple variables. Assuming the FP support scales even a fraction as well, this should be enough to practically verify pretty non-trivial functions!
As the answer on stackoverflow mentions, there is -ffast-math option to gcc that tells the compiler to treat fp operations as associative. I've done some testing:
LLVM and GCC use the same algorithm for reassociation (and as far as i know, there is still no good literature on tradeoffs one way or the other), so you will get the same results, modulo some small differences implementation differences.
The reassociation passes exist mainly to promote redundancy elimination, and so the factorizations they perform are tilted towards making binary operations that look the same.
Factorization and transformation into pow calls is also done, but this is just because it's easy to do :)
Note that the operations it forms may be "less than ideal" from a register pressure perspective, since it does not take this into account at the level reassociation is being performed (it assumes something will later reassociate them some other way if it wishes).
I find it illuminating to look at the exact values of the floating point numbers. This takes away an air of mystery: floating point numbers do in fact have very specific values, which we can inspect exactly.
The operands can't be represented exactly as floating point so they are rounded to the nearest floating-point number before the addition even happens. And once the addition does happen, it is rounded to the nearest representable floating point value.
Now we can look at the same for 0.1 + (0.2 + 0.3):
It's interesting to note that the first addition ends up being exact! This is a bit lucky, because 0.5 is exactly representable in floating point and the initial operands got rounded in opposite directions. It's also interesting that 0.6... only turns into 0.599999... at the point that we round the answer.
You can play around with this stuff conveniently using the Python "Decimal" module.
$ python
Python 2.7.5 (default, Aug 25 2013, 00:04:04)-
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from decimal import Decimal, getcontext
>>> getcontext().prec = 1000 # To prevent truncation/rounding
>>> Decimal(0.1)
Decimal('0.1000000000000000055511151231257827021181583404541015625')
>>> a = Decimal(0.1) + Decimal(0.2)
>>> a
Decimal('0.3000000000000000166533453693773481063544750213623046875')
>>> Decimal(float(a)) # get the value rounded to the nearest fp number
Decimal('0.3000000000000000444089209850062616169452667236328125')
Note that recent Python versions use Gay's correctly-rounded conversions, so .1+(.2+.3) displays as .6, since that's equivalent when converted back to binary floating point. http://bugs.python.org/issue1580
If you want an exact representation, C99 has the %a format specifier for base-16 floating point. It seems that Python 3 doesn't support it but Lua 5.2 does.
I'm not sure what you're trying to say. 0.1000000000000000055511151231257827021181583404541015625 is the precise value of float(0.1). It is not "garbage" or an approximation in any way. It is exact.
> Of course, most programmers only have a vague idea of how floating point numbers work.
From my experience, most programmers have a vague idea of how basic integers work, so that's not so surprising. (I'm in the group who only knows a little more than average about floating-point, but for what I do, I don't use FP math much either.)
I've always thought it would be helpful to have floating point values that tracked the upper and lower bounds of the possible error. That would make analysis of the results much easier.
That's called ball arithmetic, and there are many libraries for it, such as arb. http://fredrikj.net/arb/ It's useful in mathematical calculation, but usually with floating point calculations you have some error bounds coming from your algorithm which are "good enough" if the result is going to be the value of something with 5% tolerance.
It is easy to track the real result of the computation with floating-point intervals: round down when computing the lower bound, and up when computing the upper bound of each interval.
A common and simple way to treat FP mathematically is by bounding the result with the machine epsion u [1]. For example, given FP numbers x and y:
x <fp_mul> y = (1+e)xy ; abs(e) <= u
The u ideally depends only on the FP format and rounding mode. I say ideally because library functions like sqrt may be less precise than the ideal (which is the result being equal to the exact value rounded according to the rounding mode). Note that the abstraction only as long as you don't go outside the range of normal exponents (it breaks when you reach infinity or subnormal).
EXAMPLE. Suppose we have a FP value L representing a length of something, and want to split it into segments no longer than the FP value M. Assume both integers are small enough to be representable in FP. The answer in exact arithmetic is:
N = ceil(L/M).
However if we evaluate this in FP, we may get a result N'<N for certain edge values. Which means one of the segment lengths we output will be greater than M!. So the result of the algorithm is incorrect!
We can fix it by multiplying (L/M) with a number sufficiently greater than one before the ceil, at the cost of sometimes deciding to split into one more segment than mathematically necessary. The fixed FP algorithm may then be:
N' = ceil(1.00005 <fp_mul> (L <fp_div> M))
After we apply the machine epsilon formulas we come to the conclusion that:
N' = ceil( (1+e1)(1+e2)1.00005 (L/M) ) ; abs(e1)<=u, abs(e2)<=u
We then show that:
(1+e1)(1+e2)*1.00005 >= 1.
This proves that what the algorithm gives to the ceil function is not less the exact value L/M, and that N'>=N. And ceil() can't make errors by definition, assuming the integer value is representable. QED
The “malicious” rounding that you are attempting to protect against cannot actually occur in IEEE-754 floating-point. You are jumping at shadows, making lots of results less accurate to protect against something illusory.
Why is this the case?
In order for ceil(L/M) to produce an incorrect result, we would need to have the mathematically exact result be of the form N + epsilon, with 0 < epsilon <= ulp(N)/2, and N an exact integer. But, can that ever actually happen? No. If it did, we would have a floating-point number L = M(N+epsilon) exactly, or L = MN + delta, with 0 < delta < ulp(MN)[1], which can never happen by the definition of ulp.
This illustrates a difficulty naive backwards-error analysis as commonly taught in numerical analysis courses. While it provides reasonable conservative bounds, which is often sufficient for engineering needs, it misses much of the subtlety of floating-point arithmetic, and leads people to take unnecessary steps to guard against “errors” that aren’t really there.
[1] Lemma: For any binary floating-point type, ulp(xy) > x ulp(y)/2.
Proof: First note that rounding of the product xy will only ever make ulp(xy) larger, so we can safely ignore it. Use the formula ulp(x) = 2^(floor(log2(x)) - P) where P is the precision in bits and expand:
I suspect a bug in your proof right at the beginning, where you say "exact result be of the form N + epsilon, with 0 < epsilon <= ulp(N)/2, and N an exact integer". Assuming you're referring to the result of the entire expression including ceil, can we still say that, considering that ceil() is far from being a continuous function?
Anyway, my example was a bit too theoretical. A better version is one where L and M are integers which are not necessarily representable. Here: http://ideone.com/1Xkhjp
Yes, of course it’s possible to exhibit errors if you change the problem (though it should be noted that the source of error here is entirely in the fact that l is not representable). As soon as you allow representation error, of course, your tolerance-based approach goes out the window; what if the representation error in l and m is larger than your tolerance accounts for?
If I wanted a correct bound for your revised problem, I would require that l be an upper bound on the true value (by forcing it to be rounded up earlier computations, for example) and that m be a lower bound on the true value. Then the simple computation delivers a correct result regardless of the magnitude of the error in l or m.
It is the technique of backwards-error analysis that is naive, not the resulting bound. A “naive” method is not “simple” or “dumb” (or “aggressive” or whatever you think the opposite of “conservative” might be). It is simply a method that doesn’t use some a priori knowledge that’s available for the specific problem at hand. Backwards-error analysis is a naive technique in that it doesn’t benefit from specialized theorems of floating-point that allow one to establish tighter bounds, like the one I proved.
Though, as nice as the examples are, FP rounding problems in practice are rarely due to binary/decimal conversions, which are only relevant when reading and writing decimal encoded numbers. These are normally absent when doing tasks such as moving motors based on sensor readings.
There are no values of a where a * a * a differs from a * (a * a), because floating-point multiplication is commutative.
With arbitrary numbers a, b, and c, it would be a different story for a * b * c. Without further knowledge of a, b, c, there would be no particular reason to do the operations in one order or another, but the result of a * b * c and a * (b * c) could definitely differ by one ULP, probably by two in ordinary cases, by more if underflow occurred (but one of the variables would already have to be very small for this to happen).
But remember, if both the Wikipedia editors and I are wrong, you only need to produce one pair of operands a and b such that a * b does not produce not the same result(1) as b * a to be vindicated.
(1) I suggest we consider all NaN values to be “the same result” for the purpose of this challenge.
You're right of course. I was thinking about associtivity, not seeing what's really written. Sorry. I didn't expect discussing commutativity in the context (it's the non-associativity that doesn't allow all the "automatic" optimizations expected by some) where the formula in the same sentence is with three elements and the different order of computation. Which doesn't change the fact that I've made an error.
Yep grouping can lead to better real accuracy. But then you lose exact IEEE compatibility (get slightly different results that will fail a comparison).
Pure cargo-culting. There are circumstances in which it’s perfectly appropriate to compare floating-point data for equality. There are also circumstances in which a comparison with some tolerance is more appropriate. Understand what situation you’re in, and understand the tools that you’re using.
Oh, there must be a Common Lisp predicate that addresses this. prettymuchequalsp?
More seriously, I remember APL\360 SETFUZZ from long before many of you whippersnappers were born, which told the system to disregard so-many bits in an 'equals' evaluation.
FP implementations are generally evil, and errors cascade in irritating fashion. 50% of a numerical analysis class I took once was dedicated to error estimation. Stick to rationals if you can - it'll save your hairline.
Of course, most programmers only have a vague idea of how floating point numbers work. (I'm certainly among them!) It's very easy to run into problems. And even with a better understanding of the format, it's still very difficult to predict exactly what will happen in more complex expressions.
A really cool aside is that there are some relatively new toys we can use to model floating point numbers in interesting ways. In particular, several SMT solvers including Z3[1] now support a "theory of floating point" which lets us exhaustively verify and analyze programs that use floating point numbers. I haven't seen any applications taking advantage of this directly, but I personally find it very exciting and will probably try using it for debugging the next time I have to work with numeric code.
A little while back, there was an article about how you can test floating point functions by enumerating every single 32-bit float. This is a great way of thinking! However, people were right to point out that this does not really scale when you have more than one float input or if you want to talk about doubles. This is why SMT solvers supporting floating point numbers is so exciting: it makes this sort of approach practical even for programs that use lots of doubles. So you can test every single double or every single pair of doubles or more, just by being clever about how you do it.
I haven't tried using the floating point theories, so I have no idea how they scale. However, I suspect they are not significantly worse than normal bitvectors (ie signed/unsigned integers). And those scale really well to larger sizes or multiple variables. Assuming the FP support scales even a fraction as well, this should be enough to practically verify pretty non-trivial functions!
[1]: http://z3.codeplex.com/
and the same for multiplication: It actually isn't "surprising" if you understand how the format works. It essentially uses scientific notation but in binary, with a set number of bits for both the mantissa and the exponent as well as a few changes thrown in for better behavior at its limits (like denormalization). This means that it can't directly express numbers which are very easy to write in decimal form, like 0.1, just like we can't express 1/3 as a finite decimal. It's designed to manage this as well as possible with the small number of bits at its disposal, but we still inevitably run into these issues.