
Improving Floating Point Accuracy: A Look at Sums - aSanchezStern
http://alex.uwplse.org/2015/10/16/improving-accuracy-summation.html
======
rjmunro
More important than just increasing accuracy is knowing how much accuracy you
have. With interval artihmetic, you keep an upper and lower bound of every
calculation, and use the the appropriate floating point rounding modes when
you manipulate the numbers.

You can then try different methods of calculating your result and measure
which is the least destructive - compiler optimisations can optimise for width
rather than runtime.

Some methods might produce a better upper bound, and others a better lower
bound. In those cases you can take the intersection as your best result. See
[https://pypi.python.org/pypi/pyinterval](https://pypi.python.org/pypi/pyinterval)
for an implementation in python.

~~~
eru
Don't common ways to implement interval arithmetic massively overestimate the
inaccuracy?

~~~
rjmunro
Compared to what?

You can improve accuracy by optimising the way you do your calculations. A
really trivial example: 10000 + .00001 - 10000 is less accurate than 10000 -
10000 + .00001. Without interval arithmetic, you can't easily measure this and
decide which way round is best.

------
acqq
It just uses:

[https://en.wikipedia.org/wiki/Kahan_summation_algorithm](https://en.wikipedia.org/wiki/Kahan_summation_algorithm)

And Wikipedia has also a discussion of how some compiler optimizations can
optimize the algorithm implementation wrong.

~~~
edejong
Indeed. My advice is to use pairwise summation [0], which is easier to
implement, does not suffer from compiler optimizations and is, for normal
means and purposes, sufficiently precise. Previously, I considered a small
blog-post, but it would be an (incomplete) copy of a Wikipedia article.

[0]
[https://en.wikipedia.org/wiki/Pairwise_summation](https://en.wikipedia.org/wiki/Pairwise_summation)

~~~
madez
The argumentation in the article is bad at some points. First, it praises the
fast execution of pairwise addition with a large base case. Second, it praises
pairwise addition as accurate with O(E) in the base case. Following this
argumentation, making the base case as big as possible would be optimal.
However, that would result in naive summation in practice. Thus, the
argumentation is bad.

~~~
bazzargh
It looks like a bad edit to the wiki.

[https://en.wikipedia.org/w/index.php?title=Pairwise_summatio...](https://en.wikipedia.org/w/index.php?title=Pairwise_summation&diff=643560917&oldid=643548127)

^ this user changed the error bounds, incorrectly. The worst case for the
naive sum is pretty obviously epsilon*N (and is described that way on the
Kahan sum page)

------
raymondh
Even higher accuracy is possible by storing all partial sums. Here are several
ways to achieve higher accuracy:

[http://code.activestate.com/recipes/393090-binary-
floating-p...](http://code.activestate.com/recipes/393090-binary-floating-
point-summation-accurate-to-full-p/)

Also see Shewchuk's "Adaptive Precision Floating-Point Arithmetic and Fast
Robust Geometric Predicates" at
[http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/r...](http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-
arithmetic.ps)

------
SagelyGuru
Using rational numbers is another good idea how to combat/reduce floating
point errors. For example, you can represent and calculate with 1/3 exactly,
whereas in binary (floating point) you can not, no matter how many bits you
use.

~~~
recursive
In a case where it's likely to matter, you probably have a list of many
numbers with diverse denominators. Certainly you can achieve mathematical
precision with rationals (if you can get them in the first place) but the
trade-off is ending up with really huge big-ints as numerator and denominator.

~~~
SagelyGuru
Good point. To reduce this problem you can (and should?) factorise and cancel
out common primes. Just keeping to a pair of 64 bit integers will give you a
pretty good accuracy. You can perform all arithmetic directly on the rationals
and represent reals by the nearest rational approximation (after all, a float
is an approximation too in those cases, just a poorer one).

Stern-Brocot trees are a good systematic way of finding the best rational
approximations and also for doing interval rational arithmetic.

------
mollmerx
If the list is known in advance, what about a strategy of sorting the items by
absolute value and then summing from smallest to largest?

Would this improve accuracy? If so, by how much?

~~~
dbaupp
That will ensure that, say, 1e20 + 1 + 1 + ... + 1 (with 1e20 ones) is
correctly computed as 2e20 ( _edit_ : this isn't actually true: it just gets a
bit closer to 2e20, something like 1e20+1e16), but it can still result in
catastrophic cancellation, e.g. [1, 1e20, -1e20] should sum to 1, but will
give 0, despite already being sorted.

Something like the msum of [1] is needed to get a (provably[2]) exact
summation.

[1]: [http://code.activestate.com/recipes/393090-binary-
floating-p...](http://code.activestate.com/recipes/393090-binary-floating-
point-summation-accurate-to-full-p/) [2]:
[http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/r...](http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-
arithmetic.ps)

------
Devid2014
Using something else as float or double would decrease performance
significantly. Using array sorting would not only be solve too but actually do
not improve precision that much at all.

Using double to sum floats has comparable precession as Kahan summation. But
of course it is not possible to do the same to summing doubles. Or if there is
no proper double precision available at all like on GPU.

------
__mp
In my case I'm more interested in the positions where floating point accuracy
is lost: I'm currently working on a huge code base where we moved from double
to single precision. If a unittest does not validate in single precision I
want to have a tool telling me where and why.

~~~
acqq
It seems to be a very wrong decision moving a "huge" base from double to
single.

The sane approach is to move only that which must be moved, and there must be
a valid reason for the change. Moving "huge" seems wrong by default, if the
subject is numeric. It's not comparable with renaming your variables or
methods or whatever you can do on the huge base and still keep the result the
same.

------
irascible
Would doubling the size of the floats used in the computation achieve a
similar result for the example presented?

~~~
acqq
Absolutely, even if you sum floats, if you care about accuracy you should sum
them using double. I've read the discussion of Julia implementers that have
discussed Kahan summation but strangely didn't pick the simpler approach of
not using floats for partial sums of floats. Albeit it was an old discussion
and I don't know what's the present state.

~~~
alexSanchStern
Even when summing using doubles, you can lose precision. In general, Kahan
summation allows you to double the intermediary precision of your sums, so if
you're losing precision even with 64-bit doubles, Kahan summation can give you
128-bits of intermediary precision, without going to software floating point
solutions.

~~~
acqq
That's true, however it's saner, safer and faster to simply use doubles to sum
floats than just use floats and Kahan's summation, and surprisingly some
people don't get that.

------
legulere
The probably most easiest way to increase floating point accuracy is to use
decimal floating point. The input data usually is in decimal and needs only a
small part of the accuracy that a decimal float format offers. Using them
later in calculations will make them use more accuracy, but the calculations
will be exact for much longer.

When converting decimal to binary floating point numbers you will often use
the full accuracy of the float format because the decimal floating point
numbers can't be represented exactly in binary.

~~~
dbaupp
Decimal floating point just changes how cancellation/rounding appears, it
doesn't do anything to increase accuracy. E.g. suppose you're using a decimal
float with 1 digit of precision (i.e. it can have 1.0, 1.1, ... 9.9 as the
mantissa), and compute 100 + 1 - 100 (i.e. 1.0e2 + 1.0e0 + -1.0e2): 100 + 1 is
101, which rounds to 100 (1.0e2), and 100 + -100 is of course 0. Complete
catastrophic cancellation.

In fact, binary floats use their storage better than decimal floats, so a
binary float of a given size will have slightly higher accuracy than a decimal
one of the same size. For example, there's 90 mantissae to encode in the
example above, requiring a minimum of 7 bits, with a machine epsilon of 0.01.
Using those 7 bits for a binary float gives an smaller epsilon: 1/128 ≈
0.0078.

Of course, decimal floats do have the advantage of matching a common
input/output format, and their rounding artifacts better match the intuition
of us base-10 accustomed humans.

~~~
legulere
> In fact, binary floats use their storage better than decimal floats, so a
> binary float of a given size will have slightly higher accuracy than a
> decimal one of the same size.

Not necessarily. The inefficiency is coped with by using less data for the
exponent (which thus offers a smaller range). 64 bit decimal floats actually
even have more data for the mantissa than their 64 bit binary counterpart.

> Decimal floating point just changes how cancellation/rounding appears, it
> doesn't do anything to increase accuracy. E.g. suppose you're using a
> decimal float with 1 digit of precision (i.e. it can have 1.0, 1.1, ... 9.9
> as the mantissa), and compute 100 + 1 - 100 (i.e. 1.0e2 + 1.0e0 + -1.0e2):
> 100 + 1 is 101, which rounds to 100 (1.0e2), and 100 + -100 is of course 0.
> Complete catastrophic cancellation.

It also changes how often it occurs. With binary floats it already happens
when you do 0.1 + 1.0 (I don't mean the rounding error of converting 0.1 to
binary).

When you enter 1.0 + 0.1 - 1.0 e.g. in python you will get 0.10000000000000009
as a result which is not equal to what you get when you enter 0.1. With
decimal floats this doesn't happen.

With every calculation you do the amount of precision needed can rise. With
decimals converted to binary floats you use all of the precision right from
the beginning. With decimal floats you can often stay easily within the
precision offered by the data type.

