As far as I know, xsum [1] more or less solves the problem completely: order inv...

lscharen · 2024-07-03T17:53:46

I was not aware of this (2015) work -- very nice!

A couple of pull-quotes from the paper to summarize:

Much work has been done on trying to improve the accuracy of summation. Some methods aim to somewhat improve accuracy at little computational cost, but do not guarantee that the result is the correctly rounded exact sum.

Many methods have been developed that instead compute the exact sum of a set of floating-point values, and then correctly round this exact sum to the closest floating-point value. This obviously would be preferable to any non-exact method, if the exact computation could be done sufficiently quickly

Exact summation methods fall into two classes — those implemented using standard floating point arithmetic operations available in hardware on most current processors, such as the methods of Zhu and Hayes (2010), and those that instead perform the summation with integer arithmetic, using a “superaccumulator”.

I present two new methods for exactly summing a set of floating-point numbers, and then correctly rounding to the nearest floating-point number. ... One method uses a “small” superaccumulator with sixty-seven 64-bit chunks, each with 32-bit overlap with the next chunk, allowing carry propagation to be done infrequently. The small superaccumulator is used alone when summing a small number of terms. For big summations, a “large” superaccumulator is used as well. It consists of 4096 64-bit chunks, one for every possible combination of exponent bits and sign bit, plus counts of when each chunk needs to be transferred to the small superaccumulator.

On modern 64-bit processors, exactly summing a large array using this combination of large and small superaccumulators takes less than twice the time of simple, inexact, ordered summation, with a serial implementation

pacaro · 2024-07-03T20:04:37

Thanks for the summary. I kinda low-key love the idea of converting floats into a fixed point representation that covers the entire range represented by the float type. I mean the accumulator is only 32 KB, which is likely to be in L1 the entire time on modern hardware, and any given float is only going to need two 64 bit words, + 13 bits (12 bits for offset, and 1 for sign) to be represented in this scheme.