This is due to the fact that several pipelines will do calculations in parallel and depending on which pipeline ends first (which can depend on many factors like current temperature) the sums can happen in a different order, leading to different rounding results.
Parallel systems have great difficulty in defining an order of operations. Floating-point numbers are non-associative, so order matters for bitwise compatibility.
With that being said, you CAN define an order, and it CAN be done in parallel. The parallel-prefix sum is a fully defined order of operations for example, and if you sort all numbers (from smallest to largest), you'll get a more accurate result.
The + operations always happen in the same order, even on a GPU. So you should get the same results every time with floating-point math. But it is non-intuitive for many people to work with non-associative floating-point math.
Note that writing a sequential algorithm with fully defined order is still grossly difficult! IMO, if anyone wants a bit-accurate simulation, use 64-bit integers instead.
If you absolutely must use floats, then define a proper order of all operations, and include sorting the numbers (!!) to force a fully defined order as much as possible. You may need to sort your pool of numbers many, many times per calculation. But that's what you have to do to get a defined ordering.
>>> s = [10**-x for x in range(16)]
[1, 0.1, 0.01, 0.001, 0.0001, 1e-05, 1e-06, 1e-07, 1e-08, 1e-09, 1e-10, 1e-11, 1e-12, 1e-13, 1e-14, 1e-15]
Somewhere I've seen a little routine (by Stefan Karpinski of Julia fame) that starts with a fixed pre-determined array of numbers, and you give it a desired target number, and it permutes the numbers of the array such that when you sum them up (naively, from left to right) the result is your desired target number .
EDIT:  https://discourse.julialang.org/t/array-ordering-and-naive-s...
Sorting the numbers is necessary to minimize cancellation error. But absolute minimum cancellation error can only be done sequentially: the parallel version may return different results.
l = list(s)
l = list(reversed(s))