Hacker News new | past | comments | ask | show | jobs | submit login

They could refer to parallel systems, most likely GPUs, where it is common to have slightly different results even on the same operation, on the same system.

This is due to the fact that several pipelines will do calculations in parallel and depending on which pipeline ends first (which can depend on many factors like current temperature) the sums can happen in a different order, leading to different rounding results.

> They could refer to parallel systems, most likely GPUs, where it is common to have slightly different results even on the same operation.

Parallel systems have great difficulty in defining an order of operations. Floating-point numbers are non-associative, so order matters for bitwise compatibility.

With that being said, you CAN define an order, and it CAN be done in parallel. The parallel-prefix sum is a fully defined order of operations for example, and if you sort all numbers (from smallest to largest), you'll get a more accurate result.


The + operations always happen in the same order, even on a GPU. So you should get the same results every time with floating-point math. But it is non-intuitive for many people to work with non-associative floating-point math.


Note that writing a sequential algorithm with fully defined order is still grossly difficult! IMO, if anyone wants a bit-accurate simulation, use 64-bit integers instead.

If you absolutely must use floats, then define a proper order of all operations, and include sorting the numbers (!!) to force a fully defined order as much as possible. You may need to sort your pool of numbers many, many times per calculation. But that's what you have to do to get a defined ordering.

I've never heard of that happening on a GPU. Can you give an example?

We have this in our GPU computations, this is very annoying. The errors are usually too small to matter, but they make comparing binary results / reproducible builds impossible. Here is an example I just made:

    $ python3
    >>> s = [10**-x for x in range(16)]
    >>> s
    [1, 0.1, 0.01, 0.001, 0.0001, 1e-05, 1e-06, 1e-07, 1e-08, 1e-09, 1e-10, 1e-11, 1e-12, 1e-13, 1e-14, 1e-15]
    >>> sum(s)
    >>> sum(reversed(s))
(it uses CPU for simplicity, but GPU has exactly the same math)

Summation of an array is surprisingly non-trivial. Below a discussion about it with many interesting links (including Kahan summation and pairwise summation) if you want to go down some unexpected rabbit holes.

Somewhere I've seen a little routine (by Stefan Karpinski of Julia fame) that starts with a fixed pre-determined array of numbers, and you give it a desired target number, and it permutes the numbers of the array such that when you sum them up (naively, from left to right) the result is your desired target number [2].


EDIT: [2] https://discourse.julialang.org/t/array-ordering-and-naive-s...

If you don't care about cancellation error and are only concerned with perfect bitwise / reproducible results, then sort your numbers.

Sorting the numbers is necessary to minimize cancellation error. But absolute minimum cancellation error can only be done sequentially: the parallel version may return different results.

    l = list(s)
    l = list(reversed(s))
Both return 1.111111111111111 now. Sorting is the "synchronization" step you need to get all your parallel tasks back into a defined order.

You could have this kind of behavior if you use atomic functions - but you need to explicitly opt-in actually using those and it is not limited to GPU.

And performance will suffer as a consequence.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact