It's hard to call it a bug given that any concurrent float sum or product will be different in regards to changing the amount of concurrency. Even if you order the final value per thread before reducing the result will differ if you use a different amount of threads to split the problem.
Because in floating point arithmetic 1 + 2 + 3 + 4 is different than (1+2) + (3+4).
Because in floating point arithmetic 1 + 2 + 3 + 4 is different than (1+2) + (3+4).