Most GPUs are non-deterministic - learned this the hard way in deep learning on pathology data.
This is for optimization purposes. In fact, you can set a flag in Pytorch / Cuda to disable this which comes at the cost of performance.
Can you explain? How much does it actually affect results in extreme cases? The source of non-determinism does seem the GPU but parallelism and dynamic allocation in the frameworks. (Also seems that some parts of pytorch still return runtime error if you request a deterministic version). Are there other more performant deterministic DL frameworks?