> user of NumPy/CuPy to perform the float32 computation
This is just getting tiring.
Numpy and Cupy are perfectly capable of doing float32 computation - their only "fault" is that they coerce data to float64 in this one fairly unimportant function (which you can implement to your liking in 3-4 LOC). Hell, PFNs entire deep learning system, Chainer (which mind you, both predated and inspired PyTorch, and is still quite competitive with it), is built entirely on top of Cupy!
Benchmarks make sense only when the outputs are the same - in this case, they certainly are not. It's the responsibility of the author to make sure that the outputs are the same (a real benchmark), or to argue that Numpy/Cupy are wasting time by using float64 in xp.cov (an issue of implementation). He does neither.
> Numpy and Cupy are perfectly capable of doing float32 computation
You're arguing that they would perfectly be capable of an implementation which would use float32, ya sure, if you change their source code you can have them be faster, that's an obvious to me? The question is, what kind of performance can you expect as a user using them as a library, and this benchmark is revealing of that in this case. What is wrong with it?
It's like saying that you can change the implementation of Python and it will make Python code run faster, like sure you can, but you don't benchmark hypothetical future versions, the current version of CuPy uses float64 for this function, and that makes it that it barely runs faster then the CPU version. Point in case, I don't know what else you're trying to disagree about here.
> their only fault is that they coerce data to float64 in this one fairly unimportant function
I don't know that, all I know is when picking one function and benchmarking it, we found a flaw which result in performance gains from CPU to GPU to be extremely underwhelming. How many more if we started to benchmark the rest?
> if you change their source code you can have them be faster, that's an obvious to me? The question is, what kind of performance can you expect as a user using them as a library, and this benchmark is revealing of that in this case. What is wrong with it?
You don't need to change the numpy or cupy source code, just write your own version of the function in a few lines of code using other primitives that numpy/cupy provides.
Note that is exactly what is done for the neanderthal version.
This is just getting tiring.
Numpy and Cupy are perfectly capable of doing float32 computation - their only "fault" is that they coerce data to float64 in this one fairly unimportant function (which you can implement to your liking in 3-4 LOC). Hell, PFNs entire deep learning system, Chainer (which mind you, both predated and inspired PyTorch, and is still quite competitive with it), is built entirely on top of Cupy!
Benchmarks make sense only when the outputs are the same - in this case, they certainly are not. It's the responsibility of the author to make sure that the outputs are the same (a real benchmark), or to argue that Numpy/Cupy are wasting time by using float64 in xp.cov (an issue of implementation). He does neither.