
CuPy Accelerates NumPy on the GPU? Hold My Cider, Here's Clojure - signa11
https://dragan.rocks/articles/20/Clojure-Numpy-Cupy-CPU-GPU
======
akssri
The benchmark is flawed. Cupy follows Numpy in doing a type-promotion of the
data to float64 before computing the covariance.

[https://github.com/cupy/cupy/blob/b19577199d3f75c76e23c75194...](https://github.com/cupy/cupy/blob/b19577199d3f75c76e23c751941cb0976ce6ea01/cupy/statistics/correlation.py#L86)

The following function in PyTorch completes the float32 computation in 27 ms
(1080 Ti). I'm fairly sure that a similar function in Cupy would achieve very
similar results.

    
    
      def corr(dat):
          cov = (dat @ dat.T).div_(dat.shape[-1]-1)
          sigma = torch.diag(cov).sqrt_()    
          return cov.div_(sigma[None]).div_(sigma[None].T)
    

The heart of the computation SGEMM is in anycase provided by CUBLAS, and the
only areas in which PyT and Cupy differ are in the tricks used to unroll the
loop for the broadcasting & L1 op. The performance difference b/w PyT, Cupy
and others should be negligible IMO.

Note: np.corrcoef takes about 1.12 s and the above float32 version takes 325ms
on the CPU.

~~~
jsdavo
Thanks for sharing! Why does CuPy switch to float64 in this case? Is it a bug?

~~~
akssri
Numpy/Cupy do explicit coercing to float64. There is no documentation for why
this is done, but since GEMM is used to compute the covariates (summation over
data points), it makes sense to increase the precision.

You could get away using a double to only accumulate the sum, but it's a pain
to write such mixed-precision (slow) function in C and then wrap it in Python,
esp. for something like this.

(See,
[https://github.com/numpy/numpy/blob/v1.17.0/numpy/lib/functi...](https://github.com/numpy/numpy/blob/v1.17.0/numpy/lib/function_base.py#L2374))

------
rrss
> Power users even know that Nvidia provides a drop-in replacement, CuPy, that
> brings the power of GPU acceleration.

> but Nvidia provides CuPy

> It's Nvidia's stuff

Is CuPy actually developed by Nvidia? It looks like the original authors were
at Preferred Networks, and none of the top few contributors on github work at
Nvidia.

------
eslaught
If anyone's looking for NumPy that works on distributed machines with GPUs,
see also Legate, which is produced directly by NVIDIA:

[https://research.nvidia.com/publication/2019-11_Legate-
NumPy...](https://research.nvidia.com/publication/2019-11_Legate-
NumPy%3A-Accelerated)

[https://developer.nvidia.com/legate-early-
access](https://developer.nvidia.com/legate-early-access)

Disclaimer: I work on Legion, the runtime system that Legate sits on top of,
but otherwise don't have anything to do with the Legate project.

------
chenzhekl
CuPy is not developed by NVIDIA, but by Preferred Networks, the company behind
Chainer.

------
dragandj
Uses free Clojure library Neanderthal:
[https://github.com/uncomplicate](https://github.com/uncomplicate)

Also uses the book Numerical Linear Algebra for Programmers:
[https://aiprobook.com/numerical-linear-algebra-for-
programme...](https://aiprobook.com/numerical-linear-algebra-for-programmers)

------
hellofunk
One of the things I never understood about the hype around these Clojure GPU
libraries is that there is a lot of marketing jumbo about how high-level it
all is. Even the book for these libraries has the words "no C++!" half a dozen
times on its ordering page [0].

However, these are just high-level libraries and you cannot write the GPU
shaders in Clojure. You still must use C or C++ for that (depending on if you
are going the CL or Cuda route), and your C/C++ code is embedded or called
from the Clojure side. This is no different than using another high-level
language like Python -- I mean it's focusing on a rather misleading and
irrelevant detail. You can use Clojure or Python (or many other languages) to
call OpenCL or Cuda shaders, or you can just use the libraries that call them
for you. Clojure is not special in this regard.

[0] [https://aiprobook.com/deep-learning-for-
programmers/](https://aiprobook.com/deep-learning-for-programmers/)

~~~
akssri
Common Lispers have had the ability to write CUDA kernels (with shaders) on-
the-fly with a DSL in cl-cuda for many years now. See,

[https://github.com/takagi/cl-cuda](https://github.com/takagi/cl-cuda)

Cupy (a project in which takagi is/was actively involved) also has the ability
to compile kernels in the Python REPL, but the kernel code needs to be written
in C and passed in as a string (pyOpencl can do the same for OpenCL). In fact,
not too many years ago, the kernels for gradient descent, max pooling etc.
were all entirely in Python in Chainer. Projects like JAX take this to another
level by having the ability to transform the bytecode of a restricted class of
Python functions straight into CUDA kernels.

I can't speak much about Clojure/Neanderthal, but I'd advise the people to
stop dissing projects they are not familiar with, least of all using silly
benchmarks like the above.

~~~
dragandj
Good advice. And, yet, you've taken it pretty seriously to diss
Clojure/Neanderthal and my blog post, mostly by talking about unrelated stuff
and projects that the post didn't even mention. And while introducing these
themes left and right you didn't even bother to show some code related to this
topic, just a suggestion of great projects by cool people.

Yes, you showed the PyTorch code related to the blog post that confirms what
the blog post says, but when I pointed out that the code has incorrect
functionality (by missing some calculations) you didn't even bother to correct
it, or to confirm that the code is good and that I'm wrong.

So, it seems that your standard is that it is enough for one side to throw
bits and pieces around and call it a day, and for the other to run around and
prove that their stuff is better than everything that could possibly be done
in every technology.

I choose to stick to the theme. The theme is CuPy, NumPy, Clojure &
Neanderthal. The related theme could be code in another technology. Great -
write about it. But, even if every other technology were a million times
better than what I describe in the article, it does not change the fact that
CuPy and NumPy have the issue I've described.

~~~
akssri
> And, yet, you've taken it pretty seriously to diss Clojure/Neanderthal and
> my blog post

I have not - all I've said so far is that your benchmark is flawed.

The fact that the code fragment above assumes zero mean data (thus using 2
fewer L1 ops) doesn't change a _single thing_ in anything that has been
written; to wit, the timings change to 28.6ms (GPU) and 333 ms (CPU). Pedantry
is not an argument.

~~~
dragandj
I still don't get how the fact that someone could implement the same thing
that I did in Clojure in PyTorch has anything to do with NumPy and CuPy, or my
benchmark?

BTW, your PyTorch code is still incorrect (or so it seems to me although I
don't use PyTorch so I can't try it on the computer). The formula for
correlation requires division by sigma_x * sigma_y (which has dimension n x
n), and you are dividing by (sigma_x)^2 (which has dimension n). So you still
forgot at least one L2 operation that computes all combinations of sigma_x_y.
A couple operations here, a couple operations there, an edge case here, and
edge case there, it adds up. That's why people use NumPy/CuPy after all...

~~~
akssri
\- Function in Cupy takes 29.4ms, Numpy takes 427 ms. Happy ?

\- Broadcasting semantics + division takes care of the outer-product
normalization. This is 2 L1 ops in size of the matrix & the input (~ xSCAL).

Pedantry is still not an argument.

~~~
dragandj
Thank you so much, that's phenomenal news for me! (Since I can make
neanderthal code go at 23ms (GPU) and 3XX ms (CPU) when I implement it as
NumPy/CuPy/PyTorch does (sans float64 conversion, of course) You saved me from
having to fiddle with Python (which I don't particularly enjoy). Thanks again!

Can you please post your implementation of this function, here, so I can try
it on my machine and compare it to Neanderthal?

------
bernardv
Great article. I learned a bunch of things I didn’t know about Numpy and
Clojure. As far as I’m concerned the Dragan does rock.

------
xvilka
Or just use Julia where the speed is out of the box.

~~~
bearzoo
Oh? I don't suppose that julia wraps lapack functions for no good reason. What
makes you say this

~~~
ChrisRackauckas
A lot of the wrappers are largely legacy at this point. BLAS1 and BLAS2
operations are handled in pure Julia through the broadcast system, there's no
performance degradation there and the fusion of operations actually makes for
a lot of performance advantages over simple BLAS calls. For BLAS3, tools like
LoopVectorization let you write a pure BLAS3 in Julia and get good
performance, Gaius.jl is a good example of this by matching OpenBLAS with pure
Julia code. Of course you still have to write out how to do blocking, but
that's about it. At this point it's mostly needing more coverage before
deprecating BLAS bindings in a Julia 2.0 down the line.

[https://github.com/MasonProtter/Gaius.jl](https://github.com/MasonProtter/Gaius.jl)

~~~
bearzoo
Thanks, that is pretty impressive

~~~
eigenspace
I'm the dev of Gaius.jl. I'm not sure it will ever end up Julia's Base, but I
think of it as a prototype implementation that might inspire (or possibly
evolve into) a real implementation that could end up being bundled in the
language eventually.

Julia really does have the tools to solve this problem natively, and they're
maturing quite a bit more quickly than I ever would have guessed a year ago.

Last year if you had asked me I would have predicted the development of
something like LoopVectorization.jl as appearing maybe by 2022 or something.
Instead it's here now, albeit with a lot of work ahead to make it more robust.

