

Sqeezing performance out of CUDA - g-garron
http://www.johnhawthorn.com/2012/03/squeezing-performance-from-cuda/

======
pavanky
Here are a couple of observations

1) The implementation may not be the most effecient for a larger matrix or an
even more dense matrix. 1% of 52000 is 520. That divided by 32 is 8-9
additions per thread. As that number increases, increasing the number of
threads (and eventually using more blocks per row) would be a good idea.

2) He is allocating twice as much shared memory than required. I genuinely
hope that was an artefact from before. If not, that is a killer for
performance. Using more shared memory per block reduces the number of
concurrent blocks.

Note: Not sure why he is still using cuda 3.2. cuda has had csr multiplication
for a few months now, and has even gone through a revision to make it even
faster.

~~~
jhawthorn
Oh, no! Missed noticing my first link from HN for a few days. Hopefully I can
clarify some of this.

First, this is in no way faster than cuSPARSE or cusp. I wrote this for an
originally for a school assignment (hence older cuda version) and was hoping
to convey what I had learned my first time using cuda.

The size of the shared memory is so the reduction has no buffer overflow
without using conditionals. However I am using more than needed, it should be
set to 32+16. I don't expect this to affect performance as the kernel already
reaches 100% theoretical occupancy.

Could you explain the desire for more threads/blocks per row? I can't
immediately think of one, having more additions per thread sounds good so long
as all are running.

Thanks very much for the read and the reply!

