

How to write a "GPUs are Awesome" paper - ColinWright
http://www.walkingrandomly.com/?p=3736

======
andrewcooke
hmmm. this is a little unfair. i've got around a factor of 30 speedup using a
cheap GPU (some old nvidia card i had in my machine) compared to a fairly
meaty xeon at work.

the important difference is that the kernel (written in opencl) was much more
than just "taking the sine". once the data are on the gpu you don't want to
move them back until you've done as much work as possible.

and there are different "kinds" of memory on the gpu. simply taking the sine
of every value in an array is not using that memory efficiently (you're not
using the equivalent of the gpu's L2 cache (and is cpu cache explaining the
increased speed on the cpu?)).

so this would be io limited - all the test is doing is comparing memory
transfer to the cpu with memory transfer to the gpu, and coming up with
similar numbers.

i guess the fairest thing you can say is that if you're not going to put the
effort into understanding how best to use gpus, leave them alone.

(incidentally, unless you need 64bit fp, or are really limited by pci slots,
it makes much more sense to buy commodity graphics cards rather than the new
top-end fermi made-for-compute cards - for us switching from an "old" graphics
card to a c2050 gave an extra 20% speedup, iirc, which is peanuts compared to
the factor of 30 from moving off the cpu; also, nvidia support for the
expensive cards sucks, so opencl is worth the extra effort for the peace of
mind that you can try something else)

~~~
StavrosK
Not only that, but, Matlab? Seriously? I don't know the internals, but I'm
fairly sure that there's going to be some switching between the CPU and GPU
for Matlab to be able to run code there.

If you want something to run fast on the GPU, you code it in a language that
can run entirely on the GPU. Not Matlab.

~~~
andrewcooke
the code i ported was originally matlab. if you can write the operations
without explicit loops (using matlab's builtins) then it should be able to
translate things fairly well to multiple processors. i don't know much about
matlab's gpu support (when i looked, you needed third party libs - looks like
that has changed?) but again, if it's standard transforms and matrix
operations they should be able to make it fairly efficient.

in our case the matlab code had explicit loops (in a sense i was lucky - the
inner kernel operations were messy enough to need explicit coding, but regular
enough to be handled efficiently on a gpu) and that's what really kills you
(because you're bouncing out of optimised c code into their interpreter). just
moving from matlab to c (on the cpu) also gave us a significant speedup.

what i'm saying, in a slow rambling way, is that calculating a bunch of sines
should be as fast in matlab as in hand-coded c or opencl (on gpu and cpu
respectively). because that _can_ be expressed as a vector operation and
matlab will have invested a lot of time and effort into making that code fast.

~~~
StavrosK
Interesting. I checked the manual after my post above and they do execute
sines entirely on the GPU, however you do still need to jump in and out for
most coding. It does invalidate my point for his example, though (but it
doesn't invalidate any of the other points).

------
modeless
If you're only doing one mathematical operation on each of the 100 million
numbers and then you need to read them all back to the CPU, of course the GPU
will be slower. However, if you need to do 1,000 or 10,000 mathematical
operations each on 100 million numbers, and then get back a smaller answer of
maybe a million numbers, then the GPU is going to blow the CPU away.

The ultimate answer to this post is a couple of steps down the AMD Fusion
roadmap, where the GPU is a coprocessor like floating point units were back in
the day. The GPU and CPU will be on the same die with unified memory and
caches. When that architecture is fully realized, these problems will go away.

------
kyky
Amusing blog, but an unfair example. Intel did a more extensive analysis of
this issue in "Debunking the 100X GPU vs. CPU myth"
[[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.170...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.170.2755&rep=rep1&type=pdf)]

------
mynegation
All points touched in the article are valid. However, calculating sine of 100M
numbers is simple but not very representative case for GPGPU programs. Leaving
data in GPU memory as long as possible between kernel invocations and avoiding
transfers is one of the first optimizations that must be considered.

~~~
profquail
His points may be valid (that some people cut corners to get the best numbers
for their papers); however, as someone who spends a _lot_ of time writing
GPGPU code[1], I'll point out that MATLAB isn't the best way to benchmark GPU-
based code. I'm not knocking MATLAB here, but simply pointing out that MATLAB
is going to have overhead compared to, say, CUDA. Given that most GPGPU
kernels run pretty quickly (less than a second, and probably under 100ms) even
a little bit of overhead is going to severely affect your results.

[1] See my profile.

------
imurray
Separate to the whole GPU issue: when timing code in Matlab it's best to use a
more careful routine than tic/toc, like timeit.m (by a Mathworks employee):
<http://www.mathworks.com/matlabcentral/fileexchange/18798>

~~~
madiator
Thanks, didn't know about this!

------
neutronicus
His problem is that you need an O(n^2) or higher algorithm to really see the
advantages. Something like the n-body problem will go much, much faster on the
GPU than on the CPU for large n.

~~~
drmajormccheese
There have been some GPU implementations of the Barnes-Hut algorithm [O(nlogn)
n-body problem].

~~~
neutronicus
How do they do against CPU implementations of same?

I imagine you've really got to know what you're doing to compete, because that
seems like a pretty branchy algorithm.

------
gavanwoolery
This is a pretty laughable basis for comparison - anyone with more than a
trivial knowledge of GPUs knows that the largest bottleneck is the bandwidth
between system memory and GPU memory, and you are always supposed to design
your program with this in mind. Doing a trivial benchmark like taking the sin
of one number is not only a pointless operation to attempt on the GPU, but not
even a good benchmark of real world GPU applications. GPUs are only good at
certain things, and tend to rely on producing data on their end
(procedurally), or working with the same set of data over thousands of frames
or instances. If you took the sin of the results recursively, and did this a
million times over, the GPU would blow the CPU out of the water. What you are
claiming is like saying that one CPU is faster than another because one reads
from memory and the other from the harddisk.

------
kragen
By the way, the term for the ratio "arithmetic computations per memory
access", which a lot of the comments here are referring to, is "arithmetic
intensity". (We used to say "computational intensity" but that term is even
more confusing.) Low arithmetic intensity problems are going to be
bottlenecked on memory transfer time, especially for GPGPU, and I'm surprised
to see that the GPU doesn't come out quite a bit worse in the comparison for
this low-arithmetic-intensity problem. Maybe he should have picked a simpler
operation than sine if he wants to make his CPU look more awesome —
reciprocal, say. Take the reciprocal of 10 million random numbers.

~~~
repsilat
> Low arithmetic intensity problems are going to be bottlenecked on memory
> transfer time, especially for GPGPU

True in absolute terms, but it might still be worthwhile doing that anyway.
You're still offloading computation to the GPU, and the data transfer isn't
going to tie up all of your CPU's cores. While you're sending data across, the
rest of your CPU could be doing other tasks it's better suited for (branchy
code, virtual function calls, recursion...).

------
malbs
Anyone who is doing serious work with GPUs already knows the performance hit
you cop when shuffling data to and from the card.

If your solution relies on moving data back and forth from host memory to
device memory, you probably won't get the speed boosts you were hoping for.

I can create a solution where the CPU beats the pants off the GPU, and
conversely the GPU absolutely smashes the CPU, but for my actual real problems
(Data copied to the card once, many iterations, multiple calculations per
iteration, results copied back to host), the GPU gives one hell of a boost.

------
dougws
I really don't understand the point of this article. The author seems to be
honestly interested in exploring GPU computation, and is clearly reasonably
well-read in the field. The sarcasm is just really off-putting, though, and
the article really presents ver little of actual value--this one trivial
computation happens to be slower; so what?

------
pjscott
The moral of the story is Amdahl's Law:

<http://en.wikipedia.org/wiki/Amdahls_law>

------
tomjen3
Try it with a 1024X1024 matrix multiplication and suddenly GPUs are much more
useful.

