
An evaluation of throughput computing on CPU and GPU (2010) [pdf] - luu
http://sbel.wisc.edu/Courses/ME964/Literature/LeeDebunkGPU2010.pdf
======
glangdale
We have certainly seen a number of cases where the world's most naive scalar
algorithm is compared with a sophisticated GPU-based algorithm to (nominally)
do the same thing. It's an easy trap to fall in to - no-one gets a pat on
their head from their supervisor by tuning the system that you're meant to be
_beating_.

~~~
dsharlet
This is where the "100x myth" comes from. If you write a naive scalar
algorithm in C, and then you write (almost) the same thing in CUDA, you can
get a ~100x speedup. The problem is that most of that speedup is coming from
the lack of parallelism and vectorization in the C implementation, which CUDA
gives you "for free".

The thing is, this is still a real win for CUDA, because most people aren't
going to write highly optimized C. It's just not a hardware win, it's a
compiler win. The thing that most people get out of CUDA is not an advantage
from running on a GPU, but an advantage of a smart compiler/language design.
Most of the same people would get all but 1-5x of the benefits from using
something like ISPC[1], which gives a similar programming model as CUDA
implemented on the CPU.

In my experience (from the Tesla/GTX 200 era, though I doubt things have
changed _that_ much now), such a small boost in performance is not worth the
hassle of transferring data to/from the GPU, the (lack of) virtualization, and
driver shenanigans/support issues (at one point, I had to suggest someone buy
a fake monitor dongle to plug into his GPU to be able to use my code...).

Anyways, it's good to see this thread, people these days think you are crazy
for not using the GPU to do big compute workloads.

1\. [https://ispc.github.io/](https://ispc.github.io/)

~~~
hughperkins
have you actually tried writing highly optimizer cuda? its _really_ _hard_.
_nothing_ is "for free". a lot of the caching (not all) has to be hand coded.
you can see how much attention to detail scott gray puts into his kernels eg
at
[https://github.com/NervanaSystems/maxas](https://github.com/NervanaSystems/maxas)

~~~
jacquesm
Not that much harder than writing hand optimized assembly.

There are a few more quirks to keep in mind and you need to have the memory
lay-out and access patterns down otherwise it will not give you the boost you
expect.

In a way I really _love_ CUDA for this reason, it allows me to re-cycle all my
old optimization skills and do something useful with them.

------
hughperkins
this is really old, and intel sponsored. so intel research states that gpus
are 2.5 times faster, nvidia says two orders of magnitude faster. presumably
the truth is somewhere in between. in any case, it depends on the problem you
are trying to solve, but in machine learning space, using convolutional neural
networks, gpus are clearly the dominant mechanism recently.

~~~
sufiyan
And it is precisely because of this that the paper has to go through peer
review and also the authors specifically state that the optimisations that
they do make the code run at least as fast if not faster than the state of the
art.

------
Athas
While the conclusions of the paper ("compare optimised GPU code to optimised
CPU code") is good, I'm curious about whether the difference has increased
since 2010. GPUs are architecturally easier to scale up than CPUs, after all.

~~~
pcwalton
CPU SIMD width has increased too, both on x86-64 (AVX2) and on ARM (ARMv8
NEON).

------
nickeleres
tl;dr:

"We show that CPUs and GPUs are much closer in performance (2.5X) than the
previously reported orders of magnitude difference."

------
Mmrnmhrm
I dare Intel to provide an updated version of this paper, and I dare Nvidia to
perform the peer review.

