

GPU Vs CPU Smackdown : The Rise Of Throughput-Oriented Architectures - yarapavan
http://highscalability.com/blog/2010/12/3/gpu-vs-cpu-smackdown-the-rise-of-throughput-oriented-archite.html

======
alecco
What a terrible article. It's oversimplifying the situation. There are many
options for parallelism in programming, it's not just CPU vs GPU. They fall
into the Nvidia advertisement kool-aid implying CPUs only have scalar
performance and then neglect to mention the very importantant shortcomings of
GPU cores (e.g. poor caching, no branching, dismal PCIe latency to main
program on CPU.) Current CPU cores have big L1/2/3 caches, superscalar
execution and SIMD. That's the reason for the big size.

~~~
FooBarWidget
Exactly. So far I've found few good uses for GPGPUs. They're great for all the
things that they're "traditionally" good at:

* Graphics manipulation.

* Physics simulation.

* Video encoding/decoding.

* Anything else involving lots of matrix/vector calculations. Probably lots of scientific uses.

But for general purpose programming, GPGPUs aren't so useful. Suppose you're
writing a web server. It would be very cool if you can leverage the GPU to
make it faster but I don't think you can. Suppose there is a good way to write
an HTTP parser in OpenCL; by the time you've uploaded the socket data to the
GPU you could already have parsed the header 3 times in CPU.

Feel free to correct me if I'm wrong.

~~~
alecco
Pretty much, yeah.

Next year with Intel's Sandy Bridge and AMD's Fusion things will change
significantly. With CPU absorbing the GPU cores many shortcomings will be
gone:

    
    
      * Low latency GPU-CPU interaction via L3 cache
      * Conditional execution for parallel code
      * Low cost to set up a parallel thread/tasklet
    

And also next year AVX will revive x86's SIMD.

    
    
      * Double sized registers
      * Non-destructive instructions (yay!)
      * No alignment madness
    

OpenCL is a bit of a can of worms, I'm not sure it will prevail in the long
run. There are exciting new products for higher level programming like Intel's
Parallel studio, in particular Cilk and Array Building Blocks.

Hardware is changing and it's unviable to recompile for every possible
combination of hardware. Most educated predictions are on generated code.

The chaos and misinformation is very annoying, but it is very exciting to live
in an era of such a huge paradigm change in computing. Most algorithms have to
be rewritten with a very different mindset. It's like a gold rush. Good times
:)

~~~
vilya
You've mentioned twice now that GPUs can't do conditional execution; that's
incorrect. GPUs have had branch instructions for quite a while now.

The limitation is that every thread on a core is executing the same
instruction. If some of them take one branch and the rest take another, both
branches have to be executed one after the other while masking out the threads
it doesn't apply to. That reduces the performance you get, but it's certainly
possible.

It's also worth pointing out that GPUs these days have a number of independent
cores, each of which _can_ execute different instructions simultaneously.

Hope that clears it up for you.

~~~
alecco

      > The limitation is that every thread on a core is
      > executing the same instruction. If some of them take
      > one branch and the rest take another, both branches
      > have to be executed one after the other while masking
      > out the threads it doesn't apply to.
    

In my dictionary that's far from conditional branching. And I point it out
because even SIMD can have conditional branching (using the CPU's branch
predictor.)

    
    
      > That reduces the performance you get
    

That's an understatement. It complicates the implementation significantly.
Parallel-oriented (?) programming is quite hard by itself without all this.

In many non-trivial situations you need to check for reaching limits on loops
or plain data structure bounds check.

Don't get me wrong, I think it's amazing to do GPGPU, what pisses me off is
the overblown marketing and all the noise by people repeating that like canon
when they clearly never implemented a single non-trivial program in
OpenMP/CUDA/OpenCL/SIMD.

    
    
      > Hope that clears it up for you.
    

That phrasing can carry an implied personal attack, but let's not fall into
personal attacks, shall we? (And I'm foreign and could be reading too much
into it.)

~~~
vilya
"And I point it out because even SIMD can have conditional branching (using
the CPU's branch predictor.)"

This doesn't sound correct to me, but perhaps I'm missing something? I can't
see how the branch predictor is relevant here. Can you explain in more detail?

"That's an understatement. It complicates the implementation significantly.
Parallel-oriented (?) programming is quite hard by itself without all this."

How does this complicate it? You don't have to implement it yourself. If
you're coding in CUDA, OpenCL or any of the common shading languages, then you
write if-statements just as you would in C and it does the right thing.
Honestly, the only concern is the performance degradation you get.

"That phrasing can carry an implied personal attack"

There was none intended. I was hoping that would make it sound helpful but I
guess it didn't work. Text can be a tricky medium. Sorry.

------
microarchitect
The article is correct in saying that we have more choices open to us now than
a few year ago. This is partly because of GPUs, and partly because of
technological considerations.

However, the article seems to suggest that there is a dichotomy between modern
CPUs and throughput-oriented architectures. This is not true. For instance,
the Sun Niagra eschews out-of-order execution and branch speculation to gain
throughput by executing 8 threads concurrently. This type of microarchitecture
is still more flexible than your contemporary GPU, but gives you considerably
higher throughput for I/O dominated applications.

In fact, I'll go so far as to say that although future architectures will be
hybrids of low-latency and high-throughput designs, the high-throughput parts
will resemble CPUs more than current GPUs.

------
jakozaur
For me article is missing the point. The GPU are best at accelerating small
data & computation intensive portions of your program. Most of the code will
be executed on CPU, usually only a few functions are bottleneck and needs to
be ported.

The CPU are task parallelism while GPU are data parallelism. On CPU there is
usually small number of thread and programmer explicitly control them. On GPU
threads and switching them is almost free, so you can use as much as you want
(make the number data dependent).

So GPU will excel in applications which require e.g.: -float computations
(speed up 10x+) -using graphic specific functions (e.g. texture interpolation
40x+) -memory bound (continuous access 2x+)

On the other hand there are behind CPU in: -branching -small irregular
computations

The x86 is currently spending quite a lot of silicon for features like cache,
prediction, superscalar execution which provide speed up to existing
application. GPU does not have most of that features, but they spend more
silicon on core logic instead.

To sum up, a good candidate for GPGPU would be image recognition. The bad one
would be compilator.

~~~
foobarbazoo
nVidia's Fermi chip has coherent L1 and L2 caches, and that is what is used in
Amazon's GPU Compute cluster nodes. Branch-prediction and superscalar
execution are still missing, but IMO aren't needed due to other architectural
decisions on the chip.

------
dholowiski
I wonder how long it will be until we see gpus used for general purposes on
the server- like accelerating apache or mysql. Since this hasn't been done
yet, there must be some reason it's not possible?

~~~
FooBarWidget
If you can tell me how to parse HTTP in GPU faster than in CPU I would be very
interested. Right now even _uploading_ an HTTP header to GPU takes more time
than parsing it in CPU.

MySQL is heavily disk-bound. It is not CPU-bound except for things like
sorting multi-gigabyte datasets, which people rarely use MySQL for. For most
purposes you're better off installing more RAM than offloading computations to
the GPU, otherwise MySQL has to spend all its time waiting for the harddisk.

~~~
vilya
Encryption and decryption for https could benefit quite a bit, I would think.

~~~
wmf
Except Intel already put in specific instructions for that.

I was surprised to discover that GPUs can accelerate SQLite
<http://pbbakkum.com/db/> and routing <http://shader.kaist.edu/packetshader/>
(although you're probably better off using an NP for routing).

~~~
vilya
Now that IS interesting. Thanks for the links!

