Hacker News new | past | comments | ask | show | jobs | submit login
GPU Vs CPU Smackdown : The Rise Of Throughput-Oriented Architectures (highscalability.com)
31 points by yarapavan on Dec 4, 2010 | hide | past | web | favorite | 18 comments



What a terrible article. It's oversimplifying the situation. There are many options for parallelism in programming, it's not just CPU vs GPU. They fall into the Nvidia advertisement kool-aid implying CPUs only have scalar performance and then neglect to mention the very importantant shortcomings of GPU cores (e.g. poor caching, no branching, dismal PCIe latency to main program on CPU.) Current CPU cores have big L1/2/3 caches, superscalar execution and SIMD. That's the reason for the big size.


Exactly. So far I've found few good uses for GPGPUs. They're great for all the things that they're "traditionally" good at:

* Graphics manipulation.

* Physics simulation.

* Video encoding/decoding.

* Anything else involving lots of matrix/vector calculations. Probably lots of scientific uses.

But for general purpose programming, GPGPUs aren't so useful. Suppose you're writing a web server. It would be very cool if you can leverage the GPU to make it faster but I don't think you can. Suppose there is a good way to write an HTTP parser in OpenCL; by the time you've uploaded the socket data to the GPU you could already have parsed the header 3 times in CPU.

Feel free to correct me if I'm wrong.


Pretty much, yeah.

Next year with Intel's Sandy Bridge and AMD's Fusion things will change significantly. With CPU absorbing the GPU cores many shortcomings will be gone:

  * Low latency GPU-CPU interaction via L3 cache
  * Conditional execution for parallel code
  * Low cost to set up a parallel thread/tasklet
And also next year AVX will revive x86's SIMD.

  * Double sized registers
  * Non-destructive instructions (yay!)
  * No alignment madness
OpenCL is a bit of a can of worms, I'm not sure it will prevail in the long run. There are exciting new products for higher level programming like Intel's Parallel studio, in particular Cilk and Array Building Blocks.

Hardware is changing and it's unviable to recompile for every possible combination of hardware. Most educated predictions are on generated code.

The chaos and misinformation is very annoying, but it is very exciting to live in an era of such a huge paradigm change in computing. Most algorithms have to be rewritten with a very different mindset. It's like a gold rush. Good times :)


You've mentioned twice now that GPUs can't do conditional execution; that's incorrect. GPUs have had branch instructions for quite a while now.

The limitation is that every thread on a core is executing the same instruction. If some of them take one branch and the rest take another, both branches have to be executed one after the other while masking out the threads it doesn't apply to. That reduces the performance you get, but it's certainly possible.

It's also worth pointing out that GPUs these days have a number of independent cores, each of which can execute different instructions simultaneously.

Hope that clears it up for you.


  > The limitation is that every thread on a core is
  > executing the same instruction. If some of them take
  > one branch and the rest take another, both branches
  > have to be executed one after the other while masking
  > out the threads it doesn't apply to.
In my dictionary that's far from conditional branching. And I point it out because even SIMD can have conditional branching (using the CPU's branch predictor.)

  > That reduces the performance you get
That's an understatement. It complicates the implementation significantly. Parallel-oriented (?) programming is quite hard by itself without all this.

In many non-trivial situations you need to check for reaching limits on loops or plain data structure bounds check.

Don't get me wrong, I think it's amazing to do GPGPU, what pisses me off is the overblown marketing and all the noise by people repeating that like canon when they clearly never implemented a single non-trivial program in OpenMP/CUDA/OpenCL/SIMD.

  > Hope that clears it up for you.
That phrasing can carry an implied personal attack, but let's not fall into personal attacks, shall we? (And I'm foreign and could be reading too much into it.)


"And I point it out because even SIMD can have conditional branching (using the CPU's branch predictor.)"

This doesn't sound correct to me, but perhaps I'm missing something? I can't see how the branch predictor is relevant here. Can you explain in more detail?

"That's an understatement. It complicates the implementation significantly. Parallel-oriented (?) programming is quite hard by itself without all this."

How does this complicate it? You don't have to implement it yourself. If you're coding in CUDA, OpenCL or any of the common shading languages, then you write if-statements just as you would in C and it does the right thing. Honestly, the only concern is the performance degradation you get.

"That phrasing can carry an implied personal attack"

There was none intended. I was hoping that would make it sound helpful but I guess it didn't work. Text can be a tricky medium. Sorry.


> * Low latency GPU-CPU interaction via L3 cache

Intel's GPUs kinda suck so I'm not sure how much performance you're going to get out of them even with the so called "low-latency" interconnect. On the AMD side, they have better GPUs, but unfortunately Ontario (and I believe Llano as well) don't even share northbridges, leave alone L3 cache.


  >> * Low latency GPU-CPU interaction via L3 cache
  > Intel's GPUs kinda suck so I'm not sure how much performance
  > you're going to get out of them even with the so called
  > "low-latency" interconnect
Sure, Intel's GPUs are not as good. But latency and interaction with the CPU cores is the main problem at the moment for parallel programming out of corner cases like HPC or multimedia. You show animosity on Intel, this architecture has not yet been released.

  > On the AMD side, they have better GPUs, but unfortunately
  > Ontario (and I believe Llano as well) don't even share
  > northbridges, leave alone L3 cache.
Llano is not the real deal with AMD Fusion. That will be Bulldozer and Bobcat. Everybody reports at least Bulldozer will have L3-level communication.

http://www.anandtech.com/show/3865/amd-bobcat-bulldozer-hot-...

I would question AMD's gamble on OpenCL and lack of tools compared to Intel who acquired MIT's Cilk and absorbed other projects. Also I haven't seen anything close to Intel's SIMD sort algorithm yet (with regards of technical concepts, elegance, and sheer record-breaking performance.)


The article is correct in saying that we have more choices open to us now than a few year ago. This is partly because of GPUs, and partly because of technological considerations.

However, the article seems to suggest that there is a dichotomy between modern CPUs and throughput-oriented architectures. This is not true. For instance, the Sun Niagra eschews out-of-order execution and branch speculation to gain throughput by executing 8 threads concurrently. This type of microarchitecture is still more flexible than your contemporary GPU, but gives you considerably higher throughput for I/O dominated applications.

In fact, I'll go so far as to say that although future architectures will be hybrids of low-latency and high-throughput designs, the high-throughput parts will resemble CPUs more than current GPUs.


For me article is missing the point. The GPU are best at accelerating small data & computation intensive portions of your program. Most of the code will be executed on CPU, usually only a few functions are bottleneck and needs to be ported.

The CPU are task parallelism while GPU are data parallelism. On CPU there is usually small number of thread and programmer explicitly control them. On GPU threads and switching them is almost free, so you can use as much as you want (make the number data dependent).

So GPU will excel in applications which require e.g.: -float computations (speed up 10x+) -using graphic specific functions (e.g. texture interpolation 40x+) -memory bound (continuous access 2x+)

On the other hand there are behind CPU in: -branching -small irregular computations

The x86 is currently spending quite a lot of silicon for features like cache, prediction, superscalar execution which provide speed up to existing application. GPU does not have most of that features, but they spend more silicon on core logic instead.

To sum up, a good candidate for GPGPU would be image recognition. The bad one would be compilator.


nVidia's Fermi chip has coherent L1 and L2 caches, and that is what is used in Amazon's GPU Compute cluster nodes. Branch-prediction and superscalar execution are still missing, but IMO aren't needed due to other architectural decisions on the chip.


I wonder how long it will be until we see gpus used for general purposes on the server- like accelerating apache or mysql. Since this hasn't been done yet, there must be some reason it's not possible?


If you can tell me how to parse HTTP in GPU faster than in CPU I would be very interested. Right now even uploading an HTTP header to GPU takes more time than parsing it in CPU.

MySQL is heavily disk-bound. It is not CPU-bound except for things like sorting multi-gigabyte datasets, which people rarely use MySQL for. For most purposes you're better off installing more RAM than offloading computations to the GPU, otherwise MySQL has to spend all its time waiting for the harddisk.


Encryption and decryption for https could benefit quite a bit, I would think.


Except Intel already put in specific instructions for that.

I was surprised to discover that GPUs can accelerate SQLite http://pbbakkum.com/db/ and routing http://shader.kaist.edu/packetshader/ (although you're probably better off using an NP for routing).


Now that IS interesting. Thanks for the links!


Can encryption be parallelized? With block cipher chaining modes like CFB, each block depends on the previous block. You can parallelize encryption if you use ECB but that's known to be insecure.


Individual streams aren't parallelizable, but if you have many streams open at once you could hypothetically process them together in parallel instead of separately in parallel as it is with a CPU.

Also, there are block modes that are seekable and thus could be parallelized if you had a big enough backlog, CTR mode in particular, but parallelizing individual streams is not likely to reap big enough rewards to justify the complexity.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: