* Graphics manipulation.
* Physics simulation.
* Video encoding/decoding.
* Anything else involving lots of matrix/vector calculations. Probably lots of scientific uses.
But for general purpose programming, GPGPUs aren't so useful. Suppose you're writing a web server. It would be very cool if you can leverage the GPU to make it faster but I don't think you can. Suppose there is a good way to write an HTTP parser in OpenCL; by the time you've uploaded the socket data to the GPU you could already have parsed the header 3 times in CPU.
Feel free to correct me if I'm wrong.
Next year with Intel's Sandy Bridge and AMD's Fusion things will change significantly. With CPU absorbing the GPU cores many shortcomings will be gone:
* Low latency GPU-CPU interaction via L3 cache
* Conditional execution for parallel code
* Low cost to set up a parallel thread/tasklet
* Double sized registers
* Non-destructive instructions (yay!)
* No alignment madness
Hardware is changing and it's unviable to recompile for every possible combination of hardware. Most educated predictions are on generated code.
The chaos and misinformation is very annoying, but it is very exciting to live in an era of such a huge paradigm change in computing. Most algorithms have to be rewritten with a very different mindset. It's like a gold rush. Good times :)
The limitation is that every thread on a core is executing the same instruction. If some of them take one branch and the rest take another, both branches have to be executed one after the other while masking out the threads it doesn't apply to. That reduces the performance you get, but it's certainly possible.
It's also worth pointing out that GPUs these days have a number of independent cores, each of which can execute different instructions simultaneously.
Hope that clears it up for you.
> The limitation is that every thread on a core is
> executing the same instruction. If some of them take
> one branch and the rest take another, both branches
> have to be executed one after the other while masking
> out the threads it doesn't apply to.
> That reduces the performance you get
In many non-trivial situations you need to check for reaching limits on loops or plain data structure bounds check.
Don't get me wrong, I think it's amazing to do GPGPU, what pisses me off is the overblown marketing and all the noise by people repeating that like canon when they clearly never implemented a single non-trivial program in OpenMP/CUDA/OpenCL/SIMD.
> Hope that clears it up for you.
This doesn't sound correct to me, but perhaps I'm missing something? I can't see how the branch predictor is relevant here. Can you explain in more detail?
"That's an understatement. It complicates the implementation significantly. Parallel-oriented (?) programming is quite hard by itself without all this."
How does this complicate it? You don't have to implement it yourself. If you're coding in CUDA, OpenCL or any of the common shading languages, then you write if-statements just as you would in C and it does the right thing. Honestly, the only concern is the performance degradation you get.
"That phrasing can carry an implied personal attack"
There was none intended. I was hoping that would make it sound helpful but I guess it didn't work. Text can be a tricky medium. Sorry.
Intel's GPUs kinda suck so I'm not sure how much performance you're going to get out of them even with the so called "low-latency" interconnect. On the AMD side, they have better GPUs, but unfortunately Ontario (and I believe Llano as well) don't even share northbridges, leave alone L3 cache.
>> * Low latency GPU-CPU interaction via L3 cache
> Intel's GPUs kinda suck so I'm not sure how much performance
> you're going to get out of them even with the so called
> "low-latency" interconnect
> On the AMD side, they have better GPUs, but unfortunately
> Ontario (and I believe Llano as well) don't even share
> northbridges, leave alone L3 cache.
I would question AMD's gamble on OpenCL and lack of tools compared to Intel who acquired MIT's Cilk and absorbed other projects. Also I haven't seen anything close to Intel's SIMD sort algorithm yet (with regards of technical concepts, elegance, and sheer record-breaking performance.)
However, the article seems to suggest that there is a dichotomy between modern CPUs and throughput-oriented architectures. This is not true. For instance, the Sun Niagra eschews out-of-order execution and branch speculation to gain throughput by executing 8 threads concurrently. This type of microarchitecture is still more flexible than your contemporary GPU, but gives you considerably higher throughput for I/O dominated applications.
In fact, I'll go so far as to say that although future architectures will be hybrids of low-latency and high-throughput designs, the high-throughput parts will resemble CPUs more than current GPUs.
The CPU are task parallelism while GPU are data parallelism. On CPU there is usually small number of thread and programmer explicitly control them. On GPU threads and switching them is almost free, so you can use as much as you want (make the number data dependent).
So GPU will excel in applications which require e.g.:
-float computations (speed up 10x+)
-using graphic specific functions (e.g. texture interpolation 40x+)
-memory bound (continuous access 2x+)
On the other hand there are behind CPU in:
-small irregular computations
The x86 is currently spending quite a lot of silicon for features like cache, prediction, superscalar execution which provide speed up to existing application. GPU does not have most of that features, but they spend more silicon on core logic instead.
To sum up, a good candidate for GPGPU would be image recognition. The bad one would be compilator.
MySQL is heavily disk-bound. It is not CPU-bound except for things like sorting multi-gigabyte datasets, which people rarely use MySQL for. For most purposes you're better off installing more RAM than offloading computations to the GPU, otherwise MySQL has to spend all its time waiting for the harddisk.
I was surprised to discover that GPUs can accelerate SQLite http://pbbakkum.com/db/ and routing http://shader.kaist.edu/packetshader/ (although you're probably better off using an NP for routing).
Also, there are block modes that are seekable and thus could be parallelized if you had a big enough backlog, CTR mode in particular, but parallelizing individual streams is not likely to reap big enough rewards to justify the complexity.