

Swimming in OpenCL - liscio
http://www.supermegaultragroovy.com/blog/2009/11/12/swimming-in-opencl/

======
manvsmachine
Always great to hear about parallelization becoming more widely adopted. The
only thing that I don't get is the part about the GPU - it takes a _lot_ of
computation to tie up an 8800GT for a full minute. Also, it shouldn't take the
1-3 seconds he described to send a 45 sec .wav file over a PCI Express 2.0 x16
bus (~3-4 GB/s bandwidth IIRC).

I'm not sure what's causing those runtimes, but the fact that it spread over 8
cores that well suggests that it almost qualifies as embarrassingly parallel,
which a GPU really should be great for. This makes me really wonder about the
maturity of Apple's / nVidia's OpenCL implementation.

EDIT: I just ran a few of the OpenCL SDK demos and can confirm that it is 1-2
orders of magnitude slower than the same demo running in CUDA. The bandwidth
for copying memory to / from the device should still be high, though.

My OpenCL Bandwidth Test results:
~/NVIDIA_GPU_Computing_SDK/OpenCL/bin/linux/release$ ./oclBandwidthTest

./oclBandwidthTest Starting...

Running on...

Device GeForce 8400M GT

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory, direct access

    
    
       Transfer Size (Bytes)	Bandwidth(MB/s)
    
       33554432			1600.9
    

Device to Host Bandwidth, 1 Device(s), Paged memory, direct access Transfer
Size (Bytes) Bandwidth(MB/s)

    
    
       33554432			1235.1
    

Device to Device Bandwidth, 1 Device(s) Transfer Size (Bytes) Bandwidth(MB/s)

    
    
       33554432			6069.7
    

TEST PASSED

Press <Enter> to Quit...

~~~
liscio
It's probably the algorithm in question that's to blame, in conjunction with
the slow OpenCL implementation for the 8800GT, as you found.

On my machine (I'm the article's author), even Apple's GPU-tuned version of
Galaxies runs much faster on the Mac Pro's CPUs than the GPU. So, something's
up. I think only the GTX285 for the Mac Pro beats out the CPUs on that test,
but I could be wrong...

The 1-2 seconds of overhead could also be contributed to by the compilation of
the OpenCL program for the GPU, as I do a compile of the .cl kernel on each
run of the program.

Furthermore, I wasn't very scientific about the GPU case, because I wasn't
planning to ship a GPU-tuned algorithm. To actually pull this off for a
consumer app is easier said than done.

For instance, I'd prefer not to ship the .cl kernel in the application, and
would rather provide binary-compiled kernels. Doing this for >1 flavor of GPU
is nontrivial, from what I gather, as I'd have to actually own the GPUs in
question to get compiles for the different targets (I could only cover the
GeForce 9400M, and 8800GT from my own collection of hardware).

That said, I still want to stay open to the idea in the future as I play
around with the algorithm, and understand it further.

Thanks for the nudge, though. I really should dig deeper.

