

CUDA and OpenCL run time comparison using Vector Addition - astroguy
http://researchdaily.blogspot.com/2009/12/vector-addition-on-cuda-and-opencl.html

======
wtallis
What exactly were you trying to accomplish with this post? You haven't
particularly enlightened anybody about the relative performance
characteristics of OpenCL and CUDA:

The graph shown illustrates nearly identical scaling behavior between OpenCL
and CUDA, with OpenCL taking consistently about half a second longer, except
at the last data point where it takes a full second longer. In other words,
all we're really seeing here is initialization overhead, and without seeing
the source code we can't know if this is at all a fair test. It may be that
the CUDA runtime is performing more compilation work before the clock starts,
or that it is caching the compiled kernels.

Based solely on the last data point, it looks like there might be a widening
performance gap for larger jobs, but you don't include any larger test cases.
The longest run times listed are 7 second jobs, so clearly it wouldn't have
been that time-consuming to run some larger test cases. You could also easily
use a logarithmic scale on the time axis. (And please fix the misleading
horizontal axis!)

Additionally, you state that your methodology was to run each job 5 times and
take the lowest time. As I mentioned above, there could be caching in the
runtime or operating system that makes this a poor real-world test. (OpenCL
and AFAIK CUDA allows the programmer to keep a compiled form of a kernel in
memory, so if a program has to run a kernel many times, you can ensure that
you aren't needlessly re-compiling your kernels. CUDA may be doing this
automatically.) It would be better to report the average times, or even to
just show all of them on the graph and have the lines go through the average
times.

And to top it off, the test was simply of doing lots of vector addition. A
perfect case of a microbenchmark that doesn't provide useful predictions of
real-world performance. Addition is so simple that this is at best a test of
memory bandwidth, not computational power. It would have been more useful to
test something like an FFT if the intent was to measure efficiency at
computation.

Also, any good benchmark include hardware and software specs. Were these test
run on a CPU or a GPU? What driver versions were involved, and in particular,
were either implementations beta-releases?

The subjective comments seem like they could be worth discussion, but only
with some elaboration first. One can presume that any real-world performance
differences between CUDA and OpenCL on the same hardware would be due to the
relative immaturity of the OpenCL implementation, or a deliberate choice by
Nvidia to make OpenCL look bad. However, given that neither system is
particularly entrenched, a debate about the design aspects could be very
useful.

It sounds like OpenCL has a better workflow (with the exception of debugging),
but that the actual code is uglier. I'd be very interested to see comparisons
of the kernel code in each language, and a separate comparison of the set-up
code needed to run those kernels. (Although I'm a fan of Andreas Klöckner's
Python bindings, which greatly simplify the set-up code.) My impression of
OpenCL's kernel language has been that it seems seems well-designed, so you
comment that the kernel code is ugly surprises me. Have I misunderstood CUDA
as a low-level framework offering essentially the same kind of access to the
hardware? I can't tell yet from what I've read from various sources whether
CUDA has an appreciably better design, or whether it is simply benefiting from
the biases of people who learned CUDA first and have been using it longer.

~~~
jakozaur
In CUDA kernel to PTX compilation is performed when the project is build. In
OpenCL it is performed at runtime. That is "initialization overhead" which
cause result to be different. You could use cache to avoid it.

So the reason of the performance differences is caused by the OpenCL's design.
In OpenCL it is unknown on which hardware the kernel will run, so the
compilation is postpone to runtime. I could hardly see any reason why NVIDIA
would want to make OpenCL slower.

There are two major differences between CUDA and OpenCL: -OpenCL is industry
standard while CUDA is NVIDIA's platform -OpenCL is a regular C library, CUDA
in addition to that is extension to C language

CUDA kernel code will be much nicer, it will takes less lines and looks much
better. The cost is it requires a special compiler (nvcc).

Example CUDA code: (run a kernel) myCudaKernel<<< grid, block,
sharedMemorySize >>> (... arguments);

Equivalent in OpenCL: clSetKernelArg (...); // for each kernel argument! ...
clEnqueueNDRangeKernel (...); // run kernel

~~~
andrewcooke
the last part above is not clear.

the code referred to as "kernel code" is actually code that runs on the client
to invoke the kernel. opencl uses a library api, which cuda takes a dsl
approach. so opencl is more verbose.

as far as i know, the actual kernel code (which describes what happens on the
gpu) is pretty similar.

~~~
jakozaur
Agree, you are right.

s/kernel code/kernel related code/g.

------
astroguy
For source code please check this <http://pastebin.com/m73aba293>

