
Introduction to CUDA C - Garbage
http://www.infoq.com/presentations/Introduction-to-CUDA-C
======
exDM69
CUDA (and OpenCL) C is just like C, with the exception that recursion is not
allowed because there is no runtime stack to support it. Instead, there's just
a lot of registers.

However, the main difference is the number of different types memories there
are available. That makes GPU programming very tricky indeed. There's around 6
different types of memory visible to the programmer, each with it's own
distinct access times and restrictions. On the host side (CPU), there's
regular memory and page-aligned DMA buffers. On the device (GPU), there's
constant, global, local and texture memory. Most of the time it's clear which
memory you should use, but occasionally it takes some thinking. Especially
deciding whether something should go to texture memory or global memory can be
difficult without doing it first and benchmarking.

~~~
tmurray
whoa: recursion has been supported since Fermi support was released in CUDA
3.0 (there's a stack pointer and a stack frame and everything). what's not
supported until GK110 (Kepler 2) is GPU kernels launching/waiting on GPU
kernels.

generally (and especially in the case of the upcoming GK110 chip), you should
use global memory. GK110 improves this with LDG, which allows you to get some
caching benefits of texture (spatial locality) without having to jump through
the API hoops required to use textures.

(full disclosure: I run the CUDA driver team at NVIDIA)

~~~
exDM69
> whoa: recursion has been supported since Fermi support was released in CUDA
> 3.0 (there's a stack pointer and a stack frame and everything). what's not
> supported until GK110 (Kepler 2) is GPU kernels launching/waiting on GPU
> kernels.

Oh, cool! This opens doors for applying GPGPU to a whole new class of
algorithms. I clearly must update my GPU knowledge.

Global mem vs. texture mem is always the biggest choice. Sometimes it's worth
thinking whether there's a potential win in caching texture/global memory
fetch results in local memory. So there's still choices to be made, even
though the hardware has become better and easier to program.

~~~
profquail
With Fermi and newer (i.e., Kepler) hardware, there's not as much to be gained
by using texture memory. On previous hardware, the texture cache helped to
speed up some kernels that had non-uniform memory access patterns; the
Fermi/Kepler hardware has a larger on-chip L1/L2 cache which serves the same
purpose and does so without requiring the programmer to write extra code for
working with textures.

------
pavanky
Just curios, how many here use CUDA / OpenCL on a regular basis ? I usually
see GPU related posts come to the top around midnight or in the early hours of
the day (off-peak hours to be precise).

Makes me think that there are enough people working on GPUs out there, just
not enough to stand out from the web-dev or other related news that usually
get voted up..

~~~
draven
I don't but one of my colleague's job is to port existing data analysis
algorithms to GPUs. It seems other research centers like ours also have
personnel dedicated to this task. Even if I'm more of a python guy right now,
I'm still very interested in this stuff and often upvote stories.

And I'd be interested in seeing who's using GPUs for science!

~~~
pavanky
Since you said Python and are interested in GPUs (but not fully immersed in it
yet), I want to show you this which can get you started (and also a shameless
plug for the product I am the lead developer for)

<http://www.accelereyes.com/arrayfire/python/>

Its a freemium model, should cost you next to nothing to try (as long as you
have a gpu).

~~~
draven
I stumbled upon it a while ago. The whole package looked very nice but it
seems to be closed source. Being proprietary is fine but some people here tend
to dig into libraries to see what they are doing. I don't know if you plan on
offering some kind of source access.

Also I see that the free edition tries to connect to an outside server to a
high port, and that would work here. Our network is not really friendly: I
cannot even check out a git repo with the git protocol.

I can see the appeal of your product though. Some of my colleagues are trying
to use PyCUDA but have to learn C in the process.

------
gmt2027
I use CUDA extensively for porting legacy linear algebra routines in a
scientific application for modelling neutron scattering spectra from
nanostructures. Depending on the model size, the target platforms are
typically medium-sized GPU clusters or large supercomputers. Most of the newer
supercomputers have large numbers of GPUs available and it is a challenge to
use all resources efficiently.

GPU computing is promising, but given that it is fairly difficult to predict
how much of a speed-up to expect before the actual work is done, I find myself
asking whether some of the expended effort is really worth the trouble. It
seems that there are two classes of applications where this makes sense:

1) Minimising latency/increasing responsiveness for smaller algorithms such as
in a user application or service.

2) Doing large volumes of computation in a high-throughput system.

At the moment, the CUDA platform is well ahead of OpenCL in terms of maturity,
features, tools and documentation. However, OpenCL runs on CPUs, both NVIDIA
and AMD GPUs and work is being done towards targeting FPGAs [1].
Interestingly, Clang also supports compiling OpenCL kernels directly to native
code.

In all likelihood these platforms will stay outside the mainstream until
better abstraction layers exist to shield programmers from the low-level
architectural details without sacrificing performance. Something similar to
the directive-based OpenACC standard that is able to perform close to hand-
optimised close would go a long way.

[1] <http://www.altera.com/b/opencl.html>

~~~
ylem
Oddly, I have also been using CUDA to port some linear algebra routines for
neutron scattering (spinwave calculations) and find it rather useful--but I
think better tooling would help. But, overall, the speedups that we've
achieved have been worth it and would be cheaper to do these calculations on a
GPU or two than on a cluster...

Having embarrassingly parallel problems also helps...

------
daenz
Can anyone explain how CUDA compares to OpenCL, both in terms of
flexibility/power and openness?

~~~
pavanky
1) OpenCL library support is where CUDA was 2 years ago (but devleopment is
picking up quick).

\-- This means, that the devleopment time will be more because you are going
to be writing your own primitive functions which are available as a myriad of
libraries in CUDA.

2) OpenCL can run on cpus, gpus and more exotic hardware like FPGAs, Cell
Processors.

\-- On Large clusters, OpenCL's heterogenous capabilities make it more
appealing as opposed to CUDA which will essentially require you to write and
maintain two code bases (one in CUDA, the other using pthreads or what have
you).

3) OpenCL spec is designed by committee (the Khronos group) as opposed to CUDA
(developed by NVIDIA).

\-- This means CUDA iterates much more quickly than OpenCL. They also have
control over the hardware meaning they can introduce some hardware
optimizations that may not become standardized in OpenCL.

4) NVIDIA has opened up CUDA a little, implementations of OpenCL remain
closed.

\-- NVIDIA has thrust which is kind of open source, but their CUBLAS and CUFFT
libraries are closed. They recently Open sourced part of their NVCC compiler
and hooked it up with LLVM. The OpenCL impelmentations are vendor specific and
AFAIK none of them are open.

\--------------

Sorry if I am not coherent. Its pretty late and I had to type it forcing
myself to stay awake :)

~~~
DeepDuh
I like your summary. I'd like to point out, though, that if you want a unified
codebase, you might want to have a look at OpenACC. It's still young and you
need commercial compilers (PGI, HMPP or Cray), but the capabilities are
interesting. I'm doing some tests and comparisons between OpenACC and CUDA
right now.

~~~
pavanky
Using OpenACC would just give you some quick benefits in development time, but
I think it will not be competitive with pure CUDA in actual time taken for the
program to run.

I would be really interested in these results if you can share them with me
(contact@pavanky.com).

~~~
DeepDuh
My tests haven't finished yet, but I can this this much about the performance:
HMPP, while not having a very complete implementation yet, have quite a simple
code generation concept (thus generating a low overhead) and the performance
is actually very similar to CUDA. In that regard I think OpenACC will do great
in reducing the GPGPU software design to pure thinking about algorithms and
datastructures, instead of spending lots of time on the mechanics.

------
capkutay
Here's an additional resource for those who want to learn it. Stanford offers
the class and has the material online.

<http://code.google.com/p/stanford-cs193g-sp2010/>

