
An Even Easier Introduction to CUDA - jonbaer
https://devblogs.nvidia.com/parallelforall/even-easier-introduction-cuda/
======
jupiter90000
Does anyone familiar with the state of GPU programming think OpenCL will
eventually 'win' over CUDA? Although CUDA has more adoption, I don't like the
idea of using it and being locked into a specific vendor. Of course nVidia is
only supporting outdated versions of OpenCL for now. Am I a fool for hoping
OpenCL eventually becomes the standard?

~~~
arcanus
I think that the closed nature of CUDA will be its undoing. I think that a
standard, like C++ amp or openMP-4.5 will be the the ultimate winner.

I liked openCL but it seems to be dying.

~~~
programmarchy
Apple seems to have abandoned OpenCL in favor of Metal, which speaks to your
case of it dying.

I found Metal Compute Shaders to be very nice to work with, though. Was much
easier for me to understand than OpenCL.

~~~
arcanus
I also like metal, but it is not yet performant for high performance
computing, which is more my wheelhouse. I'm also skeptical it will be popular
if it does not get picked up by the GPGPU folks, but time will tell.

------
alkonaut
After completing the basic tutorials I hit a mental wall when I want to gpu
adapt some "real" code. The hard part isn't going from CPU to GPU but making
the CPU code branch-free and friendly to a GPU before actually adapting to the
GPU. Something that is fairly straightforward in normal CPU code such as a
tree traversal becomes a nightmare of sparse execution masks and inefficient
lone threads executing.

~~~
paulmd
OK so basic background here: CUDA processing usually looks like some
dimensional array of data (1d, 2d, 3d, etc). Then you have a series of "warps"
which tesselate their way through your data space processing a chunk of
elements at a time. The warps can be organized into larger "blocks" to share
data between parts of the warp. Many blocks make up a "grid", which is more or
less synonymous with "the processing elements of a kernel". A kernel is a GPU
program.

Blocks can't communicate between each other since they may be on different SMX
processor engines (SIMD units). Also, kernels can't communicate either
according to spec. CUDA doesn't guarantee the order of kernel scheduling - but
it is possible via undefined behavior with spinlocks.

Generally speaking - larger problem sizes should be better for you. GPUs suck
at small individual tasks, starting and stopping the kernels [from the CPU] is
expensive. They are good when they are doing as big a task as possible
(asymptotically to a limit). Memory size will limit how big a data set you can
work on, which will limit your total speedup. So overall, less memory usage =
better speed.

You run lots and lots of threads at any time. GPUs are designed around the
idea of massive threading, easily run dozens of threads per actual core. This
covers up the massive latency when you need to go off-chip to load from global
memory. You might run 10,000 threads in a program, and most of them will be
sleeping while waiting for their data to load. When all threads in a warp are
in READY state, the warp is scheduled and will execute.

As you note, GPUs don't work well when the threads are doing different stuff.
For example, any threads that don't follow an "if" statement will just idle -
because all threads in a warp execute in lockstep. They are masked off and
their instructions don't affect their registers. If there are N different
paths through the code, you will run it N times.

Architecture is critical to understand because this is actually bare-metal
programming, like a microcontroller. There are very few niceties here. Memory
is not zeroed between runs (actually not even during a soft PC restart). There
is no virtual memory segmentation. Illegal accesses may not even throw, or
they may trash your OS's viewport, crash the drivers, etc. And if you don't
code around the architecture's limitations, your performance will suck balls.

\-------

In terms of general advice: a lot of times, scanning your data to pre-process
and select "active" areas of the problem is a viable strategy. Streaming data
sequentially across a warp is a pretty efficient operation thanks to warp
coalescing, you have mega amounts of bandwidth, etc.

Think real heavily about your data layout. Structure of arrays is often really
good because it gives you an efficient stride of 1 as much as is possible when
reading/writing. That maximizes your efficiency when coalescing warps. If you
are having every thread fire off its own request with no coalescing - your
IOPS will trash the memory controller's performance.

As an extremely broad stroke, the best general-purpose approach to GPU
programming is to convert your task into a sorting or searching task. GPUs are
really, really good at sorting, and there's many good algorithms out there, so
you don't have to handle the low-level stuff until you get up to a big problem
size (i.e. you are maxing out GPU memory). Pay very close attention to the
Thrust "histogram.cu" example because it demonstrates these techniques.

So, one good approach is to find your active elements first. You can sort the
active elements to the front of the array. Or, you can use something like a
prefix scan/sum or a thrust::copy_if to pull out indexes of "active" elements
efficiently, and then scatter your operations across the indexes. If your
indexes are sequential, then you will get the maximum amount of warp
coalescing that is possible. That may not be much if your "active" elements
are very sparse and widely distributed, but at least you're trying, and you're
ensuring that all your elements are active as much as possible.

Obviously, wherever possible you want to avoid redundant operations, just like
on CPUs. Structure your data to avoid redundant sorting, consider whether you
want in-place or stable-sorts, etc. But overall sorting is very efficient on
GPUs. You avoid thread divergence, you align memory access, etc.

Another approach is "dynamic parallelism". So you scan your data, figure out
where "hot spots" are that have a lot of data that needs processing, and you
launch more compute resources there (your kernel can launch additional kernel
instances where needed). Also, in some situations you may be able to do the
above approach of picking out indexes that need processing and doing them all
at once - but you do it into registers or shared RAM. That way you are still
keeping your cores processing instead of idling, but you avoid the round-trip
to global RAM. The downside is you increase pressure on your registers/SRAM,
which are very very limited resources.

If a thread can't find an element to process in a particular place - there's
actually no problem with having some of your threads continue on to the next
area that the warp was going to process. Assuming a random distribution - _on
average_ most of your elements will be in approximately the same area, so you
still get some coalescing, and there is really no reason to have the rest of
the threads halt/diverge and wait for the active elements.

Another cute dynamic parallelism trick - most of the overhead from
starting/stopping kernels is the overhead of syncing the device up to the CPU.
Put your main loop in a kernel by itself, and have the kernel launch more
processing kernels. Overhead gone, now the GPU is running 100% on its own.
However - if you really do need to talk to the CPU, then you will have to
spinlock and poll, which is undefined behavior. Again, possible but iffy.

I really fucking hate CURAND. It's absolute garbage to use, it eats tons of
global memory, it eats tons of SRAM, it is very not good. Instead, I really
like Random123. Essentially instead of a "stateful" generator like Mersenne
Twister, it's based on encryption algorithms. If you accept the concept that
the output of an encryption algorithm is uncorrelated to a changing input,
then essentially the encryption key becomes your "seed", and encrypting the
value 0 becomes the first output from the RNG, 1 becomes the second, etc.

The advantage of doing this is that you don't waste your precious memory
bandwidth and SRAM on CURAND, and instead you get to use CPU cycles.
Paradoxically, GPUs have absolutely insane bandwidth, but bandwidth is the
second most precious resource. The only thing more important is SRAM, because
you get like 100 bytes per core (note: not _per thread_ , _per core_ , for all
threads) or something like that, for all your registers, cache, and shared
variables CPU cycles are cheaper than dirt. If you can possibly compute
something from some data you already have loaded, that will usually be more
efficient than loading it from global memory.

Use some property of your data (say, an index, or a uid) as your key value for
Random123 and you get essentially infinite RNGs for free. If you need to have
different results across different runs (stochastic simulations) then just add
the actual seed to the uid-key-value. By storing a single counter (the max
counter value any single element has taken) you can maintain the individual
states for every single generator in your set. Not only that, but you can seek
to arbitrary places in your RNG sequence. Let's say you generate some property
of your data randomly. You don't actually need to store that for each element
- you can just store the counter value you used to generate that, you have the
index of the data element you're working on, just re-generate it in place
wherever you need it. It's _free money_. Wait no, free global memory, which
means you can scale your program up, which means it runs faster. So basically
free money. Even better, you can force it to be cached in every SRAM bank
using the __constant__ keyword.

I have a really idiosyncratic style for CUDA. I typically start with Thrust
(basically the C++ STL for CUDA), writing high-level functional operations.
Then I figure out where I can squish operations together, which I move into
functors (pass them the index of elements they're working on, plus the array
head pointers, they do operations on memory). Functors are nice because Thrust
will auto-balance the grid for you for good occupancy. You can then start
porting stuff into raw __device__ functions, and then finally translate it to
a __global__ function that allows warp and grid level collective operations.

Once you've got the high-level stuff done, you need to tune the low-level
kernel behavior. As much as possible - avoid global-atomic operations, since
they kill your performance (you bypass cache and operate directly on global
memory, incurring latency with every call, and CAS updates tend to cause
contention/spinning). Pre-process in your shared RAM as much as possible. CUB
(Cuda UnBound) provides warp-level and block-level collective operations that
are useful - for example, a prefix-sum can give you the output targets for
each thread in a warp that has variable amounts of data (0, 1, many) that it
needs to output, which replaces a whole bunch of atomic operations. etc.

However, again a caveat: writing these collective operations can often involve
"sync points", like thread fences. These warp/block/global sync points are
really expensive in terms of processing, since you will have a bunch of cores
idling to wait up for the stragglers. In some cases it's again possible to
avoid an explicit sync operation by clever exploitation of the CUDA scheduler
(as above, with inter-grid communication: it's not really that smart). But
this is obviously very much undefined behavior too.

Texture cache can sometimes also be helpful. Basically it lets you align data
in multiple dimensions rather than just one - so you can have a 3D kernel
reading values, and from the GPU perspective it looks like they're all
aligned, even though you're reading chunks that are hugely off in flat memory
space. But there's some caveats, IIRC you really need to set it up before you
run a kernel (can't do it on the fly), and IIRC it's read-only.

Also, you can cleverly abuse the texture interpolation for free math
sometimes. That's typically the best gains you'll get out of texture memory,
but it comes at the cost of lots of extra latency.

In newer revisions of CUDA you can transparently page stuff from host memory
and it will kinda try to keep the two memory spaces synced up or whatever.
This is a really bad idea, you should think real carefully before using that
feature (basically never). Your 300 GB/s memory system is suddenly limited to
16 GB/s over PCIe, and memory bandwidth is precious. Explicitly manage your
device memory, explicitly say when you want stuff copied and fsync'd, and
don't let the autopilot handle it.

\-------

As for your specific problem of tree searching: this is really bad for GPUs.
As you noticed, naieve tree algorithms are pretty much the worst case, they
lead to lots of divergence which GPUs suck at. As much as possible - you want
to convert things into internal "while" loops that can keep moving across your
dataset if they don't find something in a specific place. Don't recurse, loop.
But generally - the structures which work well for CPUs don't necessarily work
well for GPUs. Especially if you insist on doing one operation at a time.
Searching for one element in a tree sucks. Doing range queries or searching a
couple hundred values is going to be a lot better.

I have always been fascinated with the idea of probabilistic data structures
and GPUs. Maybe you don't know for sure where an element is stored, but with
2000 cores you can probably find it even if there's a few dozen places it
might be. That avoids some of the traditional problems of lock contention/etc
on traditional data structures. And when you need to rebalance - GPUs are good
at that sort of thing, since it's more or less sorting.

Also, I feel like GPUs could be an interesting model for Erlang. Lots of
threads idling with low overhead? That's Erlang. Efficient message passing
would be a trick though, and the use-cases would be diametrically opposite.
You would have high latency and efficient numerical processing.

I also think I should be able to implement EpiSimdemics with a similar model
to this one, but that model isn't open source and Keith Bissett, the guy at
Virginia Tech who runs that program, refused to return my calls when I asked
for disease model parameters to validate against. Ah, academia.

\-------

Ton of words here, and it's been years since I touched any of this stuff
(couldn't find a job in my specialty and ended up programming Java - ugh) but
you've inspired me to actually finally put the code for my grad thesis on
github. It might be a worthwhile example of a real-world problem for you. Be
gentle, it's my first time. I haven't touched it in years and there are a few
minor things I know I screwed up (noted in the readme.md).

Repo:
[https://github.com/holvs/PandemicThrust](https://github.com/holvs/PandemicThrust)

Thesis:
[http://scholarworks.wmich.edu/masters_theses/525/](http://scholarworks.wmich.edu/masters_theses/525/)

IEEE conference paper (not very good IMO): [http://ieeexplore.ieee.org.sci-
hub.ac/document/7041000/](http://ieeexplore.ieee.org.sci-
hub.ac/document/7041000/)

\-------

Please see also:

Quick-start docs for the Thrust library, the actual easiest easiest
introduction to CUDA that you ever will find, literally 10 lines for a hello-
world program: [https://thrust.github.io/](https://thrust.github.io/)

Thrust example programs (again, see "histogram.cu"):
[https://github.com/thrust/thrust/tree/master/examples](https://github.com/thrust/thrust/tree/master/examples)

~~~
paulmd
If I format this up nicely as a blog post: I'd like to draw some spatial
diagrams. I'm a compsci programmer, not a math prof.

I need to draw 2D and 3D spaces, like a 3x3x3 cube, or an arbitrary sized
space, with selectable highlighting for each unit-cube in the space.

Can someone please help me with an appropriate tool here? I'm sure there's got
to be some Python module out there or something. I don't even know what term
to look for there.

~~~
programmarchy
Have you tried Blender? It's a 3D modeling tool with a python interface. Might
work nicely for what you want to do.

~~~
paulmd
it's not as simple as I'd prefer for 2d but that's exactly what I want for 3D.
Thank you.

------
gigatexal
It's really a shame that openCL doesn't have the market share that CUDA does
(or kudos are awaiting NVidia's marketing and foresight to invest so heavily
in the tooling around its hardware...) because the raw compute performance of
AMD hardware is superior to that of AMD and often cheaper.

~~~
markdog12
Isn't another big reason because OpenCL is harder to program in?

~~~
slizard
Harder in what sense? There is nothing (or very little) that makes OpenCL
significantly harder by nature!

OpenCL developer tools and libraries are however a disadvantage compared to
NVIDIA's CUDA stack. That's partly thanks to AMD's rather poor tools (I still
hope that their OSS initiative might change that). Intel's half-assed attitude
towards OpenCL support didn't help either. Most importantly, NVIDIA's attitude
of intentionally crippling OpenCL on their hardware by providing piss poor dev
tools, only v1.2 support, no extensions that would allow making use of their
hardware's features etc. has surely contributed to successful​ly holding back
the adoption of the OpenCL standard.

I hope the community wakes up sooner rather than later.

------
natch
This is great. But it seems like many almost every great tutorial has a step
zero that is left out. In this case, for me at least, what is missing is:
What's a good guide to choosing or building a CUDA system? Preferably a Linux
non-laptop. Mostly for playing around with something that offers a bit more
power than my day to day (very non-CUDA capable) laptop. Anyone have
suggestions?

I think there might be an EC2 solution, but I'm more interested in buying or
building my own hardware, as crazy as that might be, just to have a relatively
fixed cost (other than electricity) and to skip the overhead of any EC2
learning curve there might be.

~~~
ktta
If you are really interested, and willing to spend the time, you can get
utterly fabulous perf/$

You can google the specifics, but you can build a powerful and stable system
for about 300-400 dollars (a WHOLE system, including a CUDA compatible GPU,
not just the GPU)

ALL of the following parts can be purchased from ebay(The minimums are taken
from actual lists I've taken down while writing this post. There might be some
errors - you have been warned, so don't blindly hit purchase if you're not
sure. So if anyone has the patience please correct me)

{{Stuff}} are alternatives

CPU: Xeon - $12 - $50

Motherboard: $35-$60

RAM (24 gigs): $35-$50

Power supply: $40-$60 (don't skimp on this. Buy namebrand. Trust me on this
one.)

Case: $34-$80 (Funny how this might cost more than any of the other parts I've
listed until now. Protip - I made builds without a case, so this is optional,
if you want to save $50 and buy a little bit better parts.)

GPU: GTX 1050 :$110-$120 (brand new!)

{{GTX 970 : $160-$180}}

HDD 320GB: $20

{{HDD 1TB: $40

SSD 128GB: $41}}

So adding up all the minimum prices minus case and including GPU: It's around
$250. I don't think you can buy a good phone for around that price (Nexus 5x
goes for around $270)

The above build price minimums are pretty absolute (with links below as proof)
but I'd suggest spending around $250 for everything minus the GPU, since some
parts might bottleneck performance of the GPU if you're handling lots of data.

A single processor Xeon which doesn't have problems (google processor model
number to see if there are any) with Ubuntu server 16.04.1 would be rock
solid. Don't ever listen to ANYONE saying install arch linux,centos, etc. The
community + commercial recognition of Ubuntu for LTS version is unparalleled.
(Redhat/Centos beats Ubuntu in commercial support but regular community
support on stackoverflow and debugging using google? Ubuntu's for you)

Once you get comfortable with your device, get comfortable with Ubuntu
(Install Xubuntu-desktop if you want to attach a physical keyboard+mouse),
then get the GPU when you think you are almost ready to handle coding for CUDA
and linux tools.

Links for verification here:

[http://www.ebay.com/itm/Intel-Xeon-Match-Pair-E5620-Quad-
Cor...](http://www.ebay.com/itm/Intel-Xeon-Match-Pair-E5620-Quad-Core-
CPU-2-40Ghz-12M-5-86-GT-s-
SLBV4-LGA1366-/222355437700?hash=item33c56ab884:g:eJMAAOSw-0xYO7Qe)

[http://www.ebay.com/itm/DELL-01012MT00-000-G-N83VF-Server-
Mo...](http://www.ebay.com/itm/DELL-01012MT00-000-G-N83VF-Server-Motherboard-
Intel-2x-LGA-1366-DDR3-B2159-/232165101037?hash=item360e1e4ded)

[http://www.ebay.com/itm/EVGA-80-PLUS-600W-ATX-12V-EPS-12V-Po...](http://www.ebay.com/itm/EVGA-80-PLUS-600W-ATX-12V-EPS-12V-Power-
Supply-Black-CERTIFIED-
REFURBISHED-/381585246566?hash=item58d840c966:g:OWkAAOSwyjBW7VVL)

[http://www.ebay.com/itm/VIVO-ATX-Mid-Tower-Computer-
Gaming-P...](http://www.ebay.com/itm/VIVO-ATX-Mid-Tower-Computer-Gaming-PC-
Case-Black-4-Fan-Mounts-
USB-3-0-Port-/381324925710?hash=item58c8bc9b0e:g:6ooAAOSwT5tWHw0A)

[http://www.ebay.com/itm/MSI-GeForce-
GTX-1050-DirectX-12-GTX-...](http://www.ebay.com/itm/MSI-GeForce-
GTX-1050-DirectX-12-GTX-1050-2G-OC-2GB-128-Bit-GDDR5-PCI-
Express-3-0-/302169753448?hash=item465ab88768:g:fmgAAOSwux5YVpkl)

[http://www.ebay.com/itm/EVGA-GeForce-
GTX-970-04G-P4-2978-KR-...](http://www.ebay.com/itm/EVGA-GeForce-
GTX-970-04G-P4-2978-KR-4GB-FTW-GAMING-w-
ACX-2-0-/332112724147?hash=item4d537600b3:g:F0EAAOSwo4pYk~hd)

PS: Here's a recent post that's a great read. I suggest checking out the
comments too:

[https://www.reddit.com/r/PleX/comments/5r1zg2/plex_server_bu...](https://www.reddit.com/r/PleX/comments/5r1zg2/plex_server_build_recommendation_350_12core_24/)

~~~
natch
Wow great intro, thanks! I've done builds before but it's been a while so this
is really helpful.

------
verandaguy
I think there's a small error in the first code sample -- where the comment
says:

    
    
        Run kernel on 1M elements on the GPU
    

... The call to `add` isn't a call to a function that'd be thrown onto the
GPU. The `add` function is very much CPU-only based on the definition in that
code sample (at least, not at that point!)

~~~
ChristianGeek
The first sample is meant to use the CPU.

~~~
verandaguy
Yes. My point is that the comment treats `add` like it's GPU code.

------
jlebar
You may also enjoy my video from last year's cppcon about CUDA, which is in
some ways higher level, and in others much lower level:
[https://www.youtube.com/watch?v=KHa-
OSrZPGo](https://www.youtube.com/watch?v=KHa-OSrZPGo)

------
trevordev
Anyone know why the CUDA toolkit is 1.2GB? It seems extremely large to get
started with. In comparison Vulkan which is only 130mb.

~~~
gcp
Vulkan leverages the compiler in the graphics drivers, like OpenCL. CUDA comes
with a separate compiler that plugs into the system C/C++ compiler (this is
often a pain). CUDA also ships with dozens of premade libraries for common
compute tasks.

------
tchow
Any suggestions on a cheap cloud compute engine to play with cuda that won't
cost me a fortune as I learn?

I have macbook pro. Is it better to just buy a nvidia GPU and throw it in?

~~~
PetahNZ
It's like 60 cents an hour to run it on AWS. Just shut it down when your not
using it.

------
ape4
Is there a preprocessor in the chain? Because

    
    
        add<<<1, 1>>>(N, x, y);
    

isn't regular C++. Sorry if I missed something.

~~~
haldean
CUDA C++ is technically its own language, which is mostly implemented using a
preprocessor; nvcc performs some translation and then passes generated C++ to
your compiler of choice. The kernel launch syntax, along with a few implicit
includes and macros for __device__ and __global__ are (afaik) the only things
that really distinguish it from vanilla C++.

------
sanjeetsuhag
I'd love to learn CUDA but even the darn 'Hello World' examples don't compile.

