
Ask HN: What are good projects to understand CUDA and GPU Programming? - skywalker212
I have taken an Introduction to GPU Programming Course in my college and I want to do a project which covers all the details that I have learned. Something in the image processing field would be nice.
======
ThePhysicist
I wrote a CUDA/GPU based 2D electromagnetic simulation a while ago. The code
is open-source and (hopefully) not too complicated:

[https://github.com/adewes/fdtd-ml](https://github.com/adewes/fdtd-ml)

Here's an example video of what the results look like (it shows how EM waves
are reflected by a parabolic mirror):

[https://www.youtube.com/watch?v=ZPSzAaxkg5c](https://www.youtube.com/watch?v=ZPSzAaxkg5c)

I used mostly the PyCUDA documentation and examples as well as the official
CUDA documentation
([https://docs.nvidia.com/cuda/](https://docs.nvidia.com/cuda/)) to learn. I
think what's most important is to first understand what blocks, grids and
threads are and how they work (see e.g. here:
[https://docs.nvidia.com/cuda/cuda-c-programming-
guide/index....](https://docs.nvidia.com/cuda/cuda-c-programming-
guide/index.html)). With that knowledge you can start thinking about how you
can structure your problem to solve it efficiently on the GPU. For the
simulation, I have basically two 2D blocks of memory for each variable of
interest (e.g. electric and magnetic fields in X,Y direction, current density,
material properties) that I transfer the to the GPU. There, I use the
discretized differential equation for the electromagnetic field to update the
field values using the values from the first buffer (and the material
properties + currents). I store the result in the second buffer. I then swap
the buffer references (without copying any memory) for the next step of the
simulation. I perform this step e.g. n times and then transfer some of the
buffers back to the main memory to e.g. plot them. That's mostly it! Of course
there are many intricacies and ways to optimize code, but getting a basic
program running inside your GPU isn't that hard actually.

BTW, here's some really cool research work by Microsoft on GPU-based FDTD
(finite difference time domain) simulations of wind instruments in two
dimensions:

[https://www.youtube.com/watch?v=7Kf-
rlUZAaU](https://www.youtube.com/watch?v=7Kf-rlUZAaU)

~~~
peterbecich
[https://wiki.tiker.net/PyCuda/Examples](https://wiki.tiker.net/PyCuda/Examples)
I agree about the PyCUDA examples

------
jamesg
Since you mentioned image processing in particular, I’d recommend looking into
Halide instead of (or as well as) CUDA. Few reasons:

1\. It allows for easy experimentation with the order in which work is done
(which turns out to be a major factor in performance) —- IMO, this is one of
the trickier parts of programming (GPU or not), so tools to accelerate
experimentation accelerate learning also.

2\. It allows you to write your algorithm once and emit code to run on OpenGL,
OpenCL, CUDA, Metal, various SIMD flavors, and a bunch more exotic targets.
CUDA effectively limits you to desktop/laptop computers, and at this point I’d
rather bet on needing a mobile version at some point than not.

3\. It eliminates a ton of boilerplate code, so you can get started quickly.

4\. It’s what the pros use. Much of Adobe’s image processing code is in Halide
now, for instance (source: pretty much any presentation extolling the virtues
of Halide). The Halide authors cite a particular algorithm — the Local
Laplacian Filter - where an intern, in one afternoon, beat out a hand
optimized C++ implementation that had taken months to develop with a Halide
implementation. I don’t know if the specifics of that have been exaggerated,
but directionally I believe it. It was pretty transformational in the codepath
I used it for.

I feel like developing an intuition for the “shape” of algorithms that will
perform well before diving into the specifics of low-level tools like CUDA
will serve you well.

[http://halide-lang.org/](http://halide-lang.org/)

~~~
Assossa
Would halide be a good option for writing a pathtracing engine?

------
hevi_jos
You should escalate in difficulty levels. GPU Programming is infinite and will
always take more of your time than you plan. With escalation you could get
positive feedback fast. 2D image filtering is extremely complex for a
beginner.

Start drawing a graph of thousands of numbers with the CPU. Easy enough, but
harder than it looks.

Accelerate the graph with the GPU so it has smooth animation, moving,
zooming... Easy enough but way harder that it looks.

Take a sound sample, uncompress it and visualize in your graph system. Easy?

Take the sample and filter it with CUDA. Way easier filtering a 1D sample than
2D samples.

Play the filtered sound to "feel it".

Then you can filter 2D images if you want.

I recommend the graphic interface Dear Imgui:
[https://github.com/ocornut/imgui](https://github.com/ocornut/imgui)
[https://github.com/ocornut/imgui/issues/1902](https://github.com/ocornut/imgui/issues/1902)

Way faster development time than any other (states)interface with the
disadvantage that it continuously draws in the screen consuming energy. Well
worth it for rapid prototyping.

------
AnthonBerg
An image blur is a good place to start! Read horizontally from many pixels in
parallel, sum them up as parallel as you can, normalize, write back. Repeat
for vertical blur - and here it might be best to rotate the image by a quarter
of a turn so vertical is horizontal!, because memory access is usually faster
that way.

~~~
Const-me
Did it couple of times.

One trick is write output pixels transposed. This way both passes will be
identical, and they both read image linearly. Two transposes cancel each
other.

Another one is use local memory.

Finally, the right place for the kernel values is compiled into the code, in
immediate values. Everything else is slower.

------
tehsauce
Learning to write raymarchers on shadertoy.com! It's a wonderful platform for
gpu programming because you'll spend 100% of your time writing gpu code and 0%
on installation, drivers, and build systems.

~~~
achalpandeyy
Do you have any resources for the same? I learnt the basics from the Book of
Shaders and other tutorials but can't seem to find find any advanced lessons.

~~~
kaoD
Íñigo Quílez's (Shadertoy's creator) articles should be a start point.

[http://iquilezles.org/www/index.htm](http://iquilezles.org/www/index.htm)

He's got a YouTube channel with step by step explanations too.

------
dahart
Try to write Floyd-Steinberg dithering on the GPU. ;)

I've learned a lot by trying to port branchy code & recursive algorithms to
the GPU, then trying to figure out why the performance is terrible and how to
fix it. Specifically, I learned a bunch trying to write custom primitive
intersection shaders for OptiX
([https://developer.nvidia.com/optix](https://developer.nvidia.com/optix)).

I like ShaderToy as well, but since it's so easy and hides the abstractions
from you, you have to really look around for the amazing tricks other people
there have used. Also write code that slows down your GPU and then improve it.
Just getting pretty things to render and/or writing shaders that start out at
60fps won't help you learn as fast.

~~~
pjc50
Yeah, an algorithm that makes every pixel dependent on the previous one is
pretty much the worst-case for GPU computation :)

------
kaoD
Conway's Game of Life.

Easily paralellizable, visually attractive, fun to code and use, has a narrow
scope but can be improved with goodies (e.g. color for cell age). If you
implement realtime interactivity, it forces you to bridge the gap between GPU
and CPU world, which is a skill on its own.

~~~
llukas
There is caching CPU implementation that is _very_ fast and it is not trivial
to implement on GPU
[https://en.wikipedia.org/wiki/Hashlife](https://en.wikipedia.org/wiki/Hashlife)

Naive implementation should be good problem for start.

------
Moribund
Super-million particle engine. Draw more than a million squares to the screen
and move them according to some logic.

I've implemented one on almost every hardware I've touched (including the
browser).

It looks amazing, relatively easy to do but has some great bits of learning
along the way.

Good luck.

------
ArtWomb
Lots of great suggestions here. The real prize may be something like getting
real-time gpu-accelerated image and video filters on mobile devices. As an
alternative to OpenCL / CUDA you can try Vulkan GLSL compute shaders targeting
an Android GPU. See NDK docs.

Another possibility arises with active development from the JuliaLang GPU
team. As far as I know there is no GPU-accelerated image processing library
for the language yet, but all the popular image format encoders exist.

Intro to JuliaGPU Ecosystem

[https://www.youtube.com/watch?v=6ntJ_al4oXA](https://www.youtube.com/watch?v=6ntJ_al4oXA)

Best of luck ;)

------
nobody271
I learned a little OpenCL a few years ago just because I wanted to see how
GPUs were programmed. I tried several books:

\- Heterogeneous Computing with OpenCL
([http://www.hds.bme.hu/~fhegedus/C++/Heterogeneous_computing_...](http://www.hds.bme.hu/~fhegedus/C++/Heterogeneous_computing_OpenCL.pdf))
- informative but not many examples

\- The Art of Multiprocessor Programming ([https://www.amazon.com/Art-
Multiprocessor-Programming-Revise...](https://www.amazon.com/Art-
Multiprocessor-Programming-Revised-Reprint/dp/0123973376)) - I had a hard time
getting traction with this book.

\- OpenCL Parallel Programming Development Cookbook
([https://www.amazon.com/dp/B00ESX1AH2/ref=dp-kindle-
redirect?...](https://www.amazon.com/dp/B00ESX1AH2/ref=dp-kindle-
redirect?_encoding=UTF8&btkr=1)) - Not a great reference but it had some easy
to follow examples.

\- Actually a few other books you might find when searching for parallel
processing or parallel algorithms which just turn out to be entirely abstract
math books.

People would ask my why I wanted to learn to program on a GPU and I didn't
have an answer. Surely I would find an answer in one of those books. I saved a
few of the projects:

\- Edge detection
([https://mega.nz/#!LJUwmLSa!dRijnB1xVhI9RAC1Xac_xRhT2IsfDG2sJ...](https://mega.nz/#!LJUwmLSa!dRijnB1xVhI9RAC1Xac_xRhT2IsfDG2sJLT3nD4Q1Pc))
- fun!

\- GPU template
([https://mega.nz/#!yAsxATzb!Y4-9zRMCTSYHX1pKxWPQPl8WNDgnWkSAU...](https://mega.nz/#!yAsxATzb!Y4-9zRMCTSYHX1pKxWPQPl8WNDgnWkSAUIbIFqfwekQ))
- write GPU code with JavaScript

I have another one for bitonic sort somewhere (a parallel sort that sadly
isn't even as good as quick sort).

The projects I enjoyed most were image filters (like edge detection). You
could do a project that implements various image filters. If you did that you
would not only get experience writing CUDA but you would learn how a lot of
different filters are done.

~~~
jason_slack
I've been delving into OpenCL to parallelize some algorithmic trading stuff I
have been doing. One thing I did a few days ago was to thread the way I load
data into a MySQL Server. What took over an hour normally is down to under 10
mins.

I'd be interested in knowing is what GPU's you were/are using. I'm using a
Sonnet eGFX Breakaway Box 550 (with Radeon RX Vega 56 Card).

~~~
nobody271
Oh, you mean the bottleneck was on some processing you were doing before
sending it to the database? That's cool.

It was mostly on a low-end laptop with integrated graphics lol. I was more
interested in learning how parallel algorithms work than anything. Now I feel
like an idiot.

~~~
jason_slack
Nah don’t feel that way! We all have different goals or reasons for doing
things. I had a specific reason to learn. You had one too. They were just
different.

Yup my bottleneck was before hitting the database server so I used the GPU to
take care of it.

------
lamchob
Solvers for linear and differential equations are very interesting. Look at
geometric multigrid or asymmetric grid solvers.

Graph processing, sorting and histograms can also be very interesting.

Ray Tracing is also a classic application.

------
hiesenbug
You can create a program that calculates the histogram of an image or write a
sobel filter for an image. Sobel is fairly simple as it's just matrix
multiplication. Get familiar with how to manage memory between the CPU and GPU
first and the different types of memories you have available on a GPU

------
VLM
Abstract answer: Find FPGA class notes and lab notes online

Very specific answer: I always wanted to do the FPGA lab where you simulate
the vibrating parts of a percussion instrument like a drum head live in real
time with a push button to hit the drum and an audio out. I suppose a CUDA GPU
version would output a .wav file. I suppose if you can get Game Of Life
working this is a logical next step where you're simulating a 2-D structure
under load rather than an abstract 2-D automaton. I wonder what a drum head
looks like as a topo-map slowed down by a factor of 10K or so, probably
interesting looking with nodes and reflections all over.

------
grusel
You may want to look into Cupy. It is replicating the Numpy API on GPUs. It is
not too complicated but fun to add missing functions. Picking easier parts
first guides you closer and closer to becoming fluent in GPU programming.

------
hyperpallium
My experience: just make sure you're already confident implementing it on a
CPU first. GPU coding is hard enough without _actual_ difficulty.

If you want to cover "all the details that [you] have learned", it'll help to
state what you learned. Also, what exercises you've done already - what level
are you at?

1\. a shadertoy (web, desktop or app) for GPU programming, with comveniences
for image processing. That deals with all the BS boilerplate you learnt, from
a general point of view, and will be useful.

2\. Fluid simulation (but, there's a _lot_ of non-GPU maths to understand).

------
gmiller123456
I just started working through "Professional CUDA C Programming": Ty McKercher
, Max Grossman , John Cheng, and am finding it very interesting. It's from
2014, so I'd think there'd be something newer, but it's the best thing I could
find on Safari Books. There's a lot of newer stuff that focuses on specific
topics (like AI), but it's the best thing I could find that was general
purpose.

------
AndrewStephens
WebGL is surprisingly easy once you have some boiler-plate code to load the
textures and shaders, and has the advantage that you can shove the result on a
web site to show others.

Here is a very simple project I did last year:

[https://sheep.horse/2017/9/crossfading_photos_with_webgl_-
_b...](https://sheep.horse/2017/9/crossfading_photos_with_webgl_-
_boston_bridge_proj.html)

------
mailslot
An unconventional suggestion: If you are fluent in C & C++, the Tensorow
codebase is decent. The kernels are implemented in both CPU & CUDA, so you can
have a side-by-side comparison in code and performance without writing the
boilerplate. You can fork your own branch to implement your own extensions and
play around. Tensorflow isn’t only for deep learning.

------
sbhn
Cuda sha256

[https://github.com/Sean-Bradley/CUDALookupSHA256](https://github.com/Sean-
Bradley/CUDALookupSHA256)

Cuda ripemd160

[https://github.com/Sean-Bradley/CUDALookupRipeMD60](https://github.com/Sean-
Bradley/CUDALookupRipeMD60)

------
singularity2001
don't waste your time on cuda. Wait until an open artitecture replaces this
proprietary clusterf*ck

~~~
brutus1213
I disagree partly. Cuda is defacto in DL at the moment, and is very well
designed. It has academic roots and the neat ideas shine out.

Why did I say partly? So many optimization layers exist already (e.g.
BLAS/cuBLAS). One may not really need to get down to the CUDA level.

~~~
skohan
What if anything does Cuda do better/differently than OpenCL?

~~~
keldaris
CUDA itself does only one thing better - it lets you write kernels in C++
rather than a fairly restrictive subset of C. Far more importantly, it has a
large ecosystem of tooling (including excellent profiling tools), libraries,
documentation and open source projects out there. Comparatively speaking,
OpenCL is a barren landscape.

~~~
codehd7
SYCL layer/standard adds "C++ Single-source Heterogeneous Programming" for
OpenCL: [https://www.khronos.org/sycl/](https://www.khronos.org/sycl/)

~~~
keldaris
That it does and there's also plenty of convenience wrappers of various kinds
for different languages. Unfortunately, none of that can address the ecosystem
issue, which is the one that really matters.

