Here's an example video of what the results look like (it shows how EM waves are reflected by a parabolic mirror):
I used mostly the PyCUDA documentation and examples as well as the official CUDA documentation (https://docs.nvidia.com/cuda/) to learn. I think what's most important is to first understand what blocks, grids and threads are and how they work (see e.g. here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....). With that knowledge you can start thinking about how you can structure your problem to solve it efficiently on the GPU. For the simulation, I have basically two 2D blocks of memory for each variable of interest (e.g. electric and magnetic fields in X,Y direction, current density, material properties) that I transfer the to the GPU. There, I use the discretized differential equation for the electromagnetic field to update the field values using the values from the first buffer (and the material properties + currents). I store the result in the second buffer. I then swap the buffer references (without copying any memory) for the next step of the simulation. I perform this step e.g. n times and then transfer some of the buffers back to the main memory to e.g. plot them. That's mostly it! Of course there are many intricacies and ways to optimize code, but getting a basic program running inside your GPU isn't that hard actually.
BTW, here's some really cool research work by Microsoft on GPU-based FDTD (finite difference time domain) simulations of wind instruments in two dimensions:
1. It allows for easy experimentation with the order in which work is done (which turns out to be a major factor in performance) —- IMO, this is one of the trickier parts of programming (GPU or not), so tools to accelerate experimentation accelerate learning also.
2. It allows you to write your algorithm once and emit code to run on OpenGL, OpenCL, CUDA, Metal, various SIMD flavors, and a bunch more exotic targets. CUDA effectively limits you to desktop/laptop computers, and at this point I’d rather bet on needing a mobile version at some point than not.
3. It eliminates a ton of boilerplate code, so you can get started quickly.
4. It’s what the pros use. Much of Adobe’s image processing code is in Halide now, for instance (source: pretty much any presentation extolling the virtues of Halide). The Halide authors cite a particular algorithm — the Local Laplacian Filter - where an intern, in one afternoon, beat out a hand optimized C++ implementation that had taken months to develop with a Halide implementation. I don’t know if the specifics of that have been exaggerated, but directionally I believe it. It was pretty transformational in the codepath I used it for.
I feel like developing an intuition for the “shape” of algorithms that will perform well before diving into the specifics of low-level tools like CUDA will serve you well.
Start drawing a graph of thousands of numbers with the CPU. Easy enough, but harder than it looks.
Accelerate the graph with the GPU so it has smooth animation, moving, zooming... Easy enough but way harder that it looks.
Take a sound sample, uncompress it and visualize in your graph system. Easy?
Take the sample and filter it with CUDA. Way easier filtering a 1D sample than 2D samples.
Play the filtered sound to "feel it".
Then you can filter 2D images if you want.
I recommend the graphic interface Dear Imgui:
Way faster development time than any other (states)interface with the disadvantage that it continuously draws in the screen consuming energy. Well worth it for rapid prototyping.
One trick is write output pixels transposed. This way both passes will be identical, and they both read image linearly. Two transposes cancel each other.
Another one is use local memory.
Finally, the right place for the kernel values is compiled into the code, in immediate values. Everything else is slower.
He's got a YouTube channel with step by step explanations too.
I've learned a lot by trying to port branchy code & recursive algorithms to the GPU, then trying to figure out why the performance is terrible and how to fix it. Specifically, I learned a bunch trying to write custom primitive intersection shaders for OptiX (https://developer.nvidia.com/optix).
I like ShaderToy as well, but since it's so easy and hides the abstractions from you, you have to really look around for the amazing tricks other people there have used. Also write code that slows down your GPU and then improve it. Just getting pretty things to render and/or writing shaders that start out at 60fps won't help you learn as fast.
Easily paralellizable, visually attractive, fun to code and use, has a narrow scope but can be improved with goodies (e.g. color for cell age). If you implement realtime interactivity, it forces you to bridge the gap between GPU and CPU world, which is a skill on its own.
Naive implementation should be good problem for start.
I've implemented one on almost every hardware I've touched (including the browser).
It looks amazing, relatively easy to do but has some great bits of learning along the way.
Another possibility arises with active development from the JuliaLang GPU team. As far as I know there is no GPU-accelerated image processing library for the language yet, but all the popular image format encoders exist.
Intro to JuliaGPU Ecosystem
Best of luck ;)
- Heterogeneous Computing with OpenCL (http://www.hds.bme.hu/~fhegedus/C++/Heterogeneous_computing_...) - informative but not many examples
- The Art of Multiprocessor Programming (https://www.amazon.com/Art-Multiprocessor-Programming-Revise...) - I had a hard time getting traction with this book.
- OpenCL Parallel Programming Development Cookbook (https://www.amazon.com/dp/B00ESX1AH2/ref=dp-kindle-redirect?...) - Not a great reference but it had some easy to follow examples.
- Actually a few other books you might find when searching for parallel processing or parallel algorithms which just turn out to be entirely abstract math books.
People would ask my why I wanted to learn to program on a GPU and I didn't have an answer. Surely I would find an answer in one of those books. I saved a few of the projects:
- Edge detection (https://mega.nz/#!LJUwmLSa!dRijnB1xVhI9RAC1Xac_xRhT2IsfDG2sJ...) - fun!
I have another one for bitonic sort somewhere (a parallel sort that sadly isn't even as good as quick sort).
The projects I enjoyed most were image filters (like edge detection). You could do a project that implements various image filters. If you did that you would not only get experience writing CUDA but you would learn how a lot of different filters are done.
I'd be interested in knowing is what GPU's you were/are using. I'm using a Sonnet eGFX Breakaway Box 550 (with Radeon RX Vega 56 Card).
It was mostly on a low-end laptop with integrated graphics lol. I was more interested in learning how parallel algorithms work than anything. Now I feel like an idiot.
Yup my bottleneck was before hitting the database server so I used the GPU to take care of it.
Graph processing, sorting and histograms can also be very interesting.
Ray Tracing is also a classic application.
Very specific answer: I always wanted to do the FPGA lab where you simulate the vibrating parts of a percussion instrument like a drum head live in real time with a push button to hit the drum and an audio out. I suppose a CUDA GPU version would output a .wav file. I suppose if you can get Game Of Life working this is a logical next step where you're simulating a 2-D structure under load rather than an abstract 2-D automaton. I wonder what a drum head looks like as a topo-map slowed down by a factor of 10K or so, probably interesting looking with nodes and reflections all over.
If you want to cover "all the details that [you] have learned", it'll help to state what you learned.
Also, what exercises you've done already - what level are you at?
1. a shadertoy (web, desktop or app) for GPU programming, with comveniences for image processing. That deals with all the BS boilerplate you learnt, from a general point of view, and will be useful.
2. Fluid simulation (but, there's a lot of non-GPU maths to understand).
Here is a very simple project I did last year:
Why did I say partly? So many optimization layers exist already (e.g. BLAS/cuBLAS). One may not really need to get down to the CUDA level.