Ask HN: What are good projects to understand CUDA and GPU Programming? 195 points by skywalker212 on Oct 2, 2018 | hide | past | favorite | 37 comments I have taken an Introduction to GPU Programming Course in my college and I want to do a project which covers all the details that I have learned. Something in the image processing field would be nice.

 I wrote a CUDA/GPU based 2D electromagnetic simulation a while ago. The code is open-source and (hopefully) not too complicated:https://github.com/adewes/fdtd-mlHere's an example video of what the results look like (it shows how EM waves are reflected by a parabolic mirror):https://www.youtube.com/watch?v=ZPSzAaxkg5cI used mostly the PyCUDA documentation and examples as well as the official CUDA documentation (https://docs.nvidia.com/cuda/) to learn. I think what's most important is to first understand what blocks, grids and threads are and how they work (see e.g. here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....). With that knowledge you can start thinking about how you can structure your problem to solve it efficiently on the GPU. For the simulation, I have basically two 2D blocks of memory for each variable of interest (e.g. electric and magnetic fields in X,Y direction, current density, material properties) that I transfer the to the GPU. There, I use the discretized differential equation for the electromagnetic field to update the field values using the values from the first buffer (and the material properties + currents). I store the result in the second buffer. I then swap the buffer references (without copying any memory) for the next step of the simulation. I perform this step e.g. n times and then transfer some of the buffers back to the main memory to e.g. plot them. That's mostly it! Of course there are many intricacies and ways to optimize code, but getting a basic program running inside your GPU isn't that hard actually.BTW, here's some really cool research work by Microsoft on GPU-based FDTD (finite difference time domain) simulations of wind instruments in two dimensions:
 https://wiki.tiker.net/PyCuda/Examples I agree about the PyCUDA examples
 Since you mentioned image processing in particular, I’d recommend looking into Halide instead of (or as well as) CUDA. Few reasons:1. It allows for easy experimentation with the order in which work is done (which turns out to be a major factor in performance) —- IMO, this is one of the trickier parts of programming (GPU or not), so tools to accelerate experimentation accelerate learning also.2. It allows you to write your algorithm once and emit code to run on OpenGL, OpenCL, CUDA, Metal, various SIMD flavors, and a bunch more exotic targets. CUDA effectively limits you to desktop/laptop computers, and at this point I’d rather bet on needing a mobile version at some point than not.3. It eliminates a ton of boilerplate code, so you can get started quickly.4. It’s what the pros use. Much of Adobe’s image processing code is in Halide now, for instance (source: pretty much any presentation extolling the virtues of Halide). The Halide authors cite a particular algorithm — the Local Laplacian Filter - where an intern, in one afternoon, beat out a hand optimized C++ implementation that had taken months to develop with a Halide implementation. I don’t know if the specifics of that have been exaggerated, but directionally I believe it. It was pretty transformational in the codepath I used it for.I feel like developing an intuition for the “shape” of algorithms that will perform well before diving into the specifics of low-level tools like CUDA will serve you well.
 Would halide be a good option for writing a pathtracing engine?
 You should escalate in difficulty levels. GPU Programming is infinite and will always take more of your time than you plan. With escalation you could get positive feedback fast. 2D image filtering is extremely complex for a beginner.Start drawing a graph of thousands of numbers with the CPU. Easy enough, but harder than it looks.Accelerate the graph with the GPU so it has smooth animation, moving, zooming... Easy enough but way harder that it looks.Take a sound sample, uncompress it and visualize in your graph system. Easy?Take the sample and filter it with CUDA. Way easier filtering a 1D sample than 2D samples.Play the filtered sound to "feel it".Then you can filter 2D images if you want.I recommend the graphic interface Dear Imgui: https://github.com/ocornut/imgui https://github.com/ocornut/imgui/issues/1902Way faster development time than any other (states)interface with the disadvantage that it continuously draws in the screen consuming energy. Well worth it for rapid prototyping.
 An image blur is a good place to start! Read horizontally from many pixels in parallel, sum them up as parallel as you can, normalize, write back. Repeat for vertical blur - and here it might be best to rotate the image by a quarter of a turn so vertical is horizontal!, because memory access is usually faster that way.
 Did it couple of times.One trick is write output pixels transposed. This way both passes will be identical, and they both read image linearly. Two transposes cancel each other.Another one is use local memory.Finally, the right place for the kernel values is compiled into the code, in immediate values. Everything else is slower.
 Learning to write raymarchers on shadertoy.com! It's a wonderful platform for gpu programming because you'll spend 100% of your time writing gpu code and 0% on installation, drivers, and build systems.
 Do you have any resources for the same? I learnt the basics from the Book of Shaders and other tutorials but can't seem to find find any advanced lessons.
 Íñigo Quílez's (Shadertoy's creator) articles should be a start point.http://iquilezles.org/www/index.htmHe's got a YouTube channel with step by step explanations too.
 Try to write Floyd-Steinberg dithering on the GPU. ;)I've learned a lot by trying to port branchy code & recursive algorithms to the GPU, then trying to figure out why the performance is terrible and how to fix it. Specifically, I learned a bunch trying to write custom primitive intersection shaders for OptiX (https://developer.nvidia.com/optix).I like ShaderToy as well, but since it's so easy and hides the abstractions from you, you have to really look around for the amazing tricks other people there have used. Also write code that slows down your GPU and then improve it. Just getting pretty things to render and/or writing shaders that start out at 60fps won't help you learn as fast.
 Yeah, an algorithm that makes every pixel dependent on the previous one is pretty much the worst-case for GPU computation :)
 Conway's Game of Life.Easily paralellizable, visually attractive, fun to code and use, has a narrow scope but can be improved with goodies (e.g. color for cell age). If you implement realtime interactivity, it forces you to bridge the gap between GPU and CPU world, which is a skill on its own.
 There is caching CPU implementation that is very fast and it is not trivial to implement on GPU https://en.wikipedia.org/wiki/HashlifeNaive implementation should be good problem for start.
 Super-million particle engine. Draw more than a million squares to the screen and move them according to some logic.I've implemented one on almost every hardware I've touched (including the browser).It looks amazing, relatively easy to do but has some great bits of learning along the way.Good luck.
 Lots of great suggestions here. The real prize may be something like getting real-time gpu-accelerated image and video filters on mobile devices. As an alternative to OpenCL / CUDA you can try Vulkan GLSL compute shaders targeting an Android GPU. See NDK docs.Another possibility arises with active development from the JuliaLang GPU team. As far as I know there is no GPU-accelerated image processing library for the language yet, but all the popular image format encoders exist.Intro to JuliaGPU Ecosystemhttps://www.youtube.com/watch?v=6ntJ_al4oXABest of luck ;)
 I learned a little OpenCL a few years ago just because I wanted to see how GPUs were programmed. I tried several books:- Heterogeneous Computing with OpenCL (http://www.hds.bme.hu/~fhegedus/C++/Heterogeneous_computing_...) - informative but not many examples- The Art of Multiprocessor Programming (https://www.amazon.com/Art-Multiprocessor-Programming-Revise...) - I had a hard time getting traction with this book.- OpenCL Parallel Programming Development Cookbook (https://www.amazon.com/dp/B00ESX1AH2/ref=dp-kindle-redirect?...) - Not a great reference but it had some easy to follow examples.- Actually a few other books you might find when searching for parallel processing or parallel algorithms which just turn out to be entirely abstract math books.People would ask my why I wanted to learn to program on a GPU and I didn't have an answer. Surely I would find an answer in one of those books. I saved a few of the projects:- Edge detection (https://mega.nz/#!LJUwmLSa!dRijnB1xVhI9RAC1Xac_xRhT2IsfDG2sJ...) - fun!- GPU template (https://mega.nz/#!yAsxATzb!Y4-9zRMCTSYHX1pKxWPQPl8WNDgnWkSAU...) - write GPU code with JavaScriptI have another one for bitonic sort somewhere (a parallel sort that sadly isn't even as good as quick sort).The projects I enjoyed most were image filters (like edge detection). You could do a project that implements various image filters. If you did that you would not only get experience writing CUDA but you would learn how a lot of different filters are done.
 I've been delving into OpenCL to parallelize some algorithmic trading stuff I have been doing. One thing I did a few days ago was to thread the way I load data into a MySQL Server. What took over an hour normally is down to under 10 mins.I'd be interested in knowing is what GPU's you were/are using. I'm using a Sonnet eGFX Breakaway Box 550 (with Radeon RX Vega 56 Card).
 Oh, you mean the bottleneck was on some processing you were doing before sending it to the database? That's cool.It was mostly on a low-end laptop with integrated graphics lol. I was more interested in learning how parallel algorithms work than anything. Now I feel like an idiot.
 Nah don’t feel that way! We all have different goals or reasons for doing things. I had a specific reason to learn. You had one too. They were just different.Yup my bottleneck was before hitting the database server so I used the GPU to take care of it.
 Solvers for linear and differential equations are very interesting. Look at geometric multigrid or asymmetric grid solvers.Graph processing, sorting and histograms can also be very interesting.Ray Tracing is also a classic application.
 You can create a program that calculates the histogram of an image or write a sobel filter for an image. Sobel is fairly simple as it's just matrix multiplication. Get familiar with how to manage memory between the CPU and GPU first and the different types of memories you have available on a GPU
 Abstract answer: Find FPGA class notes and lab notes onlineVery specific answer: I always wanted to do the FPGA lab where you simulate the vibrating parts of a percussion instrument like a drum head live in real time with a push button to hit the drum and an audio out. I suppose a CUDA GPU version would output a .wav file. I suppose if you can get Game Of Life working this is a logical next step where you're simulating a 2-D structure under load rather than an abstract 2-D automaton. I wonder what a drum head looks like as a topo-map slowed down by a factor of 10K or so, probably interesting looking with nodes and reflections all over.
 You may want to look into Cupy. It is replicating the Numpy API on GPUs. It is not too complicated but fun to add missing functions. Picking easier parts first guides you closer and closer to becoming fluent in GPU programming.
 My experience: just make sure you're already confident implementing it on a CPU first. GPU coding is hard enough without actual difficulty.If you want to cover "all the details that [you] have learned", it'll help to state what you learned. Also, what exercises you've done already - what level are you at?1. a shadertoy (web, desktop or app) for GPU programming, with comveniences for image processing. That deals with all the BS boilerplate you learnt, from a general point of view, and will be useful.2. Fluid simulation (but, there's a lot of non-GPU maths to understand).
 I just started working through "Professional CUDA C Programming": Ty McKercher , Max Grossman , John Cheng, and am finding it very interesting. It's from 2014, so I'd think there'd be something newer, but it's the best thing I could find on Safari Books. There's a lot of newer stuff that focuses on specific topics (like AI), but it's the best thing I could find that was general purpose.
 WebGL is surprisingly easy once you have some boiler-plate code to load the textures and shaders, and has the advantage that you can shove the result on a web site to show others.Here is a very simple project I did last year:
 An unconventional suggestion: If you are fluent in C & C++, the Tensorow codebase is decent. The kernels are implemented in both CPU & CUDA, so you can have a side-by-side comparison in code and performance without writing the boilerplate. You can fork your own branch to implement your own extensions and play around. Tensorflow isn’t only for deep learning.
 don't waste your time on cuda. Wait until an open artitecture replaces this proprietary clusterf*ck
 I disagree partly. Cuda is defacto in DL at the moment, and is very well designed. It has academic roots and the neat ideas shine out.Why did I say partly? So many optimization layers exist already (e.g. BLAS/cuBLAS). One may not really need to get down to the CUDA level.
 What if anything does Cuda do better/differently than OpenCL?
 CUDA itself does only one thing better - it lets you write kernels in C++ rather than a fairly restrictive subset of C. Far more importantly, it has a large ecosystem of tooling (including excellent profiling tools), libraries, documentation and open source projects out there. Comparatively speaking, OpenCL is a barren landscape.
 SYCL layer/standard adds "C++ Single-source Heterogeneous Programming" for OpenCL: https://www.khronos.org/sycl/
 That it does and there's also plenty of convenience wrappers of various kinds for different languages. Unfortunately, none of that can address the ecosystem issue, which is the one that really matters.
 Cuda has Cooperative Groups now on Volta and Turing architectures. This allows for synchronization between entire workgroups rather than just locally. So you can pretty much keep your entire job on the GPU even if it involves multiple kernels. Really important for complex jobs where performance is a must.
 Given that AMDs approach has shifted to implementing cuda (under the name "hip") and providing tools to automatically find/replace cuda to hip, I don't think the cuda api is going anywhere.

Search: