Prefix sum on portable compute shaders

6d65 · on Nov 17, 2021

Another great post in the series.

I've been trying to bootstrap my own deep learning framework in Rust for a while now. I'm still stuck at implementing the stuff on the CPU.

But I've always wanted something running on GPU as well. I've poked at OpenCL, and webgpu. But Piet-gpu(more probably piet-hal) seems to be the best starting point in the Rust land.

Once I get to the GPU compute part, I'll start with Piet GPU. Not looking forward to debugging GPU compute kernels, but maybe it will be fun.

PS. It's a bit disheartening to see such buggy Vulkan implementations, since one of the main Vulkan selling points was less bugs in the drivers. I'm not sure what I'll see on Linux.

Also, I hoped that between Vulkan and Metal one could cover all the the major desktop OSes with GPU accelerated software. It's sad to see apple dropping the ball in this regard.

There is also the concern of code reuse and code quality. Doing something complex in one pass would mean a huge kernel. It would be interesting to investigate if something like openai's Triton can be implemented in top of Piet-gpu(HAL). A dsl embedded in Rust that could do kernel fusion, and generate one shader when composed together. I'll poke around at this when I get to Piet-gpu.

raphlinus · on Nov 17, 2021

Let's talk if and when you want to build something. I make no promises that piet-gpu-hal is suitable for other workloads (right now I would say it barely meets the needs for piet-gpu), but on the other hand I am interested in the question of what it lacks and what it would take to run, eg, machine learning workloads on top of it.

dahart · on Nov 17, 2021

> there is something slightly taboo about coordination between workgroups in the same dispatch. Many GPU experts I’ve talked with express skepticism that this can work at all. Even so, interest in this type of pattern is picking up, in part because of advanced rendering engines like Nanite, which also uses atomics to coordinate work between workgroups in a live-running dispatch.

I live only in CUDA land, I’m a bit ignorant of the other platforms, so I’m not sure what granularity of workgroups is here. Generally speaking you can’t rely on individual threads executing in any particular order, no matter how many there are, so yes you can’t have one thread look at data from another thread without some kind of synchronization. It took me longer than it should have to really grok this and to believe it deeply, initially I kept wanting to think that thread number 235 million could read from thread 0 safely without synchronizing because surely enough time has passed. Nope, threads can and do come in any order, it’s never safe to assume another thread has finished.

Using atomics to solve this is rarely a good idea, atomics will make things go slowly, and there is often a way to restructure the problem so that you can let threads read data from a previous dispatch, and break your pipeline into more dispatches if necessary.

CUDA at least has a few other ways to share data during a single dispatch (or “launch”) besides atomics. Threads in warps can talk to each other, threads in blocks can share memory. But this all takes careful design and various other synchronization primitives.

> That is sadly not the case for GPU compute code. The most common scenario is dependency on a large, vendor-dependent toolkit such as CUDA (that installer is a 2.4GB download). If you have the right hardware, and the runtime installed properly, then your code can run.

While GPU coding is indeed more onerous than CPU programming in general, I feel like this wasn’t necessarily a fair point - this is the CUDA SDK download being compared to the CPU runtime. Installing the Rust compiler & cargo wasn’t mentioned as a downside, for example. CPU code also requires the right hardware and to have the runtime installed properly, it’s just something most people already have setup. Similarly, compiled CUDA code will run just fine without the SDK, and most people attempting to run a compiled CUDA program will have a compatible driver installed already. For the average compiled app, the run-time is no trickier than CPU, and the CPU has more or less the same kinds of requirements.

raphlinus · on Nov 17, 2021

Workgroup in Vulkan/WebGPU lingo is equivalent to "thread block" in CUDA speak; see [1] for a decoder ring.

> Using atomics to solve this is rarely a good idea, atomics will make things go slowly, and there is often a way to restructure the problem so that you can let threads read data from a previous dispatch, and break your pipeline into more dispatches if necessary.

This depends on the exact workload, but I disagree. A multiple dispatch solution to prefix sum requires reading the input at least twice, while decoupled look-back is single pass. That's a 1.5x difference if you're memory saturated, which is a good assumption here.

The Nanite talk (which I linked) showed a very similar result, for very similar reasons. They have a multi-dispatch approach to their adaptive LOD resolver, and it's about 25% slower than the one that uses atomics to manage the job queue.

Thus, I think we can solidly conclud that atomics are an essential part of the toolkit for GPU compute.

You do make an important distinction between runtime and development environment, and I should fix that, but there's still a point to be made. Most people doing machine learning work need a dev environment (or use Colab), even if they're theoretically just consuming GPU code that other people wrote. And if you do distribute a CUDA binary, it only runs on Nvidia. By contrast, my stuff is a 20-second "cargo build" and you can write your own GPU code with very minimal additional setup.

[1]: https://github.com/googlefonts/compute-shader-101/blob/main/...

dahart · on Nov 17, 2021

> Thus, I think we can solidly conclud that atomics are an essential part of the toolkit for GPU compute.

Complete agreement there! Yes there are absolutely good use cases for atomics, I just think it shouldn’t be summarized as either the best or the only approach. It’s incredibly common for there to be better approaches that avoid atomics.

Important to note that “Multiple-dispatch” can mean many things, and your comment seems to suggest that you’re thinking of serial dispatches in a single stream. If atomics and persistent threads are providing benefits, then it’s also possible that multiple parallel dispatch would also see performance improvements over multiple serial dispatch, because parallel dispatches can fill the exact same gap between dispatches that persistent threads are filling.

> Most people doing machine learning need a dev environment

Correct, but your 20 second cargo build was preceded by an install of the dev environment, right? I can’t ‘cargo build’ in 20 seconds right now, I don’t have the dev environment. On the other hand, I can build and run a CUDA app in 20 seconds. I don’t yet see this point being fair.

raphlinus · on Nov 17, 2021

Vulkan can't reliably do parallel dispatches, certainly not with any kind of scheduling fairness guarantee. CUDA has cooperative groups, which is a huge advantage.

Okay, I see your point about dev environments. It's like cameras, the best dev toolchain is the one you already have installed on your machine. I'll fix this but want to think about the best way to say it. I still believe there's a case to be made that CUDA is a heavyweight dependency.

dahart · on Nov 17, 2021

Thanks for listening Raph! It’s a good post, I’m picking nits. CUDA is a heavyweight dependency, I don’t have any problem with that. It’s just that most dev environments are heavy dependencies to development, so it’s mostly about what we’re comparing CUDA to. The driver is the runtime dependency, and it’s something to consider, but CUDA is pretty good about backward and forward compatibility. It’s true that CUDA code only runs on NV hardware, and I hope some of the good things CUDA has will make it to WebGPU & Vulkan. It’s not super common to build CPU code that only runs on Intel.

chrsig · on Nov 17, 2021

Just as an aside on the topic:

I spent the weekend neck deep in learning how to do an "aggregated auto increment", in which a single thread in a "coalesced group" (all of the threads in the block currently running that instruction) will increment a pointer by the number of threads in the group.

It's a very useful mechanism to implement a parallel queue / bump allocator, or to have an intermediary reduction step, without having to deal with shared memory.

https://developer.nvidia.com/blog/cooperative-groups/ https://developer.nvidia.com/blog/cuda-pro-tip-optimized-fil...

raphlinus · on Nov 17, 2021

Yeah, sometimes atomics perform way better than you expect them to. Check out the linkedlist benchmark in my suite, 12.1 G elements/s on AMD 5700 XT using DX12. That's a respectable fraction of raw memory bandwidth. Carrying over intuition from CPU land, you'd expect it to be very slow.

Looking at the ISA[2] you can get a glimpse of the magic that happens under the hood to make that happen. (Note: this test case is slightly simplified from what's in the repo for pedagogical reasons).

[1]: https://github.com/linebender/piet-gpu/blob/master/tests/sha...

[2]: https://shader-playground.timjones.io/da907f46d8bace9e5db7bd...

CoolGuySteve · on Nov 17, 2021

Unreal Engine's Nanite technology has the same problem. They want to walk a LoD/culling tree but have to resort to the CPU just to schedule new tasks on the GPU.

I wouldn't be surprised if future GPUs eliminate this requirement as it will make all Nanite Unreal Engine games run a little faster.

Kind of like how a couple years after Quake came out, every GPU got fast z-buffers.

https://youtu.be/eviSykqSUUw?t=1616

raphlinus · on Nov 17, 2021

> They want to walk a LoD/culling tree

DAG, my friend, not tree. Seriously, I recommend people watch the talk, it's one of the more impressive demonstrations of how to use GPU compute power I've ever seen, and the results speak for themselves.

volta83 · on Nov 17, 2021

> I live only in CUDA land, I’m a bit ignorant of the other platforms, so I’m not sure what granularity of workgroups is here. Generally speaking you can’t rely on individual threads executing in any particular order, no matter how many there are, so yes you can’t have one thread look at data from another thread without some kind of synchronization https://news.ycombinator.com/item?id=29257343

The only thing Raph relies on is, that if and only if a thread ever starts to run, then it will continue to make progress until it completes.

This is something you can rely on, e.g., on NVIDIA GPUs since Turing/Volta, and allows you to use atomics, barriers, semaphores, etc.

dragontamer · on Nov 17, 2021

> > there is something slightly taboo about coordination between workgroups in the same dispatch. Many GPU experts I’ve talked with express skepticism that this can work at all. Even so, interest in this type of pattern is picking up, in part because of advanced rendering engines like Nanite, which also uses atomics to coordinate work between workgroups in a live-running dispatch.

My opinion on the matter...

1. Because workgroups don't necessarily launch with each other. Workgroup #500 may not execute within Workgroup #1's lifetime, so already a lot of coordination between workgroups is limited. In contrast, if pthread_create started up a new thread, its in the running state (maybe not physically running, but it will get some time to run when the CPU is free).

2. Because kernel launch and kernel exit is the "natural" way to synchronize, and is actually decently efficient (especially device-side kernel launches). Even if you're forced for host-side kernel launches, CUDA-streams make async behavior easy and efficient (OpenCL task graphs are also quite efficient).

3. Because L1 caches in GPU-land are non-cohesive IIRC, so read/write barriers are very expensive (aka: you just invalidate L1 cache and wait for L2 cache). As such, #2 tends to be the more efficient approach in many cases.

-------

Simple, relaxed atomics work out very well though (relaxed atomics don't need any memory barriers and therefore execute in whatever order is most efficient for the GPU). I think the SIMD-queue thing also works out well in practice as a shared data-structure because it can rely upon relaxed atomics.

But the minute you start implementing significant blocking behavior (ex: compare-and-swap loops making lock/unlock spinlocks)... well... things continue to __work__ but its not nearly as efficient as you think. You end up thinking about doing #2 instead in many cases. The penalty on those memory-barriers even for acquire/release semantics is really really high, far higher than you'd expect.

(Ex: If you have a SIMD-queue. You could use atomics to implement a lock/unlock for the queue, and then your workgroups can then operate on the queue at their leisure. However: how many workgroups / blocks to spawn? And when they do coordinate, the stiff-penalties associated with memory-barriers really sucks. In contrast, if you just launch a kernel over the _entire_ queue, you get better control over the amount of resources used, and the individual kernels don't have to worry about coordination as much, especially if you have a separate output queue per workgroup that later merges in in a separate kernel)

But if you have to CAS-loop a non-blocking event across a bunch of stuff, modern GPUs _can_ do it. But I'm not convinced that programmers _should_ do it.

In short: GPUs don't seem to MESI with each other. Patterns in CPU-land around spinlocks / interthread communications are simply not as efficient in GPU-land. In contrast, GPU-land workgroup barriers and/or kernel launch/exists are EXTREMELY efficient (while the CPU-equivalent "pthread_create" and "pthread_join" is extremely inefficient)... but only within GPU-specific structures (barriers are efficient only within a workgroup. Kernel-exit is only effective at synchronizing with your own CUDA grid / OpenCL Task).

masahi · on Dec 1, 2021

TVM, similar to IREE, also has a good support for Vulkan. It compiles a tensor DSL written in Python into SPIR-V. AMD is using it in production for running deep learning models on their APU.

https://github.com/apache/tvm

kvark · on Nov 18, 2021

> Vulkan is defined in terms of the SPIR-V intermediate representation, so presumably can run anything that compiles to that

“Presumably” doesn’t hold here, unfortunately. We are seeing issues in spirv processing by drivers if it doesn’t look like the output of glslc.