1. The single pass prefix sum the author implements is not legal Vulkan code. The algorithm requires spinning on conditions to be satisfied by concurrently running threads. This is allowed to deadlock in Vulkan and is mentioned explicitly in the "Limited Forward Progress Guarantees" section of the Vulkan Memory Model blog post linked to by the author. The only GPUs on which is it technically safe to run this algorithm are the Volta and newer NVIDIA architectures.
2. The mask argument of the CUDA shuffle intrinsics is there, not because it is a more powerful abstraction, but because compiler technology of today is unable to prevent illegal code motion of shuffle intrinsics. There are threads on the LLVM mailing list that discuss this in some detail.
3. I am happy to see this topic written about. Portability of compute algorithms and where APIs differ or lack is not well understood and CUDA tends to dominate the discussions. Regarding subgroup portability - I do wonder if asking developers to write their code for specific subgroup sizes and allowing them to request subgroup sizes is the correct long term abstraction. In Metal you can get a long way writing code that uses subgroup ('simdgroup' in Metal speak) functionality without actually making any assumptions about the subgroup size. It is a balance - there will always be performance to be had if you hard code such assumptions, but the more we can close the gap, the less it makes sense to do so.
Thanks, especially the second paragraph. I'll probably update the post based on that (I've made several updates already).
On 1, I'm wondering if there's some confusion, especially as I don't see what difference Volta makes here. I accept that the Vulkan spec makes no guarantees about thread scheduling fairness, and that one could write an adversarial but spec-compliant implementation which schedules the atomic bump of the first workgroup to claim the first partition, then context switches to the second workgroup and attempts to run it to completion, which will deadlock.
But I think it is reasonable to make the assumption, for an extremely wide range of hardware, that when the atomic bump is scheduled, the code to compute the partition aggregate will make forward progress at least to the point where it writes that aggregate, which is all that is required for completion. Note that this all of this is about a workgroup depending on the progress of another workgroup. Obviously if invocations within a workgroup were depending on the progress of other invocations within the same workgroup, that would be a problem, but of course there are control barriers for that.
I don't see how it is any more technically safe to run this on Volta than any other hardware. If there is a document that guarantees stronger forward progress than mandated by the spec, I'm unaware of it.
From my reading of the spec, there is no way to write a one-pass prefix sum that is technically guaranteed deadlock-free. I'd argue that this is a defect of the spec, as there's a gap between useful guarantees that can safely be made by all real hardware and what the spec provides. I'll allow that fixing this defect is not easy. Until then, I think a good strategy is to ignore the defect.
On 3, while I hope to avoid code duplication, so far our track record has been pretty rough. For piet-metal, I tried writing the kernel in subgroup size adaptive way, but had problems with size 64 subgroups, as that requires dealing with 64 bit ballot values, and I never got that to work (I don't have 64 bit hardware in front of me, so had to iterate by asking other people to test). Also, for Brian's work on 32x32 boolean matrix transpose, we tried a hybrid approach using both subgroups and thread local storage, and performance was really disappointing. But, armed with this experience, perhaps we can do better going forward.
On 1, you are right that most HWmost of the time will make the requisite forward progress after the atomic bump. But Vulkan does not guarantee it and most HW doesn't either (you're just getting lucky with a high probability).
"After probing our zoo of GPUs with litmus tests, we determined that all GPUs had the following property: once a work-group begins executing a kernel (i.e. the work-group becomes occupant on a hardware resource), it will continue executing until it reaches the end of the kernel. We call this execution model the occupancy bound execution model, because the number of work-groups for which relative forward progress is guaranteed is bound by the hardware resources available for executing work-groups; i.e. the hardware resources determine how many work-groups can be occupant."
However, I know even that property is not true of all GPUs and thus cannot be assumed by Vulkan spec compliant code. (I actually doubt it's true of any GPU's pre-Volta - it's just that the progress needed is often a natural side-effect of switching between threads for latency hiding).
Volta does guarantee this property + more. All starvation-free algorithms are supported by the architecture. See https://devblogs.nvidia.com/inside-volta/. In particular, this line:
"Independent thread scheduling in Volta ensures that even if a thread T0 currently holds the lock for node A, another thread T1 in the same warp can successfully wait for the lock to become available without impeding the progress of thread T0."
Independent thread scheduling is often thought about as providing a deadlock freedom guarantee between threads in the same warp, but the same guarantees must also extend to coarse execution scopes (e.g. a workgroup waiting on another workgroup).
So, as far as Vulkan is concerned, I do not consider this a defect in the spec. I have seen this exact algorithm deployed in production and deadlock despite never having deadlocked at the developer's desk.
On 3 - sounds like another good blog post topic. It'd be interesting to hear your experience trying to (and being unable to) write shaders that are subgroup agnostic but use subgroup functionality. I should Brian's blog post on the matrix transpose as well, maybe some of that experience is documented there.
> However, I know even that property is not true of all GPUs and thus cannot be assumed by Vulkan spec compliant code. (I actually doubt it's true of any GPU's pre-Volta - it's just that the progress needed is often a natural side-effect of switching between threads for latency hiding).
To be more precise. Pre-Volta GPU (like Pascal, Kepler, etc.) guarantee that once a thread block starts running on an SM it will run there to completion, i.e., its resources won't be freed until the kernel completes.
Warps of threads within a thread block are not guaranteed to run in any order, and there are no guarantees that, e.g., if warp A spins on a lock hold by warp B, that warp B will ever run and make process, and therefore the thread block might never run to completion.
Volta and later architectures guarantees forward progress in this case. That is, if a warp A spin locks on a lock held by warp B, Volta and later guarantee that warp B will make progress at some point, allowing it to release the lock so that warp A can make progress as well.
> Warps of threads within a thread block are not guaranteed to run in any order, and there are no guarantees that, e.g., if warp A spins on a lock hold by warp B, that warp B will ever run and make process, and therefore the thread block might never run to completion.
FYI this is wrong, pre-Volta guarantees that all warps will run to completion, what it doesn't guarantee is the same for threads within a warp. post-Volta guarantees that.
Super, thanks for the info. I should update the blog post with it.
That paper calls for standardizing forward progress guarantees. Because you mention that there are GPUs that don't meet the occupancy bound as defined in the paper, that work might not go smoothly.
It occurs to me that for this particular algorithm there might be a chance to rescue it. Instead of simply spinning, a waiting workgroup might make a small amount of progress recomputing the aggregate for the partition it's waiting on. After a finite number of spins (the partition size divided by this grain of progress), it would have the aggregate for the partition so would be able to move on to the next partition. Thus the correctness concern becomes a performance concern, where a very high probability of the spin yielding early is likely "good enough."
I'll not make any promises about another blog - I worry that subgroup size tuning is too much in the weeds other than for a very specialized audience. But I certainly do hope to blog more about piet-gpu and will see if I can touch on the topic then.
> It occurs to me that for this particular algorithm there might be a chance to rescue it. Instead of simply spinning, a waiting workgroup might make a small amount of progress recomputing the aggregate for the partition it's waiting on. After a finite number of spins (the partition size divided by this grain of progress), it would have the aggregate for the partition so would be able to move on to the next partition. Thus the correctness concern becomes a performance concern, where a very high probability of the spin yielding early is likely "good enough."
Yes, I've seen this workaround perform well in practice, sorry for not mentioning it >.<. It does blow up any asymptotic efficiency guarantees though, which may or may not be acceptable for the application at hand.
> I'll not make any promises about another blog - I worry that subgroup size tuning is too much in the weeds other than for a very specialized audience. But I certainly do hope to blog more about piet-gpu and will see if I can touch on the topic then.
Looking forward to any future piet-gpu posts you may write :-).
Huh, I missed the discussion in https://news.ycombinator.com/item?id=22902274 but while prefix-sum makes for a reasonable microbenchmark, I think it misses my main problem with any of the existing !CUDA attempts: ergonomics.
The Vulkan Compute attitude towards device memory deeply confuses me. That is, for both OpenCL and Vulkan Compute, there is no way to say (AFAICT / at least a couple years ago): "I wish to allocate X amount of memory on this device, do not spill to the host; fail if you cannot satisfy". For most programs, it's probably a fine assumption, just like virtual memory for a regular program. But in a world where you're trying to push a system to the limit, you'd still like to say "No, I'm serious".
Raph: What's your take on the ergonomics of using Vulkan Compute? I'd hoped to see more on that given your opening of:
> Vulkan might be a viable platform, for a sufficiently persistent implementor.
My take is that ergonomics are pretty bad, there's a significant learning curve. But I'm also hopeful that it's possible to build tools that layer on top of Vulkan - unlike CUDA, which is a highly integrated toolchain and runtime, Vulkan really only targets the lower levels of the stack.
I can't speak to your memory allocation question, as I'm personally quite happy with the default behavior.
> Before then, the mechanism for lock-free programming patterns was the volatile qualifier and various nonstandard barrier intrinsics.
Ehhhhhh…
> The semantics of volatile were never clearly defined…
Well, the problem is that volatile was defined, more or less. It means that you get load and store instructions in machine code that correspond to the appearance of loads and stores in the source code. Importantly, this does not imply anything about the memory model on multiprocessor systems.
So "volatile" was never supposed to be used for threading. What you can use it for is:
- Memory-mapped I/O
- Values that are changed in interrupts
- Values changed on other threads on x86 if you just don’t give a damn
- longjmp/setjmp
- Values changed on other threads on Itanium
What happens when you don’t use “volatile” is that the compiler will do dead store elimination and optimize out loads. For example,
// Global;
int x = 0;
void wait_for_x(void) { while (x == 0) {} }
If `x` is modified by an interrupt, it should be marked `volatile` because otherwise the compiler will helpfully optimize out the test in the loop.
What people actually did for lock-free programming in the pre-C++-11 days was to use those non-portable intrinsics or inline assembler. If someone used “volatile”, it was usually out of ignorance.
Ok, I want to get this right. The major point I'm trying to make is that without the Vulkan memory model, the way to get atomic loads and stores is to add the "volatile" memory qualifier to the buffer interface block. This has basically the same set of problems as using "volatile" in pre-C++11 code.
There is definitely precedent for using "volatile" in multithreaded code (though, researching this, I think there may be stronger precedents in C than in C++, see [1], particularly the "In C++11, normally never use volatile for threading, only for MMIO" answer). I agree a good argument can be made that it's incorrect. To be clear, the argument that volatile is incorrect for multithreading is extremely strong from the perspective of standard C and C++, but I think in practice it was considered quite usable.
I have a feeling we agree on the substance (a memory model is a dramatic improvement over using volatile), but I can probably improve the wording.
ETA: Here's what I consider a great reference for the widespread use of volatile for multithreaded programming. It's from 2001, but by Andrei Alexandrescu, who I think can be considered an authority: https://www.drdobbs.com/cpp/volatile-the-multithreaded-progr...
Many people have written code targeting only MSVC, where volatile is documented to have atomic-like properties, and targeting only x86, where the bus is especially forgiving, albeit at massive costs in extra logic to make it so.
So it's not so much ignorance at work as (pathological) tolerance of MS and x86 lock-in. That said, I know a FreeBSD kernel committer who in 2014 was convinced that "volatile" sufficed for thread synchronization on Gcc and any arch. So there is also plenty of ignorance at work. You could say that MS, Intel, and AMD have benefitted from such ignorance through the large amount of de-facto unportable code it engendered.
If you are using volatile to interact with memory-mapped I/O, that means that there is a _thread of execution_ somewhere also interacting with that memory - otherwise why are you reading/writing to it? - such that if those reads and writes are not synchronized the program has a data-race and the behavior is undefined.
The claim that this is not a data-race because those threads of execution are "outside my process" is incorrect: the C standard doesn't know about processes. Whether one thread runs in a CPU and writes to a memory mapped DMA region, and the other thread is in your network card reading from that same memory region doesn't matter at all.
The right tool here is a `volatile atomic` load / store.
> - Values changed on other threads on x86 if you just don’t give a damn
If by "don't give a damn" you mean "don't care about UB" then this is correct, because doing that is UB, and UB can do anything.
> Values that are changed in interrupts
From the C standard point-of-view, an interrupt handler suspends a thread of execution, and the handler runs in a different thread of execution. Whether these threads are serialized by the OS or not doesn't matter, reading/writing across threads without synchronization is UB. In particular for interrupt handlers, using volatile to modify memory from an interrupt handler doesn't imply that the thread reading that memory will see the modification (e.g. the compiler might have cached that in a register). Using volatile to read the memory isn't sufficient either, because volatile is not a compiler fence, and it doesn't synchronize-with anything, not even volatile. The right tool here is a `volatile atomic` load / store.
> - longjmp/setjmp
The C standard requires the usage of volatile here, otherwise the behavior is undefined. There is only one thread of execution at play here, so no synchronization is necessary.
Similar control-flow constructs like throwing and catching exceptions do not require the use of volatile here, so the reason one must use volatile here is because longjmp/setjmp is a poorly designed language feature.
It's a good blog, but only solves a much simpler version of the problem - it only computes prefix sum of a workgroup worth of data. It does not carry the sums forward from workgroup to workgroup, which is the entire point of the decoupled look-back approach.
The per-workgroup implementation in that blog is basically similar to mine, though I've put more effort into maxing out the partition size (using as many registers as available).
1. The single pass prefix sum the author implements is not legal Vulkan code. The algorithm requires spinning on conditions to be satisfied by concurrently running threads. This is allowed to deadlock in Vulkan and is mentioned explicitly in the "Limited Forward Progress Guarantees" section of the Vulkan Memory Model blog post linked to by the author. The only GPUs on which is it technically safe to run this algorithm are the Volta and newer NVIDIA architectures.
2. The mask argument of the CUDA shuffle intrinsics is there, not because it is a more powerful abstraction, but because compiler technology of today is unable to prevent illegal code motion of shuffle intrinsics. There are threads on the LLVM mailing list that discuss this in some detail.
3. I am happy to see this topic written about. Portability of compute algorithms and where APIs differ or lack is not well understood and CUDA tends to dominate the discussions. Regarding subgroup portability - I do wonder if asking developers to write their code for specific subgroup sizes and allowing them to request subgroup sizes is the correct long term abstraction. In Metal you can get a long way writing code that uses subgroup ('simdgroup' in Metal speak) functionality without actually making any assumptions about the subgroup size. It is a balance - there will always be performance to be had if you hard code such assumptions, but the more we can close the gap, the less it makes sense to do so.