
Prefix Sum on Vulkan - raphlinus
https://raphlinus.github.io/gpu/2020/04/30/prefix-sum.html
======
bronxbomber92
A couple of small comments.

1\. The single pass prefix sum the author implements is not legal Vulkan code.
The algorithm requires spinning on conditions to be satisfied by concurrently
running threads. This is allowed to deadlock in Vulkan and is mentioned
explicitly in the "Limited Forward Progress Guarantees" section of the Vulkan
Memory Model blog post linked to by the author. The only GPUs on which is it
technically safe to run this algorithm are the Volta and newer NVIDIA
architectures.

2\. The mask argument of the CUDA shuffle intrinsics is there, not because it
is a more powerful abstraction, but because compiler technology of today is
unable to prevent illegal code motion of shuffle intrinsics. There are threads
on the LLVM mailing list that discuss this in some detail.

3\. I am happy to see this topic written about. Portability of compute
algorithms and where APIs differ or lack is not well understood and CUDA tends
to dominate the discussions. Regarding subgroup portability - I do wonder if
asking developers to write their code for specific subgroup sizes and allowing
them to request subgroup sizes is the correct long term abstraction. In Metal
you can get a long way writing code that uses subgroup ('simdgroup' in Metal
speak) functionality without actually making any assumptions about the
subgroup size. It is a balance - there will always be performance to be had if
you hard code such assumptions, but the more we can close the gap, the less it
makes sense to do so.

~~~
raphlinus
Thanks, especially the second paragraph. I'll probably update the post based
on that (I've made several updates already).

On 1, I'm wondering if there's some confusion, especially as I don't see what
difference Volta makes here. I accept that the Vulkan spec makes no guarantees
about thread scheduling fairness, and that one could write an adversarial but
spec-compliant implementation which schedules the atomic bump of the first
workgroup to claim the first partition, then context switches to the second
workgroup and attempts to run it to completion, which will deadlock.

But I think it is reasonable to make the assumption, for an extremely wide
range of hardware, that when the atomic bump is scheduled, the code to compute
the partition aggregate will make forward progress at least to the point where
it writes that aggregate, which is all that is required for completion. Note
that this all of this is about a _workgroup_ depending on the progress of
another workgroup. Obviously if invocations within a workgroup were depending
on the progress of other invocations within the same workgroup, that would be
a problem, but of course there are control barriers for that.

I don't see how it is any more technically safe to run this on Volta than any
other hardware. If there is a document that guarantees stronger forward
progress than mandated by the spec, I'm unaware of it.

From my reading of the spec, there is no way to write a one-pass prefix sum
that is technically guaranteed deadlock-free. I'd argue that this is a defect
of the spec, as there's a gap between useful guarantees that can safely be
made by all real hardware and what the spec provides. I'll allow that fixing
this defect is not easy. Until then, I think a good strategy is to ignore the
defect.

On 3, while I hope to avoid code duplication, so far our track record has been
pretty rough. For piet-metal, I tried writing the kernel in subgroup size
adaptive way, but had problems with size 64 subgroups, as that requires
dealing with 64 bit ballot values, and I never got that to work (I don't have
64 bit hardware in front of me, so had to iterate by asking other people to
test). Also, for Brian's work on 32x32 boolean matrix transpose, we tried a
hybrid approach using both subgroups and thread local storage, and performance
was really disappointing. But, armed with this experience, perhaps we can do
better going forward.

~~~
bronxbomber92
On 1, you are right that _most HW_ _most of the time_ will make the requisite
forward progress after the atomic bump. But Vulkan does not guarantee it and
most HW doesn't either (you're just getting lucky with a high probability).

The property you're describing is discussed in
[https://johnwickerson.github.io/papers/forwardprogress_concu...](https://johnwickerson.github.io/papers/forwardprogress_concur2017.pdf),
page 1:10:

"After probing our zoo of GPUs with litmus tests, we determined that all GPUs
had the following property: once a work-group begins executing a kernel (i.e.
the work-group becomes occupant on a hardware resource), it will continue
executing until it reaches the end of the kernel. We call this execution model
the occupancy bound execution model, because the number of work-groups for
which relative forward progress is guaranteed is bound by the hardware
resources available for executing work-groups; i.e. the hardware resources
determine how many work-groups can be occupant."

However, I know even that property is not true of all GPUs and thus cannot be
assumed by Vulkan spec compliant code. (I actually doubt it's true of any
GPU's pre-Volta - it's just that the progress needed is often a natural side-
effect of switching between threads for latency hiding).

Volta does guarantee this property + more. All starvation-free algorithms are
supported by the architecture. See [https://devblogs.nvidia.com/inside-
volta/](https://devblogs.nvidia.com/inside-volta/). In particular, this line:

"Independent thread scheduling in Volta ensures that even if a thread T0
currently holds the lock for node A, another thread T1 in the same warp can
successfully wait for the lock to become available without impeding the
progress of thread T0."

Independent thread scheduling is often thought about as providing a deadlock
freedom guarantee between threads in the same warp, but the same guarantees
must also extend to coarse execution scopes (e.g. a workgroup waiting on
another workgroup).

So, as far as Vulkan is concerned, I do not consider this a defect in the
spec. I have seen this exact algorithm deployed in production and deadlock
despite never having deadlocked at the developer's desk.

On 3 - sounds like another good blog post topic. It'd be interesting to hear
your experience trying to (and being unable to) write shaders that are
subgroup agnostic but use subgroup functionality. I should Brian's blog post
on the matrix transpose as well, maybe some of that experience is documented
there.

~~~
fluffything
> However, I know even that property is not true of all GPUs and thus cannot
> be assumed by Vulkan spec compliant code. (I actually doubt it's true of any
> GPU's pre-Volta - it's just that the progress needed is often a natural
> side-effect of switching between threads for latency hiding).

To be more precise. Pre-Volta GPU (like Pascal, Kepler, etc.) guarantee that
once a thread block starts running on an SM it will run there to completion,
i.e., its resources won't be freed until the kernel completes.

Warps of threads within a thread block are not guaranteed to run in any order,
and there are no guarantees that, e.g., if warp A spins on a lock hold by warp
B, that warp B will ever run and make process, and therefore the thread block
might never run to completion.

Volta and later architectures guarantees forward progress in this case. That
is, if a warp A spin locks on a lock held by warp B, Volta and later guarantee
that warp B will make progress at some point, allowing it to release the lock
so that warp A can make progress as well.

~~~
fluffything
> Warps of threads within a thread block are not guaranteed to run in any
> order, and there are no guarantees that, e.g., if warp A spins on a lock
> hold by warp B, that warp B will ever run and make process, and therefore
> the thread block might never run to completion.

FYI this is wrong, pre-Volta guarantees that all warps will run to completion,
what it doesn't guarantee is the same for threads within a warp. post-Volta
guarantees that.

------
boulos
Huh, I missed the discussion in
[https://news.ycombinator.com/item?id=22902274](https://news.ycombinator.com/item?id=22902274)
but while prefix-sum makes for a reasonable microbenchmark, I think it misses
my main problem with any of the existing !CUDA attempts: ergonomics.

The Vulkan Compute attitude towards device memory _deeply_ confuses me. That
is, for both OpenCL and Vulkan Compute, there is no way to say (AFAICT / at
least a couple years ago): "I wish to allocate X amount of memory on this
device, do not spill to the host; fail if you cannot satisfy". For most
programs, it's probably a fine assumption, just like virtual memory for a
regular program. But in a world where you're trying to push a system to the
limit, you'd still like to say "No, I'm serious".

Raph: What's your take on the ergonomics of using Vulkan Compute? I'd hoped to
see more on that given your opening of:

> Vulkan might be a viable platform, for a sufficiently persistent
> implementor.

~~~
raphlinus
My take is that ergonomics are pretty bad, there's a significant learning
curve. But I'm also hopeful that it's possible to build tools that layer on
top of Vulkan - unlike CUDA, which is a highly integrated toolchain and
runtime, Vulkan really only targets the lower levels of the stack.

I can't speak to your memory allocation question, as I'm personally quite
happy with the default behavior.

~~~
pjmlp
Given that CUDA has a thriving eco-system of programming languages targeting
it, I fail to see how it is not possible to build tooling on top of it.

------
klodolph
> Before then, the mechanism for lock-free programming patterns was the
> volatile qualifier and various nonstandard barrier intrinsics.

Ehhhhhh…

> The semantics of volatile were never clearly defined…

Well, the problem is that volatile _was_ defined, more or less. It means that
you get load and store instructions in machine code that correspond to the
appearance of loads and stores in the source code. Importantly, this does not
imply _anything_ about the memory model on multiprocessor systems.

So "volatile" was never supposed to be used for threading. What you _can_ use
it for is:

\- Memory-mapped I/O

\- Values that are changed in interrupts

\- Values changed on other threads on x86 if you just don’t give a damn

\- longjmp/setjmp

\- Values changed on other threads on Itanium

What happens when you don’t use “volatile” is that the compiler will do dead
store elimination and optimize out loads. For example,

    
    
       // Global;
       int x = 0;
       void wait_for_x(void) { while (x == 0) {} }
    

If `x` is modified by an interrupt, it should be marked `volatile` because
otherwise the compiler will helpfully optimize out the test in the loop.

What people actually did for lock-free programming in the pre-C++-11 days was
to use those non-portable intrinsics or inline assembler. If someone used
“volatile”, it was usually out of ignorance.

~~~
raphlinus
Ok, I want to get this right. The major point I'm trying to make is that
without the Vulkan memory model, the way to get atomic loads and stores is to
add the "volatile" memory qualifier to the buffer interface block. This has
basically the same set of problems as using "volatile" in pre-C++11 code.

There is _definitely_ precedent for using "volatile" in multithreaded code
(though, researching this, I think there may be stronger precedents in C than
in C++, see [1], particularly the "In C++11, normally never use volatile for
threading, only for MMIO" answer). I agree a good argument can be made that
it's incorrect. To be clear, the argument that volatile is incorrect for
multithreading is extremely strong from the perspective of standard C and C++,
but I think in practice it was considered quite usable.

I have a feeling we agree on the substance (a memory model is a dramatic
improvement over using volatile), but I can probably improve the wording.

[1]: [https://stackoverflow.com/questions/4557979/when-to-use-
vola...](https://stackoverflow.com/questions/4557979/when-to-use-volatile-
with-multi-threading)

ETA: Here's what I consider a great reference for the widespread use of
volatile for multithreaded programming. It's from 2001, but by Andrei
Alexandrescu, who I think can be considered an authority:
[https://www.drdobbs.com/cpp/volatile-the-multithreaded-
progr...](https://www.drdobbs.com/cpp/volatile-the-multithreaded-
programmers-b/184403766)

~~~
raphlinus
Followup: I've changed the language a bit, I hope you agree it's an
improvement.

[https://github.com/raphlinus/raphlinus.github.io/commit/6b92...](https://github.com/raphlinus/raphlinus.github.io/commit/6b92d19492f8052776d84894dda000e5376d1ff1)

------
arianvanp
I previously read this: [https://cachemiss.xyz/blog/parallel-reduce-and-scan-
on-the-G...](https://cachemiss.xyz/blog/parallel-reduce-and-scan-on-the-GPU)

whose implementation seems a lot simpler than what is presented here.

What is the main difference here?

I've only dabbled a bit in Vulkan. (I wrote a pathtracer that I wanted to
speed up using compaction; for which I needed a prefix sum)

~~~
raphlinus
It's a good blog, but only solves a much simpler version of the problem - it
only computes prefix sum of a workgroup worth of data. It does not carry the
sums forward from workgroup to workgroup, which is the entire point of the
decoupled look-back approach.

The per-workgroup implementation in that blog is basically similar to mine,
though I've put more effort into maxing out the partition size (using as many
registers as available).

------
ohiovr
16309

