I write a lot of compute kernels in CUDA, and my litmus test is prefix sum, for two reasons.
First, you can implement it in pure CUDA C++, and max out the memory bandwidth of any nvidia or AMD GPU. The CUB library provides a state of the art implementation (using decoupled-lookback) that one can compare against new programming languages.
Second, it is one of the most basic parallel algorithm building blocks. Many parallel algorithms use it, and many parallel algorithms are "prefix-sum-like". If I am not able to write prefix sum from scratch efficiently in your programming language / library, I can't use it.
Every time someone shows a new programming language for compute, the examples provided are super basic (e.g. `map(...).fold(...)`), but I have yet to see a new programming language that can be used to implement the 2-3 most fundamental parallel algorithms from any parallel algorithms graduate course. For example, Futhark provides a prefix sum intrinsic, that just calls CUB - if you want to implement a prefix-sum-like algorithm, you are out of luck. In WGPU, it appears that prefix-sum will be an intrinsic of WHSL, which sounds like you would be out-of-luck too.
You mentioned WGPU and Vulkan. Do you know how to implement prefix-sum from scratch on these? If so, do you know how the performance compare against CUB?
> For example, Futhark provides a prefix sum intrinsic, that just calls CUB - if you want to implement a prefix-sum-like algorithm, you are out of luck.
It's not as fast as a hand-written scan in CUDA, nor as fast as the built-in scan in Futhark, but you can do it. Futhark isn't a low-level or GPU-specific language, so you can't take direct advantage of hardware-specific features.
Generally, Futhark doesn't compete performance-wise primitive-for-primitive, but can do application-level transformations that are infeasible for libraries, and impractical to do by hand. For example, you can 'map' a Futhark 'scan' over a two-dimensional array and it will just work pretty well, while there's no way to turn the CUB scan into a regular segmented scan, unless the original author thought of it. Similarly, if you are 'scan'ing the same array twice in a Futhark program, but with different operators or pre-scan transforms, the compiler will fuse the two operations into one, to reduce the number of accesses.
> If the builtin scan doesn't call CUB, how fast is a prefix sum in Futhark compared to Cuda using CUB ?
Last I checked, it's about 50% the performance of CUB, but a sibling post describes a current effort to implement the decoupled lookback algorithm in Futhark's CUDA backend.
> What does "pretty well" mean? 95% of peak device throughput ? 60% ?
It's as fast as Futhark's single-dimensional scan. In exotic cases (when a single segment or "scan instance" fits in a thread block) it can be even faster.
> When a compute device costs 8.000$, an efficiency of 90% means I'm throwing 800$ out of the window.
Certainly! But people run much less than 90% efficient code on $8000 CPUs all the time, so I think there is definitely room for a high-level programming language that runs on GPUs. It's the usual performance-versus-productivity tradeoff, although Futhark's ambition (and to a large extent, reality) puts more emphasis on the performance side of things than when people usually use that argument.
While Futhark can in principle not beat hand-written code, I will note that empirically it is not unusual for us to see performance increases when we take published GPU benchmark code and rewrite it straightforwardly in Futhark. Much GPU code out there is not perfectly written at the level of CUB, and while the Futhark compiler doesn't exactly generate highly inspired code, it gets the basics consistently right.
You claim there is a trade-off between usability and performance for GPUs. The reason Rust is so popular is that people used to believe the same thing for CPU programming languages, yet Rust showed that you can have a language with the abstraction power and safety of Scala, Haskell, etc. and the performance of C. This was believed to be impossible, turns out it wasn't.
In the last NERSC survey, 75% of parallel models in use in US national HPC centers are MPI and/or OpenMP, ~22% is CUDA C and CUDA C++, and all others are the rest (~3%).
CUDA C++ is the best we have because it let's you write code that extracts all the performance from the device, while also allowing one to expose that code using very high-level APIs (Thrust, CUTLASS, CUB, AmgX, ...), and even RAPIDS, CuPY, etc.
While I think it is very useful to research how "high-level" a PL for GPUs can be (Futhark, Taichi), a practical language (that actually gets used widely on HPC) must do so without sacrificing low-levelness.
A GPU programming for the masses needs to provide or improve on the performance of CUDA C, its abstraction capabilities, and its safety.
While I think it is very useful to research how "high-level" a PL for GPUs can be (Futhark, Taichi), or how "middle" a middle ground can be (OpenCL, SyCL, etc.), a practical language that actually gets used widely on HPC must be better than what we have without sacrificing anything that we have.
The hardest thing about CUDA is writing code that uses the memory hierarchy well. In my experience, this is the hardest thing to do well portably, and the hardest thing to abstract, which is why pretty much all device-only high level libraries like CUB or CUTLASS expose the memory hierarchy on their APIs, and this is, at least today, required to get good performance.
I find languages like sequoia quite interesting, because they explore how to improve on this particular problem which, right now, it appears that any CUDA replacement would need to solve better than CUDA. Languages like regent/legion, hpx, futhark, taichi, etc. focus more on just trading off low-level performance for high-level abstractions. This research is definitely useful to explore what other things a CUDA killer could do better, but I don't think any of these would end up taking a little bit of CUDA's market share. At best, they might take a bit of OpenMP's marketshare, but that seems unlikely.
For programs running on the CPU, the programmer cannot manually control memory transfers across the memory hierarchy, so adding that ability to your programming language does not solve any problems.
So I'd say most languages don't expose it because there is no need for it.
Many languages for GPUs do expose the memory hierarchy to programmers (e.g. CUDA, OpenCL, even OpenAcc).
> What is the core problem that has not yet been solved?
That using the memory hierarchy efficiently and correctly when writing GPU programs in those languages is hard and error prone. It is trivial to write code that performs horribly and/or has data-races and other forms of UB in CUDA, e.g., when dealing with global/shared/local memory.
Sequoia attempted to split the kernel in the algorithms at the different levels of the memory hierarchy, and the memory owned by the kernel at the different levels, as well as how to split this memory as you "refine" the kernel, e.g., from kernel (global) -> thread blocks (global, shared, constant) -> warps (global, shared, constant, shared registers) -> threads (global, shared, constant, shared registers, local memory).
For many algorithms (e.g. GEMM, convolutions), how you partition global memory into thread blocks, and which parts of global memory one loads into shared memory and how, has a huge impact on performance.
Interestingly, a lot of programmers (who program for the CPU) worry
about caches a great deal, despite:
- Programmers being unable to control caches, at least directly, and
reliably.
- Languages (e.g. C/C++) having no direct way of expressing memory
constraints.
This suggests to me that even in CPU programming there is something
important missing, and I imagine that a suitable explict
representation of the memory hierarchy might be it. A core problem is
that its unclear how to abstract a program so it remains perfomant
over different memory hierarchies.
People that worry about caches for CPUs, can't control them directly, so they need to instead, e.g., "block"/"tile" their loops in such a way that their caches are used efficiently, which is hard.
On different CPUs (e.g. with different cache sizes), these loops need to be tiled differently. If you want a single binary to perform well across the board, it just need to support using the different tile-sizes depending on the hardware.
Usually, this is however not enough. For example, you have some data in memory (a vector of f16s on an intel CPU), and for operating on them, you need to decompress that first into a vector of f32s.
You probably want to decompress to fill the cache, operate, recompress, to save memory and memory bandwidth. For that you need a "scratchpad" (or __shared__ memory in CUDA), e.g., for the different levels of the cache hierarchy.
The compiler needs to know, for the different architectures, what their L1/L2/L3/shared/constant/texture/.... memory sizes are, and either fill these sizes for you to use the whole cache, or let the user pick them, making the program fail if run on hardware where this isn't correct. NVCC (and pretty much every production compiler) can bundle code for different architectures within a single binary, and pick the best at run-time.
So if your L3 can be 4,8,16, or 32 Mb, your compiler can bundle 4 copies of your kernel, query the CPU cache size at initialization, and be done with it.
The way in which you abstract the memory hierarchy is by just partitioning your problem's memory.
If you have a f16s vector in global memory, you might want to partition it into N different chunks, e.g., recursively, until if a chunk where to be decompressed into f32s, that decompressed chunk would fit into a faster memory (e.g. L3). At that point you might want to do the decompression, and continue partitioning the f32s up to some threshold (e.g. the L1), on which you operate.
That is, a kernel has multiple pieces:
- a state initialization (e.g. owns the global memory)
- partitioning: how to partition that into smaller subproblems
- the merge: what gets computed at each level of the partitioning (e.g. how are the results of the partitions merged together to compute the final result)
- the leaf computation: what gets computed at the "finest" level
The programmer doesn't know a priori how many partitions would be necessary, that would depend on the memory hierarchy. But it does know everything else.
For example, for a sum reduction:
- the programmer knows how to perform a leaf computation: by summing a whole leaf partition and putting the result somewhere in the next larger memory level.
- a CUDA or CPU thread can sum N elements in parallel using SIMD, and put the result in inter-warp shared memory or the L1
- the programmer knows how to merge computations: by summing the results of the current level, and putting them in the next level of the hierarchy (inter-warp -> inter-block -> inter-grid, or L1 -> L2 -> L3 -> RAM)
- in a GPU / CPU, depending on the actual level, the compiler can use different operations (e.g. SIMD shuffles, inter warp shuffles with warp-level synchronization, inter block shuffles with block-level synchronization, local sum + atomic memory operation to write the result, etc.)
- the programmer knows how to partition an input vector into N pieces (N CUDA threads, N warps, N blocks, N grids, N cpu threads, N numa domains, etc.)
A good programming model needs to allow the user to express what they know, and abstract away what they cannot know (how big the caches are, how many cache levels, how many threads of execution, etc.)
Here I'm a bit surprised. For theoretical reasons (I've never gotten
my hands dirty with GPU implementations, I'm afraid to admit), I would
have expected that it is highly useful to have a language interface
that allows run-time (or at least compile-time) reflection on:
- Number of levels (e.g. how many layers of caches).
- Size of levels.
- Cost of memory access at each level.
- Cost of moving data from a level to the next level up/down.
The last 3 numbers don't have to be absolute, I imagine, but can be
relative, e.g.: size(Level3) = 32 * size(Level2). This data would be useful to
decide how to partition compute jobs as you describe, and do so in a
way that is (somewhat) portable.
There are all manner of subtle issues, e.g. what counts as cost of memory access and data movements (average, worst case, single byte, DMA ...), and what the compiler and/or runtime should do (if anything) if they are are violated. In abstract terms: what is the semantics of the language representation of the memory hierarchy.
Another subtle but important question is: what primitives should a language provide to access a memory level, and which ones to move between levels. An obvious choice is to treat each level as an array, and have DMA-like send/receives to move blocks of data between levels. Is that a good idea?
Equally subtle, and I already alluded to this above, is when to make this information available. Since the processor doesn't change during computing, I imagine that using a multi-stage meta-programming setup (see e.g. [1] for a rationale), might be the right framework: you have a meta-program, specialising the program doing the compute you are interested in. C++ use template for program specialisation, but
C++'s interface for meta-programming is not easy to use. It's possible to do much better.
As you wrote above, programming
"in those languages is hard and error prone", and the purpose of language primitives is to catch errors early.
What errors would a compiler / typing system for such a language catch, ideally without impeding performance?
> would have expected that it is highly useful to have a language interface that allows run-time (or at least compile-time) reflection
That's an interesting thought. I'm not sure I agree, maybe?
The user job is to express how to partition the problem. To do that properly, they need to know, e.g., "how many bytes can I fit in the next partition per unit-of-compute", so the langue/run-time has to tell them that.
I don't know if knowing the costs of memory access at each level or across levels is useful. It is a reasonable assumption that, at the next level, memory accesses are at least one order of magnitude faster, and that synchronizing across the hierarchy is very expensive and should be minimized. That's a good assumption to write your programs with, and knowing actual costs do not really help you there.
> Another subtle but important question is: what primitives should a language provide to access a memory level, and which ones to move between levels. An obvious choice is to treat each level as an array
I think CUDA's __shared__ memory is quite good. E.g. per kernel, a way to obtain memory in the level of the hierarchy that the kernel is currently working on. Nobody has extended CUDA to multiple levels because there hasn't been a need, but I expect these programs to be "recursive" in the sense that they recursively partition a problem, and for __shared__ memory on each recursion level to give you memory at a different level of the hierarchy.
To move memory across levels of the hierarchy, you would just use raw pointers, with appropriate reads/writes (e.g. atomic ones). The exact instructions that get emitted would then depend on which level of the hierarchy you are. For that the compiler needs to know something about the architecture it is compiling for, but that seems unavoidable.
Thanks, I actually hadn't read the Terra paper. I'll try to get to it this weekend. I think that using a Lua-like language as the meta-programming for your language (e.g. Regent) is an interesting approach, but it is not necessary.
For example, Scheme, Lisp, Haskell, D and Rust have shown that you can do meta-programming quite well in the same language you do normal programming in, without having to learn a completely different "meta-language".
I particularly like Rust procedural macros. It is normal run-time Rust code that just get first compiled, and then executed (with your source code as input) during compilation, with some twist to make that quite fast (e.g. the compiler parses an AST, and the proc macros do AST->AST folds).
If one wants to delay that until run-time, one should just do so (e.g. you just embed your compiler in your app, and when your proc macros run, its "run-time"). No need to layer multiple languages, but maybe the Terra paper convices me otherwise.
I didn't mean to push Terra in particular, and I agree that doing
meta-programming is easier in the same language you do normal programming in. I
just mentioned the Terra paper since it provided a rationale for using
meta-programming in high-performance programming (I could have pointed
to others making the same argument).
Rust's procedural macros are standard compile-time meta programming.
proc macros do AST->AST folds)
Last time I looked (I have not used Rust for a while) procedural
macros operate over tokens, not ASTs. Maybe that changed?
That sounds like a great project, one I heartily recommend to people wanting to get their feet wet.
First, the basic principles of how to write an efficient prefix sum are the same as before. Definitely read [1]. Also see [2] for an exploration into how much subgroups help over threadgroup shared memory. In fact, [2] has much of the answer you seek, as it's already written in Vulkan and GLSL (compiling to SPIR-V), though it does miss a direct performance comparison to CUB.
Second, doing this depends on WebGPU exposing threadgroup shared memory and subgroups operations. I'm not sure where the standard and implementations are on this; there are significant challenges. For example, subgroup shuffle is not available in DX12, though subgroup reduce operations such as WavePrefixSum are. So in general an implementation will need to do runtime detection of capabilities and choose a most-optimum kernel based on that.
In theory, it should be possible to achieve comparable performance. But it's likely that WebGPU implementations haven't really been tuned for performance yet, while CUB has seen a ton of work. So, as always, the thing to do is implement and measure.
Followup: I did a very rough prototype on my current Vulkan runner; all the good stuff is in the prefix.comp kernel code. It's not doing the fancy stuff, in particular it doesn't parallelize the lookback code, so that ends up taking a bunch of time. But it does give an example of how to use the Vulkan memory model, for which I've seen very little in the way of published material.
If anyone is interested in tuning this and doing a performance comparison, please get in touch. It'd make a good blog post, I think, but already the time I've spent on it is a distraction from the renderer.
None of the NVIDIA compute devices do, right? Do AMD GPUs implement this ?
CPUs from Intel, ARM, RISCV, MIPS, Power support horizontal SIMD add instructions that perform a tree-reduction.
I'm skeptical of GPUs being able to actually implement this on hardware. A CPU puts the whole vector in a single 512-bit register, but in a GPU, even for a single half-warp, the operation cannot take 16x64bit registers. The operation would need to be performed on memory, and move the content to the different registers, do warp-local shuffles for the reduction, etc. So we would be talking about a quite big macro-op here. I wouldn't see the point of doing this in hardware, when the compiler or driver could just call a library function that does this for you.
But then we are back to the issue that, if you can't write those library functions in pure Vulkan, you are quite limited w.r.t what one can do.
There are many prefix-sum like algorithms, like, e.g., a weighted prefix-sum (prefix_sum(x * w)) [0]. In a GPU, you don't want to do the vector multiply first followed by the prefix-sum, but do it all at once. You can probably work around this in Vulkan by doing the vector multiply in shared memory, and then the prefix sum there (while in CUDA you wouldn't need shared memory for that at all), but that won't really work if the reduction is more complicated than just a sum.
Right, I haven't looked at PTX or Intel Gen ASM, but on AMD this compiles to a lg(n) series of add instructions each with a warp local shuffle (v_add_nc_u32 with a row_shr: modifier). It's not going to shared memory. I've experimented a bit with hand-rolling scans (using subgroupShuffleUp) and get similar but not identical asm (that intrinsic gives you a wave_ror: modifier). So I think you could write this using even lower-level intrinsics and also get good performance.
I strongly suspect Intel and NVidia are similar here, but haven't experimented.
There's actually a lot more to say here, but it really gets into the weeds. To me, the fundamental question is: can you write Vulkan 1.2 compute shaders that exploit most of the performance potential of GPU hardware in a portable way? I believe the answer is yes, but it requires validation. It is, however, not a good approach if the goal is to squeeze out the last 5% or 10% of performance.
That makes sense. Given that one can write good compute shaders with Vulkan, the next obvious question worth answering to me is whether one should do that.
If I had to write a new foundational library, e.g., something like CUTLASS or cuFFT for device code, I won't start with code that fully uses a device, but I'd like to be able to do that if I want to.
If I need to reach for CUDA to do that, then Vulkan, WebGPU, etc. will need to have great FFI with CUDA C++. Nobody wants to have to "drop up" to a higher-level CUDA C++ API like CUTLASS, but being forced to do so by "dropping down" that API to a C FFI wrapper.
It would be like having Rust, but without any unsafe code support except for C FFI, C actually not existing, and having to drop "down" to C++ for everything, and then numbing down the C++ to C FFI for interoperability. You might as well just stick with C++ and spare you the pain.
I hope Vulkan and WebGPU get enough extensions so that foundational libraries can be written in those languages.
@Raph I don't have Zulip, but you have motivated me to play a lot the last weekends with Vulkan compute, and I am a bit disappointed.
I think I was expecting to have a similar level of control as with CUDA or OpenCL, but it turns out that wgpu/Vulkan compute is more like "OpenGL compute shaders", which is "compute for a rendering pipeline", something quite different than "CUDA/OpenCL compute", which has nothing to do with rendering.
If all you want is to stick a compute shader in your rendering pipeline, then wgpu/vulkan are great IMO.
If you want to do CFD, linear algebra, or machine learning, then it feels like fighting against a technology that just wasn't meant to be used for that. I managed to write a tiny explicit CFD code with Vulkan compute just fine, but the moment I wanted to implement a slightly more complicated algorithm, I was limited by things like how to properly do linear algebra (e.g. I needed a algebraic multi-grid preconditioner and a conjugate gradients implementation, and ended up fighting against how to control what's get put into shared memory when). When trying some NN kernels, I got stuck with the same issue. I managed to implement an FFT and convolutions with Vulkan, but... I couldn't manage to get convolutions achieve the same perf as with CUDA cause with Vulkan I had to use a "pull" algorithm (read from all points in the window, write to one point), but that just performed way worse than the "push" algorithms that are possible to implement with CUDA (maintain a shared memory cache where you write, and as you read one value from the window, write its contribution to multiple locations, then do a full cache write to global memory). In numbers, I went from ~150 Gb/s throughput with Vulkan to ~700Gb/s with CUDA.
I think maybe Vulkan/SPIRV could benefit from better compilers that take the "pull" algorithm and compile it to a "push" one using shared memory, but you are then at the mercy of the compiler for a 4x perf diff.
I think that if I have a rendering pipeline in which I want to integrate some compute, then wgpu/Vulkan are great, particularly because chances are that the compute you need is per "vertex/pixel/..." kind of naive-massively-parallel kind of compute. But if I just have a raw compute pipeline, then it feels like using the wrong tool for the job.
Great paper. I believe it's possible to do this efficiently in Vulkan + SPIR-V, and also adapt it to WebGPU given the caveats mentioned above, but it's hard to say for sure without doing it. Clearly that paper uses very sophisticated techniques and would not be trivial to implement - it goes considerably past the [2] I linked.
I agree 100% that prefix sum is an excellent litmus test for this platform.
I'll give one example to fill in more specifics. The fence-free descriptor updates described in section 4.4 of your linked paper depend on 64 bit atomics. These are available using the VK_KHR_shader_atomic_int64 extension, and appear (using vulkan.gpuinfo.org) to be available on NV and AMD hardware but not Intel (and basically not at all on mobile). On Vulkan, you query at runtime for the presence of this extension, and swap in the appropriate kernel based on the result of the query; optimizing for hardware without 64 bit atomics might give you a different approach.
Not having to deal with all this is the reason NVidia owns the space :)
I am currently working on implementing the decoupled-lookback in Futhark, as my master thesis. The development is under the branch `onepassscan`. It is still missing some features, like it cannot do map'o'scan, meaning include the map function in the scan optimization. It is likely that the implementation will only be used for Nvidia, since the cache guarentees only are prommised to work for Nvidia.
The current version of Futhark is using the reduce-then-scan strategy.
I'll note that these "cache guarantees" in the Vulkan world are the Vulkan 1.2 memory model, and are supported by the latest drivers for AMD, Intel, and NVidia [1]. This recent change is one big reason I'm saying Vulkan + SPIR-V is getting ready for real compute workloads, and this wasn't the case even months ago.
Would you be able to list them? I'm asking because I do develop programming languages for parallel hardware. It would be very useful for me to look at your examples and apply them to my PL design ideas.
An alternative way I might be rendering my question could be: what primitives do you recommend a language should provide that's currently missing in the languages you evaluate?
> what primitives do you recommend a language should provide that's currently missing in the languages you evaluate?
Hardware primitives and powerful capabilities for defining safe abstractions over those.
Chances are that, e.g., I need to implement my own "primitive-like" operation (e.g. my own sort or scan). If your language provides a "partition" primitive, but not the tools to implement my own, I can't use your language.
When people create languages for GPUs, for some reason they add the C++ STL or similar as primitives, instead of providing users the tools to write such libraries themselves. The consequence is that those languages end up not being practical to use, and nobody uses them.
How to abstract "hardware primitives" in a way that can be instantiated to many GPU architectures without performance penalty and be useful for higher-level programming is not so clear. How would you, to take an example from the CPU world, fruitfully abstract the CPU's memory model? As far as I'm aware that's not a solved problem in April 2020, and write papers on this subject are still appearing in top conferences .
There exists a language that can optimally target AMD's ROCm, NVIDIA's PTX, and multicore x86-64 CPUs: CUDA C++. This language has 3 production quality compilers for these targets (nvcc, hip, and pgi).
Unfortunately, this language is from 2007, and the state-of-the art has not really been improved since then (except for the evolution of CUDA proper).
It would be cool for people to work on improving the state of the art, providing languages that are simpler, perform better, or are more high-level than CUDA, without sacrificing anything (i.e. true improvements).
There have been some notable experiments in this regard, e.g., Sequoia for roadrunner scratchpads had very interesting abstraction capabilities over the cache hierarchy, that could have led to a net total improvement over CUDA __shared__ memory.
Most newer languages have the right goal of trying to simplify CUDA, but they end up picking trade-offs that only allow them to do so by sacrificing a lot of performance. That's not an interesting proposition for most CUDA developers - the reason they pick up CUDA is performance, and sacrificing ~5% might be acceptable if the productivity gains are there, but a 20-30% perf loss isn't acceptable - too much money involved.
One can simplify CUDA while retaining performance by restricting a language to GPUs and a particular domain, e.g., computer graphics / gfx shaders or even stencil codes or map-reduce, etc.
However, a lot of the widely advertised languages try to (1) target both GPUs and CPUs, compromising on some minimum common denominator of features, (2) support general-purpose compute kernels, and (3) significantly simplify CUDA, often doing so by just removing the explicit memory transfers, which compromises performance. These languages are quite neat, but they aren't really practical, because they aren't really true improvements over CUDA.
Most companies I work with that use CUDA today are using the state of the art (C++17 or 20 extensions), experimental libraries, drivers, etc. So it isn't hard to sell them into a new technology, _if it is better than what they are already using_.
The Regent language (Legion is Regent's runtime) is yet another async task-graph-based PGAS language/run-time, similar to, e.g., HPX, but leaving out what in my opinion made Sequoia interesting (e.g. the memory hierarchy abstraction capabilities).
As far as I understand, the idea in Regent is that the memory hierarchy is used dynamically, since the programmer cannot know about much of the memory hierarchy anyway at program writing time. So dynamic scheduling is used to construct the memory hierarchy and schedule at run-time. The typical use case for Regent is physicists writing e.g. weather / particle physics simulations, they are not aware of the details of L1/2/3/Memory / network/cluster ... sizes.
This is probably quite different from your use case.
Those applications do linear algebra, and pretty much any kind of linear algebra on the GPU, including simple vector-vector dot-products, requires a very careful usage of the memory hierarchy.
For doing a dot-product on the GPU, you need to take your vectors in global memory, and:
- split them into chunks that will be processed by thread blocks
- allocate shared memory for storing partial reductions from warps within a thread-block
- decide how many elements a thread operates on, and allocate enough registers within a thread
- do thread-block reductions on shared memory
- communicate thread-block reduction to global memory
- do a final inter-thread block reduction
A sufficiently-smart compiler can take a:
sum = 0.
for x,y in zip(x,y): sum += x+y
and transform it into an algorithm that does the above (e.g. a call to CUB).
But if you are not able to implement the above in the language, it suffices for the user to run into the need to do so once (e.g. your compiler does not optimize their slightly different reduction efficiently), for the value proposition of your language to suffer a huge hit (if I need to learn CUDA anyways, I might just use CUDA from the start; if I don't need performance, I wouldn't be using a super-expensive GPU, etc.).
This is IMO why CUDA is successful, and pretty much all other languages are not, maybe with the exception of OpenACC which has great CUDA interop (so you can start with OpenACC, and use cuda inline for a single kernel, if you need to).
The 6 requirements you list for doing a dot-product on the GPU can be phrased in abstract as a constraint solving problem where the number of thread blocks, the cost of communication etc are parameters.
First, you can implement it in pure CUDA C++, and max out the memory bandwidth of any nvidia or AMD GPU. The CUB library provides a state of the art implementation (using decoupled-lookback) that one can compare against new programming languages.
Second, it is one of the most basic parallel algorithm building blocks. Many parallel algorithms use it, and many parallel algorithms are "prefix-sum-like". If I am not able to write prefix sum from scratch efficiently in your programming language / library, I can't use it.
Every time someone shows a new programming language for compute, the examples provided are super basic (e.g. `map(...).fold(...)`), but I have yet to see a new programming language that can be used to implement the 2-3 most fundamental parallel algorithms from any parallel algorithms graduate course. For example, Futhark provides a prefix sum intrinsic, that just calls CUB - if you want to implement a prefix-sum-like algorithm, you are out of luck. In WGPU, it appears that prefix-sum will be an intrinsic of WHSL, which sounds like you would be out-of-luck too.
You mentioned WGPU and Vulkan. Do you know how to implement prefix-sum from scratch on these? If so, do you know how the performance compare against CUB?