First, you can implement it in pure CUDA C++, and max out the memory bandwidth of any nvidia or AMD GPU. The CUB library provides a state of the art implementation (using decoupled-lookback) that one can compare against new programming languages.
Second, it is one of the most basic parallel algorithm building blocks. Many parallel algorithms use it, and many parallel algorithms are "prefix-sum-like". If I am not able to write prefix sum from scratch efficiently in your programming language / library, I can't use it.
Every time someone shows a new programming language for compute, the examples provided are super basic (e.g. `map(...).fold(...)`), but I have yet to see a new programming language that can be used to implement the 2-3 most fundamental parallel algorithms from any parallel algorithms graduate course. For example, Futhark provides a prefix sum intrinsic, that just calls CUB - if you want to implement a prefix-sum-like algorithm, you are out of luck. In WGPU, it appears that prefix-sum will be an intrinsic of WHSL, which sounds like you would be out-of-luck too.
You mentioned WGPU and Vulkan. Do you know how to implement prefix-sum from scratch on these? If so, do you know how the performance compare against CUB?
Futhark doesn't call CUB (it also targets OpenCL), and here is an asymptotically work-efficient prefix sum in pure Futhark: https://gist.github.com/athas/611e6b9a76b382ec079f979ec7fb85...
It's not as fast as a hand-written scan in CUDA, nor as fast as the built-in scan in Futhark, but you can do it. Futhark isn't a low-level or GPU-specific language, so you can't take direct advantage of hardware-specific features.
Generally, Futhark doesn't compete performance-wise primitive-for-primitive, but can do application-level transformations that are infeasible for libraries, and impractical to do by hand. For example, you can 'map' a Futhark 'scan' over a two-dimensional array and it will just work pretty well, while there's no way to turn the CUB scan into a regular segmented scan, unless the original author thought of it. Similarly, if you are 'scan'ing the same array twice in a Futhark program, but with different operators or pre-scan transforms, the compiler will fuse the two operations into one, to reduce the number of accesses.
If the builtin scan doesn't call CUB, how fast is a prefix sum in Futhark compared to Cuda using CUB ?
> For example, you can 'map' a Futhark 'scan' over a two-dimensional array and it will just work pretty well,
What does "pretty well" mean? 95% of peak device throughput ? 60% ?
When a compute device costs 8.000$, an efficiency of 90% means I'm throwing 800$ out of the window.
Last I checked, it's about 50% the performance of CUB, but a sibling post describes a current effort to implement the decoupled lookback algorithm in Futhark's CUDA backend.
> What does "pretty well" mean? 95% of peak device throughput ? 60% ?
It's as fast as Futhark's single-dimensional scan. In exotic cases (when a single segment or "scan instance" fits in a thread block) it can be even faster.
> When a compute device costs 8.000$, an efficiency of 90% means I'm throwing 800$ out of the window.
Certainly! But people run much less than 90% efficient code on $8000 CPUs all the time, so I think there is definitely room for a high-level programming language that runs on GPUs. It's the usual performance-versus-productivity tradeoff, although Futhark's ambition (and to a large extent, reality) puts more emphasis on the performance side of things than when people usually use that argument.
While Futhark can in principle not beat hand-written code, I will note that empirically it is not unusual for us to see performance increases when we take published GPU benchmark code and rewrite it straightforwardly in Futhark. Much GPU code out there is not perfectly written at the level of CUB, and while the Futhark compiler doesn't exactly generate highly inspired code, it gets the basics consistently right.
In the last NERSC survey, 75% of parallel models in use in US national HPC centers are MPI and/or OpenMP, ~22% is CUDA C and CUDA C++, and all others are the rest (~3%).
CUDA C++ is the best we have because it let's you write code that extracts all the performance from the device, while also allowing one to expose that code using very high-level APIs (Thrust, CUTLASS, CUB, AmgX, ...), and even RAPIDS, CuPY, etc.
While I think it is very useful to research how "high-level" a PL for GPUs can be (Futhark, Taichi), a practical language (that actually gets used widely on HPC) must do so without sacrificing low-levelness.
A GPU programming for the masses needs to provide or improve on the performance of CUDA C, its abstraction capabilities, and its safety.
While I think it is very useful to research how "high-level" a PL for GPUs can be (Futhark, Taichi), or how "middle" a middle ground can be (OpenCL, SyCL, etc.), a practical language that actually gets used widely on HPC must be better than what we have without sacrificing anything that we have.
The hardest thing about CUDA is writing code that uses the memory hierarchy well. In my experience, this is the hardest thing to do well portably, and the hardest thing to abstract, which is why pretty much all device-only high level libraries like CUB or CUTLASS expose the memory hierarchy on their APIs, and this is, at least today, required to get good performance.
I find languages like sequoia quite interesting, because they explore how to improve on this particular problem which, right now, it appears that any CUDA replacement would need to solve better than CUDA. Languages like regent/legion, hpx, futhark, taichi, etc. focus more on just trading off low-level performance for high-level abstractions. This research is definitely useful to explore what other things a CUDA killer could do better, but I don't think any of these would end up taking a little bit of CUDA's market share. At best, they might take a bit of OpenMP's marketshare, but that seems unlikely.
So I'd say most languages don't expose it because there is no need for it.
Many languages for GPUs do expose the memory hierarchy to programmers (e.g. CUDA, OpenCL, even OpenAcc).
> What is the core problem that has not yet been solved?
That using the memory hierarchy efficiently and correctly when writing GPU programs in those languages is hard and error prone. It is trivial to write code that performs horribly and/or has data-races and other forms of UB in CUDA, e.g., when dealing with global/shared/local memory.
Sequoia attempted to split the kernel in the algorithms at the different levels of the memory hierarchy, and the memory owned by the kernel at the different levels, as well as how to split this memory as you "refine" the kernel, e.g., from kernel (global) -> thread blocks (global, shared, constant) -> warps (global, shared, constant, shared registers) -> threads (global, shared, constant, shared registers, local memory).
For many algorithms (e.g. GEMM, convolutions), how you partition global memory into thread blocks, and which parts of global memory one loads into shared memory and how, has a huge impact on performance.
- Programmers being unable to control caches, at least directly, and
- Languages (e.g. C/C++) having no direct way of expressing memory
This suggests to me that even in CPU programming there is something
important missing, and I imagine that a suitable explict
representation of the memory hierarchy might be it. A core problem is
that its unclear how to abstract a program so it remains perfomant
over different memory hierarchies.
On different CPUs (e.g. with different cache sizes), these loops need to be tiled differently. If you want a single binary to perform well across the board, it just need to support using the different tile-sizes depending on the hardware.
Usually, this is however not enough. For example, you have some data in memory (a vector of f16s on an intel CPU), and for operating on them, you need to decompress that first into a vector of f32s.
You probably want to decompress to fill the cache, operate, recompress, to save memory and memory bandwidth. For that you need a "scratchpad" (or __shared__ memory in CUDA), e.g., for the different levels of the cache hierarchy.
The compiler needs to know, for the different architectures, what their L1/L2/L3/shared/constant/texture/.... memory sizes are, and either fill these sizes for you to use the whole cache, or let the user pick them, making the program fail if run on hardware where this isn't correct. NVCC (and pretty much every production compiler) can bundle code for different architectures within a single binary, and pick the best at run-time.
So if your L3 can be 4,8,16, or 32 Mb, your compiler can bundle 4 copies of your kernel, query the CPU cache size at initialization, and be done with it.
The way in which you abstract the memory hierarchy is by just partitioning your problem's memory.
If you have a f16s vector in global memory, you might want to partition it into N different chunks, e.g., recursively, until if a chunk where to be decompressed into f32s, that decompressed chunk would fit into a faster memory (e.g. L3). At that point you might want to do the decompression, and continue partitioning the f32s up to some threshold (e.g. the L1), on which you operate.
That is, a kernel has multiple pieces:
- a state initialization (e.g. owns the global memory)
- partitioning: how to partition that into smaller subproblems
- the merge: what gets computed at each level of the partitioning (e.g. how are the results of the partitions merged together to compute the final result)
- the leaf computation: what gets computed at the "finest" level
The programmer doesn't know a priori how many partitions would be necessary, that would depend on the memory hierarchy. But it does know everything else.
For example, for a sum reduction:
- the programmer knows how to perform a leaf computation: by summing a whole leaf partition and putting the result somewhere in the next larger memory level.
- a CUDA or CPU thread can sum N elements in parallel using SIMD, and put the result in inter-warp shared memory or the L1
- in a GPU / CPU, depending on the actual level, the compiler can use different operations (e.g. SIMD shuffles, inter warp shuffles with warp-level synchronization, inter block shuffles with block-level synchronization, local sum + atomic memory operation to write the result, etc.)
A good programming model needs to allow the user to express what they know, and abstract away what they cannot know (how big the caches are, how many cache levels, how many threads of execution, etc.)
abstract away what they
- Number of levels (e.g. how many layers of caches).
- Size of levels.
- Cost of memory access at each level.
- Cost of moving data from a level to the next level up/down.
The last 3 numbers don't have to be absolute, I imagine, but can be
relative, e.g.: size(Level3) = 32 * size(Level2). This data would be useful to
decide how to partition compute jobs as you describe, and do so in a
way that is (somewhat) portable.
There are all manner of subtle issues, e.g. what counts as cost of memory access and data movements (average, worst case, single byte, DMA ...), and what the compiler and/or runtime should do (if anything) if they are are violated. In abstract terms: what is the semantics of the language representation of the memory hierarchy.
Another subtle but important question is: what primitives should a language provide to access a memory level, and which ones to move between levels. An obvious choice is to treat each level as an array, and have DMA-like send/receives to move blocks of data between levels. Is that a good idea?
Equally subtle, and I already alluded to this above, is when to make this information available. Since the processor doesn't change during computing, I imagine that using a multi-stage meta-programming setup (see e.g.  for a rationale), might be the right framework: you have a meta-program, specialising the program doing the compute you are interested in. C++ use template for program specialisation, but
C++'s interface for meta-programming is not easy to use. It's possible to do much better.
As you wrote above, programming
"in those languages is hard and error prone", and the purpose of language primitives is to catch errors early.
What errors would a compiler / typing system for such a language catch, ideally without impeding performance?
 Z. DeVito, J. Hegarty, A. Aiken, P. Hanrahan, J. Vitek, Terra: A Multi-Stage Language for High-Performance Computing. https://cs.stanford.edu/~zdevito/pldi071-devito.pdf
That's an interesting thought. I'm not sure I agree, maybe?
The user job is to express how to partition the problem. To do that properly, they need to know, e.g., "how many bytes can I fit in the next partition per unit-of-compute", so the langue/run-time has to tell them that.
I don't know if knowing the costs of memory access at each level or across levels is useful. It is a reasonable assumption that, at the next level, memory accesses are at least one order of magnitude faster, and that synchronizing across the hierarchy is very expensive and should be minimized. That's a good assumption to write your programs with, and knowing actual costs do not really help you there.
> Another subtle but important question is: what primitives should a language provide to access a memory level, and which ones to move between levels. An obvious choice is to treat each level as an array
I think CUDA's __shared__ memory is quite good. E.g. per kernel, a way to obtain memory in the level of the hierarchy that the kernel is currently working on. Nobody has extended CUDA to multiple levels because there hasn't been a need, but I expect these programs to be "recursive" in the sense that they recursively partition a problem, and for __shared__ memory on each recursion level to give you memory at a different level of the hierarchy.
To move memory across levels of the hierarchy, you would just use raw pointers, with appropriate reads/writes (e.g. atomic ones). The exact instructions that get emitted would then depend on which level of the hierarchy you are. For that the compiler needs to know something about the architecture it is compiling for, but that seems unavoidable.
>  Z. DeVito, J. Hegarty, A. Aiken, P. Hanrahan, J. Vitek, Terra: A Multi-Stage Language for High-Performance Computing. https://cs.stanford.edu/~zdevito/pldi071-devito.pdf
Thanks, I actually hadn't read the Terra paper. I'll try to get to it this weekend. I think that using a Lua-like language as the meta-programming for your language (e.g. Regent) is an interesting approach, but it is not necessary.
For example, Scheme, Lisp, Haskell, D and Rust have shown that you can do meta-programming quite well in the same language you do normal programming in, without having to learn a completely different "meta-language".
I particularly like Rust procedural macros. It is normal run-time Rust code that just get first compiled, and then executed (with your source code as input) during compilation, with some twist to make that quite fast (e.g. the compiler parses an AST, and the proc macros do AST->AST folds).
If one wants to delay that until run-time, one should just do so (e.g. you just embed your compiler in your app, and when your proc macros run, its "run-time"). No need to layer multiple languages, but maybe the Terra paper convices me otherwise.
Rust's procedural macros are standard compile-time meta programming.
proc macros do AST->AST folds)
First, the basic principles of how to write an efficient prefix sum are the same as before. Definitely read . Also see  for an exploration into how much subgroups help over threadgroup shared memory. In fact,  has much of the answer you seek, as it's already written in Vulkan and GLSL (compiling to SPIR-V), though it does miss a direct performance comparison to CUB.
Second, doing this depends on WebGPU exposing threadgroup shared memory and subgroups operations. I'm not sure where the standard and implementations are on this; there are significant challenges. For example, subgroup shuffle is not available in DX12, though subgroup reduce operations such as WavePrefixSum are. So in general an implementation will need to do runtime detection of capabilities and choose a most-optimum kernel based on that.
In theory, it should be possible to achieve comparable performance. But it's likely that WebGPU implementations haven't really been tuned for performance yet, while CUB has seen a ton of work. So, as always, the thing to do is implement and measure.
That's slightly outdated. The state-of-the art is: https://research.nvidia.com/publication/single-pass-parallel...
(EDIT: well the state of the art is CUB, but that paper is a bit more approachable than the CUB docs, which are still very good).
> though it does miss a direct performance comparison to CUB.
For perspective, CUB prefix-sum is as fast as a memcpy on the GPU. It achieves the peak memory bandwidth of the device.
If anyone is interested in tuning this and doing a performance comparison, please get in touch. It'd make a good blog post, I think, but already the time I've spent on it is a distraction from the renderer.
Is subgroupExclusiveAdd a primitive, or is it implemented somewhere? (i've been searching through the git repo without much luck).
This has motivated me to learn a bit more about using Vulkan for compute. I hope I get to it tonight and tomorrow - benefits of the pandemic :D
None of the NVIDIA compute devices do, right? Do AMD GPUs implement this ?
CPUs from Intel, ARM, RISCV, MIPS, Power support horizontal SIMD add instructions that perform a tree-reduction.
I'm skeptical of GPUs being able to actually implement this on hardware. A CPU puts the whole vector in a single 512-bit register, but in a GPU, even for a single half-warp, the operation cannot take 16x64bit registers. The operation would need to be performed on memory, and move the content to the different registers, do warp-local shuffles for the reduction, etc. So we would be talking about a quite big macro-op here. I wouldn't see the point of doing this in hardware, when the compiler or driver could just call a library function that does this for you.
But then we are back to the issue that, if you can't write those library functions in pure Vulkan, you are quite limited w.r.t what one can do.
There are many prefix-sum like algorithms, like, e.g., a weighted prefix-sum (prefix_sum(x * w)) . In a GPU, you don't want to do the vector multiply first followed by the prefix-sum, but do it all at once. You can probably work around this in Vulkan by doing the vector multiply in shared memory, and then the prefix sum there (while in CUDA you wouldn't need shared memory for that at all), but that won't really work if the reduction is more complicated than just a sum.
 Convolved prefix sums happen a lot in signal processing with linear-recurrence relations, e.g., in IIR filters, e.g. see  https://dl.acm.org/doi/10.1145/3229631.3229649 and
 https://dl.acm.org/doi/10.1145/3296957.3173168. The weighted prefix sum example is the simplest possible instance of those.
I strongly suspect Intel and NVidia are similar here, but haven't experimented.
If I had to write a new foundational library, e.g., something like CUTLASS or cuFFT for device code, I won't start with code that fully uses a device, but I'd like to be able to do that if I want to.
If I need to reach for CUDA to do that, then Vulkan, WebGPU, etc. will need to have great FFI with CUDA C++. Nobody wants to have to "drop up" to a higher-level CUDA C++ API like CUTLASS, but being forced to do so by "dropping down" that API to a C FFI wrapper.
It would be like having Rust, but without any unsafe code support except for C FFI, C actually not existing, and having to drop "down" to C++ for everything, and then numbing down the C++ to C FFI for interoperability. You might as well just stick with C++ and spare you the pain.
I hope Vulkan and WebGPU get enough extensions so that foundational libraries can be written in those languages.
I wrote some more of my thoughts on the xi Zulip #gpu channel (login required, open to everybody with Github): https://xi.zulipchat.com/#narrow/stream/197075-gpu/topic/Pre...
I think I was expecting to have a similar level of control as with CUDA or OpenCL, but it turns out that wgpu/Vulkan compute is more like "OpenGL compute shaders", which is "compute for a rendering pipeline", something quite different than "CUDA/OpenCL compute", which has nothing to do with rendering.
If all you want is to stick a compute shader in your rendering pipeline, then wgpu/vulkan are great IMO.
If you want to do CFD, linear algebra, or machine learning, then it feels like fighting against a technology that just wasn't meant to be used for that. I managed to write a tiny explicit CFD code with Vulkan compute just fine, but the moment I wanted to implement a slightly more complicated algorithm, I was limited by things like how to properly do linear algebra (e.g. I needed a algebraic multi-grid preconditioner and a conjugate gradients implementation, and ended up fighting against how to control what's get put into shared memory when). When trying some NN kernels, I got stuck with the same issue. I managed to implement an FFT and convolutions with Vulkan, but... I couldn't manage to get convolutions achieve the same perf as with CUDA cause with Vulkan I had to use a "pull" algorithm (read from all points in the window, write to one point), but that just performed way worse than the "push" algorithms that are possible to implement with CUDA (maintain a shared memory cache where you write, and as you read one value from the window, write its contribution to multiple locations, then do a full cache write to global memory). In numbers, I went from ~150 Gb/s throughput with Vulkan to ~700Gb/s with CUDA.
I think maybe Vulkan/SPIRV could benefit from better compilers that take the "pull" algorithm and compile it to a "push" one using shared memory, but you are then at the mercy of the compiler for a 4x perf diff.
I think that if I have a rendering pipeline in which I want to integrate some compute, then wgpu/Vulkan are great, particularly because chances are that the compute you need is per "vertex/pixel/..." kind of naive-massively-parallel kind of compute. But if I just have a raw compute pipeline, then it feels like using the wrong tool for the job.
I agree 100% that prefix sum is an excellent litmus test for this platform.
I'll give one example to fill in more specifics. The fence-free descriptor updates described in section 4.4 of your linked paper depend on 64 bit atomics. These are available using the VK_KHR_shader_atomic_int64 extension, and appear (using vulkan.gpuinfo.org) to be available on NV and AMD hardware but not Intel (and basically not at all on mobile). On Vulkan, you query at runtime for the presence of this extension, and swap in the appropriate kernel based on the result of the query; optimizing for hardware without 64 bit atomics might give you a different approach.
Not having to deal with all this is the reason NVidia owns the space :)
The current version of Futhark is using the reduce-then-scan strategy.
I'll note that these "cache guarantees" in the Vulkan world are the Vulkan 1.2 memory model, and are supported by the latest drivers for AMD, Intel, and NVidia . This recent change is one big reason I'm saying Vulkan + SPIR-V is getting ready for real compute workloads, and this wasn't the case even months ago.
2-3 most fundamental
An alternative way I might be rendering my question could be: what primitives do you recommend a language should provide that's currently missing in the languages you evaluate?
scan (e.g. prefix sum), merge, partition, reduce (e.g. minimum), tree contraction, sort
> what primitives do you recommend a language should provide that's currently missing in the languages you evaluate?
Hardware primitives and powerful capabilities for defining safe abstractions over those.
Chances are that, e.g., I need to implement my own "primitive-like" operation (e.g. my own sort or scan). If your language provides a "partition" primitive, but not the tools to implement my own, I can't use your language.
When people create languages for GPUs, for some reason they add the C++ STL or similar as primitives, instead of providing users the tools to write such libraries themselves. The consequence is that those languages end up not being practical to use, and nobody uses them.
How to abstract "hardware primitives" in a way that can be instantiated to many GPU architectures without performance penalty and be useful for higher-level programming is not so clear. How would you, to take an example from the CPU world, fruitfully abstract the CPU's memory model? As far as I'm aware that's not a solved problem in April 2020, and write papers on this subject are still appearing in top conferences .
Unfortunately, this language is from 2007, and the state-of-the art has not really been improved since then (except for the evolution of CUDA proper).
It would be cool for people to work on improving the state of the art, providing languages that are simpler, perform better, or are more high-level than CUDA, without sacrificing anything (i.e. true improvements).
There have been some notable experiments in this regard, e.g., Sequoia for roadrunner scratchpads had very interesting abstraction capabilities over the cache hierarchy, that could have led to a net total improvement over CUDA __shared__ memory.
Most newer languages have the right goal of trying to simplify CUDA, but they end up picking trade-offs that only allow them to do so by sacrificing a lot of performance. That's not an interesting proposition for most CUDA developers - the reason they pick up CUDA is performance, and sacrificing ~5% might be acceptable if the productivity gains are there, but a 20-30% perf loss isn't acceptable - too much money involved.
One can simplify CUDA while retaining performance by restricting a language to GPUs and a particular domain, e.g., computer graphics / gfx shaders or even stencil codes or map-reduce, etc.
However, a lot of the widely advertised languages try to (1) target both GPUs and CPUs, compromising on some minimum common denominator of features, (2) support general-purpose compute kernels, and (3) significantly simplify CUDA, often doing so by just removing the explicit memory transfers, which compromises performance. These languages are quite neat, but they aren't really practical, because they aren't really true improvements over CUDA.
Most companies I work with that use CUDA today are using the state of the art (C++17 or 20 extensions), experimental libraries, drivers, etc. So it isn't hard to sell them into a new technology, _if it is better than what they are already using_.
The Regent language (Legion is Regent's runtime) is yet another async task-graph-based PGAS language/run-time, similar to, e.g., HPX, but leaving out what in my opinion made Sequoia interesting (e.g. the memory hierarchy abstraction capabilities).
This is probably quite different from your use case.
For doing a dot-product on the GPU, you need to take your vectors in global memory, and:
- split them into chunks that will be processed by thread blocks
- allocate shared memory for storing partial reductions from warps within a thread-block
- decide how many elements a thread operates on, and allocate enough registers within a thread
- do thread-block reductions on shared memory
- communicate thread-block reduction to global memory
- do a final inter-thread block reduction
A sufficiently-smart compiler can take a:
sum = 0.
for x,y in zip(x,y): sum += x+y
But if you are not able to implement the above in the language, it suffices for the user to run into the need to do so once (e.g. your compiler does not optimize their slightly different reduction efficiently), for the value proposition of your language to suffer a huge hit (if I need to learn CUDA anyways, I might just use CUDA from the start; if I don't need performance, I wouldn't be using a super-expensive GPU, etc.).
This is IMO why CUDA is successful, and pretty much all other languages are not, maybe with the exception of OpenACC which has great CUDA interop (so you can start with OpenACC, and use cuda inline for a single kernel, if you need to).
One thing to keep in mind though, programming directly in Vulkan is too low-level for most users, while OpenCL is a fairly friendly API. We need tools that can compile to SPIR-V and deploy on Vulkan, but high enough level you can express your problem rather than (for example) manually placing pipeline barriers between dispatches. The lack of such tools, and having to build some of my own, is one of the reasons for the slowdown in progress on my renderer project.
OpenCL is playing catch up with CUDA, but it requires at least version 2.0 for similar capabilities, while many vendors are still shipping 1.x drivers.
Vulkan, well it seems to be the same story as OpenCL regarding compute tooling, but lets see how much love it will get from NVidia.
I advise watching the GTC 2020 talks regarding CUDA and Vulkan compute.
So it's not implausible that there's a future world here where DX13 or DX14 use SPIR-V instead of DXIL. After all this makes AMD, Intel, and Nvidia's driver teams happier. Game devs also are happier as then they can ship a single precompiled shader on all platforms - PC & consoles. Microsoft's strong position with DirectX doesn't come from DXIL, but instead from making game developers happy, so SPIR-V's future could easily be very bright.
And then if games are using SPIR-V for compute (and they already use GPU compute for various things), then Nvidia is absolutely going to optimize the ever leaving snot out of it. Likely even more than they do for CUDA, as gaming is by far Nvidia's largest source of revenue: https://s22.q4cdn.com/364334381/files/doc_financials/quarter...
Microsoft only open sourced the DXC compiler and approved their changes.
To see how much NVidia cares for SPIR-V, check the amount of Vulkan/OpenGL talks at GTC 2020 versus CUDA and related tooling.
Apparently OctaneRender decided to move away from their Vulkan prototype and just adopt Optix 7 instead.
Kind of bums me out Apple decided to double down on Metal and not just go with Vulkan directly, but MoltenVK is probably fine to deal with that. I would have really like to see all other platforms effectively unite behind a DirectX alternative finally.
Even decades old Cairo is beating VulKan/GPU based renderers like Skia in some tasks. Can it be that hardware intended for ML or 3D-graphics is not suited at all for 2D rendering (parallelism vs. pipelining)? How will WebGPU solve those problems?
To me this talk raises more questions than it answers...
I agree that the talk raises a lot of questions. Hopefully that's a good thing, as I think they deserve the attention of curious people!
Regarding the mismatch between 3D hardware and 2D rendering, we'll need hard empirical numbers, but the approach I'm taking is basically to write a software renderer (not unlike Cairo or even libart which predates it), just one that's designed for massive parallelism so it can run on the GPU. This will have performance characteristics more similar to Cairo (just faster as so much more throughput is available) than rasterization-based approaches. The latter really fall down when there's soft clipping and lots of blending, as that requires generation of intermediate texture buffers. I do address this in the talk :)
I hope I will have more solid answers for some of these questions before long.
It's interesting to note the parallels (no pun intended) with raytracing, in which everyone said it was impossible to improve over the CPU right up until people discovered how to effectively partition with acceleration structures and now raytracers are doing more and more on GPU.
But even if it is slower, there may be advantages of moving the rendering to the GPU, like leaving the CPU to do other stuff, less power usage overall, more RAM available, etc.
The key word here is "forward rendering". There exists no forward rendering primitives on GPUs besides triangles.
Your doubts are entirely unfounded. Every mobile device uses Skia-style hybrid GPU/CPU rendering entirely due to performance, and power consumption didn't take a hit from it.
There's a question of why this hasn't also happened on desktop, but I think it's just because nobody cares if desktop GUI performance is garbage. It's not touch-driven, so it's less noticeable. It's not front & center on reviewer comparisons like it is/was for iOS vs. Android. And since so much is web-based, the number of commonly used GUI desktop apps has plummeted.
And on platforms where Cairo is relevant there's bigger problems in graphics than rendering speed anyway, like the abysmal state of composition foundations (aka, X11 still being so common and wayland being stuck in permanent not-quite-there).
For example, I know that if I use WebGL and issue a `readPixels` call then that seems to take on the order of 1 to 10 milliseconds no matter what I do. This suggests that if I needed to react to something in under 100 microseconds, I probably shouldn't have the computation of the reaction involve WebGL. But is that true of GPU compute in general, or just an artifact of the abstraction exposed by WebGL?
Integrated graphics are a different story, they often just have a direct connection to DRAM and CPU<->GPU communication is therefore super fast & low latency. In theory, anyway, if the API abstractions even let you share memory between the two.
So that means even for things where GPUs should do really well at it, like summing two arrays together, doing it on the CPU can still be a lot faster if the inputs & outputs are local to the CPU.
I don't know much of that would be avoidable if you were actually closer to the metal than the raw OpenGL C++ API call
One question: in one of the slides, you mention the Vulkan 1.2 / SPIR-V 1.5 Vulkan Memory model. What exactly is this and why do you think this is meaningful? Would you have any links to learn more? When I read the SPIR-V specification, it mentions that this kind of memory model exists, but it does not put any effort into explaining why it's worthwhile or how it works on a high level.
The memory model is needed to do more sophisticated communication between the parallel workgroups. The classic example is to make a queue, even a simple one such as a simple single-producer single-consumer ring buffer (for communication between a pair of workgroups). The way this works, the producer fills an element in the buffer, then bumps an index. The consumer observes the index, then reads the element from the buffer.
Without a memory model, there's no way to guarantee that this ordering is preserved, and (to map this to hardware), the write of the index bump could flow through the caches, while the write of the element could still be "dirty" in the cache on the producer, so the consumer reads stale data.
What the memory model does is provide the programmer an explicit way to express ordering constraints. So the write of the index bump is with "release" semantics and the read of that by the consumer is with "acquire" semantics, which guarantees that producer writes before the release are visible to consumer reads after the acquire.
This unlocks a whole range of sophisticated concurrent data structures, such as concurrent hash maps, fancier memory allocation (even without a memory model, a simple allocate-only bump allocator is feasible), and so on.
Thanks for the question and kind words, I hope this illuminates.
btw the Julia link is dead
If you're doing image-like stuff, Halide is good, there's lots of stuff shipping on it.
One other project I'd add is emu, which is built on wgpu. I think that's closest to the future I have in mind, but still in early stages.
Nvidia also sponsored a Udacity course, but I don't have any direct experience with it: https://developer.nvidia.com/udacity-cs344-intro-parallel-pr...
When I was at GTC 2018 this is one of the training sessions I went to and you can now access that here.
The official NUMBA for CUDA docs are great too. Also, you can avoid a lot of installation/dependencies headaches by using anaconda/miniconda if you're doing this stuff locally.
Numba CPU is pretty good. I'd recommend it, modulo some expectation control:
* In order to actually compile a function, it must be written in a very minimal python subset that feels like C
* Numpy is only partially supported in this subset. Expect to manually reimplement ~1/3 of your simple numpy code and all of your fancy numpy code.
Numba GPU is not good. I'd avoid by default, unless you're looking for a project to contribute to:
* When we first installed, it was trivially broken by the removal of a long-deprecated CUDA API that had happened a year prior. It was an easy fix, we just chopped the reference out of numba, but it was a bad sign that foretold problems to come.
* It has lingering subtle bugs. One time we passed in a float array, it made us declare the float type three times (silly but not bad), and then it ignored the types and silently binary cast to double (bad). We found a "ghost town" thread where others had the issue.
* It has severe structural limitations. To a first order, GPU performance is all about avoiding stalls, which are usually triggered by CPU-GPU copies, and Numba's limited buffer abstractions sometimes make that difficult (you have to destructure your code) or impossible.
* It doesn't play well with nvidia's perf tools. If you're doing GPU, you're after perf, and GPUs have lots of tricky caveats that can accidentally tank your perf, so this is a much bigger deal than jupyter (until recently) missing a debugger.
* If you're wondering if numba supports CUDA API XYZ, the answer is no.
If I had to pick something today to use for arbitrary GPU acceleration in Python I'd almost certainly opt for JAX instead, but I haven't seen the kind of tutorials in JAX that exist for NUMBA CUDA.
One loop that didn't have any invalid reads or writes segfaulted in a numba prange (eg. Openmp) loop.
Just reimplementing a loop as a few smaller subfunctions called sequentially worked.
No clue where the optimizer got confused on the 20 line loop body before it got split -- it's the same code.
I personally end up relying on TensorFlow for most of my GPU needs at work still, but that training session was incredibly helpful for me to understand what was going on under the hood and helped demystify the CUDA kernels in general.
Looking at the JAX documentation it's _much_ better than the last time I saw it. Their tutorials seem fairly solid in fact. I do want to point out the difference though. You're conceptually operating at a very different point in JAX than in NUMBA.
For example, consider multiplying 2 matrices in JAX on your GPU. That's a simple example with just a few lines of code in the JAX tutorial.
On the other hand in the NUMBA tutorial from GTC I mentioned earlier, you have notebook 4, "Writing CUDA Kernels", which teaches you about the programming model used to write computation for GPU's.
I'm sorry I was unclear. My recommendation of NUMBA is not so much in advocating for its use in a project, but more so in using it and its tutorials as an easy way of learning and experimenting with CUDA kernels without jumping into the deep end with C/C++. If you actually want to write some usable CUDA code for a project, keeping in mind JAX is still experimental, I would fully advocate for JAX over NUMBA.
I'm not sure Futhark-generated code is as fast as specialist code but it's definitely a speedup compared to even skilled CPU implementations.
I have looked briefly at Futhark, and the apltail compiler, but I am trying to focus on the APL family, because my time is limited. I am a GPU dilettante who has tried basic CUDA in C. I tried Rust, but Zig  is working out better for me given I make more progress with less effort, using C APIs are effortless, but my Rust difficulties may just be my bias with PLs. I find Haskell easier than Rust.
I just read an HPC article today about a survey/questionnaire about the languages used among 57 experts. It's still predominantly C++, but with a lot pain expressed by the experts/users. I agree SPIR-V sounds promising, and I hope to check it out. Just like DL, I think people don't realize it needs to be knowledge domain and algorithms, or experts and software. Somebody has to setup the computation based on knowledge domain and software knowledge. This shows itself to me when I wind up running somebody's program, because I don't have my own specific problem I'd like to try and implement as a learning exercise and knowledge exercise.
Great talk! I found it very understandable and paced just right!
Though WebGPU is coming which I believe will mitigate this.
While it is an headache if you want to support compute on legacy devices, I do think that writing regular vertex/fragment shaders for general purpose GPU computation is an underrated pleasure as you really need to break out the coding golf toolkit to squeeze out maximum performance.
It's really no different than mapping an array except the array is is 2D instead of 1D. You supply data in one or more 2D arrays (textures). You then execute a function (fragment shader) which outputs the values for the result referencing the data in the input arrays (textures). The result is written to another 2D array which just happens to be called a texture instead of "2D array"
Demo's like this (https://paveldogreat.github.io/WebGL-Fluid-Simulation/) help show people why many small processors can beat a few large ones.
I really really enjoyed your presentation, thanks for sharing.
I remember considering rewriting some tight loops in OpenCL but then the maintenance headache of having multiple code paths made the refactor seem not worth it. I'd guess this is/was generally a major speedbump for adoption. I know there is POCL which will run your kernels on the CPU, but it's not something you can't expect available on every platform. Maybe if POCL was part of the kernel or bundled with every distro then the situation would be different.
I've seen some project do compute in OpenGL ES 2.0 b/c that's ubiquitous and will always run (I think it's even a requirement for Android systems)
WebGL does automatically fall back to the CPU in Chrome, via SwiftShader. WebGPU hasn't shipped yet but I think it's likely that a CPU backend with SwiftShader will be available eventually. And SwiftShader implements Vulkan now too, so while it's not automatic for Vulkan you can certainly have a fallback if you want it.
So you spend a lot of time in the talk massaging your sequential CPU task into an appropriately GPU task. It's clearly tricky and involves reworking your algorithms to leverage the parallelism and the memory "features" of the GPU. But through rewriting, have you actually substantively hurt the performance of the sequential program?
The big picture question is : is the reverse problem of going from a GPU program to a CPU problem ever problematic? I mean in a algorithmic sense - without looking at micro optimizations for leveraging CPU SIMD instructions or whatever. Or are you always going to be pretty okay with running your shaders sequentially one after another on your big luxurious CPU memory pool?
and ultimately is there anything stopping you from compiling SPIR-V for a CPU? Could you not autgenerate in the kernel dispatcher just a switch that'll branch and run a precompiled on-CPU kernel if no driver is found? Then you'd finally really get compile-once-run-everywhere GPU code
I guess since it's not being done then I'm missing something haha :) Maybe you are going to often hit scenarios where you'd say "No, if I'm going to run this on the CPU I need to fundamentally rewrite the algorithm"
If you have a workload that runs really well on GPU, then it can be adapted to run on CPU. There are automated tools for this, but it's not mainstream. To answer one of your questions, spirv-cross has a C++ generator.
There are a bunch of things that are in the space between GPU and traditional scalar CPU. One of the most interesting is Intel's ispc. The Larabee project was also an attempt to build some GPU features into a CPU, and that is evolving into AVX-512. The new mask features are particularly useful for emulating the bitmap of active threads, as this is something that's hard to emulate with traditional SIMD.
I think it would be a very interesting project to build a compiler from SPIR-V compute workloads to optimized multithreaded SIMD, and there are projects exploring that: https://software.intel.com/en-us/articles/spir-v-to-ispc-con...
The main reason this hasn't happened is that when doing optimization it's always better to target the actual hardware than go through layers of translation. If you want to run a neural network on CPU, you're definitely going to get better performance out of matrix multiply code tuned for the specific SIMD implementation than something adapted from GPU. But I think there may still be a niche, especially if it's possible to get "pretty good" results.
For machine learning and imaging workloads in particular, there's probably a lot more juice in having a high level (Halide) or medium level (MLIR) representation, and targeting both GPU and CPU as backends that do optimization specific to their targets.
I'm really interested to see how this space evolves, it feels like there are major opportunities, while the scalar CPU side feels largely mined out to me.
Intel GPU has three hidden strengths:
* CPU <-> GPU communication is cheaper, because they can actually share memory (a separate copy to staging buffers is not needed).
* Latency is potentially lower, although it's not clear to me yet that driver software can take advantage of the potential offered by the hardware. (More empirical measurement is needed)
* Subgroup operations on Intel appear to be wicked-fast, shuffle in particular.
Long story short, I think there is opportunity here that most people aren't exploiting.
Looking at SwiftShader - it looks like you can bundle it with your application? Is that correct? I don't see a way to bundle POCL though
If you want to target GPU in some form to increase performance (open source standards that currently fill this niche are OpenCL for compute, OpenGL/Vulkan for graphics) but you want to use the same code for CPU and GPU there is a solution!
There's a new standard called SYCL.
Currently it is based on C++11 and there are implementations that target OpenGL and CUDA.
NVidia isn't that keen in supporting OpenCL beyond 1.2 and is pushing for OpenACC in what concerns GPGPU outside CUDA on their hardware.
But it's still early days, and I don't think there's a complete solution comparable to either of those yet.
In some cases they may have a connectors for high speed communication fabric between many GPUs. NVLink is an example of this .
Outside of the consumer and workstation space GPUs really don’t look like anything you or I would recognise as a GPU anymore, and with APIs like CUDA, don’t look like GPUs to software either.
Really all we’re missing is open source drivers. But I wouldn’t hold your breath.