Hacker News new | past | comments | ask | show | jobs | submit login
A Taste of GPU Compute [video] (youtube.com)
280 points by raphlinus on April 17, 2020 | hide | past | favorite | 165 comments

This is a talk I gave at Jane Street late February on GPU compute, especially using graphics API's such as Vulkan and the upcoming WGPU. Feel free to scan through the slides [1], and to ask me anything.

[1]: https://docs.google.com/presentation/d/1FRH81IW9RffkJjm6ILFZ...

I write a lot of compute kernels in CUDA, and my litmus test is prefix sum, for two reasons.

First, you can implement it in pure CUDA C++, and max out the memory bandwidth of any nvidia or AMD GPU. The CUB library provides a state of the art implementation (using decoupled-lookback) that one can compare against new programming languages.

Second, it is one of the most basic parallel algorithm building blocks. Many parallel algorithms use it, and many parallel algorithms are "prefix-sum-like". If I am not able to write prefix sum from scratch efficiently in your programming language / library, I can't use it.

Every time someone shows a new programming language for compute, the examples provided are super basic (e.g. `map(...).fold(...)`), but I have yet to see a new programming language that can be used to implement the 2-3 most fundamental parallel algorithms from any parallel algorithms graduate course. For example, Futhark provides a prefix sum intrinsic, that just calls CUB - if you want to implement a prefix-sum-like algorithm, you are out of luck. In WGPU, it appears that prefix-sum will be an intrinsic of WHSL, which sounds like you would be out-of-luck too.

You mentioned WGPU and Vulkan. Do you know how to implement prefix-sum from scratch on these? If so, do you know how the performance compare against CUB?

> For example, Futhark provides a prefix sum intrinsic, that just calls CUB - if you want to implement a prefix-sum-like algorithm, you are out of luck.

Futhark doesn't call CUB (it also targets OpenCL), and here is an asymptotically work-efficient prefix sum in pure Futhark: https://gist.github.com/athas/611e6b9a76b382ec079f979ec7fb85...

It's not as fast as a hand-written scan in CUDA, nor as fast as the built-in scan in Futhark, but you can do it. Futhark isn't a low-level or GPU-specific language, so you can't take direct advantage of hardware-specific features.

Generally, Futhark doesn't compete performance-wise primitive-for-primitive, but can do application-level transformations that are infeasible for libraries, and impractical to do by hand. For example, you can 'map' a Futhark 'scan' over a two-dimensional array and it will just work pretty well, while there's no way to turn the CUB scan into a regular segmented scan, unless the original author thought of it. Similarly, if you are 'scan'ing the same array twice in a Futhark program, but with different operators or pre-scan transforms, the compiler will fuse the two operations into one, to reduce the number of accesses.

> Futhark doesn't call CUB (it also targets OpenCL),

If the builtin scan doesn't call CUB, how fast is a prefix sum in Futhark compared to Cuda using CUB ?

> For example, you can 'map' a Futhark 'scan' over a two-dimensional array and it will just work pretty well,

What does "pretty well" mean? 95% of peak device throughput ? 60% ?

When a compute device costs 8.000$, an efficiency of 90% means I'm throwing 800$ out of the window.

> If the builtin scan doesn't call CUB, how fast is a prefix sum in Futhark compared to Cuda using CUB ?

Last I checked, it's about 50% the performance of CUB, but a sibling post describes a current effort to implement the decoupled lookback algorithm in Futhark's CUDA backend.

> What does "pretty well" mean? 95% of peak device throughput ? 60% ?

It's as fast as Futhark's single-dimensional scan. In exotic cases (when a single segment or "scan instance" fits in a thread block) it can be even faster.

> When a compute device costs 8.000$, an efficiency of 90% means I'm throwing 800$ out of the window.

Certainly! But people run much less than 90% efficient code on $8000 CPUs all the time, so I think there is definitely room for a high-level programming language that runs on GPUs. It's the usual performance-versus-productivity tradeoff, although Futhark's ambition (and to a large extent, reality) puts more emphasis on the performance side of things than when people usually use that argument.

While Futhark can in principle not beat hand-written code, I will note that empirically it is not unusual for us to see performance increases when we take published GPU benchmark code and rewrite it straightforwardly in Futhark. Much GPU code out there is not perfectly written at the level of CUB, and while the Futhark compiler doesn't exactly generate highly inspired code, it gets the basics consistently right.

You claim there is a trade-off between usability and performance for GPUs. The reason Rust is so popular is that people used to believe the same thing for CPU programming languages, yet Rust showed that you can have a language with the abstraction power and safety of Scala, Haskell, etc. and the performance of C. This was believed to be impossible, turns out it wasn't.

In the last NERSC survey, 75% of parallel models in use in US national HPC centers are MPI and/or OpenMP, ~22% is CUDA C and CUDA C++, and all others are the rest (~3%).

CUDA C++ is the best we have because it let's you write code that extracts all the performance from the device, while also allowing one to expose that code using very high-level APIs (Thrust, CUTLASS, CUB, AmgX, ...), and even RAPIDS, CuPY, etc.

While I think it is very useful to research how "high-level" a PL for GPUs can be (Futhark, Taichi), a practical language (that actually gets used widely on HPC) must do so without sacrificing low-levelness.

A GPU programming for the masses needs to provide or improve on the performance of CUDA C, its abstraction capabilities, and its safety.

While I think it is very useful to research how "high-level" a PL for GPUs can be (Futhark, Taichi), or how "middle" a middle ground can be (OpenCL, SyCL, etc.), a practical language that actually gets used widely on HPC must be better than what we have without sacrificing anything that we have.

The hardest thing about CUDA is writing code that uses the memory hierarchy well. In my experience, this is the hardest thing to do well portably, and the hardest thing to abstract, which is why pretty much all device-only high level libraries like CUB or CUTLASS expose the memory hierarchy on their APIs, and this is, at least today, required to get good performance.

I find languages like sequoia quite interesting, because they explore how to improve on this particular problem which, right now, it appears that any CUDA replacement would need to solve better than CUDA. Languages like regent/legion, hpx, futhark, taichi, etc. focus more on just trading off low-level performance for high-level abstractions. This research is definitely useful to explore what other things a CUDA killer could do better, but I don't think any of these would end up taking a little bit of CUDA's market share. At best, they might take a bit of OpenMP's marketshare, but that seems unlikely.

Why do you think most languages avoid exposing the memory hierarchy? What is the core problem that has not yet been solved?

For programs running on the CPU, the programmer cannot manually control memory transfers across the memory hierarchy, so adding that ability to your programming language does not solve any problems.

So I'd say most languages don't expose it because there is no need for it.

Many languages for GPUs do expose the memory hierarchy to programmers (e.g. CUDA, OpenCL, even OpenAcc).

> What is the core problem that has not yet been solved?

That using the memory hierarchy efficiently and correctly when writing GPU programs in those languages is hard and error prone. It is trivial to write code that performs horribly and/or has data-races and other forms of UB in CUDA, e.g., when dealing with global/shared/local memory.

Sequoia attempted to split the kernel in the algorithms at the different levels of the memory hierarchy, and the memory owned by the kernel at the different levels, as well as how to split this memory as you "refine" the kernel, e.g., from kernel (global) -> thread blocks (global, shared, constant) -> warps (global, shared, constant, shared registers) -> threads (global, shared, constant, shared registers, local memory).

For many algorithms (e.g. GEMM, convolutions), how you partition global memory into thread blocks, and which parts of global memory one loads into shared memory and how, has a huge impact on performance.

Interestingly, a lot of programmers (who program for the CPU) worry about caches a great deal, despite:

- Programmers being unable to control caches, at least directly, and reliably.

- Languages (e.g. C/C++) having no direct way of expressing memory constraints.

This suggests to me that even in CPU programming there is something important missing, and I imagine that a suitable explict representation of the memory hierarchy might be it. A core problem is that its unclear how to abstract a program so it remains perfomant over different memory hierarchies.

People that worry about caches for CPUs, can't control them directly, so they need to instead, e.g., "block"/"tile" their loops in such a way that their caches are used efficiently, which is hard.

On different CPUs (e.g. with different cache sizes), these loops need to be tiled differently. If you want a single binary to perform well across the board, it just need to support using the different tile-sizes depending on the hardware.

Usually, this is however not enough. For example, you have some data in memory (a vector of f16s on an intel CPU), and for operating on them, you need to decompress that first into a vector of f32s.

You probably want to decompress to fill the cache, operate, recompress, to save memory and memory bandwidth. For that you need a "scratchpad" (or __shared__ memory in CUDA), e.g., for the different levels of the cache hierarchy.

The compiler needs to know, for the different architectures, what their L1/L2/L3/shared/constant/texture/.... memory sizes are, and either fill these sizes for you to use the whole cache, or let the user pick them, making the program fail if run on hardware where this isn't correct. NVCC (and pretty much every production compiler) can bundle code for different architectures within a single binary, and pick the best at run-time.

So if your L3 can be 4,8,16, or 32 Mb, your compiler can bundle 4 copies of your kernel, query the CPU cache size at initialization, and be done with it.

The way in which you abstract the memory hierarchy is by just partitioning your problem's memory.

If you have a f16s vector in global memory, you might want to partition it into N different chunks, e.g., recursively, until if a chunk where to be decompressed into f32s, that decompressed chunk would fit into a faster memory (e.g. L3). At that point you might want to do the decompression, and continue partitioning the f32s up to some threshold (e.g. the L1), on which you operate.

That is, a kernel has multiple pieces:

- a state initialization (e.g. owns the global memory)

- partitioning: how to partition that into smaller subproblems

- the merge: what gets computed at each level of the partitioning (e.g. how are the results of the partitions merged together to compute the final result)

- the leaf computation: what gets computed at the "finest" level

The programmer doesn't know a priori how many partitions would be necessary, that would depend on the memory hierarchy. But it does know everything else.

For example, for a sum reduction:

- the programmer knows how to perform a leaf computation: by summing a whole leaf partition and putting the result somewhere in the next larger memory level.

  - a CUDA or CPU thread can sum  N elements in parallel using SIMD, and put the result in inter-warp shared memory or the L1
- the programmer knows how to merge computations: by summing the results of the current level, and putting them in the next level of the hierarchy (inter-warp -> inter-block -> inter-grid, or L1 -> L2 -> L3 -> RAM)

  - in a GPU / CPU, depending on the actual level, the compiler can use different operations (e.g. SIMD shuffles, inter warp shuffles with warp-level synchronization, inter block shuffles with block-level synchronization, local sum + atomic memory operation to write the result, etc.)
- the programmer knows how to partition an input vector into N pieces (N CUDA threads, N warps, N blocks, N grids, N cpu threads, N numa domains, etc.)

A good programming model needs to allow the user to express what they know, and abstract away what they cannot know (how big the caches are, how many cache levels, how many threads of execution, etc.)

Thanks. Amazing overview.

   abstract away what they 
   cannot know 
Here I'm a bit surprised. For theoretical reasons (I've never gotten my hands dirty with GPU implementations, I'm afraid to admit), I would have expected that it is highly useful to have a language interface that allows run-time (or at least compile-time) reflection on:

- Number of levels (e.g. how many layers of caches).

- Size of levels.

- Cost of memory access at each level.

- Cost of moving data from a level to the next level up/down.

The last 3 numbers don't have to be absolute, I imagine, but can be relative, e.g.: size(Level3) = 32 * size(Level2). This data would be useful to decide how to partition compute jobs as you describe, and do so in a way that is (somewhat) portable. There are all manner of subtle issues, e.g. what counts as cost of memory access and data movements (average, worst case, single byte, DMA ...), and what the compiler and/or runtime should do (if anything) if they are are violated. In abstract terms: what is the semantics of the language representation of the memory hierarchy. Another subtle but important question is: what primitives should a language provide to access a memory level, and which ones to move between levels. An obvious choice is to treat each level as an array, and have DMA-like send/receives to move blocks of data between levels. Is that a good idea?

Equally subtle, and I already alluded to this above, is when to make this information available. Since the processor doesn't change during computing, I imagine that using a multi-stage meta-programming setup (see e.g. [1] for a rationale), might be the right framework: you have a meta-program, specialising the program doing the compute you are interested in. C++ use template for program specialisation, but C++'s interface for meta-programming is not easy to use. It's possible to do much better.

As you wrote above, programming "in those languages is hard and error prone", and the purpose of language primitives is to catch errors early. What errors would a compiler / typing system for such a language catch, ideally without impeding performance?

[1] Z. DeVito, J. Hegarty, A. Aiken, P. Hanrahan, J. Vitek, Terra: A Multi-Stage Language for High-Performance Computing. https://cs.stanford.edu/~zdevito/pldi071-devito.pdf

> would have expected that it is highly useful to have a language interface that allows run-time (or at least compile-time) reflection

That's an interesting thought. I'm not sure I agree, maybe?

The user job is to express how to partition the problem. To do that properly, they need to know, e.g., "how many bytes can I fit in the next partition per unit-of-compute", so the langue/run-time has to tell them that.

I don't know if knowing the costs of memory access at each level or across levels is useful. It is a reasonable assumption that, at the next level, memory accesses are at least one order of magnitude faster, and that synchronizing across the hierarchy is very expensive and should be minimized. That's a good assumption to write your programs with, and knowing actual costs do not really help you there.

> Another subtle but important question is: what primitives should a language provide to access a memory level, and which ones to move between levels. An obvious choice is to treat each level as an array

I think CUDA's __shared__ memory is quite good. E.g. per kernel, a way to obtain memory in the level of the hierarchy that the kernel is currently working on. Nobody has extended CUDA to multiple levels because there hasn't been a need, but I expect these programs to be "recursive" in the sense that they recursively partition a problem, and for __shared__ memory on each recursion level to give you memory at a different level of the hierarchy.

To move memory across levels of the hierarchy, you would just use raw pointers, with appropriate reads/writes (e.g. atomic ones). The exact instructions that get emitted would then depend on which level of the hierarchy you are. For that the compiler needs to know something about the architecture it is compiling for, but that seems unavoidable.

> [1] Z. DeVito, J. Hegarty, A. Aiken, P. Hanrahan, J. Vitek, Terra: A Multi-Stage Language for High-Performance Computing. https://cs.stanford.edu/~zdevito/pldi071-devito.pdf

Thanks, I actually hadn't read the Terra paper. I'll try to get to it this weekend. I think that using a Lua-like language as the meta-programming for your language (e.g. Regent) is an interesting approach, but it is not necessary.

For example, Scheme, Lisp, Haskell, D and Rust have shown that you can do meta-programming quite well in the same language you do normal programming in, without having to learn a completely different "meta-language".

I particularly like Rust procedural macros. It is normal run-time Rust code that just get first compiled, and then executed (with your source code as input) during compilation, with some twist to make that quite fast (e.g. the compiler parses an AST, and the proc macros do AST->AST folds).

If one wants to delay that until run-time, one should just do so (e.g. you just embed your compiler in your app, and when your proc macros run, its "run-time"). No need to layer multiple languages, but maybe the Terra paper convices me otherwise.

   Terra paper
I didn't mean to push Terra in particular, and I agree that doing meta-programming is easier in the same language you do normal programming in. I just mentioned the Terra paper since it provided a rationale for using meta-programming in high-performance programming (I could have pointed to others making the same argument).

Rust's procedural macros are standard compile-time meta programming.

   proc macros do AST->AST folds)
Last time I looked (I have not used Rust for a while) procedural macros operate over tokens, not ASTs. Maybe that changed?

It hasn’t changed; we didn’t want to freeze the AST forever, so we went with the token approach. There are libraries to form an AST though.

That sounds like a great project, one I heartily recommend to people wanting to get their feet wet.

First, the basic principles of how to write an efficient prefix sum are the same as before. Definitely read [1]. Also see [2] for an exploration into how much subgroups help over threadgroup shared memory. In fact, [2] has much of the answer you seek, as it's already written in Vulkan and GLSL (compiling to SPIR-V), though it does miss a direct performance comparison to CUB.

Second, doing this depends on WebGPU exposing threadgroup shared memory and subgroups operations. I'm not sure where the standard and implementations are on this; there are significant challenges. For example, subgroup shuffle is not available in DX12, though subgroup reduce operations such as WavePrefixSum are. So in general an implementation will need to do runtime detection of capabilities and choose a most-optimum kernel based on that.

In theory, it should be possible to achieve comparable performance. But it's likely that WebGPU implementations haven't really been tuned for performance yet, while CUB has seen a ton of work. So, as always, the thing to do is implement and measure.

[1]: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...

[2]: https://cachemiss.xyz/blog/parallel-reduce-and-scan-on-the-G...

> First, the basic principles of how to write an efficient prefix sum are the same as before. Definitely read [1].

That's slightly outdated. The state-of-the art is: https://research.nvidia.com/publication/single-pass-parallel...

(EDIT: well the state of the art is CUB, but that paper is a bit more approachable than the CUB docs, which are still very good).

> though it does miss a direct performance comparison to CUB.

For perspective, CUB prefix-sum is as fast as a memcpy on the GPU. It achieves the peak memory bandwidth of the device.

Followup: I did a very rough prototype on my current Vulkan runner; all the good stuff is in the prefix.comp kernel code. It's not doing the fancy stuff, in particular it doesn't parallelize the lookback code, so that ends up taking a bunch of time. But it does give an example of how to use the Vulkan memory model, for which I've seen very little in the way of published material.

If anyone is interested in tuning this and doing a performance comparison, please get in touch. It'd make a good blog post, I think, but already the time I've spent on it is a distraction from the renderer.


Thanks, that was super interesting to read.

Is subgroupExclusiveAdd a primitive, or is it implemented somewhere? (i've been searching through the git repo without much luck).

This has motivated me to learn a bit more about using Vulkan for compute. I hope I get to it tonight and tomorrow - benefits of the pandemic :D

subgroupExclusiveAdd is part of the subgroup extension, and expected to be implemented in hardware. Best place to read about that is here:


Which devices do implement this in hardware ?

None of the NVIDIA compute devices do, right? Do AMD GPUs implement this ?

CPUs from Intel, ARM, RISCV, MIPS, Power support horizontal SIMD add instructions that perform a tree-reduction.

I'm skeptical of GPUs being able to actually implement this on hardware. A CPU puts the whole vector in a single 512-bit register, but in a GPU, even for a single half-warp, the operation cannot take 16x64bit registers. The operation would need to be performed on memory, and move the content to the different registers, do warp-local shuffles for the reduction, etc. So we would be talking about a quite big macro-op here. I wouldn't see the point of doing this in hardware, when the compiler or driver could just call a library function that does this for you.

But then we are back to the issue that, if you can't write those library functions in pure Vulkan, you are quite limited w.r.t what one can do.

There are many prefix-sum like algorithms, like, e.g., a weighted prefix-sum (prefix_sum(x * w)) [0]. In a GPU, you don't want to do the vector multiply first followed by the prefix-sum, but do it all at once. You can probably work around this in Vulkan by doing the vector multiply in shared memory, and then the prefix sum there (while in CUDA you wouldn't need shared memory for that at all), but that won't really work if the reduction is more complicated than just a sum.

[0] Convolved prefix sums happen a lot in signal processing with linear-recurrence relations, e.g., in IIR filters, e.g. see [1] https://dl.acm.org/doi/10.1145/3229631.3229649 and [2] https://dl.acm.org/doi/10.1145/3296957.3173168. The weighted prefix sum example is the simplest possible instance of those.

Right, I haven't looked at PTX or Intel Gen ASM, but on AMD this compiles to a lg(n) series of add instructions each with a warp local shuffle (v_add_nc_u32 with a row_shr: modifier). It's not going to shared memory. I've experimented a bit with hand-rolling scans (using subgroupShuffleUp) and get similar but not identical asm (that intrinsic gives you a wave_ror: modifier). So I think you could write this using even lower-level intrinsics and also get good performance.

I strongly suspect Intel and NVidia are similar here, but haven't experimented.

[1]: http://shader-playground.timjones.io/4d7402168290f8922da5df3...

There's actually a lot more to say here, but it really gets into the weeds. To me, the fundamental question is: can you write Vulkan 1.2 compute shaders that exploit most of the performance potential of GPU hardware in a portable way? I believe the answer is yes, but it requires validation. It is, however, not a good approach if the goal is to squeeze out the last 5% or 10% of performance.

That makes sense. Given that one can write good compute shaders with Vulkan, the next obvious question worth answering to me is whether one should do that.

If I had to write a new foundational library, e.g., something like CUTLASS or cuFFT for device code, I won't start with code that fully uses a device, but I'd like to be able to do that if I want to.

If I need to reach for CUDA to do that, then Vulkan, WebGPU, etc. will need to have great FFI with CUDA C++. Nobody wants to have to "drop up" to a higher-level CUDA C++ API like CUTLASS, but being forced to do so by "dropping down" that API to a C FFI wrapper.

It would be like having Rust, but without any unsafe code support except for C FFI, C actually not existing, and having to drop "down" to C++ for everything, and then numbing down the C++ to C FFI for interoperability. You might as well just stick with C++ and spare you the pain.

I hope Vulkan and WebGPU get enough extensions so that foundational libraries can be written in those languages.

Thanks for your continued discussion!

I wrote some more of my thoughts on the xi Zulip #gpu channel (login required, open to everybody with Github): https://xi.zulipchat.com/#narrow/stream/197075-gpu/topic/Pre...

@Raph I don't have Zulip, but you have motivated me to play a lot the last weekends with Vulkan compute, and I am a bit disappointed.

I think I was expecting to have a similar level of control as with CUDA or OpenCL, but it turns out that wgpu/Vulkan compute is more like "OpenGL compute shaders", which is "compute for a rendering pipeline", something quite different than "CUDA/OpenCL compute", which has nothing to do with rendering.

If all you want is to stick a compute shader in your rendering pipeline, then wgpu/vulkan are great IMO.

If you want to do CFD, linear algebra, or machine learning, then it feels like fighting against a technology that just wasn't meant to be used for that. I managed to write a tiny explicit CFD code with Vulkan compute just fine, but the moment I wanted to implement a slightly more complicated algorithm, I was limited by things like how to properly do linear algebra (e.g. I needed a algebraic multi-grid preconditioner and a conjugate gradients implementation, and ended up fighting against how to control what's get put into shared memory when). When trying some NN kernels, I got stuck with the same issue. I managed to implement an FFT and convolutions with Vulkan, but... I couldn't manage to get convolutions achieve the same perf as with CUDA cause with Vulkan I had to use a "pull" algorithm (read from all points in the window, write to one point), but that just performed way worse than the "push" algorithms that are possible to implement with CUDA (maintain a shared memory cache where you write, and as you read one value from the window, write its contribution to multiple locations, then do a full cache write to global memory). In numbers, I went from ~150 Gb/s throughput with Vulkan to ~700Gb/s with CUDA.

I think maybe Vulkan/SPIRV could benefit from better compilers that take the "pull" algorithm and compile it to a "push" one using shared memory, but you are then at the mercy of the compiler for a 4x perf diff.

I think that if I have a rendering pipeline in which I want to integrate some compute, then wgpu/Vulkan are great, particularly because chances are that the compute you need is per "vertex/pixel/..." kind of naive-massively-parallel kind of compute. But if I just have a raw compute pipeline, then it feels like using the wrong tool for the job.

Great paper. I believe it's possible to do this efficiently in Vulkan + SPIR-V, and also adapt it to WebGPU given the caveats mentioned above, but it's hard to say for sure without doing it. Clearly that paper uses very sophisticated techniques and would not be trivial to implement - it goes considerably past the [2] I linked.

I agree 100% that prefix sum is an excellent litmus test for this platform.

I'll give one example to fill in more specifics. The fence-free descriptor updates described in section 4.4 of your linked paper depend on 64 bit atomics. These are available using the VK_KHR_shader_atomic_int64 extension, and appear (using vulkan.gpuinfo.org) to be available on NV and AMD hardware but not Intel (and basically not at all on mobile). On Vulkan, you query at runtime for the presence of this extension, and swap in the appropriate kernel based on the result of the query; optimizing for hardware without 64 bit atomics might give you a different approach.

Not having to deal with all this is the reason NVidia owns the space :)

I am currently working on implementing the decoupled-lookback in Futhark, as my master thesis. The development is under the branch `onepassscan`. It is still missing some features, like it cannot do map'o'scan, meaning include the map function in the scan optimization. It is likely that the implementation will only be used for Nvidia, since the cache guarentees only are prommised to work for Nvidia.

The current version of Futhark is using the reduce-then-scan strategy.


I'll note that these "cache guarantees" in the Vulkan world are the Vulkan 1.2 memory model, and are supported by the latest drivers for AMD, Intel, and NVidia [1]. This recent change is one big reason I'm saying Vulkan + SPIR-V is getting ready for real compute workloads, and this wasn't the case even months ago.

[1]: https://vulkan.gpuinfo.org/listdevicescoverage.php?extension...

Sounds like a great project.

   2-3 most fundamental 
   parallel algorithms 
Would you be able to list them? I'm asking because I do develop programming languages for parallel hardware. It would be very useful for me to look at your examples and apply them to my PL design ideas.

An alternative way I might be rendering my question could be: what primitives do you recommend a language should provide that's currently missing in the languages you evaluate?

> Would you be able to list them?

scan (e.g. prefix sum), merge, partition, reduce (e.g. minimum), tree contraction, sort

> what primitives do you recommend a language should provide that's currently missing in the languages you evaluate?

Hardware primitives and powerful capabilities for defining safe abstractions over those.

Chances are that, e.g., I need to implement my own "primitive-like" operation (e.g. my own sort or scan). If your language provides a "partition" primitive, but not the tools to implement my own, I can't use your language.

When people create languages for GPUs, for some reason they add the C++ STL or similar as primitives, instead of providing users the tools to write such libraries themselves. The consequence is that those languages end up not being practical to use, and nobody uses them.


How to abstract "hardware primitives" in a way that can be instantiated to many GPU architectures without performance penalty and be useful for higher-level programming is not so clear. How would you, to take an example from the CPU world, fruitfully abstract the CPU's memory model? As far as I'm aware that's not a solved problem in April 2020, and write papers on this subject are still appearing in top conferences .

There exists a language that can optimally target AMD's ROCm, NVIDIA's PTX, and multicore x86-64 CPUs: CUDA C++. This language has 3 production quality compilers for these targets (nvcc, hip, and pgi).

Unfortunately, this language is from 2007, and the state-of-the art has not really been improved since then (except for the evolution of CUDA proper).

It would be cool for people to work on improving the state of the art, providing languages that are simpler, perform better, or are more high-level than CUDA, without sacrificing anything (i.e. true improvements).

There have been some notable experiments in this regard, e.g., Sequoia for roadrunner scratchpads had very interesting abstraction capabilities over the cache hierarchy, that could have led to a net total improvement over CUDA __shared__ memory.

Most newer languages have the right goal of trying to simplify CUDA, but they end up picking trade-offs that only allow them to do so by sacrificing a lot of performance. That's not an interesting proposition for most CUDA developers - the reason they pick up CUDA is performance, and sacrificing ~5% might be acceptable if the productivity gains are there, but a 20-30% perf loss isn't acceptable - too much money involved.

One can simplify CUDA while retaining performance by restricting a language to GPUs and a particular domain, e.g., computer graphics / gfx shaders or even stencil codes or map-reduce, etc.

However, a lot of the widely advertised languages try to (1) target both GPUs and CPUs, compromising on some minimum common denominator of features, (2) support general-purpose compute kernels, and (3) significantly simplify CUDA, often doing so by just removing the explicit memory transfers, which compromises performance. These languages are quite neat, but they aren't really practical, because they aren't really true improvements over CUDA.

Most companies I work with that use CUDA today are using the state of the art (C++17 or 20 extensions), experimental libraries, drivers, etc. So it isn't hard to sell them into a new technology, _if it is better than what they are already using_.

Thanks. [1] contains an interesting discussion of Sequoia, and why it was considered a failed experiment, leading to Legion ][2], its successor.

[1] https://news.ycombinator.com/item?id=18009581

[2] https://legion.stanford.edu/

Thanks for the link to the HN discussion.

The Regent language (Legion is Regent's runtime) is yet another async task-graph-based PGAS language/run-time, similar to, e.g., HPX, but leaving out what in my opinion made Sequoia interesting (e.g. the memory hierarchy abstraction capabilities).

As far as I understand, the idea in Regent is that the memory hierarchy is used dynamically, since the programmer cannot know about much of the memory hierarchy anyway at program writing time. So dynamic scheduling is used to construct the memory hierarchy and schedule at run-time. The typical use case for Regent is physicists writing e.g. weather / particle physics simulations, they are not aware of the details of L1/2/3/Memory / network/cluster ... sizes.

This is probably quite different from your use case.

Those applications do linear algebra, and pretty much any kind of linear algebra on the GPU, including simple vector-vector dot-products, requires a very careful usage of the memory hierarchy.

For doing a dot-product on the GPU, you need to take your vectors in global memory, and:

- split them into chunks that will be processed by thread blocks

- allocate shared memory for storing partial reductions from warps within a thread-block

- decide how many elements a thread operates on, and allocate enough registers within a thread

- do thread-block reductions on shared memory

- communicate thread-block reduction to global memory

- do a final inter-thread block reduction

A sufficiently-smart compiler can take a:

   sum = 0.
   for x,y in zip(x,y): sum += x+y
and transform it into an algorithm that does the above (e.g. a call to CUB).

But if you are not able to implement the above in the language, it suffices for the user to run into the need to do so once (e.g. your compiler does not optimize their slightly different reduction efficiently), for the value proposition of your language to suffer a huge hit (if I need to learn CUDA anyways, I might just use CUDA from the start; if I don't need performance, I wouldn't be using a super-expensive GPU, etc.).

This is IMO why CUDA is successful, and pretty much all other languages are not, maybe with the exception of OpenACC which has great CUDA interop (so you can start with OpenACC, and use cuda inline for a single kernel, if you need to).

The 6 requirements you list for doing a dot-product on the GPU can be phrased in abstract as a constraint solving problem where the number of thread blocks, the cost of communication etc are parameters.

Is Vulkan really becoming viable for compute? It seems like something that's obviously worth exploring given the overlap with more traditional graphics workloads, but I have seen some claims that it's still lagging behind OpenCL wrt. some features, and that it could be good for some things but is not obviously competitive across the board. It's possible that this impression is outdated though.

I think the answer to this is emerging, but I would say yes. Vulkan and SPIR-V are rapidly gaining features, while OpenCL is stagnating. Some of these features are very recent, like pointers and a memory model. Another area I'm expecting Vulkan to pull ahead is support for efficient float16 computation, which is important for ML workloads, and is increasingly well supported in hardware.

One thing to keep in mind though, programming directly in Vulkan is too low-level for most users, while OpenCL is a fairly friendly API. We need tools that can compile to SPIR-V and deploy on Vulkan, but high enough level you can express your problem rather than (for example) manually placing pipeline barriers between dispatches. The lack of such tools, and having to build some of my own, is one of the reasons for the slowdown in progress on my renderer project.

In CUDA I can just use ISO C++17 (C++20 around the corner) plus CUDA extensions, or any other language that targets PTX, with very nice graphical debugging from what is going on the GPGPU.

OpenCL is playing catch up with CUDA, but it requires at least version 2.0 for similar capabilities, while many vendors are still shipping 1.x drivers.

Vulkan, well it seems to be the same story as OpenCL regarding compute tooling, but lets see how much love it will get from NVidia.

I advise watching the GTC 2020 talks regarding CUDA and Vulkan compute.

If you are looking for a single-source GPGPU framework, you should look for SyCL, not Vulkan or OpenCL.

CUDA is unfortunately very, very far ahead of Vulkan. "Graphics Compute" is quite different from CUDA at the advanced levels if you've never used it.

Have you looked at SYCL yet?

When I heard the definition SIMT, it helped me understand the if / do_a / do_b concept. But your subsequent explanation helped understand what went beyond that. Thank you for the talk.

Do you think other alternatives to Cuda will have the same drawback as OpenCL, that Nvidia won't support it properly and therefore not be a viable option on their hardware which is dominating the market at the moment?

If games use it, Nvidia will support it properly. While Vulkan hasn't "won" in the game space (and might not ever), SPIR-V actually already has support from Microsoft. The HLSL DXC compiler can natively output SPIR-V: https://github.com/Microsoft/DirectXShaderCompiler/wiki/SPIR...

So it's not implausible that there's a future world here where DX13 or DX14 use SPIR-V instead of DXIL. After all this makes AMD, Intel, and Nvidia's driver teams happier. Game devs also are happier as then they can ship a single precompiled shader on all platforms - PC & consoles. Microsoft's strong position with DirectX doesn't come from DXIL, but instead from making game developers happy, so SPIR-V's future could easily be very bright.

And then if games are using SPIR-V for compute (and they already use GPU compute for various things), then Nvidia is absolutely going to optimize the ever leaving snot out of it. Likely even more than they do for CUDA, as gaming is by far Nvidia's largest source of revenue: https://s22.q4cdn.com/364334381/files/doc_financials/quarter...

The work to make DXC generate SPIR-V was actually done by Google and Samsung, given the adoption resistance of game developers to downgrade their HLSL experience into GLSL.

Microsoft only open sourced the DXC compiler and approved their changes.

To see how much NVidia cares for SPIR-V, check the amount of Vulkan/OpenGL talks at GTC 2020 versus CUDA and related tooling.

Apparently OctaneRender decided to move away from their Vulkan prototype and just adopt Optix 7 instead.

No, precisely because I see the emerging alternatives as being based on Vulkan and SPIR-V, which I believe have momentum far greater than OpenCL. The only question is whether a compute platform based on these standards and open source tools could be anywhere nearly as good as CUDA. I believe we're far from that today, but it is possible and there's a lot of good work in that direction.

I'd really like to see the ecosystem move to Vulkan and SPIR-V as well. It's great to see that I'm not the only one! I think Vulkan has a lot of great moment behind it, especially since you can see some game shops porting fairly old games to it right now. Ubisoft's Rainbow 6 Siege is one example.

Kind of bums me out Apple decided to double down on Metal and not just go with Vulkan directly, but MoltenVK is probably fine to deal with that. I would have really like to see all other platforms effectively unite behind a DirectX alternative finally.

Regarding other platforms, Sony is still on LibGNM (and LibGNMX) and most Switch titles actually make use of NVN or middleware engines.

Note that openCL can compile to SPIR-V, so as long as Vulkan is used as the low level cross platform API we don't lose it

What about the work using SPIR—V as an intermediate representation for OpenCL? If anything, it seems like we’re moving towards a common foundation for GPU compute, with an expanding and diversifying array of higher level tooling.

What do you think of optimizers such as XLA, TVM or Glow that try to smooth out underlying hardware differences?

I don't follow this space closely (I believe XLA is related to MLIR, which I did mention), but strongly suspect that trying to run machine learning workloads portably is going to be one of the major drivers for building out tools and infrastructure for GPU compute.

Doesn't that exclude Apple?

That's a complex question, again the long answer is no. Both MoltenVK and the gfx-hal portability work (from Mozilla) allow you to run Vulkan workloads on Metal, thanks in large part to spirv-cross. I also see ~WebGL~ WebGPU on the horizon, and Apple is a very active participant in that process. Code written today to run on Vulkan, using any language that compiles to SPIR-V, will be pretty easy to adapt to the WebGPU future.

I think you meant to say "WebGPU" instead of "WebGL" here :)

Duh. Haven't had my coffee yet. Thanks.

Yes, though the MoltenVK project provides a Vulkan->Metal translation layer.

Well, what are the chances that less vendor lock in is in nVidia's interest?

Is it vendor lock-in if the clients want to be there? It's more like you can be in the club with the good music and hip people, or the club with the average unknown DJ and an empty dance floor.

That makes sense for GPU-compute-in-the-cloud applications where you can pick the hardware and don't care if other hardware doesn't work, but some people want to write programs that use GPGPU that can run on anyone's machine.

Halfway through you were bringing up hopes for a great new way to do 2D rendering on GPUs but then there is no show of actual implementation. Rather anticlimactic!

Even decades old Cairo is beating VulKan/GPU based renderers like Skia in some tasks. Can it be that hardware intended for ML or 3D-graphics is not suited at all for 2D rendering (parallelism vs. pipelining)? How will WebGPU solve those problems?

To me this talk raises more questions than it answers...

It's work in progress. I was hoping to get more done on the renderer but various things have been slowing me down. But lately I've been feeling a new burst of energy, so hope to have something more to show before long.

I agree that the talk raises a lot of questions. Hopefully that's a good thing, as I think they deserve the attention of curious people!

Regarding the mismatch between 3D hardware and 2D rendering, we'll need hard empirical numbers, but the approach I'm taking is basically to write a software renderer (not unlike Cairo or even libart which predates it), just one that's designed for massive parallelism so it can run on the GPU. This will have performance characteristics more similar to Cairo (just faster as so much more throughput is available) than rasterization-based approaches. The latter really fall down when there's soft clipping and lots of blending, as that requires generation of intermediate texture buffers. I do address this in the talk :)

I hope I will have more solid answers for some of these questions before long.

I'm experimenting with GPU based 2D rendering using signed distance functions as drawing primitives for quiet a long time now. My experience is that as soon as the object tree exceeds a certain size the CPU will always beat the GPU. The only way the GPU can win is when you have high complexity drawing primitives and a low amount of objects which is exactly the opposite one usually encounters in the real world. Maybe the partitioning approach you are proposing will help but only real world implementations and benchmarks will tell.

I'm using partitioning in Pathfinder and it works quite well. It's not hard to beat the CPU. The forerunner of these approaches is the "Massively Parallel Vector Graphics" paper, if you're interested.

It's interesting to note the parallels (no pun intended) with raytracing, in which everyone said it was impossible to improve over the CPU right up until people discovered how to effectively partition with acceleration structures and now raytracers are doing more and more on GPU.

I recently-ish wrote a GPU based renderer for a GUI toolkit. I haven't put a ton of effort into optimisation, so I wouldn't be surprised if for smaller windows with lots of elements the GPU renderer was slower, but once you get to high-dpi displays with larger windows the CPU approach hits its limits. I don't have hard numbers, but we went from an unusable ~15fps at 8k resolution to hitting our target of 60fps with the GPU renderer.

If you use froward rendering, which means placing a bunch of textures on the screen, you are not really "drawing". We are probably not talking about the same thing.

It's not as complex has say drawing SVGs, but I've implemented Bezier curves, stippled lines, arbitrary rounded rectangles (html style) and subpixel-antialiased font blitting. I'd say that's far from just "placing" a bunch of textures.

Building a 2D signed distance field renderer that works similar to Dreams, setting pixels instead of emitting a pointcloud, would allow for complex object trees with lots of edits and perform well on the GPU.

Some classes of 2D rendering are faster on the CPU, but not all, though.

But even if it is slower, there may be advantages of moving the rendering to the GPU, like leaving the CPU to do other stuff, less power usage overall, more RAM available, etc.

So far no pure GPU 2D rendering exists. Skia moves some tasks to the GPU that and leaves others to the CPU. This is the worst of both worlds because you need as much memory as before but you have more bandwidth pressure on th PCIe bus and at the same time you are waking up two devices from sleep states instead of one which will consume more power. Then you have the additional problem of security domains on GPUs which complicates things even further. I have serious doubts that GPU based 2D rendering in a way that it is done today will help performance or reduce power consumption at all.

The key word here is "forward rendering". There exists no forward rendering primitives on GPUs besides triangles.

> I have serious doubts that GPU based 2D rendering in a way that it is done today will help performance or reduce power consumption at all.

Your doubts are entirely unfounded. Every mobile device uses Skia-style hybrid GPU/CPU rendering entirely due to performance, and power consumption didn't take a hit from it.

There's a question of why this hasn't also happened on desktop, but I think it's just because nobody cares if desktop GUI performance is garbage. It's not touch-driven, so it's less noticeable. It's not front & center on reviewer comparisons like it is/was for iOS vs. Android. And since so much is web-based, the number of commonly used GUI desktop apps has plummeted.

And on platforms where Cairo is relevant there's bigger problems in graphics than rendering speed anyway, like the abysmal state of composition foundations (aka, X11 still being so common and wayland being stuck in permanent not-quite-there).

Isn’t Quartz GPU accelerated on OSX?

The state of the art on path rendering, both academic approaches such as Random Access Vector Graphics (Hoppe and Nehab), Massively Parallel Vector Graphics (Ganacim et al) and more practical implementations such as Slug (Lengyel), Pathfinder (Walton) have advanced considerably beyond NV path rendering. I also hope to do considerably better - my approach is most similar to Pathfinder but with more of the expensive CPU operations moved to GPU.

I'd like to humbly suggest that Rasterizer may have extended the state of the art of 2D GPU-accelerated rendering: https://twitter.com/Vector_GL/status/1214507005316128768

What is a reasonable lower limit on latency that I could expect from a GPU pipeline?

For example, I know that if I use WebGL and issue a `readPixels` call then that seems to take on the order of 1 to 10 milliseconds no matter what I do. This suggests that if I needed to react to something in under 100 microseconds, I probably shouldn't have the computation of the reaction involve WebGL. But is that true of GPU compute in general, or just an artifact of the abstraction exposed by WebGL?

You do have to design around the GPU typically being almost like a server. It's very fast, but also "very far" away - both in latency & in bandwidth. PCI-E 3.0 x16 is 'only' 16GB/s. That's fast for I/O, but still in the ballpark of being like I/O & not like RAM. And that's if you even get x16 electrical - x8 for GPUs is pretty common in eg. laptops where the other lanes go to things like thunderbolt 3.

Integrated graphics are a different story, they often just have a direct connection to DRAM and CPU<->GPU communication is therefore super fast & low latency. In theory, anyway, if the API abstractions even let you share memory between the two.

So that means even for things where GPUs should do really well at it, like summing two arrays together, doing it on the CPU can still be a lot faster if the inputs & outputs are local to the CPU.

There's a bunch of intractable latency, but there are patterns that can help maintain pipelining, if you don't need the data immediately: https://developer.mozilla.org/en-US/docs/Web/API/WebGL_API/W...

Right, but if I go closer to the metal how much of the latency is really truly intractable? 100 microseconds? 1 millisecond? 10 milliseconds?

Can't speak for WebGL APIs, but in OpenGL (and other C++ graphics/GPGPU platforms), it is certainly possible (and necessary) to write fully interactive real-time apps that rely on a few rounds of communication between GPU and CPU code every frame. At 60 fps, you have 16 ms per frame. As long as you aren't copying huge arrays back and forth, I'd say you can budget in the 1 ms order-of-magnitude per data transfer.

I don't know much of that would be avoidable if you were actually closer to the metal than the raw OpenGL C++ API call

with cuda or opencl, you can launch a kernel and get results back in something like 10 microseconds (for trivial 'results' like setting a flag or something). Varies by hardware, platform, etc, but that's the ballpark.

Thanks for the video and slides, super interesting!

One question: in one of the slides, you mention the Vulkan 1.2 / SPIR-V 1.5 Vulkan Memory model. What exactly is this and why do you think this is meaningful? Would you have any links to learn more? When I read the SPIR-V specification, it mentions that this kind of memory model exists, but it does not put any effort into explaining why it's worthwhile or how it works on a high level.

That could be the subject of a whole other talk :)

The memory model is needed to do more sophisticated communication between the parallel workgroups. The classic example is to make a queue, even a simple one such as a simple single-producer single-consumer ring buffer (for communication between a pair of workgroups). The way this works, the producer fills an element in the buffer, then bumps an index. The consumer observes the index, then reads the element from the buffer.

Without a memory model, there's no way to guarantee that this ordering is preserved, and (to map this to hardware), the write of the index bump could flow through the caches, while the write of the element could still be "dirty" in the cache on the producer, so the consumer reads stale data.

What the memory model does is provide the programmer an explicit way to express ordering constraints. So the write of the index bump is with "release" semantics and the read of that by the consumer is with "acquire" semantics, which guarantees that producer writes before the release are visible to consumer reads after the acquire.

This unlocks a whole range of sophisticated concurrent data structures, such as concurrent hash maps, fancier memory allocation (even without a memory model, a simple allocate-only bump allocator is feasible), and so on.

Thanks for the question and kind words, I hope this illuminates.

Thanks, the answer with some examples was exactly what I was hoping to get answered! Thank you for taking the time to respond.

Perhaps in a few years when we'll start to see things like CXL providing cache-coherency between main memory and GPU memory this will become important?

Really good talk, thanks a lot.

Any opinion on ROCm?

ROCm provides a good OpenCL 2.x implementation targeting AMD GCN. The compiler/optimizer is clang-based. It is open-source. One great thing is that ROCm OpenCL supports GCN assembly in the OpenCL source, which is handy in some corner-cases that are not properly handled by the optimizer.

Doesn't support many platforms

of the links you put on the last page (or other gpu languages you might not have included) what's the most mature way to write decent gpu code right now (even if not necessarily optimal). i'm right this moment working on a project where i'm trying to speed up a blob detection algorithm by porting it to gpu. i'm using pytorch because the primitives fit naturally but if that weren't the case i think i would be kind of at a loss for how to do it (i do not know, as of yet, how to write cuda kernels).

btw the Julia link is dead

It depends on what you're trying to do. If you want to get stuff done now then just get an Nvidia card and write CUDA. The tools and knowledge community are vastly ahead of other approaches.

If you're doing image-like stuff, Halide is good, there's lots of stuff shipping on it.

One other project I'd add is emu, which is built on wgpu. I think that's closest to the future I have in mind, but still in early stages.


okay in that case can you recommend a good tut for writing cuda code?

The official docs are pretty good. This is probably a good starting point: https://devblogs.nvidia.com/even-easier-introduction-cuda/

Nvidia also sponsored a Udacity course, but I don't have any direct experience with it: https://developer.nvidia.com/udacity-cs344-intro-parallel-pr...

If you're interested in getting started writing GPU code and you're currently dealing with PyTorch a good place to start might be using NUMBA for CUDA programming. It lets you do in Python what you'd normally do in C/C++. It's great, because setup is minimal and it has some good tutorial materials. Plus, you're mostly emulating what you'd end up doing in C/C++, so conceptually the code is very similar.

When I was at GTC 2018 this is one of the training sessions I went to and you can now access that here[1]. The official NUMBA for CUDA docs are great too[2]. Also, you can avoid a lot of installation/dependencies headaches by using anaconda/miniconda if you're doing this stuff locally.

[1]: https://github.com/ContinuumIO/gtc2018-numba


I have a love/hate relationship with numba. On one hand, the minimal setup is excellent. On the other hand, it has so many limitations and rough patches that occasionally we have been forced to abandon the numba code and reimplement in C -- by which time we've spent more time fighting numba than the C reimplementation takes. This has happened twice out of 5 major recent attempts to use it.


Numba CPU is pretty good. I'd recommend it, modulo some expectation control:

* In order to actually compile a function, it must be written in a very minimal python subset that feels like C

* Numpy is only partially supported in this subset. Expect to manually reimplement ~1/3 of your simple numpy code and all of your fancy numpy code.


Numba GPU is not good. I'd avoid by default, unless you're looking for a project to contribute to:

* When we first installed, it was trivially broken by the removal of a long-deprecated CUDA API that had happened a year prior. It was an easy fix, we just chopped the reference out of numba, but it was a bad sign that foretold problems to come.

* It has lingering subtle bugs. One time we passed in a float array, it made us declare the float type three times (silly but not bad), and then it ignored the types and silently binary cast to double (bad). We found a "ghost town" thread where others had the issue.

* It has severe structural limitations. To a first order, GPU performance is all about avoiding stalls, which are usually triggered by CPU-GPU copies, and Numba's limited buffer abstractions sometimes make that difficult (you have to destructure your code) or impossible.

* It doesn't play well with nvidia's perf tools. If you're doing GPU, you're after perf, and GPUs have lots of tricky caveats that can accidentally tank your perf, so this is a much bigger deal than jupyter (until recently) missing a debugger.

* If you're wondering if numba supports CUDA API XYZ, the answer is no.

Yes, I'd agree with all of your points really. You can see in my reply below that I was mostly advocating for it's use in learning the concepts behind GPU programming, so you can skip delving into C/C++ if you're uncomfortable with that.

If I had to pick something today to use for arbitrary GPU acceleration in Python I'd almost certainly opt for JAX instead, but I haven't seen the kind of tutorials in JAX that exist for NUMBA CUDA.

I also found edge cases in parallel code.

One loop that didn't have any invalid reads or writes segfaulted in a numba prange (eg. Openmp) loop.

Just reimplementing a loop as a few smaller subfunctions called sequentially worked.

No clue where the optimizer got confused on the 20 line loop body before it got split -- it's the same code.

that's interesting. i didn't realize numba did that sort of thing (i thought it was just a jit). i'll take a look. thanks!

Absolutely, happy to help!

I personally end up relying on TensorFlow for most of my GPU needs at work still, but that training session was incredibly helpful for me to understand what was going on under the hood and helped demystify the CUDA kernels in general.

what do you think of something like https://github.com/google/jax that i guess claims to be a jack of all trades, in the sense that it'll transform programs (through XLA???) to various architectures?

JAX is absolutely what I would recommend if you were trying to implement something usable in the long term and wanted to go beyond just learning about CUDA kernels!

Looking at the JAX documentation it's _much_ better than the last time I saw it. Their tutorials seem fairly solid in fact. I do want to point out the difference though. You're conceptually operating at a very different point in JAX than in NUMBA.

For example, consider multiplying 2 matrices in JAX on your GPU. That's a simple example with just a few lines of code in the JAX tutorial[1].

On the other hand in the NUMBA tutorial from GTC I mentioned earlier, you have notebook 4, "Writing CUDA Kernels"[2], which teaches you about the programming model used to write computation for GPU's.

I'm sorry I was unclear. My recommendation of NUMBA is not so much in advocating for its use in a project, but more so in using it and its tutorials as an easy way of learning and experimenting with CUDA kernels without jumping into the deep end with C/C++. If you actually want to write some usable CUDA code for a project, keeping in mind JAX is still experimental, I would fully advocate for JAX over NUMBA.

[1] https://jax.readthedocs.io/en/latest/notebooks/quickstart.ht...

[2] https://github.com/ContinuumIO/gtc2018-numba/blob/master/4%2...

At this years GTC NVidia is promoting cuPy as well.

That link should probably have been https://juliacomputing.com/industries/gpus.html, but that's rather old content. A better overview is https://juliagpu.org/, and you can find a demo of Julia's CUDA kernel programming capabilities here: https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/

I've changed the link, thanks!

What were you talking about with the socialist revolution?

He was giving a talk to people from Jane Street [1], who make money on the stock market using OCaml (mentioned in the same bit).

[1] https://en.wikipedia.org/wiki/Jane_Street_Capital

Right but he basically asserted to everyone that a socialist revolution is coming to wipe them out. Is there something I don’t know? Should I be getting ready for this?

Huh? It was a joke. Explaining the joke will kill it, but here goes: He was saying that their OCaml skills would still be useful even if they couldn't make money in finance because capitalism fell over. The audience got the joke and laughed. It's nothing like what you're saying.

What I saw was this guy flatly asserting that a socialist revolution is coming, I thought at first he was talking about Bernie Sanders, and where there should be a punchline there is no punchline. Then everyone gives an awkward and half-cocked chuckle as one does when presented with something so bizarre.

My experience with Futhark (and to some extent accelerate) is that they're not terribly hard to program in applicable cases.

I'm not sure Futhark-generated code is as fast as specialist code but it's definitely a speedup compared to even skilled CPU implementations.

FWIW, if you're interested Nvidia-centric GPGPU stuff, Nvidia's tech conference went online-only this year and is free for all: https://www.nvidia.com/en-us/gtc/

I loved his commentary on Co-dfns for APL at 54:38 in the video. I program in J[0], and dabble in Dyalog APL. Aaron Hsu's work on the gpu compiler, he's the implementor of Co-dfns, is nothing short of mind-blowing[1]. His arguments for using APL vs. more mainstream languages are cogent and persuasive. The paradigm shift from C-like or even Haskell (ML) type languages is too big a gap for most mainstream programmers to accept. My opinion is that aside from Python's clean, indented syntax, the real reason it took off for DL/ML, was Numpy and Pandas. Pandas was heavily influenced by the array languages per its creator Wes McKinney. Besides, I just like playing with J.

I have looked briefly at Futhark, and the apltail compiler[2], but I am trying to focus on the APL family, because my time is limited. I am a GPU dilettante who has tried basic CUDA in C. I tried Rust, but Zig [3] is working out better for me given I make more progress with less effort, using C APIs are effortless, but my Rust difficulties may just be my bias with PLs. I find Haskell easier than Rust.

I just read an HPC article today about a survey/questionnaire about the languages used among 57 experts. It's still predominantly C++, but with a lot pain expressed by the experts/users. I agree SPIR-V sounds promising, and I hope to check it out. Just like DL, I think people don't realize it needs to be knowledge domain and algorithms, or experts and software. Somebody has to setup the computation based on knowledge domain and software knowledge. This shows itself to me when I wind up running somebody's program, because I don't have my own specific problem I'd like to try and implement as a learning exercise and knowledge exercise.

Great talk! I found it very understandable and paced just right!

[0] https://jsoftware.com

[1] https://scholarworks.iu.edu/dspace/handle/2022/24749

[2] https://futhark-lang.org/blog/2016-06-20-futhark-as-an-apl-c...

[3] https://ziglang.org

hey eggy I didn't know you were interested in zig as well, drop a line, contact's in my about profile.

I'm not well versed in this area at all -- but something I find fascinating is what I have heard of the early days of GPU compute. Before CUDA and the like, you would apparently take your data you wanted to compute on, structure it as an image/texture and use the graphics API to ham fist your way to the result you wanted. No idea if that's true, but pretty neat if it is.

This is actually still true in a lot of cases where proper compute shaders are not available - notably webGL.

See: https://threejs.org/examples/webgl_gpgpu_birds.html and: https://github.com/yomboprime/GPGPU-threejs-demos

Though WebGPU is coming which I believe will mitigate this.

That works well for certain kinds of computations, basically where each thread can compute a "pixel" independently. Modern GPU compute has threadgroup local storage, subgroups, atomic operations, etc. These unlock huge categories of workloads that would not fit nicely in the rasterization pipeline.

As the other comment mentioned, this is still true if you want to support legacy graphics APIs. A non-negligible amount (~20%) of Android phones cannot execute compute pipelines as they only support OpenGL ES 3.0 - compute shaders were introduced in GLES 3.1.

While it is an headache if you want to support compute on legacy devices, I do think that writing regular vertex/fragment shaders for general purpose GPU computation is an underrated pleasure as you really need to break out the coding golf toolkit to squeeze out maximum performance.

To add to this - it's not just legacy/Android devices. Since Apple dropped support for OpenGL in favor of metal, cross platform OpenGL compute shaders are now impossible since iOS will never get OpenGL ES 3.1. This has caused headaches for me as I wrote a cross platform game and am now pinned to GLES 3.0 forever.

I wouldn't call it ham fist

It's really no different than mapping an array except the array is is 2D instead of 1D. You supply data in one or more 2D arrays (textures). You then execute a function (fragment shader) which outputs the values for the result referencing the data in the input arrays (textures). The result is written to another 2D array which just happens to be called a texture instead of "2D array"

Definitely true :)

Tensorflow.js (tf.Tensor) can be used as an easy to use interface to GPU compute (similar API to numpy) which works on all gpus not just Nvidia because it's backed by webgl https://www.tensorflow.org/js/guide/platform_environment

Great technical talk, but an hour long video of one of the more visually appealing techs and not a single demo :(

Demo's like this (https://paveldogreat.github.io/WebGL-Fluid-Simulation/) help show people why many small processors can beat a few large ones.

Noted, and I'll do my best to raise up demos for future presentations.

Sorry, the other replies to this are completely correct that you should know your audience and flashy demos might also be a turn-off. This was a fantastic technical demo, I simply work in graphics so am biased towards that.

As a counter-point to the above commenter, I'm someone who's in an adjacent industry (HFT) and to me personally spending time on pretty pictures is a turn off. So I guess the take away is: know your audience.

I really really enjoyed your presentation, thanks for sharing.

I thought you did a really nice job with the slides. The diagrams made it easier to follow along.

Long video, so I haven't watched it all yet, but I have a bit of a naiive question: Does Vulkan and WebGL/WebGPU "fall back" to running on the CPU if "appropriate" hardware isn't available? Or are developers required to maintain separate code paths based on hardware?

I remember considering rewriting some tight loops in OpenCL but then the maintenance headache of having multiple code paths made the refactor seem not worth it. I'd guess this is/was generally a major speedbump for adoption. I know there is POCL which will run your kernels on the CPU, but it's not something you can't expect available on every platform. Maybe if POCL was part of the kernel or bundled with every distro then the situation would be different.

I've seen some project do compute in OpenGL ES 2.0 b/c that's ubiquitous and will always run (I think it's even a requirement for Android systems)

This is a good question. No, Vulkan and WebGPU will not fall back, they require hardware and driver support. What I think you're looking for is a higher level layer that will target GPU compute if it's available, otherwise CPU (or possibly other resources). Projects in this space include Halide and MLIR.

> Vulkan and WebGPU will not fall back

WebGL does automatically fall back to the CPU in Chrome, via SwiftShader[1]. WebGPU hasn't shipped yet but I think it's likely that a CPU backend with SwiftShader will be available eventually. And SwiftShader implements Vulkan now too, so while it's not automatic for Vulkan you can certainly have a fallback if you want it.

[1] https://github.com/google/swiftshader

I have a more theoretical followup then :)

So you spend a lot of time in the talk massaging your sequential CPU task into an appropriately GPU task. It's clearly tricky and involves reworking your algorithms to leverage the parallelism and the memory "features" of the GPU. But through rewriting, have you actually substantively hurt the performance of the sequential program?

The big picture question is : is the reverse problem of going from a GPU program to a CPU problem ever problematic? I mean in a algorithmic sense - without looking at micro optimizations for leveraging CPU SIMD instructions or whatever. Or are you always going to be pretty okay with running your shaders sequentially one after another on your big luxurious CPU memory pool?

and ultimately is there anything stopping you from compiling SPIR-V for a CPU? Could you not autgenerate in the kernel dispatcher just a switch that'll branch and run a precompiled on-CPU kernel if no driver is found? Then you'd finally really get compile-once-run-everywhere GPU code

I guess since it's not being done then I'm missing something haha :) Maybe you are going to often hit scenarios where you'd say "No, if I'm going to run this on the CPU I need to fundamentally rewrite the algorithm"

These are indeed interesting questions, thanks for asking them.

If you have a workload that runs really well on GPU, then it can be adapted to run on CPU. There are automated tools for this, but it's not mainstream. To answer one of your questions, spirv-cross has a C++ generator.

There are a bunch of things that are in the space between GPU and traditional scalar CPU. One of the most interesting is Intel's ispc. The Larabee project was also an attempt to build some GPU features into a CPU, and that is evolving into AVX-512. The new mask features are particularly useful for emulating the bitmap of active threads, as this is something that's hard to emulate with traditional SIMD.

I think it would be a very interesting project to build a compiler from SPIR-V compute workloads to optimized multithreaded SIMD, and there are projects exploring that: https://software.intel.com/en-us/articles/spir-v-to-ispc-con...

The main reason this hasn't happened is that when doing optimization it's always better to target the actual hardware than go through layers of translation. If you want to run a neural network on CPU, you're definitely going to get better performance out of matrix multiply code tuned for the specific SIMD implementation than something adapted from GPU. But I think there may still be a niche, especially if it's possible to get "pretty good" results.

For machine learning and imaging workloads in particular, there's probably a lot more juice in having a high level (Halide) or medium level (MLIR) representation, and targeting both GPU and CPU as backends that do optimization specific to their targets.

I'm really interested to see how this space evolves, it feels like there are major opportunities, while the scalar CPU side feels largely mined out to me.

Here is the question, though: would you consider Intel HD a native part of CPU or a "some GPU". I am still quite perplexed, why everyone targets CUDA and no one tries to use ubiquitous Intel HD for AI computations. I mean, Intel CPUs are everywhere (I know, know, Ryzens are chaging the situation but still..), and almost always they have GPUs onboard.

They definitely qualify as GPU. I think the main reason people aren't using them is that tools are primitive. Also, historically they've been pretty anemic in horsepower, though are getting better. Still not competitive with high end discrete graphics cards, but absolutely getting there on the low end.

Intel GPU has three hidden strengths:

* CPU <-> GPU communication is cheaper, because they can actually share memory (a separate copy to staging buffers is not needed).

* Latency is potentially lower, although it's not clear to me yet that driver software can take advantage of the potential offered by the hardware. (More empirical measurement is needed)

* Subgroup operations on Intel appear to be wicked-fast, shuffle in particular.

Long story short, I think there is opportunity here that most people aren't exploiting.

It seems to me that if you use Apple Metal, then you are definitely (and automatically) exploiting Intel GPUs. So lots of people are actually exploiting it!

Metal targets whatever GPU is in the system, not necessarily Intel. In most cases there’s a discrete in there by AMD or Nvidia.

Well, that's my point. As a Metal programmer, you don't have to distinguish (except of being aware of shared memory in integrated gpu). The talk made it sound like it is a totally different world between mobile / desktop, not to mention Intel GPUs. From a certain point of view, and for many algorithms, it isn't. Especially when the algorithm can be viewed as a functional program, as championed in the talk. In particular, most of the time there is an integrated one in there, because there are many more mobiles out there than desktops.

Note that both DX11 and DX12 do have the WARP CPU rasterizer to fall back on. WARP is quite capable; it can't compete with native hardware but is actually reasonably fast.

Vulkan now has a capable CPU implementation as well, with SwiftShader: https://github.com/google/swiftshader

Right, and OpenCL has POCL. The original question wasn't so much if the technology exists but will it be ubiquitous and guaranteed to be present on whatever platform you run on.

Looking at SwiftShader - it looks like you can bundle it with your application? Is that correct? I don't see a way to bundle POCL though

Yes, you bundle SwiftShader with your application.


If you want to target GPU in some form to increase performance (open source standards that currently fill this niche are OpenCL for compute, OpenGL/Vulkan for graphics) but you want to use the same code for CPU and GPU there is a solution!

There's a new standard called SYCL.

Currently it is based on C++11 and there are implementations that target OpenGL and CUDA.

Except that one needs to rely on tooling from CodePlay or Intel.

NVidia isn't that keen in supporting OpenCL beyond 1.2 and is pushing for OpenACC in what concerns GPGPU outside CUDA on their hardware.

OpenCL can target the CPU

Indeed. I think it's fair to say that OpenCL is excellent for targeting a very wide range of devices, with the exception of most high performance GPUs.

Question: what is the state of GPU compute w.r.t. Rust? Are the libraries on par with accelerate or (not exactly comparable) Futhark?

There's a ton of low and mid level infrastructure for GPU being written in Rust. That includes the gfx-hal work, the wgpu implementation, and more experimental languages such as emu.

But it's still early days, and I don't think there's a complete solution comparable to either of those yet.

Expect a lot of rough patches

I wonder at what point we finally decide that "G" in GPU doesn't make sense, and instead we start building compute modules that don't have 8 graphics connectors, and don't mess with graphics apis, and most importantly have open source drivers.

Most “GPU”s found in datacentres don’t have any graphic connectors on the back of them. They do all their communication via PCIe.

In some cases they may have a connectors for high speed communication fabric between many GPUs. NVLink is an example of this [0].

Outside of the consumer and workstation space GPUs really don’t look like anything you or I would recognise as a GPU anymore, and with APIs like CUDA, don’t look like GPUs to software either.

Really all we’re missing is open source drivers. But I wouldn’t hold your breath.

[0] https://www.nvidia.com/en-us/data-center/nvlink/

Except for that last part, you’re actually describing a return to the way things were decades ago, when accelerator cards of various types were common.

When someone finds a way to use a GPU to accelerate all the code in massive react webapps, that will be the day GPU's turn from a specialists-only device to mainstream...

Are GPUs specialist devices now?

Programming them is a specialist task... I'd take a guess <10% of programmers have ever written CUDA or something equivalent.

You can use something like Futhark and link it with your program.

Is there a meaningful speed difference between creating a 2d drawing engine using "gpu compute" vs going through normal libraries like directx or opengl? I'm fairly ignorant around what is exposed in the compute libraries vs traditional drawing libraries.

I think the main issue is where the output from the GPU shader is going. OpenGL for example, the output is the screen. That’s the only output. But for compute, you want the output to come back into the CPU world for storing or manipulating or etc. So compute shaders provide that API for shuffling memory back-and-forth between CPU and GPU and having access to general purpose output from the shader beyond just screen pixels.

Just a question: Has anyone figured out serverless ml predictions yet?

Great talk! Any thoughts on Intel's oneAPI?

Thanks. First I've heard of it, in fact, so I don't have any thoughts on it. It's interesting though, and would definitely be worth adding to a list of attempts to build portable layers.

I'm looking at targeting it from Julia, and the lower-level (Level Zero) API seems rather nice, resembling the CUDA driver API but building on SPIR-V. It's also nice how the API is decoupled from the implementation, so let's hope more vendors implement it (apparently Codeplay is working on a CUDA-based implementation).

Any youtube video that has disabled comments is most likely a sign of unwilling to hear criticism and fearing i might waste my time i don't take that chance.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact