
A Taste of GPU Compute [video] - raphlinus
https://www.youtube.com/watch?v=eqkAaplKBc4
======
raphlinus
This is a talk I gave at Jane Street late February on GPU compute, especially
using graphics API's such as Vulkan and the upcoming WGPU. Feel free to scan
through the slides [1], and to ask me anything.

[1]:
[https://docs.google.com/presentation/d/1FRH81IW9RffkJjm6ILFZ...](https://docs.google.com/presentation/d/1FRH81IW9RffkJjm6ILFZ7raCgFAUPXYYFXfiyKmhkx8/edit?usp=sharing)

~~~
fluffything
I write a lot of compute kernels in CUDA, and my litmus test is prefix sum,
for two reasons.

First, you can implement it in pure CUDA C++, and max out the memory bandwidth
of any nvidia or AMD GPU. The CUB library provides a state of the art
implementation (using decoupled-lookback) that one can compare against new
programming languages.

Second, it is one of the most basic parallel algorithm building blocks. Many
parallel algorithms use it, and many parallel algorithms are "prefix-sum-
like". If I am not able to write prefix sum from scratch efficiently in your
programming language / library, I can't use it.

Every time someone shows a new programming language for compute, the examples
provided are super basic (e.g. `map(...).fold(...)`), but I have yet to see a
new programming language that can be used to implement the 2-3 most
fundamental parallel algorithms from any parallel algorithms graduate course.
For example, Futhark provides a prefix sum intrinsic, that just calls CUB - if
you want to implement a prefix-sum-like algorithm, you are out of luck. In
WGPU, it appears that prefix-sum will be an intrinsic of WHSL, which sounds
like you would be out-of-luck too.

You mentioned WGPU and Vulkan. Do you know how to implement prefix-sum from
scratch on these? If so, do you know how the performance compare against CUB?

~~~
borune
I am currently working on implementing the decoupled-lookback in Futhark, as
my master thesis. The development is under the branch `onepassscan`. It is
still missing some features, like it cannot do map'o'scan, meaning include the
map function in the scan optimization. It is likely that the implementation
will only be used for Nvidia, since the cache guarentees only are prommised to
work for Nvidia.

The current version of Futhark is using the reduce-then-scan strategy.

~~~
raphlinus
Awesome!

I'll note that these "cache guarantees" in the Vulkan world are the Vulkan 1.2
memory model, and are supported by the latest drivers for AMD, Intel, and
NVidia [1]. This recent change is one big reason I'm saying Vulkan + SPIR-V is
getting ready for real compute workloads, and this wasn't the case even months
ago.

[1]:
[https://vulkan.gpuinfo.org/listdevicescoverage.php?extension...](https://vulkan.gpuinfo.org/listdevicescoverage.php?extension=VK_KHR_vulkan_memory_model&platform=windows)

------
vmchale
My experience with Futhark (and to some extent accelerate) is that they're not
terribly hard to program in applicable cases.

I'm not sure Futhark-generated code is as fast as specialist code but it's
definitely a speedup compared to even skilled CPU implementations.

------
01100011
FWIW, if you're interested Nvidia-centric GPGPU stuff, Nvidia's tech
conference went online-only this year and is free for all:
[https://www.nvidia.com/en-us/gtc/](https://www.nvidia.com/en-us/gtc/)

------
eggy
I loved his commentary on Co-dfns for APL at 54:38 in the video. I program in
J[0], and dabble in Dyalog APL. Aaron Hsu's work on the gpu compiler, he's the
implementor of Co-dfns, is nothing short of mind-blowing[1]. His arguments for
using APL vs. more mainstream languages are cogent and persuasive. The
paradigm shift from C-like or even Haskell (ML) type languages is too big a
gap for most mainstream programmers to accept. My opinion is that aside from
Python's clean, indented syntax, the real reason it took off for DL/ML, was
Numpy and Pandas. Pandas was heavily influenced by the array languages per its
creator Wes McKinney. Besides, I just like playing with J.

I have looked briefly at Futhark, and the apltail compiler[2], but I am trying
to focus on the APL family, because my time is limited. I am a GPU dilettante
who has tried basic CUDA in C. I tried Rust, but Zig [3] is working out better
for me given I make more progress with less effort, using C APIs are
effortless, but my Rust difficulties may just be my bias with PLs. I find
Haskell easier than Rust.

I just read an HPC article today about a survey/questionnaire about the
languages used among 57 experts. It's still predominantly C++, but with a lot
pain expressed by the experts/users. I agree SPIR-V sounds promising, and I
hope to check it out. Just like DL, I think people don't realize it needs to
be knowledge domain and algorithms, or experts and software. Somebody has to
setup the computation based on knowledge domain and software knowledge. This
shows itself to me when I wind up running somebody's program, because I don't
have my own specific problem I'd like to try and implement as a learning
exercise and knowledge exercise.

Great talk! I found it very understandable and paced just right!

[0] [https://jsoftware.com](https://jsoftware.com)

[1]
[https://scholarworks.iu.edu/dspace/handle/2022/24749](https://scholarworks.iu.edu/dspace/handle/2022/24749)

[2] [https://futhark-lang.org/blog/2016-06-20-futhark-as-an-
apl-c...](https://futhark-lang.org/blog/2016-06-20-futhark-as-an-apl-compiler-
target.html)

[3] [https://ziglang.org](https://ziglang.org)

~~~
dnautics
hey eggy I didn't know you were interested in zig as well, drop a line,
contact's in my about profile.

------
streb-lo
I'm not well versed in this area at all -- but something I find fascinating is
what I have heard of the early days of GPU compute. Before CUDA and the like,
you would apparently take your data you wanted to compute on, structure it as
an image/texture and use the graphics API to ham fist your way to the result
you wanted. No idea if that's true, but pretty neat if it is.

~~~
Qasaur
As the other comment mentioned, this is still true if you want to support
legacy graphics APIs. A non-negligible amount (~20%) of Android phones cannot
execute compute pipelines as they only support OpenGL ES 3.0 - compute shaders
were introduced in GLES 3.1.

While it is an headache if you want to support compute on legacy devices, I do
think that writing regular vertex/fragment shaders for general purpose GPU
computation is an underrated pleasure as you really need to break out the
coding golf toolkit to squeeze out maximum performance.

~~~
gh123man
To add to this - it's not just legacy/Android devices. Since Apple dropped
support for OpenGL in favor of metal, cross platform OpenGL compute shaders
are now impossible since iOS will never get OpenGL ES 3.1. This has caused
headaches for me as I wrote a cross platform game and am now pinned to GLES
3.0 forever.

------
jarrell_mark
Tensorflow.js (tf.Tensor) can be used as an easy to use interface to GPU
compute (similar API to numpy) which works on all gpus not just Nvidia because
it's backed by webgl
[https://www.tensorflow.org/js/guide/platform_environment](https://www.tensorflow.org/js/guide/platform_environment)

------
moritonal
Great technical talk, but an hour long video of one of the more visually
appealing techs and not a single demo :(

Demo's like this ([https://paveldogreat.github.io/WebGL-Fluid-
Simulation/](https://paveldogreat.github.io/WebGL-Fluid-Simulation/)) help
show people why many small processors can beat a few large ones.

~~~
raphlinus
Noted, and I'll do my best to raise up demos for future presentations.

~~~
moritonal
Sorry, the other replies to this are completely correct that you should know
your audience and flashy demos might also be a turn-off. This was a fantastic
technical demo, I simply work in graphics so am biased towards that.

------
geokon
Long video, so I haven't watched it all yet, but I have a bit of a naiive
question: Does Vulkan and WebGL/WebGPU "fall back" to running on the CPU if
"appropriate" hardware isn't available? Or are developers required to maintain
separate code paths based on hardware?

I remember considering rewriting some tight loops in OpenCL but then the
maintenance headache of having multiple code paths made the refactor seem not
worth it. I'd guess this is/was generally a major speedbump for adoption. I
know there is POCL which will run your kernels on the CPU, but it's not
something you can't expect available on every platform. Maybe if POCL was part
of the kernel or bundled with every distro then the situation would be
different.

I've seen some project do compute in OpenGL ES 2.0 b/c that's ubiquitous and
will always run (I think it's even a requirement for Android systems)

~~~
raphlinus
This is a good question. No, Vulkan and WebGPU will not fall back, they
require hardware and driver support. What I think you're looking for is a
higher level layer that will target GPU compute if it's available, otherwise
CPU (or possibly other resources). Projects in this space include Halide and
MLIR.

~~~
geokon
I have a more theoretical followup then :)

So you spend a lot of time in the talk massaging your sequential CPU task into
an appropriately GPU task. It's clearly tricky and involves reworking your
algorithms to leverage the parallelism and the memory "features" of the GPU.
But through rewriting, have you actually substantively hurt the performance of
the sequential program?

The big picture question is : is the reverse problem of going from a GPU
program to a CPU problem ever problematic? I mean in a algorithmic sense -
without looking at micro optimizations for leveraging CPU SIMD instructions or
whatever. Or are you always going to be pretty okay with running your shaders
sequentially one after another on your big luxurious CPU memory pool?

and ultimately is there anything stopping you from compiling SPIR-V for a CPU?
Could you not autgenerate in the kernel dispatcher just a switch that'll
branch and run a precompiled on-CPU kernel if no driver is found? Then you'd
finally really get compile-once-run-everywhere GPU code

I guess since it's not being done then I'm missing something haha :) Maybe you
are going to often hit scenarios where you'd say "No, if I'm going to run this
on the CPU I need to fundamentally rewrite the algorithm"

~~~
raphlinus
These are indeed interesting questions, thanks for asking them.

If you have a workload that runs really well on GPU, then it can be adapted to
run on CPU. There are automated tools for this, but it's not mainstream. To
answer one of your questions, spirv-cross has a C++ generator.

There are a bunch of things that are in the space between GPU and traditional
scalar CPU. One of the most interesting is Intel's ispc. The Larabee project
was also an attempt to build some GPU features into a CPU, and that is
evolving into AVX-512. The new mask features are particularly useful for
emulating the bitmap of active threads, as this is something that's hard to
emulate with traditional SIMD.

I think it would be a very interesting project to build a compiler from SPIR-V
compute workloads to optimized multithreaded SIMD, and there are projects
exploring that: [https://software.intel.com/en-us/articles/spir-v-to-ispc-
con...](https://software.intel.com/en-us/articles/spir-v-to-ispc-convert-gpu-
compute-to-the-cpu)

The main reason this hasn't happened is that when doing optimization it's
always better to target the actual hardware than go through layers of
translation. If you want to run a neural network on CPU, you're definitely
going to get better performance out of matrix multiply code tuned for the
specific SIMD implementation than something adapted from GPU. But I think
there may still be a niche, especially if it's possible to get "pretty good"
results.

For machine learning and imaging workloads in particular, there's probably a
lot more juice in having a high level (Halide) or medium level (MLIR)
representation, and targeting both GPU and CPU as backends that do
optimization specific to their targets.

I'm really interested to see how this space evolves, it feels like there are
major opportunities, while the scalar CPU side feels largely mined out to me.

~~~
SomeoneFromCA
Here is the question, though: would you consider Intel HD a native part of CPU
or a "some GPU". I am still quite perplexed, why everyone targets CUDA and no
one tries to use ubiquitous Intel HD for AI computations. I mean, Intel CPUs
are everywhere (I know, know, Ryzens are chaging the situation but still..),
and almost always they have GPUs onboard.

~~~
raphlinus
They definitely qualify as GPU. I think the main reason people aren't using
them is that tools are primitive. Also, historically they've been pretty
anemic in horsepower, though are getting better. Still not competitive with
high end discrete graphics cards, but absolutely getting there on the low end.

Intel GPU has three hidden strengths:

* CPU <-> GPU communication is cheaper, because they can actually share memory (a separate copy to staging buffers is not needed).

* Latency is potentially lower, although it's not clear to me yet that driver software can take advantage of the potential offered by the hardware. (More empirical measurement is needed)

* Subgroup operations on Intel appear to be wicked-fast, shuffle in particular.

Long story short, I think there is opportunity here that most people aren't
exploiting.

~~~
auggierose
It seems to me that if you use Apple Metal, then you are definitely (and
automatically) exploiting Intel GPUs. So lots of people are actually
exploiting it!

~~~
hellofunk
Metal targets whatever GPU is in the system, not necessarily Intel. In most
cases there’s a discrete in there by AMD or Nvidia.

~~~
auggierose
Well, that's my point. As a Metal programmer, you don't have to distinguish
(except of being aware of shared memory in integrated gpu). The talk made it
sound like it is a totally different world between mobile / desktop, not to
mention Intel GPUs. From a certain point of view, and for many algorithms, it
isn't. Especially when the algorithm can be viewed as a functional program, as
championed in the talk. In particular, most of the time there is an integrated
one in there, because there are many more mobiles out there than desktops.

------
vmchale
Question: what is the state of GPU compute w.r.t. Rust? Are the libraries on
par with accelerate or (not exactly comparable) Futhark?

~~~
raphlinus
There's a ton of low and mid level infrastructure for GPU being written in
Rust. That includes the gfx-hal work, the wgpu implementation, and more
experimental languages such as emu.

But it's still early days, and I don't think there's a complete solution
comparable to either of those yet.

------
amelius
I wonder at what point we finally decide that "G" in GPU doesn't make sense,
and instead we start building compute modules that don't have 8 graphics
connectors, and don't mess with graphics apis, and most importantly have open
source drivers.

~~~
avianlyric
Most “GPU”s found in datacentres don’t have any graphic connectors on the back
of them. They do all their communication via PCIe.

In some cases they may have a connectors for high speed communication fabric
between many GPUs. NVLink is an example of this [0].

Outside of the consumer and workstation space GPUs really don’t look like
anything you or I would recognise as a GPU anymore, and with APIs like CUDA,
don’t look like GPUs to software either.

Really all we’re missing is open source drivers. But I wouldn’t hold your
breath.

[0] [https://www.nvidia.com/en-us/data-
center/nvlink/](https://www.nvidia.com/en-us/data-center/nvlink/)

------
londons_explore
When someone finds a way to use a GPU to accelerate all the code in massive
react webapps, that will be the day GPU's turn from a specialists-only device
to mainstream...

~~~
SQueeeeeL
Are GPUs specialist devices now?

~~~
londons_explore
Programming them is a specialist task... I'd take a guess <10% of programmers
have ever written CUDA or something equivalent.

~~~
vmchale
You can use something like Futhark and link it with your program.

------
jklinger410
[https://www.google.com/search?q=how+to+pronounce+arbiter&oq=...](https://www.google.com/search?q=how+to+pronounce+arbiter&oq=how+to+pronounce+arbiter&aqs=chrome..69i57j0l4.2632j0j1&sourceid=chrome&ie=UTF-8)

------
seanalltogether
Is there a meaningful speed difference between creating a 2d drawing engine
using "gpu compute" vs going through normal libraries like directx or opengl?
I'm fairly ignorant around what is exposed in the compute libraries vs
traditional drawing libraries.

~~~
hellofunk
I think the main issue is where the output from the GPU shader is going.
OpenGL for example, the output is the screen. That’s the only output. But for
compute, you want the output to come back into the CPU world for storing or
manipulating or etc. So compute shaders provide that API for shuffling memory
back-and-forth between CPU and GPU and having access to general purpose output
from the shader beyond just screen pixels.

------
ptrenko
Just a question: Has anyone figured out serverless ml predictions yet?

------
maleadt
Great talk! Any thoughts on Intel's oneAPI?

~~~
raphlinus
Thanks. First I've heard of it, in fact, so I don't have any thoughts on it.
It's interesting though, and would definitely be worth adding to a list of
attempts to build portable layers.

~~~
maleadt
I'm looking at targeting it from Julia, and the lower-level (Level Zero) API
seems rather nice, resembling the CUDA driver API but building on SPIR-V. It's
also nice how the API is decoupled from the implementation, so let's hope more
vendors implement it (apparently Codeplay is working on a CUDA-based
implementation).

------
thomas232233
Any youtube video that has disabled comments is most likely a sign of
unwilling to hear criticism and fearing i might waste my time i don't take
that chance.

