
RLSL: a Rust to SPIR-V Compiler - Impossible
https://docs.google.com/presentation/d/1_cB-sxUusYVoCYdXnqwAg2u3-lrqBfgrUj205ytxYaw/edit?usp=drivesdk
======
calebwin
I've been working on a project with similar goal to RLSL, called Emu. It's a
procedural macro that automatically off-loads portions of code to run on a
GPU. Here's a quick example...

    
    
      #[gpu_use]
      fn main() {
          let mut a = vec![0.2; 1000];
          let mut b = vec![0.3; 1000];
          let c = vec![0.0; 1000];
    
          gpu_do!(load(a));
          gpu_do!(load(b));
          gpu_do!(load(c));
          gpu_do!(launch());
          for i in 0..1000 {
              c[i] = a[i] + b[i];
          }
          gpu_do!(read(c));
          
          println!("{:?}", c);
      }
    

Emu is currently open-sourced as a 100% stable Rust library and while it only
supports a tiny subset of Rust, that subset is well-defined with useful
compile-time errors and a comprehensive test suite.

So if you're looking for a way to contribute to single-source GPGPU for Rust,
please consider helping expand Emu's supported subset of Rust. The repository
is at
[https://www.github.com/calebwin/emu](https://www.github.com/calebwin/emu)

I will say that since Emu works at the AST/syntax level, RLSL is of great
interest to me because it works instead at the MIR level which allows it to
more easily support a large subset of Rust.

~~~
minraws
\- So RLSL can work with Emu?

\- Would it mean most of the general Rust code could be made to work on GPU?
Or is it you want Emu to work at MIR level?

\- Do you plan to actually try to do it?

Emu seems like a really cool project either way. :)

~~~
calebwin
\- Maybe there is a component of RLSL that could be useful. I have to think
more about what I want that component to be.

\- I want Emu to support general Rust code but still use stable Rust and
provide really nice compile-time errors. Maybe Emu could do AST-level checking
to (1) ensure that only legal transpilable-to-SPIR-V subset is used, (2) infer
the kernel parameters, (3) infer global work size, local work size and then do
MIR-level compilation to OpenCL or SPIR-V?

\- At the moment, I want to focus on AST-level compilation because I think
many applications (AI, ML, simulations, etc.) can still technically be
implemented without a huge subset of Rust.

~~~
minraws
I was planning to write a tiny SVM in Rust just as a plaything, so I would
probably use Emu to see if I can speed it up....

Does Emu have some getting started other than docs?

~~~
calebwin
[https://docs.rs/em](https://docs.rs/em) contains not only documentation but
also comprehensive explanation on effectively using Emu.

I would recommend looking through it first. Of course, if you have questions
feel free to ask - [https://gitter.im/talk-about-
emu/thoughts](https://gitter.im/talk-about-emu/thoughts)

------
mitchmindtree
Woah excellent to see Embark partnering with the RLSL project!

As someone who does a lot of creative-coding contracts with lots of video,
graphics and real-time requirements, RLSL has long been one of the Rust
projects that excites me the most. The idea of writing graphics and compute
shaders in Rust, a modern language with a decent type system, standard package
manager, module system, etc, is very exciting. It makes a lot of sense that
Embark see the potential in this for game dev too.

The ability to share Rust code between the CPU and GPU alone will be so
useful. The number of times I've had to port some pre-existing Rust function
to GLSL or vice versa is silly.

Obviously the Rust that compiles to SPIR-V will be a subset of the Rust that
compiles to LLVM or WASM, but this still opens up so many doors for improved
ergonomics when interacting with the GPU from a Rust program.

I've long dreamed of an API that allows me to casually send a closure of
almost arbitrary Rust code (with the necessary 'lifetime + Send + Sync +
SpirvCompatible compile-time checks) off to the GPU for execution and get back
a GPU Future of the result. It looks like this may well become possible in the
future :)

~~~
littlestymaar
> Woah excellent to see Embark partnering with the RLSL project!

IIRC, actually they hired the guy behind RLSL a few months ago.

~~~
person_of_color
I thought it was the self driving truck company.

------
AndrewGaspar
This is a great write-up. I think Rust could be a very nice language for
Shader-like applications which are already very functional, and don't involve
alot of shared, mutable state across threads.

In HPC, we're very much interested in GPU compute programming, rather than
shader programming. In CUDA codes, you're typically doing transformation from
input buffers directly into output buffers from your CUDA kernels. This should
immediately raise red flags for a Rust developer - you've got shared, mutable
state across threads!

Consider this simple CUDA-ish Rust code with threads independently executing
over 0..cuda.len() (ignore the bounds bugs at i = 0 and i = in.len()):

    
    
      fn stencil(i: usize, in: &[f32], out: &mut [f32]) {
          out[i] = (in[i - 1] + in[i] + in[i + 1]) / 3.0;
      }
    

(The `i` is a conceit around computing indexing from thread/block IDs, but the
input and output arrays are pretty similar to the style CUDA promotes).

It's obvious to me, the programmer, that I don't have any aliasing issue -
each thread is only mutating at a single index in the output array. However,
Rust is not smart enough to see this. If they allowed the definition of the
kernel as is, you could easily write multi-threaded code that has shared
mutable access to individual memory locations, violating Rust's memory model.
OK, you force the kernel to look more like this, then:

    
    
      // `in` is the slice of [i-1,i+1]
      fn stencil(in: &[f32], out: &mut f32) {
          *out = (in[0] + in[1] + in[2]) / 3.0;
      }
    

And you'd enforce Rust's invariants at the kernel launch site, computing the
valid slices at some higher level in the library in some "unsafe" code. But
this only solves the simple case where you have some array mapping to another
array where the index relationship is obvious, and it's easily provable that
there are no aliasing issues. Start layering in things like unique indirect
indexing, or perhaps non-unique indexing but with atomic reductions, and it
becomes difficult to phrase your correct program in a way to safe(!) Rust that
is compatible with the borrow checker, at least without having to build a
bunch of abstractions to express each of your parallel patterns. Having to
build a bunch of bespoke abstractions may not be scalable to the types of
developers building big scientific codes.

Anyway, I'm curious if the folks at "Embark" have spent any time thinking
about the issue of shared, mutable state in GPU programming with Rust. It
seems like a deal breaker from where I stand.

~~~
gmueckl
Your point is especially important as most of the GPGPU performance gains come
from really clever use of shared memory. That is the very model of shared
state with concurrent acces.

Usually, your shader or kernel is required to synchronize threads that need to
see each other's changes. I'm sure that there are at least some kernels out
there that skip synchronization and still work because there are enough other
instructions between write and read accesses that cover the issue up.

~~~
dragontamer
> shared memory.

I've done enough Hacker News discussions to note that the lay-reader won't
understand what this means, and this probably needs more elaboration.

"Shared Memory" is a special memory area inside of GPUs where grids (NVidia)
or workgroups (AMD)... a group of 32x to 1024x SIMD threads... can perform
inter-thread communications in an outrageously fast way.

Shared memory is extremely small: roughly 64kB in size. Optimizing shared
memory access involves resolving bank-conflicts, and lots of very-low level
thought. At a minimum, you need to consider how you fill shared memory (the
memcpy in) as well as how to get the final data out (the memcpy out).

\---------

In many cases, synchronization to-and-from shared memory only requires a
__threadfence() instruction. Maybe only a __syncwarp() instruction in some
cases.

A lot of thought goes into the "ordering" of memory accesses, to make shared-
memory as fast as possible. See here for further details:
[https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39....](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html)

See "39.2.3 Avoiding Bank Conflicts" for how shared memory is optimized in
practice. You're not only thinking about the shared-state, but also the
average number of other threads hitting any particular bank. On a say...
32-bank system, you'll want all 32-banks to be utilized as much as possible.
If all 1024 threads are accessing bank#0, you'll be 1/32th the speed (and
banks #1 through #31 are all wasted).

A "bank" is basically the implicitly "RAID0" arrangement of shared-memory. The
precise number of banks is somewhere between 16x banks or 32x banks, depending
on architecture. But regardless, if all GPU-threads access memory location #0,
that will all hit bank#0, which is slow. Instead, you want GPU-threads to
"spread out" over the banks (maybe GPU-thread#0 should access Bank0 / Memory
location #0. GPU-thread#1 should access bank1 / memory-location #16. GPU
thread #2 should access bank2 / memory-location #32. Etc. etc.)

~~~
AndrewGaspar
> I've done enough Hacker News discussions to note that the lay-reader won't
> understand what this means, and this probably needs more elaboration.

This is why I didn't mention it, but I agree that shared memory and typical
uses present a particularly challenging problem for Rust.

------
mshockwave
I understand Rust's advantage in lots of use cases, but can someone elaborate
the benefits of using Rust over other languages for shader?

~~~
gpm
I'd also be curious to see some of the disadvantages?

~~~
maikklein_dev
The standard library is a bit awkward at places with its pointer usage. This
is why I have a custom one.

From a language point of view, it fits really well. I am also constrained with
syntax as I want rlsl to be a strict subset of rust.

Closures capture by references which means you always have to capture
explicitly with move (No pointers in structs in SPIR-V). Although I am fairly
confident that this can be fixed.

Also MIR optimization passes are pretty weak right now, and rlsl would require
an optimizer in the future. like SPIR-V <-> LLVM. I am not sure how good
spirv-opt is right now.

------
kvark
Exciting talk! And I'm happy to know that one of the leading shader devs from
Rust community has successfully "embarked" :)

Technology-wise, the compiler goes from MIR to SPIR-V. This is specific to
Rust and different from the other direction Khronos has been exploring:
[https://github.com/KhronosGroup/LLVM-SPIRV-
Backend](https://github.com/KhronosGroup/LLVM-SPIRV-Backend) . It's a bit sad
that we can't all have nice things.

~~~
the_mitsuhiko
MIR seems the better choice in rust land in general. I wonder if cranelift
could be an alternative approach here in the future.

~~~
maikklein_dev
I didn't choose cranelift because there was no real benefit for it. If the
focus would have been on optimizations, I might have used it. The IR seems
more friendly compared MIR, although I think there is a rewrite coming to make
MIR more suitable for optimizations as well :).

And I do quite a lot of transformations at the MIR level, so I am looking
forward to a more optimization friendly IR.

------
rough-sea
Just to be clear, SPIR-V has no relation to RISC-V?

~~~
slavik81
No relation. SPIR-V is the fifth iteration of the Standard Portable
Intermediate Representation, originally based on the LLVM IR.

It's an intermediate representation between the high level shader programming
language and the GPU's native machine code. It's expected that the GPU will
compile the SPIR-V to its own internal instruction set, rather than executing
it natively.

RISC-V, on the other hand, is a native instruction set for CPUs.

~~~
gmueckl
SPIR-V really is an intermediate representation of a program. There's no way
that this can be executed without any further translation. But it stops driver
developers from having to write and ship complex compiler front ends that
deviate from language specs in subtle ways that the shader writers need to
test for and work around.

~~~
phire
Yep, now shader writers only need to test for and work around backend bugs in
drivers.

Also it makes gamedevs slightly happier that they don't have to ship the plain
text source code of their shaders with games.

It doesn't provide any real security against RE, as SPIR-V is pretty close to
being tokinized/SSAed GLSL and decompilers are trivial to implement. But if it
makes them happier, so be it.

~~~
erichocean
> * SPIR-V is pretty close to being tokinized/SSAed GLSL*

SPIR-V is in SSA form; GLSL most definitely is not.

------
kbumsik
Call me dumb but I'm always confused with SPIR-V. Does it have any
relationship with Vulkan or OpenGL?

~~~
pdpi
Graphics programming is fundamentally a form of client/server programming.

Your (usually) c++ is the client, shaders are the server,
OpenGL/Vulkan/Direct3D are basically your client libraries to talk to the
server.

Your OpenGL driver has a compiler in it that JIT compiles GLSL into into
shader programs your GPU can run. In this world, SPIR-V is sort of like
JVM/WASM bytecode — a compilation target for shader languages, that’s a bit
nicer for drivers to work with.

~~~
kbumsik
> Graphics programming is fundamentally a form of client/server programming.

Thanks, in that way it is a lot easier to understand. So do you mean GLSL
would be compiled into SPIR-V?

~~~
pdpi
Yup!
[https://vulkan.lunarg.com/doc/view/1.0.39.1/linux/spirv_tool...](https://vulkan.lunarg.com/doc/view/1.0.39.1/linux/spirv_toolchain.html)

Microsoft actually has an HLSL to SPIR-V compiler too.
[https://github.com/microsoft/DirectXShaderCompiler/blob/mast...](https://github.com/microsoft/DirectXShaderCompiler/blob/master/docs/SPIR-V.rst)

------
gravypod
This is amazing and I hope to see things like this extended. There are so many
gotchas with glsl that it's hard to get going and something with a really
strict compiler would make that a lot less of a burden on an engineer. It
would be really cool to see whole program & profiling guided optimization
between GPU inputs, shader stages, and GPU outputs.

------
waiseristy
This is a great candidate for the Vulkano rust library:
[https://github.com/vulkano-rs/vulkano](https://github.com/vulkano-rs/vulkano)

It's currently using shaderc to compile glsl -> spir-v but it's clunky and
takes forever to compile.

Would be great to have a more Rust-centric way to send spir-v over to the GPU

------
Avi-D-coder
Is the video of the talk posted anywhere?

