Hacker News new | past | comments | ask | show | jobs | submit login
Rust-CUDA: Fast GPU code fully in Rust (github.com/rdambrosio016)
269 points by amadeusine 7 days ago | hide | past | favorite | 85 comments





> "Extremely fast"

When people make claims like this, it would be good if they put the benchmarks on the first page. E.g, how does it compare with https://github.com/gfx-rs/wgpu which lets you target Vulkan, Metal, DX, GL or WASM+WebGPU with rust?


I hope this doesn't come off as handwaving, but you're kinda comparing apples to oranges here. CUDA has always been in a class of it's own when it comes to GPU compute, for better and worse, so I think the people out there who want to use this will pretty quickly get an idea of who it's for. Benchmarks would be nice, but I don't really think they'd be germane when comparing a proprietary compute system with a generic cross-platform GPU binding.

What? So you can just claim something is 'fast' with no evidence, but then if it is slower than proprietary things, who cares? That's ridiculous - if you claim something is 'extremely fast' you should back that up with benchmarks.

If it's not relative to anything, than the word "fast" doesn't have much meaning.

"Fast", to me, from a software development perspective can still be meaningful, in the sense of knowing what techniques, patterns, paths, ... enable performant execution, and providing easy and straightforward paths for the user along those. Which, ultimately, leads to high performance in most ways the user will apply the provided framework (in a more general sense of the word). Hope that makes sense.

And it must be OK to claim "fast" as a goal, from the early stages of a project, even before it may be possible to create any meaningful benchmarks. As long as it's discernable for the intended audience the precise stage of development or maturity the project is currently at. Which, I believe, the project in question is communicating just fine ("still in early development").


Would have it made any difference had the parent mentioned OpenCL?

wgpu is a library for running wgsl on GPUs, not Rust.

It turns out that when wgpu is combined with rust-gpu, it can run rust on gpus too

https://github.com/EmbarkStudios/rust-gpu/tree/main/examples...

(on the "builder" directory it builds shaders with the spirv-builder crate)


https://github.com/embarkstudios/rust-gpu would be the closest equivalent, AFAICT.

I wrote some OpenCL code recently, wrapped in a rust program using opencl3. My CL code was pretty slow early on, it's a math heavy problem implemented initially with a bunch of for loops.

I largely "optimized" it by smattering the code with #pragma unroll, which was exceptionally effective for the problem at hand, given that I had tons of statically defined loops that could be easily unrolled.

I know rust has a tendency to aggressively inline and flatten everything, but I'm curious about things like this, where it can be so important in this domain to ensure things are unrolled, as the loop conditionals can otherwise introduce pipeline/concurrency issues.


When they say "extremely fast GPU code" I think they mean relative to Rust's normal home: standard CPUs. So I don't think the claim needs any specific support.

To me, it is clear that "fast GPU code" means GPU code that is faster than other GPU code.

Like a "fast sports car" would obviously have to be faster than other sports cars. You couldn't (truthfully) make that claim if was faster than regular cars but slower than other sports cars.

"Accelerate Rust code by using the GPU" would be a more honest description.


I agree it could be read both ways. I'm just saying what I think their intent is.

https://github.com/RDambrosio016/Rust-CUDA/blob/master/guide...

* Missing Atomics -- Gamebreaker IMO. Atomics are absolutely essential when you are dealing with 10,000+ threads on a regular basis. You'll inevitably come across a shared data-structure that requires write-access from each thread, and some coordination mechanism is needed for that. Atomics are one important fit.

Ironic, a few days ago, I argued for the use of Fork-join parallelism in most cases (aka: Kernel launch / synchronized kernel exits). Now I find myself arguing the opposite now that we have a topic here with missing atomics. Like... atomics need to be used very, very rarely, but those rare uses are incredibly important.

* Warp Vote / Match / Reduce / Shuffle missing (Very useful tools for highly-optimized code, but you can write slower code that does the same thing through __shared__ memory just fine)

------

Wait, does this support __shared__ memory at all? Raw access to memory is not really amenable to Rust's programming style, but its absolutely necessary for high-performance GPU programming.

If this is missing __shared__ memory concepts, then the entire idea of "efficient GPU code" is dead IMO. GPU threads can only communicate quickly over __shared__ memory within an OpenCL Workgroup / CUDA Block (A Workgroup or Block is roughly a grouping of 1024 "threads" or SIMD-lanes)

All other forms of communication are incredibly slow. Atomics are maybe the next fastest form of communication, but only across __shared__ memory. Relaxed Atomics to global memory are reasonably performant but once you have either Seq-cst or Acquire/Release semantics (aka: the right memory barriers in the right place), things slow down dramatically in GPU-land.

The big issue is that __shared__ memory is only 64kB in size, its locked down to workgroups / blocks. In NVidia GPUs, the __shared__ memory "eats" into your L1 cache as well (In fact: __shared__ memory can be thought of as programmer-managed cache. The caching heuristics just aren't good enough for high-performance GPU programmers. They want to manually manage that high-speed memory for maximum performance).


As i mentioned, it is an early project, just making the simplest kernel compile was very difficult. Atomics and shared memory are great, but both are very difficult. Atomics need "proper" atomics (i.e. special instructions on sm_70+ and emulated on <sm_70), and shared mem needs some weird codegen support. I will get to both of them. Nevertheless, noalias does cause significant performance speedups in memory bound kernels, see this blogpost: https://developer.nvidia.com/blog/cuda-pro-tip-optimize-poin...

So please do not be surprised that an early project does not contain every single feature of cuda, something thats been around for decades


No problem. I understand its a work in progress.

I'd push most strongly for CUDA __shared__ support first along with thread-barriers (CUDA's __syncthreads()), followed by __shared__ atomics. Finally, global atomics + associated memory-barrier stuffs (Ex: seq-cst atomic, acq-release atomic would work but maybe be a bit difficult. Might be easier to support the older-style memory barrier instead?)

--------

EDIT: Alternatively, maybe making a Thrust-like library for Rust is a better step 1? High performance GPU-code is very peculiar and strange. I can't imagine there's a big market for it. It seems like most people writing GPU "code" these days are just Python programmers punching in a Tensorflow, rather than actually trying to write a high-performance GPU thingy.


Hi! Engineer who writes many high-performance GPU things. We don't get nearly as much attention since part of the job is to write language bindings so that not everyone on the team has to look at CUDA kernels directly. There's a pretty big market for it, but it's still hard/expensive to both write performant code and train people to be able to do so. A lot of it reads roughly similar to CPU math library implementations (programming in "assembler" via compiler intrinsics). I'd say any way this can be made easier without losing a ton of performance will just make more problems more affordable to solve, since the time of senior performance/optimization engineers is so limited.

I'd argue that this project will be successful if you can write at least a few primitives of Thrust-like library in it: it'll give reasonably-sized problems to tackle, a production implementation to compare against for both testing and performance benchmarking, and demonstrate that the project has the basic functionality (optimized primitives like parallel sums/reductions are building blocks for a lot of useful things).


There is one issue i dont know how to solve, its generic kernels. It's a bit impossible to do without doing some seriously weird handling of it in rustc. I can monomorphize ahead of time but monomorphization from the CPU that talks from the CPU rustc to GPU rustc is just... it seems a bit impossible to do. Which is why a thrust-like library could only be made for specific types like f32, f64, etc.

> EDIT: Alternatively, maybe making a Thrust-like library for Rust is a better step 1? High performance GPU-code is very peculiar and strange. I can't imagine there's a big market for it. It seems like most people writing GPU "code" these days are just Python programmers punching in a Tensorflow, rather than actually trying to write a high-performance GPU thingy.

What NVIDIA is pushing these days in the HPC space is, in addition to CUDA, so-called "standards-based parallelization" (stdpar), meaning the NVIDIA HPC SDK compilers can offload C++17 parallel algorithms and Fortran DO CONCURRENT + certain array intrinsics to GPU's. And similarly in the python world there's Legate numpy and CuNumeric. Now of course these are quite limited and can only offload certain relatively simple algorithms, but for many real-world computational problems this is all they need, and if you can get 90% of the performance of a full CUDA implementation for 1% of the effort it's a pretty attractive proposition.

You can think of it a bit like a flowchart:

1. Can your problem be solved with a well-known algorithm that already has an optimized GPU implementation (cuBLAS, cuFFT, cuTensor, etc. etc.)? Use the existing library implementation!

2. Can your problem be expressed in a relatively straightforward data-parallel fashion? Use the standards-based parallelization with C++, Fortran, Numpy.

3. Can your problem be expressed with directive-based parallelization like OpenACC or OpenMP? Use those.

4. If neither of the above work, sure, drop down to CUDA.

None of this means that CUDA is going away, NVIDIA continues to invest heavily into the CUDA ecosystem. And CUDA is the foundation upon which the first three options in the list are built upon. Think of it more like bringing new programmers into the GPGPU fold, people who before didn't bother with utilizing a GPU at all.

Here's a recentish slide deck about it: https://www.alcf.anl.gov/sites/default/files/2021-03/NVHPC-S...

Now, getting back to Rust, yes I think stdpar-like support would be nice, but it depends on robust underlying support for handling the GPU. Something that this Rust-CUDA effort could provide.

Disclaimer: I work for NVIDIA, though not in this particular space. All opinions my own.


Looks like they just haven't gotten around to it. Rust has fine language support for atomics and shared memory.

You're lucky I've had this discussion before with other Rust programmers. But I forgot about this issue...

CUDA __shared__ memory is a 64kB block of RAM that is located __INSIDE THE CORE__ and is incredibly fast, but has very peculiar semantics. (Since the memory never "leaves" the core, its "stuck" only on a small set of 1024 "threads". Difficult to manage when you write 30,000+ thread programs but the speed benefits are insane and well worth the trouble)

Rust "shared" memory is thread-to-thread communications that simply doesn't exist in the GPU-world.

-------

Maybe it'd be more appropriate if I used OpenCL terminology instead of CUDA terminology here, because Rust programmers have an unfortunate name conflict. In OpenCL, this 64kB on-core buffer is called __local memory. Does that help?

Whenever I said "__shared__", replace that in your mind with __local instead. Because CUDA __shared__ is NOTHING like Rust-Shared.


The 1024 threads of a warp/block/whatever or just the current threads or what?

> The 1024 threads of a warp/block/whatever

In CUDA terminology: the block can access __shared__ memory together. Different blocks are locked out of seeing other block's __shared__ memory.

A block can be up to 32-warps working together (aka 1024 threads), or it could be as small as 1-warp (aka 32-threads). A warp could be 1-thread in some cases.

--

Finally, a grid in CUDA is synchronized across kernel calls. You are 100% certain that all threads of the grid are not running before the kernel_launch<<<x, y, z, stream>>(foo bar)... and you can be 100% certain that all threads of the grid are done after the cudaStreamSynchronize(stream) call.

---------

Warps: Largely about very low-level details such as branch divergence. Warps take if/else/for loops together in practice, so you need to think about warps when you think about optimal utilization. In the past 5 years or so, warp-level programming has become more popular, but warp-level stuff is pretty rare and only should be reserved once other, easier, optimizations have taken place.

Blocks: Coordinated across __shared__ memory. Because blocks are controlled by the programmer, its easier to think about than warps... but you need to be careful and put the right OpenCL barriers() or CUDA __syncthreads() in the right places.

Grid: Coordinated across cudaStreamSynchronize / CUDA-streams.

Ad-hoc coordination can be done with atomics to global memory, but this is very slow if memory barriers are involved. (Relaxed atomics are pretty fast though)


Thanks!

That sounds a little complicated to deal with, but I see no reason why either the Rust atomic types or a new type supplied by the rust-cuda library couldn't handle that just fine.

I just want to make sure that you realize that Rust does have regular atomics (and that's how other shared memory abstractions are generally implemented underneath).


EDIT: Okay, I'm switching to OpenCL terminology. I think this __shared__ thing is getting confusing for ya.

----------

The semantics of __local memory variables are very peculiar, and are prone to memory ordering issues unless coordinated with OpenCL's barrier() function calls.

That means the compiler needs to be careful about optimizations and orderings. The compiler's understanding of allowable variable orders must match the programmer's understanding.

__local variables can be of any type. In CUDA, you can make new structs or even C++ classes be in __shared__ memory. In OpenCL, you can make *arbitrary* structs be __local. Its entirely freeform use of memory, albeit tiny and 64kB in size.

---------

The proper implementation of __local semantics will require compiler support. This isn't something you can just tack on with a type-system. The implications of __local reverberate not only through the type system, but also with optimization and the very understanding of how code gets compiled fundamentally (synchronization and memory orderings).

--------

> I just want to make sure that you realize that Rust does have regular atomics (and that's how other shared memory abstractions are generally implemented underneath).

And __local memory is nothing like you've ever seen before, unless maybe you're a realtime programmer who has those weird manually-managed L1 CPUs.

Because that's what it is: __local memory is a manually managed high-speed RAM. While typical CPU code relies upon the hardware to manage L1 cache for you, GPU programmers regularly manage that region of memory *manually*.

Its not easy, but its the only way to reach the highest-levels of performance.


No, I do understand this.

Some sort of type which exposes __shared__ semantics will need to be exposed in the rust-cuda crate. Clearly this isn't just type-level stuff, but the way rust exposes features like that is usually through a new type.

Cuda-rust will probably add (probably does already) some sort of inline assembly the same way that the spirv-rust toolchain does, and through that, a "Shared" type will be able to be implemented.


> Cuda-rust will probably add (probably does already) some sort of inline assembly Already done and used a lot in cuda_std, it uses normal asm!.

My idea for shared memory is that it will always be unsafe and the only thing i can do is: - expose raw low level "give me a pointer to 70 bytes of shared mem" function because this needs codegen support.

- Then expose a higher level "give me an array of 50 u16s in shared memory"

- Make as much of it as possible usable behind an abstraction, like block reduce, stuff like that.

But i want to make it clear that shared memory will always be unsafe, shared memory semantics are literally impossible to statically prove. I dont think this is the end of the world, high performance GPU programming will always require people to really know what they're doing


If memory is exposed in CUDA more like allocators than pre-allocated static memory, look into using the alloc crate with custom allocators that hand out global or shared memory.

No. This is a fundamental misunderstanding of __shared__ memory and how kernels work.

__shared__ memory is a specification like .bss in ELF files. Its something that must be known at compile-time, well-before runtime. The GPU kernel, when launched, will reserve some __shared__ memory for itself.

----

Different kernels may use different chunks of that 64kB for themselves. For example, if Kernel Foo uses 30kB of __shared__ memory and is currently running, and Kernel Bar only uses 20kB of __shared__ memory, the GPU will allow Foo + Bar to run simultaneously.

The fundamental "unit of compute" is the OpenCL workgroup / CUDA block. The GPU will run many, many kernels (even different kernels) _ON THE SAME CORE_, as long as there's enough registers, and __shared__ memory available for them.

-----

"Thread local" variables compile into registers in practice. GPUs have ~256 registers per core, so if Kernel Foo uses 40 registers, and Kernel Bar uses 70, the GPU SMs (Symmetric Multiprocessors, what you'd call a "core" in the CPU world) could run 4 copies of Foo + 1 copy of Bar.

This is because Foo#1 will eventually run into a memory-latency issue (VRAM read/writes have well over 100-cycles of latency, maybe 300+ cycles on older GPUs). Instead of waiting for this memory operation to complete, the GPU will switch to another workgroup (like Foo#2, Foo#3, or Bar#1) to ensure that the GPU-cores stay utilized.

__shared__ memory works kinda like these registers, they're divy'd out at runtime by treating the OpenCL workgroups / CUDA blocks as a unit.

------

So __shared__ memory has to be preallocated by this model. Its an important per-unit resource that is tracked by the GPU at the lowest level, so that multiple kernels could be run concurrently on the same cores (GPUs are like SMT or Hyperthreading: capable of running 8+ kernels per core as long as you have enough registers / ___shared__ memory to launch all kernels)


Gotcha. I assumed that was the case, but I was seeing some other stuff that implied otherwise elsewhere in this thread.

Yeah, in that case, it'd probably have to be a transformation on top of static variables or something like that.


In the Lisp world... it is ambiguous if functions or macros are called at compile time or runtime.

I'm not sure if this is kosher in the world of Rust, but...

    static RAY_STACK: [Shared<Ray>; 2000] = [Shared::new(Ray::default()); 2000];
This could still work, if Shared::new(...) were a compile-time function. Or a language-extension that looked like a compile-time function.

EDIT: You wouldn't be allowed to have Shared::new inside of a loop or a recursive function though. But as long as you had assurances that any such Shared::new instance ran exactly once throughout the code, it might work?

Or maybe that's too ugly. "static" probably captures the idea better


The best way to do it is probably the way rust-gpu does it: https://github.com/EmbarkStudios/rust-gpu/blob/main/docs/src...

The entry point of the kernel would supply any objects that have special properties.


fwiw, and I think you know this, shared memory does not have to by preallocated. dynamic shared memory allows you to allocate at kernel launch time.

Yeah, I'm not 100% sure what to say in English though, lol.

There's compile-time and runtime. But there's also kernel-launch time? Dynamic shared memory is done before kernel-launch, possibly during cpu-runtime but before gpu-runtime.

-------

Things get crazier when you see OpenCL paradigms like... #define constants during CPU-runtime, invoke the OpenCL compiler (under the assumption that the compiler will now optimize the constant into the code directly), and then kernel-launch.


Because programmers expect arbitrary types to become __shared__ (ex: if you are writing a Raytracer, you probably want your rays to be stored into __shared__ memory).

So the programmer would write:

    struct Vec3{
        float x;
        float y;
        float z;
    }

    struct Ray{
        Vec3 origin;
        Vec3 direction;
        int bouncesRemaining;
    };
And then maybe in one function...

    __local struct Ray raystack[2000];
    // As rays bounce, they get new origins / directions, 
    // and may spawn new rays as needed.

    // If the raystack overflows, transfer the rays to global memory.
And then maybe my OpenCL kernel operates over these rays in parallel, tracking wherever they go.

I can't guarantee that it'd be implemented like this, but I could see it working like this in Rust.

  struct Ray {
      origin: Vec3,
      dir: Vec3,
      remaining_bounces: u32,
  }

and then using it like this:

  static RAY_STACK: [Shared<Ray>; 2000] = [Shared::new(Ray::default()); 2000];
or

  static RAY_STACK: Shared<[Ray; 2000]> = Shared::default();
I guess it could also be done like this:

  #[shared]
  static RAY_STACK: [Ray; 2000] = ...;
but that's not really the rust way.

This seems more likely than my proposals, but it does mean you won't be able to use a particular part of the `__shared__` region for more than one thing. (Mine are broken by function calls, so yours is still better.)

That's easy enough in Rust.

  struct Vec3(f32, f32, f32);
  struct Ray {
      origin: Vec3,
      direction: Vec3,
      bounces_remaining: i32
  }
and then something like (made-up API):

  let shared = rust_cuda::shared();
  let mut raystack = shared.alloc_zeroed::<[Ray; 2000]>();
or maybe:

  let mut raystack: Shared<[Ray; 2000]> = [Default::default(); 2000].into();

That is basically what im going to do. It will break down approximately like this:

  pub fn get_shared_mem_ptr<const Bytes: usize>() -> \*mut u8 {
    __nvvm_get_shared_mem_ptr(Bytes)
  }
For the raw version, the codegen internally intercepts the call to the intrinsic and declares an extern global in the shared addrspace, which is what libnvvm wants you to do, basically like

  __shared__ int foo[5];
Dynamic shared mem is a bit more weird because if you query the ptr for the dynamic smem it yields the same ptr every time.

You might want alignment there as well.

Both alignment and unalignment actually.

GPUs are weird. Prime numbers to 'unalign' data so that you minimize bank conflicts is a common optimization trick.

GPUs don't have one memory load/store unit. They are incredibly parallel and have like 32 load/store units that try to operate in parallel.

If all your data is aligned, then bank#0 gets more requests than bank#31. (Thread#0 accesses memory 800. Thread#1 accesses memory 832. Threas#2 accesses 864... Woops you just hammered one bank and now 31 of your memory banks are sitting around doing nothing, while bank#0 is doing all the work sequentially)

Unalignment means more read/writes are sent to bank#31, and fewer to bank#0, better balancing the load across your parallel load/store units.


Here be dragons, and this person tames them. This is insane, actually. I’m guessing it would be cool if gpu automagically scrambled memory so you didn't have to manually unalign it?

When you're doing "uint32_t array[thread_idx.x]" sorts of things, you'll notice that your threads are all lined up with the array. So you're in perfect bank-alignment.

With "array[thread_idx.x]" kind of access, Thread#0 accesses array[0], Thread#1 accesses array[1]... etc. etc.

array[0] might map to memory location #0x8001200, which will probably be bank#0. array[1] might map to #0x8001204, which would be bank#1. Etc. etc. (I forget exactly how many bytes per bank, but... you get the gist).

At the end of the day, all your array[] accesses from Thread#0 through Thread#1023 of your workgroup/block will be perfectly balanced and perfectly spread out between all banks.

--------

So really, the "lesson" is to just organize your data in arrays as much as possible. GPUs are really, really good at simple array reads/writes.

That's not always possible of course. You should only "shuffle" the banks if you know for certain that one bank is going to be hit more than the other banks.

--------

It really comes down to the size of the object you made an array out of. If you have a large object for some reason, maybe array[0] and array[1], array[2], etc. etc. will all map to bank#0.


This is done in rust-gpu[1] by the `spirv(workgroup)` attribute on the kernel function signature.

[1]: https://github.com/EmbarkStudios/rust-gpu/blob/46c9ea0c9c7b7...


> Atomics are absolutely essential

Most problems don't need atomics to solve

> If this is missing __shared__ memory concepts, then the entire idea of "efficient GPU code" is dead IMO

Sure, shared memory is great, but not always needed and communicating through global memory can be fast if you stay inside the L1/L2 cache and hide latency.


> Most problems don't need atomics to solve

How do you handle a global, concurrent memory write and/or read ?? (across many different blocks, maybe even across different grids).

For example: lets say you have a global hash table and 30,000 CUDA-threads are running. How do you insert data into the hash table safely?

> Sure, shared memory is great, but not always needed and communicating through global memory can be fast if you stay inside the L1/L2 cache and hide latency.

Scan operations (sum, min, max, AND, OR to name a few) through the parallel prefix pattern (https://en.wikipedia.org/wiki/Prefix_sum). How do you plan to do it, if not through __shared__ memory?

This is a fundamental operation in almost all GPU code I've ever seen. Just pushing / popping to a stack will require a prefix-sum to determine the size across the workgroup.

If you can't do prefix-sum, you won't be able to do effective load-balancing on a GPU. This is something that'd normally take a dozen clock ticks, but if you do it over L2 you're looking at hundreds of clock ticks instead.

------

Sorting networks are also probably best implemented in __shared__ memory... with maybe warp-level intrinsics beating them. (But warp-level programming is much much harder and I prefer to avoid it).


As I said, these features are often not needed. You can implement e.g. a neural network library without needing atomic operations.

> How do you plan to do it, if not through __shared__ memory?

Can't you use __shared__ memory the same way you use workgroup barriers and global memory? Might be slower, but good caching should make it comparable, which should be the case of prefix sum (you read right after writing, so should get good cache hit probability).


> As I said, these features are often not needed.

Global parallel hash table is probably the fastest way to implement collision detection. Its pretty fundamental to manipulation of 3d space, be it graphics, physics or other such simulations.

Which of course, run _GREAT_ on GPUs.

--------------

I've written "Inner Join" on a GPU for fun. Yes, the SQL operator. Its pretty fast. Databases probably can run on GPUs and parallelize easily. But any database would need globally consistent reads/writes. Sure, GPUs have less RAM than a CPU, but GPU-RAM is way faster so that might actually be a net benefit if your data is between 200MB and 4GB in size.

Use your imagination. Anywhere you'd use an atomic on a CPU is where you might use an atomic on a GPU.


Global hash table on GPUs sounds cursed and perverted. What you’re meant to do with them is run the same computation on all pixels in parallel independently! :P

shared memory and L1 are the same on modern GPUs, but shared memory gives you control of what goes on there. I agree with dragontamer that it's absolutely essentially to have atomics and shared.

How is it "absolutely essential" if you just want to sum 2 large arrays for example?

I'm talking about realistic kernels, which aren't that simple. sure, we can also say that multiplying two arrays doesn't need it. but everything from gemm to FFT does.

How are atomics absolutely essential for gemm?

>> Ironic, a few days ago, I argued for the use of Fork-join parallelism in most cases (aka: Kernel launch / synchronized kernel exits). Now I find myself arguing the opposite now that we have a topic here with missing atomics. Like... atomics need to be used very, very rarely, but those rare uses are incredibly important.

Did you read my post? Or did you just start counter-arguing me without seeing my full statement?

Most people dipping down into a GPU for parallelism will probably run across a globally-consistent read/write across some data-structure. Especially because things like gemm already have high-performance libraries and there's no damn point writing yet another gemm (unless you're some kind of super-performance expert, the standard libraries are probably way faster than what you can do)

EDIT: If you are going to rely upon global kernel synchronization, chances are your code would work with CUDA Thrust (aka: GPU-accelerated data-structures) rather than dipping down to CUDA directly.


we are talking about shared memory

That makes less sense, since there is literally no algorithm that can't be implemented without shared memory.

I'm not sure you were reading this thread. obviously anything can be written without shared memory, but they will be much, much slower, and using a GPU becomes less appealing. the entire purpose of the article and project is that it's fast, but it can't be anywhere near as fast as most cuda apps until it supports shared memory (not worth arguing about atomics).

If you were reading this thread, you know I responded to your assertion that shared memory is "absolutely essentialy to have" (sic). It was your words, literally. I wasn't arguing that shared memory has no advantages.

Since this doesn't target the majority of GPUs, nor the majority of GPUs used in enterprise compute, why continue to target CUDA? If you target SPIR-V, then this can be used with all the existing modern APIs (Vulkan, OpenCL, OpenGL; Microsoft has a SPIR-V to DXIL translator, and third parties have Vulkan/SPIR-V-on-legacy-API support as well, and there is also at least one project that converts LLVM IR to SPIR-V)

... but if you target NVVM alone, you're stuck on the minority of machines that can run CUDA. Even Nvidia has admitted that SPIR-V is the future, since they are both a major contributor to Khronos specs, but also have a top notch SPIR-V compiler that produces better code for Nvidia GPUs than their NVVM compiler.


The writing is better than I might have produced as a first year college student, but this needs copy editing. I might suggest that the word “extremely” should be removed not only from the entire repository but all of the user’s repositories. Fast might be removed as well. Nobody is trying to generate slow GPU code. The salient feature is that one can write the same code as one might write in CUDA with the advantages of Rust’s type system—-which is indeed useful! However, there’s no speed to be gained by using this relative to CUDA.

> there’s no speed to be gained by using this relative to CUDA.

That is not totally true, there are two main things that can make kernels generated by this codegen faster:

- noalias, which is the LLVM equivalent of __restrict__, CUDA can take massive advantage of noalias by using readonly cache if the pointer does not alias. If you don't believe me just take a look at nvidia's blog post: https://developer.nvidia.com/blog/cuda-pro-tip-optimize-poin...

- CUDA builds all the .cu files and links the PTX of them together, this means no LTO happens by default. I do something different where i actually lazily-load every module using dependency graphs, which gives you the benefit of LTO by default. Its not perfect because right now it leaves behind a lot of dead code, but i know how to fix it.


But there is relative to plain rust. Besides, even if nobody is trying to write slow GPU code, it's a very easy thing to get subtly wrong resulting in 10's of % speed loss.

> I might suggest that the word “extremely” should be removed not only from the entire repository but all of the user’s repositories.

You've reminded me of this classic, often (mis?)attributed to Mark Twain: [1]

> “Substitute 'damn' every time you're inclined to write 'very;' your editor will delete it and the writing will be just as it should be.”

[1] https://quoteinvestigator.com/2012/08/29/substitute-damn/


> I might suggest...

Best way to suggest is with a pull request.


Not if you're critiquing the way some writes English..

Depends how it’s worded.

If someone with copy editing experience made constructive suggestions for the readme and other documentation, it may well be appreciated.

A PR also gives the maintainers a way to discuss wording if there are points of disagreement - far more so than a post on HN that they may or may not even see.


Would be really nice to have an actual cross platform GPGPU library. It's really holding every kind of progress back to have only vendor lock-in.

Maybe WebCPU will be capable of compute to the extent that CUDA isn't necessary. https://github.com/UpsettingBoy/gpgpu-rs


> n ecosystem of libraries and tools for writing and executing extremely fast GPU code fully in Rust.

Well at least it does not say extremely safe.


The title says fast, but no benches, but the README puts more emphasis on that it’s a more usable solution than LLVM-PTX + Rust.

I mean, what’s the point of “fast”, if LLVM-PTX is clunky with rust in the first place?


How would this compare with Accel (which is also built on the cuda ecosystem)?

Accel uses the LLVM PTX backend, which is unusable for serious projects and doesn't work on windows, i have more about it here https://github.com/RDambrosio016/Rust-CUDA/blob/master/guide...

>The LLVM PTX backend is still very much WIP and often doesn't have things and/or breaks.

What is missing in your opinion aside from working debug info? I've worked on experimental CUDA support in GNAT (the Ada frontend of GCC), using LLVM's ptx backend to target GPUs and it was mostly working.


How does the compare with writing GPU code in Julia?

Could this maybe support AMD's HIP as well?

Not for the near future, HIP does not seem to be language-agnostic and the codegen is for NVVM IR, not whatever AMD uses. It might be possible to target amdgpu with llvm because all the gpu-unfriendly things are gone in my codegen. So maybe in the future? im not sure

very cool!

NVIDIA should hire the guy, then hire whoever he says he wants on the team and let them rip. That's what I'd do if I were in their shoes. Viable paths off C/C++ are badly needed, and currently the only real viable path with an ecosystem and community is Rust.

NVidia decided for Ada against Rust for their automotive firmware, their cards are designed based on C++'s memory model, and they have a big ISO C++ presence.

NVIDIA has so much cash these days that they don't have to do just one thing. They can do _all_ the things, at the same time. This realistically would only take a 5-7 person team, including the manager. That's $2M/yr tops, all in, which is peanuts for NVIDIA. Much less if the author is not based in the US.

With what ROI? This isn't charity.

Enable more people to build more GPU-accelerated software, which in turn drives sales of their GPUs. A cuda-specific library like this also strengthens their monopoly on GPU compute.

It's hard to put in numbers because it depends on the software that will be built. A single marginally popular software adding or improving GPU-acceleration could easily increase GPU sales by $2M/year


They already made it when CUDA 3.0 introduced PTX instead of being C focused like most Khronos stuff.

NVidia is not to blame when others fail to execute and aren't able to provide the same polyglot level and IDE integrations as NVidia.

They don't have any issues trying to keep up with demand, on the contrary, there aren't enough of them to sell.


Precisely. Their lead over AMD is insurmountable only because of CUDA. They should feed the goose that lays such awesome golden eggs, and feed it well.

Yeah do it before he gets the crazy idea of supporting AMD gpus...



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: