Hacker News new | past | comments | ask | show | jobs | submit login
Kompute – Vulkan Alternative to CUDA (github.com/komputeproject)
204 points by coffeeaddict1 9 months ago | hide | past | favorite | 41 comments



Vulkan has some advantages to OpenCL. You gain lower level control over memory allocation and resource synchronization. Rocm has an infamous synchronization pessimization which doesn't exist for Vulkan. You can even explicitly allocate Vulkan resources at specific memory addresses, which means Vulkan can easily be used for embedded devices.

But some of the caveats for compute applications are currently:

- No bfloat16 in shaders

- No shader work graphs (GPU-driven shader control flow)

- No inline PTX (inline GCN/RDNA/GEN is available)

These may or may not be important to you. Vulkan recently gained an ability to seamlessly dispatch CUDA kernels if you need these in some places, but there aren't currently similar Vulkan extensions for HIP.


The biggest advantage of Vulkan is that there are games relying on Vulkan, which means GPU drivers have to play nice with it. My impression is that OpenCL is borderline unsupported. In fact, there is an effort to reimplement OpenCL on top of Vulkan because the drivers are just that shit.


> GPU drivers have to play nice with it.

i would imagine nvidia to not want their moat with CUDA be uprooted, and so their vulkan support might get gimped in some ways (that does not affect games). At least, that is what the cynic in me would predict.


They don't have much to worry, Vulkan SPIR-V doesn't cover everything that PTX has, nor the ecosystem of compilers with PTX backend.


Nvidia have actually been putting a fair amount of work into their OpenCL drivers in the past few years and have a pretty decent implementation, I have a feeling that opencl is more widely used in the embedded space and they were kicked into gear

AMD, very recently have also been putting a bit more work into it, though their drivers are still the worst, and still way worse than their pre-rocm drivers


I work in VFX and I see that OpenCL has been used quite a bit in DCCs such as SideFX Houdini (which uses it extensively) and Foundry Nuke which uses it a bit too. Most of the GPUs are NVIDIA but there's some AMD too, and it's ability to fall back to the Intel (CPU w/vector instructions) host when running of the renderfarm, is absolutely critical as farm nodes generally tend not to have GPUs (most of ours don't even have X installed! So no hardware-accelerated OpenGL OR OpenCL is possible at all!)

Also one of my favourite non-VFX projects: 'Gollygang Ready' uses it to accelerate reaction-diffusion simulations.

I (perhaps naively) thought that OpenCL would still be a thing in a post-vulcan world.. SideFX are working on a new viewport using it.. I guess they solve different problems so can still coexist?


You don't need X to run OpenCL, only a driver. Pretty sure the same holds for OpenGL.


> No inline PTX (inline GCN/RDNA/GEN is available)

If I understand it correctly, that means the compute code is hardcoded into a specific assembly for gpu and to make it work with another card or newer one, you have recompile.

Like... Why? What is the problem with using SPIR-V, PTX or plain LLVM IR.

If we lived in a monoculture (e.g x86/64 for desktop apps), it would make sense, but there is a plethora of options and one gpu is not assembly compatible next gen.


"Inline PTX" is just PTX, and you have already listed PTX among the intermediate representations, which can be independent of hardware, because hardware-specific code will be generated when the program is loaded.

Of course, using PTX does not necessarily guarantee backward compatibility, because you may use some PTX features that are supported only on newer NVIDIA GPUs. Nevertheless, inline PTX should continue to work on future GPUs.

Perhaps by "hardcoded" you have referred to only a part of your quotation, i.e. to "(inline GCN/RDNA/GEN is available)".

In this case I agree with you, but even so, there are enough cases when it is impossible, at least with the current compilers, to obtain the maximum performance allowed by the hardware without using inline assembly language, either for GPUs or for CPUs. Therefore it is good for the high-level programming language to permit the use of inline assembly language, even if this facility should not be abused.


Inline assembly in shaders exists for basically the same reasons it exists in C++. Hardware gains new features much faster than Khronos standardizes an API for them in SPIR-V, so inline assembly lets you use of them ASAP. You might also want to make optimizations with the assembly if AMD/Intel's SPIR-V compilers aren't doing what you want.

Vulkan does make it easy to check which the feature availability of your device, and dispatch different shaders accordingly.

That said I've actually never had a reason to use inline assembly in shaders, personally.


Those are totally reasonable caveats to deal with currently. Thanks!


Shader work graphs are on the way https://gpuopen.com/gpu-work-graphs-in-vulkan/


This is _not_ an alternative to CUDA nor to OpenCL. It has some high-level and opinionated API [1], which covers a part (rather small part) of the API of each of those.

It may, _in principle_, have been developed - with much more work than has gone into it - into such an alternative; but I am actually not sure of that since I have poor command of Vulcan. I got suspicious being someone who maintains C++ API wrappers for CUDA myself [2], and know that just doing that is a lot more code and a lot more work.

[1] - I assume it is opinionated to cater to CNN simulation for large language models, and basically not much more.

[2] - https://github.com/eyalroz/cuda-api-wrappers/


This looks great - I've been looking for a sustainable, cross-platform-and-vendor GPU compute solution, and the alternatives are not really great. CUDA is nvidia only, Metal is apple only, etc etc. OpenCL has been the closest match but it seems like it's on the way out.

Does anyone have real world experience using Vulkan compute shaders versus, say, OpenCL? Does Kompute make things as straightforward as it seems?


> OpenCL has been the closest match but it seems like it's on the way out.

SYCL is the unofficial successor to OpenCL - in that SYCL implementations like OpenCL are based on SPIR-V compute 'kernels'. (Note that these are not directly compatible with SPIR-V compute 'shaders' as found in Vulkan, so implementing OpenCL or SYCL on top of the Vulkan compute facilities comes with some challenges.)


I've done some compute shaders with Vulkan (and a lot of graphics shaders). It's quite nice and simple, after you have gone through the initial hurdle of setting it up (or used a helper library to do it for you). For compute shaders in particular you should enable buffer device addresses, shader subgroups and timeline semaphores (need to be explicitly enabled at init).

The downside of compute shaders, regardless of graphics API (gl, dx, vk), is that they have an unspecified time limit after which the OS will kill your process. There are ways to disable this in your OS/GPU configuration but there isn't a portable way of doing this programmatically from your code.

Another issue is that if you use the same GPU for compute (or heavy graphics tasks) as your display output, your desktop responsiveness may go down. I had some issue where my desktop got really sluggish when I was drawing some graphics that took 350ms per frame (laptop integrated graphics).


Check out SYCL


Alternatives can only become one, if they support the same set of C, C++, Fortran, and PTX compiler backends, with similar level of IDE integration, grapical GPGPU debugging, and frameworks.

Until then they are wannabe alternatives, for a subset of use cases, with lesser tooling.

It always feels like those proposing CUDA alternatives don't understand what they are trying to replace, and that is already the first error.


Are you proposing that they should be doing stuff like frameworks and ide integration all by themselves or are you saying they should magically make a community appear to make them?

The product comes before the community (unless you have insane marketing money)


I am saying that without them, it isn't an real alternative.

Alternatives are supposed to cover all uses cases, otherwise they aren't alternatives.

Not even AMD and Intel are able to make it happen, so it remains to be seen how much small communities are able to achieve.


> Alternatives are supposed to cover all uses case

I can't think of a single tech product where feature parity was a driver for growth. If all you have is parity, then all you compete on is lower price.

Usually, some advantage (better/safer/faster/easier) makes a difference for a few important use cases (good examples were early, feature-incomplete no-sql databases that excelled in one use case that existing SQL servers did not). That advantage hasn't emerged yet, so no community has developed.

We'll see if it ever does...


Until RDMS added support for JSON, and NoSQL people discovered why data consistency and query languages matter.


>Alternatives are supposed to cover all uses cases, otherwise they aren't alternatives.

A bicycle is an alternative to a car, however it doesn’t cover all the same use cases


With Kompute being the bicycle.


A key component of CUDA is that the kernels are written in C/C++ and not some shader language you would only be familiar with if you were into graphics.


A key component of CUDA is that kernels can be written in C, C++, Fortran, and any language with PTX compiler backends.

The fact many think CUDA is C/C++, is already the first error trying to replace it.


CUDA C++ is not exactly standard either.


It probably extends the language less than g++ or msvc do out of the box.


Nowadays one can write pretty standard C++20 (minus modules), alongside CUDA frameworks that hide the non standard stuff on their internals.


I mean it's technically c/c++ but the runtime is so different it might as well be a completely different ecosystem. Also, there's enough tooling popping up around the concept of avoiding using this ecosystem that it's pretty clear people want more than that.


Anyone have a comparison to something like wgsl's compute shader mode over stuff like wgpu? I've never seriously written in either.


They are very similar, but wgsl is its own language. Vulkan can run SPIR-V shaders written in glsl, hlsl or slang. Other experimental languages exist (like rust-gpu).

Wgpu is a bit behind on GPU features, for example they've added support for shader subgroups (aka warps or waves) in 2024, where as this feature was available in Vulkan 1.1 released in 2018 or Direct3d 12 shader model 6.0 (same timeframe). Wgpu still does not support buffer device address ("GPU pointers") which I consider quite a game changer.

Many popular tools, e.g. RenderDoc, don't have wgpu support.

If you are not targeting the web platform, wgpu isn't really bringing anything to the table.


Pytorch alreadh has Vulkan support -- and Kompute does not support pytorch yet. That's is going to show adaptation of this project.


A quick Googling shows that Vulkan had an experimental, Android-only Vulkan backend around version 1.7. It doesn't appear to still exist in current versions.


Kompute author here - thank you very much for sharing our work!

If you are interested to learn more, do join the community through our discord here: https://discord.gg/MaH5Jv5zwv

For some background, this project started after seeing various renowned machine learning frameworks like Pytorch and Tensorflow integrating Vulkan as a backend. The Vulkan SDK offers a great low level interface that enables for highly specialized optimizations - however it comes at a cost of highly verbose code which requires 800-2000 lines of code to even begin writing application code. This has resulted in each of these projects having to implement the same baseline to abstract the non-compute related features of the Vulkan SDK.

This large amount of non-standardised boiler-plate can result in limited knowledge transfer, higher chance of unique framework implementation bugs being introduced, etc. We are aiming to address this with Kompute. As of today, we are now part of the Linux Foundation, and slowly contributing to the cross-vendor GPGPU revolution.

Some of the key features / highlights of Kompute:

* C++ SDK with Flexible Python Package * BYOV: Bring-your-own-Vulkan design to play nice with existing Vulkan applications * Asynchronous & parallel processing support through GPU family queues * Explicit relationships for GPU and host memory ownership and memory management: https://kompute.cc/overview/memory-management.html * Robust codebase with 90% unit test code coverage: https://kompute.cc/codecov/ * Mobile enabled via Android NDK across several architectures

Relevant blog posts:

Machine Learning: https://towardsdatascience.com/machine-learning-and-data-pro...

Mobile development: https://towardsdatascience.com/gpu-accelerated-machine-learn...

Game development (we need to update to Godot4): https://towardsdatascience.com/supercharging-game-developmen...


Can't we make a chip that only does one thing: multiply and add a lot of 32x32 matrices in parallel? I think that would be enough for all AI needs and easy to program.


You're sorta describing TPUs, especially early ones: https://cloud.google.com/blog/products/ai-machine-learning/a...

That first generation also included the ability to apply a handful of fixed activation functions, but really that's about it. The array is bigger than 32x32, also.


Believe it or not, to be able to do that at a high throughput you need half of a GPU anyway, like a cache hierarchy.


Great, eliminating half of transistors is a lot.


You'll need large economies of scale to do better than GPUs, and you'll still have all the same problems of memory IO, packaging, and heat removal. Transistors that aren't in use cost almost nothing.

This is a big part of why RISC didn't win and today the largest server chips in use in datacenters are still mostly compatible with an 8 bit part from the early 1980's.


All you really need form these in Transfofmers-dominated 2024 are GEMM and GEMV, plus fused RMS norm and some element wise primitives to apply RoPE and residuals. And all of that must be brain dead easy to install and access, and it should be cross platform. And yet no such thing exists as far as I can tell.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: