This blog post is great, and we need these kind of tools for sure, but we also need high level expressibility that doesn't require writing kernels in C. I know there are other projects that have taken up that cause, but it would be great to see NVIDIA double down on something like Copperhead.
I flatter myself to think that some ideas from that project have lived on in Tensorflow and PyTorch, etc. But the project itself wrapped up when I decided to focus on DL, and it would take a lot of work to bring it back to life.
I think I found it here: https://github.com/bryancatanzaro/copperhead
But I'm not sure what the state is. Looks dead (last commit 8 years ago). Probably just a proof of concept. But why hasn't this been continued?
Blog post and example:
Btw, for compiling on-the-fly from a string, I made something similar for our RETURNN project. Example for LSTM: https://github.com/rwth-i6/returnn/blob/a5eaa4ab1bfd5f157628...
This is made in a way that it compiles automatically into an op for Theano or TensorFlow (PyTorch could easily be added as well) and for both CPU and CUDA/GPU.
Disclaimer: I work on the project they used to do the distributed execution, but otherwise have no connection with Legate.
Edit: And this library was developed by a team managed by one of the original Copperhead developers, in case you're wondering.
>but we also need high level expressibility that doesn't require writing kernels in C
the above are possible because C is actually just a frontend to PTX
fundamentally you are not going to ever be able to have a way to write cuda kernels without thinking about cuda architecture anymore so than you'll ever be able to write async code without thinking about concurrency.
I worked for a client that had this wonderful Python dsl that compiled to Verilog and VHDL. It was much easier to use than writing the stuff the old way. Much more composable too, not to mention tooling.
They created that by forking an open source project dating back to Python 2.5 that I could never find again.
Imagine if that stuff would still be alive today. You could have a market for paid pypi.org instances providing you with pip installable IP components you can compose and customize easily.
But in this market, sharing is not really a virtue.
Then you have Cudapy : https://github.com/oulgen/CudaPy
But in my opinion, the most future proof solutions are higher level frameworks like Numpy, Jax and Tensorflow. TensorFlow and Jax can JIT compile Python functions to GPU (tf.function).
I don't think it's possible to achieve something like this in python because of how it's interpreted (but it sounds a bit like what another comment mentioned where the python was compiled to C)
here is writing a similar kernel in python with numba: https://github.com/ContinuumIO/gtc2017-numba/blob/master/4%2...
In any case, we did a complete rewrite in CUDA proper in less time than we spent banging our heads against that last numba-CUDA issue.
Under every language bridge there are trolls and numba-CUDA had some mean ones. Hopefully things have gotten better but I'm definitely still inside the "once bitten twice shy" period.
"First-class" is a steep claim. Does it support the nvidia perf tools? Those are very important for taking a kernel from (in my experience) ~20% theoretical perf to ~90% theoretical perf.
Source line association when using PC sampling is currently broken due to a bug in the NVIDIA drivers though (segfaulting when parsing the PTX debug info emitted by LLVM), but I'm told that may be fixed in the next driver.
That sounds very promising, but these tools are usually magnificent screenshot fodder yet they are conspicuously absent from the screenshots so I still have suspicions. Maybe I'll give it a try tonight and report back.
That's not to say that CUDA Python isn't kinda cool, but it's not a magic bullet to finally understanding GPU programming if you've been struggling.
I'm self-taught, and have been using web languages and some python, before learning Rust.
I hope that NVIDIA can dedicate some resources to creating high-quality bindings to the C API for Rust, even if in the next 1-2 years.
Perhaps being able to use a systems language that's been easy for me coming from TypeScript and Kotlin, could inspire me to take baby steps with CUDA, without worrying about understanding C.
I like the CUDA.jl package, and once I make time to learn Julia, I would love to try that out. From this article about the Python library, I'm still left knowing very little about "how can I parallelise this function".
So something like CUDA Rust would be nice to have.
By the way, D already supports CUDA,
I really wasn't a fan of the 'parallel' loop construct foreseen in Ada2020: in addition to having a 'bad' syntax ('how' instead of 'what') wasn't really well integrated in the 'control your tasking precisely' mentality that Ada provided. Having something a bit more platform-specific but still somehow portable if designed well would fit the Ada spirit far closer.
I've found that there are really good and beginner-friendly Blender tutorials. Both free and paid ones.
Perhaps my position might change in future, but for now, I'd probably rather make the GPU accessible to those open-source distributed grids that train chess engines or compute deep-space related thingies :)
I foolishly built an options pricing engine on top of PyTorch, thinking "oooh, it's a fast array library that supports CUDA transparently". Only to find out that array indexing is 100x slower than numpy.
Disclaimer: I work on the library Legate uses for distributed computing, but otherwise have no connection.
import time, torch
from itertools import product
N = 100
ten = torch.randn(N,N,N)
arr = ten.numpy()
start = time.time()
for i,j,k in product(range(N), range(N), range(N)):
x = val[i, j, k]
end = time.time()
I'm indexing one array with another e.g. x=y[z] to pull out all the values I want at once into another array for processing. And not using a python loop for the expensive part.
I'd love to hear more about this! Do you have any posts or write-ups on this?
You can define any compute shader you like in Python, and annotate it with the data types, and it compiles to SPIRV and runs under macOS, Linux and windows
And, knowing Nvidia, the documentation will probably be terrible and anything but beginner-friendly, too.
I mean, if you want to improve upon the CUDA ecosystem why not start with the low-hanging fruit first?
They have lots of functions  to name and I usually find they did a good job given the low level of the language and its proximity with the hardware and low level programming.
I do not know for python but CUDA on C++ has error handling!
As for the command line: think as you were programming for a specific architecture with its associated API
You can write all the boilerplate in python and just the kernel in C (which you can pass to a string and compiler automatically in your python script). So far the workflow is much smoother than with nvcc (and creating some dll bindings for the c programm).