About 8 years ago an NVIDIA developer released a tool called Copperhead that let you write CUDA kernels in straight Python that were then compiled to C, no "C-in-a-string" like is shown here. I always thought it was so elegant and had great potential, and I introduced a lot of people in my circle to it, but then it seems NVIDIA buried it.
This blog post is great, and we need these kind of tools for sure, but we also need high level expressibility that doesn't require writing kernels in C. I know there are other projects that have taken up that cause, but it would be great to see NVIDIA double down on something like Copperhead.
I’m so glad you liked it. It was a labor of love and I was super proud of it. Means a lot to me that you would still remember it.
I flatter myself to think that some ideas from that project have lived on in Tensorflow and PyTorch, etc. But the project itself wrapped up when I decided to focus on DL, and it would take a lot of work to bring it back to life.
This is made in a way that it compiles automatically into an op for Theano or TensorFlow (PyTorch could easily be added as well) and for both CPU and CUDA/GPU.
I don't know specifics about Copperhead in particular, but Bryan Catanzaro (creator of Copperhead) is now the VP of Applied Deep Learning Research at Nvidia. He gave a talk at GTC this year, which is how I heard about all of this in the first place.
As it turns out, NVIDIA just open sourced a product called Legate which does not just GPUs but distributed as well. Right now it supports NumPy and Pandas but perhaps they'll add others in the future. Just thought this might be up your alley since it works at a higher level than the glorified CUDA in the article.
fundamentally you are not going to ever be able to have a way to write cuda kernels without thinking about cuda architecture anymore so than you'll ever be able to write async code without thinking about concurrency.
In the IP world, there are some-hidden gems that disappear with no trace one day.
I worked for a client that had this wonderful Python dsl that compiled to Verilog and VHDL. It was much easier to use than writing the stuff the old way. Much more composable too, not to mention tooling.
They created that by forking an open source project dating back to Python 2.5 that I could never find again.
Imagine if that stuff would still be alive today. You could have a market for paid pypi.org instances providing you with pip installable IP components you can compose and customize easily.
But in this market, sharing is not really a virtue.
But in my opinion, the most future proof solutions are higher level frameworks like Numpy, Jax and Tensorflow. TensorFlow and Jax can JIT compile Python functions to GPU (tf.function).
Totally agree, Copperhead looks much easier to use. Perhaps one of the reasons they went and rebuilt from scratch is because Copperhead relies on Thrust and a couple other dependencies?
I don't think it's possible to achieve something like this in python because of how it's interpreted (but it sounds a bit like what another comment mentioned where the python was compiled to C)
I think the contrast is probably less about the language, and more about the scope and objective of the projects. the blog is describing low-level interfaces in python - probably more comparable is the old CUDAdrv.jl package (now merged into CUDA.jl): https://github.com/JuliaGPU/CUDAdrv.jl/blob/master/examples/...
I gave numba CUDA a spin in late 2018 and was severely disappointed. It didn't work out of the box, I had to tweak the source to remove a reference to an API that had been removed from CUDA more than a year prior (and deprecated long ago). Then I ran into a bug when converting a float array to a double array -- I had to declare the types three different times and it still did a naive byte-copy rather than a conversion. Thanks to a background in numerics, the symptoms were obvious, but yikes. The problem that finally did us in was an inability to get buffers to correctly pass between kernels without a CPU copy, which was absolutely critical for our perf. I think this was supported in theory but just didn't work.
In any case, we did a complete rewrite in CUDA proper in less time than we spent banging our heads against that last numba-CUDA issue.
Under every language bridge there are trolls and numba-CUDA had some mean ones. Hopefully things have gotten better but I'm definitely still inside the "once bitten twice shy" period.
> Julia has first-class support for GPU programming
"First-class" is a steep claim. Does it support the nvidia perf tools? Those are very important for taking a kernel from (in my experience) ~20% theoretical perf to ~90% theoretical perf.
Source line association when using PC sampling is currently broken due to a bug in the NVIDIA drivers though (segfaulting when parsing the PTX debug info emitted by LLVM), but I'm told that may be fixed in the next driver.
> CUDAnative.jl also [...] generates the necessary line number information for the NVIDIA Visual Profiler to work as expected
That sounds very promising, but these tools are usually magnificent screenshot fodder yet they are conspicuously absent from the screenshots so I still have suspicions. Maybe I'll give it a try tonight and report back.
i mentioned this in the response to the other comment but straight compilation is exactly what numba does for CUDA support because, just like Julia, numba uses llvm as a middleend (and llvm has a ptx backend).
As someone who has dabbled in CUDA with some success, I'm going to be a little contrarian here. To me, the difficulty with GPU programming isn't the fact that CUDA uses C-syntax versus something more readable like Python. GPU programming is fundamentally difficult, and the minor gains from using a familiar language syntax are dwarfed by the need to understand blocks, memory alignment, thread hierarchy, etc. And I don't just say this. I live it. Even though I primarily program in C#, I don't use Hybridizer when I need GPU acceleration. I go straight to CUDA and marshal everything to/from C#.
That's not to say that CUDA Python isn't kinda cool, but it's not a magic bullet to finally understanding GPU programming if you've been struggling.
I have a RTX 2070 that's under-utilised, partly because I'm surprisingly finding it hard to understand C, C++ and CUDA by extension.
I'm self-taught, and have been using web languages and some python, before learning Rust.
I hope that NVIDIA can dedicate some resources to creating high-quality bindings to the C API for Rust, even if in the next 1-2 years.
Perhaps being able to use a systems language that's been easy for me coming from TypeScript and Kotlin, could inspire me to take baby steps with CUDA, without worrying about understanding C.
I like the CUDA.jl package, and once I make time to learn Julia, I would love to try that out. From this article about the Python library, I'm still left knowing very little about "how can I parallelise this function".
A nice thing of the proper ALGOL linage systems programming languages (which C only has basic influence), is that you can write nice high level code and only deal with pointers and raw pointer stuff when actually needed, think Ada, Modula-2, Object Pascal kind of languages.
So something like CUDA Rust would be nice to have.
Yes, I want the integration to be tighter. In fact I'd really like to be able to target Ada kernels to cuda, ispc, mlir, spirv... And also have access to deep APIs for each platform. Now that gnatllvm is getting stable(r), there's a lot of opportunities opening in the Ada/SPARK world. KLEE would also be fun there.
I really wasn't a fan of the 'parallel' loop construct foreseen in Ada2020: in addition to having a 'bad' syntax ('how' instead of 'what') wasn't really well integrated in the 'control your tasking precisely' mentality that Ada provided. Having something a bit more platform-specific but still somehow portable if designed well would fit the Ada spirit far closer.
And I thought the selling point of AdaCore to NVIDIA was more SPARK for firmware & embedded than 'classic' Ada. It might have gone further since but it was already a huge jump for such a big tech company, one I can only applaud, when you see how much firmware hacking is just memory unsafety and UB exploits...
No, the high electricity cost in my country + the noise pollution in the house + how much I generally earn from the machine + my views on burning the world speculatively, discourage me from mining crypto.
Perhaps my position might change in future, but for now, I'd probably rather make the GPU accessible to those open-source distributed grids that train chess engines or compute deep-space related thingies :)
Sure it is. With a chess AI you’re driving forward progress in neural networks, machine learning, and technology in general which pushes forward humanity in ways that are too numerous to count (even if you don’t buy into AI hype). With mining crypto, you’re hashing a bunch of things against a bunch of other things to make some imaginary things that people only assign value to because other people assign value to, in a tautology, ad infinitum. It’s a scourge on the environment and a waste of great minds, to the extent that those minds are devising new crypto, not to the extent that they’re buying a mining rig to mine crypto.
Tools to make GPU development easier are sorely needed.
I foolishly built an options pricing engine on top of PyTorch, thinking "oooh, it's a fast array library that supports CUDA transparently". Only to find out that array indexing is 100x slower than numpy.
You might be interested in Legate [1]. It supports the NumPy interface as a drop-in replacement, supports GPUs and also distributed machines. And you can see for yourself their performance results; they're not far off from hand-tuned MPI.
Interesting find about the indexing. I just had the opposite experience, swapped from numpy to torch in a project and got 2000x speedup on some indexing and basic maths wrapped in autodiff. And I haven't moved it onto cuda yet.
Here's an example that illustrates the phenomenon. If memory serves me right, index latency is superlinear in dimension count.
import time, torch
from itertools import product
N = 100
ten = torch.randn(N,N,N)
arr = ten.numpy()
def indexTimer(val):
start = time.time()
for i,j,k in product(range(N), range(N), range(N)):
x = val[i, j, k]
end = time.time()
print('{:.2f}'.format(end-start))
indexTimer(ten)
indexTimer(arr)
Ah, I'm guessing it's the loop that kills you. Arrays/tensors are supposed to be used as a whole, I imagine e.g. x=val**2 would be much faster.
I'm indexing one array with another e.g. x=y[z] to pull out all the values I want at once into another array for processing. And not using a python loop for the expensive part.
You can define any compute shader you like in Python, and annotate it with the data types, and it compiles to SPIRV and runs under macOS, Linux and windows
Am I the only one who thinks that the API looks terrible? Lots of cryptic and hard-to-remember names (`cuMemcpyDtoHAsync`), no proper error and log handling, manual building of command line arguments (`opts = [b"--fmad=false", b"--gpu-architecture=compute_75"]` – wat), …
And, knowing Nvidia, the documentation will probably be terrible and anything but beginner-friendly, too.
I mean, if you want to improve upon the CUDA ecosystem why not start with the low-hanging fruit first?
While those names are vestiges of the history, I would be curious how you would name cuMemcpyDtoHAsync?
They have lots of functions [1] to name and I usually find they did a good job given the low level of the language and its proximity with the hardware and low level programming.
I do not know for python but CUDA on C++ has error handling!
As for the command line: think as you were programming for a specific architecture with its associated API
You can write all the boilerplate in python and just the kernel in C (which you can pass to a string and compiler automatically in your python script). So far the workflow is much smoother than with nvcc (and creating some dll bindings for the c programm).
Is this something that could be built on top of MLIR Python Bindings [1]? I don't know enough about any of these projects, but I worry that NVIDIA will be doing something proprietary and against an open ecosystem. It seems like you could build the entire python toolchain on top of python with MLIR bindings and support numpy acceleration that way. Just curious if this is the right thinking?
I thought for sure that someone would have posted a link to that xkcd comic by now. I only dabble with higher-level APIs, so I can't judge this on the merits. If NVIDIA really continues to back this, and follows through on wrapping other libraries like cuDNN, it could be a whole new level of vendor lock in as people start writing code that targets CUDA Python. I think the real test will be whether one of the big projects like PyTorch or TensorFlow gets on board.
This blog post is great, and we need these kind of tools for sure, but we also need high level expressibility that doesn't require writing kernels in C. I know there are other projects that have taken up that cause, but it would be great to see NVIDIA double down on something like Copperhead.