Hacker News new | past | comments | ask | show | jobs | submit login
Unifying the CUDA Python Ecosystem (nvidia.com)
147 points by pjmlp on April 16, 2021 | hide | past | favorite | 58 comments

About 8 years ago an NVIDIA developer released a tool called Copperhead that let you write CUDA kernels in straight Python that were then compiled to C, no "C-in-a-string" like is shown here. I always thought it was so elegant and had great potential, and I introduced a lot of people in my circle to it, but then it seems NVIDIA buried it.

This blog post is great, and we need these kind of tools for sure, but we also need high level expressibility that doesn't require writing kernels in C. I know there are other projects that have taken up that cause, but it would be great to see NVIDIA double down on something like Copperhead.

I’m so glad you liked it. It was a labor of love and I was super proud of it. Means a lot to me that you would still remember it.

I flatter myself to think that some ideas from that project have lived on in Tensorflow and PyTorch, etc. But the project itself wrapped up when I decided to focus on DL, and it would take a lot of work to bring it back to life.

I am not a CUDA dev, but it feels so good you, authors are around :) Cool!

Thank you for your work.

Can you guys link to the blog post?

Oh that sounds interesting. Do you know what happened to it?

I think I found it here: https://github.com/bryancatanzaro/copperhead

But I'm not sure what the state is. Looks dead (last commit 8 years ago). Probably just a proof of concept. But why hasn't this been continued?

Blog post and example: https://developer.nvidia.com/blog/copperhead-data-parallel-p... https://github.com/bryancatanzaro/copperhead/blob/master/sam...

Btw, for compiling on-the-fly from a string, I made something similar for our RETURNN project. Example for LSTM: https://github.com/rwth-i6/returnn/blob/a5eaa4ab1bfd5f157628...

This is made in a way that it compiles automatically into an op for Theano or TensorFlow (PyTorch could easily be added as well) and for both CPU and CUDA/GPU.

I don't know specifics about Copperhead in particular, but Bryan Catanzaro (creator of Copperhead) is now the VP of Applied Deep Learning Research at Nvidia. He gave a talk at GTC this year, which is how I heard about all of this in the first place.

Source: https://www.linkedin.com/in/bryancatanzaro/

As it turns out, NVIDIA just open sourced a product called Legate which does not just GPUs but distributed as well. Right now it supports NumPy and Pandas but perhaps they'll add others in the future. Just thought this might be up your alley since it works at a higher level than the glorified CUDA in the article.


Disclaimer: I work on the project they used to do the distributed execution, but otherwise have no connection with Legate.

Edit: And this library was developed by a team managed by one of the original Copperhead developers, in case you're wondering.

that project might be abandoned but this strategy is used in nvidia and nvidia adjacent projects (through llvm):



>but we also need high level expressibility that doesn't require writing kernels in C

the above are possible because C is actually just a frontend to PTX


fundamentally you are not going to ever be able to have a way to write cuda kernels without thinking about cuda architecture anymore so than you'll ever be able to write async code without thinking about concurrency.

In the IP world, there are some-hidden gems that disappear with no trace one day.

I worked for a client that had this wonderful Python dsl that compiled to Verilog and VHDL. It was much easier to use than writing the stuff the old way. Much more composable too, not to mention tooling.

They created that by forking an open source project dating back to Python 2.5 that I could never find again.

Imagine if that stuff would still be alive today. You could have a market for paid pypi.org instances providing you with pip installable IP components you can compose and customize easily.

But in this market, sharing is not really a virtue.

Sounds like nmigen might be a good open source successor to the project that you describe: https://github.com/nmigen/nmigen There are lots of open-source (n)migen components available through the litex framework: https://github.com/enjoy-digital/litex

Looks awesome.

Closest thing to mind is Numba's cuda JIT compilation : https://numba.pydata.org/numba-doc/latest/cuda/index.html

Then you have Cudapy : https://github.com/oulgen/CudaPy

But in my opinion, the most future proof solutions are higher level frameworks like Numpy, Jax and Tensorflow. TensorFlow and Jax can JIT compile Python functions to GPU (tf.function).

Totally agree, Copperhead looks much easier to use. Perhaps one of the reasons they went and rebuilt from scratch is because Copperhead relies on Thrust and a couple other dependencies?

There is arrayfire which does something like that and has python bindings

Just for contrast its interesting to look at an example of writing a similar kernel in Julia:


I don't think it's possible to achieve something like this in python because of how it's interpreted (but it sounds a bit like what another comment mentioned where the python was compiled to C)

I think the contrast is probably less about the language, and more about the scope and objective of the projects. the blog is describing low-level interfaces in python - probably more comparable is the old CUDAdrv.jl package (now merged into CUDA.jl): https://github.com/JuliaGPU/CUDAdrv.jl/blob/master/examples/...

here is writing a similar kernel in python with numba: https://github.com/ContinuumIO/gtc2017-numba/blob/master/4%2...

I gave numba CUDA a spin in late 2018 and was severely disappointed. It didn't work out of the box, I had to tweak the source to remove a reference to an API that had been removed from CUDA more than a year prior (and deprecated long ago). Then I ran into a bug when converting a float array to a double array -- I had to declare the types three different times and it still did a naive byte-copy rather than a conversion. Thanks to a background in numerics, the symptoms were obvious, but yikes. The problem that finally did us in was an inability to get buffers to correctly pass between kernels without a CPU copy, which was absolutely critical for our perf. I think this was supported in theory but just didn't work.

In any case, we did a complete rewrite in CUDA proper in less time than we spent banging our heads against that last numba-CUDA issue.

Under every language bridge there are trolls and numba-CUDA had some mean ones. Hopefully things have gotten better but I'm definitely still inside the "once bitten twice shy" period.

Same here. I switched over to CuPy from numba.cuda

> Julia has first-class support for GPU programming

"First-class" is a steep claim. Does it support the nvidia perf tools? Those are very important for taking a kernel from (in my experience) ~20% theoretical perf to ~90% theoretical perf.

Yeah, see this section of the documentation: https://juliagpu.gitlab.io/CUDA.jl/development/profiling/. CUDA.jl also supports NVTX, wraps CUPTI, etc. The full extent of the APIs and tools is available.

Source line association when using PC sampling is currently broken due to a bug in the NVIDIA drivers though (segfaulting when parsing the PTX debug info emitted by LLVM), but I'm told that may be fixed in the next driver.

Nice! I set a reminder to check back in a month.

> CUDAnative.jl also [...] generates the necessary line number information for the NVIDIA Visual Profiler to work as expected

That sounds very promising, but these tools are usually magnificent screenshot fodder yet they are conspicuously absent from the screenshots so I still have suspicions. Maybe I'll give it a try tonight and report back.

Here's a screenshot: https://julialang.org/assets/blog/nvvp.png. Or a recent PR when you can see NVTX ranges from Julia: https://github.com/JuliaGPU/CUDA.jl/pull/760

Thanks! Now I believe! :)

JAX and TensorFlow functions both would convert some Python code to equivalent XLA code or a TF graph.

i mentioned this in the response to the other comment but straight compilation is exactly what numba does for CUDA support because, just like Julia, numba uses llvm as a middleend (and llvm has a ptx backend).

As someone who has dabbled in CUDA with some success, I'm going to be a little contrarian here. To me, the difficulty with GPU programming isn't the fact that CUDA uses C-syntax versus something more readable like Python. GPU programming is fundamentally difficult, and the minor gains from using a familiar language syntax are dwarfed by the need to understand blocks, memory alignment, thread hierarchy, etc. And I don't just say this. I live it. Even though I primarily program in C#, I don't use Hybridizer when I need GPU acceleration. I go straight to CUDA and marshal everything to/from C#.

That's not to say that CUDA Python isn't kinda cool, but it's not a magic bullet to finally understanding GPU programming if you've been struggling.

I have a RTX 2070 that's under-utilised, partly because I'm surprisingly finding it hard to understand C, C++ and CUDA by extension.

I'm self-taught, and have been using web languages and some python, before learning Rust. I hope that NVIDIA can dedicate some resources to creating high-quality bindings to the C API for Rust, even if in the next 1-2 years.

Perhaps being able to use a systems language that's been easy for me coming from TypeScript and Kotlin, could inspire me to take baby steps with CUDA, without worrying about understanding C.

I like the CUDA.jl package, and once I make time to learn Julia, I would love to try that out. From this article about the Python library, I'm still left knowing very little about "how can I parallelise this function".

A nice thing of the proper ALGOL linage systems programming languages (which C only has basic influence), is that you can write nice high level code and only deal with pointers and raw pointer stuff when actually needed, think Ada, Modula-2, Object Pascal kind of languages.

So something like CUDA Rust would be nice to have.

By the way, D already supports CUDA,


CUDA Ada would be so, so nice. Especially with non-aliasing guarantees from SPARK...

Yes, I want the integration to be tighter. In fact I'd really like to be able to target Ada kernels to cuda, ispc, mlir, spirv... And also have access to deep APIs for each platform. Now that gnatllvm is getting stable(r), there's a lot of opportunities opening in the Ada/SPARK world. KLEE would also be fun there.

I really wasn't a fan of the 'parallel' loop construct foreseen in Ada2020: in addition to having a 'bad' syntax ('how' instead of 'what') wasn't really well integrated in the 'control your tasking precisely' mentality that Ada provided. Having something a bit more platform-specific but still somehow portable if designed well would fit the Ada spirit far closer.

And I thought the selling point of AdaCore to NVIDIA was more SPARK for firmware & embedded than 'classic' Ada. It might have gone further since but it was already a huge jump for such a big tech company, one I can only applaud, when you see how much firmware hacking is just memory unsafety and UB exploits...

Name it CRUST

+1 Would love to see official support for CUDA for Rust.

> I have a RTX 2070 that's under-utilised

I've found that there are really good and beginner-friendly Blender tutorials. Both free and paid ones.

If you are a looking to maximize use of that card, you can make about $5 a day mining crypto with the 2070.

No, the high electricity cost in my country + the noise pollution in the house + how much I generally earn from the machine + my views on burning the world speculatively, discourage me from mining crypto.

Perhaps my position might change in future, but for now, I'd probably rather make the GPU accessible to those open-source distributed grids that train chess engines or compute deep-space related thingies :)

I am not convinced that training AI to win at chess is any more moral than mining crypto. And the block chain is about as open-source as you can get.

Sure it is. With a chess AI you’re driving forward progress in neural networks, machine learning, and technology in general which pushes forward humanity in ways that are too numerous to count (even if you don’t buy into AI hype). With mining crypto, you’re hashing a bunch of things against a bunch of other things to make some imaginary things that people only assign value to because other people assign value to, in a tautology, ad infinitum. It’s a scourge on the environment and a waste of great minds, to the extent that those minds are devising new crypto, not to the extent that they’re buying a mining rig to mine crypto.

Tools to make GPU development easier are sorely needed.

I foolishly built an options pricing engine on top of PyTorch, thinking "oooh, it's a fast array library that supports CUDA transparently". Only to find out that array indexing is 100x slower than numpy.

You might be interested in Legate [1]. It supports the NumPy interface as a drop-in replacement, supports GPUs and also distributed machines. And you can see for yourself their performance results; they're not far off from hand-tuned MPI.

[1]: https://github.com/nv-legate/legate.numpy

Disclaimer: I work on the library Legate uses for distributed computing, but otherwise have no connection.

Interesting find about the indexing. I just had the opposite experience, swapped from numpy to torch in a project and got 2000x speedup on some indexing and basic maths wrapped in autodiff. And I haven't moved it onto cuda yet.

Here's an example that illustrates the phenomenon. If memory serves me right, index latency is superlinear in dimension count.

   import time, torch
   from itertools import product

   N = 100

   ten = torch.randn(N,N,N)
   arr = ten.numpy()

   def indexTimer(val):
       start = time.time()
       for i,j,k in product(range(N), range(N), range(N)):
           x = val[i, j, k]
       end = time.time()


Ah, I'm guessing it's the loop that kills you. Arrays/tensors are supposed to be used as a whole, I imagine e.g. x=val**2 would be much faster.

I'm indexing one array with another e.g. x=y[z] to pull out all the values I want at once into another array for processing. And not using a python loop for the expensive part.

>>> built an options pricing engine on top of PyTorch

I'd love to hear more about this! Do you have any posts or write-ups on this?

Somewhat related, I’ve tried running compute shaders using wgpu-py:


You can define any compute shader you like in Python, and annotate it with the data types, and it compiles to SPIRV and runs under macOS, Linux and windows

Am I the only one who thinks that the API looks terrible? Lots of cryptic and hard-to-remember names (`cuMemcpyDtoHAsync`), no proper error and log handling, manual building of command line arguments (`opts = [b"--fmad=false", b"--gpu-architecture=compute_75"]` – wat), …

And, knowing Nvidia, the documentation will probably be terrible and anything but beginner-friendly, too.

I mean, if you want to improve upon the CUDA ecosystem why not start with the low-hanging fruit first?

While those names are vestiges of the history, I would be curious how you would name cuMemcpyDtoHAsync?

They have lots of functions [1] to name and I usually find they did a good job given the low level of the language and its proximity with the hardware and low level programming.

I do not know for python but CUDA on C++ has error handling!

As for the command line: think as you were programming for a specific architecture with its associated API

[1] https://docs.nvidia.com/cuda/cuda-runtime-api/index.html

Actually I like pyCuda. https://documen.tician.de/pycuda/tutorial.html

You can write all the boilerplate in python and just the kernel in C (which you can pass to a string and compiler automatically in your python script). So far the workflow is much smoother than with nvcc (and creating some dll bindings for the c programm).

Note that you can write CUDA in many languages such as Java, Kotlin, Python, Ruby, JS, R with https://github.com/NVIDIA/grcuda

Is this something that could be built on top of MLIR Python Bindings [1]? I don't know enough about any of these projects, but I worry that NVIDIA will be doing something proprietary and against an open ecosystem. It seems like you could build the entire python toolchain on top of python with MLIR bindings and support numpy acceleration that way. Just curious if this is the right thinking?

[1] https://mlir.llvm.org/docs/Bindings/Python/

I thought for sure that someone would have posted a link to that xkcd comic by now. I only dabble with higher-level APIs, so I can't judge this on the merits. If NVIDIA really continues to back this, and follows through on wrapping other libraries like cuDNN, it could be a whole new level of vendor lock in as people start writing code that targets CUDA Python. I think the real test will be whether one of the big projects like PyTorch or TensorFlow gets on board.

I may be missing something - but what exactly does it do? And how is it different that dask_cuda and cupy?

There's a lot of may, should, and could's in there.

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact