I would recommend considering using NanoBind, the follow up of PyBind11 by the same author (Wensel Jakob), and move as much performance critical code to C or C++. https://github.com/wjakob/nanobind
If you really care about performance called from Python, consider something like NVIDIA Warp (Preview). Warp jits and runs your code on CUDA or CPU. Although Warp targets physics simulation, geometry processing, and procedural animation, it can be used for other tasks as well. https://github.com/NVIDIA/warp
>I would recommend considering using NanoBind, the follow up of PyBind11 by the same author (Wensel Jakob), and move as much performance critical code to C or C++
Why would you recommend that? It's all way more effort than just writing Cython, especially in a Jupyter Notebook. And Cython code can be just as fast as C/C++ code unless you're doing something really fancy. It's a bunch of work for no benefit.
>Warp jits and runs your code on CUDA or CPU
If someone's writing Cython it's probably because they found something that couldn't be done efficiently in Numpy because it was sequential, not easily vectorisable. Such code is going to get zero benefit from Cuda or running on the GPU.
In general, all your jitted code is not going to be as fast as code compiled with an ahead-of-time compiler like the C compiler that Cython uses. Moreover if you use a JIT then it makes your code a pain in the ass to embed in a C/C++ application, unlike Cython code.
> Why would you recommend that? [..] It's a bunch of work for no benefit.
nanobind/pybind11 (co-)author here. The space of python bindings is extremely diverse and on the whole probably looks very different from your use case. nanobind/pybind11 target the 'really fancy' case you mention specifically for codebases that are "at home" in C++, but which want natural Pythonic bindings. There is near-zero overlap with Cython.
Yes, I assumed everyone who cares about performance (or who writes large programs) is also a C++ or CUDA programmer. Don't tell me that is not the case :-)
Warp generates C/C++ code, that can be trivially used in a pure C++ or CUDA project without issues. So it is not strictly jit, since it calls the regular ahead-of-time compiler (gcc, llvm or nvcc) only when de code changes (using hashes to check for changes), so performance is good. Also, random non-vectorizable branchy code will run fine on cpu with Warp, but you loose many benefits indeed.
Agreed, if you have bad performing spaghetti Python code, none of those tools are going to help indeed. Then I would rather rewrite it all in C/C++ instead of fiddling with Cython.
if the object you wanna bind fits into the mold of "an algorithm with inputs and outputs, and some helper methods" I've got automatic binding of a limited set of C++ features working in https://github.com/celtera/avendish ; for now I've been using pybind11 but I guess everything I need is supported by nanobind so maybe i'll do the port...
If you really care about performance called from Python, consider something like NVIDIA Warp (Preview). Warp jits and runs your code on CUDA or CPU. Although Warp targets physics simulation, geometry processing, and procedural animation, it can be used for other tasks as well. https://github.com/NVIDIA/warp
Google Jax is another option, jitting and vectorizing code for TPU, GPU or CPU. https://github.com/google/jax