> probably the easiest way to interface with custom CUDA kernels In Python? Perh...

johndough · 2024-09-20T20:12:04 1726863124

Personally, I prefer CuPy over your library. For example, your vectorAdd.cu implementation at https://github.com/eyalroz/cuda-api-wrappers/blob/master/exa... is much longer than a similar CuPy implementation:

    import cupy as cp

    vector_add = cp.RawKernel("""
    extern "C" __global__
    void vector_add(const float *A, const float *B, float *C, int num_elements) {
        int i = blockDim.x * blockIdx.x + threadIdx.x;

        if (i < num_elements) { C[i] = A[i] + B[i]; }
    }
    """, "vector_add")

    num_elements = 50_000

    block_size = 256
    # round up to next multiple of block_size
    grid_size = (num_elements + block_size - 1) // block_size

    a = cp.random.rand(num_elements, dtype=cp.float32)
    b = cp.random.rand(num_elements, dtype=cp.float32)
    c = cp.zeros(num_elements, dtype=cp.float32)

    args = (a, b, c, num_elements)

    print(f"[Vector addition of {num_elements} elements]")
    print(f"CUDA kernel launch with {grid_size} blocks of {block_size} threads each")

    vector_add((grid_size,), (block_size,), args)

    incorrect = cp.abs(a + b - c) > 1e-5

    if cp.any(incorrect):
        print("Result verification failed at element", cp.argmax(incorrect))

    print("Test PASSED")
    print("SUCCESS")

It could be made even shorter with a cp.ElementwiseKernel https://docs.cupy.dev/en/stable/user_guide/kernel.html#basic...

Although I have to concede that the automatic grid size computation in cuda-api-wrappers is nice.

A few marketing tips for your README:

* Put a code example directly at the top. You want to present the selling points of your library to the reader as fast as possible. For reference, look at the CuPy README https://github.com/cupy/cupy?tab=readme-ov-file#cupy--numpy-... which immediately shows reader what it is good for. Your README starts with lots of text, but nobody reads text anymore these days. A link to examples is almost at the end, and then the examples are deeply nested.

* The first links in the README should link to your own library, for example to documentation or examples. You do not want to lead the reader away from your GitHub page.

* Add syntax highlighting with "cpp" after triple backticks:

    ```cpp
    <code here>
    ```

einpoklum · 2024-09-22T10:44:18 1727001858

cuPy is a useful, and kind of large, library which does a lot of things. In your example, you use it to create buffers, fill them up with random values, and perform elementwise arithmetic on them. numpy does that, which is why cuPy does that. My library only wraps CUDA functionality, and mostly "does nothing" [1] - so you have to "do everything yourself", except that it's easy(ish) to do so. It definitely never does anything behind-the-scenes or behind-your-back.

This difference between the libraries makes your program more terse; however, you lose control over where your buffers are, from where they're accessible, when they get copied around and how etc. You can't even tell - from looking at the program source - whether the buffers will be "managed memory" accessed and copied page-by-page, or rather a copy will be made from system memory to device-global memory.

So, in my book, it is not as easy to access and control CUDA with cuPy. But - it is easier for a user who "needs numpy for GPUs", and does not care about the nitty-gritty, to write their program and get things done. Your program demostrates both of these points.

I should mention that I wrote my library with the hope that others will use it to build higher-level-abstraction libraries and apps. One could use it to create a cuCpp library that would be very numpy-like but for C++, a parallel of NumCpp [2].

Thanks for the tips regarding the README, I'll fix it up.

----

[1] : cuda-api-wrappers does offer a couple of utility classes like a poor man's span for pre-C++17, and a span+unique_ptr combo - which is beyond wrapping CUDA's APIs, but still doesn't quite "do" thing.

[2]: https://github.com/dpilger26/NumCpp

einpoklum · 2024-09-22T23:17:26 1727047046

> It definitely never does anything behind-the-scenes or behind-your-back.

Actually that's a bit of a lie, because of the economy of primary device context reference counts (it's quite annoying if you need to do it well and not leak resources) and the context stack. So, let's say it does as little as possible behind the scenes... :-(