For any compiler, "supporting" a certain CPU or GPU only means that they can gen...

HarHarVeryFunny · 2024-07-16T15:21:08.000000Z

Sure, but I'm not sure if that is what the parent poster was saying (that nvcc generates poor quality PTX for newer devices).

It's been a while since I looked at CUDA, but it used to be that NVIDIA were continually extending cuDNN to add support for kernels needed by SOTA models, and I assume these kernels were all hand optimized.

I'm curious what kind of models people are writing where not only is there is no optimized cuDNN support, but also solutions like Triton or torch.compile, and even hand optimized CUDA C kernels are too slow. Are hand written PTX kernels really that common ?

saagarjha · 2024-07-17T07:53:13.000000Z

Yes. Take a look at, say, CUTLASS: you'll see that they use PTX instructions because there are no intrinsics, much less automatic compiler lowering, for the accelerators they target.

HarHarVeryFunny · 2024-07-17T13:26:51.000000Z

Yes, but that's an NVIDIA project, so would be expected to be hand optimized, same as their cuDNN kernels.

I'm more curious about what types of model people in research or industry are developing, where NVIDIA support such as this is not enough, and they are developing their own PTX kernels.