Hacker News new | past | comments | ask | show | jobs | submit login

For any compiler, "supporting" a certain CPU or GPU only means that they can generate correct translated code with that CPU or GPU as the execution target.

It does not mean that the compiler is able to generate code that has optimal performance, when that can be achieved by using certain instructions without a direct equivalent in a high-level language.

No compiler that supports the Intel-AMD ISA knows how to use all the instructions available in this ISA.




Sure, but I'm not sure if that is what the parent poster was saying (that nvcc generates poor quality PTX for newer devices).

It's been a while since I looked at CUDA, but it used to be that NVIDIA were continually extending cuDNN to add support for kernels needed by SOTA models, and I assume these kernels were all hand optimized.

I'm curious what kind of models people are writing where not only is there is no optimized cuDNN support, but also solutions like Triton or torch.compile, and even hand optimized CUDA C kernels are too slow. Are hand written PTX kernels really that common ?


Yes. Take a look at, say, CUTLASS: you'll see that they use PTX instructions because there are no intrinsics, much less automatic compiler lowering, for the accelerators they target.


Yes, but that's an NVIDIA project, so would be expected to be hand optimized, same as their cuDNN kernels.

I'm more curious about what types of model people in research or industry are developing, where NVIDIA support such as this is not enough, and they are developing their own PTX kernels.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: