I work in computer vision and pretty much all recent SOTA models had custom CUDA code. So I'd argue the opposite that in my field, lack of custom CUDA kernels will effectively exclude you from research.
It's "optional" in the sense that things still calculate correctly on CPU without it, but at a 1000x performance penalty. Or you could skip it if you had 64GB of GPU RAM, which you cannot buy (yet).
So if you actually want to work with this on GPUs that are commercially available, you need it.
Ok, so in this example the cuda code is only used when you're short on GPU memory. This means if Apple makes a chip with enough memory cuda compatibility won't be necessary, right?
Are there any examples where custom cuda code implements some op that can't be written in Pytorch/TF/Jax/etc? That would have provided a better support to your claim that M1 needs to be able to run cuda.
Note the cuda kernels in the original repo were added in August 2017. It might have been the case at the time they needed them, but again, if you need to do something like that today, you're probably an outlier. Modern DL libraries have a pretty vast assortment of ops. There have been a few cases in the last couple of years when I thought I'd need write a custom op in cuda (e.g. np.unpackbits) but every time I found a way to implement it with native Pytorch ops which was fast enough for my purposes.
If you're doing DL/CV research, can you give an example from your own work where you really need to run custom cuda code today?