Hi HN!
I was inspired by Andrej Karpathy's llm.c (https://github.com/karpathy/llm.c), and wrote a full diffusion model training loop in CUDA. I learnt a lot about CUDA from Simon Boehm's Matmul blog (https://siboehm.com/articles/22/CUDA-MMM).
Currently there is still a lot of room for optimization: the model is running at 45% speed of PyTorch with torch.compile.
I'm curious about any thoughts or CUDA tips for convolutions.