Show HN: UNet diffusion model in pure CUDA

Hi HN!

I was inspired by Andrej Karpathy's llm.c (https://github.com/karpathy/llm.c), and wrote a full diffusion model training loop in CUDA. I learnt a lot about CUDA from Simon Boehm's Matmul blog (https://siboehm.com/articles/22/CUDA-MMM).

Currently there is still a lot of room for optimization: the model is running at 45% speed of PyTorch with torch.compile.

I'm curious about any thoughts or CUDA tips for convolutions.