As I write this (after the updates to the evaluation code), https://pub.sakana.a...

As I write this (after the updates to the evaluation code), https://pub.sakana.ai/ai-cuda-engineer/kernel/2/23/optimize-... is on their top of their list of speedups, with a claim of 128x speed up on a fused 3D convolution + groupnorm + mean.

The generated implementation doesn’t do a convolution.

The 2nd kernel on the leaderboard also appears to be incorrect, with a bunch of dead code computing a convolution and then not using it and writing tanhf(1.0f) * scaling_factor for every output.