Full forward pass of GPT-2 in one file of pure CUDA

skilled · 2024-04-20T07:00:33 1713596433

An update from Karpathy:

> Our single file of 2,000 ~clean lines of C/CUDA code now trains GPT-2 (124M) on GPU at speeds ~matching PyTorch (fp32, no flash attention)

> On my A100 I'm seeing 78ms/iter for llm.c and 80ms/iter for PyTorch. Keeping in mind this is fp32, with no flash attention yet, and slightly stale PyTorch (2.1.0).

https://twitter.com/karpathy/status/1781387674978533427

yinser · 2024-04-10T19:34:45 1712777685

I feel guilty hoping such a great researcher continues making educational content but damn it Andrej puts out incredible material.

latchkey · 2024-04-10T19:33:35 1712777615

I would love for someone to run HIPIFY[0] on this and prove that it works (or doesn't). I'd do it myself, but I am unable to at this time.

[0] https://github.com/ROCm-Developer-Tools/HIPIFY

noxs · 2024-04-10T21:24:02 1712784242

this is quite the difference I saw among different ML Engineers during my work.. most can only do python (more or less like a frameworker) and some amazing ones can go all the way down to the CUDA to squeeze out the performance.