Hacker News new | past | comments | ask | show | jobs | submit login
Full forward pass of GPT-2 in one file of pure CUDA (github.com/karpathy)
63 points by tosh 7 months ago | hide | past | favorite | 4 comments



An update from Karpathy:

> Our single file of 2,000 ~clean lines of C/CUDA code now trains GPT-2 (124M) on GPU at speeds ~matching PyTorch (fp32, no flash attention)

> On my A100 I'm seeing 78ms/iter for llm.c and 80ms/iter for PyTorch. Keeping in mind this is fp32, with no flash attention yet, and slightly stale PyTorch (2.1.0).

https://twitter.com/karpathy/status/1781387674978533427


I feel guilty hoping such a great researcher continues making educational content but damn it Andrej puts out incredible material.


I would love for someone to run HIPIFY[0] on this and prove that it works (or doesn't). I'd do it myself, but I am unable to at this time.

[0] https://github.com/ROCm-Developer-Tools/HIPIFY


this is quite the difference I saw among different ML Engineers during my work.. most can only do python (more or less like a frameworker) and some amazing ones can go all the way down to the CUDA to squeeze out the performance.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: