Much faster yet than stable pytorch 2.3 (46% on A100, as per the tweet), and much much faster yet compared to pytorch 2.2, which was the stable version a couple weeks ago. Also llm.c is much faster yet when the performance comparison is on H100 instead of A100, or on multiple GPU instead of a single one.
Yeah, I'm sure that's what anyone trying to build some kind of AI startup that's managed to acquire a small handful of A100 or even better H100s thinks too. "Those cards sure were expensive, but ethically, I'd rather the software run slower to give me future imaginary options than to get the most out the hardware I just bought."
Tinfoil hat time. The recent gpt2 chatbot that everyone thought was a new open ai product - could it be?
“ You start with the gpt2.c pure CPU implementation, and see how fast you can make it by the end of the course on GPU, with kernels only and no dependencies.”
Remarkably similar nomenclature. I give it 1% chance this is related. I did play with that chatbot and it was smarter than gpt4 whatever it was.