I see they have experimental AMD backends that don’t use ROCm. Is ROCm that bad that they wrote their own, or was there some other justification for this?
Tinygrad targets consumer hardware (to be precise, only Radeon 7900XTX and nothing else[1]), while ROCm does not actually provide good support for such hardware. For example, last release of hipBLASLt-6.1.1 library has deep integration with PyTorch[1], while working only on AMD Instinct hardware. And even for the professional hardware out there, the support period is ridiculous: AMD Instinct MI100 (2020) is not supported. Only 4 years and tens of thousands of dollars worth of hardware is going to the trash, yay!
And to be more precise, they still use some core libraries from ROCm stack[3], they just don't use all these fancy multi-gigabyte[4] hardware-limited rocBLAS/hipBLASlt/rocWMMA/rocRAND/etc. libraries.
I think the matmul issue is symptomatic of a much deeper issue.
It would be nice to see less whining and blaming AMD (PyTorch and llm.c actually work on 7900 XTX, and blow tiny grad out of the water in terms of perf!), and more just getting stuff to work.
The idea with the experimental backends is that they will talk directly with the kernel drivers, [3] is for the HSA backend but is not needed for the AMD backend.
Seems to be an issue on their side. E.g., for a step of GPT2 training on a 7900 XTX [1]: tinygrad is ~440ms, PyTorch 2.4.0.dev20240513 is ~97ms, Karpathy's llm.c with ROCm is ~79ms, and llm.c with custom kernels is ~58ms
That issue seems a month old, while the 58ms number looks 1 day old.
I have seen last month getting a lot of work done in improving performance (it's in the release announcement as well), but of course I still don't think it can compete with that number...still, a new comparision would be cool.
Has he? The ones I'm aware of he was complaining about the low quality of the generic kernel drivers. AMD software had a tendency to crash when doing anything outside of standard video games (which was my experience too, but I've caved and bought Nvidia since then; average driver quality of Nvidia on linux seems to be much lower but the kernel doesn't go down which is nice. Got a lot of OOM errors where on ROCm the kernel froze requiring a full system restart).
But this is interesting and probably strong evidence that the CUDA API isn't the moat people thought it was. CUDA multiplies matricies and that is close to a commodity operation. The moat actually seems to be Nvidia's higher generic software engineering standards, the difficulty in writing job scheduling/memory management infrastructure and possibly the fact that closed firmware is the norm.
Yes he has. I have seen multiple episodes on his YouTube[1] where he absolutely grills the whole company. He also gave them a deadline to opensource the drivers or he would stop trying to make AMD stuff work.
Sorry for no direct link, but he has so many and very long videos that it is hard to find the exact spot.
Arguably the nvidia AI moat is PyTorch and the heavily optimized libraries behind it. The CUDA language and toolchain helped get that effort off the ground, no doubt, but PyTorch is written and optimized for CUDA first. All other backends work best with similar semantics to CUDA and have to match Cuda semantics to keep their users happy.