I see they have experimental AMD backends that don’t use ROCm. Is ROCm that bad ...

Lockal · on May 29, 2024

Tinygrad targets consumer hardware (to be precise, only Radeon 7900XTX and nothing else[1]), while ROCm does not actually provide good support for such hardware. For example, last release of hipBLASLt-6.1.1 library has deep integration with PyTorch[1], while working only on AMD Instinct hardware. And even for the professional hardware out there, the support period is ridiculous: AMD Instinct MI100 (2020) is not supported. Only 4 years and tens of thousands of dollars worth of hardware is going to the trash, yay!

And to be more precise, they still use some core libraries from ROCm stack[3], they just don't use all these fancy multi-gigabyte[4] hardware-limited rocBLAS/hipBLASlt/rocWMMA/rocRAND/etc. libraries.

[1] https://tinygrad.org/#tinybox

[2] https://github.com/pytorch/pytorch/issues/119081

[3] https://github.com/tinygrad/tinygrad/blob/v0.9.0/tinygrad/ru...

[4] https://repo.radeon.com/rocm/yum/6.1.1/main/

anthonix1 · on May 29, 2024

Even without hipBLASlt, PyTorch is still ~4x faster than tinygrad on a 7900 XTX for GPT2, and works fine. Any idea why?

Lockal · on May 29, 2024

Maybe due to 2x slower matmul (well known bottleneck), as one of multiple factors? https://github.com/tinygrad/tinygrad/blob/v0.9.0/extra/gemm/...

There are multiple bounties just for it in https://docs.google.com/spreadsheets/d/1WKHbT-7KOgjEawq5h5Ic...

anthonix1 · on May 29, 2024

I think the matmul issue is symptomatic of a much deeper issue.

It would be nice to see less whining and blaming AMD (PyTorch and llm.c actually work on 7900 XTX, and blow tiny grad out of the water in terms of perf!), and more just getting stuff to work.

wozeparrot · on May 29, 2024

The idea with the experimental backends is that they will talk directly with the kernel drivers, [3] is for the HSA backend but is not needed for the AMD backend.

anthonix1 · on May 29, 2024

Seems to be an issue on their side. E.g., for a step of GPT2 training on a 7900 XTX [1]: tinygrad is ~440ms, PyTorch 2.4.0.dev20240513 is ~97ms, Karpathy's llm.c with ROCm is ~79ms, and llm.c with custom kernels is ~58ms

[1] https://github.com/anthonix/llm.c [2] https://github.com/tinygrad/tinygrad/issues/4301

xiphias2 · on May 29, 2024

That issue seems a month old, while the 58ms number looks 1 day old.

I have seen last month getting a lot of work done in improving performance (it's in the release announcement as well), but of course I still don't think it can compete with that number...still, a new comparision would be cool.

anthonix1 · on May 29, 2024

Ran tinygrad again about a week ago, no change.

And still no comment on the issue, will re-run if there is any comment.

xiphias2 · on May 29, 2024

Thanks for the answer

blihp · on May 28, 2024

The author of the library has done numerous YouTube rants on how bad he thinks the AMD compute drivers are.

roenxi · on May 29, 2024

Has he? The ones I'm aware of he was complaining about the low quality of the generic kernel drivers. AMD software had a tendency to crash when doing anything outside of standard video games (which was my experience too, but I've caved and bought Nvidia since then; average driver quality of Nvidia on linux seems to be much lower but the kernel doesn't go down which is nice. Got a lot of OOM errors where on ROCm the kernel froze requiring a full system restart).

But this is interesting and probably strong evidence that the CUDA API isn't the moat people thought it was. CUDA multiplies matricies and that is close to a commodity operation. The moat actually seems to be Nvidia's higher generic software engineering standards, the difficulty in writing job scheduling/memory management infrastructure and possibly the fact that closed firmware is the norm.

casperb · on May 29, 2024

Yes he has. I have seen multiple episodes on his YouTube[1] where he absolutely grills the whole company. He also gave them a deadline to opensource the drivers or he would stop trying to make AMD stuff work.

Sorry for no direct link, but he has so many and very long videos that it is hard to find the exact spot.

https://www.youtube.com/@geohotarchive

alexbaden · on May 29, 2024

Arguably the nvidia AI moat is PyTorch and the heavily optimized libraries behind it. The CUDA language and toolchain helped get that effort off the ground, no doubt, but PyTorch is written and optimized for CUDA first. All other backends work best with similar semantics to CUDA and have to match Cuda semantics to keep their users happy.

JMiao · on May 28, 2024

What does he get wrong?

KaoruAoiShiho · on May 29, 2024

He's not wrong.