There are two points I personally wanted to make through this project:
1) With a sufficiently optimized software stack, AMD GPUs can be sufficiently cost-efficient to use in LLM serving;
2) ML compilation (MLC) techniques, through its underlying TVM Unity software stack, are the best fit in terms of cross-hardware generalizable performance optimizations, quickly delivering time-to-market values, etc.
So far, to the best of our knowledge, MLC LLM delivers the best performance across NVIDIA and AMD GPUs in single-batch inference on quantized models, and batched/distributed inference is on the horizon too.
Can you comment on how difficult it was to achieve this, and what the relative advantages b/w cards ? AFAIR, AMD cards were not not deemed competitive with Nvidia in DL space largely because of the amazing job Nvidia pulled off with CUDNN and its conv. kernels.
LLMs etc. OTOH doesn't really depend on convolutions (atleast the pure transformer bits), and instead depends a lot more on plain old GEMM + low-bit float/int compute.
> Can you comment on how difficult it was to achieve this, and what the relative advantages b/w cards?
Thanks for asking! I personally believe TVM Unity is a proper software stack for ML compilation (MLC), and its existing optimizations (e.g. TensorCore offloading) can be transparently transferred to AMD/Intel/Apple/mobile GPUs without too much engineering effort.
Of course my claim is limited to ML workloads. Not an expert outside the ML world, so I couldn't say for general HPC.
Congrats Junru! I'm not on AMD but love seeing progress in this project. Excited for batched inference -- I didn't think it'd be useful for me but I've realized batched inference is also useful for a single user / edge device workload.
Btw - I got biased sampling working in ad-llama! Catching up to guidance slowly but surely :)
This is amazing to hear Steven! (Sorry I locked myself out of discord a couple of days ago...) I'm sure there's bunch of features missing like biased sampling you mentioned, and more than happy to merge PRs if you'd love to :)
Thank you for this work. I will be staying on nvidia for now, but applaud any progress towards much needed credible competition in the consumer/enthusiast AI hardware space.
One question: given your experience, when would you predict a near parity in software stack support between te different platforms, so that a choice of GPU becomes one mostly of price/performance? It does not need to be like the AMD/Intel in the CPU market where a consumer will have no doubts about software compatibility, but let's say like the gaming gpu market where a game having problems on a gpu architecture is a newsworthy exception that is quickly corrected.
ROCm has improved a lot over the past few months, and now ROCm 5.6 seems to work out of box by just following this tutorial: https://rocm.docs.amd.com/en/latest/deploy/linux/installer/i.... TVM Unity, the underlying compiler MLC LLM uses, seems to work out of box too on ROCm 5.6 - from Bohan Hou who sets up the environment
Depends what support means to you really. The docs use support to mean things AMD tested and expect to work, modulo errata.
If you're building the stack from source or found it in a Linux repo, decent odds it'll work for you. More likely to work on gfx9 or gfx10 than the older cards. I think that's roughly the last five years.
If you use the official distribution, some parts are compiled to gpu-specific machine code and if your gpu isn't one of those, you can't use that library. I think there's a reluctance to compile the libs for GPUs that aren't in the internal CI in case they don't work.
As an anecdote, I do most development on unsupported hardware, unsupported distro and unsupported kernel, with the upstream driver, using whatever was on llvm main that morning. That mostly works despite positioning myself as most likely to run into bugs.
Are there any docker images containing this? I'd like to avoid getting into dependency hell with other software on my system, as happens all too often with new technologies.
Generally speaking I expect Vulkan to be slower than ROCm given it's designed for generic gaming across GPU vendors, so the takeaway is, whenever ROCm is available and usable, we should use ROCm. And it's the same for CUDA vs Vulkan.
This is coming! Myself and others at OctoML and in the TVM community are actively working on multi-gpu support in the compiler and runtime. Here are some of the merged and active PRs on the multi-GPU (multi-device) roadmap:
The first target will be LLM's on multiple NVIDIA GPUs but as with all of MLC-LLM effort, the approach will generalize to other hardware including AMD's wonderful hardware.
True and there are some other issues to be addressed. Those two particular issue is on our roadmap.
Regarding quantization, we wanted to develop a code path that absorbs any quantization formats, for example, those from GGML or GPTQ, so that they could be all used. ML compilation (MLC) is agnostic to any quantization formats, but we just haven't exposed such abstractions yet.
On CPU offloading, imagine if you are writing PyTorch, it should be as simple as a one-liner `some_tensor.cpu()` to bring something down to host memory, and `some_tensor.cuda()` to get it back to CUDA - seems a low-hanging fruit but it's not implemented yet in MLC LLM :( Lots of stuff to do and we should make this happen soon.
Thanks! Just curious why there is no "team" or "about us" page? It's nice sharing credit, but it also is a little unsettling when blog posts do not name contributors.
Good work though. And you have an activity community on github, congratulations.
Really depends on how good ROCm support for WSL2 is. Our team don't have a windows machine so could not verify ourselves, but if you got ROCm set up properly on WSL2, MLC LLM should work out of the box
Oh great. The AMD RX 580 was released in April 2018. AMD had already dropped ROCm support for it by 2021. They only supported the card for 3 years. 3 years. It's so lame it's bordering on fraudulent, even if not legally fraud. Keep this in mind when reading this news. The support won't last long, especially if you don't buy at launch. Then you'll be stuck in the dependency hell that is trying to use old drivers/stack.
Speaking from having run tens of thousands of both 580 and 480, they weren't just rebranded. Maybe on paper they seemed similar, but they didn't run similar.
This just makes the case for supporting it even further, because then they'd be supporting multiple years worth of hardware with just the effort for one uarch
Nothing like having a lottery of 2-6 years of support for your hardware to make your customers confident they are getting value out of the products they are buying.
The manufacturer can smugly proclaim they offered six years of support for a product that was on the shelf four years into the driver's lifecycle.
tbh im not sure what amds plan is on ROCm support on consumer devices, but i dont really think amd is being fraudulent or something.
Both rocm and vulkan are supported in MLC LLM as mentioned in our blog post. we are aware that rocm is not sufficient to cover consumer hardwares, and in this case vulkan is a nice backup!
If you click the "Radeon" tab here[1], dated 27 Jul, AMD claim ROCm support on a wide range of consumer cards, with HIP SDK support on RX 6800 and up, under Windows. The Linux situation seems less clear.
We haven't done any comparison them yet, but generally we believe Vulkan as a more generic cross-vendor API should be slower than ROCm. Same for CUDA vs Vulkan.
There is also vulkan support which should be more universal(also included in the post), for example, the post also shows running LLM on a steamdeck APU.
I am currently using my RX 580 8GB for running large language models on my home computer using llama.cpp opencl (clBLAST) offloading of layers. I can fit up to 13 billion parameter llamas (1 or 2) if they're quantized at 4 bits. It's not super fast but at least my AI IRC bots aren't eating into my CPU time anymore.
But my attempts to get direct ROCm support were thwarted by AMD.
You can say the same thing about a 24GB consumer card. Going from being able to run 13B llamas to 33B doesn't really help you in a commercial sense. This holds true, generally, for other LLM foundational models as well. To do commercial work you're going to need more RAM than consumer cards have. You need at least two if you're going to run the 70B and even then the 70B (and similar) aren't useful commercially. Except in the gathering money from investors who don't know better sense.
is 70B not commercially useful because of the model storage requirements, or total inference performance, or additional memory per session that's inferencing, or what?
is the output better such that it's desirable, or is this just a case of "too much performance hit for a marginal gain"?
I was wrong. Sorry. Food trucks do accept cash most places.
Now it's your turn Mr. "You're not going to find rx580's with enough vram for AI. Typically 4-8gb." This is completely false. Rather than acknowledging that you then tried to move the goalposts (much like I did in that past thread saying, "Oh, but maybe it's just my region where they don't.") It looks like we both behave a bit silly when trying to save face when we're wrong.
The parent article is entirely about running and benchmarking 4 bit quantized Llama2-7B/13B. This is the "super limited stuff as a hobbyist that barely works" and I've run them at entirely usable speeds on the AMD RX 580. You're either wrong or you didn't actually read the article and have been arguing correctly (from your ignorant perspective) about something random.
Ignorance is not an insult. It just became obvious that you were talking about a different concept (commercial use with big models) than the article itself and everyone else were talking about (7B/13B models). So I generously assumed you just hadn't read it (ignorance). I guess now that you've ignored that and doubled down I can assume you were/are just arguing in bad faith.
The lack of a place where you can hourly rent top of the line hardware from AMD is the biggest weakness. Nobody is going to buy and run a MI210/MI250 at home.
Having a community of people using your product has zero commercial value?
Do you even know how brand recognition works?
The amount of people swearing off of AMD because of bad drivers ten years ago easily cost them a billion dollars. More than the cost of developing a good driver.
Agreed, AMD needs to get their high end cards into more schools. Short of that, they need a place where people can rent them by the hour (and give discounts to schools).
I've been waiting for someone to tell me what I can profitably do with the 120,000+ 470/480/580's that I have sitting around doing nothing. It sounds like you have an idea?
Crack passwords, also there was this craze a few years ago, where people were using their GPUS to crunch numbers and earn money, I forgot what it was called ... something with mines.
If the best you can do is 'password cracking' as your answer, you're obviously not very well versed in things. Plus, you don't need ROCm to crack passwords.
I mean, I'm using ROCm for VFX rendering. But regardless i'm not sure that cards as old as your 470's can really be super competitive power usage wise to make them very profitable.
You could sell them or give them away to hobbyists, but that could eat into the lucrative “shoehorning old crypto mining into unrelated conversations” business
I did not know that you meant “hobby profitable” (more brain cells, bigger Ideas) not “business profitable” (money) when you asked people how to use your old mining hardware profitably.
You're not going to find a customer for high end cards, when his entry level experience is this poor. I ran stable diffusion on my CPU instead, even if it took ten minutes per picture.
it's not the 4090 that the 7900 XTX should be compared to. It's the 3090Ti. They could be had for just about the same price when You consider 2nd hand cards.
It has been a while since I saw anyone send a Userbenchmark link. Userbenchmark is not a site with a good reputation, you can find many reasons online why this is, but I guess the site moderators just ignore it by pretending it's "Advanced Marketing", meanwhile even the CPU competitor of AMD has banned the site from their subreddit, one of my favorite explanation https://youtu.be/RQSBj2LKkWg but there are of course more recent ones.
Really? you are referring to userbenchmark's assessment of AMD?
Look at literally any of their assessments for AMD products and they'll be dragging them through the mud while pointing to worse performing intel products to flex how much better they are because they aren't AMD.
Somewhat as a general question, not just for AMD/Nvidia: At what point does RAM stop being the bottleneck? By which I mean, given current chips, how much RAM could you theoretically bolt on to these cards before the limiting factor in performance is the GPU instead of RAM? And does that change when the task is training vs. deployment/prompting?
What do you mean? Are you talking about capacity? Or bandwidth from RAM?
I'm in the HPC space, and pretty much everything I do on the GPU is bound by how quickly I can get data in and out of DRAM.
The point at which data motion to/from DRAM is not the bottleneck is when you do enough work per byte of data moved. How much work is that? On today's server GPUs it's in the region of 50--100 double precision floating point operations per byte. You can work out an exact number by taking the theoretical maximum floating point operations per unit time you can execute and divide by DRAM throughput (data moved per unit time).
O(50--100) double precision flops per byte is a _lot_ of work. We're talking BLAS-3 type operations. Anything level 2 or lower, or sparse operations, are typically bandwidth bound.
The problem with a lot of machine learning algorithms is that you do hundreds or even thousands of operations per-value, but you do it in specialized orderings on incredibly large numbers of values, forcing you to swap partitions in and out of RAM. So, while you may do thousands of ops per-value, you may only be able to do tens of ops per-value per-write to RAM.
The more RAM you have on device, the fewer swaps you need to do (none at all if it's big enough), and the less those operations get amortized, bringing you closer to theoretical max throughput.
Matrix multiples are such that going from fitting 75% of your values to fitting 100% of your values can mean an order of magnitude speedup.
Disclaimer: I have no idea how machine learning algorithms work.
I work with problems so huge they do not fit on a single device. Multiple devices each own a small piece of the global problem. They solve a local problem and they must communicate over a network in order to solve the global problem. This is almost universally true.
You would know more than I in this field, and I expect it really is better to swap a partition than it is to use a network.
There are certain methods in HPC applications that are almost universally avoided because of how terribly they scale to a large distributed memory system. Matrix multiplies are one of them. Outside of a handful of ab initio computational chemistry algorithms (which are incredibly important), basically the only reason someone does a large dense matrix-multiply on a supercomputer is usually because they're running a benchmark and they're not solving a real science problem.
Folks more knowledgeable than me here feel free to jump in.
You guys are talking past each other but really talking about the same thing - arithmetic intensity. You're talking about FEA or some other grid solver/discretized PDE/DFT type thing where the matmuls are small because the mesh is highly refined and you've assumed the potentials/fields/effects are hyper-local. But that's not accident or dumb luck - the problems in scientific HPC are modeled using these kinds potentials post-hoc ie so that they can be distributed across so many cores.
What I'm saying is it's not like a global solver (ie taking into account all to all interactions) wouldn't be more accurate right? It's just an insane proposition because surprise surprise that would require an enormous matmul during the update, which you can't do efficiently, even on a GPU, for the same reason the ML folks can't: the arithmetic intensity isn't high enough and so you can incur i/o costs (memory or network, same thing at this scale).
> There are certain methods in HPC applications that are almost universally avoided because of how terribly they scale to a large distributed memory system. Matrix multiplies are one of them.
Neural networks, which are the basis for nearly all modern AI, are implemented as a mixture of sparse and dense matrix multiplies, depending on the neural architecture.
Thanks— that’s exactly the sort of framework for thinking about this that I was looking for. I think in the realm of LLMs there are other components but this is a piece of it
The ideal ram capacity is determined by the biggest available model you want to run. So... ~48GB for llama 70B? Maybe more with batching and very long contexts.
RAM bandwidth is basically always going to be a bottleneck. Microsft's proposal to get around this is to just keep everything in SRAM and pipe chips together.
Didn't he just find/report a bug in the multi-gpu case? Then have a melt-down after an AMD engineer sent him a fix a couple of days later? I think you are overstating his contributions, in fact it's not clear to me he made any.
He shows the git patch fix delivered by AMD engineers. The root cause is not explained and the change itself is about 2-3 SLoC — his frustration, what you call meltdown, is not about the interaction with AMD rather how troublesome it was to get AMD’s demo scripts to run on AMD’s chips.
Feeling frustration about something not working is a normal part of software development. Reporting a bug, or choosing not to use that software are normal too. Making a public video trashing the work of 100s, using the language he did, is a meltdown.
I've worked on public software before, and it's left me with an extremely low opinion of this kind of behaviour, and the kind of people that show it.
Do you work at AMD and therefore have some kind of insight that we don't? Based on my own experience it's much more likely the call was just a marketing exercise, completely orthogonal to actual development.
Gfx drivers are complicated, long term projects. They're not something that get fixed with a phone call.
He found a bug, and made a bunch of noise about it. No need to make it more than it was.
Off topic but feels like a good place
to ask? Can WebGPU give you decent performance on non-Cuda and help accomplish these kinds of aims? (Geohot I think is aiming to avoid a single chipmaker monopoly on AI which he sees as a bad thing /paraphrase)
Memory bandwidth for unbatched decoder-only autoregressive inference should be the only bottleneck. The listed numbers imply you're only saturating about 500 GB/sec, ignoring KV cache traffic. Implies there's still quite a bit of inference performance left to squeeze out of the 960 GB/s spec.
Nice writeup. A minor nitpick though - why is the price of the 3090Ti stated as almost 2k USD? I guess the card might be hard to get new and this might indeed be the price somewhere, but since it's a previous generation, in my opinion it would make sense to use the price of a 2nd hand (probably under 1k USD, depending on place / luck).
BTW - that's exactly the dillema I'm pondering now when thinking about a new PC build to play with some machine learning / LLMs. In my personal situation, 7900XTX and 3090Ti are about the same price / TDP / profile. Almost the only difference is then CUDA vs ROCm and the fact that one is new and might therefore give me some basic warranty.
It wasn't clear to me from the blog post: If I have an AMD GPU and an AMD CPU and everything with unified memory, can I use system memory for inferencing? that's a difference of "just" 16gb, or a combined 72gb available for me, and would br a big deal.
just out of curiosity - how would You go about that link of the 3 cards? Does AMD have anything like NVLink? (honest question, I really don't know and suspect I might have missed a bit somewhere in the idea)
Through pcie using GPU P2P with a supported AMD chipset (epyc processor) so they can talk to each other or going through your CPU and RAM. I think the latter is what people using nvidia RTX 4090s do because P2P is disabled and NVlink is unsupported
I really wonder what kind of Koolaid they serve at AMD? This feels very similar to AMD shilling "multi core" CPU's back in the mid 2000's when anyone seriously using multi-core GPUs didn't care about cost and was just buying intel while AMD continued to fumble mediocre products.
There are two points I personally wanted to make through this project:
1) With a sufficiently optimized software stack, AMD GPUs can be sufficiently cost-efficient to use in LLM serving; 2) ML compilation (MLC) techniques, through its underlying TVM Unity software stack, are the best fit in terms of cross-hardware generalizable performance optimizations, quickly delivering time-to-market values, etc.
So far, to the best of our knowledge, MLC LLM delivers the best performance across NVIDIA and AMD GPUs in single-batch inference on quantized models, and batched/distributed inference is on the horizon too.