Hacker News new | past | comments | ask | show | jobs | submit login
Making AMD GPUs competitive for LLM inference (mlc.ai)
354 points by djoldman on Aug 9, 2023 | hide | past | favorite | 132 comments



One of the authors here. Glad it’s on HackerNews!

There are two points I personally wanted to make through this project:

1) With a sufficiently optimized software stack, AMD GPUs can be sufficiently cost-efficient to use in LLM serving; 2) ML compilation (MLC) techniques, through its underlying TVM Unity software stack, are the best fit in terms of cross-hardware generalizable performance optimizations, quickly delivering time-to-market values, etc.

So far, to the best of our knowledge, MLC LLM delivers the best performance across NVIDIA and AMD GPUs in single-batch inference on quantized models, and batched/distributed inference is on the horizon too.


The numbers look amazing.

Can you comment on how difficult it was to achieve this, and what the relative advantages b/w cards ? AFAIR, AMD cards were not not deemed competitive with Nvidia in DL space largely because of the amazing job Nvidia pulled off with CUDNN and its conv. kernels.

LLMs etc. OTOH doesn't really depend on convolutions (atleast the pure transformer bits), and instead depends a lot more on plain old GEMM + low-bit float/int compute.


> Can you comment on how difficult it was to achieve this, and what the relative advantages b/w cards?

Thanks for asking! I personally believe TVM Unity is a proper software stack for ML compilation (MLC), and its existing optimizations (e.g. TensorCore offloading) can be transparently transferred to AMD/Intel/Apple/mobile GPUs without too much engineering effort.

Of course my claim is limited to ML workloads. Not an expert outside the ML world, so I couldn't say for general HPC.


Congrats Junru! I'm not on AMD but love seeing progress in this project. Excited for batched inference -- I didn't think it'd be useful for me but I've realized batched inference is also useful for a single user / edge device workload.

Btw - I got biased sampling working in ad-llama! Catching up to guidance slowly but surely :)


This is amazing to hear Steven! (Sorry I locked myself out of discord a couple of days ago...) I'm sure there's bunch of features missing like biased sampling you mentioned, and more than happy to merge PRs if you'd love to :)


Thank you for this work. I will be staying on nvidia for now, but applaud any progress towards much needed credible competition in the consumer/enthusiast AI hardware space.

One question: given your experience, when would you predict a near parity in software stack support between te different platforms, so that a choice of GPU becomes one mostly of price/performance? It does not need to be like the AMD/Intel in the CPU market where a consumer will have no doubts about software compatibility, but let's say like the gaming gpu market where a game having problems on a gpu architecture is a newsworthy exception that is quickly corrected.


Honestly at a loss why this got downvoted.


Did the ROCm 5.6 toolchain work for you out of the box? If not, what sort of hacking / hand holding did it need?

I don't know whether there's a LLM inference benchmark in the CI suite, if not perhaps something like this should be included in it.


ROCm has improved a lot over the past few months, and now ROCm 5.6 seems to work out of box by just following this tutorial: https://rocm.docs.amd.com/en/latest/deploy/linux/installer/i.... TVM Unity, the underlying compiler MLC LLM uses, seems to work out of box too on ROCm 5.6 - from Bohan Hou who sets up the environment


Awesome. I'm going to paste that into the rocm dev channel. Actual positive feedback on HN, novel and delightful. Thank you for the blog post too!


https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h... and https://community.amd.com/t5/rocm/new-rocm-5-6-release-bring... suggest that Linux support is really limited at this point. Is this information inaccurate?


Depends what support means to you really. The docs use support to mean things AMD tested and expect to work, modulo errata.

If you're building the stack from source or found it in a Linux repo, decent odds it'll work for you. More likely to work on gfx9 or gfx10 than the older cards. I think that's roughly the last five years.

If you use the official distribution, some parts are compiled to gpu-specific machine code and if your gpu isn't one of those, you can't use that library. I think there's a reluctance to compile the libs for GPUs that aren't in the internal CI in case they don't work.

As an anecdote, I do most development on unsupported hardware, unsupported distro and unsupported kernel, with the upstream driver, using whatever was on llvm main that morning. That mostly works despite positioning myself as most likely to run into bugs.


I'm still on rocm 5.4, been working great on my 6750XT for the past few months (Arch).


Are there any docker images containing this? I'd like to avoid getting into dependency hell with other software on my system, as happens all too often with new technologies.


There are thankfully, quite a few, ive mostly used rocm/rocm-terminal and rocm/rocm-dev.

https://hub.docker.com/u/rocm


Yes, it works out of box and the blog contains a prebuilt python package that you can try out


Have you tested Vulkan API on the 7900 XTX? Was it faster or slower than ROCm?


Generally speaking I expect Vulkan to be slower than ROCm given it's designed for generic gaming across GPU vendors, so the takeaway is, whenever ROCm is available and usable, we should use ROCm. And it's the same for CUDA vs Vulkan.


What slows it down? Shouldn't Vulkan expose compute queues of the GPUs as well?


I don't have any expectations, but there're reasons for Vulkan to be faster.

It's a mature technology used my millions of people every day.

Unlike GPGPU compute, for videogames performance directly affects usability.

For these reasons, the software on all levels of the stack might be more optimized.


Can I use two at the same time? Two 7900 XTX would be the price of 1 4090 but with much higher performance (260tok/sec)


This is coming! Myself and others at OctoML and in the TVM community are actively working on multi-gpu support in the compiler and runtime. Here are some of the merged and active PRs on the multi-GPU (multi-device) roadmap:

Support in TVM’s graph IR (Relax) - https://github.com/apache/tvm/pull/15447 Support in TVM’s loop IR (TensorIR) - https://github.com/apache/tvm/pull/14862 Distributed dialect of TVM’s graph IR for multi-node (GSPMD-type): https://github.com/apache/tvm/pull/15289

The first target will be LLM's on multiple NVIDIA GPUs but as with all of MLC-LLM effort, the approach will generalize to other hardware including AMD's wonderful hardware.


This exciting, but still it is very apparent more time is needed.


<3


When you say best performance on nvidia, do you mean against any other method of running this model an nvidia card?


I can confirm this, mlc is shockingly fast on my RTX 2060.

The catch is:

- MLC's quantization is somewhat different (though I havent run any perplexity tests yet)

- There is no CPU offloading (or splitting onto an IGP) like Llama.cpp yet (unless its new and I missed it).


True and there are some other issues to be addressed. Those two particular issue is on our roadmap.

Regarding quantization, we wanted to develop a code path that absorbs any quantization formats, for example, those from GGML or GPTQ, so that they could be all used. ML compilation (MLC) is agnostic to any quantization formats, but we just haven't exposed such abstractions yet.

On CPU offloading, imagine if you are writing PyTorch, it should be as simple as a one-liner `some_tensor.cpu()` to bring something down to host memory, and `some_tensor.cuda()` to get it back to CUDA - seems a low-hanging fruit but it's not implemented yet in MLC LLM :( Lots of stuff to do and we should make this happen soon.


yeah we tried out popular solutions like exllama and llama.cpp among others that support inference of 4bit quantized models


Thanks! Just curious why there is no "team" or "about us" page? It's nice sharing credit, but it also is a little unsettling when blog posts do not name contributors.

Good work though. And you have an activity community on github, congratulations.


Well, I'm very much into true open source, and my belief is that any contributor is automatically part of the team :)


I know plenty of open-source projects who list and thank every individual contributor. The website could do that too!


That's a great idea! We should dig around and see if there's any plugin to use


is this similar to the mosaicml amd MI250 vs nvidia A100 results but with consumer grade hardware? https://www.mosaicml.com/blog/amd-mi250

might be interesting to team up


Does it work with WSL2?


Really depends on how good ROCm support for WSL2 is. Our team don't have a windows machine so could not verify ourselves, but if you got ROCm set up properly on WSL2, MLC LLM should work out of the box


You can also try out the vulkan backend, which we know should work for windows, although speed might be slower than rocm


FWIW I did get the CUDA backend running via WSL2


> AMD GPUs using ROCm

Oh great. The AMD RX 580 was released in April 2018. AMD had already dropped ROCm support for it by 2021. They only supported the card for 3 years. 3 years. It's so lame it's bordering on fraudulent, even if not legally fraud. Keep this in mind when reading this news. The support won't last long, especially if you don't buy at launch. Then you'll be stuck in the dependency hell that is trying to use old drivers/stack.


> AMD RX 580 was released in April 2018

It was actually Apr 18, 2017 -- https://en.wikipedia.org/wiki/Radeon_500_series


And the 580 was just a rebranded 480, which was released in June 2016.


Speaking from having run tens of thousands of both 580 and 480, they weren't just rebranded. Maybe on paper they seemed similar, but they didn't run similar.


I’m curious, where would you run so many gpus? Quality control?


7 datacenters in the US


This just makes the case for supporting it even further, because then they'd be supporting multiple years worth of hardware with just the effort for one uarch


Nothing like having a lottery of 2-6 years of support for your hardware to make your customers confident they are getting value out of the products they are buying.

The manufacturer can smugly proclaim they offered six years of support for a product that was on the shelf four years into the driver's lifecycle.


tbh im not sure what amds plan is on ROCm support on consumer devices, but i dont really think amd is being fraudulent or something.

Both rocm and vulkan are supported in MLC LLM as mentioned in our blog post. we are aware that rocm is not sufficient to cover consumer hardwares, and in this case vulkan is a nice backup!


If you click the "Radeon" tab here[1], dated 27 Jul, AMD claim ROCm support on a wide range of consumer cards, with HIP SDK support on RX 6800 and up, under Windows. The Linux situation seems less clear.

1: https://rocm.docs.amd.com/en/latest/release/windows_support....


Given AMDs track record. The 6900 will be dropped next year or early 2025.


How does the performance with Vulkan compare to the ROCm performance on the same hardware?


We haven't done any comparison them yet, but generally we believe Vulkan as a more generic cross-vendor API should be slower than ROCm. Same for CUDA vs Vulkan.


There is also vulkan support which should be more universal(also included in the post), for example, the post also shows running LLM on a steamdeck APU.


You're not going to find rx580's with enough vram for AI. Typically 4-8gb.


I am currently using my RX 580 8GB for running large language models on my home computer using llama.cpp opencl (clBLAST) offloading of layers. I can fit up to 13 billion parameter llamas (1 or 2) if they're quantized at 4 bits. It's not super fast but at least my AI IRC bots aren't eating into my CPU time anymore.

But my attempts to get direct ROCm support were thwarted by AMD.


Great for home use, zero commercial value. Can't expect AMD to invest time/money into ROCm for that.


You can say the same thing about a 24GB consumer card. Going from being able to run 13B llamas to 33B doesn't really help you in a commercial sense. This holds true, generally, for other LLM foundational models as well. To do commercial work you're going to need more RAM than consumer cards have. You need at least two if you're going to run the 70B and even then the 70B (and similar) aren't useful commercially. Except in the gathering money from investors who don't know better sense.


is 70B not commercially useful because of the model storage requirements, or total inference performance, or additional memory per session that's inferencing, or what?

is the output better such that it's desirable, or is this just a case of "too much performance hit for a marginal gain"?


No one is arguing any of that. You're the one that brought up the 580 specifically.

By the way, still waiting for you to take me up on your 'bet'.


I was wrong. Sorry. Food trucks do accept cash most places.

Now it's your turn Mr. "You're not going to find rx580's with enough vram for AI. Typically 4-8gb." This is completely false. Rather than acknowledging that you then tried to move the goalposts (much like I did in that past thread saying, "Oh, but maybe it's just my region where they don't.") It looks like we both behave a bit silly when trying to save face when we're wrong.


> This is completely false.

It isn't completely false. You're doing super limited stuff as a hobbyist that barely works.


The parent article is entirely about running and benchmarking 4 bit quantized Llama2-7B/13B. This is the "super limited stuff as a hobbyist that barely works" and I've run them at entirely usable speeds on the AMD RX 580. You're either wrong or you didn't actually read the article and have been arguing correctly (from your ignorant perspective) about something random.


"entirely usable" is not the same as "roi efficient"

> from your ignorant perspective

no need for the ad hominem.


Ignorance is not an insult. It just became obvious that you were talking about a different concept (commercial use with big models) than the article itself and everyone else were talking about (7B/13B models). So I generously assumed you just hadn't read it (ignorance). I guess now that you've ignored that and doubled down I can assume you were/are just arguing in bad faith.


Home use is how you get employees that push your products at work. The lack of focus on home use is AMD's biggest ML weakness.


The lack of a place where you can hourly rent top of the line hardware from AMD is the biggest weakness. Nobody is going to buy and run a MI210/MI250 at home.


Having a community of people using your product has zero commercial value?

Do you even know how brand recognition works?

The amount of people swearing off of AMD because of bad drivers ten years ago easily cost them a billion dollars. More than the cost of developing a good driver.


> Having a community of people using your product has zero commercial value?

Not is not what I'm saying, I'm saying that if I buy up a bunch of rx580 cards, nobody is going to rent them from me.

Now, if I offered a bunch of MI250's on an hourly rate, you can absolutely bet people will rent them all.


i mean in AI specifically, you need your stuff to be usable by a small lab of prof/grad students, otherwise it will never get adoption.

usually at least some of the compute resources are "prosumer" workstations using commercial cards.


Agreed, AMD needs to get their high end cards into more schools. Short of that, they need a place where people can rent them by the hour (and give discounts to schools).


Do you have instructions for this ???. Got a Sapphire 580+ keen to use for more than doing the Windows UI.


ROCm is not just for AI.


I've been waiting for someone to tell me what I can profitably do with the 120,000+ 470/480/580's that I have sitting around doing nothing. It sounds like you have an idea?


Crack passwords, also there was this craze a few years ago, where people were using their GPUS to crunch numbers and earn money, I forgot what it was called ... something with mines.


Right... any legal ideas you have?


Cracking passwords is legal if you obtained the hashes legally as part of your pentest conttract. So is shitcoin mining.

But you seem dead set that there are no uses for ROCm so I'll leave you there.


If the best you can do is 'password cracking' as your answer, you're obviously not very well versed in things. Plus, you don't need ROCm to crack passwords.


Good luck trying to make enough money to pay for power let alone capex


I mean, I'm using ROCm for VFX rendering. But regardless i'm not sure that cards as old as your 470's can really be super competitive power usage wise to make them very profitable.


Correct, not profitable.


But just because you have old GPUs, doesn't imply there is a problem with ROCm. You'd have the same problem of economics with old Nvidia GPUs.


ROCm doesn't support old GPUs.

That said, people are finding hacks...

https://old.reddit.com/r/Amd/comments/15t0lsm/i_turned_a_95_...


You could sell them or give them away to hobbyists, but that could eat into the lucrative “shoehorning old crypto mining into unrelated conversations” business


I was talking about hobbyists. Who said anything about businesses?

CUDA also works on consumer NVidia cards, not just business ones.


I did not know that you meant “hobby profitable” (more brain cells, bigger Ideas) not “business profitable” (money) when you asked people how to use your old mining hardware profitably.


You're not going to find a customer for high end cards, when his entry level experience is this poor. I ran stable diffusion on my CPU instead, even if it took ten minutes per picture.


RustiCL to the rescue! Supposedly it's already faster than ROCm


Being 90% of a 4090 makes the 7900 xtx very attractive in a cost per compute perspective, as its about 65%, and power is significantly lower too


Isn't it closer to 80%? Where is it 90% of a 4090?


You're right, my mistake. Still cost and power per compute still favours AMD


Article specifically says 80%. My guess is mixing up the numbers between 4090 and 3090 TI.


it's not the 4090 that the 7900 XTX should be compared to. It's the 3090Ti. They could be had for just about the same price when You consider 2nd hand cards.


[flagged]


It has been a while since I saw anyone send a Userbenchmark link. Userbenchmark is not a site with a good reputation, you can find many reasons online why this is, but I guess the site moderators just ignore it by pretending it's "Advanced Marketing", meanwhile even the CPU competitor of AMD has banned the site from their subreddit, one of my favorite explanation https://youtu.be/RQSBj2LKkWg but there are of course more recent ones.


Really? you are referring to userbenchmark's assessment of AMD?

Look at literally any of their assessments for AMD products and they'll be dragging them through the mud while pointing to worse performing intel products to flex how much better they are because they aren't AMD.


There's a reason this site is banned from most hardware forums.


Somewhat as a general question, not just for AMD/Nvidia: At what point does RAM stop being the bottleneck? By which I mean, given current chips, how much RAM could you theoretically bolt on to these cards before the limiting factor in performance is the GPU instead of RAM? And does that change when the task is training vs. deployment/prompting?


What do you mean? Are you talking about capacity? Or bandwidth from RAM?

I'm in the HPC space, and pretty much everything I do on the GPU is bound by how quickly I can get data in and out of DRAM.

The point at which data motion to/from DRAM is not the bottleneck is when you do enough work per byte of data moved. How much work is that? On today's server GPUs it's in the region of 50--100 double precision floating point operations per byte. You can work out an exact number by taking the theoretical maximum floating point operations per unit time you can execute and divide by DRAM throughput (data moved per unit time).

O(50--100) double precision flops per byte is a _lot_ of work. We're talking BLAS-3 type operations. Anything level 2 or lower, or sparse operations, are typically bandwidth bound.


The problem with a lot of machine learning algorithms is that you do hundreds or even thousands of operations per-value, but you do it in specialized orderings on incredibly large numbers of values, forcing you to swap partitions in and out of RAM. So, while you may do thousands of ops per-value, you may only be able to do tens of ops per-value per-write to RAM.

The more RAM you have on device, the fewer swaps you need to do (none at all if it's big enough), and the less those operations get amortized, bringing you closer to theoretical max throughput.

Matrix multiples are such that going from fitting 75% of your values to fitting 100% of your values can mean an order of magnitude speedup.


Disclaimer: I have no idea how machine learning algorithms work.

I work with problems so huge they do not fit on a single device. Multiple devices each own a small piece of the global problem. They solve a local problem and they must communicate over a network in order to solve the global problem. This is almost universally true.

You would know more than I in this field, and I expect it really is better to swap a partition than it is to use a network.

There are certain methods in HPC applications that are almost universally avoided because of how terribly they scale to a large distributed memory system. Matrix multiplies are one of them. Outside of a handful of ab initio computational chemistry algorithms (which are incredibly important), basically the only reason someone does a large dense matrix-multiply on a supercomputer is usually because they're running a benchmark and they're not solving a real science problem.

Folks more knowledgeable than me here feel free to jump in.


You guys are talking past each other but really talking about the same thing - arithmetic intensity. You're talking about FEA or some other grid solver/discretized PDE/DFT type thing where the matmuls are small because the mesh is highly refined and you've assumed the potentials/fields/effects are hyper-local. But that's not accident or dumb luck - the problems in scientific HPC are modeled using these kinds potentials post-hoc ie so that they can be distributed across so many cores.

What I'm saying is it's not like a global solver (ie taking into account all to all interactions) wouldn't be more accurate right? It's just an insane proposition because surprise surprise that would require an enormous matmul during the update, which you can't do efficiently, even on a GPU, for the same reason the ML folks can't: the arithmetic intensity isn't high enough and so you can incur i/o costs (memory or network, same thing at this scale).


> There are certain methods in HPC applications that are almost universally avoided because of how terribly they scale to a large distributed memory system. Matrix multiplies are one of them.

Neural networks, which are the basis for nearly all modern AI, are implemented as a mixture of sparse and dense matrix multiplies, depending on the neural architecture.


Thanks— that’s exactly the sort of framework for thinking about this that I was looking for. I think in the realm of LLMs there are other components but this is a piece of it


The ideal ram capacity is determined by the biggest available model you want to run. So... ~48GB for llama 70B? Maybe more with batching and very long contexts.

RAM bandwidth is basically always going to be a bottleneck. Microsft's proposal to get around this is to just keep everything in SRAM and pipe chips together.


Is Geohot (George Hotz) part of this project or is it a parallel undertaking?


It's parallel. Notably, Geohot's focus has been on improving ROCm directly while this project focuses more on improving the downstream tooling.


What contributions did he make to ROCm?


In your words, "He found a bug, and made a bunch of noise about it."


He is doing full stack, and is selling tinybox (a PC with 6 AMD GPUs in it).

As he went through the hard work of debugging the driver and getting AMD to care about it, I was expecting to see him in the acknowledgements.


Didn't he just find/report a bug in the multi-gpu case? Then have a melt-down after an AMD engineer sent him a fix a couple of days later? I think you are overstating his contributions, in fact it's not clear to me he made any.


He shows the git patch fix delivered by AMD engineers. The root cause is not explained and the change itself is about 2-3 SLoC — his frustration, what you call meltdown, is not about the interaction with AMD rather how troublesome it was to get AMD’s demo scripts to run on AMD’s chips.


Feeling frustration about something not working is a normal part of software development. Reporting a bug, or choosing not to use that software are normal too. Making a public video trashing the work of 100s, using the language he did, is a meltdown.

I've worked on public software before, and it's left me with an extremely low opinion of this kind of behaviour, and the kind of people that show it.


He literally had a phone call with CEO of AMD, and got them to organize people to fix the problems. He has been whipping AMD into fixing their drivers


Do you work at AMD and therefore have some kind of insight that we don't? Based on my own experience it's much more likely the call was just a marketing exercise, completely orthogonal to actual development.

Gfx drivers are complicated, long term projects. They're not something that get fixed with a phone call.

He found a bug, and made a bunch of noise about it. No need to make it more than it was.


Off topic but feels like a good place to ask? Can WebGPU give you decent performance on non-Cuda and help accomplish these kinds of aims? (Geohot I think is aiming to avoid a single chipmaker monopoly on AI which he sees as a bad thing /paraphrase)


As of today performance in WebGPU isn't as competitive yet, but there are really quite a lot of low-hanging fruits for WebGPU to pick up.


no. different project https://tinygrad.org/


He was the one who pushed AMD to improve their drivers. So yeah, without him this wouldn't be possible.


Has there been a driver release from AMD since geohot had a call with their CEO? Otherwise I don’t see how you can attribute this entirely to him.


Memory bandwidth for unbatched decoder-only autoregressive inference should be the only bottleneck. The listed numbers imply you're only saturating about 500 GB/sec, ignoring KV cache traffic. Implies there's still quite a bit of inference performance left to squeeze out of the 960 GB/s spec.


Nice writeup. A minor nitpick though - why is the price of the 3090Ti stated as almost 2k USD? I guess the card might be hard to get new and this might indeed be the price somewhere, but since it's a previous generation, in my opinion it would make sense to use the price of a 2nd hand (probably under 1k USD, depending on place / luck).

BTW - that's exactly the dillema I'm pondering now when thinking about a new PC build to play with some machine learning / LLMs. In my personal situation, 7900XTX and 3090Ti are about the same price / TDP / profile. Almost the only difference is then CUDA vs ROCm and the fact that one is new and might therefore give me some basic warranty.


Why is 4090 so slow in your tests? It regularly demolishes my 3090 in my own tests (2x speedup in most tasks).


LLM decoding is dominated by memory bandwidth, and 3090Ti and 4090 happen to have the identical theoretical memory bandwidth


Can the GPUs that are built into CPUs be used for AI?

They’re not fast but they do share memory with the CPU.

The most interesting at the moment being the AMD Ryzen™ 9 7940HS Processor, 8 Cores/16 Threads.


They have lots of ALUs so in theory yes.


It wasn't clear to me from the blog post: If I have an AMD GPU and an AMD CPU and everything with unified memory, can I use system memory for inferencing? that's a difference of "just" 16gb, or a combined 72gb available for me, and would br a big deal.


Do I understand correctly that it only applies to dedicated cards, not these GPUs bundled with Ryzens?


What about using 3 8GB RX 6600 in place of 1 24GB RX 7900?


just out of curiosity - how would You go about that link of the 3 cards? Does AMD have anything like NVLink? (honest question, I really don't know and suspect I might have missed a bit somewhere in the idea)


Through pcie using GPU P2P with a supported AMD chipset (epyc processor) so they can talk to each other or going through your CPU and RAM. I think the latter is what people using nvidia RTX 4090s do because P2P is disabled and NVlink is unsupported


nvidia dropped nvlink from their consumer cards, so for the consumer nvidia and amd cards it would work the same - trough PCIE


Hm, this is interesting, is this enough to go long AMD?


nah


I really wonder what kind of Koolaid they serve at AMD? This feels very similar to AMD shilling "multi core" CPU's back in the mid 2000's when anyone seriously using multi-core GPUs didn't care about cost and was just buying intel while AMD continued to fumble mediocre products.


Talking s** about AMD CPUs in the mid-2000s is a gutsy move.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: