Hacker News new | past | comments | ask | show | jobs | submit login
AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x (nscale.com)
66 points by latchkey 9 months ago | hide | past | favorite | 100 comments



I don't usually like to do this, because clearly someone wrote this post (or at the very least they cared enough to have a LLM help draft it :P). Maybe it's just me and my interests not being aligned with the content. But I just found myself kind of disappointed by the content?

Like, the post is about how you can do GEMM tuning on AMD GPUs, a subject which is inherently super interesting–there's a lot of nuance to writing optimal kernels, and some of this is expressed in the article, too. Combine that with an architecture that isn't Nvidia? It's an excellent setup for something that would be interesting to read.

Which makes the actual conclusion all the more disappointing, IMO. There's nothing about what the actual optimizations are. It's just "oh yeah we tuned our GEMMs and now LLaMA is faster". Like, I get that nobody actually cares about GEMM and they just want tokens to come out of their GPU. But still, that's like writing a blog post about how you can speed up your game with SIMD and then posting some charts of how Cyberpunk 2077 gives you 2x the frame rate now. Ok, but how? I just feel like the interesting part is missing.


I believe the reason they offer no details about how they tuned the kernels is that the tuning is done by a tool provided by AMD. See here:

https://rocm.docs.amd.com/projects/rocBLAS/en/develop/how-to...


This article builds upon the work that C&C just released an article on.

https://chipsandcheese.com/2024/06/25/testing-amds-giant-mi3...

Elio, the person who did the testing, worked together with C&C, on my system. He then took it to the next level with this gemm work.

The docker container is linked on the article now so that anyone can reproduce the results.


I get the impression Nvidia thinks they have a moat with CUDA, but the current AI boom is mostly built upon Python libraries that are platform agnostic, or at least can be. With enough support for AMD in Pytorch etc. the decision to buy AMD or Nvidia will purely be down to specs.


Nvidia should be very afraid of AMD.

AMDs mastery of chiplet designs is in stark contrast to Nvidias reliance on extremely large and complex dies and pretty lack luster forays into multi-die in the past.

If Nvidia doesn't come up with ways to combat the inherent cost and yield advantages of chiplets they could get quite easily blown away on $/perf.

GPGPU didn't have enough money behind it to justify fully exploiting all hardware vs just going with whatever was easiest (Nvidia+CUDA) and eating the cost but that has changed substantially.

If you can buy AMD at a 10%+ $/perf advantage when you are talking about laying down $100M+ for accelerators on even a moderate sized cluster you can be damn sure the software will get tuned to take advantage of it.


What are you talking about?! Nvidia has been using chiplets and silicone interconnects in the data center ML accelerators way before AMD got in the game.

It's the same TSMC tech that both Nvidia and AMD use, it's not something proprietary exclusive to one or another since they're both fables.

On the contrary, AMD has to be afraid since they have no answer to Nvidia's 800GB/s data center interconnects they got from their aquisition of Mellanox, which enables Nvidia to scale.


AMD purchased Xilinx not long ago, and as a result has access to their high speed interconnects, including at least one capable of supporting 800G networking:

https://www.xilinx.com/products/technology/high-speed-serial...


I'm as close as you get to an old school Infiniband fanboi.

Don't think that Infiniband is somehow an Nvidia moat, if anything because they ate up the last somewhat independent vendor (Mellanox) it's likely it will be replaced by some variant of what is now being called Ultra Ethernet which for the most part incorporates ideas from Infiniband into the Ethernet PHY layer along with bolting into libfabric at the userland level. The people with the brains behind it include the Omni-Path folk (which is descendent from QLogic and SilverStorm before that) which have extensive experience in Infiniband and are now free from the shackles of Intel.

However this has absolutely nothing to do with chiplet designs or multi-die architectures (which Nvidia has done before with their dual GPU stuff from the GTX 900 series days etc).

The fact of the matter is that Nvidia has no experience whatsoever with chiplets or advanced packaging. AMD has on the other hand absolutely tons from yes, chiplets but also HBM, 3d-vcache etc all of which are ultimately advanced packaging techniques that all build on shared background of expertise.

Nvidia needs to rapidly catch up in this area or their future is a lot less certain.

This is putting aside what is happening with Tenstorrent/Etched/Groq, etc which threaten to eat up the vast majority of the non-training workloads associated with the current crop of AI/LLM architectures.


Nvidia seems familiar with HBM. It’s on a lot of their products.


What? Nvidia HAS used HBM in several products except only in DC side and not consumer side like AMD


Why do people make these kinds of comments that dramatically oversimplify incredibly complex issues? Is it some kind of fun "I'm in the in-crowd" type feeling?

> mostly built upon Python libraries that are platform agnostic, or at least can be. With enough support for AMD in Pytorch etc

You can't imagine how much and how long the work in "can be" is. Thus NVIDIA has two of the deepest moats you can have: time and money (you would probably have to outspend them by 10x for 10 years to catch up).

Source: go look at how the AMD backend is implemented in pytorch vs how the AMD backend is implemented.

Edit: people must really not understand that this code (ie building this stuff) is highly non-trivial. It's not like building a competing js frontend framework. You not only need good engineers you need lots of them because the surface area is very big - hardware platforms, especially AMD's aren't at all "abstract", so really to do it right you need a team per box in the product matrix.


I don't see it. Large customers are choosing AMD over nVidia.

NVidia has a massive headstart, but there are diminishing returns to their software work. If AMD performance per dollar is 10% better at the hardware level, and their software performance is 90% of that of nvidia they are ahead.


Absolutely none of this comment is based on fact or reason.

> I don't see it. Large customers are choosing AMD over nVidia.

Please show me term sheets of contracts signed with "large customers". Please include net, gross, and recurring revenue.

> diminishing returns to their software work

Makes zero sense: software penetration follows power laws (network effects etc) not exponential decay laws.

> If AMD performance per dollar is 10% better at the hardware level, and their software performance is 90% of that of nvidia they are ahead.

Again: you cannot fathom how enormous that "if" there is so I highly recommend you actually go look at the impl to get a sense.


> Makes zero sense: software penetration follows power laws (network effects etc) not exponential decay laws.

Software libraries aren't social networks, though. According to this argument we should all still be using C and not seeing new languages take hold. There is almost certainly a point where a library stops being able to add new value by being extended and that competitors can catch up to relatively easily no matter what the user demographics look like.

There are definitely social aspects to software development so there might be some power laws to find; but building a competing ecosystem gets done regularly enough and works well (C# v. Java for example). Much easier to do than building a competing social network.


> Software libraries aren't social networks, though.

They literally literally are - I follow people on GH in order to check out what they're doing and what repos they're starring/using.

> Much easier to do than building a competing social network.

We're talking about the ecosystem around the base stack remember? So no it's not easier when the bulk of the work needs to come from OSS contributors.

People have the weirdest wishful thinking about this because they don't actually work in this area (and/or don't have stake in the game). If you're this confident please show me your portfolio allocation for NVDA and AMD. Otherwise hint: everyone working on this does not so casually dismiss NVIDIA's lead.


GitHub is a social network. Libraries aren't. CUDA is not even on GitHub though as far as I recall.

> We're talking about the ecosystem around the base stack remember? So no it's not easier when the bulk of the work needs to come from OSS contributors.

The ecosystem exists as an abstraction around a base stack though. Much like saying the existence of Linux solidifies x86 dominance - that isn't what happens in practice, the Linux distros just pick up support for whatever hardware API are floating around. At some point the base can't add more value by adding more API features, and then the middleware starts supporting multiple bases depending on hardware economics.

The OSS people have no incentive to lock themselves in to buying NVidia cards. If AMD cards tended to work for compute problems they'd probably much rather be using AMD. AMD has better support for anything OSS that isn't compute related.


I'll leave it at this: no one that works in this area agrees with you or the lowbrow/peanut gallery "NVIDIA has no moat" takes. If that doesn't give you pause then I guess you should just start a company and prove us all wrong.


> no one that works in this area agrees with you or the lowbrow/peanut gallery "NVIDIA has no moat" takes.

Can you source that? That seems like a hard claim to source.

Everyone in the area agrees that they'd rather be using Nvidia hardware. I'd rather be using Nvidia hardware, indeed I switched to a Nvidia card. But that is not because I see Nvidia as having a long term defensible position, but because AMD's compute drivers on consumer cards appear to suck. It was implementation details and AMD's weaknesses, not Nvidia's strengths. Important details, but nothing that can be realistically called a moat and certainly not anything to do with CUDA itself. The issues were far more foundational.

Honestly it'd be interesting to have some real expert takes on what the problem with AMD is, because the major one I'm aware of was geohotz's and he didn't get much further than I did. Blockers came up much deeper in the stack than CUDA. ROCm would have been good enough except that it was running on AMD kernel drivers.


You have no facts and only cliche lines for reasons either. What does software penetrative or power laws have to do with anything?

I don't know what relevance term sheets should have here, but AMD success at the HPC top end is pretty obvious if you look:

https://top500.org/lists/top500/2024/06/

AMD and nVidia are both powering about 1.5 Exaflop of compute of the top 10 supercomputer. Machines 1 and 5 are AMD, nVidia is number 3 and 6-10.

They are still massively behind nVidia in the market overall: They are projecting a tenth of the datacenter revenues of nVidia. But that's still 4 billion in sales.


Hint: hpc (national labs) are not "large customers".


Frontier has 38k MI250. Google and Amazon apparently bought 50k H100 in 2023. (Behind Meta and Microsoft at 150k)

I believe 38k definitely qualifies as large.

Maybe you misread me. I didn't mean to write that this was the default choice for large customers. Just that there are large customers for whom it makes sense to go AMD.

The thing is, from where I am working and planning, nvidias cuda advantage is not a thriving community around it that would be hard to replicate. The community aspects are much more prominent higher up the stack, if you support pytorch and tensorflow you have a ton of community. Nvidia absolutely rocks at having high performance proprietary libraries for every niche use case.

That's going to take time focus and investment, but the hand performance tuning of these libraries might only buy you so much over the pytorch version. Unless you are doing LLMs and pushing hard against the limits of what's possible, that last bit of tuning can wait.

Nvidia had a call with us not long ago. I genuinely don'tsee the moat. If we manage to launch a product in our space it will run on anything pytorch runs. There is no advantage to marrying ourselves to cuda.


> Frontier has 38k MI250

I know - my group has several 100k hour allocations on frontier (had - I graduated in April). The difference between frontier and aurora and wherever buying GPUs and FAANG buying them is the labs don't refresh every 2 years. That's why I don't consider them "large customers". They're not even fullride customers - you can be sure frontier got a very nice deal, much nicer than FAANG would, because of the top500 prestige.

> The thing is, from where I am working and planning, nvidias cuda advantage is not a thriving community around it that would be hard to replicate. The community aspects are much more prominent higher up the stack, if you support pytorch and tensorflow you have a ton of community.

First sentence and the others are in direct contradiction. And literally it was the first thing I pointed out: every armchair quarterback thinks supporting pytorch and tensorflow is so easy but you can compare every how AMD is currently supported to know that it's not the case.


Frontier will run for more than 2 years before being replaced, but there are many more national lab type organizations than FAANG companies. If you decide to classify anything below FAANG scale as "not large", well...

And anyway, my point was merely that there are large customers going AMD.

As for the supposed contradiction: Cuda is much more than PyTorch support. There is a real breadth of proprietary C++ libraries. When I (and many others here I guess) say cuda, we mean that breadth of effort. Not just the cuda backend of pytorch. You don't have to match that breadth of effort to be competitive for very many AI applications being built higher up the stack, you only have to have a good enough ROCm backend for pytorch, and from everything I am hearing, it's getting there (if you are on Instinct, and not consumer hardware). But I don't have first hand experience.

What I have first-hand experience with is what NVIDIA tried to pitch us. They want to be the AI "platform", rather than just a hardware vendor. It all sounded like they know that their software advantage is brittle in places, hence the "platform" strategy. But they couldn't really articulate what that should mean.

I think the moat metaphor is ill-suited. This is not a situation where, if the moat is breached in some places, the castle falls. NV will have a software advantage in some domains for a long time to come. But at the same time, we are seeing that AMDs hardware can be competitive and better in domains where the software advantage matters comparatively less.


> but there are many more national lab type organizations than FAANG companies

Is your claim really that national labs outspend FAANG for compute? Like do you understand what you're saying? You're off by probably an order of magnitude if not two.

> You don't have to match that breadth of effort to be competitive for very many AI applications being built higher up the stack, you only have to have a good enough ROCm backend for pytorch, and from everything I am hearing, it's getting there (if you are on Instinct, and not consumer hardware). But I don't have first hand experience.

I don't want to say too much more because I work somewhere that is very close to this story/melodrama but everything you're hearing is aspirational hype and what I wish for everyone talking about this is much less gossip and much more first-hand reality.


I have said nothing of the sort anywhere in this thread. All I said is that AMD is viable for some large customers. You chose to continue to misinterpret that after several clarifications.

Then you throw out vague statements, and claim that you "don't want to say much more", to complain in the next sentence that it's all gossip.

You have literally offered no argument, other than saying "trust me, I know, I am an insider"...


> You have literally offered no argument

I absolutely did but seemingly no one understands the force of it (because no one actually cares about details). My argument was very very simple: go look at the hip backend in pytorch and the cuda backend. To anyone that understands what's what, that alone speaks volumes. Not my fault if you don't understand though.


Again, your argument is ad hominem ("you don't understand") and not naming an actual problem but declaring that it's obvious to anyone who wants to look.

The AMD and the CUDA backend in pytorch are actually derived from the same code base. AMD compiles the CUDA code to its HIP interface (which is essentially a reimplementation of CUDA), and that's that. It's all upstreamed and so changes to the CUDA backend are tested against the AMD build.

If there is a problem with that strategy, then it's up to you to demonstrate it because the prima facie evidence is that it works: I can go to Azure and buy an AMD Instinct VM and run a Hugging Face model on that right now, with marginally more effort than it takes to run it on an nVidia VM.


> Again, your argument is ad hominem ("you don't understand")

It's not ad hominem - it was my argument in literally my first response without a single word of shade. you then go and say "you have no argument" and I respond with "well if you can't understand the argument then sure I have no argument". pretty simple. I read this on reddit at some point: you're not the victim here, you're just starting a fight and then losing that fight.

> If there is a problem with that strategy, then it's up to you to demonstrate it because the prima facie evidence is that it works

you think that yolking yourself to your direct competitor's runtime API is a sound product strategy? really? i'm sure you'll come back with "intel x86_64 blah blah blah". the fact is it's not a sound strategy and maybe it would be wise for AMD to build out a real HIP backend (and maybe they already are).


Nowhere did you say that you think trying to build an interface that is as close as possible to cuda is the problem. You still haven't said why it's a problem.

You only said "look at the implementation" and said that if that's not obviously problematic to me, I just don't understand. If you had explained why you find it problematic we might have had a conversation on the viability of that strategy and the need for other complementary strategies.

Seriously, the only reason I am still here is because I hope you can understand why your replies were really not appropriate for facilitating interesting conversation.



Plus all the flexibility of CUDA is not needed for LLMs anyways


Isn't that what the big *nix vendors said about GNU/Linux - the software stack is too sophisticated to replace - in the early 2000s?

To this day I prefer to work on my Fuel than any other machine. But it's still dead.


I mean, you’re right that I don’t know the details. But the money is there. Consumers want to pay AMD for cheaper cards. AMD wants that AI money. I think they’ll start chipping away at Nvidia’s lead.

Again, I’m no expert on CPUs but I think you could have argued something similar about AMD vs Intel 12 years ago. Now it’s clear AMD’s tech is superior.


Running inference with LLMs is not very complex. You don't need CUDA for that at all. You don't even need those complex libraries for inference. The market for inference will grow much more as AI is used in more and more applications by more and more people. NVIDIA has a moat but it's around the smaller castle.


You need CUDA for flash attention. Other than that its just matrix multiplication, embedding and point wise activation.


Only for the reference implementation, flash attention is just based on optimization of the memory bandwidth between caches. The ThunderKittens[0] library has what is effectively flash attention v3, and they are working on supporting AMD arch.

[0]: https://hazyresearch.stanford.edu/blog/2024-05-12-tk


There is also the MI300A APU (not the MI300X that the article is about). Do I have any idea about the technical details? No. But IMO CUDA's value prop might actually be memory management and APU seem to be a strategic stab right into the heart of that by making memory management less of an issue. If the GPU and CPU start sharing the same memory, a lot of the value proposition of CUDA goes away.

Haven't seen that land meaningfully so far, but it is an interesting risk to Nvidia.


What makes you think that CUDA's value prop is memory management? I have seen no indication that implementing the memory allocation system is the hard part of any GPGPU system, and HIP is mostly source-compatible with CUDA, so it's not like the user has to do anything different memory management-wise to run on AMD.


I owned an AMD card. Everything high-level I wanted to do I could implement on the CPU fine, but I couldn't implement it on the GPU because it was too hard to operate on data in VRAM without crashing (been a while, I forget if it was the transfer or the operation that failed), and to get back good debug info because I needed to move data from VRAM to normal RAM. I was trying to learn at the time so efficiency wasn't really a factor, I just wanted things to work. They never did.

CPU was easy, GPU was hard. The major difference I could spot though was all the time I was spending batting data around between different buffers, the APIs weren't really limiting me.


Getting good performance out of the GPU in general is regularly a memory problem. CUDA doesn't really help beyond documenting what is expected of you with regards to data layout and warp coalescing.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/#maxim...


If you have a similar document for how to do all that with AMD cards and could link it to me back in 2014 that'd be hugely appreciated. Because I made no headway at all on the subject.

The library writers are obviously a lot better at all this than I am, but the experience of owning an AMD card was that they literally didn't and my experience left me believing that the reasons were more specific than "no CUDA". My experience was everything compute related caused crashes.


CUDA doesn't help directly, but Nvidia will provide you with libraries to do what you want efficiently on their hardware so you don't have to.


OK, but isn't the memory management stuff part of the API? :)

And with HIP, the user-facing memory model is virtually identical to CUDA to my knowledge. Converting a CUDA program to HIP, assuming no unsupported features or platform details are used, is basically just "do s/cu/hip/g".

Also, HIP wasn't around in 2014. To program an AMD GPU back then, you were probably using OpenCL, which is nicely platform independent but also arguably somewhat archaic.


> and to get back good debug info because I needed to move data from VRAM to normal RAM. I was trying to learn at the time so efficiency wasn't really a factor

I mean you're literally using the calvinball programming language - it's just a metafunctor in the language of macros, what's the big deal?

literally my (sanitized) code from 10 years ago

```cuda

__global__ void kernel_myKernel(int num_items, myTask_t * task_item,

#if VALIDATION == 1

           myDebug1_t * output_debug1_arr, myDebug2_t * output_debug2_arr,
#endif

           myOutput_t * output_arr, int current_iteration)
```

those myDebug_t pointers are pointers to host memory, so transferring the data back to host memory is transparently managed by CUDA or Thrust, and you just have an #IF or #IFDEF block in the code that dumps intermediate state to the pointer if validation is enabled in the build. And then you can run whatever test suite, on the actual intermediate data that's happening inside your kernel, so you can test your invariants/assertions. Test-suite lite edition/just the part you need - but you're testing actual kernel invocations, not just theoretical. And you can go to the level of dumping intermediate state of prefix_sums/reductions (or generalized "intermediate work items") every warp/grid iteration if you want - it's debug mode.

but granted this relies on the ability to have data dynamically piped between device and host memory spaces... which was a novel bit of syntactic sugar they added to the CUDA toolkit back in like 2012 (it was big around the Kepler era and it was a big feature on the Jetson TK1 too) so AMD might well have not had it. Or they might have said they had it but it just broke horribly if you used it.

but this is literally like three lines of code, if you had just used NVIDIA instead of AMD, because the feature was there when you needed it. Technically can be accomplished by just allocating extra VRAM for the buffer and copying it back afterwards, but granted, more work there.

What's the price of getting your work done? What's your time worth? That's always been the only moat of relevance that CUDA has had... and it's always been worth the expense.


CUDA and ROCm work under Pytorch. If ROCm does not work well Pytorch does not work well.

Nvidia has multiple advantages.

1. Same software written in PyTorch works across all relatively modern Nvidia chips. With AMD and ROCm that's not the case.

2. M̶I̶3̶0̶0̶x̶ h̶a̶s̶ n̶o̶ F̶P̶8̶ s̶u̶p̶p̶o̶r̶t̶.

3. Comparisons against H100 are always like this:

  8x AMD MI300X (192GB, 750W) GPU  
  8x H100 SXM5 (80GB, 700W) GPU
The fair comparison would be against

  8x H100 NVL (188GB, <800W) GPU 
And that they never do.

4. H100 is 21 month old architecture. MI300X is 7 months. Nvidia moved into new architecture every year pace. AMD is generation behind and must step up the pace. B100 comes out this year.

AMD is getting closer, but don't except them to catch Nvidia in not time.


It has fp8 support. Not sure whether fp8 on MI300x is supported by vLLM yet.

Also, many of these comparisons use vLLM for both setups, but for Nvidia you can and should use TensorRT-LLM which tends to do quite a bit better than vLLM at high loads.


Elio, the person who did the testing confirmed with me that he has fp8 working.


I haven't ever seen a server with 8x H100 NVL 188GB. The H100 NVL has 94GB of VRAM but they sell them in pairs connected with NVLink, so I guess they sometimes market them as 188GB, but in fact it's two cards and a server usually has 4 pairs.


> MI300X is 7 months.

Less than that, we paid for ours in January and received it in March. The first batch had problems and we had to send them back, which took another 3 weeks. So, let's consider the start date closer to April.

~3 months.


I thought H100 NVL had 96GB


Almost everyone is using stuff like tensorrt which is far from platform agnostic. But if AMD can build an API-compatible library that performs well it should indeed be easy enough to switch.


vLLM is widely used (and what is used in this benchmark) which is far easier to setup and maintain than TensorRT.

vLLM is not quite as performant but it's pretty close in production environments.


Is it really that close in production? The benchmarks I've seen show it's almost 3x slower compared to the SOTA. https://bentoml.com/blog/benchmarking-llm-inference-backends


You say benchmarks (plural) but I see just one here, and this benchmark does not add up in my experiences. But let's take L3 8b in fp16 (the first graph) which shows the performance to be comparable on throughput (vLLM faster on TTFT across the board).

On the 4bit L3 70b test, they are using AWQ. GPTQ with Marlin kernels (which now gets repacked on the fly in vLLM) is much faster in our tests, and has much better optimizations amongst vLLM contributors.

So yeah, this is sample size of n=1, as are my own tests. But my experiences have been much closer to the L3 8b fp16 charts, so I'm going to presume the bigger delta comes down to that.

EDIT: here is one of our own recent internal tests where we observed Marlin performing 25-30% better than AWQ in vLLM - https://miro.medium.com/v2/resize:fit:1400/format:webp/1*F9i...


That's cool but a 25-30% improvement wouldn't make up for the 300% advantage LMDeploy and TRT has over vllm on the 70b benchmark. Where are your results for vllm vs lmdeploy and trt?


I'm not suggesting it's down to that. I'm suggesting that anyone running AWQ on vLLM is probably not running it optimally somewhere else.

I don't have a chart handy, but in our tests, TRT was ca. 10% faster (but it's much more difficult to setup, you have to convert models, etc). LMDeploy we tested months ago, maybe it's improved now, but they were making fairly wild claims that we didn't observe.

Basically, you shouldn't publish a benchmark without providing all of the details. How long did test run? How were requests staggered? What were the input/output token sizes?

A lot of these 256 in / 256 out benchmarks are useless. If your system prompt is 4k tokens, prefill becomes a serious issue. How performant is your continuous batching? Can you do chunked prefill? Are you running prompt prefix caching (and if so, how performant is that)? It's a lot more complex than simply generating some tokens at various batch sizes.


Yeah, that makes sense. Here's another benchmark that shows trt being about 2-3x faster, 2x on fp16 and 3x on int4. https://blog.premai.io/prem-benchmarks/ Though to be fair it's 512 tokens and not 4k tokens. Would love to see more public testing done for all the different variations.


Yep, agreed on the more testing. I don't actually have any allegiance to vLLM. We've just been happy with it.

A few things in the linked article that make my eyebrows raise. Notably, they claim to achieve 206 t/s at bs=1 on a single A100 80GB (2 TB/s bandwidth) in fp16. That model is 14.5GB in size, and best case would require aggregate memory bandwidth of almost 3 TB/s to achieve.

The BentML you linked to, also ran on an A100 80GB, and achieved ~650 t/s at bs=10 so roughly 65 t/s per stream. Granted this is for an 8B param model, not 7B, and it will of course be faster at bs=1 but absolutely not by this margin. In this same benchmark, they achieved roughly 45 t/s per stream at bs=50 so you can see the sort of scaling we're dealing with. At bs=10 you should achieve relatively similar per-stream throughput to bs=10.

Baseten did a benchmark with TensorRT-LLM and an A100 80GB on Mistral 7B and at bs=1 they achieved 75 t/s and at bs=8 they achieved 66 t/s per stream (https://www.baseten.co/blog/unlocking-the-full-power-of-nvid...)

I strongly suspect there is something very wrong with the Premai (never heard of them before) or there has been some huge inference breakthrough that I am unaware of. The BentoML benchmarks look pretty good, and I suspect there is some vLLM performance left on the table, however not enough to close the gap. In our testing, TensorRT-LLM was definitely faster, but not enough to warrant all of the other headaches.


If you are buying hardware for billions of dollars, you can afford to spend a fair bit of engineering effort to make your code run on less expencive hardware.


AMD has sold out their MI300x and are trying to get more allocation from TSMC. NVDA's moat is their relationship with TSMC.


Eh? I just ordered 128 of them a few days ago.


Maybe it changed since then but Lisa Su said in the last earnings:

“We are tight on supply, so there’s no questions in the near term that if we had more supply, we have demand for that product, and we're going to continue to work on those elements as we go through the year,” Su said, adding that she’s “very pleased with how the ramp is going.”


Tight vs. sold out


in fairness, you also are a high-profile cloud provider with a pre-existing relationship - if small. ;)


Are you having trouble with your allocation?


To be fair, 128 are maybe $2.5 million revenue for AMD. If they predict $4b then we talk about 200-400k MI300X. For every $1b more revenue, AMD needs capacity from TSMC in the area of 50-100k. To be frank, your 128 are a rounding error in that.

Nvidia has shipped almost 4 million DC GPUs in 2023 to give your an idea about the sizes for it to count on the balance sheet.


Mi300x only really started being used in April…. A whopping 3 months ago.

We are a brand new startup and tiny. Not pretending to be anything else. But we are also early and one of the first to buy and receive these cards.

We have excellent backing and will grow with demand. The more people wake up to an alternative, that is actually available and as we are proving… viable, the more demand we will have. The need for compute is endless.

Dell is a partner of ours and sees the value in our proposition. That is huge. They are hungry to grow their AI business the same way AMD is.

Give it time. The frankness is appreciated, but obvious. Comparisons against nvidia today are laughable. This is a long tail game. Talk to me in a few years.

This is not a football game with a winner and loser. Our goal is to not just be AMD, but every chip that our customers want.


This is what everyone who hasn't spend months fighting AMD drivers thinks ... until they spend months fighting AMD drivers.

Geohot's streams show you why that's a bad idea in real time.


AMD's ability to not put out good software for 20 years now is probably one of the most impressive feats in all of computing history.


The most shocking thing AMD's done in the last 20 years is not fucking up Ryzen after the first release. I have no idea how they managed that give their corporate DNA.


Anecdotally mine is unstable.


Now, I don't have any MI300X, so I can't make any definite claims here. I am hoping someone else can replicate the results shown here or at the least educate me on how this is possible. Good part is the docker container and associated steps are made public - which is pretty cool!

Going by the video, the first thing that gave me pause was that a single MI300X is pulling off groq like performance, i.e. 314 tokens/second for a batch size = 1 (bs=1) with a prompt of 256 tokens and generation of 256 tokens. [1]

The Llama-2 70B is 128.48GB with FP16 (you can see this in the video). The entire model fits well within the 192GB HBM memory of the MI300X - which is awesome! However, for an regressive transformer model, during generation, the entire model weights are processed to generate a single next token. These models are "next token predictors" so to say, and you need the previous token to generate the next token. Therefore, the 128.48GB of model weights need are consumed from the HBM at the compute cores of the MI300X, per generated token. Note, I am not talking about the prefill - which only needs a single forward pass to generate the first output token. Every subsequent output token is auto-regressive.

The video shows that a single prompt (bs=1) with 256 token prompt and 256 tokens generated within 1.63 second. There is no tensor parallelism involved, or batching or anything else. This is a bs=1 case with a single card, and hence you can reason about the math fairly easy.

This shouldn't fundamentally be possible within the specs of the MI300X card. The card has a peak HBM memory bandwidth of 5.3 TB/s. You'll notice that to cycle through the weights (assuming FP16) 256 times, you'd need a minimum of 6 seconds, even at perfect ideal conditions. Napkin math: (256 * 128.48e9) / (5.3e12)

[1] https://wow.groq.com/groq-sets-new-large-language-model-perf...


Elio clarified this to me, he said...

256 tokens in 1 prompt/batch

Not 256 batches


Yes, a single sequence with 256 prompt tokens and 256 output tokens. This is a batch size = 1. No one is saying anything about 256 batches.

The first step in understanding this is to notice that the model (llama2) generates 1 output token at a time. This is because the llama2 70B is a autoregressive decoder-only transformer.

Fundamentally, to generate a single output token you need to process the entire model weights. At each forward pass you generate 1 token.

OK, now to generate 256 output tokens - you need 256 sequential forward passes. At each forward pass, the entire model is read from the gpu VRAM.

Even at ideal memory bandwidth (5.3 TB/s) that (256 forward passes of a 128.48GB model) should take 6s.

The reported number of 1.63s should not be possible.

I'd strongly recommend checking for correctness - that the generate output is coherent. Try sending actual prompts to the "gemm tuned" model and observing the generated responses and latencies. With the "benchmark_throughput.py" you only get a final number and there is no check whether the output is valid or not.


I'm not sure which benchmark you mean here but I'll just comment that the chips and cheese article (which Elio worked on apparently ?) look like they used a batch size of 128 or so. Chips and cheese don't mention the batch size used though so hard to be 100% sure.


Can someone knowledgeable please put these numbers in context with a comparison to an H100? For example they say 1053 tokens per second throughput with batch size 4 on llama2 70b. Is that good?


Where do you see 1053 t/s for L2 70b?

From their benchmark table, they achieve 1.8s response time at bs=4 for 256 output tokens, which would be 569 t/s throughput.

At bs=1 you are memory bandwidth bound. So as a simple ceiling we divide the memory bandwidth by the model size (70b is roughly 140gb at fp16). That would yield roughly 24 t/s for H100 and 39 t/s for MI300x.

I'm not really sure how GEMM optimizations alone would yield these sort of improvements at fp16. With quantization, where dequant overheads start to become compute bound, or at very high batch sizes, it makes sense and there has been a lot of work with various kernels in this area. But I'm a bit confused at fp16 and bs=1.

Maybe someone smarter can chime in?

EDIT: Ok I watched their video, which shows 39.75 t/s in the unoptimized (which checks out given the above rough maths) and 313.95 t/s for the optimized. I'm not really sure what's going on here.


Right. On that track, I want to confirm something. Maybe I am doing my math wrong or don't understand how transformers work.

There is a video about the bs=1 case, i.e. a single prompt with input 256 tokens and output 256 tokens with a Llama-2 70B model, on a single MI300X (no tensor parallelism). The optimized result is giving 314 t/s and completes the entire request in 1.63s.

Now, the Llama-2 70B is a an autoregressive decoded-only model. So, that means the entire model weights are processed for every generated token.

The model weights are 128.48 GB (also shown in the video, and can be confirmed from HuggingFace). The card has 192GB of HBM, and the model entirely fits on the card. The HBM memory bandwidth for this card is 5.3 TB/s.

Even assuming the prompt 256 tokens took no time, to generate 256 tokens, we'd need roughly 6s even at ideal memory bandwidth. (256 * 128.48e9) / (5.3e12).

What's going on here?


I found 1053 t/s in cell G18 in the throughput tab of the full results sheet they linked to: https://docs.google.com/spreadsheets/u/0/d/1hHd9riWh-hQJnNOz...


Hmm the post now links to this doc where I got my figures from: https://docs.google.com/spreadsheets/d/1hHd9riWh-hQJnNOzEI_Z...

This sheet only lists the response latency, but it does look more recent than the one you listed.

In any event, the sheet you link to shows generation speed and throughput (generation speed is cell E18 which shows 527 t/s). G18 is perhaps prefill + gen but even that doesn't add up.


I noticed column G is always 2x column E. So I think they are counting both input and output tokens in the throughput for G but not for E.

There's only one Google doc and you are linking to the same one I mentioned and the same one in the article... Not sure why you only see latency - there's definitely a throughput tab too.


It's not though, it's close, but it's not.

E18 is 527.75. 2 * 527.75 = 1055.5.

G18 is 1053.16.

So yeah, I have no idea.


https://chipsandcheese.com/2024/06/25/testing-amds-giant-mi3...

Take that article and increase it by the gemm gains.

At the end of the day, MI300x is outperforming an H100.

More importantly is that you need 2xH100 to do 70B and only 1xMI300x.


I can't square the numbers in the two articles at all. Other sources seem closer to their optimised results than their unoptimised results:

  llama3-70b throughput:
   Chips and Cheese: 4800 tok/sec (?!)
   This article:  849 tok/sec (fastest optimized)

  Mixtral 7B latency:
    Chips and cheese: 1.17s
    This article: 2.04s (fastest unoptimised) / 0.81 (fastest optimised)

  And then finally comparing mixtral 8x7b to the Reddit article using batch size 1...

  Reddit: 157 tok/sec
  This article: 165 tok/sec (optimized) or 141 tok/sec (unoptimised)


For mixtral 7B latency, I think the explanation is just that chips and cheese used 128 token requests, and in the article it is 256 token requests

I still can't explain throughput differences though


Going by the results from the article/video, a single MI300X is even outperforming a Groq system [1]

The video shows that the optimized run with Llama-2 70B gives 314 tokens/s for a bs=1 with 256 prompt + 256 generation. The Groq system is also a bs=1 apparatus and gets you around 300 tokens/s. Wild!

[1] https://wow.groq.com/groq-sets-new-large-language-model-perf...


Someone else on reddit [0] noticed this as well.

Groq does not talk about how many cards they need to get those results. Someone replied to me with this comment [1] a while ago...

[0] https://www.reddit.com/r/AMD_MI300/comments/1dqhrbn/comment/...

[1] https://news.ycombinator.com/item?id=39966620


Yeah, the rumours(?) are a groq system required to produce 300+ t/s on a Llama-2 70B (bs=1) requires 576 chips (9 racks) [1]

So, that's like $10M+ for serving bs=1 Llama-2 70B vs whatever a single MI300X costs?

[1] https://twitter.com/swyx/status/1759759125314146699


The exact cost of a mi300x is closely guarded by amd. I buy them and do not know how much they are. That said a whole chassis of 8x is far far far less than 10m.


You should make a whole post about this! Like how a single MI300X outperforms groq at bs=1.

300 tokens/s with bs=1 for a llama-2 70B on a single card is no joke.


This is why I sponsored doing the chipsandcheese tests on my hardware. That instigated Elio to up the game even further.

All open source by the way.


Thank you for sponsoring this. There's so little buzz about this hardware despite the fact it's clearly amazing for AI use cases. I don't understand why not. Maybe this is why Nvidia is the most valuable company in the world - nobody can be bothered to try a competitor.


The numbers in this article don't make sense. They aren't consistent with the hardware (they seem to show the weights being loaded faster than mi300x peak memory bandwidth) and they aren't self consistent (70B models running only 2x slower than 7B models).

I'm not sure what's going on here but something is not right.


Yeah, I've asked questions along those lines as well. Something sketchy is going on.

See here - https://news.ycombinator.com/item?id=40833109


> Let’s delve into the notable advancements achieved through GEMM tuning of LLMs such as Llama

Smells like written by GPT


I agree. This article is clearly written with a generative language model. A few other telltale signs:

1. Repeating the same thing multiple times with slight variation:

* "allowing developers to fine-tune their applications and unlock the full potential of their underlying hardware, ultimately maximising vLLM performance." (fine-tune, unlock potential, maximize performance are all roughly the same thing)

* "AI and machine learning models" (AI and machine learning models are the same thing in the context of this article)

* "utilise multiple threads or cores" (Why differentiate between threads and cores?)

* "tailored to enhance computational efficiency and overall throughput" (efficiency and throughput are highly related)

* "a series of graphs and data visualisations" (all the data visualizations in this article are graphs)

* "more computational effort and time" (same thing)

* "significantly enhanced the performance and efficiency" (same thing)

* "ensuring efficient processing and superior performance for complex and demanding AI workloads" (same things)

2. Explaining what "rocBLAS" stands for multiple times.

3. Other ChatGPTisms:

* "offering a comprehensive view of [...]"

* "Let’s delve into the notable advancements achieved through [...]"

* "ensures quicker processing times, which is crucial for [...]"

* "effectively mitigated these impacts, maintaining [...]"

* "elucidate the impact of"

* "significantly enhanced"

* "These results underscore the critical role of [...]"

* "Key Aspects", "Key Observations", "Key findings"

So why is this bad? - Because it undermines the trust in the the article. We do not know whether the claims are actually true or whether they were just made up by ChatGPT.


What if that person is not native English and wrote something up and then threw it into chatgpt (or a local chatbot running on 1 MI300x :p) just because he felt that his relatively limited vocabulary would not be enough to express everything?

That person (yeah :p), might just be trying to create as much awareness as possible.

You might get annoyed by the usage of LLM's, some might not. I get annoyed by people still trying to undermine the testing done while everything is clearly extremely transparant, even the docker image is shared..

That said, the article is about the results, if you'd like to "delve" a bit deeper into those results, let me know, i'd be happy to go over some of the data visualisations ;-)


If you want to talk about the results then there are quite a few comments (from me!) asking about those ;-)

Snark aside I do want to thank you and others for running these tests. I just wish I could make sense of the results, which seem too good to be true?


Thanks for articulating something I've noticed happening all over the internet and even in YouTube video scripts.

Are claudisms different from gptisms?

Why can't these authors tell ChatGPT to write with a different prose and avoid "delve", "crucial", etc?

Wizardlm writes these same things in all its answers too


> Are claudisms different from gptisms?

Sorry, no idea. I rarely use Claude.

> Why can't these authors tell ChatGPT to write with a different prose and avoid "delve", "crucial", etc?

The authors could do this, but that would contradict the reason for using ChatGPT, which is to do less work.

> Wizardlm writes these same things in all its answers too

WizardLM has inherited that from its instruction tuning dataset, which has been generated with ChatGPT: https://openreview.net/pdf?id=CfXh93NDgH




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: