Hi, I'm the author. Thanks for sharing, was great to wake up to my blog post on the front page! Would love to hear any feedback or if I missed anything.
You can probably get this to run slightly faster using CUDA graphs, which should reduce the amount of communication between the GPU and CPU. I notice that llama.cpp uses them, but glancing at your blog post, it appears that you do not.
You might want to consider switching to bfloat16. The model weights are published as bfloat16. LLMs are more sensitive to the exponent bits than the mantissa bits, so using fp16 over bfloat16 should slightly hurt the model’s intelligence. Using bfloat16 would require you to use the tensor cores instead of the CUDA cores. Interestingly, llama.cpp performs very poorly when asked to use bfloat16, so performance comparisons of a bfloat16 would be best done against llama.cpp using fp16, since there is no reason for performance of fp16 and bfloat16 to be different, other than a lack of optimization.
Using the tensor cores should not make your forward pass any faster despite the additional computational power since the GEMV operations that your forward pass uses should be memory bandwidth bound, although if you ever decide to try doing multiple queries in parallel, you will see a performance boost from being able to use that additional compute via GEMM operations.
If you write a prompt processing version of the forward pass, you can use GEMM operations to process all prompt tokens simultaneously even without doing multiple queries in parallel. This would give you fast prompt processing speeds and open another front on which you can try to outdo llama.cpp.
I tried this (and backprop) in GPGPU compute shaders once. One issue that really tripped me up is matrix multiplication along dimensions that are not powers of 2. It will take me awhile to learn to read CUDA code -- is this just an easier thing to do in CUDA or did you figure out a good trick for it?
I know a trick, although it relies on the closed source cuBLAS library. The matrix multiplication in non-batched forward passes is really matrix-vector multiplication, but you can shoehorn that into matrix-matrix multiplication by setting the right dimension to 1. cuBLAS uses column major ordering to match Fortran instead of the row major ordering of C/C++, although you can swap transa and transb, m and n, a and b, lda and ldb; and if you are using cublasGemmEx() to use tensor cores, atype and btype. This relies on the property of mathematics where A * B = C and B’ * A’ = C’ for matrices. You can experiment with this on a CPU using a BLAS library like OpenBLAS or the Intel MKL before trying to do this on a GPU.
That said, to do matrix vector multiplication (GEMV), you just need to compute dot products on all of the rows of the matrix with the vector to get your output vector. You could probably just handle each dot product in power of 2 chunks on the GPU to compute partial sums. When you get to the last chunk that is not a power of 2, just have the threads that go pass the end of the dot product contribute 0 to the partial sums. Then you would have the threads handle the horizontal sum of the partial sums to finish the computation of an entry in the output vector.
As for matrix-matrix multiplication (GEMM), that is much harder to do in a performant manner. If you do it by treating the multiplication as a series of matrix-vector multiplications, you will be bandwidth limited rather than compute limited. You would need to implement tiling to get good performance, but there are many tricks needed and it is very hard to outperform cuBLAS on Nvidia hardware. Here is one person’s attempts to do it:
Wow, great answer. That deserves more upvotes than it's probably going to get. So what that means for compute shaders is I will need to partition the vectors being multiplied into power of 2 sized partitions and repeatedly call the kernels, but with offsets appropriate to the partition size, and then have the overrun contribute zero as you said.
Obviously not as efficient as doing it all GPU side, still faster than doing it on the CPU. Thank you!
Technically, the device side code is in CUDA C, which is really C++, but no one has written a C compiler that can compile a lightly extended version of ANSI C to PTX, so the device code must be C style CUDA C there, CUDA FORTRAN or hand written PTX assembly. I went with what was easiest, which was C style CUDA C.
To the author, if you're lurking here, I have a tangential question- how long did it take you to write this article? From first line of code to the last line of this post?
As someone who works in GPGPU space, I can imagine myself writing an article of this sort. But the huge uncertainty around time needed has deterred me so far.
I'd say this was a bit over 3 weeks of full time coding, learning, and writing this blog post spread out over 1.5+ months. Someone with GPGPU/CUDA experience would've been able to do it faster. But I'd echo the other commenters here - just try writing some content and see if it works for you!
Great post! I've been looking to get into the guts of large scale model training (I'm half-way between the design and application layer of LLMs, mostly in python, sometimes a bit of c++) and this will be a great reference to have.
PS. appreciate it if anyone can recommend more material like this
I don't think this code can make use of the tensor cores, or the wgmma instructions that you typically need to get peak performance out of them.
Programming these is a nightmare as you need to have several in flight concurrently for peak performance.
Perhaps you don't need the extra flops as you end up bandwidth bound?
Regardless the good thing about the code in the blog though is it'll probably work pretty well for other accelerators, if you port it to HIP or similar. If you use wgmma I'm not sure it'll even be portable across Nvidia generations.
Good point yes. That explains why he's getting performance similar to the leading frameworks. Those tensor operations are helpful for training or for throughput-optimised batched inference but not really for a batch size of one.
I actually didn't know that. I'm in the space as a hobbyist and I had a vague understanding that tensor cores are essential for reaching peak performance, but can only work for certain operations like dense matrix-matrix multiplication. It was on my list to investigate whether they could be used to further improve single-batch decoding - makes sense that they don't help when it's all matrix-vector.
Does any one know of something similar in python? I want to share with my team something similar to this that goes into (almost) everything (at least conceptually) needed to efficiently serve an LLM.
It doesn’t actually need to be performant mind you (it’s in python) I just need something “conceptually complete” while being more “tutorial style” and concise than vLLM codebase
I notice that the CUDA example uses C++. I had not planned to publish my own work in this area yet (as I am still working on it), but if anyone wants fast llama inference using CUDA in C, here are some files:
./rung "llama3_8b_base.bin" -z "tokenizer.bin" -t 0 -i "Once upon a time"
I am getting around 48 to 49 tokens per second with bf16 on my 3090 Ti from this. Oddly, llama.cpp only gets around 1 token per second with bf16, but it gets around 51 to 52 tokens per second with fp16. I suspect the performance gap would close if I use CUDA graphs.
The code is ugly in comparison to the original and I am not finished optimizing it, which is why I have not yet pushed it to a GitHub branch. I still have a few ways to reduce the amount of communication between the GPU and CPU per token, before using CUDA graphs. In particular, using the batched API for computing K and V, prefilling an array of pointers for the batched API across all layer iterations (such that all CPU to GPU pointer copies for the batched API are aggregated) and a few opportunities for kernel fusion, especially in rmsnorm after I figure out how to calculate the sum of squares quickly without cublas. I also have yet to try using Nvidia’s profiler.
Technically, CUDA C is really extended C++, but I am using a C-like subset of C++ for the device code. For the host code, I have written it in ANSI C. There are a few artifacts (the __host__ and __device__ keywords) from an earlier version where I had tried using CUDA C before I realized it was really extended C++ and not extended C, but those are trivial to remove (and will be removed before I push it to a branch). It should be possible to compile the device code with Clang instead of Nvidia’s compiler to use a fully open source toolchain, although I have yet to try it.
I have another fork here that uses the MKL to make the prompt processing portion run faster:
What are the prerequities for this kind of thing? I've written ANNs back in college and understood backpropagation and gradient descent at some point but I don't know most of the terms mentioned in the architectural overview. How big of an investment is this?
reply