Hacker News new | past | comments | ask | show | jobs | submit login
Fast LLM Inference From Scratch (using CUDA) (andrewkchan.dev)
344 points by homarp 3 months ago | hide | past | favorite | 57 comments



Hi, I'm the author. Thanks for sharing, was great to wake up to my blog post on the front page! Would love to hear any feedback or if I missed anything.


If you ever want to do this in pure C, I show how here:

https://news.ycombinator.com/item?id=42432802

Technically, the device side code is in CUDA C, which is really C++, but no one has written a C compiler that can compile a lightly extended version of ANSI C to PTX, so the device code must be C style CUDA C there, CUDA FORTRAN or hand written PTX assembly. I went with what was easiest, which was C style CUDA C.


You can probably get this to run slightly faster using CUDA graphs, which should reduce the amount of communication between the GPU and CPU. I notice that llama.cpp uses them, but glancing at your blog post, it appears that you do not.

You might want to consider switching to bfloat16. The model weights are published as bfloat16. LLMs are more sensitive to the exponent bits than the mantissa bits, so using fp16 over bfloat16 should slightly hurt the model’s intelligence. Using bfloat16 would require you to use the tensor cores instead of the CUDA cores. Interestingly, llama.cpp performs very poorly when asked to use bfloat16, so performance comparisons of a bfloat16 would be best done against llama.cpp using fp16, since there is no reason for performance of fp16 and bfloat16 to be different, other than a lack of optimization.

Using the tensor cores should not make your forward pass any faster despite the additional computational power since the GEMV operations that your forward pass uses should be memory bandwidth bound, although if you ever decide to try doing multiple queries in parallel, you will see a performance boost from being able to use that additional compute via GEMM operations.

If you write a prompt processing version of the forward pass, you can use GEMM operations to process all prompt tokens simultaneously even without doing multiple queries in parallel. This would give you fast prompt processing speeds and open another front on which you can try to outdo llama.cpp.


Interesting! CUDA graphs indeed look very promising (https://developer.nvidia.com/blog/cuda-graphs/). I'd expected that kernel dispatch is not a problem because it is done asynchronously by the host and profiles suggested that the device thread is rarely idle, but from the NVIDIA blog I see the effective time per kernel goes from 3.8usec to 3.4usec. I guess there is still dispatch overhead that occurs on the device?

Re: bfloat16. Yes, this is a good suggestion and I'm aware that bfloat16 is better suited for ML applications for the reason you mentioned (the range being more important than the precision) while offering the same footprint as float16. TBH, the main reason I haven't done this is because of lack of native support on my CPU (EPYC 7702P). This makes it a bit more annoying to maintain the format parity for the CPU and GPU backends, which is nice because it allows for easy testing, although if I wanted to preserve that I could just write non-performant test-only CPU code for bfloat16.

Prompt pre-filling optimizations using GEMMs are great and I plan to explore this when I have free time. I talk about this in the "What's next" section briefly. Optimizing GEMM is a rabbit hole, as you point out in some of the other blog posts you shared. Will be a good next challenge!


FWIW CUDA graphs are only really useful when you have a lot of kernels you want to launch. Otherwise they seem to perform similarly to just launching all the kernels in a loop on the host.


llama.cpp saw a 10% performance improvement from using CUDA graphs, so inference does benefit from it:

https://developer.nvidia.com/blog/optimizing-llama-cpp-ai-in...


Yeah I mean I am sure there are workloads that it helps but there are also a lot where you'd think it would help and the driver will actually just fail to cooperate.


I have not explored CUDA graphs yet in my own code, but llama.cpp saw a 10% performance improvement from it, so it certainly does something:

https://developer.nvidia.com/blog/optimizing-llama-cpp-ai-in...

As for why this is the case, anything I say would be pure speculation, but my current guess is that the standard API is actually synchronous between the GPU and CPU, and is only asynchronous on the CPU between userland and the kernel. This allows CUDA graphs to improve performance by turning all of the synchronous calls between the GPU and CPU into a single synchronous call. Again, this is purely a guess, but if it is right, it would explain how CUDA graphs are able to improve the performance of an asynchronous API.

As for doing BF16 on CPUs that do not support BF16, you can emulate it in F32 using a widening instruction and a bitshift to load BF16 into F32 SIMD registers. Load the BF16 values into a 128-bit SIMD register. Then use _mm256_cvtepu16_epi32() to widen the 16-bit values into 32-bit values in a 256-bit register and do a left bitshift by 16 on them. This trick disregards subnormal numbers, but BF16 does not support subnormal numbers, so we would only need to worry about this when converting from BF16 to F32. F32 does support subnormal numbers, but on amd64 processors, I believe you can set bit 15 in the MXCSR register to enable flush to zero to zero subnormal results. Then a shuffle and permute could be used to convert F32 numbers to BF16 numbers like this:

https://godbolt.org/z/dWPaW61Wf

Unfortunately, I do not have an example of using shuffle and permutate to go in the opposite direction, so that will be an exercise for the reader. I will likely implement this in the CPU version of my llama3.c fork to make a BF16 version after I figure out how to write performant GEMM kernels, since I would want to have the fast prompt processing with BF16.

As for GEMM, you can skip having to optimize it by using cuBLAS. Then implementing your own GEMM kernels is only needed if you want to do kernel fusion to for example, compute the max values for rmsnorm as part of the previous GEMM operation, which would reduce memory bandwidth usage slightly.

You could look at how I generalized the forward() function into the precompute_input_logits() function to calculate all prompt processing via GEMM rather than GEMV:

https://github.com/ryao/llama3.c/blob/master/run.c

It was fairly straightforward, although the code is somewhat messy (as it is from a WIP personal research project). My precompute_input_logits() function also needs to be changed to stop assuming the starting position is 0, as hypothetically speaking, if this were made into a library and a chat frontend were put in front, end users could modify the chat history in the center, and rather than recompute everything from scratch, we only need to precompute tokens from the first changed token. However, that is a future improvement to explore.

You might wonder why I am using the batched GEMM function in forward() when everything is on the CPU in that version. All I should need is GEMV there and my own handwritten AVX2 matmul() implementing GEMV is actually faster than the Intel MKL GEMV. However, shoehorning the GEMV into the Intel MKL batched GEMM makes is even faster. I believe it is reducing memory bandwidth requirements in the forward() pass by having matrix multiplication operations that have locality run in parallel. For computing K and V, the xb vector is shared, so if multiple cores read the same locations in xb, then only 1 read needs to go to system memory. Similarly, in the attention calculations, the matrices rows are interleaved, so the CPU is able to read rows for multiple cores simultaneously, and what would have been 32 separate passes over the weight matrices become 2 passes on my Ryzen 7 5800X. At least, this is my best guess why the batched API is faster than calling a GEMV routine sequentially 32 times there. Note that using more cores on my Ryzen 7 5800X does not increase memory bandwidth, since a single Zen 3 core in the Ryzen 7 5800X is able to use the CPU's full memory bandwidth (unlike cores in most other processors).


Oh right, I forgot that BF16 allocates the same number of sign and exponent bits as F32, so we can convert from BF16 to F32 by just padding zeros to the fraction bits. I'd guess conversion in the opposite direction would require rounding rather than truncation if we want to preserve as much intelligence as possible, so something like (https://hhhhhojeihsu.github.io/tensorflow_1.8_woboq/tensorfl...).

In any case, a constraint of my exercise was to compare a hand-written inference implementation to existing inference engines (calm/llama.cpp) which both use FP16 for weight and KV cache quantization by default rather than BF16, so a BF16 backend will be an interesting addition but not so relevant to the blog post.

Yes, we can totally use cuBLAS, and it's likely what I'd go with in a production implementation for any kernels which I wasn't planning on fusing, but as mentioned in the post, defeats the point of the exercise and usually less fun than doing things from scratch. That said, I do plan on exploring for myself at some point what kind of gain I could get from switching existing hand-rolled ops to cuBLAS/cuDNN (and then trying to match that).

Thanks for linking your prefill code with GEMMs! Actually, your code looks pretty readable and well-annotated. It'll be cool digging in as your CPU backend seems better developed than mine. Cool observation about how using batched-GEMM instead of GEMV speeds things up in the single-batch forward() function. I am curious about your intuition behind the reduction in memory bandwidth ("if multiple cores read the same locations in xb..."). Is this about how xb might be kept in the L3 cache which is shared between cores in Zen 3? I wonder if you could validate this idea by adapting your handwritten matmul() function to operate on 2 matrices and outputs at the same time.


+1, rounding f32 to bf16 is helpful. For the other direction, the approach we take in Highway/gemma.cpp is to load a full vector of bf16, then shift/AND to isolate the odd/even elements and convert to float. These can execute two per cycle, whereas promoting 16->32 bit is often just one per cycle (though a different port than FMA).


I plan to write my own GEMM implementations so I can do kernel fusion. I am using GEMM implementations from Nvidia and Intel until I have my own performant substitutes. So far, my initial attempts at writing my own GEMM implementations have not been performance competitive. Intel and Nvidia use tiling algorithms. I know (unintentionally) from profiling that Intel is doing a memcpy trick that is presumably similar to the one described here:

https://youtu.be/vPhG9KSxguM

It is also presumably similar to the packing technique mentioned here:

https://salykova.github.io/matmul-cpu

As for the annotations, some might be outdated. Much like the GPU version, I had published the CPU version at the request of another OSS developer and before I had reached the point where I would make a cleanup pass to address any stale/obsolete/missing comments. I made some effort to have good comments for my own benefit during the current stage of development, but I was not disciplined enough to make a cleanup pass unnecessary. That is partly why I claimed it is messy. I am happy to hear that they look good even before I have done a cleanup pass.

As for xb being in L3, that is part of it. I suspect that Intel has multiple threads doing “tiles” from both GEMM (GEMV in GEMM) operations in parallel and the reads and prefetches on xb are being coalesced by L3 cache. I also had the idea of adapting my handwritten matmul to operate on 2 outputs at the same time, and even tried it, but it did not work. At the time, I had thought this was sue to register spilling occurred because I ran out of ymm registers on Zen 3. Presumably, this experiment could be attempted on Zen 4 using the additional ymm registers that AVX-512 gives to AVX2.

However, two further thoughts occurred to me while writing this response. The first is that this experiment could be done on Zen 3 by doing do 6 rows at a time per matrix instead of 8. The second is that this will be throwing a larger number of parallel linear accesses at the core’s memory prefetcher than what I suspect Intel does, so this might fail to work even if the underlying idea about how Intel is getting better performance is accurate. Honestly, I had been surprised to see Zen 3 handle 9 linear accesses per loop iteration, as I had felt that doing 9 (8 rows + xb) had already been asking too much, but to my happy surprise, it was not.

Right now, some GPU-based experiments are at the top of my mental to do list. I will likely circle back to the CPU side to try to understand how to get this performance from my own GEMM/GEMV kernels at some point after doing them.


> we would only need to worry about this when converting from BF16 to F32

This should be:

> we would only need to worry about this when converting from F32 to BF16

My apologies for the typo.


I tried this (and backprop) in GPGPU compute shaders once. One issue that really tripped me up is matrix multiplication along dimensions that are not powers of 2. It will take me awhile to learn to read CUDA code -- is this just an easier thing to do in CUDA or did you figure out a good trick for it?


I know a trick, although it relies on the closed source cuBLAS library. The matrix multiplication in non-batched forward passes is really matrix-vector multiplication, but you can shoehorn that into matrix-matrix multiplication by setting the right dimension to 1. cuBLAS uses column major ordering to match Fortran instead of the row major ordering of C/C++, although you can swap transa and transb, m and n, a and b, lda and ldb; and if you are using cublasGemmEx() to use tensor cores, atype and btype. This relies on the property of mathematics where A * B = C and B’ * A’ = C’ for matrices. You can experiment with this on a CPU using a BLAS library like OpenBLAS or the Intel MKL before trying to do this on a GPU.

That said, to do matrix vector multiplication (GEMV), you just need to compute dot products on all of the rows of the matrix with the vector to get your output vector. You could probably just handle each dot product in power of 2 chunks on the GPU to compute partial sums. When you get to the last chunk that is not a power of 2, just have the threads that go pass the end of the dot product contribute 0 to the partial sums. Then you would have the threads handle the horizontal sum of the partial sums to finish the computation of an entry in the output vector.

As for matrix-matrix multiplication (GEMM), that is much harder to do in a performant manner. If you do it by treating the multiplication as a series of matrix-vector multiplications, you will be bandwidth limited rather than compute limited. You would need to implement tiling to get good performance, but there are many tricks needed and it is very hard to outperform cuBLAS on Nvidia hardware. Here is one person’s attempts to do it:

https://siboehm.com/articles/22/CUDA-MMM

Here is a fairly cryptic description of how to get close to cuBLAS that mentions a few key techniques:

https://accu.org/journals/overload/32/181/schuetze/

Here is another guy’s attempts that actually managed to outperform cuBLAS on the H100:

https://cudaforfun.substack.com/p/outperforming-cublas-on-h1...


Wow, great answer. That deserves more upvotes than it's probably going to get. So what that means for compute shaders is I will need to partition the vectors being multiplied into power of 2 sized partitions and repeatedly call the kernels, but with offsets appropriate to the partition size, and then have the overrun contribute zero as you said.

Obviously not as efficient as doing it all GPU side, still faster than doing it on the CPU. Thank you!


You don’t need to repeatedly call the kernels. You just need to loop inside the kernels to have each process every nth multiplication according to their thread id, adding it to the thread’s partial sum and have a branch in the loop skip the addition (or add 0) when you exceed the range. Afterward, you would do a horizontal sum of the partial sums to get the value for the corresponding field in the output vector, and you would have a “grid” do this for all members of the output vector. The original poster’s mathmul kernel does this. See the example under “we can perform a warp sum reduction to combine results from all threads”:

https://andrewkchan.dev/posts/yalm.html#section-3.2

It does “if (i >= d) return;“ to handle non-multiples in one dimension and skips the addition in its dot product, matmul_row(), to handle non-multiples in the other dimension.

In CUDA, you have threads that are organized into 32 thread warps, that are organized into blocks of warps, that are organized into grids. Threads within a block can share memory, but blocks within a grid cannot. The hardware will effectively loop over blocks in the grid until it finishes. The dimension for the output vector does not require sharing memory, so we can set that to the grid size. The hardware might actually round the grid up, so “if (i >= d) return;” is a check for the case that we are executing past the end of the output vector in case the hardware schedules pass the end of the vector.

As for the other dimension, we only have a limited number of threads (in this example 32, which is a single warp), so we need each to do partial sums of the multiplications in the dot product. We essentially break the dot product into thread count (32 here) chunks. Each thread will know which multiplication in the thread count chunk to compute based on its thread id, and we keep adding the thread count (32 here) to that to get the next one until we have finished. The for loop’s loop condition protects us from going past the end of the dot product dimension.

Then we have thread count number of partial sums, so we need to do a horizontal sum across the threads in a block to get the final sum. This part is a bit magical and is done via a warp sum reduction. Since the threads share memory, you basically have them start summing their neighbors’ values into their own. Basically, all threads find pairs and one of the pair does the sum, cutting the threads doing the sum in half. Then the ones that did the sums pair up again. This again cuts the number in half. This continues until only 1 thread remains. Then only the remaining thread is allowed to write to “out[i] = rowSum” due to “if (threadIdx.x == 0)”. The warp itself is somewhat special since it can do a really fast horizontal sum via shuffles that cannot be done when using thread blocks ghat utilize multiple warps. That is why the author chose to have the block size be exactly 1 warp rather than a multiple of warps.

It helps to think about the threads in the block as a spreadsheet. The columns are the threads and the rows represent instructions. Every thread executes the same instruction, and when a branch says not to do something for only some threads, the ones excluded do noops until execution reaches a point where they all do the same thing. An if else similarly has noops in threads not executing true, followed by noops in the threads not executing false. This is called branch divergence. There are diagrams showing a spreadsheet like view of this, such as the following, which shows a cutdown example:

https://image1.slideserve.com/2427671/branch-divergence-l.jp...

I have mentioned a horizontal sum multiple times. That just means summing across the columns inside rows of the “spreadsheet”. It is called horizontal since rows are horizontal. Coincidentally, the partial sum calculation is called a vertical sum since the columns are vertical.

Now we need to consider the grid, which is yet another dimension and is the dimension along which we populate the output vector. Basically, you can view the grid as doing multiple spreadsheets in parallel, but not all at once. One comical way to imagine it is to imagine having printed the same spreadsheet hundreds of times an and then feeding them into a paper shredder head first. A normal paper shredder can only do a few at a time, and similarly, the GPU can only do so many at a time too. You can imagine the processing to proceed along the rows in all of the spreadsheets being shredded at the same time.


Damn, wow. Good stuff. First of all, I assumed looping within kernels was "always bad" from a performance standpoint (not that I'd done any benchmarks on looping vs repeated kernel invocation). I also assumed it was "always bad" to include conditionals.

That certainly opens up a lot of possibilities on algorithms. It also explains why people think I'm crazy when I say I don't know how to multiply matrices that are not in power of 2 dimensions haha.

Thank you, not sure why I was limiting myself like that. Not too many people to chat to about low level GPGPU stuff!


It is an area where I am still learning, but as far as I know, the badness of things is relative. CPU-GPU communication is the major thing to avoid. In my CUDA llama3.c fork, making changes to reduce CPU-GPU communication generally increases performance. That means combining kernels and reducing cudaMemcpy and cudaMalloc calls. CUDA graphs likely could give further reductions, but I have yet to run out of ideas for more normal methods for getting them.

To give a concrete example of CPU-GPU communication in a tight loop being bad (much like what you were planning to do), when I initially moved softmax to the GPU, I had been invoking the kernel in a host side loop (since attention calls softmax multiple times per pass), and it slowed down inference by a few tokens per second versus the CPU softmax. I eliminated the host side loop by using the grid dimension so that 1 CUDA call did all of the loop iterations implicitly and inference became a few tokens per second faster than the CPU softmax.

As for branches, avoiding them on the GPU where possible can be beneficial, as long as you are not doing something more expensive like CPU-GPU communication in their place. For example, in my softmax kernel, rather than do a branch to calculate the maximum value inside my loop, I use fmaxf(). This is better since there is no branch, and fmaxf is incredibly cheap since it is a single PTX instruction (which presumably maps to a single SASS intruction).

GPGPU grew out of work for having programmable shaders on GPUs and a big part of that was branching. As long as you are minimizing branch divergence and only using necessary branches, you are fine as far as branching on a GPU are concerned.


Loops and conditions within kernels are much better than the alternative, since synchronizing with the CPU is quite expensive. If you unroll appropriately and hide latencies they are very effective.


Sometimes, the compiler will unroll for you. I have noticed that nvcc will unroll this:

    #define warpSize 32
    float warpMax = partial_max;
    for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
        float otherValue = __shfl_down_sync(0xFFFFFFFF, warpMax, offset);
        warpMax = fmaxf(warpMax, otherValue);
    }
Into this:

        bar.sync        0;
        shl.b32         %r40, %r2, 2;
        add.s32         %r8, %r39, %r40;
        ld.shared.f32   %f24, [%r8];
        mov.u32         %r43, 0;
        mov.b32         %r44, %f24;
        shfl.sync.down.b32      %r48|%p13, %r44, %r23, %r22, %r24;
        mov.b32         %f25, %r48;
        max.ftz.f32     %f26, %f24, %f25;
        mov.b32         %r49, %f26;
        shfl.sync.down.b32      %r51|%p14, %r49, %r27, %r22, %r24;
        mov.b32         %f27, %r51;
        max.ftz.f32     %f28, %f26, %f27;
        mov.b32         %r52, %f28;
        shfl.sync.down.b32      %r54|%p15, %r52, %r30, %r22, %r24;
        mov.b32         %f29, %r54;
        max.ftz.f32     %f30, %f28, %f29;
        mov.b32         %r55, %f30;
        shfl.sync.down.b32      %r56|%p16, %r55, %r21, %r22, %r24;
        mov.b32         %f31, %r56;
        max.ftz.f32     %f32, %f30, %f31;
        mov.b32         %r57, %f32;
        shfl.sync.down.b32      %r59|%p17, %r57, %r35, %r22, %r24;
        mov.b32         %f33, %r59;
        max.ftz.f32     %f34, %f32, %f33;
        mov.b32         %r60, %f34;
This avoids loop branching, which is unnecessary in this case because we know in advance when the loop terminates. The use of ftz instructions is because I compiled using `nvcc -ftz true -ptx -arch=sm_86 rung.cu -o rung.ptx`.

I am not sure if unrolling loops makes sense on a GPU when we don't know when the loop will terminate in advance. If you can suggest any references explaining when that is helpful, I would be happy to read them. :)


Even if you don't know the number of iterations it can be helpful to "partially unroll" the loop, for example going by 8 or 16 elements at a time and doing a single check for all elements to make sure you're not doing more work than you were asked to do. Not only does this amortize checks for the loop body it can also enable optimizations like automatic vectorization.


Automatic vectorization is not relevant to CUDA device code because the threading model is implicitly vector based. Every single instruction is a vector instruction that is executed on all threads in the CUDA block simultaneously (unless it is a no-op due to divergent branching).

To be fair, there are actually 14 SIMD instructions intended for use by video operations, but I would be surprised if any compiler implemented optimization passes to use them since most code cannot use them:

https://docs.nvidia.com/cuda/parallel-thread-execution/index...

Reducing loop overhead does make sense, although I suspect this would be best informed by profiling the performance. That said, I found the following explanation for unrolling on a GPU helping:

https://www.researchgate.net/post/CUDA-Is_it_worth_it_to_unw...

It mentions what had me wondering if loop unrolling made sense on GPUs, which is:

> Branch prediction does not even exists on GPUs. The GPU thread scheduler will just switch execution to a different warp until the outcome of the branch has been resolved.

It then goes on to mention a few other benefits to loop unrolling on a GPU, including reducing loop overhead.


As saagarjha mentions, vectorization of loads and stores is important for memory bandwidth and can be done automatically after unrolling a loop. Another important compiler optimization which requires and is applied after loop unrolling is pre-fetching: that is, for a loop whose iterations are independent and each perform loads and then some computation depending on the loaded value, we can re-arrange the loads to be grouped before the computations. The thread can use ILP to continue issuing loads while previous ones are still in-flight as long as it still has registers to store the results. Without unrolling, the computations are stalled waiting for the load instructions to return data, while with unrolling, we are able to overlap loads and make much better use of memory bandwidth.

I describe a situation in my blog post where automatic unrolling and pre-fetching was no longer being applied after changing a kernel to use FP16, and how I re-applied the optimizations manually to regain performance: https://andrewkchan.dev/posts/yalm.html#section-3.5

Here's an NVIDIA blog post which discusses pre-fetching and unrolling more generally: https://developer.nvidia.com/blog/boosting-application-perfo...


Thanks!


BTW, this part is not entirely true: > Every single instruction is a vector instruction that is executed on all threads in the CUDA block simultaneously (unless it is a no-op due to divergent branching).

It is true that at one point instructions were executed in SIMT lockstep in warps, which are equal-size groups of CUDA cores (with each core mapping to one thread) that subdivide blocks and are the fundamental unit of execution on the hardware.

However, since Volta (2017), the execution model is allowed to move threads in a warp forward in any order, even in the absence of conditional code. Although from what I have seen, for now it appears that threads still move forward in SIMT lockstep and only diverge into active-inactive subsets at branches. That said, there is no guarantee on when the subsets may re-converge, and this is also only behavior which is done for efficiency by the hardware (https://stackoverflow.com/a/58122848/4151721) rather than in order to comply with any published programming model, e.g. it's implementation-specific behavior that could change at any time.

This is why the NVIDIA documentation (https://developer.nvidia.com/blog/using-cuda-warp-level-prim...) says to use __syncwarp() for operations with intra-warp dependencies and to not assume lockstep execution.


I've seen divergence across high-latency instructions like memory loads.


Ok, where do you nerds hang out and why am I not there?? I'm loving this discussion, y'all seem to be a rather rare breed of dev though. Where is the community for whatever this sort of dev is called? i.e., Clojure has the Clojurians Slack, the Clojurians Zulip, we have an annual Clojure conference and a few spin-offs. Where do you guys hang out??

This stuff is really awesome and I would love to dig in more!


I have not found a specific hangout to have such discussions. They just organically happened here.


Sure it is. Both loads and stores can be vectorized (and should, otherwise you’re leaving memory bandwidth on the table).


That is on a CPU. A GPU works differently, such that threads on a GPU implicitly vectorize loads and stores as part of their warp/block. My question had concerned GPUs, where you cannot vectorize instructions by loop unrolling since the instructions are already vector instructions.


I think you have a mistaken understanding of how GPUs work? There is some "vectorization" across threads in the form of coalescing but what I am talking about is literally a vectorized load/store, the same you would see on a CPU. Like, you can do a ld/ld.64/ld.128 to specify the width of the memory operation. If your loops load individual elements and it is possible to load them together then the compiler can fuse them together.


That makes more sense. When you said automatic vectorization, I was thinking about SIMD calculations. Nvidia does support doing 128-bit loads and stores:

https://docs.nvidia.com/cuda/parallel-thread-execution/index...


Off topic, but would you consider adding an RSS feed to your blog?


Thank you for writing this!


Excellent, amazing article.

To the author, if you're lurking here, I have a tangential question- how long did it take you to write this article? From first line of code to the last line of this post?

As someone who works in GPGPU space, I can imagine myself writing an article of this sort. But the huge uncertainty around time needed has deterred me so far.


Why not give it a shot and see how it goes? You should be able to have some idea if you want to proceed within an hour or two.


I'd say this was a bit over 3 weeks of full time coding, learning, and writing this blog post spread out over 1.5+ months. Someone with GPGPU/CUDA experience would've been able to do it faster. But I'd echo the other commenters here - just try writing some content and see if it works for you!



I don't think this code can make use of the tensor cores, or the wgmma instructions that you typically need to get peak performance out of them.

Programming these is a nightmare as you need to have several in flight concurrently for peak performance.

Perhaps you don't need the extra flops as you end up bandwidth bound?

Regardless the good thing about the code in the blog though is it'll probably work pretty well for other accelerators, if you port it to HIP or similar. If you use wgmma I'm not sure it'll even be portable across Nvidia generations.


For latency-bound inference (i.e. one request) you don't need tensor-cores since all your operations are just matrix vector multiplications.


Good point yes. That explains why he's getting performance similar to the leading frameworks. Those tensor operations are helpful for training or for throughput-optimised batched inference but not really for a batch size of one.


I actually didn't know that. I'm in the space as a hobbyist and I had a vague understanding that tensor cores are essential for reaching peak performance, but can only work for certain operations like dense matrix-matrix multiplication. It was on my list to investigate whether they could be used to further improve single-batch decoding - makes sense that they don't help when it's all matrix-vector.


I wonder how does the perf in tokens/second compares to my version of Mistral: https://github.com/Const-me/Cgml/tree/master/Mistral/Mistral...

BTW, see that section of the readme about quantization: https://github.com/Const-me/Cgml/tree/master?tab=readme-ov-f...


Great post! I've been looking to get into the guts of large scale model training (I'm half-way between the design and application layer of LLMs, mostly in python, sometimes a bit of c++) and this will be a great reference to have.

PS. appreciate it if anyone can recommend more material like this


This is great thank you!

Does any one know of something similar in python? I want to share with my team something similar to this that goes into (almost) everything (at least conceptually) needed to efficiently serve an LLM.

It doesn’t actually need to be performant mind you (it’s in python) I just need something “conceptually complete” while being more “tutorial style” and concise than vLLM codebase


Using Jax, you should be able to get good performance out of the box


What are the prerequities for this kind of thing? I've written ANNs back in college and understood backpropagation and gradient descent at some point but I don't know most of the terms mentioned in the architectural overview. How big of an investment is this?


Isn’t __shfl_down not recommended these days because of warp synchronization issues?


Oops, you're right and it's a difference between my blog post and source code. It should be __shfl_down_sync as seen [here](https://github.com/andrewkchan/yalm/blob/8c908f23f5d8cc3f14c...)


Great article. I think what you should cover next is collective matmuls and sharding.


I notice that the CUDA example uses C++. I had not planned to publish my own work in this area yet (as I am still working on it), but if anyone wants fast llama inference using CUDA in C, here are some files:

https://bpa.st/CA6A

https://bpa.st/WOSA

It is a fork of:

https://github.com/jameswdelancey/llama3.c

I am compiling and running it this way:

nvcc -ptx -arch=sm_86 rung.cu -o rung.ptx

gcc -I/opt/cuda/targets/x86_64-linux/include -L/opt/cuda/lib64 -O3 -g -lcuda -lcublas -lcudart -lm -lmvec -o rung rung.c

./rung "llama3_8b_base.bin" -z "tokenizer.bin" -t 0 -i "Once upon a time"

I am getting around 48 to 49 tokens per second with bf16 on my 3090 Ti from this. Oddly, llama.cpp only gets around 1 token per second with bf16, but it gets around 51 to 52 tokens per second with fp16. I suspect the performance gap would close if I use CUDA graphs.

The code is ugly in comparison to the original and I am not finished optimizing it, which is why I have not yet pushed it to a GitHub branch. I still have a few ways to reduce the amount of communication between the GPU and CPU per token, before using CUDA graphs. In particular, using the batched API for computing K and V, prefilling an array of pointers for the batched API across all layer iterations (such that all CPU to GPU pointer copies for the batched API are aggregated) and a few opportunities for kernel fusion, especially in rmsnorm after I figure out how to calculate the sum of squares quickly without cublas. I also have yet to try using Nvidia’s profiler.

Technically, CUDA C is really extended C++, but I am using a C-like subset of C++ for the device code. For the host code, I have written it in ANSI C. There are a few artifacts (the __host__ and __device__ keywords) from an earlier version where I had tried using CUDA C before I realized it was really extended C++ and not extended C, but those are trivial to remove (and will be removed before I push it to a branch). It should be possible to compile the device code with Clang instead of Nvidia’s compiler to use a fully open source toolchain, although I have yet to try it.

I have another fork here that uses the MKL to make the prompt processing portion run faster:

https://github.com/ryao/llama3.c

When I am happy with the CUDA version, I will either push it to a branch or to its own set of files in master.


Thank you for sharing! Would've loved to peek at your CUDA code but looks like it's gone now. Looking forward to seeing the full version when it's finished.


Ask and you shall receive:

https://github.com/ryao/llama3.c/commit/dc27f2b1c6884c72bd4b...

It is still a WIP, but the CUDA kernels are unlikely to see much improvement for a while as I had gone over them with a fine toothed comb for performance in the past few days. The C code is more likely where I will be making improvements. For example, I need to switch to a structure to stop abusing an array of void pointers for the cudaMemcpy() aggregration trick for the cublasGemmBatchedEx() calls.

The current version is performing either on par to or slightly better than llama.cpp's F16 codepath on my 3090 Ti, and I have not even implemented CUDA graphs yet. Both get around 53 tokens per second, although on really good runs, I have seen my code do 54 tokens per second and I have not seen that from llama.cpp so far.


Very cool! I have been bedridden and unable to do much work the last few days but it will be cool to dig into this.


I hope you feel better soon. If you want to get in touch in the future after we have both stoped checking this thread, you can find my email on one of my commits on GitHub.


Thanks, will do!


super cool post!!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: