Initial CUDA Performance Lessons

elashri · 2024-10-11T13:52:12.000000Z

I like this writeup as it summarizes my journey with optimizing some cuda code I wrote for an LHC experiment trigger. But there are few comments on some details.

There are 65536 registers per SM not thread block and while you can indirectly control that by making your block takes all the SM but this presents its own problems.

NVIDIA hardware limits the threads max number to 1024 (2048) and shared memory to 48 KB (64 KB) per SM. So if you consume all of that in one thread block or near the maximum then you are using one thread block per SM. You don't usually want to do that because it will lower your occupancy. Additionaly , If the kernel you’re running is not compute-bound and does not need all the registers or shared memory allocated to it, having fewer blocks on the SM could leave some compute resources idle. GPUs are designed to thrive on parallelism, and limiting the number of active blocks could cause underutilization of the SM’s cores, leading to poor performance. Finally, If each thread block occupies an entire SM, you limit the scalability of your kernel to the number of SMs on the GPU. For example, if your GPU has 60 SMs, and each block uses one SM, you can only run 60 blocks in parallel, even if the problem you’re solving could benefit from more parallelism. This can reduce the efficiency of the GPU for very large problem sizes.

dahart · 2024-10-11T14:59:57.000000Z

Good points, though I agree with sibling that higher occupancy is not the goal; higher performance is the goal. Since registers are such a precious resource, you often want to set your block size and occupancy to whatever is best for keeping active state in registers. If you push the occupancy higher, then the compiler might be forced to spill registers to VRAM, that that will just slow everything down even though the occupancy goes up.

Another thing to maybe mention, re: “if your GPU has 60 SMs, and each block uses one SM, you can only run 60 blocks in parallel”… CUDA tends to want to have at least 3 or 4 blocks per SM so it can round-robin them as soon as one stalls on a memory load or sync or something else. You might only make forward progress on 60 separate blocks in any given cycle, but it’s quite important that you have like, for example, 240 blocks running in “parallel”, so you can benefit from latency hiding. This is where a lot of additional performance comes from, doing work on one block while another is momentarily stuck.

jhj · 2024-10-11T14:21:37.000000Z

Aiming for higher occupancy is not always a desired solution, what frequently matters more is avoiding global memory latencies by retaining more data in registers and/or shared memory. This was first noted in 2010 and is still true today:

https://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pd...

I would also think in terms of latency hiding rather than just work parallelism (though latency hiding on GPUs is largely because of parallelism). This is the reason why GPUs have massive register files, because unlike modern multi-core CPUs, we omit latency reducing hardware (e.g., speculative execution, large caches, that out-of-order execution stuff/register renaming etc) and in order to fill pipelines we need to have many instructions outstanding, which means that the operands for those pending arguments need to remain around for a lot longer, hence the massive register file.

elashri · 2024-10-11T14:36:28.000000Z

I agree that optimizing for lower occupancy can yield significant performance gains in specific cases, especially when memory latencies are the primary bottleneck. Leveraging ILP and storing more data in registers can indeed help reduce the need for higher occupancy and lead to more efficient kernels. The examples in the GTC2010 talks highlighted that quite well. However, I would argue that occupancy still plays an important role, especially for scalability and general-purpose optimization. Over-relying on low occupancy and fewer threads, while beneficial in certain contexts, has its limits.

The first thing to consider is the register pressure. Increasing the number of registers per thread to optimize for ILP can lead to register spilling when the register file is exhausted, which drastically reduces performance. This becomes more pronounced as problem sizes scale up (the talk examples avoids that problem). Many real-world applications, especially compute-bound kernels, need high occupancy to fully utilize the GPU’s resources. Focusing too much on minimizing thread counts can lead to underutilization of the SM’s parallel execution units. An standard example will be inference engines.

Also, while low-occupancy optimizations can be effective for specific workloads (e.g, memory-bound kernels), designing code that depends on such strategies as a general practice can result in less adaptable and robust solutions for a wide variety of applications.

I believe there is a balance to strike here. low occupancy can work for specific cases, higher occupancy often provides better scalability and overall performance for more general use cases. But you have to test for that while you are optimizing your code. There will not be a general rule of thump to follow here.

amelius · 2024-10-11T14:15:50.000000Z

In the 90s we had segmented memory programming with near and far pointers, and you had to be very careful about when you used what type of pointer and how you'd organize your memory accesses. Then we got processors like the 286 that finally relieved us from this constrained way of programming.

I can't help but feel that with CUDA we're having new constraints (32 threads in a warp, what?), which are begging to be unified at some point.

dahart · 2024-10-11T14:51:07.000000Z

While reading I thought you were going to suggest unified memory between RAM and VRAM, since that’s somewhat analogous, though that does exist with various caveats depending on how it’s setup & used.

SIMD/SIMT probably isn’t ever going away, and vector computers have been around since before segmented memory; the 32 threads in a CUDA warp is the source of its performance superpower, and the reason we can even fit all the transistors for 20k simultaneous adds & multiplies, among other things, on the die. This is conceptually different from your analogy too, the segmented memory was a constraint designed to get around pointer size limits, but 32 threads/warp isn’t getting us around any limits, it’s just a design that provides high performance if you can organize your threads to all do the same thing at the same time.

krapht · 2024-10-11T14:24:20.000000Z

I'll believe it when autovectorization is actually useful in day to day high performance coding work.

It's just a hard problem. You can code ignorantly with high level libraries but you're leaving 2x to 10x performance on the table.

talldayo · 2024-10-11T15:01:17.000000Z

You can blame ARM for the popularity of CUDA. At least x86 had a few passable vector ISA ops like SSE and AVX - the ARM spec only supports the piss-slow NEON in it's stead. Since you're not going to unify vectors and mobile hardware anytime soon, the majority of people are overjoyed to pay for CUDA hardware where GPGPU compute is taken seriously.

There were also attempts like OpenCL, that the industry rejected early-on because they thought they'd never need a CUDA alternative. Nvidia's success is mostly built on the ignorance of their competition - if Nvidia was allowed to buy ARM then they could guarantee the two specs never overlap.

dboreham · 2024-10-11T14:21:22.000000Z

miki123211 · 2024-10-11T15:18:17.000000Z

What are some actually good resources to learn this stuff?