Good points, though I agree with sibling that higher occupancy is not the goal; ...

winwang · 2024-10-11T17:50:38 1728669038

Is this really true in general? I'd expect it to be true for highly homogenous blocks, but I'd also expect that kernels where the warps are "desynced" in memory operations to do just fine without having 3-4 blocks per SM.

dahart · 2024-10-11T19:34:22 1728675262

Oh I think so, but I’m certainly not the most expert of CUDA users there is. ;) Still, you will often see CUDA try to alloc local and smem space for at least 3 blocks per SM when you configure a kernel. That can’t possibly always be true, but is for kernels that are using modest amounts of smem, lmem, and registers. In general I’d say desynced mem ops are harder to make performant than highly homogeneous workloads, since those are more likely to be uncoalesced as well as cache misses. Think about it this way: a kernel can stall for many many reasons (which Nsight Compute can show you), especially memory IO, but even for compute bound work, the math pipes can fill, the instruction cache can miss, some instructions have higher latency than others, etc. etc. Even a cache hit load can take dozens of cycles to actually fill. Because stalls are everywhere, these machines are specifically designed to juggle multiple blocks and always look for ways to make forward progress on something without having to sit idle, that is how to get higher throughput and hide latency.

einpoklum · 2024-10-12T21:01:51 1728766911

Well, yes, but "desynced" warps don't use shared memory - because writes to it require some synchronization for other warps to be able to read the information.

winwang · 2024-10-14T05:15:46 1728882946

Why would that be true? Certainly there are algorithms (or portions of them) in which warps can just read whichever values exist in shared mem at the time, no need to sync. And I think we were mostly talking about global memory?

dahart · 2024-10-14T14:23:32 1728915812

I don’t think it’s possible to use shared memory without syncing, and I don’t think there are any algorithms for that. I think shared memory generally doesn’t have values that exist before the warps in a block get there. If you want to use it, you usually (always?) have to write to smem during the same kernel you read from smem, and use synchronization primitives to ensure correct order.

There might be such a thing as cooperative kernels that communicate through smem, but you’d definitely need syncs for that. I don’t know if pre-populating smem is a thing that exists, but if it does then you’ll need kernel level or device level sync, and furthermore you’d be limited to 1 thread per CUDA core. I’m not sure either of those things actually exist, I’m just hedging, but if so they sound complicated and rare. Anyway, the point is that I think if we’re talking about shared memory, it’s safe to assume there must be some synchronizing.

I also assumed by “desynced” you meant threads would be doing scattered random access memory reads, since the alternative offered was homogeneous workloads. That’s why I assumed memory perf might be low or limiting due to low cache hit rates and/or low coalescing. In the case of shared memory, even if you have syncs, random access reads might lead to heavy bank conflicts. If your workload has a very ordered access pattern, if that’s what you meant, but you just don’t need any synchronization, then in that case there’s no problem and perf can be quite good. In any case, it’s a good idea to minimize memory access and strive to be compute bound instead of memory bound. Memory tends to be the bottleneck most of the time. I’ve only seen truly optimized and compute bound kernels a small handful of times.

einpoklum · 2024-10-14T10:54:53 1728903293

there is no guarantee of order of actions taking effect. i.e. warp 1 writes to some shared memory address; warp 2 reads from that address. How can you guarantee the write happens before the read?