Hacker News new | past | comments | ask | show | jobs | submit login

Good points, though I agree with sibling that higher occupancy is not the goal; higher performance is the goal. Since registers are such a precious resource, you often want to set your block size and occupancy to whatever is best for keeping active state in registers. If you push the occupancy higher, then the compiler might be forced to spill registers to VRAM, that that will just slow everything down even though the occupancy goes up.

Another thing to maybe mention, re: “if your GPU has 60 SMs, and each block uses one SM, you can only run 60 blocks in parallel”… CUDA tends to want to have at least 3 or 4 blocks per SM so it can round-robin them as soon as one stalls on a memory load or sync or something else. You might only make forward progress on 60 separate blocks in any given cycle, but it’s quite important that you have like, for example, 240 blocks running in “parallel”, so you can benefit from latency hiding. This is where a lot of additional performance comes from, doing work on one block while another is momentarily stuck.




Is this really true in general? I'd expect it to be true for highly homogenous blocks, but I'd also expect that kernels where the warps are "desynced" in memory operations to do just fine without having 3-4 blocks per SM.


Oh I think so, but I’m certainly not the most expert of CUDA users there is. ;) Still, you will often see CUDA try to alloc local and smem space for at least 3 blocks per SM when you configure a kernel. That can’t possibly always be true, but is for kernels that are using modest amounts of smem, lmem, and registers. In general I’d say desynced mem ops are harder to make performant than highly homogeneous workloads, since those are more likely to be uncoalesced as well as cache misses. Think about it this way: a kernel can stall for many many reasons (which Nsight Compute can show you), especially memory IO, but even for compute bound work, the math pipes can fill, the instruction cache can miss, some instructions have higher latency than others, etc. etc. Even a cache hit load can take dozens of cycles to actually fill. Because stalls are everywhere, these machines are specifically designed to juggle multiple blocks and always look for ways to make forward progress on something without having to sit idle, that is how to get higher throughput and hide latency.


Well, yes, but "desynced" warps don't use shared memory - because writes to it require some synchronization for other warps to be able to read the information.


Why would that be true? Certainly there are algorithms (or portions of them) in which warps can just read whichever values exist in shared mem at the time, no need to sync. And I think we were mostly talking about global memory?


I don’t think it’s possible to use shared memory without syncing, and I don’t think there are any algorithms for that. I think shared memory generally doesn’t have values that exist before the warps in a block get there. If you want to use it, you usually (always?) have to write to smem during the same kernel you read from smem, and use synchronization primitives to ensure correct order.

There might be such a thing as cooperative kernels that communicate through smem, but you’d definitely need syncs for that. I don’t know if pre-populating smem is a thing that exists, but if it does then you’ll need kernel level or device level sync, and furthermore you’d be limited to 1 thread per CUDA core. I’m not sure either of those things actually exist, I’m just hedging, but if so they sound complicated and rare. Anyway, the point is that I think if we’re talking about shared memory, it’s safe to assume there must be some synchronizing.

I also assumed by “desynced” you meant threads would be doing scattered random access memory reads, since the alternative offered was homogeneous workloads. That’s why I assumed memory perf might be low or limiting due to low cache hit rates and/or low coalescing. In the case of shared memory, even if you have syncs, random access reads might lead to heavy bank conflicts. If your workload has a very ordered access pattern, if that’s what you meant, but you just don’t need any synchronization, then in that case there’s no problem and perf can be quite good. In any case, it’s a good idea to minimize memory access and strive to be compute bound instead of memory bound. Memory tends to be the bottleneck most of the time. I’ve only seen truly optimized and compute bound kernels a small handful of times.


there is no guarantee of order of actions taking effect. i.e. warp 1 writes to some shared memory address; warp 2 reads from that address. How can you guarantee the write happens before the read?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: