Hacker News new | past | comments | ask | show | jobs | submit login

I agree that optimizing for lower occupancy can yield significant performance gains in specific cases, especially when memory latencies are the primary bottleneck. Leveraging ILP and storing more data in registers can indeed help reduce the need for higher occupancy and lead to more efficient kernels. The examples in the GTC2010 talks highlighted that quite well. However, I would argue that occupancy still plays an important role, especially for scalability and general-purpose optimization. Over-relying on low occupancy and fewer threads, while beneficial in certain contexts, has its limits.

The first thing to consider is the register pressure. Increasing the number of registers per thread to optimize for ILP can lead to register spilling when the register file is exhausted, which drastically reduces performance. This becomes more pronounced as problem sizes scale up (the talk examples avoids that problem). Many real-world applications, especially compute-bound kernels, need high occupancy to fully utilize the GPU’s resources. Focusing too much on minimizing thread counts can lead to underutilization of the SM’s parallel execution units. An standard example will be inference engines.

Also, while low-occupancy optimizations can be effective for specific workloads (e.g, memory-bound kernels), designing code that depends on such strategies as a general practice can result in less adaptable and robust solutions for a wide variety of applications.

I believe there is a balance to strike here. low occupancy can work for specific cases, higher occupancy often provides better scalability and overall performance for more general use cases. But you have to test for that while you are optimizing your code. There will not be a general rule of thump to follow here.






Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: