
Understanding Latency Hiding on GPUs (2016) [pdf] - luu
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf
======
majke
I'm jealous. Dan is reading my twitter and re-posting stuff here :)

[https://twitter.com/majek04/status/1267388797278400512](https://twitter.com/majek04/status/1267388797278400512)

You should definitely follow me on twitter :)

Anyway, the question is about optimally using GPGPU resources. It's desired to
keep both ALU units and memory units busy, and it's not trivial what ratio of
memory vs alu operations should there be. This paper addresses it _very_
nicely.

> Substituting the hardware parameters listed in Table 4.1 into the solutions
> above suggests that hiding arithmetic latency requires 24 warps per SM and
> hiding memory latency requires 30 warps per SM if on the Maxwell GPU – which
> are similar numbers. This is despite the dramatic difference in latencies –6
> cycles in one case and 368 cycles in another – and in contrast with the
> prevailing common wisdom.

The paper doesn't answer if there is a disadvantage in having too many warps,
but considering that the max warps/block/sm is 32, we can say that you need 32
warps / 1024 threads per SM for optimal performance most often.

~~~
rrss
Note the results you quoted are for a test case in which all instructions are
back-to-back dependent, which is good for testing the hardware and developing
performance models, but not representative of well-tuned GPU codes.

> doesn't answer if there is a disadvantage in having too many warps,

There are disadvantages in having too many warps - the first is that the more
warps per block you have, the fewer registers/thread you can use and the lower
your achievable instruction level parallelism.

> but considering that the max warps/block/sm is 32, we can say that you need
> 32 warps / 1024 threads per SM for optimal performance most often

I don't think any well-tuned GPU kernels use 32 warps/block, most are at 4 to
8 warps/block.

See also:

[https://github.com/NervanaSystems/maxas/wiki/SGEMM](https://github.com/NervanaSystems/maxas/wiki/SGEMM)

[https://www.nvidia.com/content/GTC-2010/pdfs/2238_GTC2010.pd...](https://www.nvidia.com/content/GTC-2010/pdfs/2238_GTC2010.pdf)

~~~
majke
This is great. Thanks

------
arnon
One common pitfall with many cheap blog posts describing why GPUs are "fast"
is the focus on the number of CUDA cores.

The author here does a great job at highlighting how it's not necessarily the
amount of cores that makes the big difference, but rather the memory
bandwidth.

He also brings the memory bandwidth into his own models, which supposedly lead
to much more accurate models that can help developers understand the
underlying GPU technology a lot better than just "we run 5000 cores on a
matrix of data".

~~~
amelius
GPU architecture newbie here. Why isn't the memory interwoven with the cores,
so you have faster local access and more streamlike data flow like in a
systolic array?

~~~
NotCamelCase
Why aren't there e.g. 16 GB of L1D$ inside our CPUs? It'd be blazingly fast!
:)

Although memory access patterns of typical GPU workloads are very different
than those of CPU workloads, there _are_ GPU-local, hierarchical caches
similar to CPUs for different purposes that are exclusive to shader
cores/computation units, which can be even partitioned dynamically. The
struggle on the GPU side is keeping all the cores busy while servicing memory
requests efficiently. Sharing memory across small group of threads run in
lockstep, for example, is a great way of doing it.

~~~
amelius
> Why aren't there e.g. 16 GB of L1D$ inside our CPUs? It'd be blazingly fast!
> :)

I didn't say the memory necessarily has to be moved into the GPU. The other
way around would also be a possibility.

~~~
NotCamelCase
Yep, that's not what I intended to hint at, either. What I mean is, it boils
down to the familiar problem of diminishing returns to have huge caches inside
GPUs.

------
person_of_color
The author is now at NVIDIA. I guess we'll never find out more on how GPUs
work!

