You should definitely follow me on twitter :)
Anyway, the question is about optimally using GPGPU resources. It's desired to keep both ALU units and memory units busy, and it's not trivial what ratio of memory vs alu operations should there be. This paper addresses it _very_ nicely.
> Substituting the hardware parameters listed in Table 4.1 into the solutions above suggests that hiding arithmetic latency requires 24 warps per SM and hiding memory latency requires 30 warps per SM if on the Maxwell GPU – which are similar numbers. This is despite the dramatic difference in latencies –6 cycles in one case and 368 cycles in another – and in contrast with the prevailing common wisdom.
The paper doesn't answer if there is a disadvantage in having too many warps, but considering that the max warps/block/sm is 32, we can say that you need 32 warps / 1024 threads per SM for optimal performance most often.
> doesn't answer if there is a disadvantage in having too many warps,
There are disadvantages in having too many warps - the first is that the more warps per block you have, the fewer registers/thread you can use and the lower your achievable instruction level parallelism.
> but considering that the max warps/block/sm is 32, we can say that you need 32 warps / 1024 threads per SM for optimal performance most often
I don't think any well-tuned GPU kernels use 32 warps/block, most are at 4 to 8 warps/block.
The author here does a great job at highlighting how it's not necessarily the amount of cores that makes the big difference, but rather the memory bandwidth.
He also brings the memory bandwidth into his own models, which supposedly lead to much more accurate models that can help developers understand the underlying GPU technology a lot better than just "we run 5000 cores on a matrix of data".
I believe what OP is talking about, in the context of ML models, is that reading 1 byte of memory on the GPU from the GPU’s memory is _much_ slower than the CPU reading from the system RAM.
This is an intentional choice and it speaks to the core design of what each system is solving for. CPUs trade low latency for faster single threaded execution speed. GPUs trade high latency for “total thread” execution speed.
CPUs solve for maximum “single thread” performance (for lack of a better term). If you have an operation that reads and mutates one byte of RAM, and then stack many of those instructions into a long sequence, the CPU is very fast at executing that. Most programs we write do this. Processing the steps for a single-thread of an application.
GPUs optimize for concurrency and they do that by running many “threads”. Each thread is often memory bound, but because they run in parallel when one “blocks” on reading memory, another thread just pops in it’s place until it has to block.
GPUs are constantly swapping active “threads” running and are able to hide the latency better. And, because of that design, you can trade for higher bandwidth and get more instructions out.
Which, for image data (what GPUs were originally designed for), you’re manipulating big two dimensional arrays of pixel data where concurrency is important. CPUs have instructions like SSE/AVX that sit somewhere in between, these days, but GPUs have the advantage of being able to target only one domain instead of both.
That’s my understanding of it, at least. I was a game dev in a former life :)
Memory interfacing of GPU commands-streaming/processing front-ends and system memory are very efficient, employing pre-fetching, etc.
Also an important thing to remember is that the vast majority of GPUs (and GPU end-users) out there are iGPUs that share memory with the host and can also be programmed to take advantage of this fact.
All processors are memory bandwidth starved. CPUs don’t benefit from wide, high latency access to large main memory as much as GPUs, due to the nature of SISD vs. SIMD. Naturally, GPUs put more focus on a fatter pipe between the processor and main memory (GDDR and 500 GB/s pipes). CPUs operate on fewer data, so you can crank the speed of computation up if you crank the memory speed and keep latency low. This is why so much of CPU dies are dedicated to cache (I think L1 is often single cycle, which results in absurd throughout numbers).
I think the comparison will be clearer when I put these two side-by-side:
* NVIDIA Tesla V100. Maximum memory bandwidth: 900GB/s to the HBM2 memory
* Intel Xeon Platinum 8180. Maximum memory bandwidth: 119.21 GiB/s to DDR4-2666 at hexa-channel.
CPU memory to GPU memory is slow. The other direction is even slower.
But GPU memory to GPU processing cores is insanely fast if you stream large amounts of homogeneous data.
So the only interleaved memory is very small caches local to the compute cores.
(And I'm not even getting into power and thermals involved)
Latest generation high end GPU's use HBM2(E?) memory, which is a very wide and fast pipe compared to DDR4 used for "normal" CPU main memory.
As for systolic arrays, to some extent the matrix-matrix units in Google TPU's and NVIDIA Tensor Cores are systolic arrays. I suspect we'll see designs go further down that path in the future.
Although memory access patterns of typical GPU workloads are very different than those of CPU workloads, there are GPU-local, hierarchical caches similar to CPUs for different purposes that are exclusive to shader cores/computation units, which can be even partitioned dynamically. The struggle on the GPU side is keeping all the cores busy while servicing memory requests efficiently. Sharing memory across small group of threads run in lockstep, for example, is a great way of doing it.
I didn't say the memory necessarily has to be moved into the GPU. The other way around would also be a possibility.