Hacker News new | past | comments | ask | show | jobs | submit login
Understanding Latency Hiding on GPUs (2016) [pdf] (berkeley.edu)
86 points by luu 37 days ago | hide | past | favorite | 27 comments



I'm jealous. Dan is reading my twitter and re-posting stuff here :)

https://twitter.com/majek04/status/1267388797278400512

You should definitely follow me on twitter :)

Anyway, the question is about optimally using GPGPU resources. It's desired to keep both ALU units and memory units busy, and it's not trivial what ratio of memory vs alu operations should there be. This paper addresses it _very_ nicely.

> Substituting the hardware parameters listed in Table 4.1 into the solutions above suggests that hiding arithmetic latency requires 24 warps per SM and hiding memory latency requires 30 warps per SM if on the Maxwell GPU – which are similar numbers. This is despite the dramatic difference in latencies –6 cycles in one case and 368 cycles in another – and in contrast with the prevailing common wisdom.

The paper doesn't answer if there is a disadvantage in having too many warps, but considering that the max warps/block/sm is 32, we can say that you need 32 warps / 1024 threads per SM for optimal performance most often.


Note the results you quoted are for a test case in which all instructions are back-to-back dependent, which is good for testing the hardware and developing performance models, but not representative of well-tuned GPU codes.

> doesn't answer if there is a disadvantage in having too many warps,

There are disadvantages in having too many warps - the first is that the more warps per block you have, the fewer registers/thread you can use and the lower your achievable instruction level parallelism.

> but considering that the max warps/block/sm is 32, we can say that you need 32 warps / 1024 threads per SM for optimal performance most often

I don't think any well-tuned GPU kernels use 32 warps/block, most are at 4 to 8 warps/block.

See also:

https://github.com/NervanaSystems/maxas/wiki/SGEMM

https://www.nvidia.com/content/GTC-2010/pdfs/2238_GTC2010.pd...


This is great. Thanks


One common pitfall with many cheap blog posts describing why GPUs are "fast" is the focus on the number of CUDA cores.

The author here does a great job at highlighting how it's not necessarily the amount of cores that makes the big difference, but rather the memory bandwidth.

He also brings the memory bandwidth into his own models, which supposedly lead to much more accurate models that can help developers understand the underlying GPU technology a lot better than just "we run 5000 cores on a matrix of data".


When you say the memory bandwidth makes a big difference, you mean GPUs have more memory bandwidth than CPUs? My (rudimentary) understanding was that CPU/GPU memory bandwidth is often a bottleneck - is that not correct? Or are you referring instead to the aggregate memory bandwidth between GPU cores and local GPU memory?


You’re right that PCI-E bandwidth is the bottleneck for CPU-to-GPU communication. Game devs often have to think in “number of draw calls” being sent (especially before DX12/Vulcan). You can easily saturate that channel.

I believe what OP is talking about, in the context of ML models, is that reading 1 byte of memory on the GPU from the GPU’s memory is _much_ slower than the CPU reading from the system RAM.

This is an intentional choice and it speaks to the core design of what each system is solving for. CPUs trade low latency for faster single threaded execution speed. GPUs trade high latency for “total thread” execution speed.

CPUs solve for maximum “single thread” performance (for lack of a better term). If you have an operation that reads and mutates one byte of RAM, and then stack many of those instructions into a long sequence, the CPU is very fast at executing that. Most programs we write do this. Processing the steps for a single-thread of an application.

GPUs optimize for concurrency and they do that by running many “threads”. Each thread is often memory bound, but because they run in parallel when one “blocks” on reading memory, another thread just pops in it’s place until it has to block.

GPUs are constantly swapping active “threads” running and are able to hide the latency better. And, because of that design, you can trade for higher bandwidth and get more instructions out.

Which, for image data (what GPUs were originally designed for), you’re manipulating big two dimensional arrays of pixel data where concurrency is important. CPUs have instructions like SSE/AVX that sit somewhere in between, these days, but GPUs have the advantage of being able to target only one domain instead of both.

That’s my understanding of it, at least. I was a game dev in a former life :)


Number of drawcalls is usually not a problem because it saturates memory, though. It is often because of CPU overhead it incurs and (possibly redundant) commands-processing work on GPUs front-end causing bubbles in the whole pipeline. Where DX12/Vulkan-esque APIs help immensely is the first part mostly -- CPU overhead.

Memory interfacing of GPU commands-streaming/processing front-ends and system memory are very efficient, employing pre-fetching, etc.


You can also easily see this in game benchmarks, eg in https://www.gamersnexus.net/guides/2488-pci-e-3-x8-vs-x16-pe... "From a quick look, there is a little below a 1% [game FPS] difference in PCI-e 3.0 x16 and PCI-e 3.0 x8 slots".

Also an important thing to remember is that the vast majority of GPUs (and GPU end-users) out there are iGPUs that share memory with the host and can also be programmed to take advantage of this fact.


Ah, that makes sense. Thank you for clarifying!


PCIe 3.0 x16 is a 16 GB/s link, which ain’t bad. By comparison, CPU dual channel DDR4-2400 main memory is 38.4 GB/s.

All processors are memory bandwidth starved. CPUs don’t benefit from wide, high latency access to large main memory as much as GPUs, due to the nature of SISD vs. SIMD. Naturally, GPUs put more focus on a fatter pipe between the processor and main memory (GDDR and 500 GB/s pipes). CPUs operate on fewer data, so you can crank the speed of computation up if you crank the memory speed and keep latency low. This is why so much of CPU dies are dedicated to cache (I think L1 is often single cycle, which results in absurd throughout numbers).


L1 is typically 3-4 cycles in modern processors (latency) or 2-4 accesses per cycle (throughout).


What is the difference between bandwidth and throughput for memory?


In the context of memory, none. However, bandwidth is usually an analog term: the electromagnetic spectrum that is within 3 dB of the minimum insertion loss.


Also memory access patterns on GPU are more like batch access - you set up parameters for a large transfer then stream it with high bandwidth (texture, VBO, FBO, etc. etc.)


Yes. Memory-intensive operations like decompression and sorting will benefit greatly from running on a high memory bandwdith infrastructure.

I think the comparison will be clearer when I put these two side-by-side:

* NVIDIA Tesla V100. Maximum memory bandwidth: 900GB/s to the HBM2 memory

* Intel Xeon Platinum 8180. Maximum memory bandwidth: 119.21 GiB/s to DDR4-2666 at hexa-channel.


CPU memory to CPU cores is medium speed. That's why the CPU cores have caches, to make it faster.

CPU memory to GPU memory is slow. The other direction is even slower.

But GPU memory to GPU processing cores is insanely fast if you stream large amounts of homogeneous data.


PCIe 4 bandwidth is around 32 GB/s. Most games or applications don't saturate PCIe 3 16 lane bandwidth. I think you are conflating latency, bandwidth and "speed".


GPU architecture newbie here. Why isn't the memory interwoven with the cores, so you have faster local access and more streamlike data flow like in a systolic array?


Because you want to use COTS chips for memory. Cost of interleaving memory and compute is huge, and only gets worse when you consider the fact that such memory is "locked" to particular product.

So the only interleaved memory is very small caches local to the compute cores.

(And I'm not even getting into power and thermals involved)


I'm not an expert in this area, but my understanding is that DRAM and logic processes are very different from each other. If you want both DRAM and logic on the same chip you'll end up with a compromise which isn't that good for either. Which, aside from the programmability issue, is why processor-in-memory (PIM) approaches haven't caught on (yet).

Latest generation high end GPU's use HBM2(E?) memory, which is a very wide and fast pipe compared to DDR4 used for "normal" CPU main memory.

As for systolic arrays, to some extent the matrix-matrix units in Google TPU's and NVIDIA Tensor Cores are systolic arrays. I suspect we'll see designs go further down that path in the future.


There is threadgroup shared memory (also known as local data share), provided at 64kB per block of execution units (this block is called a "subslice" in Intel, a "streaming multiprocessor" in Nvidia, and "compute unit" in AMD). Though the exact ratios vary widely, this memory has about an order of magnitude higher bandwidth and lower latency than access to global memory, and when used skillfully really can enable systolic-like architectures.


If you are interested in a PIM (Processing in Memory) solution then you should take a loot at something like UPMEM instead of GPUs. It's expensive to manufacture and the processors can only access local memory (64MB of RAM) but in exchange you get 128 DPUs (CPU cores) per DIMM and each DIMM has an internal memory bandwidth of 128GB/s. With a really high end multisocket CPU system you might see 200GB/s for the entire system.


Why aren't there e.g. 16 GB of L1D$ inside our CPUs? It'd be blazingly fast! :)

Although memory access patterns of typical GPU workloads are very different than those of CPU workloads, there are GPU-local, hierarchical caches similar to CPUs for different purposes that are exclusive to shader cores/computation units, which can be even partitioned dynamically. The struggle on the GPU side is keeping all the cores busy while servicing memory requests efficiently. Sharing memory across small group of threads run in lockstep, for example, is a great way of doing it.


> Why aren't there e.g. 16 GB of L1D$ inside our CPUs? It'd be blazingly fast! :)

I didn't say the memory necessarily has to be moved into the GPU. The other way around would also be a possibility.


Yep, that's not what I intended to hint at, either. What I mean is, it boils down to the familiar problem of diminishing returns to have huge caches inside GPUs.


Memory bandwidth keeps the cores from being bottlenecked, but saying it is the reason GPUs are fast is simplistic.


The author is now at NVIDIA. I guess we'll never find out more on how GPUs work!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: