Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

When you say the memory bandwidth makes a big difference, you mean GPUs have more memory bandwidth than CPUs? My (rudimentary) understanding was that CPU/GPU memory bandwidth is often a bottleneck - is that not correct? Or are you referring instead to the aggregate memory bandwidth between GPU cores and local GPU memory?


You’re right that PCI-E bandwidth is the bottleneck for CPU-to-GPU communication. Game devs often have to think in “number of draw calls” being sent (especially before DX12/Vulcan). You can easily saturate that channel.

I believe what OP is talking about, in the context of ML models, is that reading 1 byte of memory on the GPU from the GPU’s memory is _much_ slower than the CPU reading from the system RAM.

This is an intentional choice and it speaks to the core design of what each system is solving for. CPUs trade low latency for faster single threaded execution speed. GPUs trade high latency for “total thread” execution speed.

CPUs solve for maximum “single thread” performance (for lack of a better term). If you have an operation that reads and mutates one byte of RAM, and then stack many of those instructions into a long sequence, the CPU is very fast at executing that. Most programs we write do this. Processing the steps for a single-thread of an application.

GPUs optimize for concurrency and they do that by running many “threads”. Each thread is often memory bound, but because they run in parallel when one “blocks” on reading memory, another thread just pops in it’s place until it has to block.

GPUs are constantly swapping active “threads” running and are able to hide the latency better. And, because of that design, you can trade for higher bandwidth and get more instructions out.

Which, for image data (what GPUs were originally designed for), you’re manipulating big two dimensional arrays of pixel data where concurrency is important. CPUs have instructions like SSE/AVX that sit somewhere in between, these days, but GPUs have the advantage of being able to target only one domain instead of both.

That’s my understanding of it, at least. I was a game dev in a former life :)


Number of drawcalls is usually not a problem because it saturates memory, though. It is often because of CPU overhead it incurs and (possibly redundant) commands-processing work on GPUs front-end causing bubbles in the whole pipeline. Where DX12/Vulkan-esque APIs help immensely is the first part mostly -- CPU overhead.

Memory interfacing of GPU commands-streaming/processing front-ends and system memory are very efficient, employing pre-fetching, etc.


You can also easily see this in game benchmarks, eg in https://www.gamersnexus.net/guides/2488-pci-e-3-x8-vs-x16-pe... "From a quick look, there is a little below a 1% [game FPS] difference in PCI-e 3.0 x16 and PCI-e 3.0 x8 slots".

Also an important thing to remember is that the vast majority of GPUs (and GPU end-users) out there are iGPUs that share memory with the host and can also be programmed to take advantage of this fact.


Ah, that makes sense. Thank you for clarifying!


PCIe 3.0 x16 is a 16 GB/s link, which ain’t bad. By comparison, CPU dual channel DDR4-2400 main memory is 38.4 GB/s.

All processors are memory bandwidth starved. CPUs don’t benefit from wide, high latency access to large main memory as much as GPUs, due to the nature of SISD vs. SIMD. Naturally, GPUs put more focus on a fatter pipe between the processor and main memory (GDDR and 500 GB/s pipes). CPUs operate on fewer data, so you can crank the speed of computation up if you crank the memory speed and keep latency low. This is why so much of CPU dies are dedicated to cache (I think L1 is often single cycle, which results in absurd throughout numbers).


L1 is typically 3-4 cycles in modern processors (latency) or 2-4 accesses per cycle (throughout).


What is the difference between bandwidth and throughput for memory?


In the context of memory, none. However, bandwidth is usually an analog term: the electromagnetic spectrum that is within 3 dB of the minimum insertion loss.


Also memory access patterns on GPU are more like batch access - you set up parameters for a large transfer then stream it with high bandwidth (texture, VBO, FBO, etc. etc.)


Yes. Memory-intensive operations like decompression and sorting will benefit greatly from running on a high memory bandwdith infrastructure.

I think the comparison will be clearer when I put these two side-by-side:

* NVIDIA Tesla V100. Maximum memory bandwidth: 900GB/s to the HBM2 memory

* Intel Xeon Platinum 8180. Maximum memory bandwidth: 119.21 GiB/s to DDR4-2666 at hexa-channel.


CPU memory to CPU cores is medium speed. That's why the CPU cores have caches, to make it faster.

CPU memory to GPU memory is slow. The other direction is even slower.

But GPU memory to GPU processing cores is insanely fast if you stream large amounts of homogeneous data.


PCIe 4 bandwidth is around 32 GB/s. Most games or applications don't saturate PCIe 3 16 lane bandwidth. I think you are conflating latency, bandwidth and "speed".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: