Hacker News new | past | comments | ask | show | jobs | submit login

You’re right that PCI-E bandwidth is the bottleneck for CPU-to-GPU communication. Game devs often have to think in “number of draw calls” being sent (especially before DX12/Vulcan). You can easily saturate that channel.

I believe what OP is talking about, in the context of ML models, is that reading 1 byte of memory on the GPU from the GPU’s memory is _much_ slower than the CPU reading from the system RAM.

This is an intentional choice and it speaks to the core design of what each system is solving for. CPUs trade low latency for faster single threaded execution speed. GPUs trade high latency for “total thread” execution speed.

CPUs solve for maximum “single thread” performance (for lack of a better term). If you have an operation that reads and mutates one byte of RAM, and then stack many of those instructions into a long sequence, the CPU is very fast at executing that. Most programs we write do this. Processing the steps for a single-thread of an application.

GPUs optimize for concurrency and they do that by running many “threads”. Each thread is often memory bound, but because they run in parallel when one “blocks” on reading memory, another thread just pops in it’s place until it has to block.

GPUs are constantly swapping active “threads” running and are able to hide the latency better. And, because of that design, you can trade for higher bandwidth and get more instructions out.

Which, for image data (what GPUs were originally designed for), you’re manipulating big two dimensional arrays of pixel data where concurrency is important. CPUs have instructions like SSE/AVX that sit somewhere in between, these days, but GPUs have the advantage of being able to target only one domain instead of both.

That’s my understanding of it, at least. I was a game dev in a former life :)




Number of drawcalls is usually not a problem because it saturates memory, though. It is often because of CPU overhead it incurs and (possibly redundant) commands-processing work on GPUs front-end causing bubbles in the whole pipeline. Where DX12/Vulkan-esque APIs help immensely is the first part mostly -- CPU overhead.

Memory interfacing of GPU commands-streaming/processing front-ends and system memory are very efficient, employing pre-fetching, etc.


You can also easily see this in game benchmarks, eg in https://www.gamersnexus.net/guides/2488-pci-e-3-x8-vs-x16-pe... "From a quick look, there is a little below a 1% [game FPS] difference in PCI-e 3.0 x16 and PCI-e 3.0 x8 slots".

Also an important thing to remember is that the vast majority of GPUs (and GPU end-users) out there are iGPUs that share memory with the host and can also be programmed to take advantage of this fact.


Ah, that makes sense. Thank you for clarifying!


PCIe 3.0 x16 is a 16 GB/s link, which ain’t bad. By comparison, CPU dual channel DDR4-2400 main memory is 38.4 GB/s.

All processors are memory bandwidth starved. CPUs don’t benefit from wide, high latency access to large main memory as much as GPUs, due to the nature of SISD vs. SIMD. Naturally, GPUs put more focus on a fatter pipe between the processor and main memory (GDDR and 500 GB/s pipes). CPUs operate on fewer data, so you can crank the speed of computation up if you crank the memory speed and keep latency low. This is why so much of CPU dies are dedicated to cache (I think L1 is often single cycle, which results in absurd throughout numbers).


L1 is typically 3-4 cycles in modern processors (latency) or 2-4 accesses per cycle (throughout).


What is the difference between bandwidth and throughput for memory?


In the context of memory, none. However, bandwidth is usually an analog term: the electromagnetic spectrum that is within 3 dB of the minimum insertion loss.


Also memory access patterns on GPU are more like batch access - you set up parameters for a large transfer then stream it with high bandwidth (texture, VBO, FBO, etc. etc.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: