Hacker News new | comments | show | ask | jobs | submit login

Very good analysis, and a correct conclusion that memory bandwidth is the bottleneck (at least for Matrix fused multiply-add intensive workloads - like feeedforward NNs and Convnets). We have done experiments on the 1080Ti (484 GB/s) and for 32-bit FP training (convnets on tensorflow), it is close in performance to the P100 (717 GB/s).

The other point to add is that SIMD operation for GPUs is what gives them efficient batched reads from GPU memory for each operation.




Thanks.

I can't say I'm an expert yet. But the more and more I read about highly optimized code on any platform, the more and more I realize that 90% of the problem is dealing with memory.

Virtually every optimization guide or highly-optimized code tutorial spends an enormous amount of time discussing memory problems. It seems like memory bandwidth is the singular thing that HPC coders think about the most.


It's worth noting that this GPU RAM advantage is usually coupled with a PCIe bus disadvantage, which means that you need to be able to hold a complete working set of data in the GPU long enough to really benefit from the extra bandwidth and horsepower.

If you don't have enough computations-per-byte to perform on the GPU, you will find your total job time starts to be dominated by the time it takes to stage data in and out of the GPU, without being able to keep the GPU cores busy. Even if the CPU is 5-10x slower according to issue rates and RAM bandwidth, it can keep calculating steadily with a higher duty cycle since system RAM can be much larger.

However, the CPU also benefits from locality, so you should still prefer to structure your work into block-decomposed work units if possible. A decomposition which allows you to work through a large problem as a series of sub-problems sized for a modest GPU RAM area will also let the sub-problem rise higher in the CPU cache hierarchy to get more effective throughput. However, if the decomposition adds too much sequential overhead for marshalling or final reduction of results, it may not help versus a monolithic algorithm with reasonably good vectorization/streaming access to the full data.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: