Hacker News new | past | comments | ask | show | jobs | submit login

> Copying Data from Host to Device

Surprised there's no mention of async copies here. If you want to get the most out of the GPU, you don't want it idle when copying data between the host and the GPU. Many frameworks provide for a mechanism to schedule async copies which can execute along side async work submission.

The post is sort of GPU 101 but there's a whole world of tricks and techniques beyond that once you start doing real-world GPU programming where you want to squeeze as much out of the expensive GPU as possible. Profiling tools help a lot here because, like much of optimizing now, there are hidden cliffs and non-linearities all over that you have to be aware of.




Since you likely use 64-bit (double) floats, not every GPU would help much, especially compared to a beefy CPU.

But if you use a GPU with a large number of FP64 units, it may speed things up a lot. These are generally not gaming GPUs, but if you have a 4060 sitting around anyway, it has about 300 GFLOPS FP64 performance, likely more than your CPU. Modern CPUs are mighty in this regard though, able to issue many FP64 operations per clock per core.


Did you reply to the wrong comment?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: