Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The biggest thing is using lower precision

> this implementation gives 25% speedup over Nvidia's Pytorch implementation in full precision and 2.5-3x speedup when using TensorCore

TensorCore is a lower precision core.

Other than that, the speedups presumably come from better written CUDA.

If you're asking what the bottlenecks are in general for these kinds of kernels, it's pretty much always memory bound.

View page 3-5 for what kinds of optimizations need to be done: https://people.csail.mit.edu/jrk/jrkthesis.pdf



I will agree with you. It is a combination of both. Plus I would like to add some extra points, pytorch/tensorflow essentially use the same CUDA/Cudnn libraries, the thing they are developed with the motive to catering wide corner cases, which tend to be robust but slower at times because of the memory access patterns/algorithm selection heuristics, some extra operations. Also, we can get rid of many memory read/write operations by clubbing kernels.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: