The biggest thing is using lower precision > this implementation gives 25% speed...

Saurabh_29 · on Sept 10, 2019

I will agree with you. It is a combination of both. Plus I would like to add some extra points, pytorch/tensorflow essentially use the same CUDA/Cudnn libraries, the thing they are developed with the motive to catering wide corner cases, which tend to be robust but slower at times because of the memory access patterns/algorithm selection heuristics, some extra operations. Also, we can get rid of many memory read/write operations by clubbing kernels.