Hacker News new | past | comments | ask | show | jobs | submit login

for rank-2 tensor work you can do 4x as many operations, for rank-3 tensor work, it's 8x, assuming memory bandwidth is the bottleneck.



Does that mean it’s 64x as fast for 16-bit floating point vs 64-bit for a rank 3 tensor?


assuming 1) memory bandwidth is the bottleneck and 2) you can keep the tensor values in cache or registers.

I think that GPUs are still vector processing engines, so they should scale with 4x... But assuming google architected the TPU correctly, it should be 16x as fast (I think the architecture is actually that of a rank-2 tensor).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: