Hacker News new | past | comments | ask | show | jobs | submit login

For latency-bound inference (i.e. one request) you don't need tensor-cores since all your operations are just matrix vector multiplications.





Good point yes. That explains why he's getting performance similar to the leading frameworks. Those tensor operations are helpful for training or for throughput-optimised batched inference but not really for a batch size of one.

I actually didn't know that. I'm in the space as a hobbyist and I had a vague understanding that tensor cores are essential for reaching peak performance, but can only work for certain operations like dense matrix-matrix multiplication. It was on my list to investigate whether they could be used to further improve single-batch decoding - makes sense that they don't help when it's all matrix-vector.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: