Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Correct. The main bottleneck with LLM inference is, and have always been, memory bandwidth.

TPS = active weights in GB / your memory bandwidth.

That’s it for decode. That’s all.

 help



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: