Hacker News new | past | comments | ask | show | jobs | submit login

> I don't quite understand. Most of LLM inference is MatMul and Softmax.

Most operations are memory bound: slow because memory access takes longer than computation itself. The problem is we have to shove the whole model, billions of weights, in the SRAM once for every token generated. That creates slowness.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: