> I don't quite understand. Most of LLM inference is MatMul and Softmax.
Most operations are memory bound: slow because memory access takes longer than computation itself. The problem is we have to shove the whole model, billions of weights, in the SRAM once for every token generated. That creates slowness.
Most operations are memory bound: slow because memory access takes longer than computation itself. The problem is we have to shove the whole model, billions of weights, in the SRAM once for every token generated. That creates slowness.