LLMs can't batch token generation for single users. Its sequential, each token depends on the next. In fact that's a part of the paper: "dumb" batching will leave the GPU underutilized because responses aren't all the same length, and they end up processing one token at a time at the end.
LLMs can't batch token generation for single users. Its sequential, each token depends on the next. In fact that's a part of the paper: "dumb" batching will leave the GPU underutilized because responses aren't all the same length, and they end up processing one token at a time at the end.