Ah, I see. This isn't necessarily virtualizing the static weights but the variable -sized and data dependent key value caches. These caches are built up as you go through the sequence of tokens. Makes sense.
How doesn't paging worsen speed performance though? If you are making more trips to the memory, then are you really just saving vram?
Also I see that vLLM which implements PagedAttention is also using a better scheduling? Wouldn't the speed improvements be coming from that instead? Don't put an expected short input and output in the same batch as a big input and big output?
What are the results of using the sequence-length only without virtualization?
> How doesn't paging worsen speed performance though?
It does worsen the performance of the attention kernel, if comparing to kernels which takes keys and values in continuous memory layout.
> Wouldn't the speed improvements be coming from that instead? Don't put an expected short input and output in the same batch as a big input and big output?
Actually it puts everything in the same batch. The reason for its high throughput is that sequences are removed from the batch as soon as it's finished, and new sequences can be added to the batch on-the-fly if there is enough space in KV cache. This is called continuous batching (https://www.anyscale.com/blog/continuous-batching-llm-infere...).
Paged attention and "virtualized" KV cache play an important role in an efficient implementation of continuous batching. Text generation in LLM is a dynamic process and it's not possible to predict how long the output is when scheduling incoming requests. Therefore a dynamic approach is needed for KV cache allocation, even though it hurts the performance of attention.
How doesn't paging worsen speed performance though? If you are making more trips to the memory, then are you really just saving vram?
Also I see that vLLM which implements PagedAttention is also using a better scheduling? Wouldn't the speed improvements be coming from that instead? Don't put an expected short input and output in the same batch as a big input and big output?
What are the results of using the sequence-length only without virtualization?