I wonder how the perfomance fair when context size is increased. Intuitively thi...

Kubuxu · on Oct 26, 2023

Your KV cache size is linear with context size which might put you tight on memory. There is also increased cost of recalculating KV cache of context window when the window has to move but this is close to being solved with streaming LLMs.

woadwarrior01 · on Oct 26, 2023

BERT style encoder-only models, like the embedding model being discussed here, don't need a KV cache for inference. A KV cache is only needed for efficient inference with encoder-decoder and decoder-only (aka GPT) models.