Hacker News new | past | comments | ask | show | jobs | submit login

I wonder how the perfomance fair when context size is increased. Intuitively this should be higher, but some quantized models I've tested showed noticeably worst performance.



Your KV cache size is linear with context size which might put you tight on memory. There is also increased cost of recalculating KV cache of context window when the window has to move but this is close to being solved with streaming LLMs.


BERT style encoder-only models, like the embedding model being discussed here, don't need a KV cache for inference. A KV cache is only needed for efficient inference with encoder-decoder and decoder-only (aka GPT) models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: