Hacker News new | past | comments | ask | show | jobs | submit login
StreamingLLM: Efficient streaming technique enable infinite sequence lengths (arxiv.org)
118 points by TheJCDenton 7 months ago | hide | past | favorite | 12 comments



The demo [3] seems very promising.

"Their method cleverly exploits the LLMs' tendency to use initial tokens as "attention sinks" to anchor the distribution of attention scores. By caching initial tokens alongside recent ones, StreamingLLM restored perplexity and achieved up to 22x faster decoding than prior techniques." [1]

"We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more." [2]

"we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment." [2]

"StreamingLLM achieves an impressive speedup, reaching up to 22.2× per token. Despite its reduced latency, StreamingLLM sustains a memory footprint consistent with the re-computation baseline." [2]

[1] https://notes.aimodels.fyi/llm-infinite-context-window-strea...

[2] https://arxiv.org/pdf/2309.17453.pdf

[3] https://github.com/mit-han-lab/streaming-llm


I see a lot of people misunderstanding what this is about. What this allows is incrementally updating the attention cache. It does not allow the model to see beyond its usual attention window. As they explain in a README, only the tokens that fit into the usual window are considered -- so if you ask a question about a long book, it will only consider the last pages.

But it can still be useful. Imagine this use case, where you have a chat conversation between Assistant and User. Assume that the inputs to get the next assistant response are just the past conversation turns (cut off to fit context window).

So for turn 1 the input is:

   User: (user turn 1)
For turn 2 the input is:

   User: (user turn 1)
   Assistant: (assistant turn 1)
   User: (user turn 2)
Etc.

Now, what this allows you to do is reuse the attention computed from the previous turns (since the prefix is the same).

In practice, people often have a system prompt before the conversation history, which (as far a I can tell) makes this technique not applicable (the input prefix will change as soon as the conversation history is long enough that we need to start dropping the oldest turns, otherwise the system prompt would get ignored).

In such case, what you could do is to cache at least the system prompt. This is also possible with https://github.com/OpenNMT/CTranslate2/blob/2203ad5c8baf878a...


It looks like a longformer architecture, but with dedicated "attention sink" tokens at the beginning that provide storage space. Is that right?



In figure 3, it shows that Falcon and Pythia are much less susceptible to the lack of an attention sink than Llama or MPT… seems they work almost as well by just doing naïve windowed attention


As far as I understand from their FAQ on github it's more about reducing latency a for a more instant response.

Would be interesting to see an application for this where you can have a more fluid conversation with the ability to interrupt each other mid sentence. I suppose this would require retraining or finetuning on transcribed natural vocal conversations between two people. It would probably also require a different structure than the current chat based methods.


Yeah "by conveniently dropping token out of attention we can have infinite tokens" not exactly a breakthrough


The nuance is that their technique allows you to slide the context window along without introducing a jagged discontinuity in generation. At least that’s my feel for it.


Yes, but is that really why people want the context window to be "infinite"? In my experience, the desire for bigger context is the ability to say, dump an 1000 page book into an LLM and ask questions about any part of it, not just the last chapter.


I think the title is misleading. This is an MIT PR piece.


Why do you think it's only that? I've seen a lot of dismissive comments on AI articles that were shown to misunderstand or underappreciate the article.


Why do you claim I "think it's only that"?

The other part of the paper handling the sliding KV cache compares favourably with prefix caching, sure, but then we moved away prefix caching for serving since a while now, with paged attention (which should really have been called paged KV cache but oh well) offering a lot of interesting improvement in that area including supporting extremely well parallel decoding.

And I do not care enough to compare the streaming cache with the paged attention cache directly, first because it's work they should have done and not I, second because dropping token silently is something that confuses and frustrated users significantly enough that it puts me down from wanting to investigate further.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: