Tokens are generally consumed sequentially. The model does not care what token 3...

lossolo · 2024-07-05T16:27:21.000000Z

I think you are confused on how transformers work. You are confusing between static embeddings and the dynamic, contextual outputs calculated by transformers. While initial embeddings (representations BEFORE processing through the transformer layers) are relatively static, the output embeddings from each layer of the transformer are highly dynamic and context-dependent. They incorporate information from the entire input sequence via the attention mechanisms.

So when transformers process an input sequence, they don't just look at each token in isolation but consider the ENTIRE sequence's context through complex inter-token relationships. This makes the technique of caching outputs non-trivial and context-specific. The sequential processing does not imply independence of tokens but underscores the integral role of context and sequence in generating accurate and coherent outputs.

This should help you understand better how transformers work: https://bbycroft.net/llm

jncfhnb · 2024-07-05T16:40:37.000000Z

I’m not saying they look at each token in isolation. I’m saying they look at the current token and previous tokens. it is not accurate that these things consider the entire sequence. They (generally) consider the entire sequence up until that token. Which means if you have a large prefix prompt that comes before everything, it’s always the same embedding of that prefix prompt.

You want to reuse the embedding representation of the prefix prompt before the new user tokens are confiscated, otherwise you are recalculating the embedding over and over and over. God forbid it’s not a small prefix but a huge document.

lossolo · 2024-07-05T16:52:00.000000Z

No, you are mistaken. You are still confusing initial embeddings, which are basically computationally free compared to the rest. I explained this in detail in my previous post. The ENTIRE sequence is processed by transformers to generate the first predicted token. Please check the link I provided and click on 'continue' to see a visual representation of how vectors are created from embeddings and how calculations are made. This should clarify what I mean.

jncfhnb · 2024-07-05T17:07:02.000000Z

The entire sequence is processed by the transformers to make the first predicted token. But that processing of the entire sequences is comprised of processing :n, and previously :n-1, and previously :n-2, etc cumulatively.

If tokens 1:3000 are the same you are going to be doing the same work processing them over and over; and then changing the results to adapt to the user tokens at the end.

lossolo · 2024-07-05T17:30:16.000000Z

No, sorry, what you wrote is wrong. Transformers process sequences token by token, but they don't do it cumulatively in the sense of linear models or RNNs. Instead, they process all tokens simultaneously in terms of calculating attention. The link I provided shows that visually.

Every time a token is processed in a transformer, the model computes its attention relative to ALL other tokens in the sequence. This is a key difference from sequential, step-by-step processing where previous states are incrementally built upon (so your :n, :n-1 etc). The attention mechanism in transformers recalculates the relationships for EACH token with ALL other tokens EVERY time any part of the input sequence changes.

When new tokens are added to the sequence (for example user input after a system prompt), the attention relationships for ALL PREVIOUS tokens can change. This is because the context provided by the new tokens can alter the relevance and interpretation of the earlier tokens. As such, the attention scores and subsequently the output embeddings for all tokens are recalculated to integrate this new information. So as you can see, you are not doing the same work over and over.

I hope this helps you better understand how transformers work.