Normally the inputs are padded out to the context length [1] and so the cost to ...

newhouseb · on May 11, 2023

This doesn't match my mental model (or implemented model in the case of GPT2) of how self-attention works (you need to calculate the residual stream for each individual token, attending to all prior tokens before it). Have a link?

pyth0 · on May 11, 2023

I work on infrastructure for serving large language models but I don't have any background in ML, so my perspective is looking at these models as a black box (and also conversations with the people that do the ML stuff). It is the case in practice at least from a latency side that with a fixed context length N, embedding any number of tokens from 0 to N takes the same amount of time. Perhaps it's a difference between the conceptual and actual implementation on GPU?

edit - This occurred to me after the fact but I wonder if the difference is that the use case I work with is processing batches of many different embedding requests (but computed in one batch), therefore it has to process `min(longest embedding, N)` tokens so any individual request in theory has no difference. This would also be the case for Anthropic however.

newhouseb · on May 11, 2023

Ah, you're thinking about embeddings which are basically the encoder stack on a traditional transformer architecture. Modern GPT-like models (including Claude), however, drop the encoder and use decoder-only architectures.

I could imagine something where encoders pad up to the context length because causal masking doesn't apply and the self attention has learned to look across the whole context-window.

revelio · on May 12, 2023

Decoder only architecture? What is this? That doesn't sound like a transformer at all, are you saying gpt4 uses a totally different algorithm?

newhouseb · on May 12, 2023

Nope, a decoder only transformer is a variant of the original architecture proposed by Google [1]. All variants of GPT that we know about (1 through 3) all roughly use this same architecture which takes only the decoder stack from the original Google paper and drops the encoder [2]

[1] Original Google Paper - https://arxiv.org/abs/1706.03762

[2] Original GPT Paper - https://s3-us-west-2.amazonaws.com/openai-assets/research-co...

flangola7 · on May 12, 2023

How can it work without an encoder?

sebzim4500 · on May 11, 2023

Everyone serious batches together short prompts so the cost is roughly proportional to the tokens.