Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Normally the inputs are padded out to the context length [1] and so the cost to embed 1 token or N tokens is the same. The output is produced token-by-token and so the amount of GPU time increases with the number of output tokens.

[1] I'm not sure if these huge context lengths are achieved the same way (i.e. a single input vector of length N) but given the cost is constant for input I would assume the resource usage is too.



This doesn't match my mental model (or implemented model in the case of GPT2) of how self-attention works (you need to calculate the residual stream for each individual token, attending to all prior tokens before it). Have a link?


I work on infrastructure for serving large language models but I don't have any background in ML, so my perspective is looking at these models as a black box (and also conversations with the people that do the ML stuff). It is the case in practice at least from a latency side that with a fixed context length N, embedding any number of tokens from 0 to N takes the same amount of time. Perhaps it's a difference between the conceptual and actual implementation on GPU?

edit - This occurred to me after the fact but I wonder if the difference is that the use case I work with is processing batches of many different embedding requests (but computed in one batch), therefore it has to process `min(longest embedding, N)` tokens so any individual request in theory has no difference. This would also be the case for Anthropic however.


Ah, you're thinking about embeddings which are basically the encoder stack on a traditional transformer architecture. Modern GPT-like models (including Claude), however, drop the encoder and use decoder-only architectures.

I could imagine something where encoders pad up to the context length because causal masking doesn't apply and the self attention has learned to look across the whole context-window.


Decoder only architecture? What is this? That doesn't sound like a transformer at all, are you saying gpt4 uses a totally different algorithm?


Nope, a decoder only transformer is a variant of the original architecture proposed by Google [1]. All variants of GPT that we know about (1 through 3) all roughly use this same architecture which takes only the decoder stack from the original Google paper and drops the encoder [2]

[1] Original Google Paper - https://arxiv.org/abs/1706.03762

[2] Original GPT Paper - https://s3-us-west-2.amazonaws.com/openai-assets/research-co...


How can it work without an encoder?


Everyone serious batches together short prompts so the cost is roughly proportional to the tokens.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: