Hacker News new | past | comments | ask | show | jobs | submit login

I'm confused here. What's an "output context"? My assumption was that the context window was shared across input and (prior) output. You put everything into the model at once, and then at the end the first unused context vector becomes a vector you can decode for a single token of output. With multiple token output meaning you repeatedly inference, decode, sample, append, and repeat until you sample an end token. Is this just a limit of OpenAI APIs or something I'm forgetting?



The long story short is you are technically correct but in practice things are a little different. There are 2 factors to consider here:

1. Model Capability

You are right that mechanically, input and output tokens in a standard decoder Transformer are "the same". A 32K context should mean you can have 1 input tokens and 32K output tokens (you actually get 1 bonus token), or 32K input tokens and 1 output token,

However, if you feed an LM "too much" of its own input (read: have too long an output length), it starts to go off the rails, empirically. The word "too much" is doing some work here: it's a balance of both (1) LLM labs having data that covers that many output tokens in an example and (2) LLMs labs having empirical tests to have confidence that the model won't reasonably go off the rails within some output limit. (Note, this isn't pretraining but the instruction tuning/RLHF after, so you don't just get examples for free)

In short, labs will often train a model targeting an output context length, and put out an offering based on that.

2. Infrastructure

While mathematically having the model read external input and its own output are the same, the infrastructure is wildly different. This is one of the first things you learn when deploying these models: you basically have a different stack for "encoding" and "decoding" (using those terms loosely. This is after all still a decoder only model). This means you need to set max lengths for both encoding and decoding separately.

So, after a long time of optimizing both the implementation and length hyperparameters (or just winging it), the lab will decide "we have a good implementation for up to 31K input and 1k output" and then go from there. If they wanted to change that, there's a bunch of infrastructure work involved. And because of the economies of batching, you want many inputs to have as close to the same lengths as possible, so you want to offer fewer configurations (some of this bucketing may be performed hidden from the user). Anyway, this is why it may become uneconomical to offer a model at a given length configuration (input or output) after some time.


No. Newer models have 128K of input tokens, but only 4096 output tokens.


It's the number of tokens the model can output in one pass. There's subtle differences between running it multiple times to get a bigger output and running it once to get a bigger output. These are things that only really show up when you integrate these models into production code.


This isn't unique for OpenAI models. A lot of the open source ones has similar limitations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: