I don't understand how that parallel prediction can work... Let's say I give it ...

cs702 · 2024-06-07T15:48:55 1717775335

Give the model the tokens "happily" and "I", and add to each input token its respective position embedding and the position embedding for the token to be predicted. You can do this in parallel for all tokens to be predicted. The model has been trained so it can predict tokens in any position.

hexomancer · 2024-06-07T16:07:17 1717776437

Yes, but is there any guarantee that the complete sentence makes sense?

toxik · 2024-06-07T16:57:43 1717779463

That is indeed an issue. Their sampling method rejects impossible combinations.

entropicdrifter · 2024-06-07T16:09:18 1717776558

That guarantee didn't exist with regular GPT LLMs, did it? It just came about as an emergent property of throwing more and more compute, training data, and training time at the problem

amluto · 2024-06-07T19:30:14 1717788614

I think it’s effectively built in to the design. The model outputs a probability distribution for the first unknown token [0]. Then some code outside the model chooses a token and runs the model again with that token provided to the model. So the second output token’s probability distribution is automatically conditioned on the first output token, etc.

Sometimes people will attempt to parallelize this by using a faster model to guess a few tokens and then evaluating them in as a batch with the main model to determine whether the choices were good.

[0] Usually it outputs “logits”, which become a probability distribution when combined with a “temperature” parameter.

qeternity · 2024-06-07T22:19:33 1717798773

> I think it’s effectively built in to the design.

It isn't. There is no guarantee that successive tokens will be comprehensible.

> Usually it outputs “logits”, which become a probability distribution when combined with a “temperature” parameter.

The logits are the probability distribution (well technically, you would apply softmax). Temperature is a parameter for how you sample those logits in a non-greedy fashion.

hexaga · 2024-06-07T23:14:00 1717802040

> Temperature is a parameter for how you sample those logits in a non-greedy fashion.

I think temperature is better understood as a pre-softmax pass over logits. You'd divide logits by the temp, and then their softmax becomes more/less peaky.

    probs = (logits / temp).softmax()

Sampling is a whole different thing.

qeternity · 2024-06-07T23:20:25 1717802425

Sure, my comment about softmax was simply about the probability distribution. But temperature is still part of sampling. If you’re greedy decoding, temperature doesn’t matter.

alextheparrot · 2024-06-07T19:32:47 1717788767

No, but it makes more conceptual sense given the model can consider what was said before it

KRAKRISMOTT · 2024-06-07T16:15:46 1717776946

Isn't this bag of words all over again? Except with positional hints?