I think it’s effectively built in to the design. The model outputs a probability...

qeternity · 2024-06-07T22:19:33.000000Z

> I think it’s effectively built in to the design.

It isn't. There is no guarantee that successive tokens will be comprehensible.

> Usually it outputs “logits”, which become a probability distribution when combined with a “temperature” parameter.

The logits are the probability distribution (well technically, you would apply softmax). Temperature is a parameter for how you sample those logits in a non-greedy fashion.

hexaga · 2024-06-07T23:14:00.000000Z

> Temperature is a parameter for how you sample those logits in a non-greedy fashion.

I think temperature is better understood as a pre-softmax pass over logits. You'd divide logits by the temp, and then their softmax becomes more/less peaky.

    probs = (logits / temp).softmax()

Sampling is a whole different thing.

qeternity · 2024-06-07T23:20:25.000000Z

Sure, my comment about softmax was simply about the probability distribution. But temperature is still part of sampling. If you’re greedy decoding, temperature doesn’t matter.