This can be practiced! I was super into stereograms years back and have decent control now (not intentionally but happened naturally). Can slightly diverge to see smaller/far-away images or maximally diverge for the bigger/close-up images.
A common theme in papers like these is that the model chooses word predictions greedily, instead of “thinking” and gaining confidence in its next word prediction.
This begs the question - why don’t people force the model to generate more tokens, until it has very high confidence in its next word prediction?
Of course they do. Beam search is a thing. The reason it's not used as much as it might seem to make sense - cost. Do a greedy search and you run through the model x times where x is the number of tokens generated. Run top-k at every step, the number of runs through the model gets astronomical quickly.
I'm wondering if you're describing beam search? Iirc last time I brought that up here someone explained that as models have gotten better it just didn't really make a difference.
I wasn’t thinking something like beam search, I think this seems kind of unnatural. I can imagine that the human brain is doing something like GPT, but I can’t imagine it’s doing something like a beam search.
I was more thinking a model that writes to a piece of scratch paper to gain confidence. But it doesn’t have to actually output the scratch paper that it uses, it’s totally hidden from the user.
You could take this a step further, and have something like a “two-brained” model, where the original model falls back on a secondary model if it’s not confident in its response. This resembles a “fast” and “slow” brain.
I think the scratch paper idea has been explored to some extent, but I’m not sure if people think it’s a dead end.
Isn’t that what the softmax layer is doing? The token with highest probability among all the available tokens in the model dictionary is chosen as the next token!