Hacker News new | past | comments | ask | show | jobs | submit | mp187's comments login

Divergent mode is much easier for me. I just unfocus my eyes (the same muscle that blurs them).


I can easily diverge, but I then can't control focus. Do you see the stereograms in focus?


This can be practiced! I was super into stereograms years back and have decent control now (not intentionally but happened naturally). Can slightly diverge to see smaller/far-away images or maximally diverge for the bigger/close-up images.


Why was this your first thought? Is a limiting factor to transformers the MLP layer? I thought the bottleneck was in the renormalization part.


At small input size, yes the MLP dominates compute. At large input attention matters more


A common theme in papers like these is that the model chooses word predictions greedily, instead of “thinking” and gaining confidence in its next word prediction.

This begs the question - why don’t people force the model to generate more tokens, until it has very high confidence in its next word prediction?

I can imagine several ways of doing this.


Of course they do. Beam search is a thing. The reason it's not used as much as it might seem to make sense - cost. Do a greedy search and you run through the model x times where x is the number of tokens generated. Run top-k at every step, the number of runs through the model gets astronomical quickly.


I'm wondering if you're describing beam search? Iirc last time I brought that up here someone explained that as models have gotten better it just didn't really make a difference.


I wasn’t thinking something like beam search, I think this seems kind of unnatural. I can imagine that the human brain is doing something like GPT, but I can’t imagine it’s doing something like a beam search.

I was more thinking a model that writes to a piece of scratch paper to gain confidence. But it doesn’t have to actually output the scratch paper that it uses, it’s totally hidden from the user.

You could take this a step further, and have something like a “two-brained” model, where the original model falls back on a secondary model if it’s not confident in its response. This resembles a “fast” and “slow” brain.

I think the scratch paper idea has been explored to some extent, but I’m not sure if people think it’s a dead end.


Isn’t that what the softmax layer is doing? The token with highest probability among all the available tokens in the model dictionary is chosen as the next token!


no. Softmax layer produces a distribution. What you do with that is up to you. There are numerous ways to choose from that distribution.


I haven’t read this paper but what you described is commonly done (look up top-k or top-p sampling and beam search as examples).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: