mp187's comments

mp187 · 2024-05-08T09:42:01

Divergent mode is much easier for me. I just unfocus my eyes (the same muscle that blurs them).

afandian · 2024-05-08T11:38:26

I can easily diverge, but I then can't control focus. Do you see the stereograms in focus?

dameyawn · 2024-05-08T21:32:18

This can be practiced! I was super into stereograms years back and have decent control now (not intentionally but happened naturally). Can slightly diverge to see smaller/far-away images or maximally diverge for the bigger/close-up images.

mp187 · 2024-05-01T08:57:51

Why was this your first thought? Is a limiting factor to transformers the MLP layer? I thought the bottleneck was in the renormalization part.

brrrrrm · 2024-05-01T13:10:54

At small input size, yes the MLP dominates compute. At large input attention matters more

mp187 · 2024-02-17T20:45:25

A common theme in papers like these is that the model chooses word predictions greedily, instead of “thinking” and gaining confidence in its next word prediction.

This begs the question - why don’t people force the model to generate more tokens, until it has very high confidence in its next word prediction?

I can imagine several ways of doing this.

danielmarkbruce · 2024-02-17T21:17:43

Of course they do. Beam search is a thing. The reason it's not used as much as it might seem to make sense - cost. Do a greedy search and you run through the model x times where x is the number of tokens generated. Run top-k at every step, the number of runs through the model gets astronomical quickly.

IanCal · 2024-02-17T21:16:56

I'm wondering if you're describing beam search? Iirc last time I brought that up here someone explained that as models have gotten better it just didn't really make a difference.

mp187 · 2024-02-17T21:32:16

I wasn’t thinking something like beam search, I think this seems kind of unnatural. I can imagine that the human brain is doing something like GPT, but I can’t imagine it’s doing something like a beam search.

I was more thinking a model that writes to a piece of scratch paper to gain confidence. But it doesn’t have to actually output the scratch paper that it uses, it’s totally hidden from the user.

You could take this a step further, and have something like a “two-brained” model, where the original model falls back on a secondary model if it’s not confident in its response. This resembles a “fast” and “slow” brain.

I think the scratch paper idea has been explored to some extent, but I’m not sure if people think it’s a dead end.

reqo · 2024-02-17T20:48:16

Isn’t that what the softmax layer is doing? The token with highest probability among all the available tokens in the model dictionary is chosen as the next token!

danielmarkbruce · 2024-02-18T00:33:20

no. Softmax layer produces a distribution. What you do with that is up to you. There are numerous ways to choose from that distribution.

p1esk · 2024-02-17T20:57:35

I haven’t read this paper but what you described is commonly done (look up top-k or top-p sampling and beam search as examples).