I just found this paper i read a while ago. Doesn't this answer the question ? T...

p1esk · 2024-02-26T05:11:00

Yes, more tokens means doing more compute, that much is true. The question is whether this extra compute helps or hurts. This question is yet to be answered, as far as I know. I tend to make my GPT-4 questions quite verbose, hoping it helps.

This is completely orthogonal to CoT, which is simply a better prompt - it probably causes some sort of better pattern matching (again very poorly understood).

og_kalu · 2024-02-26T05:33:20

>The question is whether this extra compute helps or hurts.

I've linked 2 papers now that show very clearly the extra compute helps. I honestly don't understand what else it is you're looking for.

>This is completely orthogonal to CoT, which is simply a better prompt - it probably causes some sort of better pattern matching (again very poorly understood).

That paper specifically dives in on the effect of the length of the CoT prompt. It makes little sense to say - "oh it's just the better prompt" when Cot prompts with more tokens perform better than the shorter ones even when the shorter ones contain the same information. There is also the clear correlation with task difficulty and length.

p1esk · 2024-02-26T20:28:11

Yes, the CoT paper does provide some evidence that a more verbose prompt works better. Thank you for pointing me to it.

Though I still don’t quite understand what is going on in the dummy tokens paper - what is “computation width” and why would it provide any benefit?

frannyg · 2024-02-26T19:08:14

So "compute" includes just having more data ... that can also be "ignored"/ "skipped" for whatever reasons (e.g. weights), ok.