It is possible to understand the mechanism once you drop the anthropomorphisms.
Each token output by an LLM involves one pass through the next-word predictor neural network. Each pass is a fixed amount of computation. Complexity theory hints to us that the problems which are "hard" for an LLM will need more compute than the ones which are "easy". Thus, the only mechanism through which an LLM can compute more and solve its "hard" problems is by outputting more tokens.
You incentivise it to this end by human-grading its outputs ("RLHF") to prefer those where it spends time calculating before "locking in" to the answer. For example, you would prefer the output
Ok let's begin... statement1 => statement2 ... Thus, the answer is 5
over
The answer is 5. This is because....
since in the first one, it has spent more compute before giving the answer. You don't in any way attempt to steer the extra computation in any particular direction. Instead, you simply reinforce preferred answers and hope that somewhere in that extra computation lies some useful computation.
It turned out that such hope was well-placed. The DeepSeek R1-Zero training experiment showed us that if you apply this really generic form of learning (reinforcement learning) without _any_ examples, the model automatically starts outputting more and more tokens i.e "computing more". DeepseekMath was also a model trained directly with RL. Notably, the only signal given was whether the answer was right or not. No attention was paid to anything else. We even ignore the position of the answer in the sequence that we cared about before. This meant that it was possible to automatically grade the LLM without a human in the loop (since you're just checking answer == expected_answer). This is also why math problems were used.
All this is to say, we get the most insight on what benefit "reasoning" adds by examining what happened when we applied it without training the model on any examples. Deepseek R1 actually uses a few examples and then does the RL process on top of that, so we won't look at that.
Reading the DeepseekMath paper[1], we see that the authors posit the following:
As shown in Figure 7, RL enhances Maj@K’s performance but not Pass@K. These
findings indicate that RL enhances the model’s overall performance by rendering
the output distribution more robust, in other words, it seems that the
improvement is attributed to boosting the correct response from TopK rather
than the enhancement of fundamental capabilities.
For context, Maj@K means that you mark the output of the LLM as correct only if the majority of the many outputs you sample are correct. Pass@K means that you mark it as correct even if just one of them is correct.
So to answer your question, if you add an RL-based reasoning process to the model, it will improve simply because it will do more computation, of which a so-far-only-empirically-measured portion helps get more accurate answers on math problems. But outside that, it's purely subjective. If you ask me, I prefer claude sonnet for all coding/swe tasks over any reasoning LLM.
I’m wondering whether this seemingly underwhelming bump on 4o magnifies when/if reasoning is added.