OAI revealed on Twitter that there is no "system" at inference time, this is just a model.
Did they maybe expand to a tree during training to learn more robust reasoning? Maybe. But it still comes down to a regular transformer model at inference time.
> In the Self-Taught Reasoner (STaR, Zelikman et al. 2022),
useful thinking is learned by inferring rationales from few-shot examples
in question-answering and learning from those that lead to a correct
answer. This is a highly constrained setting – ideally, a language model
could instead learn to infer unstated rationales in arbitrary text. We
present Quiet-STaR, a generalization of STaR in which LMs learn to
generate rationales at each token to explain future text, improving their
predictions.
>[...]
>We generate thoughts, in parallel, following all tokens in the text (think). The model produces a mixture of its next-token predictions with and without a thought (talk). We apply REINFORCE, as in STaR, to increase the likelihood of thoughts that help the model predict future text while discarding thoughts that make the future text less likely (learn).
I don't think you can claim you know what's happening internally when OpenAI processes a request. They are a competitive company and will lie for competitive reasons. Most people think Q-Star is doing multiple inferences to accomplish a single task, and that's what all the evidence suggests. Whatever Sam Altman says means absolutely nothing, but I don't think he's claimed they use only a single inference either.
I recommend getting on Twitter to follow closely the leading individuals in the field of AI, and also watch the leading Youtube channels dedicated to AI research.
So far it's been unanimous. Everyone I've heard talk about it believes Strawberry is mainly just CoT. I'm not saying they didn't fine tune a model too, I'm just saying I agree with most people that clever CoT is where most of the leap in capability seems to have come from.
I haven't even added Strawberry support to my app yet, and so haven't checked what it's context length is, but you're right that additional context length is a scaling factor that's totally independent of whether CoT is used or not.
I'm just saying whatever they did in their [new] model, I think they also added CoT on top of it, as the outer layer of the onion so to speak.
> I wouldn't call o1 a "system". It's a model, but unlike previous models, it's trained to generate a very long chain of thought before returning a final answer
That answer seems to conflict with "in the future we'd like to give users more control over the thinking time".
I've gotten mini to think harder by asking it to, but it didn't make a better answer. Though now I've run out of usage limits for both of them so can't try any more…
not in a way that it is effectively used - in real life all of the papers using CoT compare against a weak baseline and the benefits level off extremely quickly.
nobody except for recent deepmind research has shown test time scaling like o1
Did they maybe expand to a tree during training to learn more robust reasoning? Maybe. But it still comes down to a regular transformer model at inference time.