The model performance is driven by chain of thought, but they will not be providing chain of thought responses to the user for various reasons including competitive advantage.
After the release of GPT4 it became very common to fine-tune non-OpenAI models on GPT4 output. I’d say OpenAI is rightly concerned that fine-tuning on chain of thought responses from this model would allow for quicker reproduction of their results. This forces everyone else to reproduce it the hard way. It’s sad news for open weight models but an understandable decision.
The open source/weights models so far have proved that openAI doesn't have some special magic sauce. I m confident we ll soon have a model from Meta or others that s close to this level of reasoning. [Also consider that some of their top researchers have departed]
On a cursory look, it looks like the chain of thought is a long series of chains of thought balanced on each step, with a small backtracking added whenever a negative result occurs, sort of like solving a maze.
I suspect that the largest limiting factor for a competing model will be the dataset. Unless they somehow used GPT4 to generate the dataset somehow, this is an extremely novel dataset to have to build.
I’d say depends. If the model iterates 100x I’d just say give me the output.
Same with problem solving in my brain: Sure, sometimes it helps to think out loud. But taking a break and let my unconcious do the work is helpful as well. For complex problems that’s actually nice.
I think eventually we don’t care as long as it works or we can easily debug it.
Given the significant chain of thought tokens being generated, it also feels a bit odd to hide it from a cost fairness perspective. How do we believe they aren't inflating it for profit?
No, its the fraud theory of charging for usage that is unaccountable that has been repeatedly proven true when unaccountable bases for charges have been deployed.
Yeah, if they are charging for some specific resource like tokens then it better be accurate. But ultimately utility-like pricing is a mistake IMO. I think they should try to align their pricing with the customer value they're creating.
Not sure why you didn’t bother to check their pricing page (1) before dismissing my point. They are charging significantly more for both input (3x) and output (4x) tokens when using o1.
It’s really unclear to me what you understood by “cost fairness”.
I’m saying if you charge me per brick laid, but you can’t show me how many bricks were laid, nor can I calculate how many should have been laid - how do I trust your invoice?
Note: The reason I say all this is because OpenAI is simultaneously flailing for funding, while being inherently unprofitable as it continues to boil the ocean searching for strawberries.
It'd be helpful if they exposed a summary of the chain-of-thought response instead. That way they'd not be leaking the actual tokens, but you'd still be able to understand the outline of the process. And, hopefully, understand where it went wrong.
AFAIK, they are the least open of the major AI labs. Meta is open-weights and partly open-source. Google DeepMind is mostly closed-weights, but has released a few open models like Gemma. Anthropic's models are fully closed, but they've released their system prompts, safety evals, and have published a fair bit of research (https://www.anthropic.com/research). Anthropic also haven't "released" anything (Sora, GPT-4o realtime) without making it available to customers. All of these groups also have free-usage tiers.
Am I right that this CoT is not actual reasoning in the same way that a human would reason, but rather just a series of queries to the model that still return results based on probabilities of tokens?
It could just be programmed to follow up by querying itself with a prompt like "Come up with arguments that refute what you just wrote; if they seem compelling, try a different line of reasoning, otherwise continue with what you were doing." Different such self-administered prompts along the way could guide it through what seems like reasoning, but would really be just a facsimile thereof.
> I'd say OpenAI is rightly concerned that fine-tuning on chain of thought responses from this model would allow for quicker reproduction of their results.
Tested cipher example, and it got it right. But "thinking logs" I see in the app looks like a summary of actual chain of thought messages that are not visible.
o1 models might use multiple methods to come up with an idea, only one of them might be correct, that's what they show in ChatGPT. So it just summarises the CoT, does not include the whole reasoning behind it.
I don't understand how they square that with their pretense of being a non-profit that wants to benefit all of humanity. Do they not believe that competition is good for humanity?
You can see an example of the Chain of Thought in the post, it's quite extensive. Presumably they don't want to release this so that it is raw and unfiltered and can better monitor for cases of manipulation or deviation from training. What GP is also referring to is explicitly stated in the post: they also aren't release the CoT for competitive reasons, so that presumably competitors like Anthropic are unable to use the CoT to train their own frontier models.
> Presumably they don't want to release this so that it is raw and unfiltered and can better monitor for cases of manipulation or deviation from training.
My take was:
1. A genuine, un-RLHF'd "chain of thought" might contain things that shouldn't be told to the user. E.g., it might at some point think to itself, "One way to make an explosive would be to mix $X and $Y" or "It seems like they might be able to poison the person".
2. They want the "Chain of Thought" as much as possible to reflect the actual reasoning that the model is using; in part so that they can understand what the model is actually thinking. They fear that if they RLHF the chain of thought, the model will self-censor in a way which undermines their ability to see what it's really thinking
3. So, they RLHF only the final output, not the CoT, letting the CoT be as frank within itself as any human; and post-filter the CoT for the user.
This is a transcription of a literal quote from the article:
> Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users
I think they mean that you won’t be able to see the “thinking”/“reasoning” part of the model’s output, even though you pay for it. If you could see that, you might be able to infer better how these models reason and replicate it as a competitor
After the release of GPT4 it became very common to fine-tune non-OpenAI models on GPT4 output. I’d say OpenAI is rightly concerned that fine-tuning on chain of thought responses from this model would allow for quicker reproduction of their results. This forces everyone else to reproduce it the hard way. It’s sad news for open weight models but an understandable decision.