Are you sure this is not specialized to IMO? I do see the twitter thread saying it's "general reasoning" but I'd imagine they RL'd on olympiad math questions? If not I really hope someone from OpenAI says that bc it would be pretty astounding.
They also said this is not part of GPT-5, and “will be released later”. It’s very, very likely a model specifically fine-tuned for this benchmark, where afterwards they’ll evaluate what actual real-world problems it’s good at (eg like “use o4-mini-high for coding”).
Their hardware isn't fine tuned to it though, it uses the same general intelligence hardware that all other humans use.
So its a big difference if you use a general intelligence system and makes it do well in math, or when you create a specialized system that is only good at math and can't be used to get good in other areas.
From my vague rememberance of doing data science years ago, it's very hard not to leak the training set.
Basically how you do RL is that you make a set of training examples of input-output pairs, and set aside a smaller validation set, which you never train on, to check if your model's doing well.
What you do is you tweak the architecture and the training set until it does well on the validation set. By doing so, you inadvertedly leak info about the training set. Perhaps you choose an architecture which does well on the validation set. Perhaps you train more on examples more like ones being validated.
Even without the explicit intent to cheat, it's very hard to avoid this contamination, if you chose a different validation set, you'd end up with a different model.
Yeah, looking at the GP ... say a sequence of things that are true and plausible. That add your strong, unsupported claim at the end. I remember the approach from when I studied persuasion techniques...
No, and they're lying on the most important claim: that this is not a model specialized to IMO problems.
From the thread:
> just to be clear: the IMO gold LLM is an experimental research model.
The thread tried to muddy the narrative by saying the methodology can generalize, but no one is claiming the actual model is a generalized model.
There'd be a massively different conversation needed if a generalized model that could become the next iteration of ChatGPT had achieved this level of performance.
Frankly it looks to me like it's using an AlphaProof style system, going between natural language and Lean/etc. Of course OpenAI will not tell us any of this.
Anyway, that doesn't refute my point, it's just PR from a weaselly and dishonest company. I didn't say it was "IMO-specific" but the output strongly suggests specialized tooling and training, and they said this was an experimental LLM that wouldn't be released. I strongly suspect they basically attached their version of AlphaProof to ChatGPT.
We can only go off their word unfortunately and they say no formal math. so I assume it's being eval'd by a verifier model instead of a formal system. There's actually some hints of this b/c geometry in Lean is not that well developed so unless they also built their own system it's hard to do it formally (though their P2 proof is by coordinate bash (computation by algebra instead of geometric construction) so it's hard to tell.
In general I agree with you, but I see the point of requiring proof for statements made by them, instead of accepting them at face value. In those cases, given previous experiences and considering that they benefit from making them, if they are believed, the burden of proof should be on those making these statements, not on those questioning them, no?
Those models seem to be special and not part of their normal product line, as is pointed out in the comments here. I would assume that in that case they indeed had the purpose of passing these tests in mind when creating them. Or was it created for something different, and completely by chance they discovered they could be used for the challenge, unintentionally?
You don't need specialized tooling like Lean if you have enough training data with statements written in the natural language, I suppose. But the use of AlphaProof/AlphaGeometry type of learning is almost certain. And I'm sure they have spent a lot of compute to produce solutions, $10k is not a problem for them.
The bigger question is - why should everyone be excited by this? If they don't plan to share anything related to this AI model back to humanity.
I actually think this “cheating” is fine. In fact it’s preferable. I don’t need an AI that can act as a really expensive calculator or solver. We’ve already built really good calculators and solvers that are near optimal. What has been missing is the abductive ability to successfully use those tools in an unconstrained space with agency. I find really no value in avoiding the optimal or near optimal techniques we’ve devised rather than focusing on the harder reasoning tasks of choosing tools, instrumenting them properly, interpreting their results, and iterating. This is the missing piece in automated reasoning after all. A NN that can approximate at great cost those tools is a parlor trick and while interesting not useful or practical. Even if they have some agent system here, it doesn’t make the achievement any less that a machine can zero shot do as well as top humans at incredibly difficult reasoning problems posed in natural language.
Surely you jest. The cheating would be the same cheating as any other situation - someone inside the IMO skipping the questions and answers to people outside then that being used to compete. Fine - but why? If this were discovered then it would be disastrous for everyone involved, and for what? A noteworthy HN link? The downside would be international scandal and careers destroyed. The upside is imperceptible.
Finally, even if you aligned the model with the answers its weight shift of such an enormous model would be inconsequential. You would need to prime the context or boost the weights. All this seems like absurd lengths to go to to cheat on this one thing rather than focusing your energies on actually improving model performance. The payout for OpenAI isn’t a gold medal in the IMO it’s having a model that can get a gold medal at IMO then selling it. But it has to actually be capable of doing what’s on the tin otherwise their customers will easily and rapidly discover this.
Sorry, I like tin foil as much as anyone else, but this doesn’t seem credibly likely given the incentive structure.
Yet that level of cheating happens all the time because its very unlikely to be discovered. Sometimes its just done by people lower down to increase their own career, since they don't have as much to lose, but cheating does happen and its not that unlikely especially when salaries are this high.
Why is "almost certainly"? The link you provided has this to say:
> 5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.
> 8/N Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model.
And from Sam Altman:
> we are releasing GPT-5 soon but want to set accurate expectations: this is an experimental model that incorporates new research techniques we will use in future models.
The wording you quoted is very tricky: the method used to create the model is generalizable, but the model is not a general-use model.
If I have a post-training method that allows a model excel at a narrow task, it's still a generalizable method if there's a wide range of narrow tasks that it works on.
Since this looks like geometric proof, I wonder if the AI operates only on logical/mathematical statements or it actually somehow 'visualizes' the proof like a human would while solving.
No I assure you >50% of working mathematicians will not score gold level at IMO consistently (I'm in the field). As the original parent said, pretty much only ppl who had the training in high school can. Like number theorists without training might be able to do some number theory IMO questions but this level is basically impossible without specialized training (with maybe a few exceptions of very strong mathematicians)
> No I assure you >50% of working mathematicians will not score gold level at IMO consistently (I'm in the field)
I agree with you. However, would a lot of working mathematicians score gold level without the IMO time constraints? Working mathematicians generally are not trying to solve a problem in the time span of one hour. I would argue that most working mathematicians, if given an arbitrary IMO problem and allowed to work on it for a week, would solve it. As for "gold level", with IMO problems you either solve one or you don't.
You could counter that it is meaningless to remove the time constraints. But we are comparing humans with OpenAI here. It is very likely OpenAI solved the IMO problems in a matter of minutes, maybe even seconds. When we talk about a chatbot achieving human-level performance, it's understood that the time is not a constraint on the human side. We are only concerned with the quality of the human output. For example: can OpenAI write a novel at the level of Jane Austen? Maybe it can, maybe it can't (for now) but Jane Austen was spending years to write such a novel, while our expectation is for OpenAI to do it at the speed of multiple words per second.
I mean. Back when I was practicing these problems sometimes I would try them on/off for a week and would be able to do some 3&6's (usually I can do 1&4 somewhat consistently and usually none of others). As a working mathematician today, I would almost certain not be able to get gold medal performance in a week but for a given problem I guess I would have ~50% chance at least of solving it in a week? But I haven't tried in a while. But I suspect the professionals here do worse at these competition questions than you think. I mean certain these problems are "easy" compared to many of the questions we think about, but expertise drastically shifts the speed/difficulty of questions we can solve within our domains, if that makes sense.
Addendum: Actually I am not sure the probability of solving it in a week is not much better than 6 hours for these questions because they are kind of random questions. But I agree with some parts of your post tbf.
> It is very likely OpenAI solved the IMO problems in a matter of minutes, maybe even seconds
Really? My expectation would have been the opposite, that time was a constraint for the AIs. OpenAI's highest end public reasoning models are slow, and there's only so much that you can do by parallelization.
Understanding how they dealt with time actually seems like the most important thing to put these results into context, and they said nothing about it. Like, I'd hope they gave the same total time allocation for a whole problem set as the human competitors. But how did they split that time? Did they work on multiple problems in parallel?
I sense we may just have a different experience related to colleagues skill sets as I can think of 5 people I could send some questions too and I know they would do them just fine. Infact we often have done similar problems on a free afternoon and I often do similar on flights as a way to pass the time and improve my focus (my issue isn't my talent/understanding at maths, it's my ability to concentrate). I don't disagree that some level of training is needed but these questions aren't unique, nor impossible, especially as said training does exist and LLM's can access said examples. LLM's also have brute force which is a significant help with these type of issues. One particular point is that Math of all the STEM topics to try and focus on probably is the best documented alongside CS.
I mean these problems you can get better with practice. But if you haven't solved many before and can do them after an afternoon of thought I would be very impressed. Not that I don't believe you, it's just in my experience people like this are very rare. (Also I assume they have to have some degree of familarity of some common tricks otherwise they would have to derive basic number theory from scratch etc and that seems a bit much for me to believe)
I think honestly it's probably different experiences and skillsets. I find these sort of things doable bar dumb mistakes by myself, yet there will be other things I'll get stressed and not be able to do for ages (some lab skills no matter the number of times I do them and some physical equation derivations that I regularly muck up). I maybe sometimes assume that what comes easy for me, comes easy for all, and what I struggle with, everyone struggles with and that's probably not always the case. Likewise I did similar tasks as a teen in school and assume that is possibly the case for many academically bright so to speak but perhaps isn't so that probably helped me learn some tricks that I may not have otherwise. But as you say I do feel that you can learn the tricks and learn how to do them, even in older age (academically speaking) if you have the time and the patience and the right guide.
Here you go — you did this type of problems as a kid/teenager. 1) you likely have a talent for it 2) you have some training.
I did participate in math/informatics olympiads as a teenager and even taught it a little and from my experience, some type of people just _like_ that sort of problems naturally, they tickle their minds, and given time this people would develop to insane levels at it.
'Normal people', in my experience, even in math departments, don't like that type of problems, and would not fare well with them.
I am a professor in a math department (I teach statistics but there is a good complement of actual math PhDs) and there are only about 10% who care about these types of problems and definitely less than half who could get gold on an IMO test even if they didn’t care.
They are all outstanding mathematicians, but the IMO type questions are not something that mathematicians can universally solve without preparation.
There are of course some places that pride themselves on only taking “high scoring” mathematicians, and people will introduce themselves with their name and what they scored on the Putnam exam. I don’t like being around those places or people.
My second degree is in mathematics. Not only can I probably not do these but they likely aren’t useful to my work so I don’t actually care.
I’m not sure an LLM could replace the mathematical side of my work (modelling). Mostly because it’s applied and people don’t know what they are asking for, what is possible or how to do it and all the problems turn out to be quite simple really.
100% agree about this too (also a professional mathematician). To mathematicians who have not been trained on such problems, these will typically look very hard, especially the more recent olympiad problems (as opposed to problems from eg 30 years ago). Basically these problems have become more about mastering a very impressive list of techniques than at the inception (and participants prepare more and more for these). On the other hand, research mathematics has become more and more technical, but the techniques are very different, so that the correlation between olympiads and research is probably smaller than it once was.
Yeah, no - quite a chunk of IMO problems are planar and 3d geometry, and you don't really do that at university level (exception: specializing in high school maths didactics)
I see this distinction a lot, but what is the fundamental difference between competition "math" and professional/research math? If people actually knew then they (young students, and their parents) could decide for themselves if they wanted to engage in either kind of study.
I grew up in a relatively underserved rural city. I skipped multiple grades in math, completed the first two years of college math classes while in high school, and won the award for being the best at math out of everyone in my school.
I've met and worked with a few IMO gold medalists. Even though I was used to scoring in the 99th percentile on all my tests, it felt like these people were simply in another league above me.
I'm not trying to toot my own horn. I'm definitely not that smart. But it's just ridiculous to shoot down the capabilities of these models at this point.
The trouble is, getting an IMO gold medal is much easier (by frequency) than being the #1 Go player in the world, which was achieved by AI 10 years ago. I'm not sure it's enough to just gesture at the task; drilling down into precisely how it was achieved feels important.
(Not to take away from the result, which I'm really impressed by!)
The "AI" that won Go was Monte Carlo tree search on a neural net "memory" of the outcome of millions of previous games; this is a LLM solving open ended problems. The tasks are hardly even comparable.
IMO questions are to math as leetcode questions are to software engineering. Not necessarily easier or harder but they test ability on different axes. There’s definitely some overlap with undergrad level proof style questions but I disagree that being a working mathematician would necessarily mean you can solve these type of questions quickly. I did a PhD in pure math (and undergrad obv) and I know I’d have to spend time revising and then practicing to even begin answering most IMO questions.
My experience is that replicating papers is actually nontrivial. For example someone announced they had replicated gpt2 some time back but when evals were run it turned about to be the equivalent of a much smaller model.
The best way I know to get highlights from kindle (I only use iPhone) is to click the "My Notebook" icon (looks like a letter paper page), click the export button, and then send to email. There is a notecard option there as well but I haven't tried it. Not exactly a solution but I'm mentioning it because it took me a while to notice.
Professional mathematicians would not get this level of performance, unless they have a background in IMO themselves.
This doesn’t mean that the model is better than them in math, just that mathematicians specialize in extending the frontier of math.
The answers are not in the training data.
This is not a model specialized to IMO problems.
reply