dylanbyte's comments

dylanbyte · 2025-07-19T10:50:58 1752922258

These are high school level only in the sense of assumed background knowledge, they are extremely difficult.

Professional mathematicians would not get this level of performance, unless they have a background in IMO themselves.

This doesn’t mean that the model is better than them in math, just that mathematicians specialize in extending the frontier of math.

The answers are not in the training data.

This is not a model specialized to IMO problems.

Davidzheng · 2025-07-19T11:15:23 1752923723

Are you sure this is not specialized to IMO? I do see the twitter thread saying it's "general reasoning" but I'd imagine they RL'd on olympiad math questions? If not I really hope someone from OpenAI says that bc it would be pretty astounding.

stingraycharles · 2025-07-19T13:57:18 1752933438

They also said this is not part of GPT-5, and “will be released later”. It’s very, very likely a model specifically fine-tuned for this benchmark, where afterwards they’ll evaluate what actual real-world problems it’s good at (eg like “use o4-mini-high for coding”).

UltraSane · 2025-07-19T20:05:07 1752955507

Humans who excel at IMO questions are also "fine tuned" on them in the sense that they practice them for hundreds of hours

SiempreViernes · 2025-07-19T20:44:22 1752957862

Sure, but nobody is using their IMO score to prove they are superintelligent and pulling it off in wider groups.

sebmellen · 2025-07-20T01:32:27 1752975147

I’ve seen IMO rank used to justify more than one $100m+ seed round.

htrp · 2025-07-21T20:25:07 1753129507

Someone was burned by Cognition?

wanderlust123 · 2025-07-20T01:25:41 1752974741

I’m pretty sure that a high score in imo is a sign of high intelligence.

UltraSane · 2025-07-20T04:25:48 1752985548

And the opposite of ADHD.

Jensson · 2025-07-20T04:10:35 1752984635

Their hardware isn't fine tuned to it though, it uses the same general intelligence hardware that all other humans use.

So its a big difference if you use a general intelligence system and makes it do well in math, or when you create a specialized system that is only good at math and can't be used to get good in other areas.

UltraSane · 2025-07-21T00:48:55 1753058935

This IMO LLM isn't using fine tuned hardware either.

torginus · 2025-07-19T21:51:09 1752961869

From my vague rememberance of doing data science years ago, it's very hard not to leak the training set.

Basically how you do RL is that you make a set of training examples of input-output pairs, and set aside a smaller validation set, which you never train on, to check if your model's doing well.

What you do is you tweak the architecture and the training set until it does well on the validation set. By doing so, you inadvertedly leak info about the training set. Perhaps you choose an architecture which does well on the validation set. Perhaps you train more on examples more like ones being validated.

Even without the explicit intent to cheat, it's very hard to avoid this contamination, if you chose a different validation set, you'd end up with a different model.

SonOfLilit · 2025-07-19T22:31:13 1752964273

The questions were published a few days ago. The 2025 IMO just ended.

KoolKat23 · 2025-07-19T22:40:01 1752964801

And the model was in lockdown to avoid this.

YeGoblynQueenne · 2025-07-19T14:32:46 1752935566

>> This is not a model specialized to IMO problems.

How do you know?

joe_the_user · 2025-07-19T23:16:20 1752966980

Yeah, looking at the GP ... say a sequence of things that are true and plausible. That add your strong, unsupported claim at the end. I remember the approach from when I studied persuasion techniques...

aprilthird2021 · 2025-07-19T17:38:21 1752946701

> The answers are not in the training data.

> This is not a model specialized to IMO problems.

Any proof?

andrepd · 2025-07-19T19:23:30 1752953010

There's no proof that this is not made up, let alone any shred of transparency or reproducibility.

There are trillions of dollars at stake in hyping up these products; I take everything these companies write with a cartload of salt.

BoorishBears · 2025-07-19T21:42:27 1752961347

No, and they're lying on the most important claim: that this is not a model specialized to IMO problems.

From the thread:

> just to be clear: the IMO gold LLM is an experimental research model.

The thread tried to muddy the narrative by saying the methodology can generalize, but no one is claiming the actual model is a generalized model.

There'd be a massively different conversation needed if a generalized model that could become the next iteration of ChatGPT had achieved this level of performance.

AIPedant · 2025-07-19T12:00:14 1752926414

It almost certainly is specialized to IMO problems, look at the way it is answering the questions: https://xcancel.com/alexwei_/status/1946477742855532918

E.g here: https://pbs.twimg.com/media/GwLtrPeWIAUMDYI.png?name=orig

Frankly it looks to me like it's using an AlphaProof style system, going between natural language and Lean/etc. Of course OpenAI will not tell us any of this.

aluminum96 · 2025-07-19T17:11:15 1752945075

OpenAI explicitly stated that it is natural language only, with no tools such as Lean.

https://x.com/alexwei_/status/1946477745627934979?s=46&t=Hov...

redlock · 2025-07-19T12:13:10 1752927190

Nope

https://x.com/polynoamial/status/1946478249187377206?s=46&t=...

AIPedant · 2025-07-19T12:36:28 1752928588

If you don't have a Twitter account then x.com links are useless, use a mirror: https://xcancel.com/polynoamial/status/1946478249187377206

Anyway, that doesn't refute my point, it's just PR from a weaselly and dishonest company. I didn't say it was "IMO-specific" but the output strongly suggests specialized tooling and training, and they said this was an experimental LLM that wouldn't be released. I strongly suspect they basically attached their version of AlphaProof to ChatGPT.

Davidzheng · 2025-07-19T13:00:09 1752930009

We can only go off their word unfortunately and they say no formal math. so I assume it's being eval'd by a verifier model instead of a formal system. There's actually some hints of this b/c geometry in Lean is not that well developed so unless they also built their own system it's hard to do it formally (though their P2 proof is by coordinate bash (computation by algebra instead of geometric construction) so it's hard to tell.

skdixhxbsb · 2025-07-19T13:45:15 1752932715

> We can only go off their word

We’re talking about Sam Altman’s company here. The same company that started out as a non profit claiming they wanted to better the world.

Suggesting they should be given the benefit of the doubt is dishonest at this point.

aluminum96 · 2025-07-19T17:12:34 1752945154

“they must be lying because I personally dislike them”

This is why HN threads about AI have become exhausting to read

nosianu · 2025-07-19T18:36:10 1752950170

In general I agree with you, but I see the point of requiring proof for statements made by them, instead of accepting them at face value. In those cases, given previous experiences and considering that they benefit from making them, if they are believed, the burden of proof should be on those making these statements, not on those questioning them, no?

Those models seem to be special and not part of their normal product line, as is pointed out in the comments here. I would assume that in that case they indeed had the purpose of passing these tests in mind when creating them. Or was it created for something different, and completely by chance they discovered they could be used for the challenge, unintentionally?

otabdeveloper4 · 2025-07-19T18:30:17 1752949817

Yeah, that's how the concept of "reputation" works.

queenkjuul · 2025-07-19T23:48:35 1752968915

No, they are likely lying, because they have huge incentives to lie

dandanua · 2025-07-19T20:00:03 1752955203

You don't need specialized tooling like Lean if you have enough training data with statements written in the natural language, I suppose. But the use of AlphaProof/AlphaGeometry type of learning is almost certain. And I'm sure they have spent a lot of compute to produce solutions, $10k is not a problem for them.

The bigger question is - why should everyone be excited by this? If they don't plan to share anything related to this AI model back to humanity.

fnordpiglet · 2025-07-19T14:03:34 1752933814

I actually think this “cheating” is fine. In fact it’s preferable. I don’t need an AI that can act as a really expensive calculator or solver. We’ve already built really good calculators and solvers that are near optimal. What has been missing is the abductive ability to successfully use those tools in an unconstrained space with agency. I find really no value in avoiding the optimal or near optimal techniques we’ve devised rather than focusing on the harder reasoning tasks of choosing tools, instrumenting them properly, interpreting their results, and iterating. This is the missing piece in automated reasoning after all. A NN that can approximate at great cost those tools is a parlor trick and while interesting not useful or practical. Even if they have some agent system here, it doesn’t make the achievement any less that a machine can zero shot do as well as top humans at incredibly difficult reasoning problems posed in natural language.

SJC_Hacker · 2025-07-19T18:29:45 1752949785

> I actually think this “cheating” is fine. In fact it’s preferable.

The thing with IMO, is the solutions are already known by someone.

So suppose the model got the solutions beforehand, and fed them into the training model. Would that be an acceptable level of "cheating" in your view?

fnordpiglet · 2025-07-19T20:05:32 1752955532

Surely you jest. The cheating would be the same cheating as any other situation - someone inside the IMO skipping the questions and answers to people outside then that being used to compete. Fine - but why? If this were discovered then it would be disastrous for everyone involved, and for what? A noteworthy HN link? The downside would be international scandal and careers destroyed. The upside is imperceptible.

Finally, even if you aligned the model with the answers its weight shift of such an enormous model would be inconsequential. You would need to prime the context or boost the weights. All this seems like absurd lengths to go to to cheat on this one thing rather than focusing your energies on actually improving model performance. The payout for OpenAI isn’t a gold medal in the IMO it’s having a model that can get a gold medal at IMO then selling it. But it has to actually be capable of doing what’s on the tin otherwise their customers will easily and rapidly discover this.

Sorry, I like tin foil as much as anyone else, but this doesn’t seem credibly likely given the incentive structure.

Jensson · 2025-07-20T04:13:25 1752984805

Yet that level of cheating happens all the time because its very unlikely to be discovered. Sometimes its just done by people lower down to increase their own career, since they don't have as much to lose, but cheating does happen and its not that unlikely especially when salaries are this high.

signatoremo · 2025-07-19T16:46:28 1752943588

Why is "almost certainly"? The link you provided has this to say:

> 5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.

BoorishBears · 2025-07-19T21:48:54 1752961734

Also from the thread:

> 8/N Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model.

And from Sam Altman:

> we are releasing GPT-5 soon but want to set accurate expectations: this is an experimental model that incorporates new research techniques we will use in future models.

The wording you quoted is very tricky: the method used to create the model is generalizable, but the model is not a general-use model.

If I have a post-training method that allows a model excel at a narrow task, it's still a generalizable method if there's a wide range of narrow tasks that it works on.

torginus · 2025-07-19T21:41:07 1752961267

Since this looks like geometric proof, I wonder if the AI operates only on logical/mathematical statements or it actually somehow 'visualizes' the proof like a human would while solving.

ktallett · 2025-07-19T10:59:19 1752922759

[flagged]

Davidzheng · 2025-07-19T11:10:58 1752923458

No I assure you >50% of working mathematicians will not score gold level at IMO consistently (I'm in the field). As the original parent said, pretty much only ppl who had the training in high school can. Like number theorists without training might be able to do some number theory IMO questions but this level is basically impossible without specialized training (with maybe a few exceptions of very strong mathematicians)

credit_guy · 2025-07-19T11:26:53 1752924413

> No I assure you >50% of working mathematicians will not score gold level at IMO consistently (I'm in the field)

I agree with you. However, would a lot of working mathematicians score gold level without the IMO time constraints? Working mathematicians generally are not trying to solve a problem in the time span of one hour. I would argue that most working mathematicians, if given an arbitrary IMO problem and allowed to work on it for a week, would solve it. As for "gold level", with IMO problems you either solve one or you don't.

You could counter that it is meaningless to remove the time constraints. But we are comparing humans with OpenAI here. It is very likely OpenAI solved the IMO problems in a matter of minutes, maybe even seconds. When we talk about a chatbot achieving human-level performance, it's understood that the time is not a constraint on the human side. We are only concerned with the quality of the human output. For example: can OpenAI write a novel at the level of Jane Austen? Maybe it can, maybe it can't (for now) but Jane Austen was spending years to write such a novel, while our expectation is for OpenAI to do it at the speed of multiple words per second.

Davidzheng · 2025-07-19T11:37:03 1752925023

I mean. Back when I was practicing these problems sometimes I would try them on/off for a week and would be able to do some 3&6's (usually I can do 1&4 somewhat consistently and usually none of others). As a working mathematician today, I would almost certain not be able to get gold medal performance in a week but for a given problem I guess I would have ~50% chance at least of solving it in a week? But I haven't tried in a while. But I suspect the professionals here do worse at these competition questions than you think. I mean certain these problems are "easy" compared to many of the questions we think about, but expertise drastically shifts the speed/difficulty of questions we can solve within our domains, if that makes sense.

Addendum: Actually I am not sure the probability of solving it in a week is not much better than 6 hours for these questions because they are kind of random questions. But I agree with some parts of your post tbf.

jsnell · 2025-07-19T13:43:31 1752932611

> It is very likely OpenAI solved the IMO problems in a matter of minutes, maybe even seconds

Really? My expectation would have been the opposite, that time was a constraint for the AIs. OpenAI's highest end public reasoning models are slow, and there's only so much that you can do by parallelization.

Understanding how they dealt with time actually seems like the most important thing to put these results into context, and they said nothing about it. Like, I'd hope they gave the same total time allocation for a whole problem set as the human competitors. But how did they split that time? Did they work on multiple problems in parallel?

ktallett · 2025-07-19T11:24:03 1752924243

I sense we may just have a different experience related to colleagues skill sets as I can think of 5 people I could send some questions too and I know they would do them just fine. Infact we often have done similar problems on a free afternoon and I often do similar on flights as a way to pass the time and improve my focus (my issue isn't my talent/understanding at maths, it's my ability to concentrate). I don't disagree that some level of training is needed but these questions aren't unique, nor impossible, especially as said training does exist and LLM's can access said examples. LLM's also have brute force which is a significant help with these type of issues. One particular point is that Math of all the STEM topics to try and focus on probably is the best documented alongside CS.

Davidzheng · 2025-07-19T11:33:57 1752924837

I mean these problems you can get better with practice. But if you haven't solved many before and can do them after an afternoon of thought I would be very impressed. Not that I don't believe you, it's just in my experience people like this are very rare. (Also I assume they have to have some degree of familarity of some common tricks otherwise they would have to derive basic number theory from scratch etc and that seems a bit much for me to believe)

ktallett · 2025-07-19T11:49:32 1752925772

I think honestly it's probably different experiences and skillsets. I find these sort of things doable bar dumb mistakes by myself, yet there will be other things I'll get stressed and not be able to do for ages (some lab skills no matter the number of times I do them and some physical equation derivations that I regularly muck up). I maybe sometimes assume that what comes easy for me, comes easy for all, and what I struggle with, everyone struggles with and that's probably not always the case. Likewise I did similar tasks as a teen in school and assume that is possibly the case for many academically bright so to speak but perhaps isn't so that probably helped me learn some tricks that I may not have otherwise. But as you say I do feel that you can learn the tricks and learn how to do them, even in older age (academically speaking) if you have the time and the patience and the right guide.

samat · 2025-07-19T15:11:48 1752937908

Here you go — you did this type of problems as a kid/teenager. 1) you likely have a talent for it 2) you have some training.

I did participate in math/informatics olympiads as a teenager and even taught it a little and from my experience, some type of people just _like_ that sort of problems naturally, they tickle their minds, and given time this people would develop to insane levels at it.

'Normal people', in my experience, even in math departments, don't like that type of problems, and would not fare well with them.

parsimo2010 · 2025-07-19T13:36:56 1752932216

I am a professor in a math department (I teach statistics but there is a good complement of actual math PhDs) and there are only about 10% who care about these types of problems and definitely less than half who could get gold on an IMO test even if they didn’t care.

They are all outstanding mathematicians, but the IMO type questions are not something that mathematicians can universally solve without preparation.

There are of course some places that pride themselves on only taking “high scoring” mathematicians, and people will introduce themselves with their name and what they scored on the Putnam exam. I don’t like being around those places or people.

crinkly · 2025-07-19T13:55:58 1752933358

100% agree with this.

My second degree is in mathematics. Not only can I probably not do these but they likely aren’t useful to my work so I don’t actually care.

I’m not sure an LLM could replace the mathematical side of my work (modelling). Mostly because it’s applied and people don’t know what they are asking for, what is possible or how to do it and all the problems turn out to be quite simple really.

upperhalfplane · 2025-07-19T16:12:50 1752941570

100% agree about this too (also a professional mathematician). To mathematicians who have not been trained on such problems, these will typically look very hard, especially the more recent olympiad problems (as opposed to problems from eg 30 years ago). Basically these problems have become more about mastering a very impressive list of techniques than at the inception (and participants prepare more and more for these). On the other hand, research mathematics has become more and more technical, but the techniques are very different, so that the correlation between olympiads and research is probably smaller than it once was.

keeda · 2025-07-19T18:23:26 1752949406

> They are all outstanding mathematicians, but the IMO type questions are not something that mathematicians can universally solve without preparation.

So IMO is basically the leetcode of Mathematics.

mathteddybear · 2025-07-19T20:06:23 1752955583

Yeah, no - quite a chunk of IMO problems are planar and 3d geometry, and you don't really do that at university level (exception: specializing in high school maths didactics)

UltraSane · 2025-07-19T20:07:44 1752955664

So IMO questions are to math what Leetcode is to programming?

calf · 2025-07-20T00:30:28 1752971428

I see this distinction a lot, but what is the fundamental difference between competition "math" and professional/research math? If people actually knew then they (young students, and their parents) could decide for themselves if they wanted to engage in either kind of study.

gametorch · 2025-07-19T11:19:44 1752923984

Getting gold at the IMO is pretty damn hard.

I grew up in a relatively underserved rural city. I skipped multiple grades in math, completed the first two years of college math classes while in high school, and won the award for being the best at math out of everyone in my school.

I've met and worked with a few IMO gold medalists. Even though I was used to scoring in the 99th percentile on all my tests, it felt like these people were simply in another league above me.

I'm not trying to toot my own horn. I'm definitely not that smart. But it's just ridiculous to shoot down the capabilities of these models at this point.

npinsker · 2025-07-19T11:29:09 1752924549

The trouble is, getting an IMO gold medal is much easier (by frequency) than being the #1 Go player in the world, which was achieved by AI 10 years ago. I'm not sure it's enough to just gesture at the task; drilling down into precisely how it was achieved feels important.

(Not to take away from the result, which I'm really impressed by!)

Invictus0 · 2025-07-19T13:08:28 1752930508

The "AI" that won Go was Monte Carlo tree search on a neural net "memory" of the outcome of millions of previous games; this is a LLM solving open ended problems. The tasks are hardly even comparable.

yobbo · 2025-07-19T14:53:14 1752936794

A "reasoning LLM" might not be conceptually far from MCTS.

SonOfLilit · 2025-07-19T22:40:34 1752964834

I really don't like the use of the word memory here, even in quotes. AlphaGo has a much better "understanding" of Go positions than mine (7k).

gafferongames · 2025-07-19T13:15:54 1752930954

And then they created AlphaGo Zero, which is not trained on any previous games, and it was even stronger!

https://deepmind.google/discover/blog/alphago-zero-starting-...

yorwba · 2025-07-19T22:20:40 1752963640

AlphaGo Zero was also trained on millions of games, they just weren't games played by human players.

Workaccount2 · 2025-07-19T23:08:19 1752966499

Nothing that uses a mathematical model for solving a problem will ever reason because reasoning can only be done by things we don't understand...

jebarker · 2025-07-19T13:56:39 1752933399

IMO questions are to math as leetcode questions are to software engineering. Not necessarily easier or harder but they test ability on different axes. There’s definitely some overlap with undergrad level proof style questions but I disagree that being a working mathematician would necessarily mean you can solve these type of questions quickly. I did a PhD in pure math (and undergrad obv) and I know I’d have to spend time revising and then practicing to even begin answering most IMO questions.

hnfong · 2025-07-19T17:37:57 1752946677

This is probably the right time to bring up this classic:

"Did you win the Putnam?"

https://news.ycombinator.com/item?id=35079

dylanbyte · on March 24, 2021

Why do you say Ada is 2.7b ?

PufPufPuf · on March 24, 2021

The sizes of the four GPT-3 variants were shared on Reddit by Stella Athena, one of the researchers behind GPT Neo: https://www.reddit.com/r/MachineLearning/comments/ma9kaw/p_e...

dylanbyte · on March 21, 2021

Curious to see what parameter size of gpt3 this will end up being equivalent to. Obviously we won't know until they evaluate their models.

Voloskaya · on March 21, 2021

It's trained using the same architecture, and with a very similar dataset, so it should be very close.

dylanbyte · on March 22, 2021

My experience is that replicating papers is actually nontrivial. For example someone announced they had replicated gpt2 some time back but when evals were run it turned about to be the equivalent of a much smaller model.

dylanbyte · on Aug 26, 2018

Is there a convenient workflow for kindle highlights - > flashcards?

swaggyBoatswain · on Aug 27, 2018

To my knowledge, there's no kindle API

https://forums.developer.amazon.com/questions/68355/regardin...

Upon research, there's a few tools like https://readwise.io/ and clipping.io which are chrome extensions.

They webscrape your homepage highlights here:

https://read.amazon.com/notebook

https://read.amazon.com/kp/notebook?ft

Readwise is like evernote for kindle highlights. It sends 10 or so highlights as a daily message email.

kristofferR · on Aug 27, 2018

https://github.com/neingeist/kindle-to-anki

swaggyBoatswain · on Aug 27, 2018

that's a webscraper not an official api though, reading through it uses beautifulsoup.

PowerfulWizard · on Aug 27, 2018

The best way I know to get highlights from kindle (I only use iPhone) is to click the "My Notebook" icon (looks like a letter paper page), click the export button, and then send to email. There is a notecard option there as well but I haven't tried it. Not exactly a solution but I'm mentioning it because it took me a while to notice.