O3 has demonstrated that OpenAI needs 1,000,000% more inference time compute to score 50% higher on benchmarks. If O3-High costs about $350k an hour to operate, that would mean making O4 score 50% higher would cost $3.5B (!!!) an hour. That scaling wall.
I’m convinced they’re getting good at gaming the benchmarks since 4 has deteriorated via ChatGPT, in fact I’ve used 4-0125 and 4-1106 via the API and find them far superior to o1 and o1-mini at coding problems. GPT4 is an amazing tool but the true capabilities are being hidden from the public and/or intentionally neutered.
> I’ve used 4-0125 and 4-1106 via the API and find them far superior to o1 and o1-mini at coding problems
Just chiming in to say you're not alone. This has been my experience as well. The o# line of models just don't do well at coding, regardless of what the benchmarks say.
All the benchmarks provide substantial scaffolding and specification details, and that's if they are zero-shot at all, which they often are not. In reality, nobody wants to spend as much time providing so much details or examples just to get the AI to write the correct function, when that same time and effort you'd have used to write it yourself.
Also, those benchmarks often run the model K times on the same question, and if any one of them is correct, they say it passed. That could mean if you re-ran the model 8 times, it might come up with the right answer only once. But now you have to waste your time checking if it is right or not.
I want to ask: "Write a function to count unique numbers in a list" and get the correct answer the first time.
What you need to ask:
"""
Write a Python function that takes a list of integers as input and returns
the count of numbers that appear exactly once in the list.
The function should:
- Accept a single parameter: a list of integers
- Count elements that appear exactly once
- Return an integer representing the count
- Handle empty lists and return 0
- Handle lists with duplicates correctly
Please provide a complete implementation.
"""
And run it 8 times and if you're lucky it'll get it correct zero-shot.
Edit: I'm not even aware of a Pass@1, zero-shot, and without detailed prompting (natural prompting) benchmark. If anyone knows one let me know.
I used to run a lot of monte carlo simulations where the error is proportional to the inverse square root. There was a huge advantage of running for an hour vs a few minutes, but you hit the diminishing returns depressingly quickly. It would not surprise me at all if llms end up having similar scaling properties.
Yeah, any situation you need O(n^2) runtime to obtain n bits of output (or bits of accuracy, in the Monre Carlo case) is pure pain. At every point, it's still within your means to double the amount of output (by running it 3x longer than you have so far), but it gradually becomes more and more painful, instead of there being a single point where you can call it off.
Even assuming that past rates of inference cost scaling hold up, we would only expect a 2 OoM decrease after about a year or so.
And 1% of 3.5b is still a very large number.
Not really. o3-low compute still stomps the benchmarks and isn't anywhere that expensive and o3-mini seems better than o1 while being cheaper.
Combine that with the fact that LLM inference has reduced orders of magnitudes in cost the last few years and hampering over the inference costs of a new release seems a bit silly.
Not necessarily. And this is the problem with ARC that people seem to forget.
- It's just a suite of visual puzzles. It's not like say GSM8K where proficiency in it gives some indication on Math proficiency in general.
- It's specifically a suite of puzzles that LLMs have shown particular difficulty in.
Basically how much compute it takes to handle a task in this benchmark does not correlate with how much it will take LLMs to compute tasks that people actually want to use LLMs for.
If the benchmark is not representative of normal usage* then the benchmark and the plot being shown are not useful at all from a user/business perspective and the focus on the breakthrough scores of o3-low and o3-high in ARC-AGI would be highly misleading. And also the "representative" point is really moot from the discussion perspective (i.e. saying o3 stomps benchmarks, but the benchmarks aren't representative).
*I don't think that is the case as you can at least make relative conclusions (i.e. o3 vs o1 series, o3-low is 4x to 20x the cost for ~3x the perf). Even if it is pure marketing they expect people to draw conclusions using the perf/cost plot from Arc.
PS: I know there are more benchmarks like SWE-Bench and Frontier Math, but this is the only one showing data about o3-low/high costs without considering the CodeForces plot that includes o3-mini (that one does look interesting, though right now is vaporware) but does not separate between compute scale modes.
>If the benchmark is not representative of normal usage* then the benchmark and the plot being shown are not useful at all from a user/business perspective and the focus on the breakthrough scores of o3-low and o3-high in ARC-AGI would be highly misleading.
ARC is a very hyped benchmark in the industry so letting us know the results is something any company would do whether it had a direct representation on normal usage or not.
>Even if it is pure marketing they expect people to draw conclusions using the perf/cost plot from Arc.
Again, people care about ARC, they don't care doing the things ARC questions ask. That it is un-economical to pay the price to use o3 for ARC does not mean it would be un-economical to do so for the tasks people actually want to use LLMs for. What does 3x the performance in say coding mean? You really think companies/users wouldn't put up with the increased price for that? You think they have Mturkers to turn to like they do with ARC?
ARC is literally the quintessential 'easy for humans, hard for ai' benchmark. Even if you discard the 'difficulty to price won't scale the same' argument, it makes no sense to use it for an economics comparison.
In summary: so the "stomps benchmarks" means nothing for anyone trying to make decisions on that announcement (yet they show cost/perf info). It seems, hipey.
If you are talking about ARC benchmark, then o3-low doesn't look that special if you take into account there are plenty of finetuned models with much smaller resources achieved 40-50% results on private set (not semi-private like o3-low).
- I'm not just talking about ARC. On frontier Math, we have 2 scores, one with pass@1 and another with consensus vote with 64 samples. Both scores are much better than previous Sota.
- Also apparently, ARC wasn't a special fine-tune but rather some of the training set in the corpus for pre-training.
>that result is not verifiable, not reproducable, unknown if it was leaked and how it was measured. Its kinda hype science.
It will be verifiable when the model is released. Open ai haven't released any benchmark scores that were shown falsified later so unless you have an actual reason to believe they're outright lying then it's not something to take seriously.
Frontier Math is a private benchmark with its highest tier of difficulty Terrence Tao says:
“These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”
Unless you have a reason to believe answers were leaked then again, not interested in baseless speculation.
>its private for outsiders, but it was developed in "collaboration" with OAI, and GPT was tested in the past on it, so they have it in logs somewhere.
They have logs of the questions probably but that's not enough. Frontier Math isn't something that can be fully solved without gathering top experts at multiple disciplines. Even Tao says he only knows who to ask for the most difficult set.
Basically, what you're suggesting at least with this benchmark in particular is far more difficult than you're implying.
>If you think this entire conversation is pointless, then why do you continue?
There's no point arguing about how efficient the models are being (the original point) if you won't even accept the results of the benchmarks. Why i'm continuing ? For now, it's only polite to clarify.
> Frontier Math isn't something that can be fully solved without gathering top experts
Tao's quote above referred on hardest 20% problems, they have 3 levels of difficulty, presumably first level is much easier. Also, as I mentioned OAI collaborated on creating benchmark, so they could have access to all solutions too.
> There's no point arguing
Lol, let me ask again, why you are arguing then? Yes, I have strong reasonable(imo) doubt that those results are valid.