It is still not economical: in Arc at least 20 usd for task vs ~3 usd for a huma...

og_kalu · 2024-12-23T00:44:53 1734914693

Not necessarily. And this is the problem with ARC that people seem to forget.

- It's just a suite of visual puzzles. It's not like say GSM8K where proficiency in it gives some indication on Math proficiency in general.

- It's specifically a suite of puzzles that LLMs have shown particular difficulty in.

Basically how much compute it takes to handle a task in this benchmark does not correlate with how much it will take LLMs to compute tasks that people actually want to use LLMs for.

mrbungie · 2024-12-23T02:08:09 1734919689

If the benchmark is not representative of normal usage* then the benchmark and the plot being shown are not useful at all from a user/business perspective and the focus on the breakthrough scores of o3-low and o3-high in ARC-AGI would be highly misleading. And also the "representative" point is really moot from the discussion perspective (i.e. saying o3 stomps benchmarks, but the benchmarks aren't representative).

*I don't think that is the case as you can at least make relative conclusions (i.e. o3 vs o1 series, o3-low is 4x to 20x the cost for ~3x the perf). Even if it is pure marketing they expect people to draw conclusions using the perf/cost plot from Arc.

PS: I know there are more benchmarks like SWE-Bench and Frontier Math, but this is the only one showing data about o3-low/high costs without considering the CodeForces plot that includes o3-mini (that one does look interesting, though right now is vaporware) but does not separate between compute scale modes.

og_kalu · 2024-12-23T03:42:33 1734925353

>If the benchmark is not representative of normal usage* then the benchmark and the plot being shown are not useful at all from a user/business perspective and the focus on the breakthrough scores of o3-low and o3-high in ARC-AGI would be highly misleading.

ARC is a very hyped benchmark in the industry so letting us know the results is something any company would do whether it had a direct representation on normal usage or not.

>Even if it is pure marketing they expect people to draw conclusions using the perf/cost plot from Arc.

Again, people care about ARC, they don't care doing the things ARC questions ask. That it is un-economical to pay the price to use o3 for ARC does not mean it would be un-economical to do so for the tasks people actually want to use LLMs for. What does 3x the performance in say coding mean? You really think companies/users wouldn't put up with the increased price for that? You think they have Mturkers to turn to like they do with ARC?

ARC is literally the quintessential 'easy for humans, hard for ai' benchmark. Even if you discard the 'difficulty to price won't scale the same' argument, it makes no sense to use it for an economics comparison.

mrbungie · 2024-12-23T05:48:45 1734932925

In summary: so the "stomps benchmarks" means nothing for anyone trying to make decisions on that announcement (yet they show cost/perf info). It seems, hipey.