Hacker News new | past | comments | ask | show | jobs | submit login
I fed all the comments from "OpenAI O3 breakthrough" HT post to o1 pro mode
1 point by popthetopnow 4 months ago | hide | past | favorite | 1 comment
Below is a shortened summary (under 4000 characters) capturing the key points from the Hacker News thread:

What is ARC-AGI?

A puzzle-based benchmark (“easy for humans, hard for AI”) using visual/grid tasks akin to Raven’s Matrices. Proposed by François Chollet et al. to measure genuine generalization, not just memorization. Consists of public, semi-private, and private (hidden) test sets. Historically, large language models (LLMs) struggled on it, cited as evidence they lacked “true” reasoning. What did “o3” accomplish?

OpenAI’s “o3” scored 87.5–91.5% on the ARC-AGI public/semi-private sets, surpassing the ~85% threshold for the ARC prize. Human benchmarks range ~64–76% for average Turkers, >95% for STEM grads. Two modes: “o3-low”: Cheaper (~$17–$20 per puzzle), achieves ~75–82% accuracy. “o3-high”: 172× more compute, ~$1k–$3k+ per puzzle, scoring 87–91.5%. OpenAI may have spent hundreds of thousands of dollars on 400 test puzzles in high-compute mode. Reactions

Impressive but expensive: Thousands of dollars per puzzle is cost-prohibitive (human solvers might cost $5/puzzle). Not “AGI”: Even ARC’s creators say passing ARC doesn’t prove AGI; Chollet notes “o3” fails some trivial puzzles. A new “ARC-AGI-2” could slash its score to ~30%. Likely uses massive search at inference: Possibly a “tree-of-thought” approach (many sub-reasoning steps, a verifier, then a final answer), similar to AlphaZero’s search. Other benchmarks: “o3” jumped from ~48% to 70+% on SWE-Bench (code-fixing) and from single-digit to ~25% on Frontier Math (Olympiad-level problems). Employment & Future

Fears of rapid AI displacement of programmers and knowledge workers. Others see a parallel to the personal computer revolution, with humans still in the loop. Real-world tasks need maintainable solutions and large contexts, not just puzzle-solving. Current compute costs (~$1k–$3k per puzzle) are impractical. However, costs have declined steadily, suggesting future viability. Goalpost-Shifting

Benchmarks lose perceived importance once models exceed them (as with Chess and Go). ARC authors never claimed it was a definitive AGI test—just a measure of “generalization.” Inference-time vs. Training-time

Models used to rely on big training sets + single forward passes; now we see iterative chain-of-thought plus large-scale search at inference. Costs could decrease with more efficient “internal” search methods or session-based learning that amortizes expense. Main Takeaways

OpenAI’s “o3” demonstrates near-human (or higher) performance on ARC-AGI by combining an LLM with a heavy search strategy. This approach is incredibly costly but shows that, if you throw enough compute and methodical search at puzzles, LLMs can outperform average humans. Passing ARC doesn’t confirm AGI; “o3” still struggles on some simple puzzles, and a new version of the benchmark aims to challenge it further. Long term, decreasing costs and continuing progress could enable practical AI-driven automation in coding, advanced math, research, and more.




Thanks for boiling a gallon of water for nothing




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: