Hacker News new | past | comments | ask | show | jobs | submit login

A bit better at coding than ChatGPT 4o but not better than o3-mini - there is a chart near the bottom of the page that is easy to overlook:

- ChatGPT 4.5 on AWS Bench verified: 38.0%

- ChatGPT 4o on AWS Bench verified: 30.7%

- OpenAI o3-mini on AWS Bench verified: 61.0%

BTW Anthropic Claude 3.7 is better than o3-mini at coding at around 62-70% [1]. This means that I'll stick with Claude 3.7 for the time being for my open source alternative to Claude-code: https://github.com/drivecore/mycoder

[1] https://aws.amazon.com/blogs/aws/anthropics-claude-3-7-sonne...






Does the benchmark reflect your opinion on 3.7? I've been using 3.7 via Cursor and it's noticeably worse than 3.5. I've heard using the standalone model works fine, didn't get a chance to try it yet though.

personal anecdote - claude code is the best llm devx i've had.

I don't see Claude 3.7 on the official leaderboard. The top performer on the leaderboard right now is o1 with a scaffold (W&B Programmer O1 crosscheck5) at 64.6%: https://www.swebench.com/#verified.

If Claude 3.7 achieves 70.3%, it's quite impressive, it's not far from 71.7% claimed by o3, at (presumably) much, much lower costs.


It's the other way around on their new SWE-Lancer benchmark, which is pretty interesting: GPT-4.5 scores 32.6%, while o3-mini scores 10.8%.

To put that in context, Claude 3.5 Sonnet (new), a model we have had for months now and which from all accounts seems to have been cheaper to train and is cheaper to use, is still ahead of GPT-4.5 at 36.1% vs 32.6% in SWE-Lancer Diamond [0]. The more I look into this release, the more confused I get.

[0] https://arxiv.org/pdf/2502.12115


>BTW Anthropic Claude 3.7 is better than o3-mini at coding at around 62-70% [1]. This means that I'll stick with Claude 3.7 for the time being for my open source alternative to Claude-code

That's not a fair comparison as o3-mini is significantly cheaper. It's fine if your employer is paying, but on a personal project the cost of using Claude through the API is really noticeable.


> That's not a fair comparison as o3-mini is significantly cheaper. It's fine if your employer is paying...

I use it via Cursor editor's built-in support for Claude 3.7. That caps the monthly expense to $20. There probably is a limit in Claude for these queries. But I haven't run into it yet. And I am a heavy user.


Agentic coders (e.g. aider, Claude-code, mycoder, codebuff, etc.) use a lot more tokens, but they write whole features for you and debug your code.

If open ai offers a more expensive model (4.5) and a cheaper model (3 mini) and both are worse, it starts to be a fair comparison



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: