A bit better at coding than ChatGPT 4o but not better than o3-mini - there is a chart near the bottom of the page that is easy to overlook:
- ChatGPT 4.5 on AWS Bench verified: 38.0%
- ChatGPT 4o on AWS Bench verified: 30.7%
- OpenAI o3-mini on AWS Bench verified: 61.0%
BTW Anthropic Claude 3.7 is better than o3-mini at coding at around 62-70% [1]. This means that I'll stick with Claude 3.7 for the time being for my open source alternative to Claude-code: https://github.com/drivecore/mycoder
Does the benchmark reflect your opinion on 3.7? I've been using 3.7 via Cursor and it's noticeably worse than 3.5. I've heard using the standalone model works fine, didn't get a chance to try it yet though.
I don't see Claude 3.7 on the official leaderboard. The top performer on the leaderboard right now is o1 with a scaffold (W&B Programmer O1 crosscheck5) at 64.6%: https://www.swebench.com/#verified.
If Claude 3.7 achieves 70.3%, it's quite impressive, it's not far from 71.7% claimed by o3, at (presumably) much, much lower costs.
To put that in context, Claude 3.5 Sonnet (new), a model we have had for months now and which from all accounts seems to have been cheaper to train and is cheaper to use, is still ahead of GPT-4.5 at 36.1% vs 32.6% in SWE-Lancer Diamond [0]. The more I look into this release, the more confused I get.
>BTW Anthropic Claude 3.7 is better than o3-mini at coding at around 62-70% [1]. This means that I'll stick with Claude 3.7 for the time being for my open source alternative to Claude-code
That's not a fair comparison as o3-mini is significantly cheaper. It's fine if your employer is paying, but on a personal project the cost of using Claude through the API is really noticeable.
> That's not a fair comparison as o3-mini is significantly cheaper. It's fine if your employer is paying...
I use it via Cursor editor's built-in support for Claude 3.7. That caps the monthly expense to $20. There probably is a limit in Claude for these queries. But I haven't run into it yet. And I am a heavy user.
- ChatGPT 4.5 on AWS Bench verified: 38.0%
- ChatGPT 4o on AWS Bench verified: 30.7%
- OpenAI o3-mini on AWS Bench verified: 61.0%
BTW Anthropic Claude 3.7 is better than o3-mini at coding at around 62-70% [1]. This means that I'll stick with Claude 3.7 for the time being for my open source alternative to Claude-code: https://github.com/drivecore/mycoder
[1] https://aws.amazon.com/blogs/aws/anthropics-claude-3-7-sonne...