A bit better at coding than ChatGPT 4o but not better than o3-mini - there is a ...

pawelduda · 2025-02-27T21:10:26 1740690626

Does the benchmark reflect your opinion on 3.7? I've been using 3.7 via Cursor and it's noticeably worse than 3.5. I've heard using the standalone model works fine, didn't get a chance to try it yet though.

jasonjmcghee · 2025-02-27T21:15:28 1740690928

personal anecdote - claude code is the best llm devx i've had.

_cs2017_ · 2025-02-27T20:53:46 1740689626

I don't see Claude 3.7 on the official leaderboard. The top performer on the leaderboard right now is o1 with a scaffold (W&B Programmer O1 crosscheck5) at 64.6%: https://www.swebench.com/#verified.

If Claude 3.7 achieves 70.3%, it's quite impressive, it's not far from 71.7% claimed by o3, at (presumably) much, much lower costs.

aoeusnth1 · 2025-02-28T16:11:44 1740759104

I doubt o3s costs will be lower for that performance. They juice their benchmark results by letting it spend $100k in thinking tokens.

logicchains · 2025-02-27T20:26:07 1740687967

>BTW Anthropic Claude 3.7 is better than o3-mini at coding at around 62-70% [1]. This means that I'll stick with Claude 3.7 for the time being for my open source alternative to Claude-code

That's not a fair comparison as o3-mini is significantly cheaper. It's fine if your employer is paying, but on a personal project the cost of using Claude through the API is really noticeable.

cheema33 · 2025-02-27T20:37:40 1740688660

> That's not a fair comparison as o3-mini is significantly cheaper. It's fine if your employer is paying...

I use it via Cursor editor's built-in support for Claude 3.7. That caps the monthly expense to $20. There probably is a limit in Claude for these queries. But I haven't run into it yet. And I am a heavy user.

bhouston · 2025-02-27T20:43:07 1740688987

Agentic coders (e.g. aider, Claude-code, mycoder, codebuff, etc.) use a lot more tokens, but they write whole features for you and debug your code.

QuadmasterXLII · 2025-02-27T21:39:26 1740692366

If open ai offers a more expensive model (4.5) and a cheaper model (3 mini) and both are worse, it starts to be a fair comparison

ehsanu1 · 2025-02-27T20:29:27 1740688167

It's the other way around on their new SWE-Lancer benchmark, which is pretty interesting: GPT-4.5 scores 32.6%, while o3-mini scores 10.8%.

Topfi · 2025-02-27T20:38:34 1740688714

To put that in context, Claude 3.5 Sonnet (new), a model we have had for months now and which from all accounts seems to have been cheaper to train and is cheaper to use, is still ahead of GPT-4.5 at 36.1% vs 32.6% in SWE-Lancer Diamond [0]. The more I look into this release, the more confused I get.

[0] https://arxiv.org/pdf/2502.12115