More

bhouston · 2025-02-27T20:23:42 1740687822

Altman's claim and NVIDIA's consumer launch supply problems may be related - OpenAI may be eating up the GPU supply...

_zoltan_ · 2025-02-27T21:06:53 1740690413

OpenAI is not purchasing consumer 5090s... :)

bangaladore · 2025-02-27T22:06:28 1740693988

Although you are correct, Nvidia is limited on total output. They can't produce 50XXs fast enough, and it's naive to think that isn't at least partially due to the wild amount of AI GPUs they are producing.

BizarroLand · 2025-02-27T21:32:38 1740691958

No, but the supply constraints are part of what is driving the insane prices. Every chip they use for consumer grade instead of commercial grade is a potential loss of potential income.

bhouston · 2025-02-27T20:21:30 1740687690

they do have coding benchmarks, I summarized them here: https://news.ycombinator.com/item?id=43197955

bhouston · 2025-02-27T20:12:40 1740687160

A bit better at coding than ChatGPT 4o but not better than o3-mini - there is a chart near the bottom of the page that is easy to overlook:

- ChatGPT 4.5 on AWS Bench verified: 38.0%

- ChatGPT 4o on AWS Bench verified: 30.7%

- OpenAI o3-mini on AWS Bench verified: 61.0%

BTW Anthropic Claude 3.7 is better than o3-mini at coding at around 62-70% [1]. This means that I'll stick with Claude 3.7 for the time being for my open source alternative to Claude-code: https://github.com/drivecore/mycoder

[1] https://aws.amazon.com/blogs/aws/anthropics-claude-3-7-sonne...

pawelduda · 2025-02-27T21:10:26 1740690626

Does the benchmark reflect your opinion on 3.7? I've been using 3.7 via Cursor and it's noticeably worse than 3.5. I've heard using the standalone model works fine, didn't get a chance to try it yet though.

jasonjmcghee · 2025-02-27T21:15:28 1740690928

personal anecdote - claude code is the best llm devx i've had.

_cs2017_ · 2025-02-27T20:53:46 1740689626

I don't see Claude 3.7 on the official leaderboard. The top performer on the leaderboard right now is o1 with a scaffold (W&B Programmer O1 crosscheck5) at 64.6%: https://www.swebench.com/#verified.

If Claude 3.7 achieves 70.3%, it's quite impressive, it's not far from 71.7% claimed by o3, at (presumably) much, much lower costs.

aoeusnth1 · 2025-02-28T16:11:44 1740759104

I doubt o3s costs will be lower for that performance. They juice their benchmark results by letting it spend $100k in thinking tokens.

logicchains · 2025-02-27T20:26:07 1740687967

>BTW Anthropic Claude 3.7 is better than o3-mini at coding at around 62-70% [1]. This means that I'll stick with Claude 3.7 for the time being for my open source alternative to Claude-code

That's not a fair comparison as o3-mini is significantly cheaper. It's fine if your employer is paying, but on a personal project the cost of using Claude through the API is really noticeable.

cheema33 · 2025-02-27T20:37:40 1740688660

> That's not a fair comparison as o3-mini is significantly cheaper. It's fine if your employer is paying...

I use it via Cursor editor's built-in support for Claude 3.7. That caps the monthly expense to $20. There probably is a limit in Claude for these queries. But I haven't run into it yet. And I am a heavy user.

bhouston · 2025-02-27T20:43:07 1740688987

Agentic coders (e.g. aider, Claude-code, mycoder, codebuff, etc.) use a lot more tokens, but they write whole features for you and debug your code.

QuadmasterXLII · 2025-02-27T21:39:26 1740692366

If open ai offers a more expensive model (4.5) and a cheaper model (3 mini) and both are worse, it starts to be a fair comparison

ehsanu1 · 2025-02-27T20:29:27 1740688167

It's the other way around on their new SWE-Lancer benchmark, which is pretty interesting: GPT-4.5 scores 32.6%, while o3-mini scores 10.8%.

Topfi · 2025-02-27T20:38:34 1740688714

To put that in context, Claude 3.5 Sonnet (new), a model we have had for months now and which from all accounts seems to have been cheaper to train and is cheaper to use, is still ahead of GPT-4.5 at 36.1% vs 32.6% in SWE-Lancer Diamond [0]. The more I look into this release, the more confused I get.

[0] https://arxiv.org/pdf/2502.12115

bhouston · 2025-02-27T17:23:25 1740677005

This guy says he is Aleph Null: http://richard-parkins.free.nf/

throwway120385 · 2025-02-27T21:37:22 1740692242

I thought Aleph Null was countably infinite?

bhouston · 2025-02-26T15:03:29 1740582209

There is always room for more tools. How many database exist? Front end frameworks? Languages? Backend frameworks? Analytics packages?

To think that in this space there is only one solution and all others are just outright failures or not worth doing is weird thinking as that isn't normally how it works. There are usually multiple niches and success/revenue strategies.

I strongly think this is the future of software development. And thus there will be many winners here.

bhouston · 2025-02-26T14:05:36 1740578736

It actually works with bun, pnpm, yarn, etc - any standard Node package manager.

I use pnpm personally and that is evident in the repo setup itself, but npm is sort of the standard so I put in that in the docs, rather than mentioning a long list of alternatives.

Terretta · 2025-02-26T19:18:08 1740597488

GP is likely referring to this concern, "The Great npm Garbage Patch":

https://news.ycombinator.com/item?id=41178258

bhouston · 2025-02-26T14:04:31 1740578671

On average I've been spending $25 a day on Claude credits once this was up and fully running. That is cheaper than hiring another developer in just about any country and it greatly boosts my productivity.

jasonjmcghee · 2025-02-26T16:29:42 1740587382

If you use threads / chains of messages in any form, I strongly encourage you to checkout caching. The cost savings are crazy. ($0.05 / cache read 1M tokens instead of $3 / 1M input tokens)

bhouston · 2025-02-26T22:48:07 1740610087

Okay I have token caching annd token costing implemented in a PR. Will go live tomorrow. Thanks for the suggestion!

insane_dreamer · 2025-02-26T19:36:06 1740598566

Isn't it much cheaper to just use CoPilot with GPT-o?

bhouston · 2025-02-26T20:52:41 1740603161

Claude I find is significantly better at coding that OpenAI tech, especially in agentic tool using workflows.

bhouston · 2025-02-26T12:30:51 1740573051

I will have a look! Thx!

bhouston · 2025-02-26T12:30:30 1740573030

Aider is python. Claude code is closed source and this is open source and typescript.

bhouston · 2025-02-26T07:54:44 1740556484

I will investigate aider. I wrote this tool from idea to now in just four weeks without reference to existing tools so now I need to do that.

crimsoneer · 2025-02-26T08:32:26 1740558746

Yeeah you're going to find out you should have just aider I'm afraid...

fragmede · 2025-02-26T09:00:29 1740560429

except that the future of LLM assisted programming means I can also make my own implementation of aider pretty easily. So theres going to be an explosion of software that does basically the same thing but it's private or just not widely shared. not because I don't want to share but because starting and supporting an open source project is a pita and I just want to build this one little cool thing and be done with it.

bhouston · 2025-02-26T11:43:24 1740570204

I will be launching a version of this on GitHub as an app to help open source developers. So open source is also going to get a boost.

bhouston · 2025-02-26T11:38:33 1740569913

Aider is python. That is annoying for me as I like to modify things. This is typescript.

wiether · 2025-02-26T12:45:38 1740573938

Wait, are you saying that, because you know Typescript but not Python, you can't make modifications on a software intended to develop for you using AI?

bhouston · 2025-02-26T13:25:02 1740576302

Auto-coders, which is what I call this tech, are great but they screw up complex tasks, so you need to be able to step in when they are screwing one up. I view it as a team of junior devs.

This will probably change at some point, but they require supervision this point and corrections.

If you do not actually know what you are doing, these things can create a mess. But that is just the next challenge to overcome and I suspect we'll get there relatively soon.