Hacker Newsnew | past | comments | ask | show | jobs | submit | beering's commentslogin

o1 is several generations old and was released in 2024. Is this some quite old research that took a long time to get published?

Yes, the preprint of the same paper (https://arxiv.org/abs/2412.10849) was first written in December 2024.

It's also important to note that it beat doctors in diagnosing in a way doctors do not diagnose.

It's hard to draw any conclusion from this study precisely because of this. Since 2024, we went from AI being able to do a few minutes of coding work to now a few weeks autonomously. That's like going from an intern to staff engineer level.

Medical research moves. Very. Slowly.

That's a good thing

The medical equivalent to "move fast and break things" would be "move fast and kill people"


People die either way. People die because treatments sit in the lab for years instead of in clinics.

Consider the FLASH oncology treatment. In 1995 Dr Fauvadon figure out how to make radiation therapy much much safer. He spent 14 years before sharing it in 2009. In 2014 it was actually published. [1]

Know anybody who died of cancer in those DECADES?

[1] https://institut-curie.org/news/dr-vincent-favaudon-receives...


I’m a little confused as to the setup. It was asking each model to one-shot a script and then the scripts faced off? Were the models given a computer environment? Or a test server to iterate against?

Sounds incredibly simple to me. One-shot.

So nothing like real-world coding, where you’d be able to run and test the script before submitting?

One shot just means the user doesn’t have to iterate on it via the agent. The agent does what ever it needs to deliver the best outcome, including its own running and iteration until it’s happy with it. This could be a short or long process potentially depending on the task.

You won’t be taken seriously if you push OpenSCAD - it’s simply not a tool that professionals will adopt due to not doing Breps. I think the recent progress of FreeCAD and its spin-off libraries cadquery and build123d will be better to push.

Yeah, I'd love for OpenSCAD to support a kernel which supports NURBS and could output a (nice) STEP file.

There is an OpenSCAD Workbench in FreeCAD --- does it work so as to get a workable representation?


That’s based on Anthropic’s retail price right? Not a fair comparison, like saying that Netflix must be losing money because every movie rental is $4 and a Netflix subscriber can watch 20 movies in a month.

image->html is a pretty involved task though. That’s basically a frontend dev’s job. $40 wouldn’t cover an hour of their time.

I don’t think anyone has to sell inference below cost. If Anthropic is GPU-constrained, then it makes sense for them to charge much much more on API users and push subscribers towards extra billing, because that’s the only knob they can turn. OpenAI has much more capacity based on news reports.

Yes, codex has had this ability for a while.

> Do you think a human is capable of providing assistance with defense but not offense, over a textual communication channel with another human? > If no, how does a cybersec firm train its employees?

In general, no, humans can’t be sure they are only helping with defensive and not offensive work unless they have more context. IRL, a security engineer would know who they’re working for. If they’re advising Apple, then they’d feel pretty confident that Apple is not turning around and hacking people.


If the task is ill-defined, then it's a bit unfair to make it sound like the problem is that an LLM can't be configured to do something, if a human would have an equally hard time with the same task. The statement "it's impossible to configure the weights to..." should really be something more broad like "it's impossible to...".

I have no comment about whether it's impossible to determine the intentions of a person asking for assistance through a textual conversation with that person.


I agree the homepage is a weak sell, but an independent operator in Europe IS value. If it doesn’t really make a difference otherwise, why not choose a home server that is governed by and supports your home region of Europe? (obviously there are other things that would make a difference, but you gotta start somewhere)


Isn’t it bascially the same thing? You type what you want into the input box and it does what you ask for.


Claude code can be configured with custom /slash commands and other details that don't necessarily transfer over to codex. /remote-control in cc is really great for walking away from my computer and continuing from my phone, for instance.


I guess I'm asking if their CLI tool is the same or if it functions different. I've never used anything besides CC so I wouldn't know if it's basically the same thing


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: