Hacker Newsnew | past | comments | ask | show | jobs | submit | greenfish6's commentslogin

i mean yea


yes, so the thought experiment here is, how far can agents (owned by different entities) get in being productive without a human stepping in


How much can agents accomplish by coding with each other with 0 humans in the loop?


something i really like from tryin git out over the last 10 minutes is that the main agent will continue talking to you while other agents are working, so you don't have to queue a message


yea but i feel like we are over the hill on benchmaxxing, many times a model has beaten anthropic on a specific bench, but the 'feel' is that it is still not as good at coding


When Anthropic beats Benchmarks its somehow earned, when OpenAi games it, its somehow about not feeling good at coding.


I mean… yeah? It sounds biased or whatever, but if you actually experience all the frontier models for yourself, the conclusion that Opus just has something the others don’t is inescapable.


Opus is really good at bash, and it’s damn fast. Codex is catching up on that front, but it’s still nowhere near. However, Codex is better at coding - full stop.


'feel' is no more accurate

not saying there's a better way but both suck


Speak for yourself. I've been insanely productive with Codex 5.2.

With the right scaffolding these models are able to perform serious work at high quality levels.


He wasn't saying that both of the models suck, but that the heuristics for measuring model capability suck


..huh?


The variety of tasks they can do and will be asked to do is too wide and dissimilar, it will be very hard to have a transversal measurement, at most we will have area specific consensus that model X or Y is better, it is like saying one person is the best coder at everything, that does not exist.


Yea, we're going to need benchmarks that incorporate series of steps of development for a particular language and how good each model is at it.

Like can the model take your plan and ask the right questions where there appear to be holes.

How wide of architecture and system design around your language does it understand.

How does it choose to use algorithms available in the language or common libraries.

How often does it hallucinate features/libraries that aren't there.

How does it perform as context get larger.

And that's for one particular language.


The 'feel' of a single person is pretty meaningless, but when many users form a consensus over time after a model is released, it feels a lot more informative than a simple benchmark because it can shift over time as people individually discover the strong and weak points of what they're using and get better at it.


At the end of the day “feel” is what people rely on to pick which tool they use.

I’d feel unscientific and broken? Sure maybe why not.

But at the end of the day I’m going to choose what I see with my own two eyes over a number in a table.

Benchmarks are a sometimes useful to. But we are in prime Goodharts Law Territory.


yeah, to be honest it probably doesn't matter too much. I think the major models are very close in capabilities


I don’t think this is even remotely true in practice.

I honestly I have no idea what benchmarks are benchmarking. I don’t write JavaScript or do anything remotely webdev related.

The idea that all models have very close performance across all domains is a moderately insane take.

At any given moment the best model for my actual projects and my actual work varies.

Quite honestly Opus 4.5 is proof that benchmarks are dumb. When Opus 4.5 released no one was particularly excited. It was better with some slightly large numbers but whatever. It took about a month before everyone realized “holy shit this is a step function improvement in usefulness”. Benchmarks being +15% better on SWE bench didn’t mean a damn thing.


Your feeling is not my feeling, codex is unambiguously smarter model for me


i would have to imagine the gastown design isn't optimal though? why 8, and why does there need to multiple hops of agent communications before two arbitrary agents communicate with each other as opposed to single shared filespace?


I've been using Gas Town a decent bit since it was released. I'd agree with you that it's design is sub-optimal, but I believe that's more due to the way the actual agents/harnesses have been designed as opposed to optimal software design. The problem you often run into is that agents will sometimes hang thinking they need human input for a problem they are on, or they think they're at a natural stopping point. If you're trying to do fully orchestrated agentic coding where you don't look at the code at all (putting aside whether that's good or not for a second) then this is sub-optimal behavior, and so these extra roles have been designed to 'keep the machine going' as it were.

Often times if I'm only working on a single project or focus, then I'm not using most of those roles at all and it's as you describe, one agent divvying out tasks to other agents and compiling reports about them. But due to the fact that my velocity with this type of coding is now based on how fast I can tell that agent what I want, I'm often working on 3 or 4 projects simultaneously, and Gas Town provides the perfect orchestration framework for doing this.


the problem with gastown is it tries to use agents for supervision when it should be possible to use much simpler and deterministic approaches to supervision, and also being a lot more token efficient


I strongly believe we will need both agentic and deterministic approaches. Agentic to catch edge cases & the like, deterministic as those problems (along with the simpler ones early on) are continually turned into hard coded solutions to the maximum extent possible.

Ideally you could eventually remove the agentic supervisor. But for some cases you would want to keep it around, or at least a smaller model which suffices.


yegge's article does come off as complicated design for the sake of complication


Excited to try this out. I've seen a lot of working systems on my own computer that share files to talk between different Claude Code agents and I think this could work similarly to that.

(i thought gas town was satire? people in comments here seem to be saying that gas town also had multi-agent file sharing for work tracking)


When i went it was blank? Lol i think someone made claude delete the site?


It's Claude. It does what it wants. Sometimes it renders sometimes it doesn't. Unit tests all pass though.


I use Willow AI, which I think is pretty good


The first example has a miscalculation; if you invest 1k and the EV is 900, then your choice has negative ROI, not positive.


He's calculating EV above cost. If you look at the calculation, the first term is -1000 to account for the initial investment. So the final value is tell you that you got back the initial money plus 900 more.


However, the article is technically inconsistent in framing.


The calculation that arrives at 900 has already subtracted the 1000 from the start.


it's correct. the EV is 900 after accounting for the 90% probability of -$1000. that's what the first term in the sum is for.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: