something i really like from tryin git out over the last 10 minutes is that the main agent will continue talking to you while other agents are working, so you don't have to queue a message
yea but i feel like we are over the hill on benchmaxxing, many times a model has beaten anthropic on a specific bench, but the 'feel' is that it is still not as good at coding
I mean… yeah? It sounds biased or whatever, but if you actually experience all the frontier models for yourself, the conclusion that Opus just has something the others don’t is inescapable.
Opus is really good at bash, and it’s damn fast. Codex is catching up on that front, but it’s still nowhere near. However, Codex is better at coding - full stop.
The variety of tasks they can do and will be asked to do is too wide and dissimilar, it will be very hard to have a transversal measurement, at most we will have area specific consensus that model X or Y is better, it is like saying one person is the best coder at everything, that does not exist.
The 'feel' of a single person is pretty meaningless, but when many users form a consensus over time after a model is released, it feels a lot more informative than a simple benchmark because it can shift over time as people individually discover the strong and weak points of what they're using and get better at it.
I don’t think this is even remotely true in practice.
I honestly I have no idea what benchmarks are benchmarking. I don’t write JavaScript or do anything remotely webdev related.
The idea that all models have very close performance across all domains is a moderately insane take.
At any given moment the best model for my actual projects and my actual work varies.
Quite honestly Opus 4.5 is proof that benchmarks are dumb. When Opus 4.5 released no one was particularly excited. It was better with some slightly large numbers but whatever. It took about a month before everyone realized “holy shit this is a step function improvement in usefulness”. Benchmarks being +15% better on SWE bench didn’t mean a damn thing.
i would have to imagine the gastown design isn't optimal though? why 8, and why does there need to multiple hops of agent communications before two arbitrary agents communicate with each other as opposed to single shared filespace?
I've been using Gas Town a decent bit since it was released. I'd agree with you that it's design is sub-optimal, but I believe that's more due to the way the actual agents/harnesses have been designed as opposed to optimal software design. The problem you often run into is that agents will sometimes hang thinking they need human input for a problem they are on, or they think they're at a natural stopping point. If you're trying to do fully orchestrated agentic coding where you don't look at the code at all (putting aside whether that's good or not for a second) then this is sub-optimal behavior, and so these extra roles have been designed to 'keep the machine going' as it were.
Often times if I'm only working on a single project or focus, then I'm not using most of those roles at all and it's as you describe, one agent divvying out tasks to other agents and compiling reports about them. But due to the fact that my velocity with this type of coding is now based on how fast I can tell that agent what I want, I'm often working on 3 or 4 projects simultaneously, and Gas Town provides the perfect orchestration framework for doing this.
the problem with gastown is it tries to use agents for supervision when it should be possible to use much simpler and deterministic approaches to supervision, and also being a lot more token efficient
I strongly believe we will need both agentic and deterministic approaches. Agentic to catch edge cases & the like, deterministic as those problems (along with the simpler ones early on) are continually turned into hard coded solutions to the maximum extent possible.
Ideally you could eventually remove the agentic supervisor. But for some cases you would want to keep it around, or at least a smaller model which suffices.
Excited to try this out. I've seen a lot of working systems on my own computer that share files to talk between different Claude Code agents and I think this could work similarly to that.
(i thought gas town was satire? people in comments here seem to be saying that gas town also had multi-agent file sharing for work tracking)
He's calculating EV above cost. If you look at the calculation, the first term is -1000 to account for the initial investment. So the final value is tell you that you got back the initial money plus 900 more.