Imo there's a huge blind spot forming between 6 and 8 when talking to people and in reading posts by various agent evangelists - few people seem to be focussing on building "high quality" changes vs maximising throughput of low quality work items.
My (boring b2b/b2e) org has scripts that wrap a small handful of agent calls to handle/automate our workflow. These have been incredibly valuable.
We still 'yolo' into PRs, use agents to improve code quality, do initial checks via gating. We're trying to get docs working through the same approach. We see huge value in automating and lightweight orchestration of agents, but other parts of the whole system are the bottleneck, so theres no real point in running more than a couple of agents concurrently - claude could already build a low quality version our entire backlog in a week.
Is anyone exploring the (imo more practically useful today) space of using agents to put together better changes vs "more commits"?
I have a code quality analysis tool that I use to "un-slopify" AI code. It doesn't handle algorithms and code semantics, which are still the programmer's domain, but it does a pretty good job of forcing agents to dry out code, separate concerns, group code more intelligently and generally write decoupled quasi-functional code. It works quite well with the raph loop to deeply restructure codebases.
Is anyone exploring the (imo more practically useful today) space of using agents to put together better changes vs "more commits"?
Yes, I am, although not really in public yet. I use the pi harness, which is really easy to extend. I’m basically driving a deterministic state machine for each code ticket, which starts with refining a short ticket into a full problem description by interviewing me one question at a time, then converts that into a detailed plan with individual steps. Then it implements each step one by one using TDD, and each bit gets reviewed by an agent in a fresh context. So first tests are written, and they’re reviewed to ensure they completely cover the initial problem, and any problems are addressed. That goes round a loop till the review agent is happy, then it moves to implementation. Same thing, implementation is written, loop until the tests pass, then review and fix until the reviewer is happy. Each sub task gets its own commit. Then when all the tasks are done, there’s an overall review that I look at. Then if everyone is happy the commits get squashed and we move to manual testing. The agent comes up with a full list of manual tests to cover the change, sets up the test scenarios and tells me where to debug in the code while working through each test case so I understand what’s been implemented. So this is semi automated - I’m heavily involved at the initial refine stage, then I check the plan. The various implementation and review loops are mostly hands off, then I check the final review and do the manual testing obviously.
This is definitely much slower than something like Gas Town, but all the components are individually simple, the driver is a deterministic program, not an agent, and I end up carefully reviewing everything. The final code quality is very good. I generally have 2-4 changes like this ongoing at any one time in tmux sessions, and I just switch between them. At some point I might make a single dashboard with summaries of where the process is up to on each, and whether it needs my input, but right now I like the semi manual process.
> Is anyone exploring the (imo more practically useful today) space of using agents to put together better changes vs "more commits"?
That’s what I’ve been focused on the last few weeks with my own agent orchestrator. The actual orchestration bit was the easy part but the key is to make it self improving via “workflow reviewer” agents that can create new reviewers specializing in catching a specific set of antipatterns, like swallowing errors. Unfortunately I've found that what decides acceptable code quality is very dependent on project, organization, and even module (tests vs internal utilities vs production services) so prompt instructions like "don't swallow errors or use unwrap" make one part of the code better while another gets worse, creating a conflict for the LLM.
The problem is that model eval was already the hardest part of using LLMs and evaluating agents is even harder if not practically impossible. The toy benchmarks the AI companies have been using are laughably inadequate.
So far the best I’ve got is “reimplement MINPACK from scratch using their test suite” which can take days and has to be manually evaluated.
I've been playing with Brad Ross's AISP [1] to get a better quality of llm outputs at strategic stages of our basic design / plan / implementation workflows.
A concrete example of this is our Adviser Skill experiment [2]. In most AI workflows, a "reviewer" agent just dumps markdown feedback. Our Adviser doesn't just "talk"; it outputs an AISP 5.1 document ( a kind of "Assembly Language for AI Cognition" )
This document forces the agent to define:
- Strict Type Definitions for the issues identified (e.g., distinguishing between a gap, an edge case, or a missing requirement).
- EARS Rules (Easy Approach to Requirements Syntax) that determine the verdict. For example, a rule might state: "If any issue has a severity of ⊘ (critical), then the workflow MUST halt."
- Formal Evidence: Every "approve" or "reject" verdict must include a confidence score (δ) and a grounding proof (π) that explains why the change matches the original specification.
By treating the agent's output as a proof-carrying protocol rather than just text, we can chain multiple specialized agents (Architect, Strategist, Auditor) who "triangulate" on the codebase. They must reach a formal consensus where the variance between their scores is low.
This shifts the agent's goal from "Finish the task at all costs" to "Prove that this change is safe and correct." It turns out that iterating on the verification logic is much more effective for building reliable systems than just increasing the number of agents running concurrently.
My org has built internal tooling that approximates this. It's incredibly valuable from a manual test perspective though we haven't managed to get the agent part working well, app startup times (10+ min) make iterating hard.
Do you have customers who have faced/solved this problem? If so, how did they do it -- it seems like a killer on the approach?
Our foundational design value was compute instance startup speed. We've made some design decisions and evaluated several "neocloud" providers with this goal in mind.
Currently, from launching an agent to that agent being able to run tests in our Rails docker-compose environment (and to the live app preview running), is about 30 seconds. If that agent finishes their work and goes to sleep, and then hours later you come back to send a message, it'll wake up in about the same time.
(And, of course, you can launch many agents at once -- they're all going to be ready at roughly the same time.)
This is really nice and a very original take. It feels good on mobile / other touch devices.
I'd love to see it feel a bit more polished on desktop (maybe I'll give that a shot if I find a bit of spare time!) - I could see a few simple things like adding up/down arrows to the picked item and wiring into up and down arrow presses going a long way to making it work really well there too.
Genuinely, thank you for sharing this, it's something different and interesting.
Following this logic, why write anything at all? Shakespeare's sonnets are arrangements of existing words that were possible before he wrote them. Every mathematical proof, novel, piece of journalism is simply a configuration of symbols that existed in the space of all possible configurations. The fact that something could be generated doesn't negate its value when it is generated for a specific purpose, context, and audience.
Invented might be a bit strong, but he is certainly the first written record of the word. Dress existed as a verb already, as did the generic reversing “un”, but before Shakespeare there is no evidence that they were used this way. Prior to that other words/phrases, which probably still exist in use today, were used instead. Perhaps “disrobe” though the OED lists the first reference to that as only a decade before Taming Of The Shrew (the first written use of undress) was published, so there are presumably other options that were in common use before both.
It is definitely valid to say he popularised the use of the word, which may have been being used informally in small pockets for some time before.
Following that logic, we should publish all unique random orderings of words. I think there is a book about a library like that, but it is a great read and is not a regression to the mean of ideas.
Writing worth reading as a non-child surprises, challenges, teaches, and inspires. LLM writing tends towards the least surprising, worn out tropes that challenge only the patience and attention of the reader. The eager learner, however will tolerate that , so I suppose that I’ll give them teaching. They are great at children’s stories, where the goal is to rehearse and introduce tropes and moral lessons with archetypes, effectively teaching the listener the language of story.
FWIW I am not particularly a critic of AI and am engaged in AI related projects. I am quite sure that the breakthrough with transformer architecture will lead to the third industrial revolution, for better or for worse.
But there are some things we shouldn’t be using LLMs for.
RAG could largely be replaced with tool use to a search engine. You could keep some of the approach around indexing/embeddings/semantic search, but it just becomes another tool call to a separate system.
How would you feel about becoming an expert in something that is so in flux and might disappear? That might help give you your answer.
That said, there's a lot of comparatively low hanging fruit in LLM adjacent areas atm.
> How would you feel about becoming an expert in something that is so in flux and might disappear?
Isn't that true for almost every subject within computers though, except more generalized concepts like design/architecture, problem solving and more abstract skills? Say you learn whatever popular "Compile-to-JS" language (probably TypeScript today) or Kubernetes, there is always a risk it'll fade in popularity until not many people use it.
I'm not saying it's a problem, as said by someone who favors a language people constantly claim is "dying" or "disappearing" (Clojure), but more that this isn't exclusive to the LLM/ML space, it just seems to happen slightly faster in that ecosystem.
So instead, embrace change, go with what feels right and learn whatever seems interesting to you, some things stick around, others don't (like Coffeescript), hopefully you'll learn something even if it doesn't stick around.
I expect it will wind up like search engines where you either submit urls for indexing/inclusion or wait for a crawl to pick your information up.
Until the tech catches up it will have a stifling effect on progress toward and adoption of new things (which imo is pretty common of new/immature tech, eg how culture has more generally kind of stagnated since the early 2000s)
In a research context, it provides pointers, and keywords for further investigation. In a report-writing context it provides textual content.
Neither of these or the thousand other uses are worthless. Its when you expect working and complete work product that it's (subjectively, maybe) worthless but frankly aiming for that with current gen technology is a fool's errand.
My (boring b2b/b2e) org has scripts that wrap a small handful of agent calls to handle/automate our workflow. These have been incredibly valuable.
We still 'yolo' into PRs, use agents to improve code quality, do initial checks via gating. We're trying to get docs working through the same approach. We see huge value in automating and lightweight orchestration of agents, but other parts of the whole system are the bottleneck, so theres no real point in running more than a couple of agents concurrently - claude could already build a low quality version our entire backlog in a week.
Is anyone exploring the (imo more practically useful today) space of using agents to put together better changes vs "more commits"?
reply