Good idea, and an improvement, but you still have that fundamental issue: you don't really know what code has been written. You don't know the refactors are right, in alignment with existing patterns etc.
I guess to reach this point you have already decided you don't care what the code looks like.
Something I'm starting to struggle with is when agents can now do longer and more complex tasks, how do you review all the code?
Last week I did about 4 weeks of work over 2 days first with long running agents working against plans and checklists, then smaller task clean ups, bugfixes and refactors. But all this code needs to be reviewed by myself and members from my team. How do we do this properly? It's like 20k of line changes over 30-40 commits. There's no proper solution to this problem yet.
One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.
It sounds like you know this but what happened is that you didn't do 4 weeks of work over 2 days, you got started on 4 weeks of work over 2 days, and now you have to finish all 4 weeks worth of work and that might take an indeterminate amount of time.
If you find a big problem in commit #20 of #40, you'll have to potentially redo the last 20 commits, which is a pain.
You seem to be gated on your review bandwidth and what you probably want to do is apply backpressure - stop generating new AI code if the code you previously generated hasn't gone through review yet, or limit yourself to say 3 PRs in review at any given time. Otherwise you're just wasting tokens on code that might get thrown out. After all, babysitting the agents is probably not 'free' for you either, even if it's easier than writing code by hand.
Of course if all this agent work is helping you identify problems and test out various designs, it's still valuable even if you end up not merging the code. But it sounds like that might not be the case?
Ideally you're still better off, you've reduced the amount of time being spent on the 'writing the PR' phase even if the 'reviewing the PR' phase is still slow.
If you haven't reviewed the code yet, how can you say it did 4 weeks of work in 2 days? You haven't verified the correctness, and besides reviewing the code is part of the work.
That's what I was getting at. With the review and potential rework time, we could be looking at over the original 4 week estimate. So then what's the point in using long running unsupervised agents if it ends up being longer than doing it in small chunks.
>Last week I did about 4 weeks of work over 2 days first with long running agents working against plans and checklists, then smaller task clean ups, bugfixes and refactors. But all this code needs to be reviewed by myself and members from my team. How do we do this properly? It's like 20k of line changes over 30-40 commits. There's no proper solution to this problem yet.
Get an LLM to generate a list of things to check based on those plans (and pad that out yourself with anything important to you that the LLM didn't add), then have the agents check the codebase file by file for those things and report any mismatches to you. As well as some general checks like "find anything that looks incorrect/fragile/very messy/too inefficient". If any issues come up, ask the agents to fix them, then continue repeating this process until no more significant issues are reported. You can do the same for unit tests, asking the agents to make sure there are tests covering all the important things.
The proper solution is to treat the agent generated code like assembly... IE. don't review it. Agents are the compiler for your inputs (prompts, context, etc). If you care about code quality you should have people writing it with AI help, not the other way around.
Code review is a skill, as is reading code. You're going to quickly learn to master it.
> It's like 20k of line changes over 30-40 commits.
You run it, in a debugger and step through every single line along your "happy paths". You're building a mental model of execution while you watch it work.
> One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.
Not going to be a time saver, but next time you want to take nibbles and bites, and then merge the branches in (with the history). The hard lesson here is around task decomposition, in line documentation (cross referenced) and digestible chunks.
But if you get step debugging running and do the hard thing of getting through reading the code you will come out the other end of the (painful) process stronger and better resourced for the future.
Oh I didn't mean literally how do I review code. I meant, if an agent can write a lot of code to achieve a large task that seemingly works (from manual testing), what's the point if we haven't really solved code review? There's still that bottleneck no matter how fast you can get working code down.
Redoing the work as smaller PRs might help with readability, but then you get the opposite problem: it becomes hard to hold all the PRs in your head at once and keep track of the overall purpose of the change (at least for me).
IMO the real solution is figuring out which subset of changes actually needs human review and focusing attention there. And even then, not necessarily through diffs. For larger agent-generated changes, more useful review artifacts may be things like design decisions or risky areas that were changed.
> Something I'm starting to struggle with is when agents can now do longer and more complex tasks, how do you review all the code?
Same as before. Small PRs, accept that you won't ship a month of code in two days. Pair program with someone else so the review is just a formality.
The value of the review is _also_ for someone else to check if you have built the right thing, not just a thing the right way, which is exponentially harder as you add code.
You’re not alone. I went from being a mediocre security engineer to a full time reviewer of LLM code reviews last week. I just read reports and report on incomplete code all day. Sometimes things get humorously worse from review to review. I take breaks by typing out the PoCs the LLMs spell out for me…
So you have become a reviewer instead of a programmer? Is that so? hones question. And if so, what is the advantage of looking a code for 12 hours instead of coding for 12.
Build features faster. Granted, this exposes the difference between people who like to finish projects and people who like to get paid a lot of money for typing on a keyboard.
Why does understanding computer science principles and software architecture and instructing a person or an ai on how to fix them require typing every line yourself?
yeah honestly thats what i am struggling with too and I dont have a a good solution. However, I do think we are going to see more of this - so it will be interesting to see how we are going to handle this.
i think we will need some kind of automated verification so humans are only reviewing the “intent” of the change. started building a claude skill for this (https://github.com/opslane/verify)
I wouldn't have picked this article as AI until I got an agent to do some writing for me and read a bunch of it to figure out if I can stand behind it. Now I see the tells everywhere "It's not this. It's that." is particularly common and I can't unsee it. (FWIW I rewrote most of the writing it generated, but it did help me figure out my structure and narrative)
The problem I think with AI generated posts is that you feel like you can't trust the content once it's AI. It could be partly hallucinated, or misrepresented.
Yeah, but "it's not X. It's Y" is a common idiom that LLMs picked up from people. That's the point i was making. And it's starting to feel like every post has at least one comment claiming that it was AI generated.
This is exactly right IMO. I have never worked for a company where the bottleneck was "we've run out of things to do". That said, plenty of companies run out of actual software engineering work when their product isn't competitive. But it usually isn't competitive because they haven't been able to move fast enough
A) how old the product is: Twitter during its first 5 years probaby had more work to do compared to Twitter after 15 years. I suspect that is why they were able to get rid of so many developers.
B) The industry: many b2c / ecommerce businesses are straightforward and don't have an endless need for new features. This is different than more deep tech companies
There’s a third one, and it’s non-tech companies or companies for whom software is not a core product. They only make in-house tooling, ERP extensions, etc. Similar to your Twitter example, once the ERP or whatever is “done” there’s not much more work to do outside of updating for tax & legal changes, or if the business launches new products, opens a new location, etc.
I’ve built several of such tools where I work. We don’t even have a dev team, it’s just IT Ops, and all of what I’ve built is effectively “done” software unless the business changes.
I suspect there’s a lot of that out there in the world.
> how people who really try to learn with these tools work
This setup is potentially effective sure, but you're not learning in the sense that GP meant.
For GP: Personally I've reached the conclusion that it's better for my career to use agents effectively and operate at this new level of abstraction, with final code review by me and then my team as normal.
> This setup is potentially effective sure, but you're not learning in the sense that GP meant.
Then GP didn't mean anything useful. I've learned how to build those setups. I learn to build by orchestrating groups of agents, and I get to spend far more of my time focusing on architecture, rather than minutiae that are increasingly irrelevant.
Honestly, comments are just half the problem. At least half the articles I read from HN are vibe written. And I only spot it after reading a few paragraphs. It's leaving a bad taste, and it's sad because HN was guaranteed to have plenty of things worth reading and it's deteriorating
I'm using kimi-k2-instruct as the primary model and building out tool calls that use gpt-oss-120b to allow it to opt-in to reasoning capabilities.
Using Vultr for the VPS hosting, as well as their inference product which AFAIK is by far the cheapest option for hosting models of these class ($10/mo for 50M tokens, and $0.20/M tokens after that). They also offer Vector Storage as part of their inference subscription which makes it very convenient to get inference + durable memory & RAG w/ a single API key.
Their inference product is currently in beta, so not sure whether the price will stay this low for the long haul.
You can definitely get gpt-oss-120b for much less than $0.20/M on openrouter (cheapest is currently 3.9c/M in 14c/M out). Kimi K2 is an order of magnitude larger and more expensive though.
What other models do they offer? The web page is very light on details
K2 is the only of the 5 that supports tool calling. In my testing, it seems like all five support RAG, but K2 loses knowledge of its registered tools when you access it through the RAG endpoint forcing you to pick one capability or the other (I have a ticket open for this).
Also, the R1-distill models are annoying to use because reasoning tokens are included in the output wrapped in <think> tags instead of being parsed into the "reasoning_content" field on responses. Also also, gpt-oss-120b has a "reasoning" field instead of "reasoning_content" like the R1 models.
reply