This “short leash” seems like more of a crutch to me, and a sign of not giving the AI enough detail on the problem to begin with, or not reviewing and iterating on its output.
Hand-holding great models like Fable through implementation is a waste of time, and a waste of Fable. You can have increasingly nuanced discussions with stronger models, and they write a lot better code than they used to. The process of discussing designs and their implementations, questioning things that look weird to you, and actually reading the AI’s responses also helps to find better solutions.
For example, one time I wanted to write a greedy solver for a problem, and in my discussion with Opus on the idea it suggested using an existing MILP library to solve the problem exactly. I’d never even heard of MILP, but my final implementation ended up being better and simpler than what I’d have done alone.
You say you can have increasingly nuanced discussions with stronger models.
What I say is, when I asked Claude why he applied a certain change I didn't understand, and boy, it was a small change, he said he "reasoned from first principles" based on the code paths. But it didn't work, and when I asked, "Okay, describe the steps of your reasoning from first principles," it literally answered that it had just made it up.
So, nuanced discussions with models, I don't buy it.
You can never ask why a model did a certain thing, or what it was "thinking" when it said something - just like you can't ask a human which neurons were firing when they had a certain thought. The information just isn't available at that level.
You absolutely can have deep nuanced discussions with LLMs however, you just need to better understand their strengths and weaknesses.
The human will quite convincingly be able to construct a post-hoc reasoning on an action that may or may not be related at all to what was actually going through their head or the actual instinctual reasons that led to a decision.
Humans can accurately retell what their consciousness was doing, but they have no clue why their unconsciousness responded as it did.
LLM is just that unconsciousness part that humans have to post hoc explain like that, and lacks the conscious part that we humans actually can inspect in ourselves.
If the AI had some introspection part where it actually tracks its reasoning maybe it would be closer to conscious humans. Its too expensive to do that everywhere ofc, not even us humans tracks everything like that, just a tiny bit, but tracking that tiny bit is enough for so much error correction to happen.
"Humans can accurately retell what their consciousness was doing" is often not true, because of complex mechanisms. The feeling of shame alone can make it very hard for someone to accurately describe how the arrived at the wrong conclusion.
Plus it's an open question if this is even a thing. Does consciousness consist of constructing actions beforehand, or of construction justifications afterward?
Frankly, my opinion is that DNA is incredible at choose the most energy efficient/cheap option, and the cheaper option is definitely justifications afterward.
I feel strengthened by psychological experiments where people are shown fake events involving them, where they then "explain their (nonexistent) reasoning at the time".
Arguments for the idea that the human consciousness/soul is something that is emergent keep getting shouted down though. Even though if you take the extreme opposite: it's obviously wrong. Nobody has ever cut open a human skull (or anything else) and found a soul. So somehow it's constructed from very non-conscious components we don't understand, it's not "actually there" in a real sense.
Sufficiently constrained post-hoc justifications are indistinguishable from explanations. Consciousness tries to make things up, it learns that people notice this, it then begins trying to construct justifications that won't be predictably called out as false. Eventually it learns how its unconscious operates, and how to interrogate it, and its post-hoc justifications, at least in the common cases, become reliable.
>Consciousness tries to make things up, it learns that people notice this, it then begins trying to construct justifications that won't be predictably called out as false.
There's a logical "skip" between that and
>Eventually it learns how its unconscious operates, and how to interrogate it, and its post-hoc justifications, at least in the common cases, become reliable.
The brain constructs a narrative that won't be called out as false, one that provides social capital, makes one feel good about oneself, is consistent with all your other justifications, etc. It's only an assumption that this process would naturally converge on Truth, and considering it's massively-multiplayer chaos where brains coordinate their stories in complex ways, my assumption is that this would converge on *stability*, not truth.
Yep. It converges on truth unless there's a strong reward for lies because truth is easy. It's a neural network. It just reads off/probes the internal state because that's the cheapest way to model the unconscious. The justification won't necessarily be true, mind, in terms of the labels it puts, but it should mostly be true structurally- behaviorally predictive in the ordinary domain.
(Even if you are incentivized to lie and flatter yourself, it is still helpful to have access to the true signal internally, because that way you can know how to structure your lie to best avoid detection.)
>Eventually it learns how its unconscious operates
I mean, no we don't, both in a personal way and in a global scientific understanding.
What you're saying happens is a set of socially consistent and acceptable responses based upon general human knowledge at the time. The common cases aren't exactly reliable, it's that they are repeatable in the sense they cover what we expect, and tend to explode when the world is less predictable.
This is why the scientific method changed the world, because we started writing shit down, comparing notes, and striving for repeatability.
I think a better way of putting this is that humans think they can accurately re-tell what their consciousness was doing. Whether they actually can, or even if consciousness exists at all as a thing outside the perception of consciousness is a philosophical question currently beyond answering.
I wonder if monte carlo tree search could play a role in reasoning. I'm searching and it seems to come up in arxiv papers, so the idea is not dead. I'll look more into this after writing this comment..
Isn’t that part of what the think blocks are for?
Yea, don’t inject them back into the context, but do log them for review of that train of thought… no?
You don't get access to the thinking traces. Might work with local models tho, but the current <thinking/> meta isn't particularly suited for this either, as it's a big blob of rambling surfaced by RL, with the "only" objective being that the thinking blob somehow leads to a better final answer. Something more detailed, using templates akin to oAI's harmony could work, provided there's also a step that teaches the models to reflect on the various thinking channels, and maybe surface bits and pieces to include in "skills" or "learnings".
That's true, but it does mean that the LLM itself actually does have access to those thinking traces and could therefore, at least in principle, answer what it was thinking. They're probably not trained to do that, though.
It depends. Up until recently the models were trained only to "think" on the last user message. So you'd send the message1, got back reply1 w/ think1 but you'd make the next iteration m1 - r1 - m2, and would get back reply2 w/ think2. You would not add the thinking1. That's how the models were trained, and that's how you were supposed to construct the conversation.
Now recently some things have changed, and you can add the thinking part (you get that encrypted from the closed API labs). But the model needs to have been trained for this to work. And doing it this way you'll burn through tokens faster, as the thinking parts are usually rather long.
You certainly can ask it what it was thinking, the problem is just that it's more likely to make up a plausible sounding fabrication than to say "I don't know" or "my reasoning is hidden for business reasons" (frontier models hide a lot of their chain of thought). Which is the fundamental problem with LLMs though, if the data doesn't exist or it's sparse they make things up.
Choosing plausible sounding fabrication over an admission of ignorance is not an uncommon modality among the human beings I interact with, so I'm not surprised this pattern is found in models trained on human interactions.
You can have a nuanced discussion with an LLM. But LLMs also have failure modes where they start making up justifications. The two are not mutually exclusive.
> You can never ask why a model did a certain thing
Of course you can! It might be following outdated docs or read something in legacy code and tried to follow that pattern and it'll tell you as much if you ask it in a way that actually gets you the reason instead of it thinking it needs to immediately fix the mistake.
Even asking a human being why they did a certain thing is questionable. The research on choice blindness seems like a pretty definitive debunking of post-hoc rationalization:
I'm not sure what point you're trying to make. In science and engineering, being able to provide justification is a core skill. The comparison we should be making is against the human practitioners who are trained in their fields. There will always be a distribution of ability. Saying that there's evidence that people are capable of providing post-hoc rationalization doesn't say anything about the ability of experts to produce well thought out responses (in their respective fields) that don't immediately fall apart under scrutiny.
Structured thinking and deliberation are indeed important, but you can also make LLMs do structured "thinking" if you work hard enough, and generate quite plausible reasoned arguments with valid real-world results, and you can get them to write down their working as they go. But as research has shown, it's not "true" thinking, just pattern matching at a higher level, and eventually runs out of steam.[0]
But you only have to drill down a couple more layers and you are back in the void again; do you have any proof that your own thinking, no matter how structured and accurate, is anything other than pattern-matching at a sufficiently much higher level at which you are incapable of seeing it as such?
I think we will be finding some very interesting things out soon using the combination of LLMs and theorem provers, as demonstrated by Terence Tao's recent work.[1]
A cheetah is not a motorbike is not an aircraft is not a rocket.
"Nuanced discussion" doesn't necessarily mean the sort one would have with a human. Statistical apologies are never going to be meaningful. One could edit nonsense into the context window and the model would attempt to rationalize it. The models are smart but you need to use them in a way that makes sense for what they are.
"Nuanced discussions" is more about describing a design to a model, asking the model to critique your design and ask you for clarifications, and then you providing those clarifications and the model "getting it" and proceeding to additional levels of detail before implementation. In particular the models being able to highlight concerns you have not yet thought about is a pretty good sign of this. Fable is noticeably better at this compared to Opus.
I was not talking about models making mistakes. Mistakes, and then models making up justifications for those mistakes, is a failure mode of any LLM, and Fable is no different in that regard. Newer models might make less mistakes, or at least make less egregious mistakes, but they still make mistakes.
Maybe I’m missing something, but he talks about charm and tasks (repos on his GitHub). Charm being his harness, and tasks being one of his skills.
Idk, maybe I’m mistaken from reading the article…
> Fable is better than most staff engineers at my FAANG.
While this wouldn’t entirely surprise me, my experience is just not that. Using Claude and fable, it regularly (poorly) recreates features that exist inside our codebase. Sure, I could give way more initial context but at a certain point I’ve given so much context that I would have been faster writing the code myself, or I could have literally handed it to even a fresh graduate to write.
We can point out mistakes that feel rather grating without assuming intent behind them.
I agree that their use of "he" is likely because they're not a native speaker, especially because they're arguing against the capabilities of LLMs.
That doesn't make it inherently wrong to point out the mistake when it's so intertwined with the deeper discussion here, especially given the fact that some (hopefully few) people do build relationships with LLMs.
I’d be more willing to engage with your argument in good faith without inflammatory language like this. Try and meet people where they are and these conversations become easier.
If you have invested significantly in the planning phase and there is momentum in the architecture and conventions that already exist in the project, the implementation phase might not need as much oversight as is suggested here.
> You can discover that your initial idea was dumb and a better one exists
The planning and architecture phase is usually where I make these types of discovery at a high level.
> Your agent might go “off the rails” and start doing something you don’t want it to do
Candidly these orthogonal, inadvertent edits aren't as bad as they once were and for impactful changes there should be at least some test coverage, even if that test coverage is just "freezing" what was implemented.
As you mentioned the final review discussion is a good chance to verify beyond what review or adversarial review agents find.
I think the obvious solution here is to beef up the test side of the app, much more than when writing code by hand. Tests represent project knowledge in executable format. The LLM does not need to be careful to remember every detail of the tests. You don't need to vet every small interaction, it automates review work as well.
Even better if the project was built from the start to be easier to test and observe. But my golden rule remains - no code without tests, expand test suite all the time.
I am a bit confused which part you disagree with specifically. Reading AI responses and reviewing code seems to be what you propose as well.
Your example with MLIP is something that would not be prevented by this approach, during the planing phase, it would surface.
I guess the devil is in the details and the way you prompt it for starting the task matters.
But IMO you absolutely need to check the output, need to engage with what the model is doing, need to probe why something is built the way the model tries to build it.
I disagree with keeping an eye on the model as it is working, approving every command, and denying and stopping the model when you think it has gone wrong. It is not that it is actively harmful to do this, but rather that it is a waste of time and you can avoid the need for it through better design discussions and review.
Micro-managing and keeping the AI on a "short leash" also lends itself better to telling models to do smaller units of work at a time instead of discussing broader design concerns. That is why I think someone doing this would miss the MILP solution, because they might never discuss the overall design with the model but rather just tell it what to implement next.
I personally am somewhere between you and the author. I don't check _all_ the intermediary steps, but I do try to understand what it's doing [1] and follow the process. Mostly I let it do the changes itself without supervision at each step but when a coherent "chunk" of work is done, I go through it really thoroughly. In almost 90% of the cases after a chunk is done some adjustments are needed.
I find broad architectural design to be _better_ if you follow along in the process because you better understand the direction it's going earlier and you can shift the high level direction much earlier. Even if you check its steps, you can ask it for its take on high-level architectural aspects along the way, no problem. I think personal touch matters a lot though, because I naturally ask it and try to get the big picture image.
[1] I actually find it really instructive what tooling it uses to tackle a problem, I got to become a much better console user because of it
The article feels like micromanaging AI. If you think about it like a junior employee, micromanaging them will mean they end up doing the work you want and do it your way. But they won't bring any of their ideas to the table, which in the long run could be beneficial to everyone on the team.
Sounds like you wrote very poor quality (edit: or trivial) code, or you’re exaggerating a bit for effect.
I too forget the details of most of the code I write, but the most important 10-20% of the code that I write encodes my mental model of the problem I’m trying to solve. Sometimes it’s a class representation of a digital or physical entity. Sometimes it’s a job with tasks that map to subproblems. Those abstractions almost immediately launch me into the mindset of my former self, even years (or a decade!) after the fact.
AI-generated code does not tend to create those kinds of abstractions in my experience. It will likely, with encouragement, solve the problem you’re asking it to - but it won’t magically cause you to understand how to solve the problem. You must take the initiative to understand it yourself. You are the camel that the AI has taken to water, and it can’t force you to drink it.
So the way I write code is that, my understanding is local. Okay, we need a function that does this (high level). It'll call these functions to do that. And then I just continue until there's nothing left to write and the thing works (after a few rounds of debugging).
I understand each piece and what it talks to. But I can't hold them all in my mind at once, because there's too many pieces. (I think chunking helps here, but it seems to require a certain level of fluency with the entire codebase that I'm not sure it's feasible to hit with anything past a certain line count. I am working on this new memory software though...)
The transformer on the other hand, just loads it into context (they can do about 10K LoC these days without performance degradation), cross references everything against everything (that's how the transformer works! That's why they're so expensive) and just tells me what talks to what, what the full chain is, and also btw you have 3 bugs you didn't notice because they involve how distant parts of the chains interact, you're welcome!
I've been looking for ways to build up that mental model. The Feynman technique seemed like a good place to start. I did it on a section of my codebase. It took half an hour of poking around to connect all the pieces. The transformer was able to do it instantly.
I'm not sure if there was added value to me poking around manually or if those 30 minutes would have been better spent just memorizing what it told me.
(After verification of course! To clarify, I don't think they're infallible, but their perception is broader than ours due to how they're structured, and I'm learning to utilize that more effectively.)
Also, in the absence of that costly verification, the model my Feynman technique produced turned out to be wrong (though it sounded correct!). So I'm leaning in the direction of, the way to actually verify your mental model is to make a modification to the codebase. Make reality push back!
--
On some projects/subprojects I do build an explicit mental model beforehand, and then I do generally remember it pretty well, at least for a while. Others take a more iterative approach to the design. (I'm on the 5th damn iteration of my netcode right now.)
So there's two distinct issues here, the model building process and the human forgetting curve.
That’s fair I guess. I’m pretty consistently surprised by the wide variety of tasks that everyone under the “programming” moniker tackles. I consider myself a programmer by trade, even though I’m not a SWE. Your first paragraph couldn’t really be further from my personal experience. I haven’t thought in terms of functions in years, mostly “jobs”, “tasks”, “workflows”, “data flows”, “modeling”, “labeling”, etc.
Some people really do have jobs that I wouldn’t be surprised that LLMs will nearly completely automate away. And those people will be forced to move “up the stack” in terms of abstraction… but that’s already where I’m at. And LLMs are helpful, but I don’t feel threatened by them at all. If they take my job, I think computers will be declared obsolete. No more keyboards and mice.
> As for how to rebuild it, I haven't figured that part out yet.
Just do some work with the code. If I go back and try to add a feature or fix some bugs on code that I have not worked with for a long time I find it much quicker to build up a mental model of it than code which I have never worked on previously.
Yeah. Ebbinghaus found this in his work on memory in the 19th century. Even after something has been forgotten, re-learning it goes more quickly, as a function of how many times it has been learned already.
I'm developing a new memory system that functions as an L1 cache for the human mind, taking the opposite approach of Anki and showing you things you want "top of mind" as often as possible. (As opposed to as rarely as possible, which is the standard approach in the memory space these days!)
If you were working as a manager on a large project, how would you build a model? Something where your position requires you to have an overview of the project but not necessarily to actually write or review much code.
I am not able to find it now, but there was an amazing story recently from the 60s or 70s where an engineer was in exactly this position. His team was building a new, complex, ambitious operating system, but it was late and over budget and didn’t work. It nearly wrecked the company. He talks about hitting rock bottom and asking himself what went wrong, and one of the fathers of computing (can’t remember which) shouts from the hallway in passing, “that’s easy, you didn’t understand what your people were doing.” So the guy turned it around by implementing a new rule: he had to understand every line of code his team wrote. They started over with the company’s existing OS in use by customers and implementing a few of the most requested features. Much less ambitious, but it actually shipped. Gradually they achieved all their goals by upgrading the existing system.
The “I must understand every line” constraint didn’t sound like a power trip that succeeded because the guy was such a brilliant code reviewer. I think it was a blunt instrument that enforced simplicity.
I guess what I’m saying is, I reject the premise of having technical oversight without writing or reading much code.
As with anything. Either you can go full-speed without much understanding and hit a wall when you need to understand stuff or you can go a manageable speed and actually understand the codebase.
I don't think we can do both. The difference is that it's optional now depending on the project and the audience.
I never said you wouldn't have to read code. I was asking a question to get answers about how people would achieve having an oversight if it was humans writing code that they were managing rather than agents.
As for your suggestion, understanding every line might have worked in the 70s but even pre-agentic modern coding it's not possible for any large project with dependencies even if you are directly contributing code yourself, so I'm not sure how useful your idea is.
There's definitely cases where you should have that aim - writing a low level maths or graphics library, for example. But most people are not doing that.
Perhaps my "reject the premise" comment was a bit too provocative. I didn't intend to start an argument. I wanted to share a story of a person who was put in the position you described (working as a manager on a large project, required to have an overview of the project but not necessarily to actually write or review much code), failed miserably, changed the rules of engagement (partly by reviewing code), and subsequently succeeded. So my _personal_ answer to your question of how to build a mental model in that scenario would be to do something in the same vein as that story. Not necessarily _exactly_ what that guy did, but I think the principles still apply today. Nothing really changes.
I've been experimenting with the Feynman technique on codebases. However the issue I run into is that, you need hard feedback to verify your hypotheses.
I was satisfied with my own explanation of how something worked but it turned out to be wrong.
LLMs help here (the transformer is good at seeing the big picture, at least on smallish codebases), but the best thing I found so far is just modding.
Actually making a change to the code is the best way to get hard feedback about your model.
I'm not so sure. I think you can, you just need to intentionally drill into what you don't understand and it's exhausting. What I do agree with though is that I can't seem to build the ability to build it myself the same way as I would if I wrote it.
For example, I know my mental model works because I know what change I should do in order to get an effect and when I do the change, I get what I expect. But if I were to build myself something similar, I could not build it because the approach is somewhat out of my reach, I know it sounds weird, but it's hard to explain.
That's why I like to build a complete feature and the infrastructure myself first, so the AI will have a picture of how the code should look and where it should live.
Or I use the short-leash method and I will instruct the AI build infrastructure first, without even talking about features yet.
yep. got a laptop i dont care about that claude can play with in wsl.
its the fun of funemployment.
starting work again is gonna be an interesting change though. its currently straightforward letting it run, then giving a broad critique and setting up new introspection/closed loop feedback for an hour over a beer, then letting it run wild again after
>>You never use “YOLO” mode (aka “dangerously skip permissions”)
Do you mean this?
I'm curious how are people using Claude in any way other than bypass-permissions. I've tried for so long to maintain a curated list of things Claude can use, but inevitably I would always come back only to find it stuck because it decided to pipe an output of one tool into another and that's not explicitly allowed so it stopped even though it was just greping or whatever. I found it infuriating. In bypass-permissions it "just works" but then again I only use it to analyze existing code and suggest new changes(and even if it breaks something that's what source control is for?)
It does do this to frustrate you, save 30 tokens, and then waste a few thousand more when it didn't get all the context it needed by grep'ping. You have to be involved in the process though. It frequently wants to do things that are so incorrect, that even if it would be more convenient to just totally ignore it, it would be insane to actually ignore it. Do you trust it to not accidentally rm -rf the .git/ right after it helpfully force pushes to remote? I don't. Even if I don't expect it to do that, why would I ALLOW it to be able to?
We use perforce and Claude can't push anything to our perforce server. The worst thing it could possibly do is delete my local workspace, but that's not exactly a huge problem, would just have to sync again.
I did it by making a huge database of allowlisted bash and having hooks check each one against the list. It makes a recursively parsed tree so it can handle gnarly blocks of bash. And then it outputs to the agent what failed and tells it to break it up next time. Then, in agent instructions, I impress on it strongly to use composable bash tools rather than trying to write python/ruby/perl scripts.
It was a bit of work, admittedly, but it's picked up a few users and I learned a lot from designing the research process and parsing the syntax trees.
I actually want to be alerted about everything that's not auto-approved, though. With safe commands auto-approved, it's much less noisy. I think it's important to read your code, as it develops, not just at the end, and understand what agents are doing.
I’ve found unexpected success in using ephemeral NixOS VMs for local development… once you authenticate your agent you can let it run wild without worrying about permissions.
I got halfway thru learning about containers before I realized, I just don't want it to blow up my files. That was a very solved problem in the 1970s! So I just made a Linux user called agent.
It's not YOLO, but auto mode in Claude Code does reduce the amount you have to approve significantly. And frankly, without it, progress is constantly interrupted by permission requests. It's all I use. Don't even really switch into Plan mode manually anymore.
What questions? When I go into auto mode, it doesn't come back until it accidentally/intentionally tries to slip the guardrails, or completes the task. My prompt will generally include information on what it's allowed to do to accomplish the task, where to test, etc. Simple, but effective.
Build your own MCP of allowed tools. Cargo. Ripgrep. File read and write, including directory listing and find. some git commands. Then block everything else.
My problem with that is it makes the shittiest bash scripts to do basic things like search for a file and it gets them wrong for minutes at a time. It’s depressing. But yeah, that’s the other option. Just don’t watch.
One problem I have with "how to do X with AI" is that every situation is different. For example, I'm bumping Symfony projects from 3.1 to 8.1. There's a clear path here
- Follow the written up migration guides PER major version
- test all routes, authorised, etc. You can even hand-curate these tests. some might return 200, some might return 302
- Maybe optionally start with writing a safety net so you do not need to do these test manually, have e.g. a PHPStan baseline, etc.
You're done when the routes are e2e functionally working as intended. You could even use snapshot testing here.
I do not need to look at the AI here. I can review the code at the end, but I do not need to manually approve stuff here, hence safety features are off.
> One problem I have with "how to do X with AI" is that every situation is different
It's less of a "problem" and more a "How to approach content on the internet". Everyone is writing things from one perspective (usually) while there is a wide-range of perspectives out there, and what works in one situation doesn't work in another. "software engineering" as a whole is basically figuring out what goes where, and when, then trying to ignore the rest.
Then lots of company blog posts wants to lead you to believe there are silver bullets, solutions that apply for every scenario and case out there, which usually isn't true.
So again, less of a "problem" and more of a "Some things work in some situation", like we've been dealing with forever in software engineering. It's not right, it's not wrong, just applied practically different in different situations, perfectly fine and normal.
AI is a junior to mid-level engineer. If you treat it as such, you get the best of both vibe coding and rigorous engineering without all this paranoia.
Since the very beginning I've ran Claude from an isolated VM on yolo mode. This is just like giving an engineer their own laptop. Claude works on a feature up to a PR worthy point. I review the diff, just like I would with another engineer, and massage it to get it in the right shape and move on.
Inexperienced engineers make the same mistakes described I've even seen rm -rf albeit not from root! I would have lost my mind micromanaging someone with all permissions denied.
I strongly agree with this take — and that’s partly why the article posted here leaves me scratching my head. PRs are already the gate, right? I don’t care what an agent does or doesn’t do within the confines of its workspace assuming their contributions are gated via a git repository and they don’t require exotic access to a production environment to do their development.
I’m also with you on the junior / mid-level engineer framing (a “brilliant” junior engineer perhaps, one who graduated from at the top of their class from the best CS program in the country) with a big caveat: AI is like a junior engineer who doesn’t know how to learn.
It’s like you’re working with the guy from Memento. Every day your LLM reports to work and they’ve learned nothing from your work so far. Every day is the first day!
Now like the Memento guy you can help them to scatter their workspace with sticky notes and reminders everywhere. With some effort you can start to approximate that thing called “learning” which is LITERALLY the most important trait of every single software developer on a team.
But I confess it’s a struggle for me and the available tooling isn’t there yet. The best I’ve done looks closer to the “second brain” people use tools like Obsidian for. Sadly I don’t think a second brain is a substitute for a first brain. And to be 100% honest any engineer who exhibited the same inability to learn and grow as an AI agent would be sacked after their first month on the job at any company I’ve ever worked at.
I’m actually reasonably optimistic that either the main AI providers or someone else will improve on this in the coming years. It certainly feels like a decent memory paired with a well architected thinking system that’s better at contextually injecting memories (I find LLMs today don’t know what they don’t know unless you force them to put metaphorical sticky notes all over the place) as well as capturing real learnings without supervision shouldn’t be an impossible task requiring novel technical structures.
Anyhow I’d love to be wrong about some of the above and I’m always reading articles like this one hoping that someone has solved these problems already and that I’m just slow on the uptake. But as of today, I’m only modestly better at architecting such agents than I was when I started.
Yep, this is my experience too. I think of it more as a very, very smart and fast intern -- you can tell it’s going places, and in many ways is already way better than you, but it still needs an experienced hand to steer it.
My rule of thumb is, any special processes you put in place for AIs are either sensible for humans as well, or they’re not worthwhile. Good CLIs, auto-summarization of long command outputs, Markdown docs and workflows -- those are all useful for people too!
To guard against mistakes and abuse, you use sandboxing and scoped permissions, not micromanagement.
One thing I’d like to figure out is a good pair-programming workflow for AI agents. You can tell a high-level model to go and do something, and that works; you can use a low-level model as an IDE assistant, and that works; but they’re separate workflows. What would be really useful is a way to kind of hand the keyboard back and forth with the high-end model and build something together. But safely, not in full-on YOLO mode on my own machine. This is one specific area where humans and LLMs differ -- it’s so much faster than me that I can’t just grab the keyboard back from it if it goes off the rails.
This is not true anymore and you aren't helping yourself by deluding yourself about it.
It's something, nobody quite knows what, but it's NOT a junior or mid level engineer, it's a nuclear powered staff engineer living in a cardboard box who lacks domain context and wakes up with no memories ever 5 hours.
And who can't code its way out of a wet paper bag on hard problems. It's more productive for the day-to-day BS, which is convenient because it creates more day-to-day BS you need to handle, but that isn't the reason I hire a staff engineer.
i'm sorry but you're wrong and the only person you're hurting with your delusions is yourself. it doesn't change reality to pretend the world isn't changing under your feet.
i'm not going to argue about this but for your own career etc i truly hope you evaluate your epistemics.
Sure, it's changing, and I use AI a ton. The second I ask it (where "it" is a smattering of all the SOTA models and harnesses) to do something as simple as design a server capable of doing <moderately simple task> when any concurrent data structures are involved and the single-server load is in the 100k QPS range, even with extremely thorough plans of how concurrency needs to be managed, it doesn't matter how little code is actually needed or how easy it would be for my juniors to bang out the problem, especially with a little AI boost, AI just can't keep up by itself yet. It can sometimes spit out something close, but only with major correctness issues.
I'm not trying to be argumentative; You posed an idea, and it looked wrong in an important way, so I added my observations. I'd love if you could share the model/harness/workflow you use that makes you so confident in this tooling, because I don't want to be left behind.
LLMs are still next token predictors, just because you can give it more vague instructions and it still finds the right steps to follow, it doesn't mean it's intelligent. It means you're speaking the same language as the harness they trained your model on.
And that has a limit. If you are stuck at PoC level or simple apps, you have no idea how limited the current models still are. There you really need to break tasks down, not just trust a token predictor to list steps that sound good. There has to be a human in the loop somewhere, because by the time you start skipping permissions, best case you get the jackpot, more likely is you get a suboptimal solution and token waste and what's genuinely still terrifying when the model ignores instructions and does some stupid nonsense, ruining your day. It really is as sharp as a CNC machine. It's not not useful, but could be dangerous, so maybe don't try to carve wood with a monster machine, or park your Ferrari in that crammed neighbourhood if you don't know how to parallel park.
"Next token prediction" is an interface, not an algorithm. A process that "predicts next tokens" can be arbitrarily complex or simple, and arbitrarily capable or incapable of performing a given task.
Saying that an LLM can or can't do something because it's a "token predictor" is a category error. The interface isn't a hard limit.
I'm not sure if it's has any real bearing on real-world performance, but technically next token prediction makes it an online algorithm and they can be provably worse than (good) offline algorithms.
For something like "a hard limit" to hold, LLMs must be restricted to only reproducing existing text. This is utterly false even for base models - their basin seems to be "permutations loosely inspired by existing text".
I'm not sure how you're defining "intelligent", but I'd like to know how it is able to exclude a language model, while still including humans, without simply defining it with an axiom that predefines LLMs as lacking intelligence.
Intelligence is the complete opposite of an LLM. Usually the more you needed to memorize to do something the less intelligent you were considered.
It was also not considered to be a different route to the same thing, but more like fraud.
Also conceptually I could just write the weights on paper and do the billion multiplications on paper without any computer, does that mean I am the paper or the numbers or what??
> Intelligence is the complete opposite of an LLM. Usually the more you needed to memorize to do something the less intelligent you were considered.
Contrary to popular belief, training a LLM is not just about memorization (overfitting). There is some memorization happening, but well-trained LLMs also generalize.
Intelligent humans are capable of following diverse and intricate analogies and draw lessons from seemingly unrelated events. Try asking an LLM to summarize an article and use an imprecise way to state your view. Ask it to push back. You will be drawn into so many pedantic arguments that burn through your tokens within a few messages, you'd wonder if there's someone deliberately taking over the keyboard on their side and spending your token limit. This would never happen with an intelligent human being unless they have nothing better to do and want to troll. This is a speech pattern that LLMs are trained on, it's not a show of intelligence. This also applies to LLMs claiming consciousness: The internet is full of people writing about sentience, talking to "superior aliens" in blog posts, forum threads etc. It's the speech pattern that's copied, not actual thoughts and feelings because LLMs perceive, suffer, have aims or dreams...
Agentic systems use LLMs, and they are absolutely able to follow diverse and intricate analogies. I use them frequently to hunt down notoriously difficult to find memory leaks, in codebases too large for a human to read in a single sitting. They are able to not only follow those intricate paths, they're able to discover solutions and apply those solutions. I use these systems quite a bit, and it's nothing like you've described.
An LLM has a fixed number of ways it can express itself. we can give it an array of 14 billion options but it still has to chose one to output. Humans have no such limitation.
An LLM does not persist in consciousness from one token to the next. Each generation, happening hundreds of times a second, will be initialized, generate an output, and terminate. Humans are not stateless like an LLM.
You're conflating a singular model with a much larger system, but I want to address some of your points anyway.
> An LLM has a fixed number of ways it can express itself
While deterministic, there is not a fixed number of ways it can express itself, given that we can use settings like temperature to inject randomness into the output.
> An LLM does not persist in consciousness from one token to the next
While a model alone does not update itself to persist some form of history, there are a number of ways to overcome this, e.g. episodic memory, fine-tuning, and other self-improvement systems exist, which can indeed carry forward what you've called "consciousness".
> Humans are not stateless like an LLM.
A single LLM might be stateless, but an agentic system that relies on LLMs is very often not.
> While deterministic, there is not a fixed number of ways it can express itself, given that we can use settings like temperature to inject randomness into the output.
You're missing the point, which is that no matter the process involved. The LLM can only ever output one of the tokens in its token vector. It can't invent a new symbol or character. It can't leave and go build a church. It has to output a little piece of data for you.
You're moving the goalpost. If the definition of intelligence is based on ability to "go build a church", then we've ruled out the vast majority of the animal kingdom from being labeled "intelligent". If you cannot be consistent in your definition of "intelligence", then you cannot have a reliable litmus test for it.
I wasn't trying to make a reliable litmus test for it.
Either way, if you consider animals, LLMs are even more poorly positioned. They can do exactly none of the things my cat can do. An LLM can string together words, but if my cat is intelligent, it's clear that stringing together words is not synonymous with intelligence, since my cat can't do that.
Animals do in fact "string words together", e.g. parrots. You're also misidentifying what "language" is. Language in this context is not just the ability to string word together. Consider a musician, when they learn to play an instrument, they are learning the language of that instrument. Notes are tokens, ensembles are sentences and paragraphs. I'm afraid you're experiencing conformational bias, because every piece of evidence presented to you has been dismissed with things like "stringing together words is not synonymous with intelligence, since my cat can't do that".
Chinese whispers, simulacra... I don't have the energy to argue after being name called, but you get the point. Yes LLMs are useful in building automatic telling machines, but ask it to do anything more substantial and all you are doing is burning tokens at the altar of Anthropic and hope. That just doesn't fly in regulated industries.
It's impossible for someone to doubt their own sentience. The literal act of doubting is enough to dissipate all doubt. Solipsism is essentially the one certainty that every mind out there has.
Doubting the sentience of machines and even other humans is perfectly fine though. Only empathy allows people to make the leap and assume other humans have souls.
> It's impossible for someone to doubt their own sentience. The literal act of doubting is enough to dissipate all doubt.
i never found this convincing. just because you can loop does not mean you are sentient/conscious. what would it look like if you didn't exist and there was just a system that interrogated neural inputs and produced neural outputs in a loop? if anything, LLM's as an existence proof made this more likely to be the actual case.
"Realize" is too strong a word. You're the only one who can verify that you're the soul who's staring out at the world through your eyes. For all you know, everyone else could be just biological automatons, golems.
Any leap beyond that is based on empathy. You have a soul, and you are human, therefore other humans could have souls too. It's a spiritual belief. Answers to questions that cannot be answered.
That's the standard Piagetian understanding of child development, yes. Humans do not start out with theory of mind, and are thus inherently solipsistic, but in most cases an understanding that there are other conscious beings with their own thoughts, goals and feelings develops between the ages of 2 and 7.
Developing theory of mind is one of the key milestones in child development.
I mean, conversationally, of course we work a little more like that (I tend to think in whole sentence blocks before I say them but I suppose they assemble themselves largely word-by-word, or word-by-word with a bit of editing).
But right now I am trying to design something -— a physical mechanism with a particular enclosure — that I cannot clearly describe (this makes it hard to research). I designed a previous version without even knowing the words that do, in fact, describe that.
I have a theory about it, animated in my mind, that I can only test by making it.
If I want you to know about it, I can either show you it or work out words to describe it, which will be inadequate to describing it.
The idea for it came from seeing things nobody has ever put into words for me.
"Next-word sayer" doesn't describe any of this process, does it?
while the how is different, the what has many parallels. E.g. both the brain and LLMs appear to learn distributions of representations, they both develop a hierarchy of those representations, both have early layers that process simple features, with later ones processing more abstract concepts, both predict missing information...
The post I responded to stated that the commenter was just a next-word-sayer, but that's wrong. The similarities you draw aren't really relevant to my reply.
no disrespect intended, however I think my response is relevant, because the broader topic here is whether LLMs and the human mind share similar functions. They both do in fact have a lot of overlapping features, and a fundamental one is predicting next-thing, be that a word, image, or otherwise.
It's not relevant. However, if you want to talk about a broader point, that's ok.
> LLMs appear to learn distributions of representations, they both develop a hierarchy of those representations, both have early layers that process simple features, with later ones processing more abstract concepts, both predict missing information.
This type of superficial comparison isn't very meaningful, it's trivial to liken anything to a human biology in this manner.
A plane and a bird both use wings to produce lift, it doesn't then follow that a bird and a plane are meaningfully similar.
> A plane and a bird both use wings to produce lift, it doesn't then follow that a bird and a plane are meaningfully similar.
The use of Bernoulli's principle to achieve lift is a fundamental and meaningfully similar function of both airplane and bird wings. That functional similarity is well known.
> This type of superficial comparison isn't very meaningful
The comparisons I provided are fundamental to both the human mind and LLMs.. that's pretty darn relevant.. and whether you find that trivial or not is a matter of opinion.
Even if you could understand human cognition to the level required to say, confidently, that it’s done one word at a time, it’s likely not! Natural language is not a prerequisite for human intelligence, as evidenced by the fact that we went from primates to commenting on HN.
Natural language is, however, a prerequisite for the existence of LLMs. It’s more similar to methods for storing and retrieving information, like the printing press or a database, than it is to a sentient being.
That’s not to say that LLMs can’t do crazy things, because they already have. Our language can encode a whole lot of information, and it’s incredible that we’ve found a way to distill that so effectively.
Even if you could understand human cognition to the level required to say, confidently, that it’s done one word at a time, it’s likely not!
I think they’re not talking about cognition, but about output: regardless of what may be happening inside your brain, ultimately one word at a time comes out of your mouth, right? And you can’t then unsay it.
When you put it in those terms, LLMs are in exactly the same boat.
Interesting thought but I assume a lot of samples in the training corpus are examples of translation between languages and the same text in different languages.
Calling LLMs 'next token predictors' is completely reductive and disingenuous; it's true that technically that is what they're doing, but so are you! What people generally mean by this though is that they're just 'predicting the next token of their training [i.e. the internet]'. If you were talking about the raw models, this would actually be true; but the models are post trained, so even this description isn't true at all anymore! Saying they aren't 'intelligent' is both not useful and (imo) wrong. Who cares if it matches your definition of 'intelligent'; it still gets impressive stuff done, much more impressive stuff than you seem to be implying.
> I think we're mving towards humans no longer needing to understand a codebase, and letting AI drive it.
The AI companies are incentivized to push this kind of reckless slopmaxxing - the end result is that your business is totally dependent on them and your product's value entirely sourced from them. And a lot of people are buying it, but I think it's a silly fad.
> I think we're mving towards humans no longer needing to understand a codebase, and letting AI drive it.
I can see this being true for non-critical software like entertainment, media, and so on.
Definitely not true for systems where security stakes are high. Like banking, aviation, defense, etc.. AI will surely contribute but not independent of human engineering understanding.
In all those fields you mentioned, they have a lot of strict compliance measures and it is highly unlikely that AI will just be able to take over. Ironically almost all of aviation code is actually machine-generated using things like Simulink
> Except that said AI can now themselves use your software and find and fix bugs themselves, not to mention drive new features.
Anyone with sufficiently good taste in how to program effectively and architect will disagree with you on this. The short leash method is how you ensure good results when you're functioning outside of the training data. If you're even a modestly above average programmer this is afaik the only way to ensure fast, quality development with LLMs.
> This again feels outdated. I think we're mving towards humans no longer needing to understand a codebase, and letting AI drive it.
I think you are perhaps unaware of a world of programming where AI is still woefully inept. I have observed very consistently in all languages with manual memory management frequent issues with handling it. Trust me, it's not as simple as sticking it in a loop with Valgrind.
> This happens but far less often than it used to, and the case for full autonomous agents is getting stronger, not weaker.
This is that I do not see. My journey, just couple weeks ago, Claude Code + Opus 4.8. The task was not too complicated, 4 new API endpoint plus events streamed from client by websocket.
1. Multiply iterations on API definitions, refine request/response models, database schema, whole flow. A lot of corrections, removing contradictions, manual changes in document. Opus went of rails all the time. 500+ lines final document
2. API Integration tests. Once again, back and forth. AI was unable to create tests directly from document, so 2 iterations: Create placeholders with Given-When-Than comments, review an correct by hand. Second iteration was to implement tests. A lot of mistakes corrected after review.
3. Implementation. CC got api document, working tests ( modifications blocked by hook ), 6+ "best practices" skills ( most promptly ignored ), "rubber duck" and "code simplifier" agents, pre cooked scipts to run tests, linter, and check for compilation errors. Plan + execution + review, multiply corrections on the way. Feature implemented, all tests passed.
4. Code review. At average, found one issue per 20 lines of code. Not count code style, things like: Use in memory semaphore in kubernetes service (deployment described in CLAUDE.md ), 8 database calls to update the same record during a single request. One column at a time! Read-modify-save without transaction. Mistakes in business logic, failure recovery, authorization.
The result: almost one workweek, $100+ in tokens, and one thought: did it worth the effort ?
P.S. I have a team of 2 developers. Just got PR to review from one of them. 80% slop.
Same thing I'm seeing, all the "AI practitioners" at my company with their advanced workflows are just shipping mountains of slop, and end up either putting the actual work on the reviewers, or the poor soul that's on call when an incident occurs.
I feel like people that have built crazy AI workflows have developed a false sense of confidence that their guardrails are helping them ship clean/correct code with little review when it isn't the case at all. In reality, the models and harnesses are at a point where there's very little difference as long as your prompts are somewhat reasonable, and the quality of the code ultimately comes down to the level of care and effort the implementor puts into it.
I don't think the first people that are going to be replaced by AI are going to be the people who don't use it extensively. The first that will be replaced are going to be those that are using AI mindlessly, because at that point, what are you besides a very expensive human LLM interface? To be clear, I'm not "anti-AI", I use AI quite extensively (in a way that's similar to what's described in the article), I just think that it's being pushed in a completely unsustainable way and the industry is in a collective psychosis over it's capabilities.
> The first that will be replaced are going to be those that are using AI mindlessly, because at that point, what are you besides a very expensive human LLM interface?
I think this archetype has a good chance of surviving. Not because of merit, but because they will be the only ones able and willing to work on projects taken over by AI slop.
I'm very much aligned with everything else you said.
> I think we're mving towards humans no longer needing to understand a codebase, and letting AI drive it.
Hard disagree. Even the best frontier models generate output that's not what I asked for. Sometimes I realize that I get lazy in my prompting and the lack of specificity winds up showing up in the output. Just the other day, a coworker built a huge feature using frontier models and it slipped an IDOR in.
I just don't see a world in which we completely cede control of the codebase to AI because it's still my ass on the line if I ship something that completely borks production. If I'm not reading code regularly, then I lose the ability to read code, and if I lose that ability, then I'm no longer a developer.
> Sometimes I realize that I get lazy in my prompting and the lack of specificity winds up showing up in the output.
I wouldn't blame your "lazy" prompting. Specification is just really hard. This is why we stopped doing waterfall software development. I think the current-day obsession with one-shotting software forgets why we had to stop trying to figure everything out up front.
I can't help but feel that this reads more as a reflection that you don't want to stop being a developer than it does that thing's aren't moving in the direction that the GP said it is.
Maybe, it seems like a bad idea for so many reasons though. Take away tactile code review, insert a layer of prompts and tooling between developers and the codebase, and you've created the conditions to let all kinds of nefarious things happen in a codebase. A disgruntled employee updates agent prompts instructing the code review bot to ignore data exfiltration vulnerabilities (because if we aren't reviewing code, we're probably not reviewing prompts either), ships a backdoor, and you better hope that your network monitoring catches it.
If you are just shipping code blindly without reviewing anything then that's your fault. My company heavily uses AI (I'd say 90% of code is written with AI assistance) but we never ship anything that hasn't been reviewed by a human.
This is how we use it for code reviews:
- a skill tells the agent to automatically run a subset of tests and linting before each commit
- another skill tells it to review the entire changeset before creating a PR, this review has more extensive rules that can't easily be put into code (e.g. linter rules) based on PR comments humans have written. It also sometimes catches things that were missed from the original prompt/task.
- when the PR is created we run a few AI tools to do automated code and security reviews. CI runs at the same time.
- the agent waits for these to complete, and verifies and fixes any issues if they are valid
- after all that it's passed back to the author to review
- once they are happy it's passed to a teammate to review
So we are not handing off reviews to AI, we are using it to do much more extensive reviews, and automatically fix stupid stuff the AI or human might have done. So by the time you are asked to review a PR, it should be pretty much ready to go, you can focus on what it's actually changing instead of looking for slop.
> If you are just shipping code blindly without reviewing anything then that's your fault.
Did you miss or already forget the context of "humans no longer needing to understand a codebase, and letting AI drive it"? You're not doing that, either. You cannot "review" something you don't understand. You can "try it out" maybe.
The thread I responded to is about no longer needing to read code at all, not AI-assisted code-review. I definitely use AI-assisted code review. OP is arguing that one day we won't need to read code at all, which I disagree with.
> This again feels outdated. I think we're mving towards humans no longer needing to understand a codebase, and letting AI drive it.
Seems so, but that doesn't mean it's a good or correct direction. As of today, none of the existing models can meaningfully handle mid-size tasks on five services with 10k+ LOC each, plus infra (I'm really not interested in greenfield projects done over the weekend that were never touched by actual users). It doesn't make them useless, but it significantly reduces the scope of trustworthy operations models can handle (unless you don't care about outcomes).
The moment your spec, plan, and results of related codebase exploration go beyond 100k tokens (roughly 50% of available context), quality degradation becomes real. Threads/subagents can help, and you can argue that code reviews mitigate some issues, but that's transitioning from reliable automation to gambling without human oversight. Say you want to mitigate the risks of failures (correctly listed by others) - how would you do that if you don't understand your codebase? In my practice, the answer is: you start to learn what your agents created, discover shit they created, and steer them toward better, desired outcomes.
Whatever we're moving toward, I currently can't let any SOTA model + harness operate on more than ~10k changed SLOC at once, and even then only with very careful prompting I thoroughly understand, only on the simplest of problems, and only if I pause it at key points to correct some sort of nonsense thinking and put in a significant cleanup pass and am still willing to tolerate some bullshit. Tooling is impressive for sure, but it's not magic.
I tried a similar approach before, but it didn't work for me. I didn't get a lot of speedup if any from it. IMO, to get productivity you need some kind of YOLO mode (in a sandbox).
IMO, the goal should be to outsource as much work to the model, as possible, while minimizing effort required to understand and review what is did. For example: ask the model to find out why a bug happens, figure out proof of concept for thing X, incrementally optimize something, do a well specified refactoring with some guide, and similar things.
IMO, what people say about creating loops is a very similar thing. You maximize the work done by the model, while minimizing the amount you need to do to control it.
Last year it was, “AI is just a stochastic parrot.”
This year it’s, “AI can write the code, but a human still has to review it!” (Using AI, of course.)
Give it another year and the narrative will be: “Only AI is capable of reviewing code, and only AI can review the AI’s review. Humans just need to read the AI’s final opinion so they still have meaningful oversight.”
The goalposts keep moving. The certainty never does.
Why shouldn't the goalposts move? That it was possible to beat or tie a chess master, if you had enough computational power, was basically the content of a theorem of Zermelo over a hundred years ago. It differs not a whit from tic-tac-toe. Even Eliza was practically passing the Turing test, which seems comically silly now. There's just an incredible amount of computational power so all sorts of things are possible that were formerly unimaginable - like training LLMs on the whole corpus of extant human discourse.
The regress ends somewhere, because (barring some pretty sharp changes to the way the law works basically everywhere) ultimately someone has to certify the outcomes as acceptable. This might be in the form of the market (though AI-adjacent stuff seems extremely prone to prolonged market failures), this might be regulatory in nature. This might be the executive management of the companies involved.
Personally I think that if you cranked the capability up high enough the first person you'd run into who absolutely demanded more than vibes and didn't care about your singularity thesis would be the representative of a reinsurance firm: mostly to do serious stuff without bending the law, you need insurance, and I am unaware of anyone writing serious policies (certainly not ones that make any economic sense) that underwrite the risk of AI autonomy outcomes financially.
When Swiss Re writes a policy that Anthropic Cinematic Universe or whatever iteration we're on won't fuck it up?
Now maybe we're talking. Until then you ask three practitioners and get nine answers, no one knows what they're talking about unless they're doing a really good job keeping it quiet (and that's probably what you'd do!).
This really helps I was organically doing something very similar to this but it wasn’t the “conventional wisdom” this makes me want to try and integrate AI again.
To me a lot of the anti-short leash sentiment is reflective of the low accountability SWE have always had for their output. Software devs seem to strongly reject the concept that it isnt ok to ship defective products and fix later. It will be interesting to see if it persists as incidents start to occur due to fully automated code.
Maybe I'm too optimistic, but given appropriate skills and references (not just for writing but also reviewing) and intelligent use of subagents for isolated reviews and checks, you can lengthen the leash a bit.
But you still need to properly review plans and PRs to keep a good mental model of the codebase. This effectively limits the number of tasks being done in parallel to maybe 2-3. Though you'll be mentally exhausted and probably start to make mistakes or take shortcuts in reviews yourself.
This reminds me of the workflow I had a year ago. Miss Aider so much. Are there any good open source agents right now? Might be a good time to try one soon as Fable switches to token-based billing, which Code is designed to maximize.
You can only check your email so many times, and you can only work on so many problems at once. You also generally have to be mindful of token consumption. I think it also leads to burnout to work on so much at once. I've been working this way for like a year.
I’m not sure I understand. Babysitting models is not a multiplier IMO. If you have done 1000s of turns your harness should get sharper and less likely to go off the rails.
Also I find that on greenfield, babysitting is a must, but once you have established your house style of patterns, abstractions, and baselines, you can let any of them roam free cause they will look for examples before going forward.
I agree with the sentiment though that if you let a swarm design and code your whole codebase, you will be lost in how it fits together. More feature bloat than code bloat though from my experience
Seems like a common-sense approach. I appreciate the emphasis on understanding, humans will eventually be held accountable, blaming Claude for an outage is not going to get Claude fired.
It's really an extension of the abstraction debate.
X86 was designed for performance. The language really hates humans compared to machine languages that came before. I thought it was a truly stupid idea at the time but had to change my mind eventually.
Then we glue on many layers of abstraction and made everything as convenient for the programmer as possible. Performance became unimportant!
It imho begs to question why we are even using x86 or risk or even FORTH if performance doesn't matter. Make something luxurious that doesn't need to be compiled? Perhaps plug and play coprocessors named after libraries.
But if we aren't going to look at the code anymore we might as well write the application in English and give the LLM some cache. Go full prayer driven development.
I thought it was going to be even shorter leash - code autocomplete with smaller local models. That raises the level of interactivity and leads to better code knowledge.
I'm curious whether Opus4.8 or similar can attain Mythos level through good system prompting and steering? You would expect this to work if it's true that the strength of Mythos is its unwillingness to quit before it gets a desired outcome
As a Mythos user (I’m part of Project Glasswing), I would say that abliterated models [1][2] produce similar, if not identical, results. While good prompting and steering won’t give Claude Opus 4.8 the same capabilities as Mythos (preview 1), using abliterated models (if you have the computational power to run the larger ones) will get you close to the same goals as people who have access to Mythos (preview 1) [3].
[3] I specifically refer to “preview 1” because the newer versions (Fable 5 / Mythos 5) don’t appear to offer the same level of freedom as the very first version that I was able to use through Project Glasswing. This is one of the reasons why I continue running our massive security scans with “preview 1”, or at least I was running them until June 30, when the program’s policy changed.
I think that Anthropic is gaslighting us with their new model releases. Specifically, I think they have some good base model and are just fine-tuning it until they achieve desired outcome, or the desired outcome is achieved accidentally as part of fine-tuning. My theory is based on the fact that as a long-term (if you can call it that way) Claude user I keep noticing the same patterns it outputs. It's not trivial but certainly possible to see when something has been written by Claude because it has a different style than GPT.
However they have quite good harness in their backend which is the actual model.
Here I thought this was about Fable the video game, then I remembered Anthropics model got named Fable. It's going to be painful to google one of my favorite game series, just like googling "Rust server" does not give you Rust programming results, but Rust the video game results. I wish google would have fixed this problem long ago, it seems like something trivial for them to fix.
I mean, the key is to stop trying to one-shot everything: The main problem I found with LLM code is more that they always try to take the shortest path to the solution possible, so a lot of time Codex would write code that meets the requirements of the prompt but misses something that cause it to not work in the non-ideal scenario.
The solution for that is pretty easy too, it's just iteration: you describe the exact problem you have with the code and why it is not running correctly and ask them to provide a narrow fix that addresses the bug. It's not that complicated.
I'm convinced that even if/when ASI is achieved we will still have mediocre engineers writing blog posts about how they have uncovered the secrets to using these tools "effectively".
This is probably slower than writing the code yourself. Doesn't make sense to me. Using an agent without YOLO mode is not wort it.
The way I rather do it is tightly control the output by skills written yourself, prompts, plans, etc. and have the closest possible outcome you would write yourself.
Not really if it takes you 15 minutes to write a 50 line function but it takes the AI 90 seconds then you already are at a 10x speedup just for this task.
This (non-yolo mode AI coding) is actually how we used to code in the old days (2023).
I have found a different model should be used to do the review - like if Claude did the code, Codex should review. Models reviewing their own code is a recipe for disaster.
I <3 how everyone and their brother feels qualified to write advice to hundreds? thousands? of other developers about AI ... based on a couple months of experience as a personal user.
I mean, it's like writing a book about how to use React or Django or some other major software ... after you used it for one project for a month!
Authors: I know this is the Internet, and I know bloggers blog about whatever pops into their head ... but if you are going to act like an authority, how about you learn more than the average reader before you start telling them authoritatively what to do?
People are doing what they've always done with any other new technology, and sharing what, personally, works for them. People can take or leave the advice.
Right but there's a marked difference between a "I just tried this new tech and here's what I think" vs. "I've used this tech for a few months and now I'm going to speak like I know everything about it".
I have no beef with people writing about new tech, but I do have beef with claiming that "____ is the correct way to do it" ... based on nothing except "I feel proud of the last three months I spent with Claude".
It's an open problem of clearly large value how to get reliably useful and trustworthy outcomes from AI systems in many domains, software is maybe the signal example of that. If one had solved it resoundingly and scaleably, one could in fact "get rich quick".
It is unsurprising that a lot of people claim to know how to get rich quick.
I believe it is possible to solve this problem, and I have my own horses in the race which I won't threadjack to promote here, but it's the central problem of our profession at the moment. We've all seen the truly discontinuous outcomes and we've all seen allegedly national security dangerous models (which at one time was GPT-3) faceplant with it's shoelaces tied together. I wanted to see if Fable was really all that and I left it overnight on some fairly straightforward C++ (code DSv4 Flash works on with moderate supervision) and it's pretty roast worthy, I gave it a chance to redeem itself this morning and it's ticked up a bit (I still think it's roughly Opus 4.8 with a Project Zero fine tune and DRO trained off the constant gratuitous yield tic which is pretty clearly an intentional gimp).
I give all such claims 30 seconds of my time because someone is going to actually be right one of these days.
There are a lot of people with a long career in the old way of doing things are feeling incredibly threatened and defensive and desperate to virtue signal about AI.
This post seems like some decent advice mixed in with a lot of overconfidence and unverifiable claims.
“expert developers whose skills have reached the point where they outclass any and all “frontier AI models” in their area of expertise”
Are any developers saying they outclass any and all frontier models? I’d say at best it’s mixed at this point. The best developers still do certain things better, but not even close to all things.
“The problem is that even code written and/or reviewed by Fable 5, will stink”
FTA: Contrary to marketing statements made by certain CEOs, these models are not able to think beyond their training data.
The sheer cognitive dissonance needed to say something like that at a time when AI is delivering novel math proofs is... well, not actually impressive. Mostly, it's just sad.
Some part of him must know such a statement is not true, or more properly, that it's meaningless. But he says it anyway, because he thinks it makes an impression of insight and erudition on the listener.
If you think what it does is brilliant, you're not ready (to use AI.)
At some point in one's journey to engineering enlightenment, one recognizes how rarely "brilliance" is actually called for, and indeed how counterproductive such self-judged "brilliance" often turns out to be in the long run.
Clearly the author is still striving to reach this particular stage.
From these comments I find it funny how mad some people get when someone finds success working without the latest “state of the art” AI slop methods. Some people really seem to have vested interest in pushing the AI coding supremacy. Probably people from Anthropic in here.
Good luck with that. I used to be an OCD freak about code before LLMs, but AI coding has largely freed me of that limitation. I've become very comfortable giving AI a long leash, but only after being meticulous about curating the context.
These days I spend most of the day in discussions and planning, producing documentation, agonizing over architectural decisions, edge cases, and naming conventions. Once that's all settled I'll hand off implementation work to run overnight. In the morning, I'll review and fix, but I'm usually pleasantly surprised with the results.
One pitfall is long leash without a curated context, which is more like "slot machine" coding. Usually not effective, and may have addictive effects since it does occasionally work.
To spice things up lately, I've been encouraging the model to produce its own "capstone" -- a feature it decides to build on its own, however it wishes, with the tools at its disposal. So far it's been conservative, creating useful tools for development rather than customer facing features, but I'm curious to dial up the temperature to see what it might come up with.
Better method start to realizing that everything that every program do is data transformations and or movement
Then you ask llm to subdivide data in a tree along the domain model, classifing streaming vs storing nodes
Then for each node you discuss with the ai for the best data structure
Then you ask for an interface that fully encapsulate the structure and every mutation only allows to go from a valid state to a valid state and bidding else is allowed to touch the state
And that's mostly it just connect all the interfaces until input goes to monitor or to storage or to api or wherever the destination is
Efficient != effective, and the author outlines as much. Regardless, while you're technically correct, it's kinda like saying the Fantasy Land Specification[1] (aka the "Algebraic JavaScript Specification") is pure. The problem is that purely functional fantasy lands rarely exist outside of fairytales. In other words, life is a lot like JavaScript and never that simple.
Hand-holding great models like Fable through implementation is a waste of time, and a waste of Fable. You can have increasingly nuanced discussions with stronger models, and they write a lot better code than they used to. The process of discussing designs and their implementations, questioning things that look weird to you, and actually reading the AI’s responses also helps to find better solutions.
For example, one time I wanted to write a greedy solver for a problem, and in my discussion with Opus on the idea it suggested using an existing MILP library to solve the problem exactly. I’d never even heard of MILP, but my final implementation ended up being better and simpler than what I’d have done alone.
reply