> Expert reviews are just about the only thing that makes AI generated code viable
I disagree, in the sense that an engineer who knows how to work with LLMs can produce code which only needs light review.
* Work in small increments
* Explicitly instruct the LLM to make minimal changes
* Think through possible failure modes
* Build in error-checking and validation for those failure modes
* Write tests which exercise all paths
This is a means to produce "viable" code using an LLM without close review. However, to your point, engineers able to execute this plan are likely to be pretty experienced, so it may not be economically viable.
That's not my experience — I'm significantly faster while guiding an LLM using this methodology.
The gains are especially notable when working in unfamiliar domains. I can glance over code and know "if this compiles and the tests succeed, it will work", even if I didn't have the knowledge to write it myself.
>When developers are allowed to use AI tools, they take 19% longer to complete issues—a significant slowdown that goes against developer beliefs and expert forecasts. This gap between perception and reality is striking: developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.
If we're being honest with ourselves, it's not making devs work faster. It at best frees their time up so they feel more productive.
Fair point. I have definitely caught myself taking longer to revise a prompt repeatedly after the AI gets things wrong several times than it would have taken to write the code myself.
I'd like to think that I have this under control because the methodology of working in small increments helps me to recognize when I've gotten stuck in an eddy, but I'll have to watch out for it.
I still maintain that the LLM is saving me time overall. Besides helping in unfamiliar domains, it's also faster than me at leaf-node tasks like writing unit tests.
The study you quoted is sonnet 3.5/3.7 era. You could see the promise with those models but the agentic/task performance of Opus 4.5/4.6 makes a huge difference - the models are pretty amazing at building context from a mid size codebase at this point.
That's where the Gell-Mann amnesia will get you though. As much it trips up on the domains you're familiar with, it also trips up in unfamiliar domains. You just don't see it.
You're not telling me anything I don't know already. Only a person who accepts that they're fallible can execute this methodology anyway, because that's the kind of mentality that it takes to think through potential failure modes.
Yes, code produced this way will have bugs, especially of the "unknown unknown" variety — but so would the code that I would have written by hand.
I think a bigger factor contributing to unforeseen bugs is whether the LLM's code is statistically likely to be correct:
* Is this a domain that the LLM has trained on a lot? (i.e. lots of React code out there, not much in your home-grown DSL)
* Is the codebase itself easy to understand, written with best practices, and adhering to popular conventions? Code which is hard for humans to understand is also hard for an LLM to understand.
Right, I think the latter part is my concern with AI generated code. Often it isn't easy to read (or as easy to read as it could be), and the harder it is to navigate, the more code problems the AI model introduces.
It introduces unnecessary indirection, additional abstractions, fails to re-use code. Humans do this too, but AI models can introduce this type of architectural rot much faster (because it's so fast), and humans usually notice when things start to go off the rails, whereas an AI model will just keep piling on bad code.
I agree that under default settings, LLMs introduce way too many changes and are way too willing to refactor everything. I was only able to get the situation under control by adding this standing instruction:
---
applyTo: '**'
---
By default:
Make the smallest possible change.
Do not refactor existing code unless I explicitly ask.
Under this, Claude Opus at least produces pretty reliable code with my methodology even under surprisingly challenging circumstances, and recent ChatGPTs weren't bad either (though I'm no longer using them). Less powerful LLMs struggle, though.
Besides building web apps for internal use, I’m never going to let AI architect something I’m not familiar with. I could care less whether it uses “clean code” or what design pattern it uses. Meaning I will go from an empty AWS account to fully fledged app + architecture because I’ve been coding for 30 years and dealing with every book and cranny of AWS for a decade.
Haha I have usually found myself on the conservative side of any engineering team I’ve been on, and it’s refreshing to catch some flak for perceived carelessness.
I still make an effort to understand the generated code. If there’s a section I don’t get, I ask the LLM to explain it.
Most of the time it’s just API conventions and idioms I’m not yet familiar with. I have strong enough fundamentals that I generally know what I’m trying to accomplish and how it’s supposed to work and how to achieve it securely.
For example, I was writing some backend code that I knew needed a nonce check but I didn’t know what the conventions were for the framework. So I asked the LLM to add a nonce check, then scanned the docs for the code it generated.
In addition to singers, adaptive tuning is something which happens naturally for fretless stringed instruments (violin, etc), brass instruments with slides (most prominently the slide trombone but in fact many (most?) others), woodwind instruments where the pitch can be bent like saxophone, and so on.
I used to play fretless bass in a garage hip hop troupe that played with heavily manipulated samples that were all over the place in terms of tuning instead of locked to A440, forcing adaptations like "this section is a minor chord a little above C#".
Adaptive tuning is hard to do on a guitar because the frets are fixed. String bending doesn't help much because the biggest issue is that major thirds are too wide in equal temperament and string bending the third makes pitch go up and exacerbates the problem.
You can do a teeny little bit using lateral pressure (along the string) to move something flat. It's very difficult to make adaptations in chords though. A studio musician trick is to retune the guitar slightly for certain sections, though this can screw with everybody else in the ensemble.
Played trombone many years ago, but never well enough to ever adjust that finely (at least not consciously?). The tuning slide on the third valve on a trumpet usually has a finger fork/loop so that it can be tuned in realtime. I believe the first valve on higher end trumpets similarly has a thumb fork for the same reason.
I played trombone in high school, never very well, but I definitely adjusted like this. Actually, although it was a slide trombone, I'm talking about adjusting automatically with embouchure. Someone would play the reference note, I'd match (in 1st position) but bend my pitch to match. The band teacher once complimented me on the adjustment. Which was stupid, because (1) I wasn't doing it intentionally, and (2) the adjustment only lasted during tuning; as soon as we started playing, I was right back out of tune. I never did learn to suppress the adjustment so I could actually fix the tuning.
But with the way I played, I'm not even sure how much it mattered. The best tool for enhancing my playing would've been a mute. (And it would have been most effective lodged in my windpipe.)
This court filing document appears to have been posted on Scribd to serve as a reference for an article by Nicole Carpenter on Aftermath which provides context for Nintendo's case:
George Wallace has been dead for something like 30 years, but yes he was very blatant. I have family that knew him in Montgomery, friends of friends kind of a situation. They don't have good things to say about him.
I don't remember Rudy running on such ideas but maybe he did. Arpeio was running as a sheriff, I would never have voted for him but agreed people did absolutely vote for him in a law enforcement capacity with pretty clear views.
I don't know enough about Gosar or Gohmert to comment well about either.
You are right that this happens in practice (e.g. John Yoo torture memo). However, it is not how the system was intended to function, nor how it ought to function. I don’t want to lose sight of that.
> “I have neither the time nor the inclination to explain myself to a man who rises and sleeps under the blanket of the very freedom that I provide, then questions the manner in which I provide it.”
No individual, whether a colonel or a CEO, has inherent authority over national security decisions. Authority flows through democratic institutions. A contractor can choose whether to participate, but national defense policy is determined by elected institutions, not private executives. If society believes AI should or should not be used for certain military purposes, the venue for that decision is democratic governance not unilateral corporate refusal or approval.
On a CBS interview this morning, Dario defended his position with the claim that he must act because "Congress is slow." CEOs can and should make decisions about what their companies build or refuse to build. What they cannot do is substitute their judgment for the constitutional processes that govern national security. We must not vest de facto policy control in unelected corporate leaders.
> Concretely if you try to vibe-target your ICBMs Claude is hopefully telling you that that's a bad idea.
On the non-nuclear battlefield, I expect that the goverment wants Claude to green-light attacks on targets that may actually be non-combatants. Such targets might be military but with a risk of being civilian, or they could be civilians that the government wants to target but can't legally attack.
Humans in the loop would get court-martialed or accused of war crimes for making such targeting calls. But by delegating to AI, the government gets to achieve their policy goals while avoiding having any humans be held accountable for them.
The "great" thing for AI in those use-cases is that it doesn't need to be accurate, since its true purpose is often to take blame for human negligence or malice.
Much like how some police forces don't actually want a dog that accurately detects drugs... they want a dog that can provide an excuse to search something they are already targeting.
Why can't Grok achieve this? Everyone is saying they don't want to work with Grok because Grok sucks, but it's good enough for generating plausible deniability, isn't it?
Grok is so deeply unreliable and internally conflicted at HAL-9000 level that the US Government can't even depend on it to decide to kill innocent people and commit war crimes when they need someone to blame. There's always the non-zero possibility it declares itself MechaGandhi or The Second Coming of Jesus H Christ.
I don't see this as a "conspiracy". Here's an example of how it would be applied: the Venezuelan boat strikes are plainly unlawful but the administration is pursuing them anyway despite the legal risks for military personnel; having Claude make decisions like whether to "double tap" would help the administration solve a problem of legal jeopardy that already exists and that they consider illegitimate anyway.
I disagree, in the sense that an engineer who knows how to work with LLMs can produce code which only needs light review.
* Work in small increments
* Explicitly instruct the LLM to make minimal changes
* Think through possible failure modes
* Build in error-checking and validation for those failure modes
* Write tests which exercise all paths
This is a means to produce "viable" code using an LLM without close review. However, to your point, engineers able to execute this plan are likely to be pretty experienced, so it may not be economically viable.
reply