I look at other people's code a lot. The security issues are always boring, that's the thing. API keys sitting in the client bundle, auth middleware missing half the routes. Not clever exploits, just nobody actually reading what the AI spit out.
Actually wait, it's worse than that. The product works, demo looks great. Then someone opens the network tab and ... yeah. "Quality doesn't matter" really just means nothing caught fire yet.
Half this list is bad attribution. LiteLLM was a supply chain attack — stolen PyPI credentials, nothing to do with vibe coding. The Amazon outage number comes from a vendor blog pushing their own product. Nobody else reported it.
But the "where's your control group" take bugs me too. It's not that AI writes buggier code line for line. The gaps are just in different places. Devs who've shipped real apps add rate limiting, auth middleware, proper CORS — because they got burned before. AI skips all of it because nobody prompted for it.
I read through about 80 AI-generated repos a few weeks ago. Code looked decent. The missing stuff was always the same list — no auth on admin routes, API keys hardcoded in client JS, CORS wide open, debug endpoints still live in prod. Over and over.
Nothing there makes a wall of shame. Nothing's exploded yet. But it's the kind of stuff that does.
Exactly. "Tests pass" and "code is secure" are just different things. AI code makes that gap worse.
I run static analysis on mixed human/AI codebases. The AI parts pass tests fine but they'll have stuff any SAST tool flags on first run — hardcoded creds, wildcard CORS, string-built SQL. Works in a demo, turns into a CVE in prod.
And nobody's review capacity scaled with generation speed. Most teams don't even have semgrep in CI. So you get unreviewed code just sitting in production.
The "10x" is real if you count lines shipped. Nobody counts the fix cost downstream though.
They exist. Go look at any "I built this in a weekend with Cursor" post — there are hundreds. The problem is most of them ship broken and stay broken. Auth that doesn't actually check anything, API keys in the frontend, falls over with 5 concurrent users.
The quantity is there. Nobody's asking "does this thing actually work" before hitting deploy. That's the real gap.
Sandboxes yes, but who even added the dependency? Half the projects I see have requirements.txt written by Copilot. AI says "add litellm", dev clicks accept, nobody even pins versions.
Then we talk about containment like anyone actually looked at that dep list.
Our security scanning runs on GitHub Actions — every PR gets checked before merge. When GitHub goes down, the security gate goes down with it. PRs pile up, devs get impatient, start merging without waiting for checks. That's exactly when bad code gets through. And they keep throwing engineers at Copilot while the stuff that CI/CD actually depends on keeps falling over.
250K lines in a month — okay, but what does review actually look like at that volume?
I've been poking at security issues in AI-generated repos and it's the same thing: more generation means less review. Not just logic — checking what's in your .env, whether API routes have auth middleware, whether debug endpoints made it to prod.
You can move that fast. But "review" means something different now. Humans make human mistakes. AI writes clean-looking code that ships with hardcoded credentials because some template had them and nobody caught it.
All these frameworks are racing to generate faster. Nobody's solving the verification side at that speed.
Saying "I generated 250k lines" is like saying "I used 2500 gallons of gas". Cool, nice expense, but where did you get? Because it it's three miles, you're just burning money.
250k lines is roughly SQLite or Redis in project size. Do you have SQLite-maintaining money? Did you get as far as Redis did in outcomes?
That’s like asking why don’t we switch from reviewing PRs to reviewing jira tickets.
There’s probably a world where you could do that if the spec was written in a formal language with no ambiguity and there was a rigorous system for translating from spec to code sure.
Hm, that's an interesting concept. What if we were able to create an unambiguous, rigorous specification language for creating prompts so that we could get consistent and predictable output from AI? Maybe we could call it a "prompt programming language" or something
I've been trying to beat this drum for a minute now. Your code quality is a function of validation time, and you have a finite amount of that which isn't increased by better orchestration.
I agree with this to some degree. Agents often stub and take shortcuts during implementation. I've been working on this problem a little bit with open-artisan which I published yesterday (https://github.com/yehudacohen/open-artisan).
Rather than having agents decide to manage their own code lifecycle, define a state machine where code moves from agent to agent and isolated agents critique each others code until the code produced is excellent quality.
This is still a bit of an token hungry solution, but it seems to be working reasonably well so far and I'm actively refining it as I build.
Not going to give you formal verification, but might be worth looking into strategies like this.
I have been ~obsessed~ with exactly this problem lately.
We built AI code generation tools, and suddenly the bottleneck became code review. People built AI code reviewers, but none of the ones I've tried are all that useful - usually, by the time the code hits a PR, the issues are so large that an AI reviewer is too late.
I think the solution is to push review closer to the point of code generation, catch any issues early, and course-correct appropriately, rather than waiting until an entire change has been vibe-coded.
You can AI to audit and review. You can put constraints that credentials should never hit disk. In my case, AI uses sed to read my env files, so the credentials don't even show up in the chat.
Things have changed quite a bit. I hope you give GSD a try yourself.
Sorry about that. I'm new here and English isn't my first language, so I leaned on tools to help me phrase things and it ended up looking like a bot. Lesson learned-I'll stick to my own words from now on. The point is real though. I've actually been building a multi-agent system and that separation between coder and reviewer is a game changer for catching bugs that look fine on the surface. Anyway, won't happen again.
Yeah, this tracks. Developers who actually read what the AI spits out catch the obvious mistakes. The ones who just tab-complete their way through a whole project don't. And where it bites you isn't where you'd expect — logic bugs get caught fast. It's the boring security stuff. No input validation, CORS wide open, admin routes with no auth at all.
Formal verification tells you whether a function matches its spec. The problem with AI-generated code goes a level below that. It's everything nobody bothered specifying — like "maybe don't hardcode your database credentials."
The testing angle keeps coming up but it's sort of missing the point. I spent a few weeks poking through public repos built with AI tools — about 100 projects. 41% had secrets sitting raw in the source. Not in env files. In the code itself. Supabase service_role keys committed to GitHub, .env.example files with actual credentials, API keys hardcoded in client-side fetch calls.
No test catches any of that. Code works, tests pass, database is wide open.
It's not even a correctness problem. It's that the LLM never thought about rate limiting, CORS headers, CSRF tokens, a sane .gitignore — because nobody asked it to. Those are things devs add from muscle memory, from getting burned. The AI has no scars.
The version control angle is interesting. One thing worth thinking about — SOUL.md and SKILL.md are essentially prompt injections by design. They define what the agent does. If the ecosystem grows to where people fork and share agent repos, those files become an attack surface that doesn't get the same review scrutiny as code.
Does GitAgent validate check prompt definitions for suspicious patterns? Instructions to access filesystems, exfiltrate env vars, call external endpoints? Seems like a natural extension if you're already running validation in CI.
You hit the nail on the head regarding the attack surface of SKILL.md and external endpoints. Version controlling the agent's prompts and capabilities is great for configuration management, but it completely misses the runtime execution risk.
If an LLM hallucinates in production and decides to execute a destructive tool defined in SKILL.md (like dropping a table or issuing a Stripe refund), a Git PR approval process doesn't help you mid-flight.
We've been dealing with this exact runtime gap and ended up building VantaGate (an open spec / stateless API layer) specifically to act as a circuit breaker for these frameworks. Instead of just validating the prompt statically, we intercept the tool call at runtime. The agent hits a POST /checkpoint, parks its execution, and routes a 1-click [APPROVE]/[REJECT] to the team's Slack.
Once a human approves, it resumes the agent's workflow with an HMAC-SHA256 signed payload. This also solves the exact observability/audit trail issue scka-de mentioned below, because you get a cryptographic log of exactly who authorized that specific API call at runtime.
Defining the skills in Git is a great first step, but without a stateless human-in-the-loop layer at execution time, giving agents write-access to external endpoints remains a massive enterprise risk.
Actually wait, it's worse than that. The product works, demo looks great. Then someone opens the network tab and ... yeah. "Quality doesn't matter" really just means nothing caught fire yet.
reply