The deny list section hit home. I keep seeing agents use unlink instead of rm, or spawn a python subprocess to delete files. Every new rule just taught the agent a new workaround.
Ended up flipping the model — instead of blocking bad actions, require proof of safety before any action runs. No proof, no action. Much harder to route around.
Nothing super fancy.
For me “proof” just means the agent has to make its intent explicit in a way I can check before running it.
For example:
1) If it wants to delete a file, it has to output the exact path it thinks it’s deleting. I normalize it and make sure it’s inside the project root. If not, I block it.
2) If it proposes a big change, I require a diff first instead of letting it execute directly.
3) After code changes, I run tests or at least a lint/type check before accepting it.
So it’s less about formal proofs and more about forcing the agent to surface assumptions in a structured way, then verifying those assumptions mechanically.
Still hacky, but it reduced the “creative workaround” behavior a lot.
Is this a policy snippet you add to your CLAUDE.md? Do you still maintain a deny list?
I recently added a snippet asking Claude to not try to bypass the deny list. I didn't have an incidence since but Im still nervous... Claude once bypassed the deny list and nuked an important untracked directory which caused me lots of trouble.
Autonoma is an open-source, local-first autonomous code remediation engine. It analyzes code at the AST level and uses a local LLM (currently Qwen 2.5-Coder) to automatically detect and fix a bounded set of high-impact issues such as hardcoded secrets, insecure password handling, SQL injection patterns, and common linting problems.
This is a pilot edition: single-repository, on-prem only, no governance layer, no audit logs, no RBAC, and no enterprise guarantees. The goal is to explore what practical, bounded autonomy looks like for code remediation — not to claim production or enterprise readiness.
Everything runs locally, the code is fully inspectable, and fixes are intentionally constrained to deterministic categories.
I’m especially interested in feedback around safety, determinism, failure modes, and where this approach breaks down.
Hard agree. As LLMs drive the cost of writing code toward zero, the volume of code we produce is going to explode. But the cost of complexity doesn't go down—it actually might go up because we're generating code faster than we can mentally model it.
SRE becomes the most critical layer because it's the only discipline focused on 'does this actually run reliably?' rather than 'did we ship the feature?'. We're moving from a world of 'crafting logic' to 'managing logic flows'.
I dunno, I don't think in practice SRE or DevOPs are even really different from the people we used to call sys admins (former sysadmin myself). I think the future of mediocre companies is SRE chasing after LLM fires, but I think a competitive business would have a much better strategy for building systems. Humans are still by far the most efficient and generalized reasoners, and putting the energy intensive, brittle ai model in charge of most implementation is setting yourself up to fail.
But how much of current day software complexity is inherent in the problem space vs just bad design and too many (human) chefs in the kitchen? I'm guessing most of it is the latter category.
We might get more software but with less complexity overall, assuming LLMs become good enough.
I agree that there's a lot of complexity today due to the process in which we write code (people, lack of understanding the problem space, etc.) vs the problem itself.
Would we say us as humans also have captured the "best" way to reduce complexity and write great code? Maybe there's patterns and guidelines but no hard and fast rules. Until we have better understanding around that, LLMs may also not arrive at those levels either. Most of that knowledge is gleamed when sticking with a system -- dealing with past choices and requiring changes and tweaks to the code, complexity and solution over time. Maybe the right "memory" or compaction could help LLMs get better over time, but we're just scratching the surface there today.
LLMs output code as good as their training data. They can reason about parts of code they are prompted and offer ideas, but they're inherently based on the data and concepts they've trained on. And unfortunately...its likely much more average code than highly respected ones that flood the training data, at least for now.
Ideally I'd love to see better code written and complexity driven down by _whatever_ writes the code. But there will always been verification required when using a writer that is probabilistic.
SREs usually don't know the first thing about whether particular logic within the product is working according to a particular set of business requirements. That's just not their role.
Any SRE who does that is really filling a QA role. It's not part of the SRE job title, which is more about deployments/monitoring/availability/performance, than about specific functional requirements.
In a well-run org, the software engineers (along with QA if you have them) are responsible for validation of requirements.
well-run ops requires knowing the business. It's not enough to know "This rpc is failing 100%", but also what the impact on the customer is, and how important to the business it is.
Mature SRE teams get involved with the development of systems before they've even launched, to ensure that they have reliability and supportability baked in from the start, rather than shoddily retrofitted.
I see it less as SRE and more about defensive backend architecture. When you are dealing with non-deterministic outputs, you can't just monitor for uptime, you have to architect for containment. I've been relying heavily on LangGraph and Celery to manage state, basically treating the LLM as a fuzzy component that needs a rigid wrapper. It feels like we are building state machines where the transitions are probabilistic, so the infrastructure (Redis, queues) has to be much more robust than the code generating the content.
The point about 'no such thing as a free lunch with processes' is something I wish more junior EMs understood.
I've seen so many teams treat process as a pure 'fix', ignoring that it's always a trade-off: you are explicitly trading velocity for consistency. Sometimes that trade is worth it (e.g., payments), but often for internal tools, you're just paying a tax for consistency you don't actually need.
This is the classic 'plausible hallucination' problem. In my own testing with coding agents, we see this constantly—LLMs will invent a method that sounds correct but doesn't exist in the library.
The only fix is tight verification loops. You can't trust the generative step without a deterministic compilation/execution step immediately following it. The model needs to be punished/corrected by the environment, not just by the prompter.
Yes, and better still the AI will fix its mistakes if it has access to verification tools directly. You can also have it write and execute tests, and then on failure, decide if the code it wrote or the tests it wrote are wrong, snd while there is a chance of confirmation bias, it often works well enough
> decide if the code it wrote or the tests it wrote are wrong
Personally I think it's too early for this. Either you need to strictly control the code, or you need to strictly control the tests, if you let AI do both, it'll take shortcuts and misunderstandings will much easier propagate and solidify.
Personally I chose to tightly control the tests, as most tests LLMs tend to create are utter shit, and it's very obvious. You can prompt against this, but eventually they find a hole in your reasoning and figure out a way of making the tests pass while not actually exercising the code it should exercise with the tests.
I haven’t found that to be the case in practice. There is a limit on how big the code can be so it can do it like this, and it still can’t reliably subdivide problems on its own (yet?), but give it a module that is small enough it can write the code and the tests for it.
You should never let the LLM look at code when writing tests, so you need to have it figure out the interface ahead of time. Ideally, you wouldn’t let it look at tests when it was writing code, but it needs to tell which one was wrong. I haven’t been able to add an investigator into my workflow yet, so I’m just letting the code writer run and evaluate test correctness (but adding an investigator to do this instead would avoid confirmation bias, what you call it finding a loophole).
> I haven’t found that to be the case in practice.
Do you have any public test code you could share? Or create even, should be fast.
I'm asking because I hear this constantly from people, and since most people don't have as high standards for their testing code as the rest of the code, it tends to be a half-truth, and when you actually take a look at the tests, they're as messy and incorrect as you (I?) think.
I'd love to be proven wrong though, because writing good tests is hard, which currently I'm doing that part myself and not letting LLMs come up with the tests by itself.
I'm doing all my work at Google, so its not like I can share it so easily. Also, since GeminiCLI doesn't support sub-agents yet...I've had to get creative with how I implement my pipelines. The biggest challenge I've found ATM is controlling conversation context so you can control what the AI is looking at when you do things (e.g. not looking at code when writing tests!). I hope I can release what I'm doing eventually, although it isn't a key piece of AI tech (just a way to orchestrate the pipeline to make sure that AI gets different context for different parts of the pipeline steps, it might be obsolete after we get better support for orchestrating dev work in GeminiCLI or other dev-oriented AI front ends).
The tests can definitely be incorrect, and are often incorrect. You have to tell the AI that consider that the tests might be wrong, not the implementation, and it will generally take a closer look at things. They don't have to be "good" tests, just good enough tests to get the AI writing not crap code. Think very small unit tests that you normally wouldn't think about writing yourself.
> They don't have to be "good" tests, just good enough tests to get the AI writing not crap code. Think very small unit tests that you normally wouldn't think about writing yourself.
Yeah, those for me are all not "good tests", you don't want them in your codebase if you're aiming for a long-term project. Every single test has to make sense and be needed to confirm something, and should give clear signals when they fail, otherwise you end locking your entire codebase to things, because knowing what tests are actually needed or not becomes a mess.
Writing the tests and let the AI write the implementation ends you up with code you know what it does, and can confidently say what works vs not. When the IA ends up writing the tests, you often don't actually know what works or not, not even by scanning the test titles you often don't learn anything useful. How is one supposed to be able to guarantee any sort of quality like that?
If it clarifies anything, I have my workflow (each step is a separate prompt without preserved conversation context):
1 Create a test plan for N tests from the description. Note that this step doesn't provide specific data or logic for the test, it just plans out vaguely N tests that don't overlap too much.
2 Create an interface from the description
3 Create an implementation strategy from the description
4.N Create N tests, one at a time, from the test plan + interface (make sure the tests compile) (note each test is created in its own prompt without conversation context)
5 Create code using interface + implementation strategy + general knowledge, using N tests to validate it. Give feedback to 4.I if test I fails and AI decides it is the test's fault.
If anything changes in the description, the test plan is fixed, the tests are fixed, and that just propagates up to the code. You don't look at the tests unless you reach a situation where the AI can't fix the code or the tests (and you really need to help out).
This isn't really your quality pass, it is crap filter pass (the code should work in the sense that a programmer wrote something that they thinks works, but you can't really call it "tested" yet). Maybe you think I was claiming that this is all the testing that you'll need? No, you still need real tests as well as these small tests...
Testing is fun, but getting all the scaffolding in place to get to the fun part and do any testing suuuuucks. So let the LLM write the annoying parts (mocks. so many mocks.) while you do the fun part
I imagine you would use something that errs on the side of safety - e.g. insist on total functional programming and use something like Idris' totality checker.
I've been using codex and never had a compile time error by the time it finishes. Maybe add to your agents to run TS compiler, lint and format before he finish and only stop when all passes.
This is the classic 'plausible hallucination' problem. In my own testing with coding agents, we see this constantly—LLMs will invent a method that sounds correct but doesn't exist in the library.
Often, if not usually, that means the method should exist.
I've been working on this problem specifically in the context of autonomous coding agents, and you hit the nail on the head with 'implicit context'.
The biggest issue isn't just that documentation gets outdated; it's that the 'mental model' of the system only exists accurately in a few engineers' heads at any given moment. When they leave or rotate, that model degrades.
We found the only way to really fight this is to make the system self-documenting in a semantic way—not just auto-generated docs, but maintaining a live graph of dependencies and logic that can be queried. If the 'map' of the territory isn't generated from the territory automatically, it will always drift. Manual updates are a losing battle.
I built Autonoma because I was tired of Copilot suggesting code that didn't compile.
Autonoma is a local daemon that acts as an "L5 Autonomous Engineer". It doesn't just autocomplete; it autonomously fixes bugs, security vulnerabilities, and linter errors in the background.
Key features:
- Air-Gapped: Runs 100% locally (Docker). No code leaves your machine.
- Self-Correcting: It validates its own fixes against your compiler/linter.
- Deterministic: Uses Tree-Sitter for AST analysis to prevent syntax hallucinations.
Would love feedback on the install process. The "Enterprise" tier is just for support—the core engine is fully open for the community.
The 'scaling with intelligence' argument often ignores the 'scaling of verification costs'.
As models get smarter and cheaper, the cost of generating complex code or artifacts approaches zero. However, the cost of verifying that output (especially in high-stakes enterprise environments) remains non-zero.
We might see a value shift where the premium isn't on the 'Intelligence' that generates the work, but on the 'Reliability' systems that validate it. In a world of infinite cheap tokens, trust becomes the scarce asset.
The biggest challenge I've seen with these 'infinite narration' layers on top of simulation games is context context window management. As a RimWorld colony grows, the state data (pawns, inventory, health, history) explodes exponentially.
Are you summarizing the history logs to fit it into context, or are you just feeding a snapshot of the current frame state (JSON) to the LLM? I've found that 'narrative coherence' usually suffers if you don't keep a rolling buffer of recent major events.
Current snapshot + semantic search related to the "recent event + messages since last trigger" in the database of memories. I prune unnecessary events (built chair, planted rice etc) as time goes on, and have a decay system that basically decreases the weight of old misc events. I also do not keep the messages in the save file directly and instead keep them in a seperate file in mod config, to not bloat the main save XML. Can it be polished more? Absolutely. but I had to publish this at some point after all. that being said, I am planning to actually focus on environmental awareness and better context as time goes on.
This is a fascinating pattern—treating 'agent skills' as composable dependencies rather than monolithic prompts.
I'm curious about the execution model: How are you handling the security implications of an agent pulling and executing arbitrary skill code? Is there an inherent sandboxing layer for these skills, or do they inherit the full privileges of the host agent?
In my experience with agentic tools, managing the 'permissions scope' of 3rd party capabilities is the hardest part of moving from demo to production.
Ended up flipping the model — instead of blocking bad actions, require proof of safety before any action runs. No proof, no action. Much harder to route around.
Curious if you've tried anything similar.