Probably insufficient to know that you wrote it, because code has bugs that LLMs and attackers are motivated to find. It has a higher trust requirement than most code.
And of course, that trust only applies to you, no one else should trust your code absent other proofs.
No, for example a tool call calling an API. So the llm does not have access to the API keys, the tool does. For example an API call that fetches some data remotely and return it to the llm. You don’t need a sandbox for it. It’s faster and more efficient to keep this out of the sandbox.
We don't host 3rd party agents (I don't know if this what you implied). We built an agent that monitors CI pipelines, tests failures, performance and auto opens PR to address issues we find. We host our agent loop on a backend (it's in go), and we call to the sandbox when we run operations involving the user code.
Yes, it's also because the agent described in the post is doing some operations on the user code (fix CI pipelines, rerun tests, fix them, etc...). So another big reason to use the sandbox is to run things like bash on a user code. you don't want credentials or anything trusted inside that sandbox, including the LLM api key.
We considered wrapping Claude Code when we started building Mendral (this agent in the article). We ended up building our own agent, it's lot more work because we followed all the right patterns as the models evolved (sub-agents, proper token caching, redo basic tools like read,write,edit,bash, etc...). But it paid off over time when you build an agent that is focused on a specific task (not a general coding agent).
The main driver for writing our own agent was to leave it out of the sandbox (the agent loop runs on our backend, we call the sandbox only when needed). We wrote another post about that (it's the latest post on the blog).
However, I am curious how would you implement the triager pattern by only using Claude Code as harness.
The basic idea is to run a 24/7 SWE agent loop on local hardware to maximize the cost effectiveness. The agents just keep running and refining development tasks in a project board. When the human is in the loop they do just enough to complete the tasks with semi-frequent human review. However, whenever human is absent or human feedback becomes the bottleneck they start autonomously debating, delegating to cloud LLMs, etc. to try to clear bottlenecks autonomously instead. So essentially the system will try to do useful things to fully utilize the hardware, which is a specific optimization for local models (you'd never need to do this if you use cloud models).
The local models would also be queryable on-demand (which overrules the 24/7 tasks in terms of priority) as cheap inference. The idea is that in user-queried interactive tasks, the main Claude agent primarily only gets summaries from other agents and makes decisions based on it, thus saving a ton of tokens compared to giving it access to the codebase. These small-model calls would preferentially route to my local model to save costs but overflow to a cloud provider if demand is momentarily too high.
IMO RAG is mostly dead. The game changer with newer models like Opus is the reasoning. So instead of pushing all the context up front (RAG style), it's better to give strong primitives (eg. bash, SQL) and let the agent figure it out.
It's what Claude Code is doing now and the principles we applied for Mendral as well.
That said, you're right that some smaller models can outperform Haiku and we're thinking supporting oss models at some point. But it does not change the core design principles IMO.
It's more accurate to say that RAG is alive and well and is just incorporated into the agent's responsibility, it's just one more tool that it can call on instead of the user manually doing it.
We're dealing with CI logs, produced by a variety of frameworks, languages, etc... And the tough ones to look into are e2e tests, with outputs from infrastructure.
I wish a re.match() would be enough, but we often don't even know what to match in the first place.
We started to add deterministic matching on the patterns that the agent sees the most so we don't have to go through the whole thing (for example a flake on PostHog can occurs 100+ times during a day, you don't need to reinvestigate every time). But for new errors, it's tricky.
It's the same as an escalation. Something we omitted from the post is that we often use Sonnet to write SQL queries.
We wrote another post that was on HN some time ago that goes into the details of SQL queries (linked at the top of this article). Sonnet is perfect for this.
reply