This makes sense for OpenAI, my experience with Promptfoo is great at testing model outputs. But I keep wondering who's looking at the other side: the actual agent code, and what happens now for other models such as Gemini/Claude etc that are using Promptfoo being locked-in with OpenAI and OS.
Like, an eval will tell you the model gave a bad answer. It won't tell you that your agent passes that answer straight into a shell command, or that a loop has no exit condition and burns through your API budget overnight.
We've been working on this, static analysis that reads agent code and maps out what can go wrong before you deploy. Found issues in ~80% of the repos we scanned.
I would also recommend checking out https://inkog.io as well, looks at similar patterns and you can run it directly in the browser and get results in 60s, it also builds an agent topology and check for "human in the loop"
Interesting that NIST is pushing for machine-readable behavioral declarations for agents. Basically an SBOM equivalent — agents declaring what tools they can access and what they can't do.
RFI responses due March 9, concept papers April 2. Moves fast for a federal initiative.
Honestly yeah – static catches structural stuff (missing exit conditions). But the trickier loops are when the model keeps deciding to retry. Like "let me try one more search" forever. That's prompt behavior, need runtime traces for those.
the point about this being an os problem not an ai problem resonates. letting untrusted agents drive your browser smells like a problem to me.
in practice we've had better luck running agents in lightweight sandboxes with explicit capability handles. curious if anyone's tried capability-based systems like sel4 for hosting agents, feels like mainstream oses have a long way to go here.
nice work. the idea of breaking agents into short-lived executors with explicit inputs/outputs makes a lot of sense - most failures i've seen come from agents staying alive too long and leaking assumptions across steps.
curious how you're handling context lifetimes when agents call other agents. do you drop context between calls or is there a way to bound it? that's been the trickiest part for us.
Like, an eval will tell you the model gave a bad answer. It won't tell you that your agent passes that answer straight into a shell command, or that a loop has no exit condition and burns through your API budget overnight.
We've been working on this, static analysis that reads agent code and maps out what can go wrong before you deploy. Found issues in ~80% of the repos we scanned.
would be great to get your feedback: https://github.com/inkog-io/inkog
reply