frolvlad's comments

frolvlad · 2026-02-13T19:13:32 1771010012

Well, the challenge is to know if the action supposed to be executed BEFORE it is requested to be executed. If the email with my secrets is sent, it is too late to deal with the consequences.

Sandboxes could provide that level of observability, HOWEVER, it is a hard lift. Yet, I don't have better ideas either. Do you?

liuliu · 2026-02-13T19:31:03 1771011063

The solution is to make the model stronger so the malicious intents can be better distinguished (and no, it is not a guarantee, like many things in life). Sandbox is a basic, but as long as you give the model your credential, there isn't much guardrails can be done other than making the model stronger (separate guard model is the wrong path IMHO).

ramoz · 2026-02-13T20:30:52 1771014652

I think generally correct to say "hey we need stronger models" but rather ambitious to think we really solve alignment with current attention-based models and RL side-effects. Guard model gives an additional layer of protection and probably stronger posture when used as an early warning system.

liuliu · 2026-02-13T20:51:30 1771015890

Sure. If you treat "guard model" as diversification strategy, it is another layer of protection, just like diversification in compilation helps solving the root of trust issue (Reflections on Trusting Trust). I am just generally suspicious about the weak-to-strong supervision.

I think it is in general pretty futile to implement permission systems / guardrails which basically insert a human in the loop (humans need to review the work to fully understand why it needs to send that email, and at that point, why do you need a LLM to send the email again?).

ramoz · 2026-02-13T21:07:57 1771016877

fair enough

ramoz · 2026-02-13T19:17:52 1771010272

if you extend the definition of sandbox, then yea.

Solutions no, for now continued cat/mouse with things like "good agents" in the mix (i.e. ai as a judge - of course just as exploitable through prompt injection), and deterministic policy where you can (e.g. OPA/rego).

We should continue to enable better integrations with runtime - why i created the original feature request for hooks in claude code. Things like IFC or agent-as-a-judge can form some early useful solutions.

frolvlad · 2026-02-13T18:45:59 1771008359

Instead of expecting the tools to adhere, they are enforced. For example, to make an HTTP call with a secret key, the tool must use the proxy service that will enforce that the secret key is only used for the specific domain, if that is allowed, then the proxy service will make the call, thus the secret never leaks outside of the service.

However, this design is still under development as it creates quite a bit of challenges.

frolvlad · on Feb 25, 2022

Are you fucking kidding us? What does totalitarian country (russia) know about liberty? You must be playing an idiot if you believe that military invasion can be called "liberation".

frolvlad · on Feb 25, 2022

Because my friends and relatives die in Ukraine from russian troops

frolvlad · on April 7, 2020

COVID Connecting People (c)