As large language models evolve from passive assistants into tool-using agents, a new class of risk emerges. These agents can browse the web, read emails, query databases, and take actions on behalf of users. That power is exactly what makes them useful, and exactly what makes them dangerous when exposed to untrusted inputs.
This blog summarizes an article concerning adversarial reinforcement learning and how it can be used to harden LLM agents against one of the most subtle and impactful threats they face today: indirect prompt injection. The graphic above illustrates the core loop behind ARLAS (Adversarial Reinforcement Learning for Agent Safety), a framework that trains agents to stay safe without sacrificing task performance.