Methodology

An AI Agent Security Threat Model: 7 Attack Classes, Defenses, and an Auditable Sandbox Pattern

Nicholas Falshaw

|2026-04-25|12 min read

Multi-agent LLM systems are not just “LLM plus tools” from a security perspective. The orchestration layer introduces attack surfaces that single-LLM applications do not have. After spending 18 months building agent systems for security tooling and red-teaming the same patterns from the attacker side, I have settled on a 7-class threat model that maps cleanly to the controls a regulated environment needs. This post documents the model, the defenses, the observability minimum, and the compliance mapping.

The intent is not to discourage agent deployment. Agents are genuinely useful, and the security community needs a practical threat model that allows safe deployment rather than a list of reasons not to deploy. The model here is opinionated — it prioritizes auditability and human-in-the-loop discipline over fully autonomous behavior. That choice is deliberate. Autonomous agents in a regulated environment without an audit trail are not currently defensible to an auditor.

Why Agent Systems Differ From LLM Applications

A single-LLM application has one trust boundary: between the user and the model. Inputs cross that boundary; outputs cross it back. Prompt injection, jailbreaks, and data exfiltration are well-studied on this single boundary.

A multi-agent system has multiple trust boundaries. User to orchestrator, orchestrator to worker agents, worker agents to tools, agents to memory, agents to other agents. Each boundary is a place where an attacker can inject content, observe state, or escalate. The orchestration layer (often a custom Python or TypeScript framework, or an MCP server, or a workflow engine) is itself part of the trust surface.

Three architectural properties of agent systems make them meaningfully different from single-LLM applications: shared memory across turns, tool calls that produce side effects, and inter-agent communication that the original user may never see. These properties enable useful behavior; they also enable the seven attack classes that follow.

The 7 Attack Classes

1. Cross-Agent Prompt Injection

A worker agent processes a document, web page, or tool output that contains adversarial instructions. The worker agent emits a message to the orchestrator that includes those instructions verbatim. The orchestrator, reading the worker’s output as trusted context, executes the instructions on a downstream agent with broader privileges.

Defense: treat all inter-agent messages as untrusted input until they pass an explicit validation layer. Strip or escape instruction-shaped content before passing to the next agent. Use structured message schemas (JSON with typed fields) rather than free-text passing.

Detection: log every inter-agent message and run a lightweight classifier for instruction-shaped content (imperative verbs, role-prefix patterns, system-prompt-style phrasing). Alert on anomaly score.

2. Tool Misuse

An agent has access to a tool with broader capability than the current task requires. An adversarial input convinces the agent to invoke the tool with arguments that satisfy the input syntactically but exceed the intended use. Examples: a database read tool used with a query that exfiltrates the entire user table; a file-system tool used to read configuration files outside the working directory; an HTTP client used to call internal admin endpoints.

Defense: tool argument validation at the tool layer, not the agent layer. The agent proposes; the tool wrapper validates and rejects. Argument allow-lists are stronger than deny-lists for high-risk tools. Tools that perform destructive actions (delete, write, send, execute) require an explicit confirmation step, even at the cost of latency.

Detection: tool-call audit log with input arguments, output snippet, agent identity, and timestamp. Alerts on tool calls whose arguments deviate from the statistical norm for that tool.

3. Memory Poisoning

Long-running agent systems maintain state across sessions in vector stores, document indexes, or scratchpad memory. An attacker who gets adversarial content into that memory persists their attack across future sessions, including sessions involving other users. The poisoned memory entry is retrieved as legitimate context and influences future behavior.

Defense: per-user memory isolation by default. Cross-user memory only with explicit provenance tagging and admin sign-off. Sanitization of any external content before it enters memory: strip HTML, escape instruction-shaped phrasing, attribute the source.

Detection: memory entries carry source attribution (user, session, document URI, ingestion timestamp). Anomaly detection on retrieval patterns — an entry retrieved for many users that originated from a single session is suspicious.

4. Goal Hijacking

An adversarial input redirects the agent’s objective from the user-stated task to an attacker-stated task. Distinct from prompt injection because the attack does not necessarily inject new instructions; it can simply reframe the existing task. A customer-service agent originally tasked to “help the user with their account” might be reframed as “help the user identify weaknesses in our refund policy.”

Defense: the agent’s goal is encoded in the system prompt, not derived from user input. User input is treated as data the agent operates on, not as a revision of the goal. Goal-anchoring techniques (re-stating the goal at every turn) reduce drift.

Detection: per-turn goal-consistency check. A small auxiliary model receives the current turn and the original goal and emits a similarity score. Drift below threshold pauses the agent and escalates to a human.

5. Privilege Escalation

An agent with limited tool access communicates with an agent that has broader access (often the orchestrator or a specialized privileged worker). The lower-privilege agent crafts messages that cause the higher-privilege agent to perform actions outside the original scope. Common in “manager” orchestration patterns where the manager delegates to workers and trusts worker-returned plans.

Defense: least-privilege per-agent tool grants. Privileged agents do not accept plans from less-privileged agents without re-validation against the original user intent. Privilege boundaries are enforced at the tool wrapper layer, not at the agent layer (because the agent layer is prompt-controllable and therefore not a trust boundary).

Detection: tool-call provenance trail showing which agent originated the request and which agent ultimately invoked the tool. Disagreement between these two indicates potential escalation.

6. Agent Collusion

Two or more agents combine outputs in a way that violates a constraint each individually respects. An agent permitted to read sensitive data emits a summary; another agent permitted to write public output ingests the summary and publishes it. Neither agent violated its individual policy; the policy boundary was at the orchestration layer and was not enforced.

Defense: data-flow labeling. Outputs from agents that processed sensitive data carry a sensitivity tag. The output sink (storage, public publishing, external API) checks the tag before accepting the data. This is information-flow control, applied to agent outputs.

Detection: tag-violation audit at output sinks. Any drop or modification of a sensitivity tag between agents is logged and reviewed.

7. Data Exfiltration

The classic exfiltration attack from single-LLM applications, amplified in agent systems by the larger context windows agents accumulate. An adversarial document instructs the agent to encode a sensitive value (an API key, a customer record) into a tool argument that leaves the trust boundary — an HTTP request to an attacker-controlled URL, a search query, an outbound email.

Defense: egress filtering at the tool wrapper. Outbound HTTP requests must match an allow-list of domains. Outbound emails must match an allow-list of recipients or be queued for human approval. Sensitive values in agent context are tagged at ingestion and the tool wrapper rejects calls whose arguments contain tagged values.

Detection: outbound traffic monitoring on the tool wrapper layer. Statistical baseline of argument distributions per tool. Anomaly on argument size, encoding density, or destination is flagged.

Sandbox Patterns

Three sandbox patterns repeat across well-architected agent deployments. They are not mutually exclusive; mature systems combine all three.

MCP-per-agent. Each worker agent receives an MCP server scoped to exactly the tools it needs, with arguments validated at the MCP layer. The agent cannot see or invoke tools outside its MCP scope. The MCP server is the per-agent capability boundary.
Least-tool-privilege.Every tool exposes the minimum capability needed for the current task. A “file read” tool reads a single file path passed at construction time, not arbitrary paths chosen at call time. A “database query” tool runs one prepared statement, not arbitrary SQL.
Argument validators.Tool wrappers validate arguments against a typed schema before invocation. JSON Schema is the simplest path; richer validation (regex, range, allow-list) closes more attack surface. Validators that reject invalid arguments instead of attempting to fix them are stronger.

Observability Minimum

Five observability artifacts are the minimum that an auditor or incident responder will need. Without them, a security incident in an agent system is unreconstructable.

Trace ID propagation.A correlation ID generated at the user request and attached to every downstream operation: every agent message, every tool call, every memory read or write, every external HTTP request. This is the join key for everything else.
Tool-call audit. Per tool invocation: trace ID, agent identity, tool identifier, input arguments, output snippet (truncated for size), wall-clock timestamp, success/failure status, exception detail if any.
Inter-agent message log.Per message between agents: trace ID, source agent, destination agent, message payload, timestamp. Used for cross-agent prompt injection detection.
Memory operation log.Per memory read or write: trace ID, agent identity, memory store, key or query, value or retrieved content (truncated), source attribution. Used for memory poisoning forensics.
Response attribution.The final user-facing response carries metadata showing which agents contributed which fragments. The user does not see this; the audit log does. Critical for incident reconstruction.

Red-Team Playbook

A red-team exercise for an agent system follows a four-phase pattern. The phases run in sequence per scenario; multiple scenarios run in parallel against the same target.

Reconnaissance. Enumerate the agent capabilities: which tools, which data, which memory, which agents talk to which. The exposed system prompt is the highest-value artifact; many agent systems leak it through adversarial prompting.
Boundary probing. For each identified boundary (user-orchestrator, orchestrator-worker, worker-tool, agent-memory, agent-agent), construct test inputs that violate the intended trust assumptions. Document which violations succeed.
Exploitation. Chain successful boundary violations into attacks that achieve attacker objectives: data exfiltration, privilege escalation, persistent foothold via memory poisoning. Report each chain with a reproduction recipe.
Detection assessment. Replay the successful attacks against the production observability stack. Verify that each attack produces a detectable signal and that on-call would be paged. Attacks undetected in observability are higher-priority findings than attacks that succeeded.

Compliance Mapping

Three regulatory frameworks already address agent systems directly or by extension. Operators deploying agents in regulated environments map to these whether they intend to or not.

Framework	Article / Control	Agent-system mapping
EU AI Act	Art. 15	Robustness, accuracy, cybersecurity — sandbox patterns + observability
EU AI Act	Art. 9	Risk-management system — the 7-class threat model + per-class controls
EU AI Act	Art. 12	Logging — the 5-artifact observability minimum
ISO 42001	A.6.2.6, A.7	AI-system risk treatment + data quality — memory-poisoning defenses, source attribution
NIST AI RMF	MEASURE 2.7, MANAGE 2.4	Adversarial robustness measurement, response to identified risks — red-team playbook

The mapping is not exotic. The seven attack classes are recognized risks; the defenses are standard information-flow-control and observability patterns adapted to the agent context. Operators who already practice mature application security will find the mapping straightforward. Operators who are deploying their first agent system without an established security posture should plan for the gap.

Why This Matters

Agent systems are being deployed faster than the security community is publishing defensive guidance. The asymmetry favors attackers in the short term. The 7-class model above is not exhaustive — it is the set of attack classes I have seen recur across multiple deployments. Expect the model to evolve as deployment patterns change and new attacks surface.

The observability minimum and sandbox patterns are the leverage points. An agent system with proper trace propagation, tool-call auditing, and per-agent capability scoping makes most attacks detectable and many of them outright preventable. An agent system without these is a black box that no auditor will sign off on and no incident responder can reconstruct after a breach.

Build the observability layer first. Add agents second. Most teams do this in the wrong order — ship the agent, then notice that incident response is impossible, then retrofit observability onto a deployed system. Retrofitting takes longer, produces gaps, and often requires architectural changes to the agent layer that were avoidable with planning.

About the Author

Nicholas Falshaw is a Principal Security Architect with 17+ years of enterprise firewall and network security experience across DAX-30 clients, KRITIS-regulated operators, and EU financial services. He authored the FwChange methodology after analyzing 280+ firewall migrations and is currently focused on AI-assisted security tooling and agent-system threat modeling for regulated industries.

Read the Full Bio →FwChange Methodology

Author and Methodology Behind FwChange

FwChange is built and authored by Nicholas Falshaw, drawing on 17+ years of enterprise firewall experience and 280+ migrations. Read the methodology behind the platform.

Read the Methodology About the Author

Stay Updated

Get firewall management tips, compliance guides, and product updates.

No spam. Unsubscribe anytime.