AI Agent Security: The Complete Guide

Key takeaways

Agent security requires controls at multiple simultaneous layers — identity, authorization, content, and observability — because a gap at any single layer can be exploited through the others.
Every agent needs its own credential; shared service tokens eliminate attribution and make targeted revocation impossible.
Connection-level policies — task types, rate limits, spend caps, time windows, model allowlists, and approval gates — are the natural unit for authorizing what an agent can do.
Trust scoring adds a runtime gate that adapts to the agent's actual behavioral history, blocking calls from agents whose signals have degraded even if their credentials are still valid.
Guardrails apply bidirectionally: input guardrails catch prompt injection, output guardrails catch data exfiltration — both directions must be inspected independently.

Securing an AI agent means controlling who can invoke it, what it is permitted to do, what data it can see and produce, and what happens when something goes wrong. This guide covers each of those layers in order, from identity at the perimeter through monitoring and incident response. For a focused look at any individual layer, see the related posts linked throughout.

Why Agent Security Is Different from API Security

A traditional API has a defined set of inputs and a bounded set of outputs. An AI agent is dynamic by design: it reasons, selects tools, calls other services, and produces outputs that depend on context you cannot fully anticipate.

Three classic security assumptions break down. Least-privilege is harder to express — you cannot enumerate every action an agent might take before it runs. Blast radius is larger — a misconfigured agent can chain tool calls across systems in seconds. And the attack surface extends to content: prompt injection, exfiltration through model outputs, and PII leaking into logs are threat classes that traditional API security ignores entirely.

Agent security therefore requires controls at multiple layers simultaneously: the call entering the agent, the connections the agent makes to resources, the content moving through each hop, and the record left behind.

Identity and Authentication

Every agent needs its own credential — not a borrowed human session or a shared service token. When an agent acts under a human's identity, you lose accountability: the audit log shows the human, not the agent, and you cannot revoke the agent's access without revoking the human's.

The right model is to treat each agent as a first-class principal with its own identity, issued credentials, and defined scopes. When the agent authenticates, it presents its own credential. When it acts, the action is attributed to that agent. When you need to revoke access, you revoke the agent's credential without touching anything else.

At registration time, capture enough metadata to make the identity meaningful: which team owns it, what capabilities it expects to use, and what it is not permitted to do. This inventory becomes the foundation for authorization and monitoring. The post on why agents need their own credentials explains the accountability gap in detail.

Authorization: Connections and Communication Policies

Authentication answers "who is calling." Authorization answers "what are they allowed to do." For agents, authorization needs to extend to every link in the chain — not just the entry point.

The connection is the natural unit for this. Each connection represents a directed link from one agent to another, or from an agent to an external resource such as an MCP server. Attaching a policy to the connection lets you express constraints precisely where they matter: on the edge between caller and resource.

A well-formed connection policy covers several dimensions:

Task types: which categories of work the connection permits. You might allow an agent to query data but not to write or delete.
Rate limits: per-minute and per-hour request budgets, preventing a looping or runaway agent from saturating a downstream service.
Time windows: restricting calls to business hours, or to the duration of a scheduled job.
Spend caps: a monthly cost ceiling beyond which the connection is refused, protecting against unbounded token expenditure on a single path.
Model and tool allowlists: restricting which models or tools the agent may invoke through this connection, so it cannot escalate to a more capable or more expensive resource than expected.
Approval gates: marking certain connections as requiring a human confirmation step before the call proceeds.

Crucially, a policy is only useful if it is enforced at dispatch time, not merely stored. Enforcement needs to happen before the task is handed off, and violations need to be logged so they appear in your audit trail. For how zero-trust principles apply this model at every hop, see zero trust for AI agents.

Trust Scoring as a Gate

Authorization based on static policy is necessary but not sufficient. An agent's trustworthiness is also a function of its recent behavior. An agent that has been producing high error rates, triggering guardrail violations, or failing health checks should be treated with more caution than one with a clean record.

A trust score aggregates behavioral signals — error rate, policy violation history, attestation status, version recency — into a single value that can be compared against a minimum threshold. Configuring a trust floor on a connection means the connection refuses calls from agents whose trust score falls below that threshold, even if their credentials are valid. This is a runtime gate: it adapts to what the agent has actually been doing, not just what it was permitted to do at configuration time.

The fail-closed property matters here. If the trust evaluation itself fails — due to a transient database error or service unavailability — the connection should deny the request rather than grant it. A fail-open trust gate is not a trust gate.

Content Guardrails

Even a properly authenticated and authorized agent can produce harmful outputs. Guardrails operate on the content of requests and responses, independent of whether the caller had permission to make the call.

The three most common enforcement actions are block, redact, and warn. Blocking stops the request or response from proceeding. Redacting allows it to proceed but removes or masks the sensitive portion. Warning lets it through but emits a log entry and an alert, leaving action to a human reviewer.

What you choose depends on the rule and the tolerance for false positives. Rules targeting credentials, private keys, or clearly out-of-scope PII are usually worth blocking. Rules detecting patterns that sometimes appear in legitimate content may be better suited to warn-and-review. The cost of a false positive is workflow disruption; the cost of a false negative is a data leak. Only you can calibrate that trade-off for each rule.

Guardrails apply bidirectionally. Input guardrails catch prompt injection — attempts by malicious content in the environment to redirect the agent's behavior. Output guardrails catch data exfiltration — agents including sensitive data in responses destined for untrusted consumers. Both directions matter, and they need to be inspected independently. See how to detect and defend against prompt injection for concrete detection patterns.

Connections can carry their own guardrail assignments, so the rules applied to a high-sensitivity connection differ from those on a low-risk one without requiring a single global policy.

Monitoring and Health Tracking

Operational security depends on observability. Track per connection: request rate, error rate, latency, guardrail violation counts, and cumulative spend. Point-in-time health snapshots taken at regular intervals let you reconstruct what the connection state looked like at any moment — useful for both incident investigation and capacity planning.

Watch for fleet-level anomaly patterns too — deviations from an agent's established behavioral baseline that suggest something has changed, either in the agent's behavior or in an attacker's activity.

Calibrate alert thresholds carefully. Well-tuned alerts catch anomalies with fewer false alarms than a single static threshold applied uniformly. For a structured view of what to track and how to surface it, see observability for AI agents: logs, metrics, and traces.

Incident Response

When an agent misbehaves or you suspect a credential compromise, act quickly and with precision. Your immediate options are: disable the agent (stops new task routing), revoke its credential (blocks authentication for new requests), and disable individual connections (cuts specific paths without a full shutdown). Surgical connection-level action is usually preferable because it minimizes collateral disruption.

Investigation requires a complete audit trail. Every policy violation, guardrail trigger, connection status change, and authentication event should be a durable, attributable record. Append-only logs with cryptographic chaining produce records that can be independently verified — important for both internal postmortems and any regulatory scrutiny.

After containment, reconstruct the timeline from the audit trail: when did the anomalous behavior begin, which connections were involved, and what outputs were produced? Those answers drive both remediation and the policy and monitoring changes that reduce the risk of recurrence.

After containment, reconstruct what the normal state looked like before the incident, and update connection policies and monitoring thresholds so the same pattern triggers faster next time. For a structured approach to the incident process itself, see incident response for AI agent breaches. Teams preparing before an incident occurs will find the AI agent incident readiness checklist a useful starting point.

Common questions

Does every agent need its own identity, or can a team share credentials? Every agent should have its own credential. Shared credentials eliminate attribution — when something goes wrong, you cannot tell which agent caused it, and revoking the shared credential takes down every agent using it. The operational overhead of per-agent credentials is low; the governance benefit is significant.

What is the difference between a rate limit on a connection and a spend cap? Rate limits constrain the frequency of calls regardless of cost — they protect downstream services from being overwhelmed. Spend caps constrain the total monetary cost regardless of frequency — they protect your budget from runaway token consumption. Both are necessary; they address different failure modes. A low-cost but very high-frequency agent needs rate limits. A low-frequency but expensive agent needs a spend cap.

When should I use block versus warn for a guardrail rule? Use block for content you can confidently say should never cross that boundary — credentials, clearly out-of-scope PII, or known-bad patterns. Use warn for content where the rule may have a meaningful false-positive rate. Warn-and-review keeps humans in the loop without stopping work; block without a review path can create hard operational dependencies on a rule's accuracy.

How should I sequence these controls when building from scratch? Start with identity — one credential per agent — since every other control depends on knowing which agent is acting. Add connection-level spend caps and rate limits next, as they bound the worst-case outcomes from a misconfigured or looping agent. Content guardrails and trust scoring follow once you have the baseline instrumentation to calibrate them. Audit logging should be in place from the start so you have a record before the first incident. For a sequenced approach tailored to smaller teams, see AI agent security for startups.