Guardrails

Estimated reading time: 6 minutes

Guardrails are content-level controls that define what can be said or requested in messages between entities. They operate on the semantic content of interactions, not just authentication headers or API keys.

Why guardrails?

Authentication answers "who is calling?" but not "what are they asking for?" In AI infrastructure, the content of every interaction matters:

An agent might be authorized to access a database tool but should not be allowed to request deletion of records
An MCP server might return data that should be filtered before reaching certain callers
An application should not receive raw error messages or internal system information

Guardrails close this gap by inspecting and filtering the content of interactions.

Configuring guardrails

Guardrails are configured per connection, per direction. For each connection, you can set:

Client-side guardrails

Controls on the client entity in the connection:

Outgoing requests: What the client is allowed to send
Incoming responses: What the client is allowed to receive

Server-side guardrails

Controls on the server entity in the connection:

Incoming requests: What the server will accept
Outgoing responses: What the server is allowed to return

Example configurations

Restrict tool access

Allow an agent to use read-only tools on an MCP server but block write operations:

Server-side incoming guardrail: Only allow requests to list_* and get_* tools. Block create_*, update_*, and delete_* tools.

Filter sensitive data

Prevent an MCP server from returning personally identifiable information to a public-facing application:

Server-side outgoing guardrail: Redact email addresses, phone numbers, and social security numbers from responses.

Content restrictions

Prevent an agent from making requests that include prohibited content:

Client-side outgoing guardrail: Block requests containing financial advice, medical diagnoses, or legal recommendations.

The intent layer: catching the right call on the wrong data

The controls above are a fast pattern layer: they inspect the content of a message and block or redact known-bad patterns — prompt injection, PII, credentials, prohibited tool names. Pattern matching is essential, but it only sees the text of a call, not what the call is trying to do.

The intent layer sits on top. It evaluates the action a call represents against declarative, organization-tunable rules, so it can catch calls that are technically valid and individually clean but semantically wrong — the right API call against the wrong data. Its rules target violation classes that pattern matching misses:

Scope escalation — an entity reaching for permissions or resources beyond its assigned scope.
Bulk data exfiltration — a single well-formed call that returns far more data than the task needs.
Tool out of scope — invoking a tool that has no legitimate role in the current task.

The intent layer is rule-based: you declare the conditions, and each evaluation resolves to a block verdict (enforced, fail-closed — the call does not proceed) or a flag verdict (recorded for observation without stopping the task). Flag first to measure, then promote to block once the rule is trusted.

Because intent rules evaluate every hop, they extend to agent-to-agent chains. Each inter-agent call carries a chain identifier tying every hop back to the request that started it, so a verdict is attributable to the exact point in the chain — visible hop by hop in the chain graph view. For the concepts behind this, see Content Guardrails for AI Agents and Agent-to-Agent (A2A) Communication, Governed.

Guardrail engines

Under the configuration model above, three evaluation engines do the work, and a guardrail can combine them:

Rule and pattern matching. Deterministic checks — keyword and regex rules — evaluated in-line with negligible latency. Patterns are guarded against pathological regex (ReDoS), so a badly written rule cannot stall the evaluation path. Best for tool-name restrictions, known-bad strings, and structural checks.
PII detection and redaction. Purpose-built detection of personal data categories (emails, phone numbers, identifiers) with redaction as the action — the content proceeds with the sensitive spans removed, rather than the whole message being blocked.
LLM-based evaluation. A model-judged check for semantic categories that patterns cannot express — topical restrictions, intent classification, nuanced content policies. Higher latency and cost than pattern rules, so it is best reserved for the flows whose risk justifies it.

Templates and presets cover the common configurations, so most teams start from a preset and tune rather than authoring from scratch. Every evaluation is recorded: content logs capture what was checked and what verdict resulted, and per-guardrail statistics and health views show hit rates over time — which is how you distinguish a guardrail that is working from one that is merely configured.

Guardrails vs policies

Guardrails and policies serve different purposes:

	Guardrails	Policies
What	Content of communication	Mechanics of communication
Examples	Block PII, restrict tool access	Rate limits, geo-restrictions
Scope	Semantic analysis	Operational parameters

Both work together. A request that satisfies policy controls (within rate limits, from an approved region) is also subject to guardrail evaluation (requesting prohibited data).

Common questions

Which guardrail engine should I start with? Pattern rules and PII redaction first: they are deterministic, near-free in latency, and cover the highest-frequency risks (known-bad content, sensitive data in responses). Add LLM-based evaluation selectively, on the flows where semantic judgment is worth its latency and cost — the stats views will tell you where pattern rules are missing things.

Do guardrails block by default, or just log? Guardrails are enforced in-line — a block verdict stops the content from proceeding, and redaction modifies it before delivery. For intent-layer rules specifically, you choose per rule between block (enforced, fail-closed) and flag (recorded without stopping the task); the recommended pattern is flag first to measure, then promote to block once the rule is trusted.

Why configure guardrails per direction rather than per entity? Because the risks differ by direction: what a client may send (injection, prohibited requests) and what it may receive (PII, internal details) are different rule sets. Per-direction configuration on the connection expresses both precisely, and lets the same entity carry different rules on different relationships.

How do I know a guardrail is actually working? Check its evaluation record: content logs show every check and verdict, and per-guardrail statistics show hit rates over time. A guardrail with zero evaluations is not attached to the flow you think it is — that absence is the most common misconfiguration, and the stats surface makes it visible.