Content Guardrails for AI Agents

Content guardrails are policy rules that sit on the task path for your AI agents, scanning both the input that arrives and the output that leaves before either crosses a trust boundary. When a guardrail triggers, the platform takes a configured action — blocking the request, redacting the sensitive fragment, issuing a warning for review, or escalating to a human — rather than simply logging the fact after the damage is done. Guardrails apply at the level of each organization and can be scoped to a specific agent or connection, so your security and compliance posture is enforced automatically at every execution, not just when someone remembers to check.

Why Content Inspection Matters for Agentic Systems

Traditional application security focuses on who can make a request. Agentic systems add a second question: what content is moving through the system, and is it safe to let it proceed? An agent that can call tools, write to external systems, or pass results to downstream agents creates a content pipeline that is harder to audit manually. Sensitive data — personal identifiers, credentials, confidential business information — can appear in prompts fed to agents or in the responses they produce. Without automated inspection at the boundary, the only control is trusting that neither the user nor the model made a mistake.

Guardrails address this by treating content as a first-class subject of policy. A rule that prevents PII from appearing in an agent's output is enforced on every task completion, not just when an operator happens to review a log. A rule that blocks requests containing credentials prevents those values from reaching the model in the first place. The coverage is systematic rather than sampled. For context on how this fits the broader data exfiltration risk surface, see Data Exfiltration Risks in Agentic AI.

For a broader look at how guardrails fit alongside other runtime controls, see guardrails vs policies: understanding AI infrastructure controls.

Types of Guardrails

Praesidia supports three evaluation types, each suited to a different inspection task.

Rule-based guardrails use deterministic pattern matching — regular expressions and keyword lists — to detect known-bad content. This is the right choice for structured data patterns: credit card numbers, API key formats, national identifiers, specific prohibited phrases. Rule-based evaluation is fast and produces no false negatives for patterns you have enumerated.

ML-based guardrails run content through a moderation model that can classify tone, detect hate speech, or identify content categories that a regex cannot capture. You can point a guardrail at your own moderation endpoint or use a platform-provided model, depending on the sensitivity of the content passing through and your data residency requirements.

LLM-based guardrails use a language model to reason about content in context. This is appropriate for nuanced policy checks — detecting that a response is providing advice outside the agent's intended scope, identifying subtle prompt injection attempts, or evaluating whether output meets brand standards that are easier to describe in natural language than to encode as rules. LLM evaluation adds latency and external dependency, so it is best reserved for checks where context matters and the higher fidelity is worth the cost.

Scoping and Priority

A guardrail can apply across all agents in an organization or to a specific agent. Organization-wide guardrails establish a baseline that no agent can bypass; per-agent guardrails add checks appropriate for that agent's domain. Connection-level guardrails extend this further, allowing you to add checks specific to a particular integration — for example, a guardrail that applies only when an agent is calling a specific external tool.

When multiple guardrails apply to the same task, they run in priority order. A high-priority block rule halts evaluation of subsequent rules, while warn-only rules accumulate without stopping the request. You set priority explicitly when creating or editing a guardrail, giving your team control over the evaluation sequence rather than leaving it to insertion order.

Enforcement Actions

Each guardrail specifies the action to take when it triggers. The available actions reflect a range of severity and reversibility.

Block stops the task entirely. For input validation, this means the task is rejected before it reaches the agent. For output validation, it means the agent's response is suppressed and the task is marked failed. Block is the appropriate action for hard policy violations — a credential in a prompt, PII leaving the system unredacted, or a request that falls outside the agent's permitted scope.

Redact allows the task to proceed but removes or masks the sensitive fragment. For output guardrails, this means the response reaches the caller with the sensitive content replaced by a placeholder. Redact is appropriate for content that should not be transmitted verbatim but whose presence does not indicate a breach on its own — a response that incidentally includes a partial identifier, for example.

Warn records the trigger and continues without modifying the content. This is suitable for monitoring: you want visibility into how often a pattern appears without blocking legitimate traffic. Warning-only guardrails are also useful during the rollout of a new rule, letting you observe trigger rates before committing to a block action.

Escalate routes the task to a human review queue rather than resolving it automatically. This is the right action for content that is ambiguous — where the rule triggered but the correct response depends on context that the system cannot evaluate. The task is held until a reviewer approves or rejects it. For more on designing this decision tree, see designing guardrails: block, redact, or warn?.

Fail Mode: Closed vs Open

Each guardrail has a fail mode that controls what happens if the evaluation itself encounters an error — a provider timeout, a model returning an unexpected response, or an internal evaluation fault. The default is fail-closed: if the guardrail cannot complete evaluation, the task is blocked. This is the conservative choice. A guardrail that cannot be evaluated is not a guardrail; the failure itself is treated as a reason to stop.

Fail-open is available for guardrails where blocking on evaluation error would cause unacceptable disruption. Activating fail-open for a given rule is an explicit decision that trades safety margin for availability. For guardrails protecting against serious violations — credentials, regulated personal data — fail-closed should remain the default.

Templates and Presets

Starting from scratch with guardrail configuration takes time, and the cost of getting it wrong falls on every task that runs before you notice. Praesidia ships with a library of templates for common use cases: PII detection, prompt injection protection, credential detection, toxicity filtering, off-topic response blocking. You can apply a single template or apply a full industry preset that activates a curated set of guardrails appropriate for your sector.

Templates are a starting point, not a final state. Once applied, each guardrail is editable: you can tune the severity, adjust the action, override the fail mode, and add patterns specific to your organization. The template origin is tracked so you can see which guardrails started from a standard definition and which were built from scratch.

For a deeper look at how PII-specific rules work, see PII detection and redaction in AI pipelines.

Audit Logs and Stats

Every guardrail evaluation produces a log record: the result, the scope (input or output), the action taken, a content sample for triggered events, and the processing time. These records are paginated and available from the guardrails audit log view.

Aggregate stats show trigger counts over time by guardrail, making it straightforward to identify which rules are firing most frequently and whether trigger rates are stable or trending upward. An unexpected spike in triggers for a specific guardrail is often the first signal that a new class of problematic input has appeared, that an agent prompt has changed in a way that produces off-policy output, or that a particular integration is behaving differently than expected. For a comparison of guardrails against evaluation and monitoring as distinct layers, see Guardrails vs Evals vs Monitoring.

Common questions

Can I test a guardrail without running it on live traffic? Yes. The platform provides a way to test a guardrail rule against sample content before enabling it on production traffic. This is useful when creating a new rule: you can verify it fires on the inputs you intend and does not fire on inputs you want to permit.

What happens when an agent's output is blocked — does the caller see an error? When a guardrail blocks an agent's output, the task is marked failed and the caller receives an error response indicating that the content did not pass validation. The specific content that triggered the guardrail is not included in the error message returned to the caller; that detail is available only in the guardrail audit log, accessible to operators with the appropriate permission.

How do I apply guardrails to a specific connection rather than all of an agent's traffic? When configuring a connection between an agent and a resource, you can associate additional guardrail rules with that connection. Those guardrails run alongside any agent-level and organization-level guardrails for tasks that flow through that specific connection. This lets you apply stricter content rules to a sensitive integration without affecting the agent's other traffic. See governed connections between agents and resources for the connection configuration reference.

How do content guardrails relate to prompt injection defenses? Guardrails handle the content dimension — what is in a message — while prompt injection defenses focus on detecting manipulative instruction patterns in retrieved content. The two work together: input guardrails can flag injection signatures, while output guardrails catch data that an injection attempt may have caused the agent to surface. See how to detect and defend against prompt injection for a deeper treatment.

Do guardrails add significant latency to every agent task? Rule-based guardrails add minimal overhead. ML-based evaluation adds a network call to a moderation endpoint. LLM-based evaluation adds the most latency, typically comparable to a short model inference. For latency-sensitive paths, use rule-based checks at the baseline and reserve ML or LLM evaluation for high-risk connections where the fidelity justifies the cost.

Praesidia's guardrail system gives your team a systematic way to enforce content policy on every agent interaction — not as a post-hoc audit step, but as an active control on the task path. The result is a clear boundary between what your agents are permitted to receive and produce and what your policy requires, enforced automatically at the point of execution.