Designing Guardrails: Block, Redact, or Warn?

When you add a guardrail to an AI agent, the central question is not what to detect — it is what to do when detection fires. Block the request outright, silently redact the sensitive fragment, or let it pass with a warning logged? Each choice has a different cost profile, and picking the wrong one creates problems that only become visible once you are in production with real workloads.

The enforcement action spectrum

Most guardrail systems offer a range of responses when a rule triggers. At the strictest end, a block action halts the operation entirely — the agent task is rejected before it executes, the output is suppressed, and the caller receives an error. At the most permissive end, a warn action lets the content through but records the event so operators can review it later.

In between sit redact and replace: the content is modified in transit so that the sensitive portion is removed or substituted before the agent sees it (on input) or before the response reaches its destination (on output). A retry action surfaces a signal to the orchestrator to re-run with modified context, which is useful when an LLM-based check suggests a rephrasing would resolve the issue. An escalate action routes the request to a human approval queue rather than completing it automatically.

Each option has a clear use case:

Block — reject; the operation does not proceed. Use for high-confidence, high-severity violations where passing is never acceptable.
Warn — allow; flag in the audit log; no interruption to the caller. Use for new rules under observation or low-severity findings.
Redact — remove the offending span; pass the cleaned content onward. Use for PII and credentials where the workflow can still run without the sensitive fragment.
Replace — substitute the span with a placeholder such as [REDACTED] or [PII]. Similar to redact but preserves the structure of the content.
Retry — re-attempt generation with modified context. Use when the violation is correctable without human involvement.
Escalate — pause and require human approval before continuing. Use for irreversible actions where automated judgment is insufficient.

Matching action to risk

The right action depends on two independent dimensions: the reversibility of the downstream action, and the confidence of the detection signal.

When the downstream action is irreversible — sending an email, writing to a database, calling an external API — the cost of a false negative is high. A block or escalate is appropriate even at the expense of occasional false positives, because the alternative is data that cannot be recalled. When the action is transient and internal, a warn or redact may be sufficient.

When the detection signal has high confidence — a rule-based pattern match for a credit card number or a private key, for instance — you can afford to act aggressively because the false-positive rate is predictable and low. When confidence is lower — an ML model returning a borderline score on a subjective category — a warn action preserves throughput while surfacing the finding for review.

A practical heuristic:

Confidence	Irreversible downstream action	Reversible downstream action
High	Block	Redact
Medium	Redact or Escalate	Warn
Low	Warn	Warn

The content category also shapes the choice. PII and credentials warrant redact or block because the data itself is the hazard — see how to keep PII out of agent prompts and logs for the full redaction workflow. Brand and accuracy violations more often warrant warn or escalate, since those judgments involve context a human is better placed to make. Prompt injection attempts warrant block because the intent is adversarial and passing the content onward amplifies the attack — the threat model is detailed in prompt injection: threats and defenses.

The false-positive cost

Every guardrail designer eventually confronts the tension between security and usability. A guardrail that fires too often on legitimate content trains your users to route around it — seeking alternate channels, asking admins to override, or simply disabling the guardrail. At that point you bear the cost of evaluation with none of the protection.

Managing false positives requires discipline at three levels.

Staged enforcement. Start in warn-only mode on any new rule. Collect real-traffic data to understand the false-positive rate before promoting to block or redact. This is especially important for ML and LLM-based detectors where threshold tuning matters more than for deterministic rule patterns.

Per-scope differentiation. Input guardrails (what the agent receives) and output guardrails (what the agent produces) have different risk profiles. A user prompt that contains what looks like a phone number may be benign context; an agent response that echoes that number verbatim in a user-facing message is a different concern. Many teams apply redact-on-output with warn-only-on-input for PII, reserving blocks for high-confidence exfiltration signals.

Severity tiers. Tagging guardrails with severity — critical, high, medium, low — lets you apply stricter default actions to high-severity rules and more permissive defaults to lower-severity ones, without managing each rule individually. A critical rule starts at block; a low-severity rule starts at warn; both can be promoted or demoted as evidence accumulates.

Fail-open versus fail-closed

This is the most consequential design decision in any guardrail system, and it is frequently treated as a technical detail when it is actually a security policy choice.

A fail-closed guardrail blocks the operation when evaluation itself fails — if the ML provider is unreachable, if the LLM call times out, if the evaluation engine hits an unexpected condition. Nothing passes.

A fail-open guardrail allows the operation through when evaluation fails. The content is treated as if it passed the check.

Fail-closed is the safer default for security guardrails. If an attacker can induce an evaluation failure — by overwhelming a provider, causing a timeout, or triggering an edge case — fail-open means the guardrail has been effectively disabled. Fail-closed prevents this class of bypass entirely.

Fail-open is appropriate for non-security guardrails where availability outweighs the content risk. A brand-tone checker that is temporarily down should probably not block every agent response. The key is to make the choice deliberately, per guardrail, rather than as a global system default.

The failure mode to avoid is silent fail-open: a system that behaves as if it is enforcing guardrails but skips evaluation on errors without any observable indication. If you cannot detect when evaluation is failing, you cannot respond to the gap — and from the outside the system looks healthy while being unprotected.

Ordering, priority, and layering

When multiple guardrails apply to the same content, evaluation order matters. A block from a high-priority security guardrail should short-circuit evaluation before lower-priority brand checks run, saving unnecessary latency and cost. For how MCP tool-level scoping interacts with guardrail ordering, see Scoping MCP Tool Permissions: Least Privilege for Tools.

When two guardrails would apply different actions to overlapping content, the strictest action should take precedence — if one rule says redact and another says block, block wins. Make this explicit policy rather than an accident of ordering.

Layering by scope reduces noise further: applying org-wide guardrails first, then connection-specific or agent-specific ones, keeps the evaluation tree predictable and makes it straightforward to trace which rule caused a trigger in the audit log.

Auditing what actually happened

The value of a guardrail system is only partly in prevention. The evaluation log — what content was inspected, which rule triggered, what action was taken, and the confidence score — is what lets you tune rules, investigate incidents, and demonstrate compliance. Without it, you cannot know whether your configuration is working as intended.

An effective guardrail log captures the result, the scope (input or output), the action taken, and the trigger reason. It should be append-only and queryable by agent, rule, and time range — so you can answer questions like "how often did this PII rule fire over the last 30 days?"

Praesidia is designed around this model. Guardrail evaluation runs before a task is executed, so a block actually prevents the task from executing rather than merely logging that it should have been stopped. Every evaluation writes to an append-only log with result, scope, action, and trigger reason. The default fail mode is closed, with a per-guardrail override available for cases where availability is the primary concern. For a broader comparison of controls, see guardrails vs evals vs monitoring and content guardrails for AI agents. For guidance on applying these principles to a specific high-risk scenario, see data exfiltration risks in agentic AI.

Common questions

Should I start with block or warn for a new guardrail rule?

Start with warn unless the rule detects something high-confidence and categorically unacceptable, such as a hardcoded credential or a private key pattern. Warn gives you real-traffic data to assess false-positive rates before you commit to blocking legitimate operations. Running in warn-only mode for a week or two before promoting a security rule to block is a reasonable default practice.

What should happen when a guardrail evaluation errors out?

Default to closed: treat an evaluation error the same as a triggered block, and surface the failure in your audit log. The alternative — silently passing content when evaluation fails — creates a bypass path for any attacker who can reliably induce evaluation failures. Reserve fail-open for non-security guardrails where checker downtime is more disruptive than the content risk it is guarding against.

Can the same rule apply different actions on input versus output?

Yes, and in many cases it should. Input guardrails intercept what the agent receives; output guardrails intercept what the agent produces. PII in a user prompt may be legitimate context the agent needs to function; PII echoed verbatim in an agent response may be a compliance violation. Scoping allows you to redact on output while only warning on input, matching the enforcement action to the actual risk direction rather than applying a single policy to both surfaces.