Personally identifiable information leaks into AI agent flows in ways that are easy to overlook: a user message carries an email address, a retrieved document contains a social security number, a tool response returns a phone number embedded in JSON. The safeguard is a content inspection layer that detects PII before it reaches the model, redacts it in flight, and prevents it from persisting in logs. This post explains where the exposure points are, what effective detection looks like, and how to design a redaction approach that holds up under audit.
Why Prompts Are a PII Risk Surface
When an AI agent receives a task, it typically assembles a prompt from several sources: user input, retrieved context from a knowledge base or database, tool call results, and prior conversation history. Any of these can carry PII without the application developer explicitly intending it.
User input is the obvious case — people volunteer names, addresses, and account numbers when describing a problem. Retrieved context is subtler. A RAG system pulling documents from an HR database or a CRM may return records that contain PII as incidental fields alongside the information the agent actually needs. Tool call responses are the least-considered surface: when an agent calls a calendar API, a ticketing system, or a payment processor, the response payload often contains personal details about third parties who have no idea their data is in an LLM prompt.
Each of these paths represents a point where PII can flow into the model context, into the model response, and ultimately into whatever logging or audit infrastructure captures the interaction. Without an interception layer, all three destinations receive the data whether or not they need it. For a broader taxonomy of what these flows expose, see data exfiltration risks in agentic AI.
What Counts as PII in This Context
Regulatory definitions of PII differ by jurisdiction. GDPR uses "personal data" broadly to cover any information that can identify a natural person, directly or in combination. HIPAA defines protected health information (PHI) as a specific subset. CCPA and similar state laws add their own categories. The types that most commonly appear uninvited in agent flows include:
- Names, email addresses, phone numbers
- National identity numbers, passport numbers, tax IDs
- Dates of birth and ages when combined with other identifiers
- Financial account numbers and payment card details
- Health conditions, diagnoses, and treatment references
- IP addresses and device identifiers when used to track individuals
- Location data precise enough to reveal a home or work address
The challenge is that agents often need some of this information to perform their task. A support agent answering a billing question needs a customer's account number. A scheduling agent needs names and contact details. The goal of redaction is not to strip all PII blindly but to ensure that only the minimum necessary data reaches the model and that surplus data does not persist beyond the interaction.
Detection Approaches
Effective PII detection before an LLM call uses one or more of three approaches, and production systems typically layer all three.
Pattern matching covers well-structured identifiers: email addresses, phone numbers, credit card numbers (validated by Luhn check), and social security numbers. Regex-based rules are fast and have low false-positive rates for these categories, making them the right first pass.
Statistical and ML classifiers handle less-structured text — a paragraph mentioning a person's name and employer in context. A named-entity recognition (NER) model trained on PII categories can identify names, organisations, locations, and dates in free text. These classifiers add latency but catch what patterns miss.
LLM-assisted detection is a third layer for edge cases where accuracy matters more than speed. An LLM judge can reason about whether a piece of text constitutes PII in context — for instance, whether "the patient in room 4" is identifying when combined with other prompt content. This option is typically reserved for high-sensitivity data flows.
Each approach has a false-negative rate. Combining them in a pipeline reduces the overall miss rate, and calibrating thresholds is an ongoing process: too aggressive and benign content is flagged; too loose and PII slips through.
Redaction Actions and When to Use Each
Detecting PII is only useful if the system then does something with the finding. The main action classes are:
| Action | What happens | When to use |
|---|---|---|
| Redact | Replace the PII with a placeholder ([EMAIL], [SSN]) |
Default for most categories; task can still proceed without the actual value |
| Block | Reject the task outright | When the PII presence indicates the request should not proceed at all (e.g., a scraping attempt) |
| Replace | Substitute a synthetic value that preserves format | When downstream processing requires a value of the right type but not the real one |
| Warn | Log and alert without altering content | When the operator wants visibility but the task is permitted to proceed |
| Escalate | Route the task for human review | When automated handling is insufficient — medical data, legal documents |
Redact-with-placeholder is the most common choice. The placeholder indicates what was removed without transmitting the actual value. If the agent genuinely needs the real value, it should retrieve it through a controlled, audited path — such as a parameterised lookup — rather than having the raw value travel through the prompt layer.
Logs Are a Separate Problem
Removing PII from prompts is necessary but not sufficient. Logs are where PII tends to accumulate invisibly over time.
Audit logs that capture task inputs and outputs will contain whatever was in those payloads before redaction fires — unless the logging point is downstream of the redaction step. Where logging occurs in the pipeline matters as much as what gets logged.
Evaluation logs retained for guardrail debugging can also accumulate PII if they capture raw content. Any such record needs careful scoping: how much of the original content is kept, for how long, and who can read it.
The principle is that PII should be redacted at the earliest detectable point, and every log or store written after that point should receive only the redacted form. Any store that captures pre-redaction content needs a short, defined retention window and appropriate access controls.
Applying Controls to Both Input and Output
A common implementation gap is inspecting only the input to the agent while ignoring the output. Model responses can and do contain PII, either because the model reproduced PII from its context or because it synthesised identifying information in its answer.
Effective coverage runs inspection on both sides of the agent execution: the assembled prompt before it reaches the model, and the model response before it is returned to the caller or stored. This two-pass approach catches PII that enters through retrieved context and PII the model adds, ensuring that response logs retained for quality review contain only the cleaned form.
Making Redaction Auditable
For regulatory purposes, the existence of a redaction control is only as valuable as your ability to demonstrate that it operated correctly. Auditors asking about GDPR Article 25 (data protection by design) or HIPAA technical safeguards want to see evidence of the control, not just a description of it.
The evidence takes the form of logs that record, for each guardrail evaluation: when it ran, what category was detected, and what action was taken. These records should not themselves contain the original PII — a log entry that describes the detection result and action is useful audit evidence; a log entry that repeats the actual value is a liability.
Retention periods for these evaluation logs should match your data governance policy. A log that proves a guardrail fired at a point in time does not need to be kept forever; most frameworks allow destruction once the relevant retention window closes, provided the destruction itself is recorded.
Praesidia's guardrails capability is designed with this audit trail in mind: every evaluation produces a log entry with result, action taken, and trigger reason, without requiring the system to persist the content that triggered it.
Common questions
How do I handle PII that the agent legitimately needs to complete its task?
Pass it through a controlled reference pattern rather than inline in the prompt. The agent receives a token or reference ID in the prompt, and retrieves the sensitive value through a privileged, audited lookup at the moment it is actually needed. This keeps the prompt clean while preserving the agent's ability to act on real data. It also means the sensitive value is retrieved under access controls that can be revoked, rather than being reproduced wherever the prompt text travels.
Can I use LLM-based detection without adding unacceptable latency to every request?
Yes, with routing. Pattern matching and ML classifiers handle the majority of cases quickly. LLM-based detection can be reserved for content that scores above a threshold on the fast pass — content that looks like it might contain sensitive free-text PII — while straightforward requests skip it. You can also run LLM detection asynchronously for monitoring purposes without blocking the response path, accepting that the control is retrospective rather than preventive for that tier of content.
What happens if redaction removes context the model needs?
Some tasks will fail gracefully — the model produces a response that acknowledges missing information. Others will degrade silently, producing a response that is coherent but wrong because a key value was removed. Evaluating this trade-off is a calibration exercise: run your guardrails in warn-only mode first to measure how often redaction would have fired and on what content, before switching to enforce mode. This gives you a realistic picture of impact before you commit to a threshold.
The practical starting point is an inventory of where personal data enters your agent flows. Map each input source — user messages, retrieved documents, tool responses, conversation history — and classify what PII categories each is likely to carry. That inventory drives where to apply inspection, at what sensitivity, and which action suits each data category.
For teams evaluating a governance layer, the Praesidia docs cover how content guardrails fit into the broader control plane alongside identity, budget enforcement, and audit logging. For a full look at guardrail design options — block, redact, or warn — see designing guardrails: block, redact, or warn?. Organizations with GDPR erasure obligations should also review GDPR for AI systems: data subject rights and erasure.