Cost Control for LLM Applications

LLM costs in agentic applications accumulate across token usage, retrieval, tool calls, retries, and parallel fan-out — most of which standard billing dashboards do not break down by agent or run. Controlling those costs requires attribution at every hop and enforcement wired into the dispatch path before work enters the queue, not alerts reviewed after the invoice arrives.

The core challenge

LLM costs in production are predictable in aggregate but dangerous at the individual-agent level. A well-designed workflow has a bounded token budget per run. A misconfigured prompt, an unexpected recursive loop, or a single high-context retrieval step can turn a sub-cent task into a multi-dollar one — and at scale, that asymmetry matters. The guardrails that keep costs under control are not billing features bolted on at the end; they are enforcement points wired into the dispatch path of every request.

Where costs actually hide

The obvious cost driver is input and output tokens sent to a model API. But in practice, LLM application costs spread across several layers that teams frequently undercount:

Retrieval and context assembly. Embedding-based retrieval and long context windows are often priced per token, not per query. Overly broad retrieval strategies that fetch large document chunks inflate context length on every call.

Tool calls and sub-agent invocations. Agent-to-agent flows compound costs multiplicatively. If one agent delegates to a second, which delegates to a third, each hop has its own model cost. Without per-hop attribution, the bill arrives with no signal about which branch was the expensive one.

Retry logic. Error handling that retries on timeout or rate-limit failures without backoff can silently double or triple the token budget for a single logical request.

Parallel fan-out. Orchestration patterns that parallelize across many agents are efficient in wall-clock time but can multiply cost by the fan-out factor. This is often intentional — the issue is when it happens unintentionally because a loop condition is wrong.

Streaming and long-running tasks. Tasks that stream outputs or maintain long conversational context accumulate tokens gradually. The final cost is not visible until the task completes, which makes real-time budget enforcement harder than point-in-time checks.

For a detailed breakdown of how these patterns appear in practice, see budgets and quotas: preventing runaway agent costs.

Attributing costs to the right entity

Attribution is a prerequisite for governance. You cannot enforce a budget on a team if you do not know which team a given invocation belongs to. The attribution hierarchy typically runs: organization → team → agent → workflow → individual run. Each layer needs to be instrumented.

The practical challenge is that attribution identifiers need to be attached to requests before they leave the control plane, not inferred after the fact from logs. When a request passes through multiple hops — user to workflow to agent to sub-agent — the originating identity must be propagated with each hop so that costs can be rolled up to the correct level at any granularity. For how this attribution is surfaced visually, see Visualizing AI Usage and Cost.

This is where a central control plane earns its keep. Every call that passes through the platform carries its lineage, so the cost reported against a workflow run reflects every token consumed across every agent in that run, not just the entry-point call.

Budget policies: the enforcement primitive

A budget policy is a statement of the form: "entity X may spend at most Y over period Z, and if it crosses threshold T, take action A." The useful design space is in the dimensions of each variable:

Entity scope determines what the budget guards. Useful scopes include: the whole organization (a hard ceiling on total monthly spend), a specific agent (guard a high-cost specialist), a workflow (cap a single orchestration graph), or a team (delegate budget responsibility to team leads).

Period determines when the counter resets: daily, weekly, monthly, or a fixed total that does not reset. Daily limits are useful for catching runaway loops quickly. Monthly limits match billing cycles. Fixed totals suit pilot programs with a defined envelope.

Threshold actions determine what happens as spend approaches and crosses the limit. The useful range is: alert (notify operators but keep running), throttle (slow dispatch rate), pause (suspend queued runs, resume when budget is raised), and block (hard stop on new work). A well-designed policy typically chains these — alert at 80%, throttle at 95%, block at 100% — rather than jumping straight to a hard stop.

Reservation accounting is the mechanism that makes enforcement reliable. Rather than checking current spend against the limit at dispatch time, the system reserves the estimated cost of the task before it runs and only releases or commits that reservation on completion. This prevents the race condition where two tasks both see spend below the limit and both proceed, together exceeding it.

The gap between alerting and enforcement

Many teams instrument cost monitoring without enforcement, relying on alerts to trigger manual intervention. This works until it doesn't — the alert fires at 2 AM, no one is watching, and the agent runs to completion. Alerting is necessary but insufficient. The enforcement action needs to be automatic and wired into the dispatch path, not dependent on a human seeing a notification in time.

Enforcement also needs to handle the case where a budget is raised mid-period. If a pause or block threshold was crossed and the operator increases the budget limit, the system should automatically resume paused runs without requiring manual re-triggering. The policy state machine and the task queue need to be coupled.

Similarly, enforcement needs to handle the period boundary correctly. A policy with a monthly budget should reset its counter at the start of each period, not accumulate indefinitely. Period resets should be predictable and auditable — operators should be able to see when a period started, what was spent, and when the next reset occurs.

Multi-scope policies and precedence

In practice, a single request may be covered by multiple overlapping policies: the organization-level cap, the agent-level cap, and the workflow-level cap simultaneously. The correct behavior is to evaluate all applicable policies and enforce the most restrictive one that has been triggered. If the workflow cap has not been crossed but the agent cap has, the request is blocked regardless.

This requires the enforcement layer to evaluate policies in a defined order at dispatch time, not just check a single limit. It also means that raising one budget does not automatically unblock work if another policy is still in a blocked state — the operator needs visibility into which policy is the binding constraint. See budgets vs rate limits: controlling agent consumption for how these two mechanisms interact.

How Praesidia approaches budget enforcement

Praesidia's budget policy system is designed to make enforcement a first-class operation rather than a reporting afterthought. Policies are defined per organization with configurable scope, period, thresholds, and actions. Every applicable policy is evaluated at the point a task is admitted — before work begins — so that cost reservations are in place and any triggered threshold is acted on before a task can proceed.

The reservation model keeps committed and reserved spend separated, which gives accurate in-flight visibility — you can see what is already committed against a budget and what is reserved by tasks currently running, not just what has been spent. Policies that cross a pause threshold automatically suspend affected workflow runs; raising the budget limit clears the armed state and resumes those runs.

Budget status is surfaced in the monitoring dashboard so operators can see current spend, remaining budget, and armed policies without digging into logs. For teams that need programmatic access, the budget policy API supports full CRUD plus period resets and summary views, so budget management can be scripted into deployment pipelines or capacity planning workflows.

Common questions

Does a budget policy stop a task mid-run, or only at dispatch? Enforcement fires at dispatch — before the task enters the queue. A task that has already started runs to completion (or until its own timeout). This means the hard stop is not instantaneous, but it prevents the more damaging case of unlimited new work being accepted after a limit is crossed. Reservation accounting ensures that in-flight cost is counted against the budget, so the window between "limit crossed" and "enforcement takes effect" is bounded by the estimated cost of the currently running tasks.

What happens if the cost estimation is wrong? Estimates are based on task metadata available at dispatch time: model, expected input size, historical average for similar tasks. They will not be exact. A reconciliation layer periodically re-syncs actual spend from committed records, which corrects drift between estimated reservations and real costs. Policies calibrated for tight limits should account for estimation variance by setting the hard-block threshold below the absolute maximum — for example, blocking at 90% of the stated limit rather than 100%.

Can I set different budgets for different teams within one organization? Yes. Team-scoped policies target a specific team identifier, and multiple policies can coexist on the same organization with different scopes and periods. An agent belonging to one team is evaluated against that team's policy, the agent-level policy (if one exists), and the organization-level policy simultaneously. The tightest triggered threshold governs.

How do credits relate to budget policies? Budget policies govern the rate and ceiling of spend within a period; credits are the prepaid balance that funds that spend. The two work together: a credit balance provides the hard floor (agents cannot spend what is not there), while a budget policy adds structured per-entity and per-period limits on top. For more on the credit ledger model, see credits and cost monitoring for agent spend.

Cost control in LLM applications is most effective when it operates in the dispatch path rather than the reporting path. Attribution, reservation accounting, and automated enforcement actions are the three primitives that close the gap between knowing what was spent and preventing overspend in the first place.