How to Rate-Limit AI Agents

Rate-limiting AI agents means enforcing a ceiling on how many requests a given agent can initiate within a time window. The standard techniques from web API rate-limiting apply here, but agents break several assumptions those techniques rely on: they can saturate a limit in seconds, they operate across multiple orchestration layers simultaneously, and they have no human feedback loop to slow them down. Effective rate-limiting for agents requires thinking through the unit of measurement, the window shape, the scope of enforcement, and what happens when a limit is reached.

Why standard rate-limiting falls short for agents

Web API rate limits were designed for human-paced clients: a browser, a mobile app, an integration that occasionally polls. Even aggressive automated callers tend to operate at tens to hundreds of requests per minute. An AI agent running an autonomous workflow has no such natural pacing. It can exhaust a per-minute limit in seconds and then retry in a tight loop, amplifying the problem.

There is also a structural difference in who owns the limit. With a conventional API, the limit is set by the provider and applies to the caller. With agents, you are often the operator — you set the limit on your own agents, for your own downstream systems and LLM providers. The goal is not to protect a vendor's infrastructure; it is to protect your organization from runaway costs, cascading failures, and accidental abuse of third-party services. The threat model for runaway agent spend covers the failure modes in detail.

Finally, agents frequently operate in multi-hop chains: one orchestrating agent dispatches to several worker agents, each of which may call tools and LLMs. A single high-level task can translate into dozens of downstream calls. A limit applied only at the outermost layer does not prevent a burst at an inner layer.

Choosing the right unit of measurement

Before setting any numbers, decide what you are measuring. The choices are not mutually exclusive, and a mature approach uses more than one.

Request count is the most familiar measure. You count the number of calls an agent makes to a given target — an LLM API, an external service, or an internal endpoint — within a window. It is easy to audit and easy to enforce with a simple counter.

Token count is more precise for LLM calls. Two requests can differ by an order of magnitude in resource consumption depending on context and response length. If your primary concern is LLM spend, limiting by tokens per minute is more accurate than limiting by request count alone.

Spend bridges token counting and billing by expressing limits directly in currency terms. Spend caps behave differently from rate limits: they are cumulative over a longer window (usually a month) rather than a rolling short window. Combining a per-minute token limit with a monthly spend cap gives you both burst protection and total-cost protection. See budgets vs rate limits: controlling agent consumption for a deeper comparison.

Task complexity is harder to measure in advance but sometimes useful as a secondary signal. An agent submitting tasks that each fan out to many tool calls is more impactful than one making single-step calls, even if the raw request count looks similar.

Choosing a window shape

Two window shapes dominate rate-limiting implementations.

A fixed window resets the counter at a clock boundary — the top of each minute, for example. It is simple to implement and reason about, but it creates a boundary vulnerability: an agent can fire close to the max at 11:59 and again right at 12:00, producing a burst of nearly double the nominal limit in a short period.

A sliding window tracks the timestamp of each request within the last N seconds and counts those that fall within the window at query time. It removes the boundary burst but requires storing per-request timestamps (or approximating with a two-bucket sliding-window estimate). For agents that can saturate limits very quickly, the smoother enforcement of a sliding window is usually worth the added implementation complexity.

A token bucket is a third option that allows short bursts up to a bucket capacity, then enforces a steady refill rate. This models real-world traffic well: it lets an agent complete a reasonable burst of related work quickly, then enforces pacing over the longer term. The tradeoff is that the capacity setting requires tuning — too large a bucket and you get a burst that is indistinguishable from no limit at all.

Scoping limits to the right granularity

A single global rate limit per organization is almost never the right scope. Agents have different roles, different criticality, and different trust levels. A policy that limits every agent equally either over-restricts high-priority agents or under-restricts low-priority ones.

More useful scoping levels include:

Scope	What it protects
Per agent	Prevents any single agent from monopolizing capacity
Per connection	Limits the throughput on a specific agent-to-agent or agent-to-MCP link
Per team	Ensures departments share capacity fairly within an org
Per task type	Allows high-frequency lightweight tasks and throttles expensive ones differently
Per target service	Keeps agent activity within the limits your downstream vendors impose

Connection-level limits are especially valuable. When an agent connects to another agent or to an MCP server, that connection represents an explicit trust relationship with a negotiated communication policy. Rate limits embedded in the connection policy apply at the link level, so an agent that is allowed to call Service A frequently can be held to tighter limits on Service B — without any shared global counter that would interfere with other agents' access to either service.

What to do when a limit is hit

The most important decision is whether to fail closed, queue, or backpressure.

Fail closed returns an error to the agent immediately when the limit is exceeded. This is the safest default: it stops the burst instantly and surfaces a clear signal. The downside is that the agent needs to handle the error gracefully and not enter a tight retry loop. Well-implemented agent frameworks respect retry-after headers and back off exponentially; not all do.

Queuing holds the excess requests and processes them as capacity allows. This works well for non-time-sensitive background tasks. It requires a durable queue and bounded queue depth; an unbounded queue just converts a rate limit into a delay, which can be worse if the agent continues submitting work at pace.

Backpressure is the most elegant pattern when the agent framework supports it: the rate limiter signals the caller to slow its submission rate rather than accumulating a queue or dropping requests. This requires coordination between the enforcing layer and the agent runtime, which is worth designing for when you control both sides.

Active time windows and temporal scoping

Agents running on behalf of people — processing support queues, drafting content, triggering workflows — often should not operate outside business hours or outside the task's authorized time window. A rate limit tells you how fast an agent can move; a time window tells you when it is allowed to move at all.

Combining both gives you temporal isolation: an agent might be allowed up to 200 requests per hour, but only between 09:00 and 18:00 UTC, Monday through Friday. Outside that window, dispatches are rejected regardless of the rate counter. This pattern limits blast radius from runaway or hijacked agents that might otherwise run overnight before anyone notices.

Common questions

How do I set the right rate limit numbers?

Start by profiling actual agent behavior under normal operating conditions before any limits are in place. Measure peak burst rates, average sustained rates, and the distribution of downstream calls per high-level task. Set the sustained limit comfortably above normal peak with a burst allowance that covers legitimate spikes — then tighten from there based on what your downstream systems can absorb and what your spend targets require. Rate limits are almost always wrong on the first pass; build in observability from the start so you can see when limits are hit and whether they are too tight or too loose.

Should rate limits apply before or after trust evaluation?

After. Trust evaluation determines whether an agent is allowed to act at all. Rate limiting determines how often an allowed agent may act. Applying rate limits before trust evaluation means spending enforcement resources on agents you would reject anyway. The practical ordering is: authenticate first, then evaluate all applicable policies before dispatch — rate limiting only makes sense after you have confirmed the caller is who they claim to be.

How does rate-limiting interact with spend budgets?

They control different things and should be used together. A rate limit caps request frequency within a short window — it gives you burst protection and prevents runaway loops. A spend budget caps cumulative cost over a longer window — it gives you financial protection and monthly ceiling enforcement. An agent can stay within its per-minute rate limit while steadily draining a monthly budget; a budget cap stops that scenario where a rate limit cannot. Praesidia's connection policy model lets you set both simultaneously on a single connection, so neither control is applied in isolation. See budget policies: hard spend caps for AI agents for how the two interact at the policy-evaluation layer.

Putting it together

Rate-limiting agents well requires treating the limit not as a single global dial but as a policy composed from several dimensions: the unit of measurement, the window shape, the enforcement scope, the response to violations, and the time window in which the agent is authorized to act. Getting any one of these wrong either leaves gaps or creates unnecessary friction.

The most durable approach is to enforce limits at the point where the agent communicates with its downstream targets — at the connection or link level — rather than globally, so that limits can be tuned to the specific relationship between an agent and a service. Combined with token-bucket or sliding-window counting, a mix of per-request rate limits and longer-window spend caps, and clear fail-closed semantics with exponential backoff, you have controls that protect both your infrastructure and your downstream vendors without unnecessarily throttling legitimate work. Praesidia surfaces these controls as part of the connection policy attached to every agent-to-agent and agent-to-MCP link, so they are always scoped, auditable, and adjustable without touching your agent code.