Health Probes and Readiness for AI Infrastructure

Health probes are how your orchestrator knows whether a process is alive, able to serve traffic, and connected to the systems it depends on. For AI services, the stakes are higher than for a typical web app: a wedged agent runner, a stale model connection, or a disconnected queue can corrupt task state in ways that are expensive to diagnose. Getting the health surface right from the start is a compounding investment — and it integrates with the broader observability for AI agents picture every production deployment needs.

The difference between liveness and readiness

These two probe types serve different purposes and must never be conflated.

A liveness probe answers one question: is this process still functional? If the event loop is blocked for an extended period, if the process has deadlocked, or if memory pressure has put it into an unresponsive state, a failing liveness probe tells the orchestrator to restart the pod. The restart itself is the recovery action. You should fail liveness only when there is no alternative to a restart — not for transient upstream failures.

A readiness probe answers a different question: should this instance receive traffic right now? A process can be alive but not ready. During startup, the service may not have completed its database connection pool warm-up or its cache preload. During a degraded period — say, when the primary database is temporarily unreachable — you may want to pull the instance from rotation without restarting it. Readiness gates traffic; liveness gates restarts.

Mixing them up creates operational problems. If you return 503 on your readiness probe every time a dependent service is slow, Kubernetes will pull your pods from rotation during any upstream blip. If you use the same check for liveness, it will restart healthy processes unnecessarily, compounding the incident.

What to check on each probe type

For a minimal but honest liveness probe, checking event loop lag is often more reliable than application-level assertions. A process that is busy but functional has low lag; a process that has stalled or is thrashing has high and rising lag. A practical threshold is failing the liveness probe when measured lag has exceeded a generous ceiling — say, five seconds — for a sustained window. This prevents false positives from momentary GC pauses while still catching genuine wedges.

For a readiness probe, you want to confirm that the services essential for request processing are reachable. At minimum:

Database connectivity. A lightweight SELECT 1 is sufficient. You are not testing query performance; you are testing that the connection pool can reach the primary.
Cache connectivity. A round-trip health check confirms the cache layer is reachable. For AI services that rely heavily on queue state, this matters: a missing cache connection means queued tasks will not be dispatched.
Queue health. If your AI workload runs through a job queue, the queue subsystem's reachability is a readiness concern. A service that accepts HTTP traffic but cannot enqueue tasks will accept work it cannot deliver.

The readiness response should be deliberately minimal: a status field and an HTTP status code. Avoid exposing the internal breakdown in the public readiness endpoint — that information belongs in a protected diagnostic surface.

Dependency checks: deep probes for upstream services

Beyond your own infrastructure, AI services depend on external upstream providers: model API endpoints, payment processors, email delivery, object storage. These are worth probing, but with important caveats.

Deep upstream probes should run on a background schedule, not inline on every probe request. Calling an external HTTP endpoint synchronously inside a probe handler adds latency and creates a coupling between your health status and third-party availability that load balancers will misinterpret. Instead, run a background job on a leader-elected schedule, cache the results, and serve the probe endpoint from that cache. If the cache is cold — for instance, immediately after a restart before the first background run — the endpoint should signal unavailability rather than return stale or empty data.

A second subtlety: distinguish between a dependency that is unreachable and one that is responding with errors. A 503 from an upstream is better than a DNS timeout, but neither means the integration is fully healthy. Your probe logic needs to encode what "degraded" means per dependency and roll those signals up to a meaningful overall status: operational, degraded, or outage.

The aggregate status surface

For operators, you want a richer status surface than a single boolean. A useful health status response includes:

Individual component status for each dependency (database, cache, queues, upstreams)
Response time for the health check itself as a rough latency signal
An overall rolled-up status that can be consumed by dashboards and alerting

This status surface is distinct from the readiness probe. Readiness is for the orchestrator; the richer status is for operators. Guard the detailed diagnostic endpoint behind admin authentication so it is not scraped by external actors, but make it easily accessible to your team via the API or an admin console.

For AI control planes in particular, the diagnostic surface is valuable during incidents. When an agent task is stalling, knowing at a glance that the model API is degraded or that the queue layer is intermittent cuts the time-to-diagnosis dramatically compared to having to instrument everything manually. This pairs well with Prometheus metrics and observability as a complementary signal layer for ongoing monitoring.

Health probes for worker processes

Many AI services separate the HTTP layer from the background worker layer. The web process handles API requests; a separate worker process runs the queue consumers that execute agent tasks. These are distinct operating units with distinct health concerns, and each needs its own health surface.

A worker process cannot use the same HTTP health server as the web process — it may not have one at all by default. The right pattern is a lightweight standalone HTTP server running inside the worker process specifically for health probing, bound to a different port. The orchestrator probes it independently. A worker that has crashed or deadlocked should fail its own liveness check and be restarted without requiring the web tier to detect it.

Structuring your admin health page

Beyond probes for automated infrastructure, human operators need a health dashboard. A good admin health page polls the status surface on a short interval — every 30 seconds is a reasonable cadence — and surfaces the component breakdown in a readable format. It should refresh only when the tab is active; background refreshing on hidden tabs adds unnecessary load without benefiting anyone watching.

The health page serves a different audience than the probe endpoints. It is for the engineer investigating an incident, not for the load balancer deciding traffic routing. Include enough detail to distinguish between "database is down" and "model API is degraded" — these require very different responses.

Praesidia exposes exactly this surface: separate orchestrator-facing probes for traffic routing and process restart decisions, plus a protected diagnostic endpoint with per-dependency breakdown available only to administrators. The admin console's System Health view polls that diagnostic surface automatically, giving operators a live picture of database, cache, and queue state without requiring them to call the API directly.

Common questions

Should I fail my readiness probe when an upstream model API is unavailable?

Generally, no — not unless your service can do nothing without that specific upstream for every request. If the model API is down but your service can still handle administrative API calls, configuration reads, or other operations, failing readiness will unnecessarily pull you from the load balancer rotation. A better pattern is to mark the status as degraded in your diagnostic surface and return appropriate errors from the endpoints that specifically require the model API, while keeping readiness healthy for the rest. Reserve failing readiness for dependencies that make the entire service unable to serve any meaningful traffic.

How do I prevent the dependency probe from showing stale data immediately after a restart?

The cleanest approach is to 503 the dependency probe endpoint when the cache is cold, rather than returning an empty or optimistic response. This requires the caller — whether an operator or a monitoring tool — to retry after the first background probe run completes. Document the expected cold-start window in your runbook. For most background probe cadences, this window is short: under two minutes after startup.

What should the readiness probe actually return in its response body?

Keep it minimal. Return only a status field and let the HTTP status code carry the signal to the orchestrator. The orchestrator only looks at the response code; exposing the full dependency breakdown in the readiness endpoint is unnecessary and can leak infrastructure topology to anyone who can reach the endpoint. Put the breakdown in the authenticated diagnostic endpoint instead.

How should health probes interact with alerting and on-call workflows?

Probe failures should feed into your alerting pipeline, but with appropriate deduplication and delay. A single probe failure that self-resolves in seconds should not page anyone. Set a sustained failure window — typically three to five consecutive failures over one to two minutes — before triggering an alert. The rich diagnostic endpoint is what the on-call responder opens first; it should be the starting point of your runbook, not a final destination reached after manual investigation.

For more on how Praesidia structures its observability surface, see the operations dashboard for your AI estate or explore how service level objectives for AI services build on a reliable health probe foundation.