AI Governance

Tackle LLM Rate Limits and Outages with an AI Gateway

An AI gateway eliminates LLM rate limits and outages through automatic failover, load balancing, and semantic caching, keeping production AI applications online.

LLM rate limits and outages are no longer edge cases for production AI teams. Every major provider, from OpenAI to Anthropic to Google Vertex, throttles requests through tiered RPM and TPM ceilings, and every major provider has logged multi-hour incidents in the last twelve months. When a primary provider returns a 429 or a 503 in the middle of a customer-facing workflow, the application breaks unless the infrastructure underneath can absorb the failure. An AI gateway sits between the application and the providers, turning rate limits and outages from a runtime crisis into a routing decision. This post explains the failure modes, the architectural patterns that handle them, and how Bifrost implements those patterns at production scale with 11 microsecond overhead.

Understanding the LLM Rate Limit and Outage Problem

Rate limits and outages share a single application-level symptom (the request fails) but have different root causes. Resolving them requires understanding how each provider enforces capacity and where the infrastructure breaks under load.

LLM rate limits are enforced through several simultaneous dimensions:

Requests per minute (RPM): a hard cap on API calls in a rolling 60-second window, regardless of token count.
Input and output tokens per minute (ITPM, OTPM, or combined TPM): the total token volume the provider will process per minute. Anthropic measures input and output tokens separately, while OpenAI's rate limit documentation describes RPM, TPM, RPD, and TPD as independent constraints that fire whichever is hit first.
Concurrency limits: caps on simultaneous in-flight requests, applied separately from per-minute limits.
Daily token and request caps (TPD, RPD): hard ceilings that reset on a 24-hour window, common on lower tiers.
Tier-based scaling: limits expand as cumulative spend grows, but the gap between tiers can be substantial. Anthropic documents acceleration limits that trigger when usage spikes sharply, on top of the standard tier limits.

Outages compound the problem. StatusGator tracked 294 OpenAI outages since the start of 2025, and incident hubs continue to log monthly degradation events across every major provider. The December 2024 OpenAI outage took ChatGPT, the API, and Sora offline globally for several hours due to an upstream networking issue, according to coverage of the incident. Application teams that depend on a single provider have no infrastructure-level recourse when this happens.

The result is a class of production failures that application code cannot solve cleanly:

429 responses during traffic spikes that exceed RPM or TPM
5xx errors during regional or global provider degradation
Model-specific unavailability when a particular model is offline for maintenance
Authentication or billing-related throttling that activates without warning
Latency degradation that does not return errors but stalls user-facing workflows

Approaches to Handling LLM Rate Limits and Outages

Three architectural patterns have emerged for keeping AI applications online when providers throttle or fail. Each comes with trade-offs around complexity, cost, and operational overhead.

The first approach is application-level retry and backoff. Engineers add exponential backoff with jitter inside the application code, retrying 429 and 5xx responses until the request succeeds. This works for transient errors that resolve in seconds, but it does nothing during extended outages and consumes additional quota with every retry attempt. It also pushes reliability logic into every service that touches an LLM, which fragments the implementation and creates inconsistent behavior across teams.

The second approach is multi-provider fallback in code. Engineers write try-except blocks that switch from a primary provider to a backup when the primary fails. This handles outages but multiplies the maintenance burden: every SDK has different error shapes, different authentication patterns, and different request and response formats. Keeping fallback chains current across half a dozen services becomes a full-time problem.

The third approach is gateway-level routing. An AI gateway centralizes failover, load balancing, retry logic, and caching at the infrastructure layer, so application code stays simple and provider changes do not require redeploys. This is the pattern that solves both rate limits and outages without leaking complexity into application code, and it scales linearly as the number of providers, models, and consumers grows.

How Bifrost Solves LLM Rate Limits and Outages

Bifrost is an open-source AI gateway built specifically to handle the failure modes above. It unifies access to 20+ LLM providers through a single OpenAI-compatible API, with built-in failover, load balancing, semantic caching, and governance. In sustained 5,000 RPS benchmarks, Bifrost adds only 11 microseconds of overhead per request, so the reliability layer does not introduce latency that an application would otherwise feel.

Bifrost addresses LLM rate limits and outages through five complementary mechanisms:

Automatic failover: Bifrost's automatic fallbacks detect 429, 500, 502, 503, and 504 responses and route the request to the next provider in the fallback chain. The client sees a successful response with an extra_fields.provider value indicating which provider actually handled it.
Provider-level retries: Before triggering a fallback, Bifrost retries the same provider on retryable status codes, configured at the provider level. This absorbs transient blips without the cost of switching providers.
Load balancing across keys and providers: Bifrost distributes traffic across multiple API keys and providers using weighted strategies. Spreading load across keys keeps any single key well below the per-key rate limit ceiling.
Semantic caching: Bifrost's semantic caching returns cached responses for queries that are semantically similar to prior requests, not just identical ones. Cache hits do not count against provider rate limits at all.
Governance and observability: Virtual keys enforce per-consumer rate limits and budgets inside the gateway, and built-in observability surfaces fallback trigger rates, success rate by provider position, and latency per provider. Engineers can see exactly which provider is serving traffic at any moment.

Together, these mechanisms turn provider failures into routing decisions that the application never has to handle directly.

Implementing Failover and Load Balancing with Bifrost

Configuring Bifrost to handle LLM rate limits and outages requires no application code changes beyond the base URL. Bifrost is a drop-in replacement for the OpenAI, Anthropic, AWS Bedrock, Google GenAI, LiteLLM, and LangChain SDKs.

A request with a fallback chain looks like this:

curl -X POST <http://localhost:8080/v1/chat/completions> \\
  -H "Content-Type: application/json" \\
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Summarize this transcript"}],
    "fallbacks": [
      "anthropic/claude-3-5-sonnet-20241022",
      "bedrock/anthropic.claude-3-sonnet-20240229-v1:0"
    ]
  }'

Bifrost attempts OpenAI first. If OpenAI returns a 429 (rate limit) or any retryable error after configured retries are exhausted, the request routes to Anthropic. If Anthropic also fails, AWS Bedrock takes the request. The first successful response is returned to the client in OpenAI-compatible format, with the actual provider exposed in extra_fields.provider.

When fallback chains do not need to be specified per-request, Bifrost can derive them automatically from virtual key configuration. A virtual key with multiple provider_configs ordered by weight builds the fallback list automatically:

{
  "provider_configs": [
    {"provider": "openai", "weight": 0.8},
    {"provider": "anthropic", "weight": 0.2}
  ]
}

Requests routed through this virtual key automatically fall back from OpenAI to Anthropic without any payload change. Each fallback attempt is treated as a fresh request: semantic cache lookups run again, plugins execute again, and governance rules apply again. This gives consistent behavior regardless of which provider ultimately serves the request.

Layered on top of this are governance controls through virtual keys. Platform teams scope rate limits per consumer at the gateway, so a single misbehaving service cannot consume the entire organization's provider quota. Granular rate limits protect upstream providers from cascading throttle events, even before failover comes into play. For deeper enterprise patterns, the Bifrost governance resource covers virtual key hierarchies, budget enforcement, and per-team access control.

Real-World Benefits of Gateway-Level Reliability

Routing AI traffic through a gateway built for failover delivers measurable outcomes that application-level retry logic cannot match.

Higher effective uptime: Multi-provider failover targets 99.99% availability for AI applications by treating any single provider's status page as one input among many.
Reduced rate limit incidents: Load balancing across keys and providers smooths traffic so that individual quotas are not breached during spikes.
Lower cost per call under stress: Semantic caching absorbs repeat traffic during high-load periods, and cached responses cost nothing in tokens.
Faster incident response: Fallback rate, success rate by provider position, and latency per provider are first-class metrics in the gateway, exported via Prometheus and OTLP.
No application code changes during provider issues: When OpenAI is degraded, platform teams adjust the fallback configuration once and every downstream service inherits the new routing behavior immediately.

For teams evaluating gateway options, the LLM Gateway Buyer's Guide details the capability matrix that distinguishes production-grade gateways from simple proxies, including failover depth, governance granularity, and observability coverage.

Start Building with Bifrost

LLM rate limits and outages will not get less frequent as AI workloads scale. Provider quotas tighten, model launches create capacity crunches, and infrastructure incidents are operational certainties rather than anomalies. The teams that stay online are the ones that move reliability out of application code and into a gateway that handles failover, load balancing, caching, and governance as first-class infrastructure. Bifrost does this at production scale with the performance profile of a thin proxy and the feature set of a full enterprise control plane.

To see how Bifrost can absorb LLM rate limits and outages for your AI infrastructure, book a demo with the Bifrost team.