LLM Gateway

Retries, Fallbacks, and Circuit Breakers in LLM Apps: A Production Guide

TL;DR: Building production LLM applications requires robust failure handling strategies. Retries automatically recover from transient errors like rate limits and network timeouts. Fallbacks seamlessly switch between providers when the primary option fails. Circuit breakers prevent cascading failures by temporarily blocking requests to unresponsive services. Together, these patterns form the foundation of reliable AI systems at scale. Bifrost implements all three as first-class gateway features, handling complexity so your application code stays clean.

Why Reliability Matters in LLM Applications

LLM applications face unique reliability challenges. Provider APIs experience outages, rate limiting, and variable latency. A single dependency failure can cascade through your entire system, degrading user experience or bringing your application offline completely.

In production systems handling thousands of requests per second, the difference between proper error handling and none at all determines whether you maintain 99.9% uptime or face frequent outages. AI reliability requires defensive programming at the infrastructure layer.

According to AWS's architectural guidance, distributed systems must account for network unreliability, latency variance, and partial failures. These challenges intensify when working with external LLM providers where you control neither the infrastructure nor the service guarantees.

Understanding Retries

What Are Retries?

Retries automatically re-attempt failed requests after transient errors. Not every failure indicates a permanent problem. Network blips, temporary overload, and brief service disruptions often resolve within milliseconds or seconds.

When to Retry

Certain HTTP status codes signal transient failures worth retrying:

429 (Rate Limit Exceeded): Provider throttling that resolves after backoff
500 (Internal Server Error): Temporary server issues
502 (Bad Gateway): Proxy or load balancer problems
503 (Service Unavailable): Temporary capacity or maintenance issues
504 (Gateway Timeout): Request exceeded timeout threshold

Non-retryable errors include:

400 (Bad Request): Malformed request syntax
401 (Unauthorized): Invalid authentication
403 (Forbidden): Insufficient permissions
404 (Not Found): Invalid endpoint or resource

How Bifrost Handles Retries

Bifrost implements intelligent retry logic at the gateway level. When a provider returns a retryable status code (500, 502, 503, 504, 429), Bifrost automatically retries the same provider before attempting fallbacks.

Key features:

Configurable retry count per provider
Exponential backoff with jitter
Respects rate limit headers when available
No application code changes required

This approach keeps retry complexity out of your application logic. Your code makes a single request; Bifrost handles the retry strategy internally.

Implementing Fallbacks

What Are Fallbacks?

Fallbacks provide alternative execution paths when primary options fail. In LLM applications, this typically means switching between providers or models to maintain availability.

Provider Fallback Chains

Define ordered lists of providers to try sequentially:

Primary: OpenAI GPT-4
Fallback 1: Anthropic Claude
Fallback 2: Google Gemini
Fallback 3: Azure OpenAI

Each provider in the chain gets attempted until one succeeds or all options exhaust.

Model-Level Fallbacks

Fallback strategies can target specific models within the same provider:

Primary: GPT-4 Turbo
Fallback: GPT-4
Fallback: GPT-3.5 Turbo

This approach works well when you need specific capabilities but can degrade to less capable models during outages.

How Bifrost Manages Fallbacks

Bifrost implements automatic failover between providers with zero application downtime. The gateway maintains ordered fallback lists and attempts each provider sequentially after primary failures.

Critical implementation details:

Fallbacks trigger after retry exhaustion: If a provider fails after all retries, Bifrost moves to the next fallback
Plugin re-execution: Each fallback attempt runs through the complete plugin chain (caching, governance, logging)
Failure isolation: Response status from Provider A doesn't impact fallback attempts on Provider B
Selective blocking: Plugins can prevent fallbacks using the AllowFallbacks field when appropriate (authentication failures, malformed requests)

Unlike simple round-robin load balancing, Bifrost's fallback system maintains ordered preferences while ensuring every request attempts all available options before failing.

Circuit Breakers for Resilience

What Is a Circuit Breaker?

Circuit breakers prevent applications from repeatedly calling failing services. The pattern originates from electrical systems where breakers trip to prevent damage from overcurrent.

In distributed systems, circuit breakers monitor service health and block requests when failure rates exceed thresholds. This prevents resource exhaustion and gives failing services time to recover.

Circuit Breakers vs Retries

These patterns serve different purposes:

Pattern	Purpose	When It Activates
Retries	Recover from transient failures	Individual request fails
Circuit Breaker	Prevent cascading failures	Failure rate exceeds threshold
Fallbacks	Maintain availability via alternatives	All retries exhausted

Microsoft Azure's guidance emphasizes that circuit breakers complement rather than replace retry patterns. Retries handle temporary failures. Circuit breakers protect system stability during prolonged outages.

Implementing Circuit Breakers

While Bifrost doesn't currently expose circuit breakers as a configurable feature, the combination of retries and fallbacks achieves similar protective effects. When a provider consistently fails, fallback mechanisms route traffic away from the unhealthy service.

For teams requiring explicit circuit breaker implementations, libraries like Resilience4j (Java), Polly (C#), and PyBreaker (Python) integrate well with LLM gateway architectures.

Circuit breaker configuration typically includes:

Failure threshold: Number or percentage of failures before opening
Timeout period: How long circuit stays open before testing recovery
Success threshold: Number of successful tests required to close circuit
Rolling window: Time period for calculating failure rates

How These Patterns Work Together

Combining retries, fallbacks, and circuit breakers creates layered resilience:

Request arrives at Bifrost gateway
Retry logic handles transient failures on primary provider (429, 500, 502, 503, 504)
Fallback chain activates if all retries exhaust
Circuit breakers (if implemented) prevent continued attempts to consistently failing services
Observability tracks which pattern activated and why

[Flow Diagram: Complete Failure Handling Flow]

Request Received
      │
      ▼
Primary Provider
      │
      ├─ Success → Return Response
      │
      └─ Failure (Retryable Error)
            │
            ▼
      Retry with Backoff
            │
            ├─ Success → Return Response
            │
            └─ All Retries Failed
                  │
                  ▼
          Fallback Provider 1
                  │
                  ├─ Success → Return Response
                  │
                  └─ Failure
                        │
                        ▼
                  Fallback Provider 2
                        │
                        ├─ Success → Return Response
                        │
                        └─ All Fallbacks Failed
                              │
                              ▼
                      Return Error

This layered approach ensures maximum availability while preventing resource waste on hopeless operations.

Monitoring and Observability

Production resilience requires comprehensive monitoring. Track metrics across all three patterns:

Retry Metrics

Retry attempts per request
Retry success rate by provider
Time spent retrying vs. total request latency

Fallback Metrics

Fallback activation rate
Which providers in chain handle most requests
Quality differences between primary and fallback responses

Circuit Breaker Metrics (if implemented)

Circuit state transitions
Time in each state
Test request success rates during Half-Open

Bifrost provides first-class observability with native Prometheus metrics, distributed tracing, and comprehensive logging. This visibility enables you to understand exactly which resilience patterns activated and why.

For deeper insights into LLM application behavior, Maxim's observability platform tracks production logs, runs periodic quality checks, and provides real-time alerts on quality degradation.