The fastest LLM gateway in the world

Your Primary LLM Provider Failed? Enable Automatic Fallback with Bifrost

Your Primary LLM Provider Failed? Enable Automatic Fallback with Bifrost

When building applications that depend on external LLM APIs, provider failures are inevitable. Network timeouts, rate limits, model unavailability, and service outages all cause request failures that break application functionality. This post explains how to implement automatic fallback mechanisms using Bifrost to maintain service availability when primary providers fail.

Failure Scenarios in LLM Applications

LLM API calls can fail for several reasons:

  • Rate limiting: HTTP 429 responses when quota limits are exceeded
  • Network errors: Timeouts, DNS failures, connection refused
  • Provider outages: HTTP 500/502/503/504 server errors
  • Model unavailability: Specific models offline for maintenance
  • Authentication issues: Invalid API keys or expired tokens
  • Regional restrictions: Geo-blocking or service unavailability

Applications typically handle these failures by showing error messages to users or failing silently. Both approaches degrade user experience and system reliability.

Automatic Fallback Architecture

Bifrost implements automatic failover by maintaining an ordered list of provider configurations. When a request fails, the system attempts each fallback provider sequentially until one succeeds or all providers are exhausted.

Fallback Processing Flow

  1. Primary Request: Execute request against primary provider
  2. Retry Logic: If request fails with retryable status codes (500, 502, 503, 504, 429), retry the same provider
  3. Fallback Execution: After all retries are exhausted, attempt next provider in fallback list
  4. Plugin Re-execution: Run all configured plugins for each fallback attempt
  5. Response Handling: Return successful response or original error if all providers fail

Request Configuration

Fallbacks are configured by adding a fallbacks array to the request payload:

# Chat completion with multiple fallbacks
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [
      {
        "role": "user",
        "content": "Explain quantum computing in simple terms"
      }
    ],
    "fallbacks": [
      "anthropic/claude-3-5-sonnet-20241022",
      "bedrock/anthropic.claude-3-sonnet-20240229-v1:0"
    ],
    "max_tokens": 1000,
    "temperature": 0.7
  }'

The system attempts providers in this order:

  1. openai/gpt-4o-mini (primary)
  2. anthropic/claude-3-5-sonnet-20241022 (first fallback)
  3. bedrock/anthropic.claude-3-sonnet-20240229-v1:0 (second fallback)

Response Format

Responses maintain standard OpenAI API compatibility regardless of which provider handled the request:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Quantum computing is like having a super-powered calculator..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 150,
    "total_tokens": 162
  },
  "extra_fields": {
    "provider": "anthropic"
  }
}

The extra_fields.provider indicates which provider actually processed the request, enabling monitoring and analytics. Note that latency metrics (when available) will be provided in milliseconds.

Failure Classification

The system uses two separate mechanisms for handling failures: retries and fallbacks.

Retries (Provider-Level)

Retries are configured at each provider level and occur before attempting fallbacks. The system allows retries for these specific status codes:

  • HTTP 500: Internal Server Error
  • HTTP 502: Bad Gateway
  • HTTP 503: Service Unavailable
  • HTTP 504: Gateway Timeout
  • HTTP 429: Too Many Requests

When a request fails with a retryable status code, the system will retry the same provider multiple times before moving to fallbacks.

Fallbacks (Cross-Provider)

Fallbacks are attempted after all retries for a provider have been exhausted. Unlike retries, fallbacks are allowed for any failure type because response status from provider A should not impact fallback attempts on provider B.

However, plugins can prevent fallback execution using the AllowFallbacks field on BifrostError. For example, an authentication plugin can block all fallbacks and return the error immediately if there's a fundamental auth issue that would affect all providers.

Plugin Execution Behavior

Each fallback attempt is treated as a new request, triggering complete plugin re-execution:

  • Semantic caching: Cache lookups run against each provider's cache
  • Governance rules: Rate limits and content policies apply per provider
  • Logging: Each attempt generates separate log entries
  • Monitoring: Metrics track attempts per provider

This ensures consistent behavior regardless of which provider handles the final request.

Plugin Fallback Control

Plugins can prevent fallback execution by setting the AllowFallbacks field on BifrostError. This provides fine-grained control over when fallbacks should be attempted:

class CustomPlugin:
    def process_request(self, request, context):
        # Example: Block fallbacks for content policy violations
        if self.detect_content_policy_violation(request):
            return BifrostError(
                message="Content policy violation detected",
                allow_fallbacks=False  # This prevents all fallback attempts
            )
        return None  # Allow normal processing and fallbacks

When a plugin sets AllowFallbacks=False, the system immediately returns the original error without attempting any fallbacks, even if they are configured.

For more details on plugin fallback control, see the Bifrost documentation.

Implementation Examples

Basic Configuration

curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{
    "model": "openai/gpt-4",
    "fallbacks": ["anthropic/claude-3-sonnet"],
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Multi-tier Fallbacks

curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{
    "model": "openai/gpt-4",
    "fallbacks": [
      "anthropic/claude-3-sonnet",
      "google/gemini-pro",
      "bedrock/anthropic.claude-v2"
    ],
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Cost-optimized Fallbacks

curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{
    "model": "openai/gpt-4",
    "fallbacks": [
      "openai/gpt-3.5-turbo",
      "anthropic/claude-instant"
    ],
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Monitoring and Observability

Tracking Provider Usage

Monitor which providers handle requests using the provider field in responses:

response = requests.post("/v1/chat/completions", json=payload)
data = response.json()
actual_provider = data["extra_fields"]["provider"]
print(f"Request handled by: {actual_provider}")

Fallback Metrics

Key metrics to track:

  • Fallback trigger rate per provider
  • Success rate by provider position
  • Average latency per provider (when available)
  • Cost per provider

Request Tracing

Monitor request flow through the retry and fallback process:

# Track which provider ultimately handled the request
response = requests.post("/v1/chat/completions", json=payload)
data = response.json()
actual_provider = data["extra_fields"]["provider"]
print(f"Request handled by: {actual_provider}")

Limitations

  • Increased latency when fallbacks are triggered
  • Higher complexity in request tracing and debugging
  • Potential cost increases from using multiple providers
  • Model response variations between providers may affect application behavior

Conclusion

Automatic fallbacks provide a systematic approach to handling LLM provider failures without requiring application code changes. By configuring multiple providers and letting the system handle failover logic, applications can maintain availability during provider outages, rate limiting, and other failure scenarios.

The key is proper failure classification, comprehensive monitoring, and thoughtful provider selection to balance reliability, cost, and performance requirements.