Your Primary LLM Provider Failed? Enable Automatic Fallback with Bifrost
When building applications that depend on external LLM APIs, provider failures are inevitable. Network timeouts, rate limits, model unavailability, and service outages all cause request failures that break application functionality. This post explains how to implement automatic fallback mechanisms using Bifrost to maintain service availability when primary providers fail.
Failure Scenarios in LLM Applications
LLM API calls can fail for several reasons:
- Rate limiting: HTTP 429 responses when quota limits are exceeded
- Network errors: Timeouts, DNS failures, connection refused
- Provider outages: HTTP 500/502/503/504 server errors
- Model unavailability: Specific models offline for maintenance
- Authentication issues: Invalid API keys or expired tokens
- Regional restrictions: Geo-blocking or service unavailability
Applications typically handle these failures by showing error messages to users or failing silently. Both approaches degrade user experience and system reliability.
Automatic Fallback Architecture
Bifrost implements automatic failover by maintaining an ordered list of provider configurations. When a request fails, the system attempts each fallback provider sequentially until one succeeds or all providers are exhausted.
Fallback Processing Flow
- Primary Request: Execute request against primary provider
- Retry Logic: If request fails with retryable status codes (500, 502, 503, 504, 429), retry the same provider
- Fallback Execution: After all retries are exhausted, attempt next provider in fallback list
- Plugin Re-execution: Run all configured plugins for each fallback attempt
- Response Handling: Return successful response or original error if all providers fail
Request Configuration
Fallbacks are configured by adding a fallbacks array to the request payload:
# Chat completion with multiple fallbacks
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [
{
"role": "user",
"content": "Explain quantum computing in simple terms"
}
],
"fallbacks": [
"anthropic/claude-3-5-sonnet-20241022",
"bedrock/anthropic.claude-3-sonnet-20240229-v1:0"
],
"max_tokens": 1000,
"temperature": 0.7
}'
The system attempts providers in this order:
openai/gpt-4o-mini(primary)anthropic/claude-3-5-sonnet-20241022(first fallback)bedrock/anthropic.claude-3-sonnet-20240229-v1:0(second fallback)
Response Format
Responses maintain standard OpenAI API compatibility regardless of which provider handled the request:
{
"id": "chatcmpl-123",
"object": "chat.completion",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Quantum computing is like having a super-powered calculator..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 150,
"total_tokens": 162
},
"extra_fields": {
"provider": "anthropic"
}
}
The extra_fields.provider indicates which provider actually processed the request, enabling monitoring and analytics. Note that latency metrics (when available) will be provided in milliseconds.
Failure Classification
The system uses two separate mechanisms for handling failures: retries and fallbacks.
Retries (Provider-Level)
Retries are configured at each provider level and occur before attempting fallbacks. The system allows retries for these specific status codes:
- HTTP 500: Internal Server Error
- HTTP 502: Bad Gateway
- HTTP 503: Service Unavailable
- HTTP 504: Gateway Timeout
- HTTP 429: Too Many Requests
When a request fails with a retryable status code, the system will retry the same provider multiple times before moving to fallbacks.
Fallbacks (Cross-Provider)
Fallbacks are attempted after all retries for a provider have been exhausted. Unlike retries, fallbacks are allowed for any failure type because response status from provider A should not impact fallback attempts on provider B.
However, plugins can prevent fallback execution using the AllowFallbacks field on BifrostError. For example, an authentication plugin can block all fallbacks and return the error immediately if there's a fundamental auth issue that would affect all providers.
Plugin Execution Behavior
Each fallback attempt is treated as a new request, triggering complete plugin re-execution:
- Semantic caching: Cache lookups run against each provider's cache
- Governance rules: Rate limits and content policies apply per provider
- Logging: Each attempt generates separate log entries
- Monitoring: Metrics track attempts per provider
This ensures consistent behavior regardless of which provider handles the final request.
Plugin Fallback Control
Plugins can prevent fallback execution by setting the AllowFallbacks field on BifrostError. This provides fine-grained control over when fallbacks should be attempted:
class CustomPlugin:
def process_request(self, request, context):
# Example: Block fallbacks for content policy violations
if self.detect_content_policy_violation(request):
return BifrostError(
message="Content policy violation detected",
allow_fallbacks=False # This prevents all fallback attempts
)
return None # Allow normal processing and fallbacks
When a plugin sets AllowFallbacks=False, the system immediately returns the original error without attempting any fallbacks, even if they are configured.
For more details on plugin fallback control, see the Bifrost documentation.
Implementation Examples
Basic Configuration
curl -X POST http://localhost:8080/v1/chat/completions \
-d '{
"model": "openai/gpt-4",
"fallbacks": ["anthropic/claude-3-sonnet"],
"messages": [{"role": "user", "content": "Hello"}]
}'
Multi-tier Fallbacks
curl -X POST http://localhost:8080/v1/chat/completions \
-d '{
"model": "openai/gpt-4",
"fallbacks": [
"anthropic/claude-3-sonnet",
"google/gemini-pro",
"bedrock/anthropic.claude-v2"
],
"messages": [{"role": "user", "content": "Hello"}]
}'
Cost-optimized Fallbacks
curl -X POST http://localhost:8080/v1/chat/completions \
-d '{
"model": "openai/gpt-4",
"fallbacks": [
"openai/gpt-3.5-turbo",
"anthropic/claude-instant"
],
"messages": [{"role": "user", "content": "Hello"}]
}'
Monitoring and Observability
Tracking Provider Usage
Monitor which providers handle requests using the provider field in responses:
response = requests.post("/v1/chat/completions", json=payload)
data = response.json()
actual_provider = data["extra_fields"]["provider"]
print(f"Request handled by: {actual_provider}")
Fallback Metrics
Key metrics to track:
- Fallback trigger rate per provider
- Success rate by provider position
- Average latency per provider (when available)
- Cost per provider
Request Tracing
Monitor request flow through the retry and fallback process:
# Track which provider ultimately handled the request
response = requests.post("/v1/chat/completions", json=payload)
data = response.json()
actual_provider = data["extra_fields"]["provider"]
print(f"Request handled by: {actual_provider}")
Limitations
- Increased latency when fallbacks are triggered
- Higher complexity in request tracing and debugging
- Potential cost increases from using multiple providers
- Model response variations between providers may affect application behavior
Conclusion
Automatic fallbacks provide a systematic approach to handling LLM provider failures without requiring application code changes. By configuring multiple providers and letting the system handle failover logic, applications can maintain availability during provider outages, rate limiting, and other failure scenarios.
The key is proper failure classification, comprehensive monitoring, and thoughtful provider selection to balance reliability, cost, and performance requirements.