Building Better AI Applications with Bifrost: A Complete Technical Guide for AI Engineers
Introduction
Bifrost by Maxim AI is an open-source, high-performance LLM gateway built in Go that helps AI engineers build production-ready AI applications. This technical guide walks you through Bifrost's architecture, features, and practical implementation patterns to help you leverage its full capabilities.
What is Bifrost?
Bifrost is an HTTP API gateway that provides a unified interface for multiple AI providers (OpenAI, Anthropic, AWS Bedrock, Google Gemini, and more) while adding enterprise-grade features like governance, observability, caching, and failover capabilities—all with zero impact on request latency.
Getting Started: Installation in 30 Seconds
NPX Binary (Quickest Method)
# Install and run locally
npx -y @maximhq/bifrost
Docker
# Pull and run
docker pull maximhq/bifrost
docker run -p 8080:8080 maximhq/bifrost
# With data persistence
docker run -p 8080:8080 -v $(pwd)/data:/app/data maximhq/bifrost
Once running, access the web UI at http://localhost:8080 to configure providers and view request logs.
Core Features for AI Engineers
1. Drop-in SDK Replacement
Bifrost acts as a transparent proxy for popular AI SDKs, requiring only a single-line change to your existing code:
OpenAI SDK Integration:
# Before: Direct to OpenAI
client = openai.OpenAI(
api_key="your-openai-key"
)
# After: Through Bifrost
client = openai.OpenAI(
base_url="<http://localhost:8080/openai>",
api_key="dummy-key" # Keys managed by Bifrost
)
Anthropic SDK Integration:
# Before: Direct to Anthropic
client = anthropic.Anthropic(
api_key="your-anthropic-key"
)
# After: Through Bifrost
client = anthropic.Anthropic(
base_url="<http://localhost:8080/anthropic>",
api_key="dummy-key"
)
This integration pattern instantly unlocks advanced features like automatic failovers, load balancing, semantic caching, and governance—without modifying application logic.
2. Intelligent Load Balancing
Bifrost provides sophisticated key management with weighted distribution and automatic failover:
Configure Multiple Keys with Weights:
curl -X POST <http://localhost:8080/api/providers> \\
-H "Content-Type: application/json" \\
-d '{
"provider": "openai",
"keys": [
{
"name": "openai-key-1",
"value": "env.OPENAI_API_KEY_1",
"models": ["gpt-4o", "gpt-4o-mini"],
"weight": 0.7
},
{
"name": "openai-key-2",
"value": "env.OPENAI_API_KEY_2",
"models": [],
"weight": 0.3
}
]
}'
How It Works:
- Keys with higher weights receive proportionally more traffic (70% vs 30% in this example)
- Empty
modelsarray means the key supports all models for that provider - Automatic failover to the next available key if one fails
3. Enterprise Governance with Virtual Keys
Virtual Keys provide granular access control, budget management, and rate limiting:
Create Virtual Key with Budget and Rate Limits:
curl -X POST <http://localhost:8080/api/governance/virtual-keys> \\
-H "Content-Type: application/json" \\
-d '{
"name": "Engineering Team API",
"provider_configs": [
{
"provider": "openai",
"weight": 0.5,
"allowed_models": ["gpt-4o-mini"],
"budget": {
"max_limit": 500.00,
"reset_duration": "1M"
},
"rate_limit": {
"token_max_limit": 1000000,
"token_reset_duration": "1h",
"request_max_limit": 1000,
"request_reset_duration": "1h"
}
},
{
"provider": "anthropic",
"weight": 0.5,
"allowed_models": ["claude-3-sonnet-20240229"]
}
],
"budget": {
"max_limit": 1000.00,
"reset_duration": "1M"
},
"is_active": true
}'
Using Virtual Keys in Requests:
curl -X POST <http://localhost:8080/v1/chat/completions> \\
-H "Content-Type: application/json" \\
-H "x-bf-vk: <VIRTUAL_KEY>" \\
-d '{
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Key Capabilities:
- Hierarchical budgets: Customer → Team → Virtual Key → Provider Config
- Rate limiting: Token and request-based throttling
- Model/provider restrictions: Enforce which models teams can access
- Cost tracking: Real-time monitoring of spending per team/customer
4. Provider-Level Routing and Failover
Route requests intelligently across providers with automatic failover:
Weighted Multi-Provider Configuration:
curl -X PUT <http://localhost:8080/api/governance/virtual-keys/{vk_id}> \\
-H "Content-Type: application/json" \\
-d '{
"provider_configs": [
{
"provider": "openai",
"allowed_models": ["gpt-4o", "gpt-4o-mini"],
"weight": 0.2
},
{
"provider": "azure",
"allowed_models": ["gpt-4o"],
"weight": 0.8
}
]
}'
Load Balancing Behavior:
- For
gpt-4o: 80% Azure, 20% OpenAI (both providers support it) - For
gpt-4o-mini: 100% OpenAI (only provider that supports it) - Automatic fallback chain created if primary provider fails
Bypass Load Balancing (Target Specific Provider):
curl -X POST <http://localhost:8080/v1/chat/completions> \\
-H "x-bf-vk: vk-prod-main" \\
-d '{"model": "openai/gpt-4o", "messages": [...]}'
5. Semantic Caching for Cost Reduction
Reduce API costs and latency with intelligent semantic caching using vector similarity:
Configure Semantic Cache:
{
"plugins": [
{
"enabled": true,
"name": "semantic_cache",
"config": {
"provider": "openai",
"embedding_model": "text-embedding-3-small",
"ttl": "5m",
"threshold": 0.8,
"conversation_history_threshold": 3,
"cache_by_model": true,
"cache_by_provider": true
}
}
]
}
Trigger Cache in Requests:
# Python SDK example
ctx = {"x-bf-cache-key": "session-123"}
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is Python?"}],
extra_headers=ctx
)
Key Features:
- Dual-layer caching: Exact hash matching + semantic similarity search
- Configurable threshold: Match similar queries (default 0.8 similarity)
- Dynamic TTL: Override cache duration per request
- Cost tracking: Cache hits include metadata in
response.extra_fields.cache_debug
6. Comprehensive Observability
Track every AI request with built-in observability or integrate with external platforms:
Built-in Observability (SQLite/PostgreSQL):
{
"client": {
"enable_logging": true,
"disable_content_logging": false
},
"logs_store": {
"enabled": true,
"type": "sqlite",
"config": {
"path": "./logs.db"
}
}
}
Query Logs via API:
curl '<http://localhost:8080/api/logs?'> \\
'providers=openai,anthropic&' \\
'models=gpt-4o-mini&' \\
'status=success&' \\
'start_time=2024-01-15T00:00:00Z&' \\
'limit=100'
Captured Data:
- Complete request/response content
- Token usage and costs
- Latency metrics
- Provider and model information
- Error details with status codes
OpenTelemetry Integration:
{
"plugins": [
{
"enabled": true,
"name": "otel",
"config": {
"service_name": "bifrost",
"collector_url": "<http://localhost:4318>",
"trace_type": "genai_extension",
"protocol": "http",
"headers": {
"Authorization": "env.OTEL_API_KEY"
}
}
}
]
}
7. Prometheus Metrics and Telemetry
Monitor performance with comprehensive Prometheus metrics:
Key Metrics Available:
bifrost_upstream_requests_total: Total requests to providersbifrost_success_requests_total: Successful requestsbifrost_error_requests_total: Failed requests with reason labelsbifrost_upstream_latency_seconds: Provider latency histogrambifrost_input_tokens_total/bifrost_output_tokens_total: Token usagebifrost_cost_total: Real-time cost tracking in USDbifrost_cache_hits_total: Cache performance by type
Configure Custom Labels:
{
"client": {
"prometheus_labels": ["team", "environment", "organization", "project"]
}
}
Dynamic Label Injection:
curl -X POST <http://localhost:8080/v1/chat/completions> \\
-H "x-bf-prom-team: engineering" \\
-H "x-bf-prom-environment: production" \\
-d '{...}'
Example Prometheus Queries:
# Success rate by provider
rate(bifrost_success_requests_total[5m]) /
rate(bifrost_upstream_requests_total[5m]) * 100
# Daily cost estimate
sum by (provider) (increase(bifrost_cost_total[1d]))
# Cache hit rate
rate(bifrost_cache_hits_total[5m]) /
rate(bifrost_upstream_requests_total[5m]) * 100
8. MCP (Model Context Protocol) Tool Integration
Enable AI models to interact with external tools and systems:
Configure MCP Client:
{
"mcp": {
"clients": [
{
"name": "filesystem",
"transport": "stdio",
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/path/to/allowed/files"],
"env": {}
}
]
}
}
MCP Tool Filtering per Virtual Key:
curl -X PUT <http://localhost:8080/api/governance/virtual-keys/{vk_id}> \\
-H "Content-Type: application/json" \\
-d '{
"mcp_configs": [
{
"mcp_client_name": "filesystem",
"tools_to_execute": ["read_file", "list_directory"]
}
]
}'
9. LiteLLM Compatibility Mode
Convert text completion requests to chat format automatically for models that only support chat APIs:
Enable LiteLLM Compatibility:
{
"client_config": {
"enable_litellm_fallbacks": true
}
}
How It Works:
- Checks if model supports text completion natively
- If not supported, converts text prompt to chat message format
- Calls chat completion endpoint internally
- Transforms response back to text completion format
- Returns content in
choices[0].textinstead ofchoices[0].message.content
Advanced Features
Custom Plugins
Bifrost supports custom plugins for extending functionality:
Mocker Plugin (Testing):
plugin, err := mocker.NewMockerPlugin(mocker.MockerConfig{
Enabled: true,
Rules: []mocker.MockRule{
{
Name: "openai-mock",
Probability: 1.0,
Conditions: mocker.Conditions{
Providers: []string{"openai"},
},
Responses: []mocker.Response{
{
Type: mocker.ResponseTypeSuccess,
Content: &mocker.SuccessResponse{
Message: "Mock response for testing",
},
},
},
},
},
})
JSON Parser Plugin (Streaming):
jsonPlugin := jsonparser.NewJsonParserPlugin(jsonparser.PluginConfig{
Usage: jsonparser.AllRequests,
CleanupInterval: 2 * time.Minute,
MaxAge: 10 * time.Minute,
})
Fixes partial JSON chunks in streaming responses by adding missing closing characters to make them valid JSON.
Production Best Practices
1. Configuration Management
Use Environment Variables for Secrets:
{
"providers": {
"openai": {
"keys": [{
"value": "env.OPENAI_API_KEY"
}]
}
}
}
2. Enable Governance for Production
{
"client": {
"enable_governance": true,
"enforce_governance_header": true
}
}
3. Configure Observability
PostgreSQL for Production Logs:
{
"logs_store": {
"enabled": true,
"type": "postgres",
"config": {
"host": "localhost",
"port": "5432",
"user": "bifrost",
"password": "postgres",
"db_name": "bifrost",
"ssl_mode": "disable"
}
}
}
4. Set Up Prometheus Monitoring
scrape_configs:
- job_name: "bifrost-gateway"
static_configs:
- targets: ["bifrost-instance-1:8080"]
scrape_interval: 30s
metrics_path: /metrics
5. Production Alerting
- alert: BifrostHighErrorRate
expr: sum(rate(bifrost_error_requests_total[5m])) / sum(rate(bifrost_upstream_requests_total[5m])) > 0.05
for: 2m
labels:
severity: warning
Performance Characteristics
- Latency Overhead: < 0.1ms for request processing
- Async Operations: All logging and metrics collection happen asynchronously
- Connection Pooling: Efficient HTTP/2 connection reuse
- Memory Management: Automatic cleanup with configurable intervals
- Streaming Support: Full streaming capability with proper chunk ordering
Conclusion
Bifrost provides AI engineers with a production-ready gateway that handles the complexity of multi-provider AI applications. Key benefits include:
- Zero-code integration with existing SDKs
- Enterprise governance with virtual keys and budgets
- Intelligent routing with automatic failover
- Cost optimization through semantic caching
- Complete observability with multiple integration options
- Production-ready with high performance and reliability
Get started with Bifrost today:
- GitHub: https://github.com/maximhq/bifrost
- Documentation: https://docs.getbifrost.ai
- Enterprise: https://getmaxim.ai/bifrost/enterprise