LLM Gateway

Building Better AI Applications with Bifrost: A Complete Technical Guide for AI Engineers

Introduction

Bifrost by Maxim AI is an open-source, high-performance LLM gateway built in Go that helps AI engineers build production-ready AI applications. This technical guide walks you through Bifrost's architecture, features, and practical implementation patterns to help you leverage its full capabilities.

What is Bifrost?

Bifrost is an HTTP API gateway that provides a unified interface for multiple AI providers (OpenAI, Anthropic, AWS Bedrock, Google Gemini, and more) while adding enterprise-grade features like governance, observability, caching, and failover capabilities—all with zero impact on request latency.

Getting Started: Installation in 30 Seconds

NPX Binary (Quickest Method)

# Install and run locally
npx -y @maximhq/bifrost

Docker

# Pull and run
docker pull maximhq/bifrost
docker run -p 8080:8080 maximhq/bifrost

# With data persistence
docker run -p 8080:8080 -v $(pwd)/data:/app/data maximhq/bifrost

Once running, access the web UI at http://localhost:8080 to configure providers and view request logs.

Core Features for AI Engineers

1. Drop-in SDK Replacement

Bifrost acts as a transparent proxy for popular AI SDKs, requiring only a single-line change to your existing code:

OpenAI SDK Integration:

# Before: Direct to OpenAI
client = openai.OpenAI(
    api_key="your-openai-key"
)

# After: Through Bifrost
client = openai.OpenAI(
    base_url="<http://localhost:8080/openai>",
    api_key="dummy-key"  # Keys managed by Bifrost
)

Anthropic SDK Integration:

# Before: Direct to Anthropic
client = anthropic.Anthropic(
    api_key="your-anthropic-key"
)

# After: Through Bifrost
client = anthropic.Anthropic(
    base_url="<http://localhost:8080/anthropic>",
    api_key="dummy-key"
)

This integration pattern instantly unlocks advanced features like automatic failovers, load balancing, semantic caching, and governance—without modifying application logic.

2. Intelligent Load Balancing

Bifrost provides sophisticated key management with weighted distribution and automatic failover:

Configure Multiple Keys with Weights:

curl -X POST <http://localhost:8080/api/providers> \\
  -H "Content-Type: application/json" \\
  -d '{
    "provider": "openai",
    "keys": [
      {
        "name": "openai-key-1",
        "value": "env.OPENAI_API_KEY_1",
        "models": ["gpt-4o", "gpt-4o-mini"],
        "weight": 0.7
      },
      {
        "name": "openai-key-2",
        "value": "env.OPENAI_API_KEY_2",
        "models": [],
        "weight": 0.3
      }
    ]
  }'

How It Works:

Keys with higher weights receive proportionally more traffic (70% vs 30% in this example)
Empty models array means the key supports all models for that provider
Automatic failover to the next available key if one fails

3. Enterprise Governance with Virtual Keys

Virtual Keys provide granular access control, budget management, and rate limiting:

Create Virtual Key with Budget and Rate Limits:

curl -X POST <http://localhost:8080/api/governance/virtual-keys> \\
  -H "Content-Type: application/json" \\
  -d '{
    "name": "Engineering Team API",
    "provider_configs": [
      {
        "provider": "openai",
        "weight": 0.5,
        "allowed_models": ["gpt-4o-mini"],
        "budget": {
          "max_limit": 500.00,
          "reset_duration": "1M"
        },
        "rate_limit": {
          "token_max_limit": 1000000,
          "token_reset_duration": "1h",
          "request_max_limit": 1000,
          "request_reset_duration": "1h"
        }
      },
      {
        "provider": "anthropic",
        "weight": 0.5,
        "allowed_models": ["claude-3-sonnet-20240229"]
      }
    ],
    "budget": {
      "max_limit": 1000.00,
      "reset_duration": "1M"
    },
    "is_active": true
  }'

Using Virtual Keys in Requests:

curl -X POST <http://localhost:8080/v1/chat/completions> \\
  -H "Content-Type: application/json" \\
  -H "x-bf-vk: <VIRTUAL_KEY>" \\
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Key Capabilities:

Hierarchical budgets: Customer → Team → Virtual Key → Provider Config
Rate limiting: Token and request-based throttling
Model/provider restrictions: Enforce which models teams can access
Cost tracking: Real-time monitoring of spending per team/customer

4. Provider-Level Routing and Failover

Route requests intelligently across providers with automatic failover:

Weighted Multi-Provider Configuration:

curl -X PUT <http://localhost:8080/api/governance/virtual-keys/{vk_id}> \\
  -H "Content-Type: application/json" \\
  -d '{
    "provider_configs": [
      {
        "provider": "openai",
        "allowed_models": ["gpt-4o", "gpt-4o-mini"],
        "weight": 0.2
      },
      {
        "provider": "azure",
        "allowed_models": ["gpt-4o"],
        "weight": 0.8
      }
    ]
  }'

Load Balancing Behavior:

For gpt-4o: 80% Azure, 20% OpenAI (both providers support it)
For gpt-4o-mini: 100% OpenAI (only provider that supports it)
Automatic fallback chain created if primary provider fails

Bypass Load Balancing (Target Specific Provider):

curl -X POST <http://localhost:8080/v1/chat/completions> \\
  -H "x-bf-vk: vk-prod-main" \\
  -d '{"model": "openai/gpt-4o", "messages": [...]}'

5. Semantic Caching for Cost Reduction

Reduce API costs and latency with intelligent semantic caching using vector similarity:

Configure Semantic Cache:

{
  "plugins": [
    {
      "enabled": true,
      "name": "semantic_cache",
      "config": {
        "provider": "openai",
        "embedding_model": "text-embedding-3-small",
        "ttl": "5m",
        "threshold": 0.8,
        "conversation_history_threshold": 3,
        "cache_by_model": true,
        "cache_by_provider": true
      }
    }
  ]
}

Trigger Cache in Requests:

# Python SDK example
ctx = {"x-bf-cache-key": "session-123"}
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is Python?"}],
    extra_headers=ctx
)

Key Features:

Dual-layer caching: Exact hash matching + semantic similarity search
Configurable threshold: Match similar queries (default 0.8 similarity)
Dynamic TTL: Override cache duration per request
Cost tracking: Cache hits include metadata in response.extra_fields.cache_debug

6. Comprehensive Observability

Track every AI request with built-in observability or integrate with external platforms:

Built-in Observability (SQLite/PostgreSQL):

{
  "client": {
    "enable_logging": true,
    "disable_content_logging": false
  },
  "logs_store": {
    "enabled": true,
    "type": "sqlite",
    "config": {
      "path": "./logs.db"
    }
  }
}

Query Logs via API:

curl '<http://localhost:8080/api/logs?'> \\
'providers=openai,anthropic&' \\
'models=gpt-4o-mini&' \\
'status=success&' \\
'start_time=2024-01-15T00:00:00Z&' \\
'limit=100'

Captured Data:

Complete request/response content
Token usage and costs
Latency metrics
Provider and model information
Error details with status codes

OpenTelemetry Integration:

{
  "plugins": [
    {
      "enabled": true,
      "name": "otel",
      "config": {
        "service_name": "bifrost",
        "collector_url": "<http://localhost:4318>",
        "trace_type": "genai_extension",
        "protocol": "http",
        "headers": {
          "Authorization": "env.OTEL_API_KEY"
        }
      }
    }
  ]
}

7. Prometheus Metrics and Telemetry

Monitor performance with comprehensive Prometheus metrics:

Key Metrics Available:

bifrost_upstream_requests_total: Total requests to providers
bifrost_success_requests_total: Successful requests
bifrost_error_requests_total: Failed requests with reason labels
bifrost_upstream_latency_seconds: Provider latency histogram
bifrost_input_tokens_total / bifrost_output_tokens_total: Token usage
bifrost_cost_total: Real-time cost tracking in USD
bifrost_cache_hits_total: Cache performance by type

Configure Custom Labels:

{
  "client": {
    "prometheus_labels": ["team", "environment", "organization", "project"]
  }
}

Dynamic Label Injection:

curl -X POST <http://localhost:8080/v1/chat/completions> \\
  -H "x-bf-prom-team: engineering" \\
  -H "x-bf-prom-environment: production" \\
  -d '{...}'

Example Prometheus Queries:

# Success rate by provider
rate(bifrost_success_requests_total[5m]) /
rate(bifrost_upstream_requests_total[5m]) * 100

# Daily cost estimate
sum by (provider) (increase(bifrost_cost_total[1d]))

# Cache hit rate
rate(bifrost_cache_hits_total[5m]) /
rate(bifrost_upstream_requests_total[5m]) * 100

8. MCP (Model Context Protocol) Tool Integration

Enable AI models to interact with external tools and systems:

Configure MCP Client:

{
  "mcp": {
    "clients": [
      {
        "name": "filesystem",
        "transport": "stdio",
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-filesystem", "/path/to/allowed/files"],
        "env": {}
      }
    ]
  }
}

MCP Tool Filtering per Virtual Key:

curl -X PUT <http://localhost:8080/api/governance/virtual-keys/{vk_id}> \\
  -H "Content-Type: application/json" \\
  -d '{
    "mcp_configs": [
      {
        "mcp_client_name": "filesystem",
        "tools_to_execute": ["read_file", "list_directory"]
      }
    ]
  }'

9. LiteLLM Compatibility Mode

Convert text completion requests to chat format automatically for models that only support chat APIs:

Enable LiteLLM Compatibility:

{
  "client_config": {
    "enable_litellm_fallbacks": true
  }
}

How It Works:

Checks if model supports text completion natively
If not supported, converts text prompt to chat message format
Calls chat completion endpoint internally
Transforms response back to text completion format
Returns content in choices[0].text instead of choices[0].message.content

Advanced Features

Custom Plugins

Bifrost supports custom plugins for extending functionality:

Mocker Plugin (Testing):

plugin, err := mocker.NewMockerPlugin(mocker.MockerConfig{
    Enabled: true,
    Rules: []mocker.MockRule{
        {
            Name:        "openai-mock",
            Probability: 1.0,
            Conditions: mocker.Conditions{
                Providers: []string{"openai"},
            },
            Responses: []mocker.Response{
                {
                    Type: mocker.ResponseTypeSuccess,
                    Content: &mocker.SuccessResponse{
                        Message: "Mock response for testing",
                    },
                },
            },
        },
    },
})

JSON Parser Plugin (Streaming):

jsonPlugin := jsonparser.NewJsonParserPlugin(jsonparser.PluginConfig{
    Usage:           jsonparser.AllRequests,
    CleanupInterval: 2 * time.Minute,
    MaxAge:          10 * time.Minute,
})

Fixes partial JSON chunks in streaming responses by adding missing closing characters to make them valid JSON.

Production Best Practices

1. Configuration Management

Use Environment Variables for Secrets:

{
  "providers": {
    "openai": {
      "keys": [{
        "value": "env.OPENAI_API_KEY"
      }]
    }
  }
}

2. Enable Governance for Production

{
  "client": {
    "enable_governance": true,
    "enforce_governance_header": true
  }
}

3. Configure Observability

PostgreSQL for Production Logs:

{
  "logs_store": {
    "enabled": true,
    "type": "postgres",
    "config": {
      "host": "localhost",
      "port": "5432",
      "user": "bifrost",
      "password": "postgres",
      "db_name": "bifrost",
      "ssl_mode": "disable"
    }
  }
}

4. Set Up Prometheus Monitoring

scrape_configs:
  - job_name: "bifrost-gateway"
    static_configs:
      - targets: ["bifrost-instance-1:8080"]
    scrape_interval: 30s
    metrics_path: /metrics

5. Production Alerting

- alert: BifrostHighErrorRate
  expr: sum(rate(bifrost_error_requests_total[5m])) / sum(rate(bifrost_upstream_requests_total[5m])) > 0.05
  for: 2m
  labels:
    severity: warning

Performance Characteristics

Latency Overhead: < 0.1ms for request processing
Async Operations: All logging and metrics collection happen asynchronously
Connection Pooling: Efficient HTTP/2 connection reuse
Memory Management: Automatic cleanup with configurable intervals
Streaming Support: Full streaming capability with proper chunk ordering

Conclusion

Bifrost provides AI engineers with a production-ready gateway that handles the complexity of multi-provider AI applications. Key benefits include:

Zero-code integration with existing SDKs
Enterprise governance with virtual keys and budgets
Intelligent routing with automatic failover
Cost optimization through semantic caching
Complete observability with multiple integration options
Production-ready with high performance and reliability

Get started with Bifrost today:

GitHub: https://github.com/maximhq/bifrost
Documentation: https://docs.getbifrost.ai
Enterprise: https://getmaxim.ai/bifrost/enterprise