From LiteLLM to a Gateway with Native Semantic Caching: A Migration Guide
A step-by-step migration guide from LiteLLM to Bifrost, the AI gateway with native semantic caching, dual-layer hit matching, and zero application code changes.
Teams running LiteLLM in production often hit the same wall: response caching works in isolation, but wiring semantic caching into the proxy introduces Redis module dependencies, embedding configuration drift, and operational overhead that the proxy was supposed to absorb in the first place. This migration guide walks through moving from LiteLLM to Bifrost, the open-source AI gateway built by Maxim AI with native semantic caching, dual-layer hit matching, and drop-in SDK compatibility. The goal is to keep application code unchanged while replacing the caching and routing layer with infrastructure that performs better at scale.
Why Teams Are Migrating Away from LiteLLM
LiteLLM popularized the unified LLM API pattern, but production teams increasingly push against its architectural limits. The Python-based proxy adds measurable latency under load, and configuring semantic caching requires a specific Redis deployment (Redis Stack with the RediSearch module) that many cloud-managed Redis offerings do not ship by default. A recurring reported error, "redis semantic caching requires redis-py client >= 4.2.0 and the redisearch module to be loaded on redis-server," surfaces in Windows and managed Redis environments where the search module is not installed.
Beyond the semantic cache setup friction, LiteLLM's proxy architecture concentrates failure surface in a single Python process. LiteLLM's own high-availability documentation notes that production deployments handling 1000+ requests per second can experience PostgreSQL connection exhaustion and database deadlocks when running 10+ instances simultaneously updating the same user, team, or key records. Teams that scale past a single instance begin writing custom Redis transaction buffers, running side-car caches, and monitoring deadlock patterns, all work that should be handled by the gateway itself.
The migration rationale usually comes down to three factors:
- Performance overhead: Python proxies cannot match the throughput of a native Go gateway at the same infrastructure cost.
- Caching setup complexity: Redis Stack with RediSearch is a specific deployment requirement, not a commodity.
- Single-process blast radius: Scaling the proxy horizontally introduces coordination problems that require additional infrastructure.
What Bifrost Changes About the Caching Layer
Bifrost is built in Go and designed to operate as middleware between application code and 15+ LLM providers, with caching, routing, and governance as first-class capabilities rather than bolt-on plugins. The semantic caching plugin ships with dual-layer matching: exact hash deduplication for identical requests and vector similarity search for semantically similar ones. Both layers can be active at the same time, or direct hash mode can run standalone when embedding calls are undesirable.
The key operational differences compared to a LiteLLM deployment:
- Choice of vector store: Bifrost's semantic cache works with Weaviate, Redis/Valkey-compatible endpoints, Qdrant, and Pinecone. Teams are not locked into Redis Stack.
- Direct hash mode: A fully embedding-free exact-match mode is available by setting
dimension: 1with no embedding provider. This gives LiteLLM-style exact caching without any vector compute, and Redis/Valkey is the recommended backend. - Per-request cache control: Cache behavior (type, TTL, threshold, no-store) is controlled per request via HTTP headers like
x-bf-cache-key,x-bf-cache-ttl, andx-bf-cache-threshold, which keeps application-layer logic flexible without requiring redeploys. - Cache debug metadata: Every cached response includes
cache_debugfields withcache_hit,hit_type,similarity, andcache_id, making it straightforward to audit hit rates and clear specific entries.
Bifrost also publishes independent performance benchmarks showing approximately 11 microseconds of gateway overhead at 5,000 requests per second, which keeps caching decisions out of the latency budget for production workloads.
Pre-Migration Checklist
Before starting the migration, inventory the LiteLLM features currently in use. The migration path differs depending on whether you rely on LiteLLM's SDK, its proxy server, or both. Bifrost supports three compatibility modes that cover most LiteLLM deployments:
- Drop-in SDK replacement: Keep using the OpenAI, Anthropic, or LiteLLM SDK. Change only the base URL to point to Bifrost. See the drop-in replacement docs for the one-line change per SDK.
- LiteLLM SDK passthrough: Bifrost provides a dedicated LiteLLM SDK integration so existing
litellm.completion()calls continue to work unchanged. - LiteLLM compatibility mode: For text completion requests against chat-only models, Bifrost's LiteLLM compatibility layer performs text-to-chat conversion transparently, preserving the
choices[0].textresponse shape that legacy code expects.
Confirm which of these applies to your codebase. In most cases, migration touches configuration only, not application code.
Step-by-Step Migration from LiteLLM to Bifrost
Step 1: Deploy Bifrost Alongside LiteLLM
Run Bifrost in parallel with the existing LiteLLM proxy. This enables side-by-side traffic comparison without a cutover risk.
# Run Bifrost with zero configuration
npx -y @maximhq/bifrost
# Or with Docker
docker run -p 8080:8080 maximhq/bifrost
Bifrost starts with a web UI at http://localhost:8080 for provider configuration. Add the same provider API keys that LiteLLM is using. Full setup details are covered in the gateway setup guide.
Step 2: Configure a Vector Store for Semantic Caching
Bifrost's semantic cache requires a vector store. For teams migrating from LiteLLM with an existing Redis deployment, Redis or Valkey is often the simplest path. The recommended Redis configuration:
{
"vector_store": {
"enabled": true,
"type": "redis",
"config": {
"addr": "localhost:6379"
}
}
}
For teams that want to avoid embedding API calls entirely (replicating LiteLLM's exact-match caching behavior), direct hash mode is the equivalent:
{
"plugins": [
{
"enabled": true,
"name": "semantic_cache",
"config": {
"dimension": 1,
"ttl": "5m",
"cleanup_on_shutdown": true,
"cache_by_model": true,
"cache_by_provider": true
}
}
]
}
Omitting the provider field triggers direct-only mode with zero embedding overhead. This is the lowest-latency, lowest-cost configuration.
Step 3: Enable Semantic Caching with Embeddings
For teams that want true semantic matching (cache hits on paraphrased queries, not just identical strings), configure the plugin with an embedding provider. The configuration mirrors LiteLLM's redis-semantic cache type but runs natively in the gateway:
{
"plugins": [
{
"enabled": true,
"name": "semantic_cache",
"config": {
"provider": "openai",
"embedding_model": "text-embedding-3-small",
"dimension": 1536,
"ttl": "5m",
"threshold": 0.8,
"conversation_history_threshold": 3,
"cache_by_model": true,
"cache_by_provider": true
}
}
]
}
The threshold parameter (0.8 by default) controls similarity strictness. This is directly analogous to LiteLLM's similarity_threshold, but Bifrost additionally supports per-request threshold overrides through the x-bf-cache-threshold header, which is useful for isolating high-precision endpoints (such as billing or compliance queries) from general-purpose ones.
Step 4: Add the Cache Key on the Application Side
Bifrost's semantic cache only activates when a cache key is supplied in the request. This is a deliberate design choice that prevents unintended caching of sensitive or session-specific content. Update the application to include the header:
curl -H "x-bf-cache-key: session-123" \\
-H "Content-Type: application/json" \\
-d '{"model": "openai/gpt-4o-mini", "messages": [...]}' \\
<http://localhost:8080/v1/chat/completions>
For SDK-based applications, set the header via the existing request options. This is the only application-side change typically required.
Step 5: Point Application Traffic at Bifrost
Change the base URL in the SDK configuration from the LiteLLM proxy endpoint to the Bifrost endpoint. For teams using the OpenAI SDK, this is a single line:
from openai import OpenAI
client = OpenAI(
base_url="<http://localhost:8080/v1>", # Bifrost gateway
api_key="any-value" # Bifrost handles provider auth
)
Run a percentage of production traffic through Bifrost first to validate cache hit rates, latency, and error handling before a full cutover. Bifrost's built-in telemetry exposes Prometheus metrics and OpenTelemetry traces, so existing Grafana or Datadog dashboards can ingest Bifrost data directly.
Step 6: Validate Cache Hits and Tune Thresholds
Every cached response from Bifrost returns extra_fields.cache_debug metadata with hit type, similarity score, and cache ID. Use this to audit the migration:
{
"extra_fields": {
"cache_debug": {
"cache_hit": true,
"hit_type": "semantic",
"similarity": 0.95,
"threshold": 0.8
}
}
}
Monitor hit rates per endpoint and adjust the threshold and conversation_history_threshold settings accordingly. A threshold between 0.85 and 0.92 is typical for customer-facing applications where precision matters more than recall.
What Else You Get Beyond the Cache
Migration to Bifrost unlocks capabilities beyond caching that were either missing or required separate tools in a LiteLLM deployment. Automatic failover across providers activates when a primary provider returns errors or times out, with zero application code changes. Virtual keys provide per-team budgets, rate limits, and access control as a first-class governance layer. And the MCP gateway centralizes tool connections, OAuth, and authorization for agentic workloads, which is the next phase of infrastructure that LiteLLM does not natively address.
Teams specifically evaluating alternatives can reference Bifrost as a drop-in LiteLLM alternative for a full feature comparison, and the LLM Gateway Buyer's Guide covers evaluation criteria beyond caching. External industry data tracks the broader context as well: LLM API spending roughly doubled from $3.5 billion to $8.4 billion between late 2024 and mid-2025, with 72% of organizations planning further AI budget increases in 2026, which makes gateway-level cost optimization (caching, routing, governance) a strategic investment rather than a micro-optimization.
Common Migration Pitfalls to Avoid
A few patterns come up repeatedly when teams migrate gateways:
- Forgetting the cache key header: Bifrost intentionally does not cache requests without
x-bf-cache-key. If hit rates are unexpectedly zero, check that the header is being forwarded by the application. - Mismatched cache dimensions: If the embedding model or dimension changes after initial deployment, existing entries in the vector store become unreadable. Either set
cleanup_on_shutdown: trueor use a newvector_store_namespacewhen changing dimensions. - Over-threshold tuning: Setting the similarity threshold too low (below 0.75) produces false-positive cache hits on semantically related but materially different queries. Start at 0.8 and narrow down from observed traffic.
- Mixing direct and semantic modes: The
x-bf-cache-typeheader only has effect when the plugin is initialized in dual-layer mode (with an embedding provider). In direct-only mode, all requests use hash matching regardless of the header.
Migrate to Bifrost Today
Migration from LiteLLM to Bifrost is primarily a configuration change. Application code stays put, the cache layer becomes dual-layer and vector-store-flexible, and the gateway adds failover, governance, and MCP infrastructure that LiteLLM does not provide. The migration from LiteLLM guide covers additional edge cases, and the Bifrost GitHub repo has a one-command install path for a local evaluation.
To see how Bifrost's native semantic caching and LiteLLM compatibility apply to your production workload, book a demo with the Bifrost team.