Try Bifrost Enterprise free for 14 days. Request access

Semantic Caching and Dynamic Routing: Cutting Token Consumption and AI Spend

Semantic Caching and Dynamic Routing: Cutting Token Consumption and AI Spend
Bifrost implements semantic caching and dynamic routing as two complementary gateway-level mechanisms that reduce LLM costs without changing application code. This guide covers how both mechanisms work and how to apply them to production AI workloads.

LLM API costs at scale break down into two compounding problems: unnecessary token spend from repeated queries and verbose tool calls, and inefficient routing that sends every request to the most expensive frontier model regardless of workload requirements. Both are solvable at the gateway layer, without modifying application code or changing how clients call the API. Bifrost, the open-source AI gateway built in Go by Maxim AI, addresses both through built-in semantic caching and flexible dynamic routing that operate transparently across all connected providers.

Why LLM Costs Escalate at Scale

Per-token pricing compounds quickly once AI workloads move beyond prototype scale. Several patterns drive cost growth:

  • Repeated queries without caching: Support bots, FAQ agents, and document analysis pipelines receive semantically identical questions from different users. Without caching, each variation triggers a full LLM inference call, even when the answer is identical to one already computed.
  • Batch and background workloads on frontier models: Classification, summarization, and extraction jobs routed to GPT-4o or Claude Opus pay frontier pricing for tasks that a smaller, cheaper model handles at equivalent quality.
  • Per-token pricing across multiple providers: Teams using two or three providers simultaneously multiply cost exposure. A single provider outage that causes retry loops across expensive fallback models can spike costs during an incident window.
  • MCP agentic workloads with high per-tool-call overhead: Classic MCP deployments with 8-10 servers and 150+ tools send the full tool catalog to the model on every turn. At 100+ tools per context, token costs for the tool definitions alone exceed the cost of the actual task.

All four patterns are addressable at the AI gateway layer without changing client code. The LLM Gateway Buyer's Guide covers the full cost-reduction capability matrix for teams evaluating gateway options.

How Semantic Caching Reduces Token Spend

Semantic caching is a response caching mechanism that matches incoming requests against stored responses using vector similarity, not exact string matching.

Semantic caching stores LLM responses in a vector database and retrieves them when a new request is semantically equivalent to a previous one, even if the wording differs. A user asking "What is your refund policy?" and another asking "How do I get a refund?" may receive the same cached answer if their semantic similarity exceeds the configured threshold. No LLM inference is triggered for the second request.

This approach beats exact-match caching for LLM workloads because users rephrase questions, reorder words, and vary phrasing while expressing the same intent. Exact-match caches miss these variations entirely and force unnecessary re-inference. Semantic caching with configurable similarity thresholds and TTLs captures the same cost reduction across naturally varying user language.

Workloads that benefit most from semantic caching:

  • Customer support bots: High question volume with limited answer diversity. Caching common intents covers a significant share of total requests.
  • FAQ and knowledge base agents: Queries against structured company knowledge repeat with high frequency, especially for onboarding and product documentation.
  • Document analysis pipelines: The same document or document type gets submitted by multiple users asking equivalent analytical questions.
  • Code review agents: Similar patterns (unused imports, naming conventions, security antipatterns) appear across thousands of review requests, making cache hit rates high.

Three parameters govern semantic cache behavior:

  • Similarity threshold: The minimum cosine similarity score required to serve a cached response. Lower values increase hit rates at the cost of accuracy; higher values ensure responses are close matches.
  • TTL: How long a cached entry remains valid. Short TTLs are appropriate for time-sensitive data; longer TTLs suit stable reference content.
  • Cache key: The scope used to segment cache entries (global, per-virtual-key, per-model, or custom).

How Dynamic Routing Reduces AI Spend

Dynamic routing directs each LLM request to the model and provider most appropriate for that workload's cost and quality requirements. The routing decision happens at the gateway layer, before the request reaches any provider.

Cost-appropriate routing follows workload characteristics:

  • Batch and background jobs (classification, tagging, extraction, summarization at scale) route to smaller, cheaper models like GPT-4o Mini, Haiku, or Mistral Nemo. These models handle structured extraction tasks at a fraction of frontier pricing.
  • Interactive requests requiring reasoning, code generation, or complex multi-step outputs route to frontier models where quality justifies the cost.
  • Off-peak traffic routes to lower-cost providers or reserved capacity, reducing spend during periods with looser latency requirements.

Weighted routing distributes load across multiple API keys or providers by traffic share, so a team with both OpenAI and Anthropic access can allocate 70% of traffic to the cheaper provider and 30% to the higher-quality one without any change to application code.

Routing rules use CEL (Common Expression Language) expressions to make routing decisions at runtime based on request context: model requested, virtual key in use, request type, time of day, or custom headers. This means routing logic lives in the gateway configuration, not scattered across application services.

How Bifrost Implements Semantic Caching

Bifrost's semantic caching uses a vector store backend for both direct (exact-match) and semantic (similarity) cache lookups. The two modes complement each other: direct matching serves deterministic cache hits with sub-millisecond latency; semantic matching catches rephrased queries that would miss exact-match lookup.

The vector store backend supports Redis/Valkey, Weaviate, Qdrant, and Pinecone. Direct-only mode requires no embedding provider: Bifrost normalizes and hashes the request for deterministic lookup. Semantic mode requires an embedding-capable provider to vectorize incoming requests for similarity search.

Importantly, the cache operates across providers. A response cached from an OpenAI-routed request is served for an Anthropic-routed request with the same semantic query, because the cache key and vector embedding are derived from the request content, not the provider. This cross-provider cache coherence maximizes hit rates for teams using multiple providers.

Cache writes are asynchronous: Bifrost returns the provider response immediately and writes the cache entry in the background, so the first request for any query never blocks on cache write latency. Cache entries persist across gateway restarts, keeping the cache warm through maintenance windows.

Semantic caching is configured at the virtual key level, allowing fine-grained control: a customer support use case might cache aggressively with a low similarity threshold, while a code generation use case disables semantic caching entirely to prevent incorrect code from being served as a cached response.

How Bifrost Implements Dynamic Routing

Bifrost's routing rules provide expression-based routing that evaluates at request time and overrides governance-level provider selection. Rules are organized in a scope hierarchy: virtual key scope takes highest priority, followed by team, customer, and global scope. The first matching rule wins.

Within routing rules, CEL expressions can route by:

  • Model requested: If the client asks for gpt-4o, route expensive requests to claude-haiku for batch jobs
  • Virtual key: Route traffic from specific virtual keys (specific teams or applications) to designated providers
  • Request type: Batch inference, embeddings, and image generation can route to different providers based on cost profiles
  • Time of day: Route off-peak traffic to lower-cost providers

Provider routing handles weighted distribution across providers and API keys. Weights are assigned per provider, and Bifrost distributes traffic proportionally. This is how teams implement cost tiering: allocate more weight to the cheaper provider, less to the premium one, without touching application code.

Automatic fallback routes around provider rate limits and outages to cost-efficient alternatives. When the primary provider returns 429 or 5xx errors, Bifrost activates the next provider in the fallback chain without client retries. This prevents both downtime and the spike in costs caused by retry storms against an unavailable provider.

Adaptive load balancing extends this with real-time provider health monitoring and proactive rerouting before errors occur, using provider health signals to shift traffic away from degrading endpoints.

Code Mode: Token Reduction for MCP Agentic Workloads

For teams using Bifrost as an MCP gateway, Code Mode is a separate and significant cost-reduction mechanism distinct from semantic caching and dynamic routing.

Classic MCP deployments with 10+ servers and 150+ tools include the full tool catalog in the model's context on every request turn. At scale, tool definition tokens dwarf task execution tokens. Code Mode solves this by exposing only four generic meta-tools to the model. The model writes Python (Starlark) code to orchestrate the full tool set in a sandbox, rather than being presented with hundreds of tool definitions directly.

In benchmarked deployments with 16 MCP servers and 508 tools, Code Mode reduced input token usage by 92.8% and estimated cost by 92.2%, while maintaining 100% task pass rate. Execution latency also dropped approximately 40% because the model completes tasks in fewer turns.

The MCP Gateway blog post covers the complete benchmark methodology and per-round results for teams evaluating the token cost impact at different MCP footprint sizes.

Combining Semantic Caching and Dynamic Routing

Semantic caching and dynamic routing address different parts of the LLM cost equation and compound when used together.

Semantic caching reduces the total number of requests that reach LLM providers. A cache hit serves the response from the vector store without any provider call, eliminating both the inference cost and the latency of the full provider round-trip. For workloads with high question repetition, caching can reduce provider request volume significantly.

Dynamic routing ensures that the requests that do reach providers are directed to the most cost-appropriate model for the workload. A batch classification job that would have been sent to GPT-4o at $10 per million output tokens instead routes to GPT-4o Mini at a fraction of that cost.

Together, the two mechanisms create a cost reduction stack:

  1. Cache hits skip provider calls entirely
  2. Cache misses route to cost-appropriate providers based on workload type
  3. Fallback chains prevent expensive retry storms during outages
  4. Code Mode (for MCP workloads) reduces input token volume per request

The governance resource page covers how virtual keys tie cost controls together: each virtual key can carry its own caching configuration, routing rules, model access permissions, and spend limits, providing per-team or per-application cost governance within a single gateway deployment.

Observability to Track Cost Reduction

Measuring the impact of caching and routing requires visibility into request-level cost data. Bifrost provides real-time observability covering:

  • Per-virtual-key spend tracking: Token consumption and estimated cost per virtual key, enabling per-team and per-application cost attribution.
  • Cache hit rate metrics: Direct and semantic cache hit rates, so teams can measure the fraction of requests being served from cache versus providers.
  • Provider cost breakdown: Cost distribution across providers, models, and time periods.
  • Prometheus metrics: Native Prometheus scraping for integration with existing Grafana dashboards.
  • OpenTelemetry export: OTLP traces and metrics compatible with New Relic, Honeycomb, and any OTLP-compatible backend.
  • Datadog connector: Native integration for APM traces, LLM Observability dashboards, and metric forwarding to Datadog.

With per-virtual-key attribution and cache hit metrics visible in real time, teams can identify which workloads are driving cost growth, verify that routing rules are directing traffic to the intended providers, and measure the cost reduction from caching configuration changes.

Start Reducing LLM Costs with Bifrost

Semantic caching and dynamic routing together address the two primary sources of LLM cost at scale: repeated inference on semantically equivalent queries, and over-provisioned model routing. Both operate at the gateway layer in Bifrost without changes to application code. For MCP agentic workloads, Code Mode adds a third cost reduction layer through input token compression. The Bifrost benchmarks provide verified overhead and cost reduction numbers for planning purposes. To see how Bifrost fits your AI infrastructure and cost reduction goals, book a demo with the Bifrost team.