AI Gateway

How to Save Token Costs for Your AI Applications with Bifrost: A Complete Guide

Token costs are the silent budget killer for AI applications in production. Every LLM API call burns through tokens for input context, tool definitions, repeated queries, and suboptimal routing, and the bill compounds fast as you scale. Most teams discover this only after their monthly invoice arrives from OpenAI, Anthropic, or AWS Bedrock.

Bifrost is a high-performance, open-source AI gateway built in Go that gives teams a full toolkit for reducing token spend without sacrificing output quality. From intelligent caching and cost-aware routing to agent-level token optimization, Bifrost addresses every major cost vector at the infrastructure layer so your application code stays clean.

This guide walks through every cost-saving mechanism Bifrost provides and how to implement each one.

1. Semantic Caching: Stop Paying Twice for Similar Queries

The most direct way to cut token costs is to never make the same (or a similar) API call twice. Bifrost's semantic caching uses vector similarity search to intelligently cache AI responses, serving cached results for semantically similar requests even when the exact wording differs.

This is fundamentally different from simple hash-based caching. If a user asks "What is our refund policy?" and another user later asks "How do returns work?", traditional caching misses entirely. Bifrost's semantic cache recognizes these as similar requests and serves the cached response, saving you an entire LLM round trip.

Dual-layer caching: Bifrost runs exact hash matching first (fastest path), then falls back to semantic similarity search using embeddings with a configurable similarity threshold (default: 0.8)
Sub-millisecond cache retrieval compared to multi-second API calls, delivering both cost and latency savings simultaneously
Model and provider isolation: Caching is separated per model and provider combination, so cached GPT-4o responses are never served for Claude requests
Full streaming support: Cached responses maintain proper chunk ordering for streaming use cases
Per-request control: Override TTL, similarity threshold, and cache type on individual requests via headers like x-bf-cache-ttl and x-bf-cache-threshold
Direct hash mode: For teams that only need exact-match deduplication without an embedding provider, Bifrost supports an embedding-free direct hash mode with zero embedding overhead
Conversation-aware thresholds: The ConversationHistoryThreshold setting automatically skips caching for long conversations where semantic false positives are more likely, keeping cache quality high without manual tuning

Semantic caching is especially effective for customer support bots, FAQ-heavy applications, internal knowledge assistants, and any system where users frequently ask variations of the same questions. Configure it through the Bifrost web UI or config.json in minutes.

2. MCP Code Mode: Cut Agent Token Usage by 50%

If your AI application uses tool calling or agentic workflows, MCP Code Mode is likely the single biggest cost lever available to you. The problem it solves is fundamental to how the Model Context Protocol works at scale.

When you connect multiple MCP servers (say 8 to 10 servers with 150+ tools total), every single request includes all tool definitions in the context window. The LLM spends most of its token budget reading tool catalogs instead of doing actual work. This is pure waste on every turn.

Code Mode replaces all 150+ tool definitions with just four generic meta-tools: listToolFiles, readToolFile, getToolDocs, and executeToolCode. The LLM uses these to discover tools on demand and then writes Python code (executed in a sandboxed Starlark interpreter) to orchestrate everything in one step. Intermediate results are processed inside the sandbox rather than flowing back through the model.

The numbers from Bifrost's documentation tell the story:

Classic MCP with 5 servers (100 tools): 6 LLM turns, approximately 600+ tokens in tool definitions alone per workflow, all intermediate results traveling through the model
Code Mode with the same 5 servers: 3 to 4 LLM turns, approximately 50 tokens in tool definitions, intermediate results processed in sandbox
Result: roughly 50% cost reduction and 3 to 4x fewer LLM round trips

In a real-world e-commerce scenario with 10 MCP servers and 150 tools, Code Mode dropped the average cost per task from $3.20 to $4.00 down to $1.20 to $1.80 while cutting latency from 18 to 25 seconds to 8 to 12 seconds.

You can also mix both modes: enable Code Mode for "heavy" servers (web search, documents, databases) and keep small utilities as direct tools. Enable it per MCP client through the Bifrost dashboard or config for any server with 3+ tools.

3. Intelligent Routing Rules: Send Requests to the Cheapest Viable Provider

Not every request needs GPT-4o or Claude Opus. Bifrost's routing rules use CEL (Common Expression Language) expressions to evaluate conditions at runtime and dynamically route requests to the most cost-effective provider based on context.

Budget-based failover: Automatically route to a cheaper provider when your primary provider's budget usage exceeds a threshold. For example, the expression budget_used > 85 can trigger a switch from OpenAI to Groq, preventing budget overruns while keeping the application running
Request type routing: Route embedding requests to cheaper providers while keeping chat completions on premium models. The expression request_type == "embedding" && budget_used > 50 can redirect embeddings to a lower-cost endpoint
Tier-based routing: Use headers to route premium users to high-quality models and standard users to cost-efficient alternatives, optimizing spend per user segment
Probabilistic A/B splits: Split traffic across providers by weight for cost optimization. A rule like 70% OpenAI and 30% Groq lets you test cheaper alternatives on a portion of traffic without committing fully
Scope hierarchy: Rules can be applied globally, per customer, per team, or per virtual key, with first-match-wins evaluation and priority ordering
Fallback chains: When a routing rule's primary target fails, Bifrost automatically tries fallback providers in order rather than returning an error, so you never pay for a failed request and then pay again for the retry on the application side

Routing rules execute before governance provider selection and can override it, giving you granular control over where every token dollar goes. Combined with Bifrost's automatic fallbacks, your application stays resilient while always targeting the most cost-effective path.

4. Virtual Keys and Budget Controls: Set Hard Spending Limits

Unchecked token usage across teams is one of the fastest ways to blow through an AI budget. Bifrost's virtual keys and governance framework provide hierarchical cost control at the gateway level.

Per-team and per-customer budgets: Set spending limits at every level of your organization so no single team or project can monopolize your AI budget
Rate limiting: Control requests per minute/hour at the virtual key, team, and customer levels to prevent runaway costs from automated pipelines or misconfigured agents
Real-time capacity metrics: Budget consumption and token usage percentages are available in routing rules (e.g., budget_used, tokens_used), enabling automated cost reactions like switching providers when a budget threshold is hit
Virtual key scoping for MCP tools: Control which MCP tools are available per virtual key using tool filtering, preventing teams from accessing expensive tool chains they do not need

These controls are included in Bifrost's open-source tier, not locked behind an enterprise license.

5. Observability: Find and Fix Token Waste

You cannot optimize what you cannot measure. Bifrost's built-in observability tracks every AI request in real time, including token usage, cost per request, provider performance, and cache hit rates.

Native Prometheus metrics for monitoring token consumption patterns across models, providers, and teams
OpenTelemetry integration for distributed tracing with Grafana, New Relic, Honeycomb, and more, capturing attributes like gen_ai.usage.prompt_tokens, gen_ai.usage.completion_tokens, and gen_ai.usage.cost on every span
Cache debug metadata on every response shows whether it was a cache hit (semantic or direct), the similarity score, and the cache entry ID, letting you fine-tune your caching strategy based on real data
Routing rule logs show exactly which rules matched and which provider was selected, so you can verify your cost-optimization routing is working as intended

This data lets you identify which prompts, teams, or features are consuming the most tokens and target your optimization efforts where they will have the biggest impact.

Putting It All Together

The most effective approach is to layer these strategies together:

Start with semantic caching for immediate savings on repeated and similar queries. This is the lowest-effort, highest-impact change you can make
Enable Code Mode on any MCP-heavy agent workflows to cut tool-calling token waste by 50%
Set up routing rules to automatically direct traffic to the cheapest viable provider based on request type, user tier, and real-time budget consumption
Configure budget controls with virtual keys to set hard spending ceilings per team and per project
Use observability data to continuously identify remaining waste and refine your strategy

The cumulative effect of these strategies can reduce total token spend by 40% to 60% for typical production applications. Every feature described above is available out of the box with a single command:

npx -y @maximhq/bifrost

Ready to see how much your team could save? Book a Bifrost demo for a walkthrough tailored to your architecture.

How to Save Token Costs for Your AI Applications with Bifrost: A Complete Guide

1. Semantic Caching: Stop Paying Twice for Similar Queries

2. MCP Code Mode: Cut Agent Token Usage by 50%

3. Intelligent Routing Rules: Send Requests to the Cheapest Viable Provider

4. Virtual Keys and Budget Controls: Set Hard Spending Limits

5. Observability: Find and Fix Token Waste

Putting It All Together

Read next

Best Enterprise AI Gateway for Fintech Organisations in 2026

Top Semantic Caching Solutions for AI Applications in 2026

Top 5 AI Gateways to Use Claude Code with Non-Anthropic Models

Ship your AI agents 5x faster ⚡️