How to Save Token Costs for Your AI Applications with Bifrost: A Complete Guide
Token costs are the silent budget killer for AI applications in production. Every LLM API call burns through tokens for input context, tool definitions, repeated queries, and suboptimal routing, and the bill compounds fast as you scale. Most teams discover this only after their monthly invoice arrives from OpenAI, Anthropic, or AWS Bedrock.
Bifrost is a high-performance, open-source AI gateway built in Go that gives teams a full toolkit for reducing token spend without sacrificing output quality. From intelligent caching and cost-aware routing to agent-level token optimization, Bifrost addresses every major cost vector at the infrastructure layer so your application code stays clean.
This guide walks through every cost-saving mechanism Bifrost provides and how to implement each one.
1. Semantic Caching: Stop Paying Twice for Similar Queries
The most direct way to cut token costs is to never make the same (or a similar) API call twice. Bifrost's semantic caching uses vector similarity search to intelligently cache AI responses, serving cached results for semantically similar requests even when the exact wording differs.
This is fundamentally different from simple hash-based caching. If a user asks "What is our refund policy?" and another user later asks "How do returns work?", traditional caching misses entirely. Bifrost's semantic cache recognizes these as similar requests and serves the cached response, saving you an entire LLM round trip.
- Dual-layer caching: Bifrost runs exact hash matching first (fastest path), then falls back to semantic similarity search using embeddings with a configurable similarity threshold (default: 0.8)
- Sub-millisecond cache retrieval compared to multi-second API calls, delivering both cost and latency savings simultaneously
- Model and provider isolation: Caching is separated per model and provider combination, so cached GPT-4o responses are never served for Claude requests
- Full streaming support: Cached responses maintain proper chunk ordering for streaming use cases
- Per-request control: Override TTL, similarity threshold, and cache type on individual requests via headers like
x-bf-cache-ttlandx-bf-cache-threshold - Direct hash mode: For teams that only need exact-match deduplication without an embedding provider, Bifrost supports an embedding-free direct hash mode with zero embedding overhead
- Conversation-aware thresholds: The
ConversationHistoryThresholdsetting automatically skips caching for long conversations where semantic false positives are more likely, keeping cache quality high without manual tuning
Semantic caching is especially effective for customer support bots, FAQ-heavy applications, internal knowledge assistants, and any system where users frequently ask variations of the same questions. Configure it through the Bifrost web UI or config.json in minutes.
2. MCP Code Mode: Cut Agent Token Usage by 50%
If your AI application uses tool calling or agentic workflows, MCP Code Mode is likely the single biggest cost lever available to you. The problem it solves is fundamental to how the Model Context Protocol works at scale.
When you connect multiple MCP servers (say 8 to 10 servers with 150+ tools total), every single request includes all tool definitions in the context window. The LLM spends most of its token budget reading tool catalogs instead of doing actual work. This is pure waste on every turn.
Code Mode replaces all 150+ tool definitions with just four generic meta-tools: listToolFiles, readToolFile, getToolDocs, and executeToolCode. The LLM uses these to discover tools on demand and then writes Python code (executed in a sandboxed Starlark interpreter) to orchestrate everything in one step. Intermediate results are processed inside the sandbox rather than flowing back through the model.
The numbers from Bifrost's documentation tell the story:
- Classic MCP with 5 servers (100 tools): 6 LLM turns, approximately 600+ tokens in tool definitions alone per workflow, all intermediate results traveling through the model
- Code Mode with the same 5 servers: 3 to 4 LLM turns, approximately 50 tokens in tool definitions, intermediate results processed in sandbox
- Result: roughly 50% cost reduction and 3 to 4x fewer LLM round trips
In a real-world e-commerce scenario with 10 MCP servers and 150 tools, Code Mode dropped the average cost per task from $3.20 to $4.00 down to $1.20 to $1.80 while cutting latency from 18 to 25 seconds to 8 to 12 seconds.
You can also mix both modes: enable Code Mode for "heavy" servers (web search, documents, databases) and keep small utilities as direct tools. Enable it per MCP client through the Bifrost dashboard or config for any server with 3+ tools.
3. Intelligent Routing Rules: Send Requests to the Cheapest Viable Provider
Not every request needs GPT-4o or Claude Opus. Bifrost's routing rules use CEL (Common Expression Language) expressions to evaluate conditions at runtime and dynamically route requests to the most cost-effective provider based on context.
- Budget-based failover: Automatically route to a cheaper provider when your primary provider's budget usage exceeds a threshold. For example, the expression
budget_used > 85can trigger a switch from OpenAI to Groq, preventing budget overruns while keeping the application running - Request type routing: Route embedding requests to cheaper providers while keeping chat completions on premium models. The expression
request_type == "embedding" && budget_used > 50can redirect embeddings to a lower-cost endpoint - Tier-based routing: Use headers to route premium users to high-quality models and standard users to cost-efficient alternatives, optimizing spend per user segment
- Probabilistic A/B splits: Split traffic across providers by weight for cost optimization. A rule like 70% OpenAI and 30% Groq lets you test cheaper alternatives on a portion of traffic without committing fully
- Scope hierarchy: Rules can be applied globally, per customer, per team, or per virtual key, with first-match-wins evaluation and priority ordering
- Fallback chains: When a routing rule's primary target fails, Bifrost automatically tries fallback providers in order rather than returning an error, so you never pay for a failed request and then pay again for the retry on the application side
Routing rules execute before governance provider selection and can override it, giving you granular control over where every token dollar goes. Combined with Bifrost's automatic fallbacks, your application stays resilient while always targeting the most cost-effective path.
4. Virtual Keys and Budget Controls: Set Hard Spending Limits
Unchecked token usage across teams is one of the fastest ways to blow through an AI budget. Bifrost's virtual keys and governance framework provide hierarchical cost control at the gateway level.
- Per-team and per-customer budgets: Set spending limits at every level of your organization so no single team or project can monopolize your AI budget
- Rate limiting: Control requests per minute/hour at the virtual key, team, and customer levels to prevent runaway costs from automated pipelines or misconfigured agents
- Real-time capacity metrics: Budget consumption and token usage percentages are available in routing rules (e.g.,
budget_used,tokens_used), enabling automated cost reactions like switching providers when a budget threshold is hit - Virtual key scoping for MCP tools: Control which MCP tools are available per virtual key using tool filtering, preventing teams from accessing expensive tool chains they do not need
These controls are included in Bifrost's open-source tier, not locked behind an enterprise license.
5. Observability: Find and Fix Token Waste
You cannot optimize what you cannot measure. Bifrost's built-in observability tracks every AI request in real time, including token usage, cost per request, provider performance, and cache hit rates.
- Native Prometheus metrics for monitoring token consumption patterns across models, providers, and teams
- OpenTelemetry integration for distributed tracing with Grafana, New Relic, Honeycomb, and more, capturing attributes like
gen_ai.usage.prompt_tokens,gen_ai.usage.completion_tokens, andgen_ai.usage.coston every span - Cache debug metadata on every response shows whether it was a cache hit (semantic or direct), the similarity score, and the cache entry ID, letting you fine-tune your caching strategy based on real data
- Routing rule logs show exactly which rules matched and which provider was selected, so you can verify your cost-optimization routing is working as intended
This data lets you identify which prompts, teams, or features are consuming the most tokens and target your optimization efforts where they will have the biggest impact.
Putting It All Together
The most effective approach is to layer these strategies together:
- Start with semantic caching for immediate savings on repeated and similar queries. This is the lowest-effort, highest-impact change you can make
- Enable Code Mode on any MCP-heavy agent workflows to cut tool-calling token waste by 50%
- Set up routing rules to automatically direct traffic to the cheapest viable provider based on request type, user tier, and real-time budget consumption
- Configure budget controls with virtual keys to set hard spending ceilings per team and per project
- Use observability data to continuously identify remaining waste and refine your strategy
The cumulative effect of these strategies can reduce total token spend by 40% to 60% for typical production applications. Every feature described above is available out of the box with a single command:
npx -y @maximhq/bifrost
Ready to see how much your team could save? Book a Bifrost demo for a walkthrough tailored to your architecture.