Token Optimization for MCP Tool Calls: 5 Techniques That Actually Work

Token Optimization for MCP Tool Calls: 5 Techniques That Actually Work
Token costs in multi-server MCP deployments can grow faster than any other line item in an AI budget. Bifrost, the open-source AI gateway built in Go by Maxim AI, provides a set of production-tested techniques that reduce input token consumption by up to 92.8% without sacrificing task completion rates. This post covers five of those techniques, starting with Code Mode and working through tool filtering, semantic caching, provider routing, and budget-enforced governance.

Token usage in agentic AI applications follows a predictable pattern: small in development, expensive in production. The culprit is usually tool calling. Every MCP-enabled request loads the full schema for every connected tool into the model's context window before a single line of useful work is done. With 5 servers and 20 tools each, that is 100 tool definitions per turn. Multiply that across hundreds of agent runs and the cost compounds rapidly.

The five techniques below address this at different layers: how tools are presented to the model, which tools are accessible per request, how responses are reused, which provider handles each request, and how spending is bounded before it escalates.


Technique 1: Code Mode for Tool Schema Compression

The largest single token cost in most MCP deployments is not inference, it is the repeated injection of tool definitions on every turn. Classic MCP works by exposing all connected tools directly to the model, which means every request carries the full schema catalog regardless of how many tools are actually needed.

Bifrost addresses this with Code Mode, an execution model that replaces the standard tool-loading approach entirely.

Code Mode exposes four generic meta-tools instead of the full tool catalog:

  • listToolFiles: discover available MCP servers
  • readToolFile: load compact Python stub signatures on demand
  • getToolDocs: retrieve detailed documentation for a specific tool when needed
  • executeToolCode: run Python (Starlark) in a sandboxed interpreter with full tool bindings

The model writes a short orchestration script rather than calling tools one by one. All intermediate results stay inside the sandbox. Only the compact final output returns to the model's context.

The impact at scale is significant. Across benchmarks with increasing MCP footprint:

MCP footprint Input tokens, classic MCP Input tokens, Code Mode Cost change
96 tools / 6 servers 19.9M 8.3M -55.7%
251 tools / 11 servers 35.7M 5.5M -83.4%
508 tools / 16 servers 75.1M 5.4M -92.2%

At 508 tools across 16 servers, Code Mode reduced average input tokens per query from 1.15M to 83K while maintaining a 65/65 (100%) task pass rate. The savings grow with tool count because classic MCP cost scales with the size of the connected tool registry, while Code Mode cost is bounded by what the model actually reads.

Enable Code Mode per client in config.json:

{
  "mcp": {
    "client_configs": [
      {
        "name": "filesystem",
        "connection_type": "stdio",
        "stdio_config": {
          "command": "npx",
          "args": ["-y", "@anthropic/mcp-filesystem"]
        },
        "tools_to_execute": ["*"],
        "is_code_mode_client": true
      }
    ]
  }
}

The best practice: enable Code Mode for any MCP client with 3 or more servers, or for any "heavy" server handling web search, document access, or databases. Smaller single-purpose servers can remain in classic MCP mode and be mixed into the same deployment.

The full architecture and benchmark data are covered in the Bifrost MCP Gateway blog post on token costs at scale, with additional context on the MCP gateway resource page.


Technique 2: Tool Filtering to Limit Context Exposure

Even without Code Mode, a significant portion of tool-related token waste comes from loading tools that a given consumer or workflow will never use. If a customer-facing agent has read access to internal admin tools, every request pays for those schema tokens regardless.

MCP tool filtering in Bifrost restricts tool visibility at the virtual key level. Each virtual key can be configured with a strict allow-list of MCP clients and individual tools. A request authenticated under that virtual key only receives the tools on its allow-list, so tool schemas for other servers are never loaded into context.

This produces two compounding benefits. Token costs decrease because fewer schemas are injected per turn. Tool selection accuracy also improves: models are more reliable when choosing among 10 relevant tools than when choosing among 100, many of which are irrelevant to the request.

Practical configuration approach:

  • Create a dedicated virtual key for each agent role or customer tier
  • Define an explicit tool allow-list for each key restricted to the tools that role requires
  • Leave the full tool set available only for admin or debug virtual keys used internally

Virtual keys are the primary governance entity in Bifrost. The same key that restricts tool access also enforces budget limits and rate limits, so tool filtering and cost governance are configured in one place rather than two separate systems.


Technique 3: Semantic Caching for Repeated Queries

Token optimization is not only about reducing what goes into the context window per turn. It also means avoiding redundant calls altogether. In most production deployments, a meaningful percentage of incoming requests are semantically equivalent to requests the system has already answered.

Semantic caching in Bifrost stores responses based on meaning rather than exact string matching. A query like "what are the open tickets assigned to the platform team" and a rephrased version of the same question will hit the same cache entry if their semantic similarity exceeds the configured threshold.

Cache hits return in approximately 5ms versus 2,000ms or more for a full provider round-trip. For MCP-enabled workflows, this means the entire tool-calling sequence for a cached query is bypassed: no schema loading, no tool execution, no inference tokens consumed.

The categories of workloads where semantic caching delivers the strongest returns:

  • Customer support agents: a high proportion of questions cluster around the same topics (billing, shipping, account status), and while the exact phrasing varies, the underlying query is often identical
  • Internal knowledge assistants: employees rephrase the same policy or documentation questions repeatedly
  • FAQ-style applications: structured question sets with bounded topic ranges produce high cache hit rates
  • Repeated analytical queries: reporting agents that answer variations of the same metric questions across multiple users or sessions

Semantic caching is configured through the Bifrost web UI or via config.json and requires no changes to application code.


Technique 4: Cost-Aware Model Routing

Not every request in an MCP workflow requires the same model. Agentic pipelines typically involve a mix of operations: task orchestration, tool selection, summarization, and final response generation. Running all of these steps against a frontier model when smaller, cheaper models can handle the lighter ones is a common source of avoidable cost.

Routing rules in Bifrost direct traffic to specific models, providers, and API keys based on configurable criteria. Combined with virtual key configuration, routing can enforce that:

  • High-complexity reasoning steps use a capable frontier model
  • Summarization and classification steps route to a faster, lower-cost model
  • Non-production environments (staging, testing, CI) route to the cheapest viable option

Routing decisions can also incorporate real-time provider health. Automatic fallbacks reroute traffic when a primary provider returns errors or exceeds latency thresholds, avoiding retries that would compound token costs and add latency.

For teams running MCP-heavy agentic workflows across multiple providers, cost-aware routing and failover are configured together at the virtual key level. The same configuration that restricts tool access also determines which provider handles the request and what fallback chain applies if that provider is unavailable.


Technique 5: Budget Controls and Rate Limits to Cap Runaway Spend

The four techniques above reduce token consumption per request. Budget controls ensure that aggregate consumption across all requests stays within defined limits, even if individual requests are optimized.

Bifrost enforces spending limits hierarchically through its governance system:

  • Virtual key level: hard budget caps per consumer, per project, or per agent role
  • Team level: aggregate budget ceilings across a team or department
  • Customer level: per-customer spending limits for multi-tenant deployments

These limits are enforced at request time, not reported post-hoc. When a virtual key reaches its budget ceiling, requests under that key stop rather than continuing to accumulate costs. Rate limits add a second control layer that bounds request volume independent of token count.

For MCP-specific workloads, the governance layer also integrates with tool filtering. A single virtual key configuration controls which tools are accessible, what the token budget is, and how many requests per minute are permitted. Governance is applied consistently whether the request goes through classic MCP, Code Mode, or any other execution path.

This matters most for agentic deployments where a single errant agent run can trigger a chain of tool calls that consumes an order of magnitude more tokens than expected. Budget caps provide a circuit breaker that limits the blast radius of any individual runaway workflow.


Combining the Techniques

These five techniques are not mutually exclusive. In a production MCP deployment, they work at different layers and can be applied simultaneously:

  • Code Mode reduces per-turn schema token overhead (most impactful for 3+ servers)
  • Tool filtering limits context exposure to only what each consumer needs
  • Semantic caching eliminates token spend on repeated or near-duplicate queries
  • Model routing directs each step in a workflow to the cheapest model capable of handling it
  • Budget controls enforce aggregate spending limits and prevent runaway agent costs

The Bifrost MCP gateway resources page covers the full architecture for deploying these capabilities together. Teams running large-scale MCP deployments with 10 or more connected servers and hundreds of daily agent runs typically see the largest absolute cost reductions from Code Mode combined with tool filtering. Teams with more bounded tool sets but high query repetition see strong returns from semantic caching alongside virtual key governance.

All five features are available in the open-source Bifrost gateway. Code Mode requires version v1.4.0-prerelease1 or above and is available in Bifrost Enterprise for production deployments requiring clustering, RBAC, and compliance controls.


What to Prioritize

For teams starting token optimization for the first time, the highest-impact sequence is:

  1. Enable semantic caching first. It requires no structural changes and returns immediate cost savings on any application with query repetition.
  2. Enable Code Mode for MCP-heavy agent workflows. It is the single largest token lever for workloads connecting 3 or more servers.
  3. Configure tool filtering via virtual keys to scope tool access to what each consumer actually needs.
  4. Add cost-aware routing rules to direct lighter operations to lower-cost models.
  5. Set budget caps and rate limits to bound aggregate spend and prevent runaway workloads from erasing the gains made by the first four steps.

To see how these capabilities apply to your MCP deployment, book a demo with the Bifrost team.