MCP Cost Optimization: Cut Agent Token Spend by Over 90%

MCP Cost Optimization: Cut Agent Token Spend by Over 90%

Reduce MCP token costs by up to 92% using Code Mode, scoped tool governance, and semantic caching, without giving up capability or accuracy.

As Model Context Protocol adoption moves into production, teams are running into a cost problem that wasn't obvious during prototyping. Connect five MCP servers, run a few hundred agent queries a day, and the token bill starts climbing in ways that aren't easy to explain or control. This guide covers the structural reasons MCP costs escalate and the mechanisms Bifrost provides to bring them under control.

Why MCP Token Costs Escalate

The default MCP execution model was designed for connectivity, not cost efficiency. Every time an agent makes a request, the MCP client injects the full definitions of every connected tool into the model's context window before the prompt even begins.

Five servers with 30 tools each means 150 tool definitions loaded on every single request. Each definition carries the tool name, description, parameter schema, and return type in verbose JSON format. At scale, this "startup cost" alone can consume tens of thousands of tokens per query before any actual work is done.

The problem compounds as you add more MCP servers. Each new integration makes the per-request overhead larger, not just in proportion to the new tools but across every request in every agent loop. Intermediate tool results flow back through the model too, so multi-step workflows accumulate tokens at every turn. Anthropic's engineering team documented this pattern, showing a Google Drive to Salesforce workflow consuming over 150,000 tokens in context under the classic approach.

The standard advice to "trim your tool list" is not a real solution. Reducing connected tools means giving up capability to control cost. These should not be the only two options.

The Compounding MCP Cost Problem

MCP costs do not scale linearly. They compound. Here is why.

In a classic MCP setup, the full tool list appears in context on every turn of the agent loop. A 10-turn workflow does not pay for 10 requests times the cost of one tool list. It pays for 10 requests times the cost of the full tool list plus the accumulated intermediate results from every prior turn. The more servers you connect and the longer the agent runs, the worse the ratio becomes.

Research on this pattern has described it as the MCP Tax: the overhead imposed by loading verbose JSON schemas for all connected tools upfront, even when the agent will only use a small subset. In workloads with large tool counts, this tax can account for the majority of total token spend before any meaningful computation begins.

Code Mode: A Fundamentally Different Execution Model

Bifrost's Code Mode addresses this at the architecture level rather than through tool trimming. Instead of injecting every tool definition into context, Code Mode exposes connected MCP servers as a virtual filesystem of lightweight Python stub files. The model discovers tools on demand, reads only the signatures it needs, writes a short orchestration script, and Bifrost executes it in a sandboxed interpreter.

The model works through four meta-tools:

Meta-tool What it does
listToolFiles Discover which servers and tools are available
readToolFile Load Python function signatures for a specific server or tool
getToolDocs Fetch detailed documentation for a specific tool
executeToolCode Run the orchestration script against live tool bindings

Intermediate results stay inside the execution environment and never re-enter the model's context. The model receives only the final output. This means token usage is bounded by what the model actually reads, not by how many tools are connected.

Bifrost ran three rounds of controlled benchmarks with Code Mode on and off, scaling the number of connected tools between rounds:

Configuration Input Tokens (OFF) Input Tokens (ON) Cost Reduction
96 tools, 6 servers 19.9M 8.3M −55.7%
251 tools, 11 servers 35.7M 5.5M −83.4%
508 tools, 16 servers 75.1M 5.4M −92.2%

Accuracy held at 100% pass rate in all three rounds. The savings are not a tradeoff against quality. They compound as tool count grows because the classic approach's overhead scales with every additional server while Code Mode's does not. At approximately 500 connected tools, Code Mode reduces per-query token usage by roughly 14 times.

The execution sandbox runs Starlark, a deterministic Python-like language with no imports, no file I/O, and no network access. This makes script execution fast, predictable, and safe to run inside automated agent loops.

You can review the full benchmark methodology and results in Bifrost's detailed Code Mode benchmark report.

Scoped Access Reduces Token Exposure by Design

Token optimization and access control are not separate problems. How many tool definitions end up in context is partly a governance question: if every agent can see every tool from every server, every request carries the full overhead.

Bifrost's virtual keys scope tool access at the individual consumer level. Each virtual key specifies exactly which tools it is permitted to call. A key provisioned for a customer-facing agent cannot reach internal admin tooling, and the model never receives definitions for tools outside the key's scope. There is no prompt-level workaround because excluded tools are simply not present in context.

MCP Tool Groups extend this to organizational scale. A tool group is a named collection of tools from one or more MCP servers. You define the group once and attach it to any combination of virtual keys, teams, customers, or users. Bifrost resolves the correct tool set at request time with no database queries; everything is indexed in memory and synced across cluster nodes automatically.

The combination of virtual keys and tool groups means the model context is constrained by policy, not just by Code Mode mechanics. Agents only see what they are authorized to use, which directly reduces token overhead for every request.

Semantic Caching for Repeated Queries

Beyond tool-level optimization, Bifrost's semantic caching reduces costs for LLM requests where similar queries recur. Rather than processing each request independently, Bifrost checks whether a semantically equivalent query has been answered recently and returns the cached result. This benefits teams running agents on repetitive workflows: customer support pipelines, data extraction loops, and scheduled monitoring agents all see meaningful reduction in both cost and latency.

Semantic caching and Code Mode operate at different layers and stack. Caching eliminates redundant model calls. Code Mode reduces the token cost of each model call that does occur.

Per-Tool Cost Visibility

Controlling costs requires measuring them accurately. MCP agent runs have two cost components: LLM tokens and external tool invocations. Many tools call paid APIs, such as search providers, enrichment services, or code execution environments, and each invocation carries a direct cost.

Bifrost tracks cost at the tool level using a pricing configuration you define per MCP client. These per-tool costs appear in logs alongside LLM token costs, giving a complete picture of what each agent run actually consumed. The observability layer rolls this up into spend over time, broken down by tool, virtual key, and MCP server.

Every tool execution is a first-class log entry: tool name, server, arguments, result, latency, the virtual key that triggered it, and the parent LLM request that initiated the agent loop. You can trace any agent run in full detail or filter by virtual key to audit a specific team or customer's usage. Bifrost's enterprise audit logs capture immutable records suitable for SOC 2, GDPR, and HIPAA compliance workflows.

Getting Started with Bifrost MCP Gateway

Bifrost's MCP gateway exposes all connected MCP servers through a single /mcp endpoint. Setting up Code Mode, access control, and cost tracking takes a few steps:

  1. Add an MCP client in the Bifrost dashboard, specifying the connection type (HTTP, SSE, or STDIO) and any required auth headers. Bifrost discovers tools automatically and syncs them on a configured interval.
  2. Enable Code Mode in the client settings. No schema changes or redeployment required. Token usage drops immediately on the next request.
  3. Set auto-execution rules for tools you want the agent to run autonomously. Read-only meta-tools (listToolFiles, readToolFile, getToolDocs) are always auto-executable. executeToolCode becomes auto-executable only when every tool the generated script calls is on the allowlist.
  4. Create virtual keys and assign tool scopes for each consumer. Use MCP Tool Groups to manage access across teams, customers, or users without configuring each key individually.
  5. Connect your agent by pointing it to Bifrost's /mcp endpoint. Claude Code, Cursor, and any other MCP-compatible client can connect through a single URL. Add new MCP servers to Bifrost and they appear automatically, with no client-side configuration changes needed. See the Claude Code integration guide for a step-by-step walkthrough.

The LLM Gateway Buyer's Guide covers the full capability matrix for evaluating MCP gateway infrastructure, including governance, performance, and enterprise deployment requirements.

Reduce MCP Costs Without Reducing Capability

MCP cost optimization does not have to mean doing less. The token overhead that makes MCP expensive in naive deployments is a solvable infrastructure problem. Code Mode reduces per-query token usage by up to 92% as tool count grows, virtual keys limit tool exposure by policy, semantic caching eliminates redundant model calls, and per-tool cost tracking provides the visibility needed to identify where spend is actually going.

Bifrost handles all of this through a single gateway, with the same platform managing your LLM provider routing, fallbacks, load balancing, and spend controls alongside your MCP infrastructure.

To see how Bifrost can reduce your MCP token costs in production, book a demo with the Bifrost team.