Reduce MCP Token Costs With Multiple MCP Tools by upto 92%
Learn how to reduce MCP token costs by up to 92% using tool filtering, Code Mode, and gateway-level governance, without sacrificing the agent capabilities your workflows depend on.
MCP token costs are one of the fastest-growing line items in AI infrastructure budgets. The reason is structural: every time an agent makes a request, the full catalog of tool definitions from every connected MCP server loads into the context window. With three servers and 30 tools per server, that is 90 tool definitions injected before the model reads a single character of the user's prompt. At 200 tokens per definition, you are paying for 18,000 tokens of overhead on every call. Scale to ten servers with mixed tool sets and the number climbs past 100,000 tokens per request.
The instinctive response is to connect fewer servers or strip out less-used tools. That trades cost for capability, which is the wrong tradeoff. Bifrost, the open-source AI gateway by Maxim AI, addresses MCP token costs at the infrastructure layer through tool filtering, Code Mode, and centralized governance, so teams can reduce MCP token costs by 50 to 92 percent while keeping every tool available to the agents that need them.
Why MCP Token Costs Scale So Aggressively
The core problem is how MCP clients handle tool discovery by default. Most implementations load the full schema for every registered tool, on every request, regardless of whether those tools are relevant to the current task.
Each tool definition includes a name, a natural language description, and a parameter schema. For well-documented tools, that schema is detailed and verbose. A single large MCP server can consume 10,000 to 17,000 tokens of context per request just for tool descriptions. Combine several large servers and it becomes easy to spend 30,000 or more tokens on tool metadata in every request.
For a 93-tool GitHub MCP server, the full tool catalog injection is 55,000 tokens before the agent does anything. Connect three services and you have burned 143,000 of a 200,000 token window, 72 percent gone, before the agent begins.
There are two distinct sources of token waste in MCP workflows:
- Schema bloat: Tool definitions injected into every request, whether the agent uses those tools or not
- Response bloat: Raw API payloads returned from tool calls, full of fields the agent does not need to reason over
Solving only one while ignoring the other leaves significant cost on the table. An effective approach to reducing MCP token costs addresses both.
Tool Filtering: Expose Only What Each Consumer Needs
The most direct way to reduce MCP token costs is to control which tools each agent or consumer actually sees. If a coding agent only needs filesystem and code execution tools, there is no reason to inject the full catalog of CRM, analytics, and communication tools into its context window on every request.
Without filtering, the model evaluates every tool on every request. With pre-filtering, the model evaluates only the relevant tools for the task. In benchmarks, tokens dropped from 23,000 to 450 per request, response times fell from 3.4 seconds to under 400 milliseconds, and accuracy jumped from 42 to 85 percent.
Bifrost's virtual key-based tool filtering implements this at the gateway layer. Each virtual key is scoped to a specific set of MCP servers and tools. When an agent authenticates with a virtual key, it only sees the tools that key is permitted to access. The full catalog remains connected to the gateway, but no individual agent receives more tool definitions than its access scope requires.
This produces two benefits beyond token savings: it enforces least-privilege access across the MCP infrastructure, and it reduces the cognitive load on the model. Fewer irrelevant tools in context means fewer chances for the model to call the wrong tool or spend reasoning tokens evaluating options it will never select.
The Bifrost governance layer handles virtual key creation, scoping, and management through a single configuration surface, so changes propagate centrally without requiring per-client updates across every agent deployment.
Code Mode: Replacing Schema Injection with TypeScript Declarations
Tool filtering controls which schemas enter context. Code Mode changes the format of those schemas entirely, producing dramatically smaller representations that preserve full tool functionality.
In standard MCP mode, the model receives a full JSON Schema for each tool: typed parameters, descriptions, enum values, required flags, nested objects. Readable, but verbose. In Bifrost's Code Mode, the gateway converts tool definitions into compact TypeScript function declarations instead. The model gets the same semantic information about how to call each tool, expressed in a format that is significantly more token-efficient.
Classic MCP dumps 100 or more tool definitions into every LLM call. Bifrost's Code Mode generates TypeScript declarations instead, cutting token usage by 50 percent or more and latency by 40 to 50 percent. If you are running 3 or more MCP servers, this is the single biggest cost lever available.
The mechanics work as follows: instead of injecting full JSON Schemas, the gateway presents the agent with a set of typed function signatures. The agent uses those signatures to write code that calls the tools it needs. Intermediate results stay in the execution environment and only the final output returns to the model context. This means that for multi-step workflows, intermediate data never consumes tokens at all.
When agents use code execution with MCP, intermediate results stay in the execution environment by default. The agent only sees what you explicitly log or return, meaning data you do not need to share with the model can flow through the workflow without ever entering the model's context.
Code Mode is particularly effective in high-volume workflows where the same tools are invoked repeatedly, such as coding agents working across file systems, repositories, and build tools, or data agents querying and transforming records across multiple systems.
Agent Mode: Autonomous Execution Without Round-Trip Overhead
A separate but complementary approach to reducing MCP token costs is to minimize the number of model inference calls required to complete a workflow.
Standard agentic workflows operate in a loop: the model decides on a tool call, the gateway executes it, the result returns to the model, the model decides on the next action. Each loop iteration is a full inference call, with the full token payload. For a ten-step workflow, you pay for ten inference calls.
Bifrost's Agent Mode moves sequential decision-making to the gateway layer. You configure which tools are pre-approved for auto-execution via tools_to_auto_execute and set a max_depth to prevent runaway loops. The gateway then handles the iterative execution autonomously, without routing each intermediate step back through the model for approval.
This is distinct from Code Mode. Agent Mode is for workflows where you want the LLM to act autonomously within boundaries. Code Mode is for when you want to reduce token costs on tool-heavy operations. Both are available in Bifrost and can be used independently or together depending on the workflow structure.
For long-running agentic tasks with predictable tool sequences, Agent Mode reduces both the number of inference calls and the total tokens consumed across a session.
Response Filtering: Cutting Output Token Costs
Schema bloat drives input token costs. Response bloat drives output token costs. Both contribute to the total MCP token bill.
When an agent calls a tool that returns structured data, the full API payload enters the context window by default. A user lookup might return 40 fields when the agent needs 3. A record list might return 500 rows when the agent needs to know whether any records match a condition.
MCP response filtering reduces the token cost of tool outputs by requesting only the fields the agent needs, rather than returning full API payloads. At the gateway layer, Bifrost supports response shaping so that tool outputs are trimmed to relevant fields before they re-enter context. This is configurable per tool and per virtual key, allowing precise control over what each consumer receives.
Combined with Code Mode, response filtering addresses both sides of the token equation: compressed schemas reduce what goes in, and field-scoped responses reduce what comes back.
Centralized Governance Without Per-Agent Configuration
One underappreciated cost of managing MCP token optimization at the client level is the operational overhead. If each agent or team manages its own tool scoping, schema settings, and response shaping, configuration drift becomes inevitable. One agent upgrade removes a filter. A new deployment restores the full catalog. Optimization breaks silently in production.
Bifrost consolidates all of this at the MCP gateway layer. Tool filtering, Code Mode, Agent Mode, and response shaping are configured once at the gateway and applied to every consumer that connects through it. Changes propagate immediately without touching individual agent codebases.
The gateway also captures observability data on every tool call, showing which tools consume the most tokens, which consumers are generating the highest costs, and where optimization opportunities remain. This telemetry is available through Bifrost's observability integrations, including Datadog, OpenTelemetry, and Prometheus connectors.
Audit logs provide a full record of tool invocations by consumer, useful both for cost attribution and for compliance in regulated environments. Teams can identify which teams or workflows are generating disproportionate token costs and apply targeted optimizations without a full infrastructure change.
What to Expect at Scale
The token savings from these approaches compound as the number of connected MCP servers grows.
With tool filtering alone, teams running 5 or more MCP servers with differentiated consumer access patterns typically see 40 to 60 percent reduction in schema tokens. The exact number depends on how much overlap exists between what different consumers need.
With Code Mode added, the TypeScript declaration format produces another 30 to 50 percent reduction on top of filtering, reaching aggregate savings of 50 percent or more in most production configurations. In deployments with high tool counts and multi-step workflows, the combined reduction in schema tokens, intermediate result tokens, and inference call count produces savings in the 90 percent range.
The Bifrost MCP gateway blog documents a deployment achieving 92 percent lower token costs at scale using this combination of approaches, without reducing the number of tools available to agents.
Performance overhead from the gateway itself is minimal: Bifrost adds 11 microseconds per request at 5,000 RPS, a negligible figure compared to the latency improvements produced by shorter context windows and fewer inference round-trips.
Getting Started with Bifrost MCP Token Optimization
Bifrost is open source and available via npm or Docker:
# npm
npx -y @maximhq/bifrost-cli
# Docker
docker run -p 8080:8080 maximhq/bifrost
From there, connect your MCP servers, configure virtual keys with tool-level scoping, and enable Code Mode for high-volume consumers. The MCP gateway resource page includes a setup guide and configuration reference.
Teams evaluating their overall AI infrastructure options can also review the LLM Gateway Buyer's Guide for a detailed capability comparison across gateway options.
Reducing MCP token costs does not require compromising on agent capability. With the right gateway layer in place, token efficiency and tool breadth are not in tension. To see how Bifrost can bring both to your infrastructure, book a demo with the Bifrost team.