Cost-Aware Routing: How to Push 80% of Traffic to the Cheapest Capable Model

Cost-Aware Routing: How to Push 80% of Traffic to the Cheapest Capable Model
Bifrost combines governance-based weighted routing with adaptive load balancing to direct the majority of LLM traffic to the cheapest capable model, cutting inference costs without sacrificing output quality.

Enterprise LLM API spending doubled from $3.5 billion in late 2024 to $8.4 billion by mid-2025 according to Menlo Ventures' 2025 AI infrastructure report, and research consistently shows that 60-80% of those costs come from routing low-complexity tasks through expensive frontier models. Cost-aware routing solves this by directing each request to the cheapest model that meets the quality threshold for that task. Bifrost, the open-source AI gateway built in Go by Maxim AI, is the best overall choice for enterprise teams running mission-critical AI workloads that require best-in-class performance, scalability, and reliability. This post covers how to use Bifrost's governance routing, routing rules, and adaptive load balancing to push 80% or more of production traffic to lower-cost models while keeping expensive providers available as automatic fallbacks.

What Is Cost-Aware Routing

Cost-aware routing is the practice of distributing LLM requests across multiple providers and models based on cost, capability, and real-time performance data. Instead of sending every request to a single frontier model, a cost-aware routing layer classifies or segments traffic and directs each request to the least expensive model that can produce an acceptable result.

A cost-aware routing system typically operates on three principles:

  • Weighted traffic distribution: Assign a higher share of requests to cheaper providers (for example, 80% to a budget model and 20% to a frontier model)
  • Capability matching: Restrict each provider configuration to the models it supports, so requests are only routed to providers that can handle the requested model
  • Automatic failover: When a cheaper provider returns errors or high latency, re-route traffic to a more reliable (often more expensive) provider without manual intervention

In Bifrost, cost-aware routing is implemented through the combination of governance-based routing (explicit, user-defined weights and model restrictions on virtual keys), routing rules (dynamic, expression-based overrides evaluated at request time), and adaptive load balancing (real-time performance scoring that automatically adjusts traffic distribution).

Why Most Teams Overspend on LLM Inference

The default path for most AI teams is to pick a frontier model during prototyping and keep it in production. When the product scales, the cost scales linearly. A 2023 Stanford study (FrugalGPT) found that 50-90% of total inference spend is addressable with optimization techniques, with model routing alone delivering 40-70% savings.

The core problem is uniformity. Not every request requires frontier-level capability. Common examples include:

  • Classification and extraction tasks: Sentiment labeling, entity extraction, and intent classification rarely need a $15/M-token model when a $0.40/M-token model produces identical results
  • Structured output generation: JSON formatting, template filling, and data transformation are well within the capabilities of smaller, cheaper models
  • Repetitive queries: Customer support FAQs and knowledge-base lookups repeat similar patterns thousands of times per day

Without a routing layer, these low-complexity requests consume the same compute budget as complex reasoning tasks. Bifrost addresses this by making provider-level traffic distribution a first-class configuration concern, controlled through virtual keys and enforced at the gateway level.

How Bifrost Enables Cost-Aware Routing with Governance

Bifrost provides governance-based routing as the primary mechanism for explicit cost-aware traffic distribution. Each virtual key can define multiple provider configurations with model restrictions, weights, and budget limits. Bifrost evaluates these configurations on every request and selects a provider using weighted random distribution.

A typical cost-aware configuration assigns a higher weight to the cheaper provider:

{
  "provider_configs": [
    {
      "provider": "groq",
      "allowed_models": ["*"],
      "weight": 0.8
    },
    {
      "provider": "openai",
      "allowed_models": ["*"],
      "weight": 0.2
    }
  ]
}

In this example, 80% of traffic routes to the lower-cost provider, while 20% routes to the more expensive option. Automatic fallbacks ensure that if the primary provider returns an error, Bifrost retries with the next provider in the weighted list. The budget and rate limit controls on each virtual key add a second layer of cost governance: once a provider hits its configured spending ceiling, Bifrost automatically excludes it from selection.

For teams that need model-level control, explicit allowed_models lists restrict which models are available per provider. A virtual key might allow only gpt-4o-mini on OpenAI (for cost control) while permitting the full model catalog on a self-hosted provider. Bifrost validates every request against these lists before routing, using the Model Catalog to resolve cross-provider model availability. For a deeper look at how governance controls work together, the governance resource page covers virtual keys, budgets, and routing in detail.

Dynamic Cost-Aware Routing with Routing Rules

Static weights cover the common case, but production workloads often need dynamic cost control. Bifrost's routing rules use CEL (Common Expression Language) expressions to evaluate conditions at request time and override the default routing.

Routing rules execute before governance provider selection in the request pipeline. If a rule matches, it overrides the provider, model, and fallback chain entirely. This enables patterns like:

  • Budget-triggered downgrade: When budget utilization exceeds a threshold, route all traffic to a cheaper provider
budget_used > 85   // → groq/llama-3 (cheaper fallback)
  • Tier-based routing: Premium API consumers get frontier models; standard consumers get budget models
headers["x-tier"] == "premium"   // → openai/gpt-4o
  • Team-specific cost policies: Research teams route to expensive models; production services route to optimized, lower-cost models
team_name == "production-api"   // → groq/llama-3.3-70b

Rules are scoped hierarchically (virtual key, team, customer, global) and evaluated in priority order. This allows Bifrost to enforce organization-wide cost policies while still permitting team-level overrides. Combined with virtual key budgets, routing rules create a layered cost-control and governance system that adapts to real-time spending patterns.

Adaptive Load Balancing and Cost-Aware Traffic Shaping

Governance routing and routing rules define the intent (which providers should receive traffic and in what proportion). Adaptive load balancing adds real-time intelligence by continuously adjusting traffic distribution based on live performance data.

Bifrost's adaptive load balancer operates at two levels:

  • Level 1 (Direction): Selects the provider for a given model using a multi-factor performance score. The score combines error penalty (50% weight), latency score (20% weight, using the MV-TACOS token-aware algorithm), utilization score (5% weight), and a momentum bias that accelerates recovery after transient failures.
  • Level 2 (Route): Within the selected provider, selects the best-performing API key. This level runs on every request, even when governance or a routing rule has already determined the provider.

The scoring formula recalculates weights every 5 seconds:

Score = (Error_Penalty × 0.5) + (Latency_Penalty × 0.2) + (Utilization_Penalty × 0.05) - Momentum
Weight = W_min + (1 - Score) × (W_max - W_min)

For cost-aware routing, this matters because a cheaper provider that starts returning elevated error rates or latency spikes will see its weight reduced automatically. Traffic shifts to the next-best provider (which may be more expensive but more reliable), and shifts back once the cheaper provider recovers. Routes transition through four health states (Healthy, Degraded, Failed, Recovering) with specific thresholds: a 2% error rate triggers Degraded, above 5% triggers Failed, and recovery happens fast (90% penalty reduction in 30 seconds).

The system also uses a 25% exploration probability to probe potentially recovered routes, so a cheaper provider that was temporarily degraded is re-tested automatically rather than permanently deprioritized.

When governance and adaptive load balancing are both active, governance takes precedence for provider selection (respecting explicit user configuration), while the adaptive load balancer still optimizes key selection within that provider. This two-level architecture means cost-aware governance weights define the macro policy, and adaptive load balancing handles micro-level optimization within those constraints. Bifrost publishes benchmarks showing that all route selection logic adds less than 10 microseconds to hot-path latency, with weight calculations happening asynchronously.

Putting It All Together: An 80/20 Cost-Aware Configuration

A practical cost-aware routing setup in Bifrost combines all three layers. Here is a representative configuration for a team that wants to route 80% of traffic to a budget provider while keeping a frontier provider as a fallback:

  1. Governance layer: A virtual key with two provider configs (budget provider at 0.8 weight, frontier provider at 0.2 weight) and per-provider budget limits
  2. Routing rules layer: A rule that forces all traffic to the budget provider when the frontier provider's budget utilization exceeds 85%
  3. Adaptive load balancing layer: Real-time scoring across both providers' API keys, with automatic demotion of underperforming keys and fast recovery

The result is a system where the cheapest capable provider handles the majority of requests, the expensive provider handles the remainder and serves as a fallback, and the system self-heals when any provider degrades.

For teams evaluating AI gateways for cost optimization, the LLM Gateway Buyer's Guide provides a detailed capability matrix covering routing, governance, and cost-control features. Enterprise teams running regulated workloads can deploy Bifrost in air-gapped or in-VPC environments through Bifrost Enterprise, which includes adaptive load balancing, RBAC, audit logs, and advanced governance as part of the platform.

Getting Started with Cost-Aware Routing

Configuring cost-aware routing in Bifrost takes minutes. Start by defining a virtual key with weighted provider configurations, set budget limits per provider, and enable automatic fallbacks for resilience. Bifrost is a drop-in replacement for existing OpenAI, Anthropic, and Bedrock SDKs, so no application code changes are required beyond updating the base URL.

To explore how cost-aware routing and adaptive load balancing can reduce your team's LLM inference costs, book a demo with the Bifrost team.