Smart LLM Routing: Picking the Optimal Model per Request

Smart LLM Routing: Picking the Optimal Model per Request

Smart LLM routing in production cuts cost and latency by matching each request to the right model. Learn how Bifrost routes at the gateway layer.

Every production AI application eventually hits the same wall: the team picked one default model, costs keep climbing, and latency on simple requests is indistinguishable from latency on the hardest ones. Smart LLM routing solves this by evaluating each request against signals like complexity, budget, tier, and capacity, then sending it to the model that delivers acceptable quality at the lowest cost and latency. Bifrost, the open-source AI gateway by Maxim AI, moves that routing decision out of application code and into the gateway layer where it belongs, so engineering teams can change model policy without redeploying services.

This post covers why per-request routing matters, the signals that should drive it, and how Bifrost implements smart LLM routing in production with expression-based rules, weighted targets, and automatic fallbacks.

Why Per-Request Model Selection Matters

Using a frontier model for every request is one of the most common sources of avoidable spend in production AI. Academic research on cost- and latency-constrained routing shows wide cost and latency gaps across model tiers on the same hardware, and demonstrates that constrained optimization can maintain quality while staying under per-request cost and latency budgets.

Three forces make per-request routing a first-class production concern:

  • Cost variance across providers: LLM API pricing spans two orders of magnitude. A single routing decision can move a request from a premium tier to a budget tier without a noticeable quality drop for simple tasks.
  • Latency sensitivity at scale: Nielsen Norman Group research on response time thresholds identifies 1 second as the limit for uninterrupted user flow. Time-to-first-token above 300 milliseconds compounds across multi-step agent workflows and degrades user experience quickly.
  • Capacity and reliability pressure: Provider rate limits and regional outages force traffic redistribution in real time. A static single-model deployment has no way to respond when one provider degrades.

Teams that treat model selection as a runtime decision, rather than a hardcoded default, consistently report 40 to 70 percent reductions in blended inference cost while maintaining output quality.

Signals That Should Drive LLM Routing Decisions

Smart LLM routing depends on evaluating the right signals at request time. A routing policy that only inspects the model name misses most optimization opportunities. A policy that inspects too many signals becomes unmaintainable. The production-proven middle is a compact set of dimensions that cover the common cases.

The most useful signals for production routing include:

  • Request type: Chat completions, embeddings, image generation, moderation, and transcription have different cost profiles and different acceptable provider sets.
  • Model requested: The caller's declared preference, which the gateway can honor, override, or normalize to a canonical model name.
  • Headers and parameters: Runtime metadata like tier, region, environment, priority, or application version.
  • Organizational context: The virtual key, team, or customer making the request. Different tenants often need different model access rules and budgets.
  • Capacity metrics: Current budget usage, token rate limit consumption, and request rate as percentages. These let the gateway shift load before limits are breached rather than after.
  • Provider health: Real-time signals on which providers are available, which are rate-limited, and which are degraded.

Any routing layer that cannot evaluate these signals in a single expression forces teams to hardcode equivalents into application logic, where changing them requires a deploy.

How Bifrost Enables Smart LLM Routing at the Gateway Layer

Bifrost implements smart LLM routing as a first-class primitive through expression-based routing rules that evaluate CEL (Common Expression Language) conditions on each request. Rules execute before governance provider selection and can override it, giving teams precise control over where every request goes.

The routing pipeline in Bifrost runs in a fixed order on every request:

  1. Routing rules evaluate first: CEL expressions are checked against request context, headers, parameters, capacity metrics, and organizational hierarchy. First match wins.
  2. Governance selects provider if no rule matches: Static provider weights from virtual key governance apply.
  3. Load balancing picks the key: Performance-based key selection runs within the chosen provider, so optimization happens even on rule-driven paths.
  4. Fallbacks activate on failure: If the selected provider fails, Bifrost cascades through configured fallback providers.

Rules are organized by scope with clear precedence: virtual key, team, customer, and global. A virtual key rule always wins over a team rule, a team rule always wins over a global rule, and everything falls back to Bifrost's default provider routing when no rule matches. This hierarchy lets platform teams set organization-wide policy while giving individual teams room for their own preferences.

First-party voice on the routing stack

Bifrost adds only 11 microseconds of overhead per request at 5,000 requests per second in sustained tests, which means the routing logic does not add measurable latency to production workloads. Performance benchmarks for Bifrost publish the full methodology.

CEL-Based Routing Rules in Practice

The CEL expression layer is the core of smart LLM routing in Bifrost. Expressions are compiled once, cached as bytecode, and evaluated in microseconds on subsequent requests. The language is restrictive enough to prevent runaway logic and expressive enough to cover the routing patterns teams actually need.

A few concrete examples illustrate the range:

Budget-aware fallback routes to a cheaper provider when the primary budget is nearly exhausted:

{
  "name": "Budget Exhaustion Fallback",
  "cel_expression": "budget_used > 85",
  "targets": [
    { "provider": "groq", "model": "llama-3.1-70b", "weight": 1 }
  ],
  "fallbacks": ["openai/gpt-4o-mini"],
  "scope": "global",
  "priority": 10
}

Tier-based routing sends premium customer traffic to a faster provider with a split between two keys:

{
  "name": "Premium Tier Fast Track",
  "cel_expression": "headers[\\"x-tier\\"] == \\"premium\\"",
  "targets": [
    { "provider": "openai", "model": "gpt-4o", "weight": 0.7 },
    { "provider": "azure",  "model": "gpt-4o", "weight": 0.3 }
  ],
  "fallbacks": ["bedrock/claude-3-opus"],
  "scope": "global",
  "priority": 5
}

Request-type routing keeps expensive chat completions on a premium model while pushing embeddings to a budget provider:

{
  "name": "Cheap Embeddings",
  "cel_expression": "request_type == \\"embedding\\"",
  "targets": [
    { "provider": "openai", "model": "text-embedding-3-small", "weight": 1 }
  ],
  "scope": "global",
  "priority": 20
}

A/B testing splits traffic probabilistically for a canary rollout of a newer model:

{
  "name": "Canary New Model",
  "cel_expression": "true",
  "targets": [
    { "provider": "anthropic", "model": "claude-sonnet-4", "weight": 0.9 },
    { "provider": "anthropic", "model": "claude-opus-4.7", "weight": 0.1 }
  ],
  "scope": "team",
  "scope_id": "team-ml-research",
  "priority": 15
}

Weights within a rule must sum to 1, and each request matching the rule is assigned probabilistically. This is the same primitive used for gradual rollouts, cost-quality splits, and controlled experimentation.

Rule chaining composes multiple decisions. A chain rule can normalize a model alias to a canonical name, then hand off to a second rule that picks the provider for that canonical model. This keeps individual rules simple and auditable.

Combining Routing with Failover and Load Balancing

Routing rules do not operate in isolation. They are the first layer of a broader reliability stack that also includes automatic retries and fallbacks, adaptive load balancing, and semantic caching. Each layer handles a distinct failure mode:

  • Routing rules determine which provider and model a request should target based on context.
  • Load balancing picks the best API key within the selected provider based on real-time health and performance.
  • Fallbacks cascade to alternate providers if the primary fails, without changing application code.
  • Semantic caching short-circuits the entire routing path when a semantically similar response already exists.

A request can therefore hit the semantic cache, miss, flow through a routing rule that picks a tier-appropriate provider, get a healthy key from adaptive load balancing, and still fall back to a secondary provider if the primary returns an error. All four mechanisms are configured declaratively and observable through built-in telemetry.

Real-World Benefits of Smart LLM Routing

Teams that move routing into the Bifrost layer report consistent gains across the metrics that matter in production:

  • Cost reduction on mixed workloads: Routing simple classifications, moderation, and embeddings to budget-tier models while keeping complex reasoning on premium models typically produces 30 to 50 percent reductions in blended monthly inference cost.
  • Latency improvements on high-volume paths: Fast models on hot paths, with premium reserved for the requests that require them, reduces p95 latency on common operations.
  • Zero-downtime provider events: Budget-aware and capacity-aware rules shift traffic before rate limits are hit, removing the 429-error class of incidents.
  • Policy changes without deploys: Updating a routing rule is an API call, not a release. Model cutovers, canary rollouts, and cost ceilings can be adjusted in minutes.
  • Tenant-level differentiation: Virtual key and team-scoped rules let one gateway serve premium customers on premium models and self-serve users on budget models, without splitting infrastructure.

For teams evaluating how smart LLM routing fits a broader gateway decision, the LLM Gateway Buyer's Guide covers the full capability matrix including governance, observability, and deployment models.

Getting Started with Smart LLM Routing in Bifrost

Smart LLM routing in production is the difference between paying for a frontier model on every request and paying for the right model on every request. Bifrost makes per-request model selection a gateway-level concern, evaluated in microseconds, configured in declarative rules, and composable with failover, load balancing, and caching. Teams running Claude Code, Codex CLI, agentic workflows, or high-volume inference pipelines get cost discipline and reliability without touching application code.

To see smart LLM routing running against your actual traffic patterns, book a demo with the Bifrost team and walk through a tier-based, budget-aware, or capacity-aware routing configuration matched to your production workload.