Best LLM Gateways for Monitoring Claude Code Token Spend

Best LLM Gateways for Monitoring Claude Code Token Spend

Claude Code has become a core part of how engineering teams write, review, and ship code. Anthropic reports that the average cost is approximately $6 per developer per day on API pricing, with daily costs remaining below $12 for 90% of users. At the team level, that translates to roughly $100 to $200 per developer per month with Sonnet, though heavy usage with Opus or multi-agent workflows can push costs significantly higher.

The challenge is that Claude Code's built-in cost visibility is limited. The /cost command shows session-level totals, and CLI tools like ccusage parse local JSONL files for historical analysis. But neither approach gives engineering leaders centralized, real-time visibility across an entire team: who is consuming what, which models are driving spend, and whether budget thresholds are being breached before the invoice arrives.

An LLM gateway solves this by sitting between Claude Code and the upstream provider, intercepting every request and response to capture token counts, costs, latency, and model metadata in real time. This guide evaluates the best LLM gateways for monitoring Claude Code token spend, with a focus on the depth of observability, cost tracking granularity, and governance controls each platform provides.

Why Claude Code Needs Gateway-Level Monitoring

Claude Code's native monitoring has three structural gaps that make gateway-level observability essential for teams:

  • No centralized team visibility: Claude Code stores session logs locally on each developer's machine in ~/.claude/projects/. There is no built-in mechanism for aggregating usage across developers, projects, or environments into a single dashboard. Teams using Pro or Max plans cannot track per-developer costs through the Anthropic Console at all.
  • No real-time alerting: The /cost command and ccusage both operate retrospectively. By the time a developer checks their spend, the tokens have already been consumed. There is no way to set proactive budget thresholds or receive alerts when spending approaches a limit.
  • No per-team or per-project cost attribution: Organizations running Claude Code across multiple teams need to understand which team, project, or use case is driving spend. Claude Code does not support tagging or segmenting requests with organizational metadata.

An LLM gateway addresses all three gaps by capturing every API call centrally, tagging requests with custom dimensions, and providing both real-time dashboards and alerting infrastructure.

1. Bifrost

Bifrost is a high-performance, open source AI gateway built in Go that provides the most comprehensive monitoring stack for Claude Code token spend. It connects to Claude Code through a 100% compatible Anthropic API endpoint at /anthropic, requiring only two environment variables to set up:

export ANTHROPIC_API_KEY=your-bifrost-virtual-key
export ANTHROPIC_BASE_URL=http://localhost:8080/anthropic

This setup routes all Claude Code traffic through Bifrost with zero changes to how developers use the tool.

Token and Cost Metrics

Bifrost's Prometheus-based telemetry captures dedicated counters for every dimension of token spend:

  • bifrost_input_tokens_total and bifrost_output_tokens_total track input and output tokens separately, broken down by provider, model, virtual key ID, virtual key name, and selected API key
  • bifrost_cost_total tracks cost in USD for every upstream request, enabling PromQL queries like sum by (provider) (increase(bifrost_cost_total[1d])) for daily cost estimates and sum by (provider, model) (rate(bifrost_cost_total[5m])) / sum by (provider, model) (rate(bifrost_upstream_requests_total[5m])) for cost-per-request analysis
  • bifrost_stream_first_token_latency_seconds and bifrost_stream_inter_token_latency_seconds capture streaming performance characteristics specific to Claude Code's interactive workflow
  • bifrost_cache_hits_total tracks semantic and direct cache hits, showing exactly how much spend is being avoided through semantic caching

All metrics are collected asynchronously with zero impact on request latency.

Real-Time Request Logging

Beyond aggregated Prometheus metrics, Bifrost's built-in observability captures every individual request with full metadata: input messages, model parameters, provider context, token usage, cost, and latency. The logging plugin operates asynchronously and adds less than 0.1ms overhead.

Logs are accessible through three interfaces:

  • Web UI at http://localhost:8080/logs with real-time streaming, advanced filtering by provider, model, status, token range, cost range, and content search
  • REST API with filtering parameters including min_cost, max_cost, min_tokens, max_tokens, and time range filters
  • WebSocket for live monitoring integrations

The API response includes aggregate stats: total requests, success rate, average latency, total tokens, and total cost.

Custom Dimensions and Team Attribution

Bifrost solves the team attribution gap through two mechanisms. First, Virtual Keys allow organizations to issue distinct keys per developer, team, or project. All Prometheus metrics include virtual_key_id and virtual_key_name labels, enabling cost breakdowns by team without any developer-side configuration.

Second, dynamic Prometheus label injection via x-bf-prom-* headers allows attaching arbitrary metadata (team, environment, project, organization) to specific requests. Custom labels configured at the gateway level (such as team, environment, organization, project) appear on every metric for filtering and aggregation.

Budget Controls and Alerting

Bifrost provides hierarchical budget management with spending limits at the virtual key, team, and customer levels. When a budget limit is reached, Bifrost automatically rejects requests before they incur additional cost. Rate limits on both tokens and requests add a second layer of protection.

For alerting, Bifrost's documented production alerting examples include ready-to-use Prometheus alert rules for high cost thresholds (e.g., daily spend exceeding $100 per provider) and high error rates.

Integration with Observability Platforms

For teams with existing monitoring infrastructure, Bifrost supports OpenTelemetry for distributed tracing with platforms like Grafana, Datadog, New Relic, and Honeycomb. A native Datadog connector (enterprise tier) provides APM traces, LLM Observability integration, and DogStatsD metrics.

Bifrost is open source under Apache 2.0 with 11 microseconds of overhead at 5,000 RPS. Book a demo to evaluate enterprise monitoring capabilities.

2. LiteLLM

LiteLLM is a Python-based proxy that supports routing to 100+ providers including Anthropic. It provides spend tracking through a PostgreSQL-backed virtual key system that logs token usage and cost per request.

LiteLLM's monitoring capabilities include per-key spend tracking, model-level cost breakdowns, and integration with observability platforms like Langfuse and OpenTelemetry. Budget limits can be set per virtual key with automatic request blocking when thresholds are exceeded.

The main limitations for Claude Code monitoring are the Python runtime's higher baseline latency compared to compiled gateways, and the reliance on an external PostgreSQL database for spend tracking. Prometheus metrics are available but require additional configuration. Real-time streaming metrics (time to first token, inter-token latency) are not natively instrumented at the same granularity. Teams running high-throughput Claude Code deployments with dozens of concurrent developers may encounter performance constraints under sustained load.

3. Cloudflare AI Gateway

Cloudflare AI Gateway provides a managed proxy with built-in analytics including request counts, token consumption, and cost tracking across providers. The dashboard shows usage patterns over time and supports filtering by provider and model.

For Claude Code monitoring, the setup involves pointing ANTHROPIC_BASE_URL to a Cloudflare gateway endpoint. Caching (exact-match only, no semantic caching) can reduce redundant API calls, and the analytics dashboard provides cost visibility without self-hosted infrastructure.

The key constraints are the absence of per-developer or per-team cost attribution (no virtual key system), no self-hosted deployment option, and limited custom metric dimensions. Log retention is capped at 100,000 logs on the free tier. Budget enforcement and automatic request blocking when spending thresholds are exceeded are not available. Teams that need hierarchical governance, custom Prometheus labels, or on-premise deployment will find the monitoring depth insufficient.

4. Anthropic Console

For teams using Claude Code with API keys (not Pro or Max subscriptions), Anthropic's own Console provides native cost tracking through the Usage and Cost API. This includes token consumption broken down by model, workspace, and service tier, with cost reports in USD grouped by workspace or description.

The Console works well as a baseline monitoring layer, especially since Claude Code automatically creates a dedicated workspace for centralized tracking. The API supports time-bucketed usage reports (1 minute, 1 hour, or 1 day intervals) and filtering by API key, workspace, and model.

However, the Console only tracks traffic sent directly to Anthropic's API. If your team routes through an LLM gateway for fallbacks, load balancing, or multi-provider access, the Console loses visibility into those requests. It also does not support custom metric dimensions, Prometheus integration, or real-time alerting. For teams on Pro or Max subscriptions, the Console does not provide cost data at all since billing is subscription-based.

Choosing the Right Monitoring Approach

The right solution depends on your team's scale and existing infrastructure. For individual developers on Pro or Max plans, Claude Code's built-in /cost command combined with CLI tools like ccusage provides adequate session-level visibility. For teams using API pricing, the Anthropic Console adds organizational cost tracking.

For engineering organizations that need centralized real-time monitoring, per-team cost attribution, budget enforcement, and integration with existing Prometheus or Grafana infrastructure, Bifrost provides the most complete monitoring stack. Its combination of dedicated token and cost Prometheus counters, hierarchical Virtual Key governance, dynamic label injection for custom dimensions, and sub-0.1ms logging overhead makes it purpose-built for monitoring Claude Code at team scale.

Book a Bifrost demo to explore how it can provide full visibility into your organization's Claude Code token spend.