5 Tools for Rate Limiting LLM APIs at Scale
Rate limiting LLM APIs is a two-sided problem. Provider-imposed ceilings on requests per minute (RPM) and tokens per minute (TPM) put a hard cap on throughput; internal quotas across teams, agents, and customers create a second layer of enforcement. Solving only one side leaves production behavior broken at the other. Bifrost, the open-source AI gateway built in Go by Maxim AI, is the best overall choice for enterprise teams that need token-aware hierarchical rate limits, multi-key pooling, and automatic failover with 11µs of overhead at 5,000 RPS. This post compares five tools on the dimensions that matter most for LLM API rate limiting at scale.
What Makes LLM Rate Limiting Different
Standard API rate limiting counts requests in a time window. LLM traffic breaks that model because two requests to the same endpoint can differ by orders of magnitude in tokens, compute time, and cost. A 30-token classifier call and a 40,000-token document analysis are equally one request per minute, but they are not equal in any resource dimension.
Effective rate limiting for LLM APIs requires tracking at least three dimensions simultaneously:
- Request rate (RPM): the number of calls per time window
- Token rate (TPM): the volume of prompt and completion tokens per window
- Spend (budget): cumulative cost against a dollar limit per period
Provider-imposed rate limits enforce all three at the account level. Internal quotas should enforce the same dimensions at the consumer level: per-team, per-application, per-customer, and per-provider. Without both layers, a single runaway agent or noisy tenant can exhaust provider capacity for every other consumer on the platform.
Evaluation Criteria
Each tool below is assessed against the same five criteria:
- Token-aware rate limiting: does it track TPM alongside RPM?
- Hierarchical per-consumer quotas: can rate limits be scoped to individual teams, customers, or virtual keys?
- Multi-key pooling: can it distribute traffic across multiple provider API keys to multiply effective throughput?
- Automatic failover on 429s: does it reroute traffic away from exhausted providers without application changes?
- Deployment model: self-hosted, managed, or both?
1. Bifrost
Bifrost is an open-source AI gateway written in Go that handles both sides of the rate-limit problem at the infrastructure layer. It adds only 11µs of overhead per request at 5,000 RPS, which means the governance layer does not become the bottleneck under rate-limit storms.
Token-aware rate limiting operates at two levels simultaneously: the virtual key level and the provider config level. Both support independent request limits and token limits with configurable reset windows (1m, 1h, 1d, 1M). A request must pass both limit types at all applicable levels before reaching a provider.
Hierarchical per-consumer quotas work through virtual keys, which map to a governance hierarchy of Provider Config → Virtual Key → Team → Customer. Each level carries independent budgets and rate limits. All applicable levels are checked on every request, and costs are deducted from every level simultaneously when a request completes.
Multi-key pooling is built into provider configuration. Each virtual key can be restricted to a specific subset of provider API keys, and Bifrost load-balances across those keys using weighted distribution. When one key exhausts its rate limit, traffic continues through remaining keys in the pool without any request failing.
Automatic failover is constructed from provider weights. When a provider exhausts its token or request limit, it is excluded from routing automatically and traffic redistributes to remaining available providers within the virtual key. No application retry logic is required.
For enterprises, Bifrost Enterprise adds real-time state synchronization across nodes via a RAFT-based clustering protocol, adaptive load balancing with provider health monitoring, RBAC with SSO integration, and immutable audit logs for SOC 2, HIPAA, GDPR, and ISO 27001.
Best for: Bifrost is built for enterprises running mission-critical AI workloads that require best-in-class performance, scalability, and reliability. It serves as a centralized AI gateway to route, govern, and secure all AI traffic across models and environments with ultra low latency. Bifrost unifies LLM gateway, MCP gateway, and Agents gateway capabilities into a single platform. Designed for regulated industries and strict enterprise requirements, it supports air-gapped deployments, VPC isolation, and on-prem infrastructure. It provides full control over data, access, and execution, along with robust security, policy enforcement, and governance capabilities.
2. Kong AI Gateway
Kong AI Gateway extends Kong Gateway's existing plugin architecture with an ai-rate-limiting-advanced plugin that applies token-aware rate limiting to LLM traffic. It calculates cost per request using token data returned by the LLM provider, and allows rate limits to be set per consumer or per time window against that cost.
Kong supports multiple rate-limiting strategies: local (per-node counters), cluster (shared counters across nodes), and Redis (external counter storage for high-availability deployments). When a Redis connection fails, the plugin falls back to local rate limiting and re-syncs counters when the connection is restored.
Token rate limiting in Kong is configured at the route or service level rather than through a consumer-keyed virtual key system. Per-tenant quota enforcement requires separate consumer entities and plugin configurations, which adds operational overhead as the number of distinct consumers grows. The AI Rate Limiting Advanced plugin is part of Kong's enterprise and Konnect offering rather than the open-source distribution.
Kong's strength is its breadth as a general-purpose API gateway. Teams that already run Kong across their API infrastructure can extend rate-limit governance to LLM traffic through a single control plane, without introducing a separate gateway. The trade-off is that LLM-specific capabilities are layered onto a request-counting architecture that was not originally designed for token economics.
Best for: Teams already running Kong Gateway across their API infrastructure who want to extend rate-limit governance to LLM traffic without deploying a separate tool. Token-aware depth requires the enterprise tier and additional plugin configuration.
3. Cloudflare AI Gateway
Cloudflare AI Gateway integrates AI routing into Cloudflare's global edge network, enforcing rate limits close to the end user rather than at a centralized origin. The managed product covers request-level rate limiting, response caching, retry logic, and model fallbacks. In 2026, Cloudflare added spend limits as a first-class feature: dollar-based budgets scoped to model, provider, or custom attributes like user and team, with fixed or rolling reset windows.
Edge-based rate limiting reduces round trips to origin infrastructure and inherits Cloudflare's existing DDoS and bot protections. For teams already operating on the Cloudflare stack, the integration path is low-friction: changing one line of code points existing SDK calls through the gateway URL.
The depth trade-offs are meaningful for production AI programs at scale. Cloudflare AI Gateway is managed-only, which can be a constraint for regulated workloads that require private deployment or data residency. Per-team token budget enforcement at the depth that dedicated AI gateways provide, and hierarchical virtual key governance, are not native to the platform. Teams building multi-tenant AI products that need per-customer TPM isolation and automatic provider failover will typically need to pair Cloudflare AI Gateway with additional infrastructure.
Best for: Teams already on the Cloudflare edge stack who need lightweight rate limiting and spend controls with minimal operational overhead. Works well as an observability and caching layer; teams with complex per-tenant quota requirements or regulated deployment constraints will need to supplement it.
4. Apache APISIX
Apache APISIX is an open-source, cloud-native API gateway that supports LLM traffic through a set of AI plugins maintained in its core project. The ai-rate-limiting plugin enforces token-based rate limiting for LLM requests, tracking tokens consumed against configurable per-window limits. Strategies can be scoped by Route, Service, Consumer, Consumer Group, or custom dimensions.
APISIX supports both single-machine and cluster-level rate limiting to accommodate different deployment scales, and fixed or sliding time windows for flexible traffic control. The ai-proxy-multi plugin enables load balancing across multiple LLM instances, and the two plugins are designed to work together so that when one instance's rate limit is exhausted, traffic can continue to a non-rate-limited instance.
All AI plugins in APISIX are fully open-source, which gives teams full configuration access without a commercial license. The trade-off against a dedicated LLM gateway is that hierarchical consumer governance (the Customer → Team → Virtual Key budget cascade that platform teams need for multi-tenant products) requires more manual assembly through APISIX's consumer and route architecture. For teams already standardized on APISIX for their API layer, extending it to AI traffic avoids introducing another gateway to the stack.
Best for: Infrastructure teams standardized on Apache APISIX that want to add token-aware LLM rate limiting and multi-instance load balancing within their existing gateway deployment, using open-source plugins.
5. LiteLLM
LiteLLM is an open-source Python proxy that provides a unified OpenAI-compatible interface across 100+ LLM providers. Its proxy server includes rate limiting, budget management, and per-user or per-team quota enforcement through a PostgreSQL-backed configuration store.
LiteLLM's rate limiting supports RPM and TPM limits per user, team, or key, and budget limits in dollars over configurable periods. The configuration model is accessible and widely documented, which makes it a common starting point for teams that need basic multi-provider governance before investing in more specialized infrastructure.
The primary production constraint is the Python runtime. The gateway adds measurable latency overhead at scale, and the Python GIL becomes a real throughput constraint at high concurrency. At 500 RPS on identical hardware, published benchmark comparisons show P99 latency differences that compound under rate-limit storms, when the gateway needs to make many fast routing decisions in rapid succession. For moderate-traffic workloads where Python overhead stays within acceptable bounds, LiteLLM remains a pragmatic option with a large community and broad provider support.
Best for: Python-native teams at moderate traffic volumes who need a quick path to multi-provider rate limiting and budget controls. Teams scaling to high-concurrency production workloads should evaluate whether the Python runtime overhead aligns with their latency requirements.
Comparison Summary
| Tool | Token-Aware Limits | Hierarchical Quotas | Multi-Key Pooling | Auto Failover on 429 | Self-Hosted |
|---|---|---|---|---|---|
| Bifrost | ✅ RPM + TPM | ✅ VK → Team → Customer | ✅ Built-in | ✅ Built-in | ✅ |
| Kong AI Gateway | ✅ (Enterprise) | ⚠️ Manual config | ⚠️ Via plugin | ⚠️ Via config | ✅ |
| Cloudflare AI Gateway | ✅ Spend limits | ⚠️ Limited depth | ❌ | ⚠️ Via fallback rules | ❌ Managed only |
| Apache APISIX | ✅ TPM per route | ⚠️ Manual config | ✅ Via multi-proxy | ⚠️ Via config | ✅ |
| LiteLLM | ✅ RPM + TPM | ✅ User/team/key | ✅ Key rotation | ✅ Retry logic | ✅ |
Choosing the Right Tool
The right choice depends on where rate-limit enforcement needs to live and how much of the surrounding governance problem the tool needs to solve.
For teams building multi-tenant AI products, the governing factor is whether the tool can enforce hierarchical per-consumer quotas natively. Bifrost's virtual key governance model handles this end-to-end: a single API credential maps to a policy that enforces RPM, TPM, and dollar budgets at the virtual key, team, and customer level simultaneously, with automatic provider failover and multi-key pooling built in. The Bifrost governance resource page documents how this model scales from open-source to enterprise RBAC, SSO, and compliance frameworks.
For teams already invested in Kong or APISIX, extending those gateways to LLM traffic avoids an additional dependency. The trade-off is that LLM-specific governance primitives require more plugin configuration and, in Kong's case, an enterprise license for full token-aware depth.
For teams starting from Python and building at moderate scale, LiteLLM provides the fastest path to rate limiting across multiple providers. The decision point is concurrency: at high traffic, the Python runtime becomes the constraint.
The Cloudflare AI Gateway fits teams already on the Cloudflare edge who want basic rate limiting and caching with minimal setup, particularly where the managed deployment model and edge enforcement are more important than deep per-tenant isolation.
The common thread across all five options is that rate limiting for LLM APIs works best when it lives at the infrastructure layer, not inside application code. Provider API keys stay hidden, retry logic stays out of services, and every rate-limit decision is observable and auditable from a single point. Teams evaluating AI gateways can review the LLM Gateway Buyer's Guide for a detailed capability comparison across deployment, governance, and performance dimensions.
To see how the open-source Bifrost gateway handles rate limiting across your specific providers, team structure, and traffic volume, book a demo with the Bifrost team.