AI Gateway

Bifrost Cluster Mode: High Availability for Enterprise AI Deployments

Bifrost cluster mode runs multiple replicas with gossip-based state sync, automatic failover, and zero-downtime deployments for enterprise high availability.

Cluster mode runs multiple gateway replicas as a single coordinated unit, sharing rate limits, budget counters, and governance state across every node so a fleet of instances behaves consistently under load. Bifrost, the open-source AI gateway built in Go by Maxim AI, is built for enterprises running mission-critical AI workloads that require best-in-class performance, scalability, and reliability. Running Bifrost in cluster mode adds peer-to-peer high availability, automatic failover, and zero-downtime deployments on top of that foundation. This guide covers what Bifrost cluster mode is, why enterprises deploy it, how it works, and how to deploy it on Kubernetes with Helm.

What Is Cluster Mode in Bifrost?

Cluster mode in Bifrost is a deployment configuration where multiple gateway replicas form a peer-to-peer cluster and synchronize state in real time. When cluster mode is disabled, which is the default, each replica runs independently and shares state only through the database. With Bifrost cluster mode enabled, replicas share rate limits, budget counters, and governance data directly across nodes.

The mode is designed for production scale. It uses gossip-based state synchronization so every node converges on the same view of cluster membership, usage counters, and configuration within seconds. Cluster mode requires PostgreSQL as the storage backend, since SQLite is single-node only, and it is part of the enterprise feature set. The open-source image accepts cluster configuration values, but clustered state replication runs in Bifrost Enterprise.

Why Enterprises Deploy Bifrost in Cluster Mode

Enterprises deploy Bifrost in cluster mode to remove single points of failure, keep rate limits and budgets accurate across many replicas, and update the gateway without taking AI traffic offline. A single-instance gateway is a single point of failure: if it stops responding, every downstream application that depends on it loses access to its models. Clustering distributes that risk across multiple equal nodes with automatic failover.

The benefits map directly to enterprise operational requirements:

No single point of failure: Traffic redistributes automatically when a node fails, so the gateway stays available during hardware faults, pod evictions, and zone disruptions.
Accurate distributed rate limiting and budgets: Per-minute rate limits and budget counters are synchronized across the cluster, so a limit of 1,000 requests per minute holds across the whole fleet rather than being enforced separately on each replica.
Consistent governance across replicas: Virtual keys, routing rules, providers, and access policies replicate to every node, so a configuration change applies uniformly without per-pod drift.
Capacity for traffic spikes: Horizontal scaling distributes load across nodes, and the cluster absorbs spikes by adding replicas rather than degrading a single instance.
Zero-downtime maintenance: Rolling updates replace nodes one at a time while the remaining nodes continue serving requests.
Region-aware operation: Nodes can be tagged with a region label for latency-aware routing and region-scoped coordination across geographically distributed deployments.

Teams planning capacity and redundancy can review the patterns on the enterprise deployment resource page, which covers cluster sizing and high-availability topologies.

How Bifrost Cluster Mode Works

Bifrost cluster mode uses a peer-to-peer network in which every node is an equal participant. There is no primary node and no external coordinator in the request path. Each node discovers its peers automatically, tracks the liveness of every other node, and receives state updates as they happen.

The cluster uses two transports with separate responsibilities. Membership and node-liveness signals run over a memberlist gossip layer on port 10101 (TCP and UDP), based on the SWIM protocol for scalable, eventually consistent group membership. Everything else, including governance usage counters, configuration sync, routing rules, virtual keys, RBAC, and MCP tools, travels over a dedicated gRPC channel on port 10102 (TCP). Splitting the two keeps membership churn isolated from the higher-volume application message stream, and each transport can be tuned and observed independently.

Bifrost replicates more than 30 entity types across the clustering layer, spanning the model catalog, providers, governance counters, routing rules, RBAC, MCP tools, pricing, and prompt deployments. Each message carries a unique ID and a timestamp, and receivers run a deduplicator so a node ignores re-broadcasts of a message it has already processed. All nodes converge to the same state within seconds under an eventual-consistency model.

Leader election runs automatically when clustering is enabled, with one cluster-wide leader and one leader per region. Election is deterministic: the lexicographically first healthy member wins, and the cluster re-evaluates membership every 30 seconds so leadership transfers automatically when nodes join, leave, or fail. The leader coordinates singleton tasks such as upstream pricing fetches, then broadcasts the result so other nodes stay in sync without each making the same call.

Deploying Bifrost in Cluster Mode on Kubernetes

The most common way to deploy Bifrost in cluster mode is with the Helm chart on Kubernetes, using Kubernetes-native pod discovery. The chart deploys the default mesh clustering model, where nodes reach each other directly over gossip (10101) and gRPC (10102). A minimum of three nodes is recommended for fault tolerance: three nodes tolerate one failure, five tolerate two, and seven or more tolerate three or more.

A basic cluster configuration enables cluster mode, points at an external PostgreSQL instance, and turns on Kubernetes discovery so new pods join the cluster automatically as they scale:

# cluster-values.yaml
replicaCount: 3

storage:
  mode: postgres

postgresql:
  external:
    enabled: true
    host: "your-postgres-host.example.com"
    port: 5432
    user: bifrost
    database: bifrost
    sslMode: require
    existingSecret: "postgres-credentials"
    passwordKey: "password"

bifrost:
  encryptionKeySecret:
    name: "bifrost-encryption"
    key: "encryption-key"
  cluster:
    enabled: true
    region: "us-east-1"
    discovery:
      enabled: true
      type: kubernetes
      k8sNamespace: "default"
      k8sLabelSelector: "app.kubernetes.io/name=bifrost"
    gossip:
      port: 10101

Kubernetes discovery requires a service account with permission to list pods, so the gossip layer can find peer pods by label selector. After creating the PostgreSQL and encryption secrets and applying the pod-discovery RBAC role, install the release with helm install bifrost bifrost/bifrost -f cluster-values.yaml. Both ports 10101 (TCP and UDP) and 10102 (TCP) must be reachable between pods, so NetworkPolicies and security groups need to allow that traffic.

Service discovery options

Bifrost supports six service discovery methods, so the cluster can form on any infrastructure:

Kubernetes: Pod discovery by label selector through the Kubernetes API, the recommended method for cloud-native deployments.
DNS: Headless service or A-record resolution, which works well with StatefulSets.
Consul and etcd: Service-registry-based discovery for environments already running those systems.
UDP broadcast and mDNS: Local-network discovery for on-premise clusters and development.

For serverless platforms without peer-to-peer networking, such as Google Cloud Run, Bifrost offers broker mode, where each node makes a single outbound connection to a central relay instead of forming a direct mesh. The same model applies in air-gapped and private environments documented under in-VPC deployments.

Production Best Practices for High-Availability Deployments

A production Bifrost cluster combines a multi-node baseline with scheduling, autoscaling, and graceful-shutdown settings that protect availability during routine operations. The goal is a fleet that survives node failures, scales with demand, and rolls out new versions without dropping in-flight requests.

Run three or more nodes and spread them across hosts: Use pod anti-affinity so replicas land on different nodes, which keeps the cluster available when a single host fails.
Protect availability during maintenance: Set a PodDisruptionBudget with minAvailable: 2 so voluntary disruptions, such as node drains and cluster autoscaling, never take the cluster below a safe replica count.
Drain in-flight streams on shutdown: Configure a termination grace period and a preStop hook so streaming responses finish before a pod terminates, rather than being cut off mid-response.
Scale on real demand: Enable horizontal autoscaling with a conservative scale-down policy so the cluster grows for traffic spikes without aggressively removing pods that are still serving requests.
Pair clustering with provider-level resilience: Combine cluster high availability with automatic fallbacks and adaptive load balancing so the gateway routes around provider outages as well as node failures.
Monitor the cluster: Export Prometheus metrics and use the built-in cluster topology view and diagnostics to confirm that every node is reachable and acknowledging messages.

Mixed-version rollouts are supported through per-peer capability negotiation, so newer and older nodes run side by side during a rolling upgrade without losing quorum. This means a Bifrost cluster can be upgraded one pod at a time using standard Kubernetes rolling-update strategies.

Bifrost Cluster Mode FAQ

Does Bifrost cluster mode require a specific database?

Yes. Cluster mode requires PostgreSQL as the storage backend, because shared state and configuration are coordinated through it. SQLite is single-node only and cannot back a multi-replica cluster.

How many nodes should a Bifrost cluster have?

Three nodes is the recommended minimum and tolerates one node failure. Five nodes tolerate two failures, and seven or more tolerate three or more, so node count scales with the fault tolerance an enterprise deployment needs.

Can Bifrost run in cluster mode on serverless platforms?

Yes, using broker mode. On platforms without peer-to-peer networking, such as Google Cloud Run, each node makes a single outbound connection to a central broker that relays messages, instead of forming a direct gossip and gRPC mesh.

Does cluster mode support zero-downtime upgrades?

Yes. Nodes negotiate capabilities per peer, so a cluster can run mixed versions during a rolling upgrade. Replacing one pod at a time keeps the gateway serving requests throughout the update.

Getting Started with Bifrost Cluster Mode

Bifrost cluster mode gives enterprises a high-availability AI gateway with gossip-based state synchronization, automatic failover, distributed rate limiting, and zero-downtime deployments across a peer-to-peer cluster. It is the deployment model for teams that need the gateway to stay available through node failures, traffic spikes, and routine maintenance. For sizing guidance and reference topologies, the enterprise deployment guide and the broader Bifrost resources hub walk through production patterns in detail.

To see how Bifrost cluster mode fits your high-availability and compliance requirements, book a demo with the Bifrost team.