Try Bifrost Enterprise free for 14 days.
Request access

[ Provider Guide ]

vLLM Provider on Bifrost

vLLM is an OpenAI-compatible provider for self-hosted inference. Bifrost delegates to the shared OpenAI provider implementation with chat, text completions, embeddings, rerank, STT, and streaming.

vLLM provider summary

vLLM serves models with an OpenAI-compatible API. Bifrost routes requests to your vLLM instance with optional API key authentication and configurable base URL.

Key characteristics:

  • OpenAI compatibility — chat, text completions, embeddings, rerank, and streaming
  • Self-hosted — typically http://localhost:8000 or your own server
  • Optional authentication — API key often omitted for local instances
  • Responses API — supported via chat completion fallback
PropertyDetails
DescriptionOpenAI-compatible self-hosted inference engine.
Provider route on Bifrostvllm/<model>
Typical endpointhttp://localhost:8000
Supported endpoints/v1/chat/completions, /v1/responses, /v1/completions, /v1/embeddings, /v1/rerank, /v1/models, /v1/audio/transcriptions

Supported operations

Image Generation, Speech (TTS), Files, and Batch return UnsupportedOperationError. See Supported operations in Bifrost docs.

OperationNon-streamingStreamingUpstream endpoint
Chat CompletionsYesYes/v1/chat/completions
Responses APIYesYes/v1/chat/completions
Text CompletionsYesYes/v1/completions
EmbeddingsYes/v1/embeddings
RerankYes/v1/rerank (fallback: /rerank)
List ModelsYes/v1/models
Transcriptions (STT)YesYes/v1/audio/transcriptions
Image GenerationNoNo-
Speech (TTS)NoNo-
FilesNoNo-
BatchNoNo-

Authentication

API key is optional. For local vLLM instances, the key is often left empty. When set, Bifrost sends Authorization: Bearer <key>. See Authentication in Bifrost docs.

Configuration

Default base URL is http://localhost:8000. Override via provider network_config.base_url. Model names depend on what is loaded on your vLLM server (e.g. meta-llama/Llama-3.2-1B-Instruct, BAAI/bge-m3 for embeddings).

# Point to local or remote vLLM (default: http://localhost:8000)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/meta-llama/Llama-3.2-1B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

# Gateway provider config: set base_url for remote vLLM
# "network_config": { "base_url": "http://vllm-endpoint:8000" }

Getting started

  1. Run a vLLM server (Docker or pip). Example: docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model meta-llama/Llama-3.2-1B-Instruct
  2. Verify: curl http://localhost:8000/v1/models
  3. Route through Bifrost with the vllm/ prefix (e.g. vllm/meta-llama/Llama-3.2-1B-Instruct).

See Getting started in Bifrost docs.

API reference

OpenAI-compatible endpoints routed to your vLLM instance via Bifrost.

1) Chat Completions

Primary request path at /v1/chat/completions. vLLM supports standard OpenAI chat parameters. Message types, tools, and streaming follow OpenAI behavior. See Chat Completions in Bifrost docs and OpenAI Chat Completions.

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/meta-llama/Llama-3.2-1B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

2) Responses API

Bifrost converts Responses API requests to Chat Completions and back. Upstream routes to /v1/chat/completions. See Responses API in Bifrost docs.

BifrostResponsesRequest
  → ToChatRequest()
  → ChatCompletion
  → ToBifrostResponsesResponse()
curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/meta-llama/Llama-3.2-1B-Instruct",
    "input": "Hello",
    "max_output_tokens": 1024
  }'

3) Text Completions

Legacy format at /v1/completions. Supports streaming. See Text Completions in Bifrost docs.

ParameterMapping
promptSent as-is
max_tokensmax_tokens
temperaturetemperature
top_ptop_p
stopstop sequences
curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/meta-llama/Llama-3.2-1B-Instruct",
    "prompt": "Hello, my name is",
    "max_tokens": 50
  }'

4) Embeddings

Text embeddings at /v1/embeddings — no streaming. Use model IDs exposed by your vLLM server (e.g. BAAI/bge-m3). See Embeddings in Bifrost docs.

curl -X POST http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/BAAI/bge-m3",
    "input": "Hello world"
  }'

5) List Models

GET /v1/models — lists models loaded on your vLLM instance. Available models depend on server configuration. See List Models in Bifrost docs.

curl http://localhost:8080/v1/models

6) Rerank

Reranking for pooling/cross-encoder models. Bifrost sends to /v1/rerank and falls back to /rerank when required. Your vLLM server must be started with a rerank-capable model. See Rerank in Bifrost docs.

curl -X POST http://localhost:8080/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/BAAI/bge-reranker-v2-m3",
    "query": "What is machine learning?",
    "documents": [
      {"text": "Machine learning is a subset of AI."},
      {"text": "Python is a programming language."},
      {"text": "Deep learning uses neural networks."}
    ],
    "params": {
      "return_documents": true
    }
  }'

Implementation caveats

CaveatImpactSeverity
Default base URL is localhostDefault is http://localhost:8000; set network_config.base_url for remote or custom portsLow
Error responses with HTTP 200vLLM may return HTTP 200 with an error payload instead of 4xx/5xx; Bifrost normalizes these for clientsLow
Rerank endpoint fallbackBifrost tries /v1/rerank then /rerank depending on vLLM deploymentLow
Unsupported multimodal opsImage generation, TTS, files, and batch return UnsupportedOperationErrorMedium

Authoritative references

[ BIFROST FEATURES ]

Open Source & Enterprise

Everything you need to run AI in production, from free open source to enterprise-grade features.

01 Governance

SAML support for SSO and Role-based access control and policy enforcement for team collaboration.

02 Adaptive Load Balancing

Automatically optimizes traffic distribution across provider keys and models based on real-time performance metrics.

03 Cluster Mode

High availability deployment with automatic failover and load balancing. Peer-to-peer clustering where every instance is equal.

04 Alerts

Real-time notifications for budget limits, failures, and performance issues on Email, Slack, PagerDuty, Teams, Webhook and more.

05 Log Exports

Export and analyze request logs, traces, and telemetry data from Bifrost with enterprise-grade data export capabilities for compliance, monitoring, and analytics.

06 Audit Logs

Comprehensive logging and audit trails for compliance and debugging.

07 Vault Support

Secure API key management with HashiCorp Vault, AWS Secrets Manager, Google Secret Manager, and Azure Key Vault integration.

08 VPC Deployment

Deploy Bifrost within your private cloud infrastructure with VPC isolation, custom networking, and enhanced security controls.

09 Guardrails

Automatically detect and block unsafe model outputs with real-time policy enforcement and content moderation across all agents.

[ SHIP RELIABLE AI ]

Try Bifrost Enterprise with a 14-day Free Trial

[quick setup]

Drop-in replacement for any AI SDK

Change just one line of code. Works with OpenAI, Anthropic, Vercel AI SDK, LangChain, and more.

1import os
2from anthropic import Anthropic
3
4anthropic = Anthropic(
5 api_key=os.environ.get("ANTHROPIC_API_KEY"),
6 base_url="https://<bifrost_url>/anthropic",
7)
8
9message = anthropic.messages.create(
10 model="claude-3-5-sonnet-20241022",
11 max_tokens=1024,
12 messages=[
13 {"role": "user", "content": "Hello, Claude"}
14 ]
15)
Drop in once, run everywhere.