vLLM on Bifrost: OpenAI-Compatible Inference, Reranking, and Streaming

vLLM provider summary

vLLM serves models with an OpenAI-compatible API. Bifrost routes requests to your vLLM instance with optional API key authentication and configurable base URL.

Key characteristics:

OpenAI compatibility — chat, text completions, embeddings, rerank, and streaming
Self-hosted — typically http://localhost:8000 or your own server
Optional authentication — API key often omitted for local instances
Responses API — supported via chat completion fallback

Property	Details
Description	OpenAI-compatible self-hosted inference engine.
Provider route on Bifrost	vllm/<model>
Typical endpoint	http://localhost:8000
Supported endpoints	/v1/chat/completions, /v1/responses, /v1/completions, /v1/embeddings, /v1/rerank, /v1/models, /v1/audio/transcriptions

Supported operations

Image Generation, Speech (TTS), Files, and Batch return UnsupportedOperationError. See Supported operations in Bifrost docs.

Operation	Non-streaming	Streaming	Upstream endpoint
Chat Completions	Yes	Yes	/v1/chat/completions
Responses API	Yes	Yes	/v1/chat/completions
Text Completions	Yes	Yes	/v1/completions
Embeddings	Yes	—	/v1/embeddings
Rerank	Yes	—	/v1/rerank (fallback: /rerank)
List Models	Yes	—	/v1/models
Transcriptions (STT)	Yes	Yes	/v1/audio/transcriptions
Image Generation	No	No	-
Speech (TTS)	No	No	-
Files	No	No	-
Batch	No	No	-

Authentication

API key is optional. For local vLLM instances, the key is often left empty. When set, Bifrost sends Authorization: Bearer <key>. See Authentication in Bifrost docs.

Configuration

Default base URL is http://localhost:8000. Override via provider network_config.base_url. Model names depend on what is loaded on your vLLM server (e.g. meta-llama/Llama-3.2-1B-Instruct, BAAI/bge-m3 for embeddings).

# Point to local or remote vLLM (default: http://localhost:8000)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/meta-llama/Llama-3.2-1B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

# Gateway provider config: set base_url for remote vLLM
# "network_config": { "base_url": "http://vllm-endpoint:8000" }

Getting started

Run a vLLM server (Docker or pip). Example: docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model meta-llama/Llama-3.2-1B-Instruct
Verify: curl http://localhost:8000/v1/models
Route through Bifrost with the vllm/ prefix (e.g. vllm/meta-llama/Llama-3.2-1B-Instruct).

See Getting started in Bifrost docs.

API reference

OpenAI-compatible endpoints routed to your vLLM instance via Bifrost.

1) Chat Completions

Primary request path at /v1/chat/completions. vLLM supports standard OpenAI chat parameters. Message types, tools, and streaming follow OpenAI behavior. See Chat Completions in Bifrost docs and OpenAI Chat Completions.

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/meta-llama/Llama-3.2-1B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

2) Responses API

Bifrost converts Responses API requests to Chat Completions and back. Upstream routes to /v1/chat/completions. See Responses API in Bifrost docs.

BifrostResponsesRequest
  → ToChatRequest()
  → ChatCompletion
  → ToBifrostResponsesResponse()

curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/meta-llama/Llama-3.2-1B-Instruct",
    "input": "Hello",
    "max_output_tokens": 1024
  }'

3) Text Completions

Legacy format at /v1/completions. Supports streaming. See Text Completions in Bifrost docs.

Parameter	Mapping
prompt	Sent as-is
max_tokens	max_tokens
temperature	temperature
top_p	top_p
stop	stop sequences

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/meta-llama/Llama-3.2-1B-Instruct",
    "prompt": "Hello, my name is",
    "max_tokens": 50
  }'

4) Embeddings

Text embeddings at /v1/embeddings — no streaming. Use model IDs exposed by your vLLM server (e.g. BAAI/bge-m3). See Embeddings in Bifrost docs.

curl -X POST http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/BAAI/bge-m3",
    "input": "Hello world"
  }'

5) List Models

GET /v1/models — lists models loaded on your vLLM instance. Available models depend on server configuration. See List Models in Bifrost docs.

curl http://localhost:8080/v1/models

6) Rerank

Reranking for pooling/cross-encoder models. Bifrost sends to /v1/rerank and falls back to /rerank when required. Your vLLM server must be started with a rerank-capable model. See Rerank in Bifrost docs.

curl -X POST http://localhost:8080/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/BAAI/bge-reranker-v2-m3",
    "query": "What is machine learning?",
    "documents": [
      {"text": "Machine learning is a subset of AI."},
      {"text": "Python is a programming language."},
      {"text": "Deep learning uses neural networks."}
    ],
    "params": {
      "return_documents": true
    }
  }'

Implementation caveats

Caveat	Impact	Severity
Default base URL is localhost	Default is http://localhost:8000; set network_config.base_url for remote or custom ports	Low
Error responses with HTTP 200	vLLM may return HTTP 200 with an error payload instead of 4xx/5xx; Bifrost normalizes these for clients	Low
Rerank endpoint fallback	Bifrost tries /v1/rerank then /rerank depending on vLLM deployment	Low
Unsupported multimodal ops	Image generation, TTS, files, and batch return UnsupportedOperationError	Medium