vLLM provider summary
vLLM serves models with an OpenAI-compatible API. Bifrost routes requests to your vLLM instance with optional API key authentication and configurable base URL.
Key characteristics:
- OpenAI compatibility — chat, text completions, embeddings, rerank, and streaming
- Self-hosted — typically
http://localhost:8000or your own server - Optional authentication — API key often omitted for local instances
- Responses API — supported via chat completion fallback
| Property | Details |
|---|---|
| Description | OpenAI-compatible self-hosted inference engine. |
| Provider route on Bifrost | vllm/<model> |
| Typical endpoint | http://localhost:8000 |
| Supported endpoints | /v1/chat/completions, /v1/responses, /v1/completions, /v1/embeddings, /v1/rerank, /v1/models, /v1/audio/transcriptions |
Supported operations
Image Generation, Speech (TTS), Files, and Batch return UnsupportedOperationError. See Supported operations in Bifrost docs.
| Operation | Non-streaming | Streaming | Upstream endpoint |
|---|---|---|---|
| Chat Completions | Yes | Yes | /v1/chat/completions |
| Responses API | Yes | Yes | /v1/chat/completions |
| Text Completions | Yes | Yes | /v1/completions |
| Embeddings | Yes | — | /v1/embeddings |
| Rerank | Yes | — | /v1/rerank (fallback: /rerank) |
| List Models | Yes | — | /v1/models |
| Transcriptions (STT) | Yes | Yes | /v1/audio/transcriptions |
| Image Generation | No | No | - |
| Speech (TTS) | No | No | - |
| Files | No | No | - |
| Batch | No | No | - |
Authentication
API key is optional. For local vLLM instances, the key is often left empty. When set, Bifrost sends Authorization: Bearer <key>. See Authentication in Bifrost docs.
Configuration
Default base URL is http://localhost:8000. Override via provider network_config.base_url. Model names depend on what is loaded on your vLLM server (e.g. meta-llama/Llama-3.2-1B-Instruct, BAAI/bge-m3 for embeddings).
# Point to local or remote vLLM (default: http://localhost:8000)
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vllm/meta-llama/Llama-3.2-1B-Instruct",
"messages": [{"role": "user", "content": "Hello"}]
}'
# Gateway provider config: set base_url for remote vLLM
# "network_config": { "base_url": "http://vllm-endpoint:8000" }Getting started
- Run a vLLM server (Docker or pip). Example:
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model meta-llama/Llama-3.2-1B-Instruct - Verify:
curl http://localhost:8000/v1/models - Route through Bifrost with the
vllm/prefix (e.g.vllm/meta-llama/Llama-3.2-1B-Instruct).
API reference
OpenAI-compatible endpoints routed to your vLLM instance via Bifrost.
1) Chat Completions
Primary request path at /v1/chat/completions. vLLM supports standard OpenAI chat parameters. Message types, tools, and streaming follow OpenAI behavior. See Chat Completions in Bifrost docs and OpenAI Chat Completions.
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vllm/meta-llama/Llama-3.2-1B-Instruct",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true
}'2) Responses API
Bifrost converts Responses API requests to Chat Completions and back. Upstream routes to /v1/chat/completions. See Responses API in Bifrost docs.
BifrostResponsesRequest → ToChatRequest() → ChatCompletion → ToBifrostResponsesResponse()
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "vllm/meta-llama/Llama-3.2-1B-Instruct",
"input": "Hello",
"max_output_tokens": 1024
}'3) Text Completions
Legacy format at /v1/completions. Supports streaming. See Text Completions in Bifrost docs.
| Parameter | Mapping |
|---|---|
| prompt | Sent as-is |
| max_tokens | max_tokens |
| temperature | temperature |
| top_p | top_p |
| stop | stop sequences |
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vllm/meta-llama/Llama-3.2-1B-Instruct",
"prompt": "Hello, my name is",
"max_tokens": 50
}'4) Embeddings
Text embeddings at /v1/embeddings — no streaming. Use model IDs exposed by your vLLM server (e.g. BAAI/bge-m3). See Embeddings in Bifrost docs.
curl -X POST http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "vllm/BAAI/bge-m3",
"input": "Hello world"
}'5) List Models
GET /v1/models — lists models loaded on your vLLM instance. Available models depend on server configuration. See List Models in Bifrost docs.
curl http://localhost:8080/v1/models
6) Rerank
Reranking for pooling/cross-encoder models. Bifrost sends to /v1/rerank and falls back to /rerank when required. Your vLLM server must be started with a rerank-capable model. See Rerank in Bifrost docs.
curl -X POST http://localhost:8080/v1/rerank \
-H "Content-Type: application/json" \
-d '{
"model": "vllm/BAAI/bge-reranker-v2-m3",
"query": "What is machine learning?",
"documents": [
{"text": "Machine learning is a subset of AI."},
{"text": "Python is a programming language."},
{"text": "Deep learning uses neural networks."}
],
"params": {
"return_documents": true
}
}'Implementation caveats
| Caveat | Impact | Severity |
|---|---|---|
| Default base URL is localhost | Default is http://localhost:8000; set network_config.base_url for remote or custom ports | Low |
| Error responses with HTTP 200 | vLLM may return HTTP 200 with an error payload instead of 4xx/5xx; Bifrost normalizes these for clients | Low |
| Rerank endpoint fallback | Bifrost tries /v1/rerank then /rerank depending on vLLM deployment | Low |
| Unsupported multimodal ops | Image generation, TTS, files, and batch return UnsupportedOperationError | Medium |
Authoritative references
- Bifrost vLLM provider reference: docs.getbifrost.ai/providers/supported-providers/vllm
- vLLM documentation: docs.vllm.ai
- Bifrost provider support overview: docs.getbifrost.ai/providers/supported-providers/overview