SGLang on Bifrost: OpenAI-Compatible Inference and API Reference

SGLang provider summary

SGLang serves models with an OpenAI-compatible API. Bifrost routes requests through the OpenAI provider layer with streaming (SSE), tool calling, embeddings, and filtered parameters for SGL compatibility.

Key features:

OpenAI API compatibility — identical request/response format
Full streaming support with usage tracking
Tool calling — function definitions and execution
Text embeddings for vector generation
Parameter filtering — unsupported OpenAI fields removed automatically

Property	Details
Description	OpenAI-compatible local/remote inference engine.
Provider route on Bifrost	sgl/<model>
Typical endpoint	http://localhost:8000
Supported endpoints	/v1/chat/completions, /v1/responses, /v1/completions, /v1/embeddings, /v1/models

Supported operations

Bifrost delegates SGLang to the OpenAI provider implementation. Chat, Responses API, and Text Completions support streaming; Embeddings and List Models do not. Speech, Transcriptions, Files, and Batch return UnsupportedOperationError. SGL is typically self-hosted — configure BaseURL to your instance (e.g. http://localhost:8000). See Supported operations in Bifrost docs.

Operation	Non-streaming	Streaming	Upstream endpoint
Chat Completions	Yes	Yes	/v1/chat/completions
Responses API	Yes	Yes	/v1/chat/completions
Text Completions	Yes	Yes	/v1/completions
Embeddings	Yes	—	/v1/embeddings
List Models	Yes	—	/v1/models
Image Generation	No	No	-
Speech (TTS)	No	No	-
Transcriptions (STT)	No	No	-
Files	No	No	-
Batch	No	No	-

BaseURL configuration

SGL requires BaseURL pointing at your SGLang server. Requests fail without it (validated in NewSGLProvider). Use http://localhost:8000 for local deployments or https://sgl.example.com for remote instances.

# Example: route chat through Bifrost to a local SGL server
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sgl/meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

API reference

OpenAI-compatible endpoints routed to your SGL instance via Bifrost.

1) Chat Completions

Primary request path. Maps to upstream /v1/chat/completions. SGL supports all standard OpenAI chat completion parameters. For full parameter reference, see OpenAI Chat Completions and SGL Chat Completions in Bifrost docs.

Filtered parameters

Removed for SGL compatibility:

Parameter	Reason	Notes
prompt_cache_key	Not supported	Removed for SGL compatibility
verbosity	Anthropic-specific	Removed for SGL compatibility
store	Not supported	Removed for SGL compatibility
service_tier	OpenAI-specific	Removed for SGL compatibility

SGL supports standard OpenAI message types, tools, responses, and streaming formats. Cache control directives are stripped from messages during JSON marshaling.

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sgl/meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

2) Responses API

Fallback to Chat Completions with format conversion. Upstream routes to /v1/chat/completions on your SGL instance. Same parameter support as Chat Completions. See Responses API in Bifrost docs.

ResponsesRequest → ChatRequest → Response conversion

curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sgl/meta-llama/Llama-3.1-8B-Instruct",
    "input": "Hello",
    "max_output_tokens": 1024
  }'

3) Text Completions

Legacy text completion format at /v1/completions. Supports streaming. See Text Completions in Bifrost docs.

Parameter	Mapping	Notes
prompt	Direct pass-through
max_tokens	max_tokens
temperature	Direct pass-through
top_p	Direct pass-through
frequency_penalty	Supported
presence_penalty	Supported

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sgl/meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "Hello, my name is",
    "max_tokens": 50
  }'

4) Embeddings

Text embeddings at /v1/embeddings — no streaming. Response returns embedding vectors with usage information. See Embeddings in Bifrost docs.

Parameter	Notes
input	Text or array of texts
model	Embedding model name
encoding_format	"float" or "base64"
dimensions	Model-specific dimension count

curl -X POST http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sgl/your-embedding-model",
    "input": "Hello world"
  }'

5) List Models

GET /v1/models — lists available models from your SGL server with capabilities. No request parameters required. See List Models in Bifrost docs.

curl http://localhost:8080/v1/models

Unsupported features

These operations are not offered by the upstream SGL API. Bifrost returns UnsupportedOperationError. See Unsupported features in Bifrost docs.

Feature	Reason
Speech/TTS	Not offered by SGL API
Transcription/STT	Not offered by SGL API
Batch operations	Not offered by SGL API
File management	Not offered by SGL API
Image generation	Not offered by SGL API

Implementation caveats

Caveat	Impact	Severity
BaseURL configuration required	Requests fail without explicit BaseURL (validated in NewSGLProvider)	High
Cache control stripped	Cache control directives removed from messages; prompt caching does not work	Medium
Parameter filtering	prompt_cache_key, verbosity, store, service_tier removed via filterOpenAISpecificParameters	Low
User field size limit	User identifiers longer than 64 characters are silently dropped (SanitizeUserField)	Low