Try Bifrost Enterprise free for 14 days.
Request access

[ Provider Guide ]

SGLang Provider on Bifrost

SGL (SGLang) is an OpenAI-compatible local or remote inference engine for high-throughput model serving. Bifrost delegates all SGL operations to the OpenAI provider implementation with parameter filtering for compatibility.

SGLang provider summary

SGLang serves models with an OpenAI-compatible API. Bifrost routes requests through the OpenAI provider layer with streaming (SSE), tool calling, embeddings, and filtered parameters for SGL compatibility.

Key features:

  • OpenAI API compatibility — identical request/response format
  • Full streaming support with usage tracking
  • Tool calling — function definitions and execution
  • Text embeddings for vector generation
  • Parameter filtering — unsupported OpenAI fields removed automatically
PropertyDetails
DescriptionOpenAI-compatible local/remote inference engine.
Provider route on Bifrostsgl/<model>
Typical endpointhttp://localhost:8000
Supported endpoints/v1/chat/completions, /v1/responses, /v1/completions, /v1/embeddings, /v1/models

Supported operations

Bifrost delegates SGLang to the OpenAI provider implementation. Chat, Responses API, and Text Completions support streaming; Embeddings and List Models do not. Speech, Transcriptions, Files, and Batch return UnsupportedOperationError. SGL is typically self-hosted — configure BaseURL to your instance (e.g. http://localhost:8000). See Supported operations in Bifrost docs.

OperationNon-streamingStreamingUpstream endpoint
Chat CompletionsYesYes/v1/chat/completions
Responses APIYesYes/v1/chat/completions
Text CompletionsYesYes/v1/completions
EmbeddingsYes/v1/embeddings
List ModelsYes/v1/models
Image GenerationNoNo-
Speech (TTS)NoNo-
Transcriptions (STT)NoNo-
FilesNoNo-
BatchNoNo-

BaseURL configuration

SGL requires BaseURL pointing at your SGLang server. Requests fail without it (validated in NewSGLProvider). Use http://localhost:8000 for local deployments or https://sgl.example.com for remote instances.

# Example: route chat through Bifrost to a local SGL server
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sgl/meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

API reference

OpenAI-compatible endpoints routed to your SGL instance via Bifrost.

1) Chat Completions

Primary request path. Maps to upstream /v1/chat/completions. SGL supports all standard OpenAI chat completion parameters. For full parameter reference, see OpenAI Chat Completions and SGL Chat Completions in Bifrost docs.

Filtered parameters

Removed for SGL compatibility:

ParameterReasonNotes
prompt_cache_keyNot supportedRemoved for SGL compatibility
verbosityAnthropic-specificRemoved for SGL compatibility
storeNot supportedRemoved for SGL compatibility
service_tierOpenAI-specificRemoved for SGL compatibility

SGL supports standard OpenAI message types, tools, responses, and streaming formats. Cache control directives are stripped from messages during JSON marshaling.

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sgl/meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

2) Responses API

Fallback to Chat Completions with format conversion. Upstream routes to /v1/chat/completions on your SGL instance. Same parameter support as Chat Completions. See Responses API in Bifrost docs.

ResponsesRequest → ChatRequest → Response conversion
curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sgl/meta-llama/Llama-3.1-8B-Instruct",
    "input": "Hello",
    "max_output_tokens": 1024
  }'

3) Text Completions

Legacy text completion format at /v1/completions. Supports streaming. See Text Completions in Bifrost docs.

ParameterMappingNotes
promptDirect pass-through
max_tokensmax_tokens
temperatureDirect pass-through
top_pDirect pass-through
frequency_penaltySupported
presence_penaltySupported
curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sgl/meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "Hello, my name is",
    "max_tokens": 50
  }'

4) Embeddings

Text embeddings at /v1/embeddings — no streaming. Response returns embedding vectors with usage information. See Embeddings in Bifrost docs.

ParameterNotes
inputText or array of texts
modelEmbedding model name
encoding_format"float" or "base64"
dimensionsModel-specific dimension count
curl -X POST http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sgl/your-embedding-model",
    "input": "Hello world"
  }'

5) List Models

GET /v1/models — lists available models from your SGL server with capabilities. No request parameters required. See List Models in Bifrost docs.

curl http://localhost:8080/v1/models

Unsupported features

These operations are not offered by the upstream SGL API. Bifrost returns UnsupportedOperationError. See Unsupported features in Bifrost docs.

FeatureReason
Speech/TTSNot offered by SGL API
Transcription/STTNot offered by SGL API
Batch operationsNot offered by SGL API
File managementNot offered by SGL API
Image generationNot offered by SGL API

Implementation caveats

CaveatImpactSeverity
BaseURL configuration requiredRequests fail without explicit BaseURL (validated in NewSGLProvider)High
Cache control strippedCache control directives removed from messages; prompt caching does not workMedium
Parameter filteringprompt_cache_key, verbosity, store, service_tier removed via filterOpenAISpecificParametersLow
User field size limitUser identifiers longer than 64 characters are silently dropped (SanitizeUserField)Low

Authoritative references

[ BIFROST FEATURES ]

Open Source & Enterprise

Everything you need to run AI in production, from free open source to enterprise-grade features.

01 Governance

SAML support for SSO and Role-based access control and policy enforcement for team collaboration.

02 Adaptive Load Balancing

Automatically optimizes traffic distribution across provider keys and models based on real-time performance metrics.

03 Cluster Mode

High availability deployment with automatic failover and load balancing. Peer-to-peer clustering where every instance is equal.

04 Alerts

Real-time notifications for budget limits, failures, and performance issues on Email, Slack, PagerDuty, Teams, Webhook and more.

05 Log Exports

Export and analyze request logs, traces, and telemetry data from Bifrost with enterprise-grade data export capabilities for compliance, monitoring, and analytics.

06 Audit Logs

Comprehensive logging and audit trails for compliance and debugging.

07 Vault Support

Secure API key management with HashiCorp Vault, AWS Secrets Manager, Google Secret Manager, and Azure Key Vault integration.

08 VPC Deployment

Deploy Bifrost within your private cloud infrastructure with VPC isolation, custom networking, and enhanced security controls.

09 Guardrails

Automatically detect and block unsafe model outputs with real-time policy enforcement and content moderation across all agents.

[ SHIP RELIABLE AI ]

Try Bifrost Enterprise with a 14-day Free Trial

[quick setup]

Drop-in replacement for any AI SDK

Change just one line of code. Works with OpenAI, Anthropic, Vercel AI SDK, LangChain, and more.

1import os
2from anthropic import Anthropic
3
4anthropic = Anthropic(
5 api_key=os.environ.get("ANTHROPIC_API_KEY"),
6 base_url="https://<bifrost_url>/anthropic",
7)
8
9message = anthropic.messages.create(
10 model="claude-3-5-sonnet-20241022",
11 max_tokens=1024,
12 messages=[
13 {"role": "user", "content": "Hello, Claude"}
14 ]
15)
Drop in once, run everywhere.