Try Bifrost Enterprise free for 14 days.
Request access

[ Provider Guide ]

Ollama Provider on Bifrost

Ollama provides self-hosted LLM inference with full OpenAI API compatibility. Bifrost routes requests to Ollama with streaming, embeddings, and tool calling support for local deployments.

Ollama provider summary

Ollama is a self-hosted inference engine with identical request/response format to OpenAI. Bifrost routes to Ollama with full streaming, embeddings, and tool calling support. Typically runs locally on http://localhost:11434.

Common Ollama models used in Bifrost routes:

  • ollama/llama3.1:latest (128K context)
  • ollama/mistral:latest (32K context)
  • ollama/neural-chat:latest (8K context)
PropertyDetails
DescriptionSelf-hosted LLM inference engine with OpenAI-compatible API.
Provider route on Bifrostollama/<model>
Typical endpointhttp://localhost:11434
Supported endpoints/v1/chat/completions, /v1/responses, /v1/completions, /v1/embeddings, /v1/models

Supported operations

Ollama supports 5 major operations. Chat completions, responses API, and text completions support full streaming. Embeddings can customize dimensionality.

OperationNon-streamingStreamingUpstream endpoint
Chat CompletionsYesYes/v1/chat/completions
Responses APIYesYes/v1/chat/completions
Text CompletionsYesYes/v1/completions
EmbeddingsYesNo/v1/embeddings
List ModelsYesNo/v1/models

Parameter handling

Ollama accepts identical request format to OpenAI with streaming responses. Embeddings support custom dimension specification via dimensions parameter. Tool calling and function definitions fully supported.

Custom embeddings:

  • Customize embedding dimensionality per request
  • Specify dimensions parameter for custom output size

Tool calling:

  • Full support for function definitions
  • Tool choice parameter supported

Supported Ollama parameters

Quick reference of OpenAI-compatible parameters accepted when routing through Bifrost to Ollama.

[
  "stream",
  "temperature",
  "top_p",
  "top_k",
  "max_tokens",
  "stop"
]

Popular models

Common Ollama models for local inference. Pull with ollama pull <model> before routing through Bifrost with the ollama/ prefix. See Popular models in Bifrost docs.

ModelSizeContextSpeedBifrost route
llama3.1:latestVaries128KFastollama/llama3.1:latest
mistral:latest7B32KVery Fastollama/mistral:latest
neural-chat:latest7B8KVery Fastollama/neural-chat:latest
orca-mini:latest3B3KVery Fastollama/orca-mini:latest
openchat:latest7B8KVery Fastollama/openchat:latest

Context windows vary by model (for example Llama 3.1 70B supports up to 128K tokens). Use stream: true for better UX with larger models; Ollama uses GPU acceleration when available.

API reference by operation

OpenAI-compatible endpoints for self-hosted Ollama instances.

1) Chat Completions

Primary request path. Maps to upstream /v1/chat/completions. Fully compatible with OpenAI request format.

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/llama3.1:latest",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

2) Responses API

The Responses API is converted internally to Chat Completions. Upstream routes to /v1/chat/completions on your Ollama instance. Same parameter support as Chat Completions. See Responses API in Bifrost docs.

ResponsesRequest → ChatRequest → ChatCompletion → ResponsesResponse
curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/llama3.1:latest",
    "input": "Hello",
    "max_output_tokens": 1024
  }'

3) Text Completions

Legacy text completion format at /v1/completions. Supports streaming. See Text Completions in Bifrost docs.

ParameterMappingNotes
promptDirect pass-through
max_tokensmax_tokens
temperatureDirect pass-through
top_pDirect pass-through
stopStop sequences
curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/mistral:latest",
    "prompt": "Hello, my name is",
    "max_tokens": 50
  }'

4) Embeddings

Text embeddings at /v1/embeddings — no streaming. Response returns embedding vectors with token usage. See Embeddings in Bifrost docs.

ParameterNotes
inputText or array of texts
modelEmbedding model name
encoding_format"float" or "base64"
dimensionsCustom output dimensions (Optional)
curl -X POST http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/mxbai-embed-large:latest",
    "input": "Hello world",
    "dimensions": 1024
  }'

5) List Models

GET /v1/models — lists models currently available in your Ollama instance with capabilities and context information. See List Models in Bifrost docs.

curl http://localhost:8080/v1/models

Implementation caveats

CaveatImpactSeverity
BaseURL configuration requiredMust configure Ollama endpoint (typically localhost:11434)High
No image/audio/video supportImage generation, TTS, STT, video not availableMedium
Local deployment onlyOllama is self-hosted and requires local setupHigh
Identical OpenAI formatRequests and responses match OpenAI format exactlyLow
Custom embedding dimensionsEmbeddings support dimension customization parameterLow

Authoritative references

[ BIFROST FEATURES ]

Open Source & Enterprise

Everything you need to run AI in production, from free open source to enterprise-grade features.

01 Governance

SAML support for SSO and Role-based access control and policy enforcement for team collaboration.

02 Adaptive Load Balancing

Automatically optimizes traffic distribution across provider keys and models based on real-time performance metrics.

03 Cluster Mode

High availability deployment with automatic failover and load balancing. Peer-to-peer clustering where every instance is equal.

04 Alerts

Real-time notifications for budget limits, failures, and performance issues on Email, Slack, PagerDuty, Teams, Webhook and more.

05 Log Exports

Export and analyze request logs, traces, and telemetry data from Bifrost with enterprise-grade data export capabilities for compliance, monitoring, and analytics.

06 Audit Logs

Comprehensive logging and audit trails for compliance and debugging.

07 Vault Support

Secure API key management with HashiCorp Vault, AWS Secrets Manager, Google Secret Manager, and Azure Key Vault integration.

08 VPC Deployment

Deploy Bifrost within your private cloud infrastructure with VPC isolation, custom networking, and enhanced security controls.

09 Guardrails

Automatically detect and block unsafe model outputs with real-time policy enforcement and content moderation across all agents.

[ SHIP RELIABLE AI ]

Try Bifrost Enterprise with a 14-day Free Trial

[quick setup]

Drop-in replacement for any AI SDK

Change just one line of code. Works with OpenAI, Anthropic, Vercel AI SDK, LangChain, and more.

1import os
2from anthropic import Anthropic
3
4anthropic = Anthropic(
5 api_key=os.environ.get("ANTHROPIC_API_KEY"),
6 base_url="https://<bifrost_url>/anthropic",
7)
8
9message = anthropic.messages.create(
10 model="claude-3-5-sonnet-20241022",
11 max_tokens=1024,
12 messages=[
13 {"role": "user", "content": "Hello, Claude"}
14 ]
15)
Drop in once, run everywhere.