Ollama on Bifrost: Self-Hosted LLM Inference, OpenAI Compatibility

Ollama provider summary

Ollama is a self-hosted inference engine with identical request/response format to OpenAI. Bifrost routes to Ollama with full streaming, embeddings, and tool calling support. Typically runs locally on http://localhost:11434.

Common Ollama models used in Bifrost routes:

ollama/llama3.1:latest (128K context)
ollama/mistral:latest (32K context)
ollama/neural-chat:latest (8K context)

Property	Details
Description	Self-hosted LLM inference engine with OpenAI-compatible API.
Provider route on Bifrost	ollama/<model>
Typical endpoint	http://localhost:11434
Supported endpoints	/v1/chat/completions, /v1/responses, /v1/completions, /v1/embeddings, /v1/models

Supported operations

Ollama supports 5 major operations. Chat completions, responses API, and text completions support full streaming. Embeddings can customize dimensionality.

Operation	Non-streaming	Streaming	Upstream endpoint
Chat Completions	Yes	Yes	/v1/chat/completions
Responses API	Yes	Yes	/v1/chat/completions
Text Completions	Yes	Yes	/v1/completions
Embeddings	Yes	No	/v1/embeddings
List Models	Yes	No	/v1/models

Parameter handling

Ollama accepts identical request format to OpenAI with streaming responses. Embeddings support custom dimension specification via dimensions parameter. Tool calling and function definitions fully supported.

Custom embeddings:

Customize embedding dimensionality per request
Specify dimensions parameter for custom output size

Tool calling:

Full support for function definitions
Tool choice parameter supported

Supported Ollama parameters

Quick reference of OpenAI-compatible parameters accepted when routing through Bifrost to Ollama.

[
  "stream",
  "temperature",
  "top_p",
  "top_k",
  "max_tokens",
  "stop"
]

Popular models

Common Ollama models for local inference. Pull with ollama pull <model> before routing through Bifrost with the ollama/ prefix. See Popular models in Bifrost docs.

Model	Size	Context	Speed	Bifrost route
llama3.1:latest	Varies	128K	Fast	ollama/llama3.1:latest
mistral:latest	7B	32K	Very Fast	ollama/mistral:latest
neural-chat:latest	7B	8K	Very Fast	ollama/neural-chat:latest
orca-mini:latest	3B	3K	Very Fast	ollama/orca-mini:latest
openchat:latest	7B	8K	Very Fast	ollama/openchat:latest

Context windows vary by model (for example Llama 3.1 70B supports up to 128K tokens). Use stream: true for better UX with larger models; Ollama uses GPU acceleration when available.

API reference by operation

OpenAI-compatible endpoints for self-hosted Ollama instances.

1) Chat Completions

Primary request path. Maps to upstream /v1/chat/completions. Fully compatible with OpenAI request format.

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/llama3.1:latest",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

2) Responses API

The Responses API is converted internally to Chat Completions. Upstream routes to /v1/chat/completions on your Ollama instance. Same parameter support as Chat Completions. See Responses API in Bifrost docs.

ResponsesRequest → ChatRequest → ChatCompletion → ResponsesResponse

curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/llama3.1:latest",
    "input": "Hello",
    "max_output_tokens": 1024
  }'

3) Text Completions

Legacy text completion format at /v1/completions. Supports streaming. See Text Completions in Bifrost docs.

Parameter	Mapping	Notes
prompt	Direct pass-through
max_tokens	max_tokens
temperature	Direct pass-through
top_p	Direct pass-through
stop	Stop sequences

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/mistral:latest",
    "prompt": "Hello, my name is",
    "max_tokens": 50
  }'

4) Embeddings

Text embeddings at /v1/embeddings — no streaming. Response returns embedding vectors with token usage. See Embeddings in Bifrost docs.

Parameter	Notes
input	Text or array of texts
model	Embedding model name
encoding_format	"float" or "base64"
dimensions	Custom output dimensions (Optional)

curl -X POST http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/mxbai-embed-large:latest",
    "input": "Hello world",
    "dimensions": 1024
  }'

5) List Models

GET /v1/models — lists models currently available in your Ollama instance with capabilities and context information. See List Models in Bifrost docs.

curl http://localhost:8080/v1/models

Implementation caveats

Caveat	Impact	Severity
BaseURL configuration required	Must configure Ollama endpoint (typically localhost:11434)	High
No image/audio/video support	Image generation, TTS, STT, video not available	Medium
Local deployment only	Ollama is self-hosted and requires local setup	High
Identical OpenAI format	Requests and responses match OpenAI format exactly	Low
Custom embedding dimensions	Embeddings support dimension customization parameter	Low