Ollama provider summary
Ollama is a self-hosted inference engine with identical request/response format to OpenAI. Bifrost routes to Ollama with full streaming, embeddings, and tool calling support. Typically runs locally on http://localhost:11434.
Common Ollama models used in Bifrost routes:
ollama/llama3.1:latest(128K context)ollama/mistral:latest(32K context)ollama/neural-chat:latest(8K context)
| Property | Details |
|---|---|
| Description | Self-hosted LLM inference engine with OpenAI-compatible API. |
| Provider route on Bifrost | ollama/<model> |
| Typical endpoint | http://localhost:11434 |
| Supported endpoints | /v1/chat/completions, /v1/responses, /v1/completions, /v1/embeddings, /v1/models |
Supported operations
Ollama supports 5 major operations. Chat completions, responses API, and text completions support full streaming. Embeddings can customize dimensionality.
| Operation | Non-streaming | Streaming | Upstream endpoint |
|---|---|---|---|
| Chat Completions | Yes | Yes | /v1/chat/completions |
| Responses API | Yes | Yes | /v1/chat/completions |
| Text Completions | Yes | Yes | /v1/completions |
| Embeddings | Yes | No | /v1/embeddings |
| List Models | Yes | No | /v1/models |
Parameter handling
Ollama accepts identical request format to OpenAI with streaming responses. Embeddings support custom dimension specification via dimensions parameter. Tool calling and function definitions fully supported.
Custom embeddings:
- Customize embedding dimensionality per request
- Specify dimensions parameter for custom output size
Tool calling:
- Full support for function definitions
- Tool choice parameter supported
Supported Ollama parameters
Quick reference of OpenAI-compatible parameters accepted when routing through Bifrost to Ollama.
[ "stream", "temperature", "top_p", "top_k", "max_tokens", "stop" ]
Popular models
Common Ollama models for local inference. Pull with ollama pull <model> before routing through Bifrost with the ollama/ prefix. See Popular models in Bifrost docs.
| Model | Size | Context | Speed | Bifrost route |
|---|---|---|---|---|
| llama3.1:latest | Varies | 128K | Fast | ollama/llama3.1:latest |
| mistral:latest | 7B | 32K | Very Fast | ollama/mistral:latest |
| neural-chat:latest | 7B | 8K | Very Fast | ollama/neural-chat:latest |
| orca-mini:latest | 3B | 3K | Very Fast | ollama/orca-mini:latest |
| openchat:latest | 7B | 8K | Very Fast | ollama/openchat:latest |
Context windows vary by model (for example Llama 3.1 70B supports up to 128K tokens). Use stream: true for better UX with larger models; Ollama uses GPU acceleration when available.
API reference by operation
OpenAI-compatible endpoints for self-hosted Ollama instances.
1) Chat Completions
Primary request path. Maps to upstream /v1/chat/completions. Fully compatible with OpenAI request format.
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ollama/llama3.1:latest",
"messages": [{"role": "user", "content": "Hello"}]
}'2) Responses API
The Responses API is converted internally to Chat Completions. Upstream routes to /v1/chat/completions on your Ollama instance. Same parameter support as Chat Completions. See Responses API in Bifrost docs.
ResponsesRequest → ChatRequest → ChatCompletion → ResponsesResponse
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "ollama/llama3.1:latest",
"input": "Hello",
"max_output_tokens": 1024
}'3) Text Completions
Legacy text completion format at /v1/completions. Supports streaming. See Text Completions in Bifrost docs.
| Parameter | Mapping | Notes |
|---|---|---|
| prompt | Direct pass-through | |
| max_tokens | max_tokens | |
| temperature | Direct pass-through | |
| top_p | Direct pass-through | |
| stop | Stop sequences |
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ollama/mistral:latest",
"prompt": "Hello, my name is",
"max_tokens": 50
}'4) Embeddings
Text embeddings at /v1/embeddings — no streaming. Response returns embedding vectors with token usage. See Embeddings in Bifrost docs.
| Parameter | Notes |
|---|---|
| input | Text or array of texts |
| model | Embedding model name |
| encoding_format | "float" or "base64" |
| dimensions | Custom output dimensions (Optional) |
curl -X POST http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "ollama/mxbai-embed-large:latest",
"input": "Hello world",
"dimensions": 1024
}'5) List Models
GET /v1/models — lists models currently available in your Ollama instance with capabilities and context information. See List Models in Bifrost docs.
curl http://localhost:8080/v1/models
Implementation caveats
| Caveat | Impact | Severity |
|---|---|---|
| BaseURL configuration required | Must configure Ollama endpoint (typically localhost:11434) | High |
| No image/audio/video support | Image generation, TTS, STT, video not available | Medium |
| Local deployment only | Ollama is self-hosted and requires local setup | High |
| Identical OpenAI format | Requests and responses match OpenAI format exactly | Low |
| Custom embedding dimensions | Embeddings support dimension customization parameter | Low |
Authoritative references
- Bifrost Ollama provider reference: docs.getbifrost.ai/providers/supported-providers/ollama
- Ollama official site: ollama.com
- Bifrost provider support overview: docs.getbifrost.ai/providers/supported-providers/overview