Hugging Face provider summary
Hugging Face integrates 20+ inference backends via Bifrost. Models use the composite format huggingface/{inference_provider}/{model_id} (for example huggingface/hf-inference/meta-llama/Meta-Llama-3-8B-Instruct). Per-backend capabilities are listed in Supported inference providers below.
Common Hugging Face models used in Bifrost routes:
huggingface/cerebras/llama-3.1-70b-instructhuggingface/groq/llama-3.2-90b-vision-instructhuggingface/fireworks/mixtral-8x7b-instruct
| Property | Details |
|---|---|
| Description | Multi-backend provider supporting chat, embeddings, TTS, STT, and image operations. |
| Provider route on Bifrost | huggingface/{backend}/{model} |
| Inference providers | 20+ backends (hf-inference, fal-ai, cerebras, groq, fireworks, nebius, sambanova, and more) — see Supported inference providers |
| Supported endpoints | /v1/chat/completions, /v1/embeddings, /v1/audio/*, /v1/images/* |
Supported inference providers
Bifrost routes Hugging Face requests to 20+ inference backends. Capabilities vary by provider; model routes use huggingface/{inference_provider}/{model_id}. All chat-supported backends also support the Responses API via Bifrost's internal conversion. See Supported inference providers in Bifrost docs. For the latest upstream capabilities, see Hugging Face Inference Providers documentation.
| Provider | Chat | Embedding | Speech (TTS) | Transcription | Image gen | Image gen (stream) | Image edit | Image edit (stream) |
|---|---|---|---|---|---|---|---|---|
| hf-inference | Yes | Yes | No | Yes | Yes | No | No | No |
| cerebras | Yes | No | No | No | No | No | No | No |
| cohere | Yes | No | No | No | No | No | No | No |
| fal-ai | No | No | Yes | Yes | Yes | Yes | Yes | Yes |
| featherless-ai | Yes | No | No | No | No | No | No | No |
| fireworks | Yes | No | No | No | No | No | No | No |
| groq | Yes | No | No | No | No | No | No | No |
| hyperbolic | Yes | No | No | No | No | No | No | No |
| nebius | Yes | Yes | No | No | Yes | No | No | No |
| novita | Yes | No | No | No | No | No | No | No |
| nscale | Yes | No | No | No | No | No | No | No |
| ovhcloud-ai-endpoints | Yes | No | No | No | No | No | No | No |
| public-ai | Yes | No | No | No | No | No | No | No |
| replicate | No | No | Yes | Yes | No | No | No | No |
| sambanova | Yes | Yes | No | No | No | No | No | No |
| scaleway | Yes | Yes | No | No | No | No | No | No |
| together | Yes | No | No | No | Yes | No | No | No |
| z-ai | Yes | No | No | No | No | No | No | No |
Yes indicates a capability supported by that inference provider upstream. Provider capabilities may change over time.
Parameter handling
Parameters convert from OpenAI format to backend-specific formats. Audio operations have format-specific handling: hf-inference receives raw bytes, fal-ai uses base64 Data URIs. Image generation and editing leverage provider-specific field mapping.
Audio format handling:
- hf-inference: expects raw audio bytes
- fal-ai: expects base64 Data URIs
Image generation:
- Streaming support available via fal-ai backend only
- Automatic input field mapping by model type
Supported Hugging Face parameters
Quick reference of parameters accepted when routing through Bifrost to Hugging Face backends.
[ "stream", "temperature", "top_p", "top_k", "max_tokens", "stop", "response_format", "tools", "tool_choice" ]
Popular models via Hugging Face
Use the provider prefix huggingface/{backend}/ in Bifrost model routes. Specify the backend and model ID for proper routing.
| Backend | Model ID | Bifrost route | Typical usage |
|---|---|---|---|
| Llama 3.1 70B | meta-llama/Llama-3.1-70B-Instruct | huggingface/cerebras/llama-3.1-70b-instruct | High performance |
| Llama 3.2 90B Vision | meta-llama/Llama-3.2-90B-Vision-Instruct | huggingface/groq/llama-3.2-90b-vision-instruct | Multimodal |
| Mixtral 8x7B | mistralai/Mixtral-8x7B-Instruct-v0.1 | huggingface/fireworks/mixtral-8x7b-instruct | Mixture of experts |
API reference
Bifrost routes Hugging Face requests by inference provider using huggingface/{inference_provider}/{model_id}. Hugging Face enforces a 2 MB request body limit across chat, embeddings, speech, and transcription. Content aligned with Bifrost Hugging Face provider docs.
Model aliases and identification
Unlike single-string model IDs, Hugging Face routes use a composite key. See Model aliases in Bifrost docs.
huggingface/hf-inference/meta-llama/Meta-Llama-3-8B-Instruct // inference_provider → backend (hf-inference, fal-ai, cerebras, …) // model_id → Hugging Face Hub model ID
1) Chat Completions
OpenAI-compatible chat at /v1/chat/completions. Bifrost converts requests per backend via chat.go; dynamic model aliasing maps Hub IDs to provider-specific model names. On HTTP 404, the provider model cache is invalidated and the request is retried.
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "huggingface/cerebras/meta-llama/Llama-3.1-70B-Instruct",
"messages": [{"role": "user", "content": "Hello"}]
}'2) Responses API
All chat-supported inference providers also support /v1/responses via Bifrost's internal conversion to chat completions.
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "huggingface/groq/meta-llama/Llama-3.1-70B-Instruct",
"input": "Hello",
"max_output_tokens": 1024
}'3) Embeddings
/v1/embeddings — supported on hf-inference, nebius, sambanova, scaleway. See Embedding requests in Bifrost docs.
- Most providers: JSON field
input - hf-inference: JSON field
inputs(plural) - Bifrost populates both fields in
embedding.gofor cross-provider compatibility
curl -X POST http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "huggingface/hf-inference/sentence-transformers/all-MiniLM-L6-v2",
"input": "Hello world"
}'4) Speech (Text-to-Speech)
/v1/audio/speech — fal-ai and replicate backends. TTS uses a dedicated JSON body (no pipeline_tag in the request). See Speech in Bifrost docs.
// HuggingFaceSpeechRequest (simplified)
{
"text": "Hello world",
"provider": "fal-ai",
"model": "…",
"parameters": { }
}5) Transcriptions (ASR)
/v1/audio/transcriptions — request format depends on the inference provider. See Transcription in Bifrost docs.
| Provider | Request format | Notes |
|---|---|---|
| hf-inference | Raw audio bytes in body | Content-Type: audio MIME; max 2 MB; URL: /hf-inference/models/{model} |
| fal-ai | JSON with audio_url | Base64 data URI; MP3 only (WAV rejected) |
curl -X POST http://localhost:8080/v1/audio/transcriptions \ -F file=@audio.mp3 \ -F model=huggingface/hf-inference/openai/whisper-large-v3
6) Image Generation
/v1/images/generations — routes by model string. Only fal-ai supports streaming. See Image Generation in Bifrost docs.
| Provider | Non-streaming | Streaming | Notes |
|---|---|---|---|
| hf-inference | Yes | No | Prompt-only JSON; returns raw image bytes |
| fal-ai | Yes | Yes | Full parameters; SSE streaming |
| nebius | Yes | No | Nebius format with width/height and LoRAs |
| together | Yes | No | OpenAI-compatible format |
fal-ai parameter mappings
| OpenAI / Bifrost | fal-ai | Notes |
|---|---|---|
| n | num_images | |
| size | image_size {width, height} | From WxH string |
| output_format | output_format | jpg normalized to jpeg |
| response_format: b64_json | sync_mode: true | Auto-set |
| moderation: low | enable_safety_checker: false | Auto-set |
curl -X POST http://localhost:8080/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "huggingface/fal-ai/fal-ai/flux/dev",
"prompt": "A futuristic cityscape at sunset",
"size": "1024x1024",
"n": 2,
"response_format": "url"
}'7) Image Edit
Only fal-ai supports image edit; other providers return UnsupportedOperationError. Multipart images are converted to base64 data URLs. Image variation is not supported. See Image Edit in Bifrost docs.
| Parameter | Required | Notes |
|---|---|---|
| model | Yes | Must be huggingface/fal-ai/{model_id} |
| prompt | Yes | Edit description |
| image[] | Yes | Image file(s); converted to base64 data URLs |
| n | No | Maps to num_images (1–10) |
| size | No | WxH → image_size object |
| output_format | No | png, webp, jpeg |
| seed, num_inference_steps, guidance_scale | No | Via extra_params |
Multi-image models use image_urls; single-image models use image_url. Override with extra_params.use_image_urls. Streaming: SSE with image_edit.partial_image / image_edit.completed.
8) List Models
GET /v1/models — parallel Hub queries per inference provider, filtered by pipeline_tag (chat, feature-extraction, text-to-speech, etc.). Returns IDs as huggingface/{provider}/{model_id}. See List Models in Bifrost docs.
curl http://localhost:8080/v1/models
Implementation caveats
| Caveat | Impact | Severity |
|---|---|---|
| Backend-specific audio format | hf-inference uses raw bytes; fal-ai uses base64 Data URIs | Medium |
| Image streaming limitation | Image generation streaming only supported via fal-ai | Medium |
| Image edit exclusive to fal-ai | Image editing only available through fal-ai backend | Medium |
| Model format requirement | Must use huggingface/{backend}/{model} format | Low |
| Backend availability | Not all backends support all operations | Medium |
Authoritative references
- Bifrost Hugging Face provider reference: docs.getbifrost.ai/providers/supported-providers/huggingface
- Hugging Face Inference API: huggingface.co/inference-api
- Bifrost provider support overview: docs.getbifrost.ai/providers/supported-providers/overview