Hugging Face on Bifrost: Multi-Backend LLM, Audio, and Image Operations

Hugging Face provider summary

Hugging Face integrates 20+ inference backends via Bifrost. Models use the composite format huggingface/{inference_provider}/{model_id} (for example huggingface/hf-inference/meta-llama/Meta-Llama-3-8B-Instruct). Per-backend capabilities are listed in Supported inference providers below.

Common Hugging Face models used in Bifrost routes:

huggingface/cerebras/llama-3.1-70b-instruct
huggingface/groq/llama-3.2-90b-vision-instruct
huggingface/fireworks/mixtral-8x7b-instruct

Property	Details
Description	Multi-backend provider supporting chat, embeddings, TTS, STT, and image operations.
Provider route on Bifrost	huggingface/{backend}/{model}
Inference providers	20+ backends (hf-inference, fal-ai, cerebras, groq, fireworks, nebius, sambanova, and more) — see Supported inference providers
Supported endpoints	/v1/chat/completions, /v1/embeddings, /v1/audio/, /v1/images/

Supported inference providers

Bifrost routes Hugging Face requests to 20+ inference backends. Capabilities vary by provider; model routes use huggingface/{inference_provider}/{model_id}. All chat-supported backends also support the Responses API via Bifrost's internal conversion. See Supported inference providers in Bifrost docs. For the latest upstream capabilities, see Hugging Face Inference Providers documentation.

Provider	Chat	Embedding	Speech (TTS)	Transcription	Image gen	Image gen (stream)	Image edit	Image edit (stream)
hf-inference	Yes	Yes	No	Yes	Yes	No	No	No
cerebras	Yes	No	No	No	No	No	No	No
cohere	Yes	No	No	No	No	No	No	No
fal-ai	No	No	Yes	Yes	Yes	Yes	Yes	Yes
featherless-ai	Yes	No	No	No	No	No	No	No
fireworks	Yes	No	No	No	No	No	No	No
groq	Yes	No	No	No	No	No	No	No
hyperbolic	Yes	No	No	No	No	No	No	No
nebius	Yes	Yes	No	No	Yes	No	No	No
novita	Yes	No	No	No	No	No	No	No
nscale	Yes	No	No	No	No	No	No	No
ovhcloud-ai-endpoints	Yes	No	No	No	No	No	No	No
public-ai	Yes	No	No	No	No	No	No	No
replicate	No	No	Yes	Yes	No	No	No	No
sambanova	Yes	Yes	No	No	No	No	No	No
scaleway	Yes	Yes	No	No	No	No	No	No
together	Yes	No	No	No	Yes	No	No	No
z-ai	Yes	No	No	No	No	No	No	No

Yes indicates a capability supported by that inference provider upstream. Provider capabilities may change over time.

Parameter handling

Parameters convert from OpenAI format to backend-specific formats. Audio operations have format-specific handling: hf-inference receives raw bytes, fal-ai uses base64 Data URIs. Image generation and editing leverage provider-specific field mapping.

Audio format handling:

hf-inference: expects raw audio bytes
fal-ai: expects base64 Data URIs

Image generation:

Streaming support available via fal-ai backend only
Automatic input field mapping by model type

Supported Hugging Face parameters

Quick reference of parameters accepted when routing through Bifrost to Hugging Face backends.

[
  "stream",
  "temperature",
  "top_p",
  "top_k",
  "max_tokens",
  "stop",
  "response_format",
  "tools",
  "tool_choice"
]

Popular models via Hugging Face

Use the provider prefix huggingface/{backend}/ in Bifrost model routes. Specify the backend and model ID for proper routing.

Backend	Model ID	Bifrost route	Typical usage
Llama 3.1 70B	meta-llama/Llama-3.1-70B-Instruct	huggingface/cerebras/llama-3.1-70b-instruct	High performance
Llama 3.2 90B Vision	meta-llama/Llama-3.2-90B-Vision-Instruct	huggingface/groq/llama-3.2-90b-vision-instruct	Multimodal
Mixtral 8x7B	mistralai/Mixtral-8x7B-Instruct-v0.1	huggingface/fireworks/mixtral-8x7b-instruct	Mixture of experts

API reference

Bifrost routes Hugging Face requests by inference provider using huggingface/{inference_provider}/{model_id}. Hugging Face enforces a 2 MB request body limit across chat, embeddings, speech, and transcription. Content aligned with Bifrost Hugging Face provider docs.

Model aliases and identification

Unlike single-string model IDs, Hugging Face routes use a composite key. See Model aliases in Bifrost docs.

huggingface/hf-inference/meta-llama/Meta-Llama-3-8B-Instruct
// inference_provider → backend (hf-inference, fal-ai, cerebras, …)
// model_id → Hugging Face Hub model ID

1) Chat Completions

OpenAI-compatible chat at /v1/chat/completions. Bifrost converts requests per backend via chat.go; dynamic model aliasing maps Hub IDs to provider-specific model names. On HTTP 404, the provider model cache is invalidated and the request is retried.

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "huggingface/cerebras/meta-llama/Llama-3.1-70B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

2) Responses API

All chat-supported inference providers also support /v1/responses via Bifrost's internal conversion to chat completions.

curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "huggingface/groq/meta-llama/Llama-3.1-70B-Instruct",
    "input": "Hello",
    "max_output_tokens": 1024
  }'

3) Embeddings

/v1/embeddings — supported on hf-inference, nebius, sambanova, scaleway. See Embedding requests in Bifrost docs.

Most providers: JSON field input
hf-inference: JSON field inputs (plural)
Bifrost populates both fields in embedding.go for cross-provider compatibility

curl -X POST http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "huggingface/hf-inference/sentence-transformers/all-MiniLM-L6-v2",
    "input": "Hello world"
  }'

4) Speech (Text-to-Speech)

/v1/audio/speech — fal-ai and replicate backends. TTS uses a dedicated JSON body (no pipeline_tag in the request). See Speech in Bifrost docs.

// HuggingFaceSpeechRequest (simplified)
{
  "text": "Hello world",
  "provider": "fal-ai",
  "model": "…",
  "parameters": { }
}

5) Transcriptions (ASR)

/v1/audio/transcriptions — request format depends on the inference provider. See Transcription in Bifrost docs.

Provider	Request format	Notes
hf-inference	Raw audio bytes in body	Content-Type: audio MIME; max 2 MB; URL: /hf-inference/models/{model}
fal-ai	JSON with audio_url	Base64 data URI; MP3 only (WAV rejected)

curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -F file=@audio.mp3 \
  -F model=huggingface/hf-inference/openai/whisper-large-v3

6) Image Generation

/v1/images/generations — routes by model string. Only fal-ai supports streaming. See Image Generation in Bifrost docs.

Provider	Non-streaming	Streaming	Notes
hf-inference	Yes	No	Prompt-only JSON; returns raw image bytes
fal-ai	Yes	Yes	Full parameters; SSE streaming
nebius	Yes	No	Nebius format with width/height and LoRAs
together	Yes	No	OpenAI-compatible format

fal-ai parameter mappings

OpenAI / Bifrost	fal-ai	Notes
n	num_images
size	image_size {width, height}	From WxH string
output_format	output_format	jpg normalized to jpeg
response_format: b64_json	sync_mode: true	Auto-set
moderation: low	enable_safety_checker: false	Auto-set

curl -X POST http://localhost:8080/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "huggingface/fal-ai/fal-ai/flux/dev",
    "prompt": "A futuristic cityscape at sunset",
    "size": "1024x1024",
    "n": 2,
    "response_format": "url"
  }'

7) Image Edit

Only fal-ai supports image edit; other providers return UnsupportedOperationError. Multipart images are converted to base64 data URLs. Image variation is not supported. See Image Edit in Bifrost docs.

Parameter	Required	Notes
model	Yes	Must be huggingface/fal-ai/{model_id}
prompt	Yes	Edit description
image[]	Yes	Image file(s); converted to base64 data URLs
n	No	Maps to num_images (1–10)
size	No	WxH → image_size object
output_format	No	png, webp, jpeg
seed, num_inference_steps, guidance_scale	No	Via extra_params

Multi-image models use image_urls; single-image models use image_url. Override with extra_params.use_image_urls. Streaming: SSE with image_edit.partial_image / image_edit.completed.

8) List Models

GET /v1/models — parallel Hub queries per inference provider, filtered by pipeline_tag (chat, feature-extraction, text-to-speech, etc.). Returns IDs as huggingface/{provider}/{model_id}. See List Models in Bifrost docs.

curl http://localhost:8080/v1/models

Implementation caveats

Caveat	Impact	Severity
Backend-specific audio format	hf-inference uses raw bytes; fal-ai uses base64 Data URIs	Medium
Image streaming limitation	Image generation streaming only supported via fal-ai	Medium
Image edit exclusive to fal-ai	Image editing only available through fal-ai backend	Medium
Model format requirement	Must use huggingface/{backend}/{model} format	Low
Backend availability	Not all backends support all operations	Medium