Try Bifrost Enterprise free for 14 days.
Request access

[ Provider Guide ]

Hugging Face Provider on Bifrost

Hugging Face provides access to multiple inference backends including Cerebras, Groq, Fireworks, and custom servers. Bifrost routes to different backends and supports chat, embeddings, audio (TTS/STT), and image operations.

Hugging Face provider summary

Hugging Face integrates 20+ inference backends via Bifrost. Models use the composite format huggingface/{inference_provider}/{model_id} (for example huggingface/hf-inference/meta-llama/Meta-Llama-3-8B-Instruct). Per-backend capabilities are listed in Supported inference providers below.

Common Hugging Face models used in Bifrost routes:

  • huggingface/cerebras/llama-3.1-70b-instruct
  • huggingface/groq/llama-3.2-90b-vision-instruct
  • huggingface/fireworks/mixtral-8x7b-instruct
PropertyDetails
DescriptionMulti-backend provider supporting chat, embeddings, TTS, STT, and image operations.
Provider route on Bifrosthuggingface/{backend}/{model}
Inference providers20+ backends (hf-inference, fal-ai, cerebras, groq, fireworks, nebius, sambanova, and more) — see Supported inference providers
Supported endpoints/v1/chat/completions, /v1/embeddings, /v1/audio/*, /v1/images/*

Supported inference providers

Bifrost routes Hugging Face requests to 20+ inference backends. Capabilities vary by provider; model routes use huggingface/{inference_provider}/{model_id}. All chat-supported backends also support the Responses API via Bifrost's internal conversion. See Supported inference providers in Bifrost docs. For the latest upstream capabilities, see Hugging Face Inference Providers documentation.

ProviderChatEmbeddingSpeech (TTS)TranscriptionImage genImage gen (stream)Image editImage edit (stream)
hf-inferenceYesYesNoYesYesNoNoNo
cerebrasYesNoNoNoNoNoNoNo
cohereYesNoNoNoNoNoNoNo
fal-aiNoNoYesYesYesYesYesYes
featherless-aiYesNoNoNoNoNoNoNo
fireworksYesNoNoNoNoNoNoNo
groqYesNoNoNoNoNoNoNo
hyperbolicYesNoNoNoNoNoNoNo
nebiusYesYesNoNoYesNoNoNo
novitaYesNoNoNoNoNoNoNo
nscaleYesNoNoNoNoNoNoNo
ovhcloud-ai-endpointsYesNoNoNoNoNoNoNo
public-aiYesNoNoNoNoNoNoNo
replicateNoNoYesYesNoNoNoNo
sambanovaYesYesNoNoNoNoNoNo
scalewayYesYesNoNoNoNoNoNo
togetherYesNoNoNoYesNoNoNo
z-aiYesNoNoNoNoNoNoNo

Yes indicates a capability supported by that inference provider upstream. Provider capabilities may change over time.

Parameter handling

Parameters convert from OpenAI format to backend-specific formats. Audio operations have format-specific handling: hf-inference receives raw bytes, fal-ai uses base64 Data URIs. Image generation and editing leverage provider-specific field mapping.

Audio format handling:

  • hf-inference: expects raw audio bytes
  • fal-ai: expects base64 Data URIs

Image generation:

  • Streaming support available via fal-ai backend only
  • Automatic input field mapping by model type

Supported Hugging Face parameters

Quick reference of parameters accepted when routing through Bifrost to Hugging Face backends.

[
  "stream",
  "temperature",
  "top_p",
  "top_k",
  "max_tokens",
  "stop",
  "response_format",
  "tools",
  "tool_choice"
]

Popular models via Hugging Face

Use the provider prefix huggingface/{backend}/ in Bifrost model routes. Specify the backend and model ID for proper routing.

BackendModel IDBifrost routeTypical usage
Llama 3.1 70Bmeta-llama/Llama-3.1-70B-Instructhuggingface/cerebras/llama-3.1-70b-instructHigh performance
Llama 3.2 90B Visionmeta-llama/Llama-3.2-90B-Vision-Instructhuggingface/groq/llama-3.2-90b-vision-instructMultimodal
Mixtral 8x7Bmistralai/Mixtral-8x7B-Instruct-v0.1huggingface/fireworks/mixtral-8x7b-instructMixture of experts

API reference

Bifrost routes Hugging Face requests by inference provider using huggingface/{inference_provider}/{model_id}. Hugging Face enforces a 2 MB request body limit across chat, embeddings, speech, and transcription. Content aligned with Bifrost Hugging Face provider docs.

Model aliases and identification

Unlike single-string model IDs, Hugging Face routes use a composite key. See Model aliases in Bifrost docs.

huggingface/hf-inference/meta-llama/Meta-Llama-3-8B-Instruct
// inference_provider → backend (hf-inference, fal-ai, cerebras, …)
// model_id → Hugging Face Hub model ID

1) Chat Completions

OpenAI-compatible chat at /v1/chat/completions. Bifrost converts requests per backend via chat.go; dynamic model aliasing maps Hub IDs to provider-specific model names. On HTTP 404, the provider model cache is invalidated and the request is retried.

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "huggingface/cerebras/meta-llama/Llama-3.1-70B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

2) Responses API

All chat-supported inference providers also support /v1/responses via Bifrost's internal conversion to chat completions.

curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "huggingface/groq/meta-llama/Llama-3.1-70B-Instruct",
    "input": "Hello",
    "max_output_tokens": 1024
  }'

3) Embeddings

/v1/embeddings — supported on hf-inference, nebius, sambanova, scaleway. See Embedding requests in Bifrost docs.

  • Most providers: JSON field input
  • hf-inference: JSON field inputs (plural)
  • Bifrost populates both fields in embedding.go for cross-provider compatibility
curl -X POST http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "huggingface/hf-inference/sentence-transformers/all-MiniLM-L6-v2",
    "input": "Hello world"
  }'

4) Speech (Text-to-Speech)

/v1/audio/speech — fal-ai and replicate backends. TTS uses a dedicated JSON body (no pipeline_tag in the request). See Speech in Bifrost docs.

// HuggingFaceSpeechRequest (simplified)
{
  "text": "Hello world",
  "provider": "fal-ai",
  "model": "…",
  "parameters": { }
}

5) Transcriptions (ASR)

/v1/audio/transcriptions — request format depends on the inference provider. See Transcription in Bifrost docs.

ProviderRequest formatNotes
hf-inferenceRaw audio bytes in bodyContent-Type: audio MIME; max 2 MB; URL: /hf-inference/models/{model}
fal-aiJSON with audio_urlBase64 data URI; MP3 only (WAV rejected)
curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -F file=@audio.mp3 \
  -F model=huggingface/hf-inference/openai/whisper-large-v3

6) Image Generation

/v1/images/generations — routes by model string. Only fal-ai supports streaming. See Image Generation in Bifrost docs.

ProviderNon-streamingStreamingNotes
hf-inferenceYesNoPrompt-only JSON; returns raw image bytes
fal-aiYesYesFull parameters; SSE streaming
nebiusYesNoNebius format with width/height and LoRAs
togetherYesNoOpenAI-compatible format

fal-ai parameter mappings

OpenAI / Bifrostfal-aiNotes
nnum_images
sizeimage_size {width, height}From WxH string
output_formatoutput_formatjpg normalized to jpeg
response_format: b64_jsonsync_mode: trueAuto-set
moderation: lowenable_safety_checker: falseAuto-set
curl -X POST http://localhost:8080/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "huggingface/fal-ai/fal-ai/flux/dev",
    "prompt": "A futuristic cityscape at sunset",
    "size": "1024x1024",
    "n": 2,
    "response_format": "url"
  }'

7) Image Edit

Only fal-ai supports image edit; other providers return UnsupportedOperationError. Multipart images are converted to base64 data URLs. Image variation is not supported. See Image Edit in Bifrost docs.

ParameterRequiredNotes
modelYesMust be huggingface/fal-ai/{model_id}
promptYesEdit description
image[]YesImage file(s); converted to base64 data URLs
nNoMaps to num_images (1–10)
sizeNoWxH → image_size object
output_formatNopng, webp, jpeg
seed, num_inference_steps, guidance_scaleNoVia extra_params

Multi-image models use image_urls; single-image models use image_url. Override with extra_params.use_image_urls. Streaming: SSE with image_edit.partial_image / image_edit.completed.

8) List Models

GET /v1/models — parallel Hub queries per inference provider, filtered by pipeline_tag (chat, feature-extraction, text-to-speech, etc.). Returns IDs as huggingface/{provider}/{model_id}. See List Models in Bifrost docs.

curl http://localhost:8080/v1/models

Implementation caveats

CaveatImpactSeverity
Backend-specific audio formathf-inference uses raw bytes; fal-ai uses base64 Data URIsMedium
Image streaming limitationImage generation streaming only supported via fal-aiMedium
Image edit exclusive to fal-aiImage editing only available through fal-ai backendMedium
Model format requirementMust use huggingface/{backend}/{model} formatLow
Backend availabilityNot all backends support all operationsMedium

Authoritative references

[ BIFROST FEATURES ]

Open Source & Enterprise

Everything you need to run AI in production, from free open source to enterprise-grade features.

01 Governance

SAML support for SSO and Role-based access control and policy enforcement for team collaboration.

02 Adaptive Load Balancing

Automatically optimizes traffic distribution across provider keys and models based on real-time performance metrics.

03 Cluster Mode

High availability deployment with automatic failover and load balancing. Peer-to-peer clustering where every instance is equal.

04 Alerts

Real-time notifications for budget limits, failures, and performance issues on Email, Slack, PagerDuty, Teams, Webhook and more.

05 Log Exports

Export and analyze request logs, traces, and telemetry data from Bifrost with enterprise-grade data export capabilities for compliance, monitoring, and analytics.

06 Audit Logs

Comprehensive logging and audit trails for compliance and debugging.

07 Vault Support

Secure API key management with HashiCorp Vault, AWS Secrets Manager, Google Secret Manager, and Azure Key Vault integration.

08 VPC Deployment

Deploy Bifrost within your private cloud infrastructure with VPC isolation, custom networking, and enhanced security controls.

09 Guardrails

Automatically detect and block unsafe model outputs with real-time policy enforcement and content moderation across all agents.

[ SHIP RELIABLE AI ]

Try Bifrost Enterprise with a 14-day Free Trial

[quick setup]

Drop-in replacement for any AI SDK

Change just one line of code. Works with OpenAI, Anthropic, Vercel AI SDK, LangChain, and more.

1import os
2from anthropic import Anthropic
3
4anthropic = Anthropic(
5 api_key=os.environ.get("ANTHROPIC_API_KEY"),
6 base_url="https://<bifrost_url>/anthropic",
7)
8
9message = anthropic.messages.create(
10 model="claude-3-5-sonnet-20241022",
11 max_tokens=1024,
12 messages=[
13 {"role": "user", "content": "Hello, Claude"}
14 ]
15)
Drop in once, run everywhere.