AI Gateway

Semantic Caching for LLMs: Cut Cost and Latency at Scale

Semantic caching for LLMs reduces API cost and latency by serving cached responses to similar queries. Learn how Bifrost makes it production-ready.

LLM API bills grow faster than traffic for almost every team that ships a chatbot, a RAG application, or an agent. Users rarely ask the exact same question twice, but they ask semantically identical questions constantly: "what is your refund policy," "how do I get a refund," "can I return this order." Exact-match caches miss all of it. Semantic caching for LLMs closes that gap by matching on meaning instead of literal text, and Bifrost, the open-source AI gateway built by Maxim AI, ships it as a first-class feature in the gateway layer so every application behind it benefits without a code change.

This post covers how semantic caching works, where it fits in an LLM stack, the failure modes teams hit when they roll their own, and how Bifrost's dual-layer cache handles them in production.

What Is Semantic Caching for LLMs

Semantic caching for LLMs is a caching technique that stores LLM responses against vector embeddings of the input prompt, and serves a cached response when a new prompt is semantically similar to a stored one above a configured similarity threshold. Unlike exact-match caching, it captures rephrased questions, paraphrases, and typos.

The pattern has four components:

Embedding model: converts each prompt into a dense vector
Vector store: indexes embeddings for fast nearest-neighbor search
Similarity threshold: decides what counts as a cache hit (commonly 0.8 to 0.95 cosine similarity)
Response store: holds the cached completion and metadata (TTL, model, provider)

On each request, the gateway embeds the prompt, searches the vector store, and returns the cached response if the best match exceeds the threshold. If not, the request goes to the upstream model and the response is stored for future hits.

Why Semantic Caching Matters for LLM Cost and Latency

Production LLM workloads have a heavy long tail of near-duplicate queries. Independent research published on arXiv reports that GPT Semantic Cache reduces API calls by up to 68.8% across various query categories, with cache hit rates between 61.6% and 68.8%, and positive hit rates exceeding 97%. VentureBeat documented a team that saw cache hit rates rise to 67% and LLM API costs drop 73% after switching from text-based to semantic caching.

The cost math is straightforward. A cached response avoids the input and output tokens entirely, so savings scale with hit rate. The latency math is better: a vector lookup in a warm store returns in single-digit milliseconds, while a fresh call to a frontier model typically takes 1 to 5 seconds end-to-end. For user-facing applications, that gap is the difference between a snappy UI and a visible wait.

Semantic caching is complementary to provider-side prefix caching. Anthropic's prompt caching reduces costs by up to 90% and latency by up to 85% for long prompts, but it only matches on prefix tokens that are byte-identical. Semantic caching handles the case where the user's question itself varies.

Where semantic caching pays off most

Customer support bots and FAQ assistants
Internal knowledge-base search
Documentation Q&A
Repetitive classification, extraction, and moderation prompts
Agent sub-steps that reuse planning or routing prompts
Any workload where a meaningful fraction of queries are paraphrases of prior queries

How Exact-Match Caching Falls Short

Traditional caches like Redis or Memcached hash the input string and match on byte equality. In NLP workloads this misses almost everything that matters. Two prompts with the same intent but different wording produce different hashes and different cache keys, so the gateway dispatches a new LLM call for each variation. Research cited in industry analyses suggests 31% of LLM queries exhibit semantic similarity to previous requests, which is redundancy that exact-match caching cannot recover.

The failure mode is predictable. A support bot receives "how do I cancel my subscription," "how to cancel a subscription," and "cancel subscription" within minutes. Each one is a cache miss. Each one pays full input and output cost. The model generates substantively identical answers three times. This is the gap semantic caching closes.

How Bifrost Handles Semantic Caching for LLMs at the Gateway Layer

Bifrost implements semantic caching as a plugin inside the gateway, which means every provider, model, and SDK behind Bifrost inherits it without application changes. Because Bifrost is a drop-in replacement for the OpenAI, Anthropic, and other major SDKs, enabling semantic caching requires changing the base URL and flipping the plugin on.

The cache is dual-layer by default:

Direct hash matching: deterministic cache ID derived from the normalized input, parameters, and stream flag. Exact-match requests return instantly without any embedding call.
Semantic similarity matching: if the direct lookup misses, the prompt is embedded and compared against stored vectors using cosine similarity against a configurable threshold.

This order matters. Direct matches cost nothing extra, so Bifrost tries them first. Semantic matches require an embedding API call, which costs a fraction of a full LLM call but is not free.

Supported vector stores

Bifrost ships adapters for four production vector databases:

Weaviate: production-ready with gRPC support
Redis or Valkey: high-performance in-memory vector storage using RediSearch-compatible APIs
Qdrant: Rust-based vector search with advanced filtering
Pinecone: managed serverless vector database

Teams that do not want to run any embedding provider can use direct hash mode with dimension: 1 and no provider configured. In that mode Bifrost uses only exact-match deduplication, which is still valuable for retry storms, streaming replays, and identical prompts from multiple users. Redis and Valkey are the recommended backends for direct-only mode.

Per-request control

Semantic caching is opt-in per request through a cache key header. Applications that need different TTLs or thresholds for different flows can override them without restarting the gateway:

x-bf-cache-key: activates caching for the request and scopes it to a session or tenant
x-bf-cache-ttl: per-request TTL override (supports durations like 30s, 5m, 24h)
x-bf-cache-threshold: per-request similarity threshold (for example, 0.9 for stricter matching)
x-bf-cache-type: force direct or semantic matching only
x-bf-cache-no-store: read from cache but do not store the new response

Cached responses include a cache_debug object in the response metadata with the hit type (direct or semantic), cache ID, similarity score, and the embedding model and provider used. This makes it straightforward to instrument cache hit rate and quality in production without parsing logs.

Conversation-aware behavior

Long conversations are a known failure mode for semantic caches. Once a chat has several turns, the prompt is dominated by history, and two unrelated conversations can look similar in vector space. Bifrost guards against this with a conversation_history_threshold setting that skips caching entirely when a conversation exceeds a configured message count (default: 3). System prompts can optionally be excluded from cache key generation for applications that rotate system prompts without changing response semantics.

Key Considerations for Semantic Caching in Production

Rolling out semantic caching for LLMs involves three decisions that materially affect cost, latency, and answer quality.

Choosing a similarity threshold

The similarity threshold is the single most consequential knob. Too loose and the cache returns wrong answers for queries that sound similar but mean different things. Too strict and hit rates collapse.

0.95 and above: strict matching, high precision, lower hit rate. Use for code generation, structured extraction, and anything where small prompt differences change the correct answer.
0.85 to 0.95: balanced. Sensible default for support chatbots, Q&A assistants, and RAG front-ends.
Below 0.85: aggressive matching, highest hit rate, risk of semantic drift. Use only for low-stakes exploration or internal tools.

Tune the threshold against a real query log, not synthetic data. Measure both hit rate and human-rated answer quality at each threshold and pick the knee of the curve.

Choosing an embedding model

Embedding quality is the ceiling on semantic cache quality. A weak embedding model will conflate unrelated queries or miss obvious paraphrases. Bifrost defaults to OpenAI's text-embedding-3-small at 1536 dimensions, which is a strong general-purpose choice. Teams with domain-specific vocabularies (medical, legal, finance) often see material gains from fine-tuned or domain-adapted embedding models.

TTL and invalidation

Cached responses go stale when the underlying data changes. For FAQ answers and documentation Q&A, TTLs of hours to days are reasonable. For anything that depends on current state (inventory, pricing, account data), TTLs should be short or caching should be scoped per session only. Bifrost supports both global and per-request TTLs, and offers explicit cache clear endpoints for event-driven invalidation.

Combining Semantic Caching With the Rest of the Gateway

Semantic caching rarely works in isolation. In production LLM stacks, it sits alongside routing, governance, and observability, which is why Bifrost bundles them in one gateway layer.

Multi-provider routing: Bifrost routes across 20+ LLM providers through a single OpenAI-compatible API, so cached responses remain valid even when underlying providers fail over.
Governance and budgets: virtual keys let platform teams enforce per-team budgets, rate limits, and access policies on top of cached traffic. Cache hits do not burn budget, which naturally extends per-team allocations.
Observability: native Prometheus metrics and OpenTelemetry traces expose cache hit rate, embedding latency, and LLM latency on the same dashboard, so teams can tune thresholds with real data. The independent performance benchmarks show Bifrost itself adds 11µs of overhead at 5,000 RPS, so the cache layer does not introduce meaningful overhead on cache misses.
Drop-in SDK support: because Bifrost is compatible with the OpenAI, Anthropic, LangChain, and LiteLLM SDKs, applications gain semantic caching without rewrites. Teams currently on LiteLLM can review Bifrost as a LiteLLM alternative for a full feature comparison.

Get Started With Semantic Caching for LLMs on Bifrost

Semantic caching for LLMs is one of the highest-ROI optimizations available for production AI applications, and it belongs in the gateway rather than inside every application that calls a model. Bifrost ships it as an open-source, dual-layer cache with four supported vector stores, per-request overrides, and conversation-aware guards, all behind the same OpenAI-compatible API that routes requests to 20+ providers.

To see semantic caching, routing, and governance working together on your traffic, book a Bifrost demo with the Bifrost team.