Overview

Semantic caching uses vector similarity search to intelligently cache AI responses, serving cached results for semantically similar requests even when the exact wording differs. This dramatically reduces API costs and latency for repeated or similar queries. Key Benefits:
  • Cost Reduction: Avoid expensive LLM API calls for similar requests
  • Improved Performance: Sub-millisecond cache retrieval vs multi-second API calls
  • Intelligent Matching: Semantic similarity beyond exact text matching
  • Streaming Support: Full streaming response caching with proper chunk ordering

Core Features

  • Dual-Layer Caching: Exact hash matching + semantic similarity search (customizable threshold)
  • Vector-Powered Intelligence: Uses embeddings to find semantically similar requests
  • Dynamic Configuration: Per-request TTL and threshold overrides via headers/context
  • Model/Provider Isolation: Separate caching per model and provider combination

Vector Store Setup

import (
    "context"
    "github.com/maximhq/bifrost/framework/vectorstore"
    "github.com/maximhq/bifrost/core/schemas"
)

// Configure vector store
vectorConfig := &vectorstore.Config{
    Enabled: true,
    Type:    vectorstore.VectorStoreTypeWeaviate,
    Config: vectorstore.WeaviateConfig{
        Scheme:    "http",
        Host:      "localhost:8080",
    },
}

// Create vector store
store, err := vectorstore.NewVectorStore(context.Background(), vectorConfig, logger)
if err != nil {
    log.Fatal("Failed to create vector store:", err)
}

Semantic Cache Configuration

import (
    "github.com/maximhq/bifrost/plugins/semanticcache"
    "github.com/maximhq/bifrost/core/schemas"
)

// Configure semantic cache plugin
cacheConfig := semanticcache.Config{
    CacheKey:    "request-cache-key",     // Required: bifrost will look for this key in ctx for cache triggering
    CacheTTLKey: "request-cache-ttl",     // Optional: bifrost will look for this key in ctx for TTL override
    CacheThresholdKey: "request-cache-threshold", // Optional: bifrost will look for this key in ctx for threshold override
    
    // Embedding model configuration (Required)
    Provider:       schemas.OpenAI,
    Keys:          []schemas.Key{{Value: "sk-..."}},
    EmbeddingModel: "text-embedding-3-small",
    
    // Cache behavior
    TTL:       5 * time.Minute,  // Time to live for cached responses (default: 5 minutes)
    Threshold: 0.8,              // Similarity threshold for cache lookup (default: 0.8)
    
    // Advanced options
    CacheByModel:    bifrost.Ptr(true),  // Include model in cache key (default: true)
    CacheByProvider: bifrost.Ptr(true),  // Include provider in cache key (default: true)
    ExcludeSystemPrompt: bifrost.Ptr(false), // Exclude system messages in cache key (default: false)
}

// Create plugin
plugin, err := semanticcache.Init(context.Background(), cacheConfig, logger, store)
if err != nil {
    log.Fatal("Failed to create semantic cache plugin:", err)
}

// Add to Bifrost config
bifrostConfig := schemas.BifrostConfig{
    Plugins: []schemas.Plugin{plugin},
    // ... other config
}

Cache Triggering

Cache Key is mandatory: Semantic caching only activates when a cache key is provided. Without a cache key, requests bypass caching entirely.
Must set cache key in request context:
// This request WILL be cached
ctx = context.WithValue(ctx, semanticcache.ContextKey("request-cache-key"), "session-123")
response, err := client.ChatCompletionRequest(ctx, request)

// This request will NOT be cached (no context value)
response, err := client.ChatCompletionRequest(context.Background(), request)

Per-Request Overrides

Override default TTL and similarity threshold per request:
You can set TTL and threshold in the request context, in the keys you configured in the plugin config:
// Go SDK: Custom TTL and threshold
ctx = context.WithValue(ctx, semanticcache.ContextKey("request-cache-key"), "session-123")
ctx = context.WithValue(ctx, semanticcache.ContextKey("request-cache-ttl"), 30*time.Second)
ctx = context.WithValue(ctx, semanticcache.ContextKey("request-cache-threshold"), 0.9)

Cache Management

Cache Metadata Location

When responses are served from semantic cache, 3 key variables are automatically added to the response: Location: response.ExtraFields.CacheDebug (as a JSON object) Fields:
  • CacheHit (boolean): true when response served from cache
  • CacheHitType (string): "semantic" for similarity match, "direct" for hash match
  • CacheID (string): Unique cache entry ID for management operations
Semantic Cache Only:
  • CacheThreshold (number): Similarity threshold used for the match
  • CacheSimilarity (number): Similarity score for the match
Example HTTP Response:
{
  "extra_fields": {
    "cache_debug": {
      "cache_hit": true,
      "cache_hit_type": "semantic",
      "cache_id": "550e8500-e29b-41d4-a725-446655440001",
      "cache_threshold": 0.8,
      "cache_similarity": 0.95
    }
  }
}
These variables allow you to detect cached responses and get the cache entry ID needed for clearing specific entries.

Clear Specific Cache Entry

Use the request ID from cached responses to clear specific entries:
// Clear specific entry by request ID
err := plugin.ClearCacheForRequestID("550e8400-e29b-41d4-a716-446655440000")

// Clear all entries for a cache key  
err := plugin.ClearCacheForKey("support-session-456")

Use Cases

Customer Support

Cache responses for common questions like “How do I reset my password?” or “What are your business hours?” - semantically similar variations get instant responses.

Content Generation

Cache blog post outlines, product descriptions, or marketing copy for similar topics, reducing costs for content teams.

Data Analysis

Cache responses for similar analytical queries about datasets, dashboards, or reports, speeding up business intelligence workflows.
Vector Store Requirement: Semantic caching requires a configured vector store (currently Weaviate only). Without vector store setup, the plugin will not function.