RESEARCH · TECHNIQUE

Semantic caching.

25 April 2026

By the LLM CFO team

Semantic caching embeds an incoming request, looks up the nearest prior request above a similarity threshold, and returns that prior response instead of calling the model. When it works, it cuts spend 20–40% on the affected endpoint with zero added latency. When it doesn't, it returns wrong answers — and you get told about it on Twitter.

Where it works

RAG retrieval where the corpus is stable. Documentation Q&A, internal knowledge bases, product help.
Classification and tagging. Sentiment, intent, content moderation — the inputs cluster naturally.
High-volume customer support flows with a long tail of duplicated questions.
Boilerplate generation — legal clauses, marketing variants, templated emails.

Where it breaks

Personalized output. "Summarize my last 3 emails" must never hit a cache from another user. Scope keys to user/session.
Time-sensitive answers. Stock prices, news, schedules. Use TTLs aggressively or skip the cache entirely.
Long-context, low-repetition workloads like fresh document analysis. Hit rate stays under 5% — not worth the embedding cost.
Anything where "approximately right" is wrong. Code generation, financial calculations, medical.

Picking a similarity threshold

This is the lever that destroys quality if you set it wrong. Defaults we use as a starting point (cosine similarity on `text-embedding-3-small`):

Workload	Starting threshold
RAG retrieval	0.95
Classification	0.92
Customer support	0.97
Boilerplate	0.90

Tune by sampling 200 cache hits per endpoint, judging each pair (input, served-from-cache response) with an LLM-as-judge + human spot-check. If precision drops below 95%, raise the threshold.

Implementation cost

Embedding tokens are not free. At `$0.02 / 1M tokens` for `text-embedding-3-small`, the embedding cost is negligible vs. a frontier-model call — but if you embed every request and your hit rate is 3%, you're paying for embeddings without the savings to justify them. Measure hit rate before scaling.

Stack we typically use

Embedding model: `text-embedding-3-small` (default), Voyage `voyage-3` or Cohere `embed-v3` for domain-specific corpora.
Vector store: pgvector for <5M entries, Qdrant or Pinecone above that.
Gateway: LiteLLM with custom cache backend, or Helicone's built-in semantic cache.

← Back to llmcfo.com