Semantic caching.
25 April 2026
Semantic caching embeds an incoming request, looks up the nearest prior request above a similarity threshold, and returns that prior response instead of calling the model. When it works, it cuts spend 20–40% on the affected endpoint with zero added latency. When it doesn't, it returns wrong answers — and you get told about it on Twitter.
Where it works
- RAG retrieval where the corpus is stable. Documentation Q&A, internal knowledge bases, product help.
- Classification and tagging. Sentiment, intent, content moderation — the inputs cluster naturally.
- High-volume customer support flows with a long tail of duplicated questions.
- Boilerplate generation — legal clauses, marketing variants, templated emails.
Where it breaks
- Personalized output. "Summarize my last 3 emails" must never hit a cache from another user. Scope keys to user/session.
- Time-sensitive answers. Stock prices, news, schedules. Use TTLs aggressively or skip the cache entirely.
- Long-context, low-repetition workloads like fresh document analysis. Hit rate stays under 5% — not worth the embedding cost.
- Anything where "approximately right" is wrong. Code generation, financial calculations, medical.
Picking a similarity threshold
This is the lever that destroys quality if you set it wrong. Defaults we use as a starting point (cosine similarity on `text-embedding-3-small`):
| RAG retrieval | 0.95 |
| Classification | 0.92 |
| Customer support | 0.97 |
| Boilerplate | 0.90 |
Tune by sampling 200 cache hits per endpoint, judging each pair (input, served-from-cache response) with an LLM-as-judge + human spot-check. If precision drops below 95%, raise the threshold.
Implementation cost
Embedding tokens are not free. At `$0.02 / 1M tokens` for `text-embedding-3-small`, the embedding cost is negligible vs. a frontier-model call — but if you embed every request and your hit rate is 3%, you're paying for embeddings without the savings to justify them. Measure hit rate before scaling.
Stack we typically use
- Embedding model: `text-embedding-3-small` (default), Voyage `voyage-3` or Cohere `embed-v3` for domain-specific corpora.
- Vector store: pgvector for <5M entries, Qdrant or Pinecone above that.
- Gateway: LiteLLM with custom cache backend, or Helicone's built-in semantic cache.