← home
RESEARCH · TECHNIQUE

Model routing.

25 April 2026

Most production traffic doesn't need the frontier model. Routing easy queries to a smaller/cheaper model and hard queries to the frontier model is the single highest-leverage optimization in the playbook — typically 30–50% reduction with no quality regression, when you do it right. Here's how we do it.

The two routing patterns that actually work

1. Classifier-up-front

A small classifier (often a fine-tuned `gpt-5-mini`, `claude-haiku`, or a local distilled BERT) reads the incoming request and predicts difficulty. Easy → cheap model. Hard → frontier model. Edge cases → frontier model.

Works well when your traffic has visibly distinct difficulty buckets (e.g. customer support: simple FAQ vs. multi-turn troubleshooting). Doesn't work when difficulty is uncorrelated with surface features.

2. Cascade with confidence

Cheap model attempts the answer. If its self-reported confidence (or a verifier check) is below threshold, escalate to the frontier model. Pay the cheap-model token cost twice for ~10–20% of traffic; pay the frontier-model cost on a fraction of overall traffic.

Works well when the cheap model is right most of the time and confidence is well-calibrated. Don't trust raw `logprobs` as confidence on aligned chat models — train or distill a verifier.

Patterns that don't work

Quality measurement

Every router ships with a 7-day A/B against the production baseline. Quality SLOs are agreed per endpoint:

If any SLO degrades meaningfully, the router auto-rolls back. There's no shipping a router without a rollback path.

Stack

Order of operations

  1. Identify the top 3 spend endpoints.
  2. For each: build a 200-row eval set with frontier-model answers as reference.
  3. Test 2–3 candidate cheaper models against the eval set. Keep the ones that pass quality.
  4. Build the classifier or cascade on the candidates that passed.
  5. Ship at 1% traffic. Then 10%. Then 50%. Then 100%. A/B at every step.

Related

← Back to llmcfo.com