Model Routing LLM Cost Playbook

Desk note

Routing works when it is measurable and explainable. If you cannot explain why the router picked a model, you will not be able to debug spend spikes or quality regressions.

What model routing is

Model routing is choosing which model to call for each step of a workflow instead of sending everything to the same default. The goal is to match model capability to task risk: cheap models handle low-risk steps, and expensive models are reserved for hard or high-stakes steps.

Routing input: task type, risk level, budget, and past failure modes.
Routing output: chosen model, fallback plan, and a traceable reason.

On this siteModel routing topic hub What proven workhorses cost Routing and caching tools

Where routing saves money

Routing saves money when many steps do not require premium reasoning: classification, extraction, summarization, rewriting, formatting, and simple lookups. The savings get bigger when agents are involved, because agents produce many small calls that compound.

High leverage: agent tool loops and multi-step workflows.
Lower leverage: one-shot tasks where model choice is already obvious.

Policy patterns that work

Start with policies you can explain. Then add complexity only when you can measure the improvement.

Tiered routing: cheap default, expensive on failure or uncertainty.
Budget routing: cap spend per task; degrade gracefully when budget is hit.
Risk routing: reserve strong models for customer-visible or high-stakes steps.
Context routing: if context is huge, prefer models that can handle it or compress first.

Fallback and stop conditions

Most routing failures are not model failures. They are control failures: retries without caps, tools without guards, and fallback chains that keep escalating cost. Treat stop conditions as part of the routing policy.

Cap retries per step and per task.
Stop when acceptance cannot be reached without human input.
Prevent recursive summarization and repeated context reloads.

Evaluation and acceptance signals

Routing needs an objective signal. Use acceptance state and lightweight evals so you can tell whether savings are real or whether they shifted cost into human review.

Track accepted vs. edited vs. rejected outcomes by route.
Add a small evaluation set for each workflow before changing policies.

Observability: what to log

To debug tokenmaxxing and routing decisions, you need the trace, not the invoice. Log the route and the reason alongside the spend.

Workflow + owner + prompt version + route + model + tokens + cost + retries + latency.
Decision reason: policy rule hit, uncertainty score, or failure mode that triggered escalation.

Frequently asked questions

What is the simplest routing policy?

A two-tier policy: default to a cheaper model, then escalate to a stronger model only when the output fails a check (format, factuality guardrail, test failure, or reviewer rejection).

How do you know routing is working?

Cost per accepted outcome goes down while acceptance rate and review burden stay stable. If savings come with more edits, more retries, or more escalations, you have shifted cost rather than reduced it.

Do you need a router product to do routing?

Not at first. Many teams start with application-level rules (task category -> model) and add a router or gateway once they need centralized policy management, observability, and provider abstraction.

Weekly briefing

The term is moving faster than the definition.

Tokenmaxxing keeps shifting as new receipts land. The weekly briefing tracks who's burning what, and why it matters.

Practical next step

Pick one workflow, define two routing tiers (cheap vs. expensive), and track cost per accepted outcome before tuning anything else.

Operator checklist

Define an acceptance bar for each workflow (accepted, edited, rejected, escalated).
Tag every call with workflow, owner, route, model, retries, and cost.
Start with simple tiers (cheap vs. expensive) before dynamic policies.
Log router decisions so cost and quality changes are explainable.

Related guides

Watchouts

Fallback loops can erase savings if retries are not capped.
A cheap model can be expensive if it causes repair loops or review debt.
Routing without evals turns quality regressions into anecdote fights.

Open topics

Model Routing LLM Cost Playbook

What model routing is

Where routing saves money

Policy patterns that work

Fallback and stop conditions

Evaluation and acceptance signals

Observability: what to log

Frequently asked questions

What is the simplest routing policy?

How do you know routing is working?

Do you need a router product to do routing?

The term is moving faster than the definition.

Current feed records connected to this guide

The problem with AI model routing

Introducing Claude Sonnet 5

Meituan open-sources LongCat-2.0 — the 1.6T model that topped OpenRouter as Owl Alpha

Tools that make the guide operational

LangGraph

LiteLLM

Langfuse

Model Routing LLM Cost Playbook

What model routing is

Where routing saves money

Policy patterns that work

Fallback and stop conditions

Evaluation and acceptance signals

Observability: what to log

Frequently asked questions

What is the simplest routing policy?

How do you know routing is working?

Do you need a router product to do routing?

The term is moving faster than the definition.

Current feed records connected to this guide

The problem with AI model routing

Introducing Claude Sonnet 5

Meituan open-sources LongCat-2.0 — the 1.6T model that topped OpenRouter as Owl Alpha

Tools that make the guide operational

LangGraph

LiteLLM

Langfuse

Fresh source notes each week.