Guide

How to Reduce Wasted LLM Tokens

A field guide to reducing bloated prompts, irrelevant context, repeated requests, malformed outputs, and runaway agent loops.

Updated 2026-05-12cost-control / token-consumption / model-routing
Desk note

Token reduction is only a win when accepted output holds. The target is not smaller prompts for their own sake; it is less repeated, irrelevant, or repair-heavy work.

Reduce context before ambition

Most waste starts with context discipline. Teams send whole files, long histories, and irrelevant documents because it feels safer than retrieval or task decomposition. The result is expensive calls that are harder to inspect.

  • Split tasks before sending giant context windows.
  • Use retrieval to send targeted chunks rather than every document.

Route simple work down

Not every step needs the strongest model. Classification, extraction, formatting, low-risk planning, and validation are common candidates for cheaper routes once evals prove the quality bar holds.

  • Route by task risk, not by habit.
  • Keep a fallback path when confidence is low.

Stop paying for repeated work

Semantic caching, prompt normalization, deterministic pre-processing, and saved intermediate results can prevent teams from generating the same expensive answer again and again.

  • Start with the most repeated expensive calls.
  • Cache only where freshness and permissions are understood.

Constrain agents

Agents need explicit budgets: step limits, stop conditions, retry caps, tool budgets, and escalation rules. Otherwise a vague task can become a long trace that looks busy while it burns through model calls.

  • Require a stopping reason on each trace.
  • Alert on retry loops and long-running tasks.
Source trail

Current feed records connected to this guide

Augment Code source artwork
newsAC
news

5 Best Model Routing Platforms for AI Agent Systems

Augment Code rounds up model routing options for agent systems - tools that decide which model to call per step to balance quality, latency, and cost.

tokenmaxxingagentstoken-consumption
Read note
Augment Code source artwork
guideAC
guide

Multi-Agent Cost Compounding: Why 3 Agents Cost 10x

Augment Code breaks down why adding agents can explode costs: orchestration overhead, context handoffs, retries, and verification loops often dominate raw model pricing.

tokenmaxxingagentstoken-consumption
Read note
Generated Tokenmaxxing editorial thumbnail for Anthropic tightens limits on Claude subscriptions - Axios
newsA
news

Anthropic tightens limits on Claude subscriptions - Axios

Axios reports Anthropic is tightening what paid Claude subscribers can do, shifting heavy third-party agent usage behind a separate credit meter.

tokenmaxxingcoding-agentsagents
Read note
Project layer

Tools that make the guide operational

#12Direct
Caching

GPTCache

zilliztech/GPTCache

A semantic cache for LLM applications, with integrations for LangChain and LlamaIndex-style workflows.

8K583MIT
semantic-cachecost-controllatency
#1Direct
Routing

LiteLLM

BerriAI/litellm

An OpenAI-compatible gateway and SDK for calling many model providers with budgets, logging, load balancing, guardrails, and cost tracking.

47.8K8.2KSource-available
gatewaycost-trackingrouting
#2Direct
Observability

Langfuse

langfuse/langfuse

Open-source LLM engineering platform for observability, traces, metrics, evals, prompt management, datasets, and playground workflows.

27.6K2.8KSource-available
tracesevalscosts
Briefing

Fresh source notes each week.

New tokenmaxxing links, model-router signals, agent usage research, and AI cost notes.