Stop Overpaying for LLM APIs. Here Is the Single Fix People Are Not Talking About

Your LLM bill is higher than it should be. Not because you use the API too much, but because you pay for the same computation over and over. Every time your system prompt repeats, every time the same tool definitions load, every time conversation history gets re-processed — you burn money on work already done.

The fix exists. It is called prompt caching — storing the attention states (KV cache) of a prompt prefix server-side so the model skips re-processing identical instructions. Costs drop 90-99%. Latency drops 80-85%.

How Prompt Caching Works

The technology relies on exact prefix matching. The API caches content starting from the very first token (token 0) up to a designated breakpoint. If even one character—such as a timestamp or a reordered tool—differs in that prefix, the cache is invalidated for everything following it.

Transformer models process tokens in parallel during the prefill phase, generating a Key-Value (KV) cache for each attention layer. This cache is expensive: for a 128K context window, the KV cache dominates compute cost. Repeatedly processing the same system prompt, tool definitions, or conversation history wastes GPU cycles. Prompt caching solves this by persisting the KV state across requests, skipping redundant computation.

Cache hits require byte-exact prefix alignment. Common cache-busting mistakes:

Timestamps or dates in system prompts
Reordered tool definitions
Random IDs in session identifiers
Differing whitespace or trailing newlines

Tools like Claude Code avoid these by using system message updates (not prompt edits), tool stubs that stay at fixed positions, and appending new instructions as user messages instead of rebuilding prefixes.

How DeepSeek Is Able to Offer Such Low API Costs

DeepSeek achieves aggressive savings (up to 99% cheaper cached input) through smart caching.

Their cache is small enough to store on regular disk drives instead of expensive GPU memory. This means they can keep cached data for hours or days at almost no cost, and pass those savings to users.

Caching is fully automatic. No headers to set, no TTL knobs to turn, no breakpoints to configure. Every request gets cached from the first token. The cache lives until the prefix stops being used.

The numbers speak for themselves.

| Model | Cache Miss | Cache Hit | Output | |-------|-----------|-----------|--------| | DeepSeek V4 Flash | $0.14/MTok | $0.0028/MTok | $0.28/MTok | | DeepSeek V4 Pro | $0.435/MTok | $0.003625/MTok | $0.87/MTok |

A cache hit on Flash costs $0.0028 per million tokens — that is 98% cheaper than a miss. On Pro, hits are $0.003625 compared to $0.435.

Pricing as of June 20, 2026 — api-docs.deepseek.com/quick_start/pricing

Anthropic: Explicit Control for Maximum Hit Rates

Anthropic takes the opposite approach — explicit opt-in with cache_control in the Messages API. You set breakpoints at stable prefix boundaries:

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    system=[
        {
            "type": "text",
            "text": STABLE_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "Your question here"}
    ]
)

Two TTL tiers exist. Ephemeral (5 minutes, 1.25x write cost) works for sessions. Extended (1 hour, 2x write cost) suits truly static prompts. Both give 90% read discount.

The cost difference with and without caching is massive.

| Model | Base Input | Cache Write (5m) | Cache Write (1h) | Cache Hit | Output | |-------|-----------|-----------------|-----------------|-----------|--------| | Claude Opus 4.8 | $5/MTok | $6.25/MTok | $10/MTok | $0.50/MTok | $25/MTok | | Claude Sonnet 4.6 | $3/MTok | $3.75/MTok | $6/MTok | $0.30/MTok | $15/MTok |

Pricing as of June 20, 2026 — claude.ai/pricing

Opus 4.8 drops from $5 to $0.50 per million input tokens on cache hits. Sonnet 4.6 drops from $3 to $0.30. If your workflow reuses the same system prompt and tool definitions across many calls, caching turns $5 into $0.50 for the bulk of your input.

To keep cache hit rates high, structure prompts from most stable to most volatile: global tools and system prompt first (cache breakpoint here), then project context, session context, and conversation history last. For branching tasks, append instructions as new user messages to existing cached history instead of making fresh uncached calls — this preserves the cached prefix.

Caching is not optional

Caching is not optional — AI inference is expensive and getting more expensive as models support larger contexts. Every workflow that repeats the same system prompt, tool definitions, documents, or conversation history is burning money on redundant computation. Long-running agents, RAG pipelines, and multi-step workflows make this worse: a single session can re-process the same prefix dozens or hundreds of times.

Always choose a provider that supports caching out of the box. If the provider requires explicit configuration, do it. Read the docs, set the breakpoints, structure your prompts correctly. The 5 minutes it takes to set up caching pays for itself in the first hour of production usage.

Related: