Blog

Claude API Prompt Caching: 90% Cost Reduction

Claude

Add a single cache_control key to the Anthropic API and input token costs drop to 0.10x on hits. How it works, when it pays off, and when it costs you more.

You send the same system prompt to Claude 200 times a day. Every turn, the model re-reads everything. And you pay for all of it. Anthropic's prompt caching is a single key added to your request, no code rewrite required, and input token costs collapse the moment you reuse a given block.

Why you're overpaying today

Say you're running Claude as an assistant that answers questions about your internal documentation. Your system prompt is 4,000 tokens (instructions plus doc excerpts). Every conversation, every turn of that conversation, the API re-reads those 4,000 tokens as input. You're billed for those tokens even if nothing changed.

Over 200 requests a day, over a month, with several users in parallel, the bill adds up fast. The worst part: that context never changes. You're paying for inertia.

Enabling the cache: one key

Take your existing request and add cache_control to the block you want cached. That's it.

python
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {"role": "user", "content": "User question"}
    ],
)

The system block (here a long prompt plus documentation excerpts) becomes cacheable. On the first call, Anthropic writes the cache and charges you 1.25x. On the second call within 5 minutes with the same block, it's a hit: 0.10x.

You can do the same with tools and user blocks that don't change.

The real gain on hits

Input token rateWithout cacheWith cache (hit)
Multiplier1.00x0.10x
Savingsnone90%
Billed every turnyesno, within the TTL window

Concretely: if your system prompt is 4,000 tokens at 5/milliononOpus4.7,withoutcachethats5/million on Opus 4.7, without cache that's 0.02 per call. With a cache hit, it's 0.002.Over200requestsperday,yougofrom0.002. Over 200 requests per day, you go from 4/day to $0.40/day on that block alone. And that's before counting hits on tools and conversational context.

0.10x
rate on hits
5 min
ephemeral TTL
1024
minimum tokens
1.25x
first hit (write)

The first-hit trap

The trap most integrations miss: writing the cache costs 25% more than the standard rate. So the first call costs 1.25x instead of 1.00x. Only from the second hit within 5 minutes do you start coming out ahead.

First call: cache write

You pay 1.25x. The block is written to Anthropic's cache.

Second call within 5 min: first hit

You pay 0.10x. The cumulative total over 2 calls is 1.35x, which is already cheaper than 2 uncached calls (2.00x - you're already winning).

Third call and beyond: clean hits

At 0.10x each, every additional hit widens the gap. Over 10 calls in 5 minutes, you pay roughly 2.15x instead of 10.00x.

When it pays off, when it doesn't

Use caseCaching usefulWhy
Internal doc assistant with 50+ active usersyesConstant parallel hits, TTL window always active
Support chatbot with long contextyesSame 3,000+ context tokens per conversation
Overnight batch generationyesYou send 1,000 requests back to back
One-off script, 1 call per hournoThe 5-min TTL expires between each call
Short prompts (under 1,024 tokens)noAPI ignores cache_control below the threshold

Measuring hit ratio in production

The API response includes cache_creation_input_tokens and cache_read_input_tokens counters. Log both and compute your hit ratio continuously.

python
usage = response.usage
hits = usage.cache_read_input_tokens
writes = usage.cache_creation_input_tokens
total_in = usage.input_tokens + hits + writes

hit_ratio = hits / total_in if total_in else 0
print(f"Hit ratio: {hit_ratio:.0%}")

Above 30-40% hits, you're clearly winning. Below that, you're approaching break-even, which is a signal to reconsider your traffic pattern or disable caching on certain blocks.

A hit ratio below 10% in production means the cache is costing you more than it saves. Disable it on the affected blocks.

Code-level pitfalls

The block must be byte-for-byte identical

A single character difference in content (a space, the field order in a tool's JSON, etc.) and it's a cache miss. If you dynamically build your system prompt, lock the section order and stabilize whitespace.

The cache is shared per project, not per user

Two requests from the same API project that send the same cacheable block hit the same cache. This is what makes caching massively profitable on multi-user assistants: one cache for everyone.

Ephemeral mode vs 1h

The ephemeral mode has a 5-minute TTL. Anthropic also offers a 1h mode that extends the TTL to one hour at a higher write cost. For most interactive workloads, ephemeral is more than enough.

Frequently asked questions

  • Does caching affect response quality?

    No, there is strictly zero impact on model output. You get the exact same response, faster and cheaper. Anthropic doesn't reprocess cached tokens, but the model sees them as if they were sent normally.

  • How many cache_control blocks can I put in one request?

    Up to 4 per request. You can cache your system prompt, your tools, a doc block in messages, and another block further down, all separately. Anthropic hits them independently.

  • Do cached tokens count against the context window?

    Yes. Cached tokens still occupy the model's context window. Caching doesn't give you more context, just a lower cost on the input tokens you send.

  • Is there a risk the cache gets evicted before 5 minutes?

    In practice, on an active project: no. Anthropic describes ephemeral mode as best-effort, but in use it holds the TTL window reliably. On very sparse traffic, it can happen.

Going further

Prompt caching, Anthropic docs
Official reference: full cache_control syntax, ephemeral and 1h modes, per-SDK language examples, and explanation of the cache_read/cache_creation counters.
docs.anthropic.com
Anthropic pricing, by model
The official price table with cache multipliers (0.10x on hit, 1.25x on ephemeral write) per model. Cross-reference with your actual volume.
anthropic.com

Enable caching on your largest block, measure the hit ratio over 48 hours, and you'll have real numbers on what it actually changes for your workload. If you want a review of your AI stack, let's talk.

Discuss your AI stack with Blokby