You send the same system prompt to Claude 200 times a day. Every turn, the model re-reads everything. And you pay for all of it. Anthropic's prompt caching is a single key added to your request, no code rewrite required, and input token costs collapse the moment you reuse a given block.
Why you're overpaying today
Say you're running Claude as an assistant that answers questions about your internal documentation. Your system prompt is 4,000 tokens (instructions plus doc excerpts). Every conversation, every turn of that conversation, the API re-reads those 4,000 tokens as input. You're billed for those tokens even if nothing changed.
Over 200 requests a day, over a month, with several users in parallel, the bill adds up fast. The worst part: that context never changes. You're paying for inertia.
Enabling the cache: one key
Take your existing request and add cache_control to the block you want cached. That's it.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
messages=[
{"role": "user", "content": "User question"}
],
)The system block (here a long prompt plus documentation excerpts) becomes cacheable. On the first call, Anthropic writes the cache and charges you 1.25x. On the second call within 5 minutes with the same block, it's a hit: 0.10x.
You can do the same with tools and user blocks that don't change.
The real gain on hits
| Input token rate | Without cache | With cache (hit) |
|---|---|---|
| Multiplier | 1.00x | 0.10x |
| Savings | none | 90% |
| Billed every turn | yes | no, within the TTL window |
Concretely: if your system prompt is 4,000 tokens at 0.02 per call. With a cache hit, it's 4/day to $0.40/day on that block alone. And that's before counting hits on tools and conversational context.
The first-hit trap
The trap most integrations miss: writing the cache costs 25% more than the standard rate. So the first call costs 1.25x instead of 1.00x. Only from the second hit within 5 minutes do you start coming out ahead.
First call: cache write
You pay 1.25x. The block is written to Anthropic's cache.
Second call within 5 min: first hit
You pay 0.10x. The cumulative total over 2 calls is 1.35x, which is already cheaper than 2 uncached calls (2.00x - you're already winning).
Third call and beyond: clean hits
At 0.10x each, every additional hit widens the gap. Over 10 calls in 5 minutes, you pay roughly 2.15x instead of 10.00x.
When it pays off, when it doesn't
| Use case | Caching useful | Why |
|---|---|---|
| Internal doc assistant with 50+ active users | yes | Constant parallel hits, TTL window always active |
| Support chatbot with long context | yes | Same 3,000+ context tokens per conversation |
| Overnight batch generation | yes | You send 1,000 requests back to back |
| One-off script, 1 call per hour | no | The 5-min TTL expires between each call |
| Short prompts (under 1,024 tokens) | no | API ignores cache_control below the threshold |
Measuring hit ratio in production
The API response includes cache_creation_input_tokens and cache_read_input_tokens counters. Log both and compute your hit ratio continuously.
usage = response.usage
hits = usage.cache_read_input_tokens
writes = usage.cache_creation_input_tokens
total_in = usage.input_tokens + hits + writes
hit_ratio = hits / total_in if total_in else 0
print(f"Hit ratio: {hit_ratio:.0%}")Above 30-40% hits, you're clearly winning. Below that, you're approaching break-even, which is a signal to reconsider your traffic pattern or disable caching on certain blocks.
A hit ratio below 10% in production means the cache is costing you more than it saves. Disable it on the affected blocks.
Code-level pitfalls
Frequently asked questions
Does caching affect response quality?
No, there is strictly zero impact on model output. You get the exact same response, faster and cheaper. Anthropic doesn't reprocess cached tokens, but the model sees them as if they were sent normally.
How many cache_control blocks can I put in one request?
Up to 4 per request. You can cache your system prompt, your tools, a doc block in messages, and another block further down, all separately. Anthropic hits them independently.
Do cached tokens count against the context window?
Yes. Cached tokens still occupy the model's context window. Caching doesn't give you more context, just a lower cost on the input tokens you send.
Is there a risk the cache gets evicted before 5 minutes?
In practice, on an active project: no. Anthropic describes ephemeral mode as best-effort, but in use it holds the TTL window reliably. On very sparse traffic, it can happen.
Going further
Enable caching on your largest block, measure the hit ratio over 48 hours, and you'll have real numbers on what it actually changes for your workload. If you want a review of your AI stack, let's talk.
Discuss your AI stack with Blokby