Updated March 2026
Claude Prompt Caching Pricing
Cache repeated input content and pay only 10% of the standard input token price on subsequent requests. The single most effective way to reduce Claude API costs at scale.
Cache Write
+25% vs standard
Cache Read
-90% vs standard
Break-Even
2 requests
What Is Prompt Caching?
Prompt caching is a feature that lets you store frequently-used portions of your API input on Anthropic's servers so they do not need to be re-processed on every request. When you send an API call, a significant chunk of the input is often identical across requests: system prompts, few-shot examples, shared context documents, or conversation history prefixes. Without caching, you pay the full input token price for this repeated content every single time.
With caching enabled, the first request processes and stores the cacheable content at a slight premium (25% more than standard input pricing). Every subsequent request that includes the same cached prefix pays only 10% of the standard input price for those tokens. The cache persists for 5 minutes and resets its TTL with each hit, so steady traffic keeps the cache warm indefinitely.
Under the hood, Anthropic stores the computed key-value attention states for the cached prefix. When a new request arrives with the same prefix, the model skips the forward pass for those tokens entirely, reading the pre-computed states from cache. This is why the cost reduction is so dramatic: the compute work is literally not being done.
Cache Pricing for All Claude Models
All prices are per 1 million tokens. Cache write is a one-time cost per cache creation. Cache read applies to every subsequent request within the TTL window.
| Model | Standard Input | Cache Write (+25%) | Cache Read (-90%) | Output |
|---|---|---|---|---|
| Claude Opus 4 | $15.00 | $18.75 | $1.50 | $75.00 |
| Claude Sonnet 4 | $3.00 | $3.75 | $0.30 | $15.00 |
| Claude Haiku 3.5 | $0.80 | $1.00 | $0.08 | $4.00 |
Standard Input
The base price for processing input tokens without any caching. This is what you pay for every token in a non-cached request.
Cache Write
A one-time premium of 25% over standard input. Paid only when the cache is created or recreated after TTL expiry. For Sonnet 4, that is $3.75 per MTok instead of $3.00.
Cache Read
The discounted price for tokens served from cache: just 10% of the standard input rate. For Sonnet 4, that is $0.30 per MTok instead of $3.00. This is where the savings happen.
Break-Even Analysis
How quickly does prompt caching pay for itself? The answer: almost immediately. The cache write premium is only 25%, but you save 90% on every subsequent read. After just 2 requests, you are already saving money.
The math behind break-even
Consider a 2,000-token system prompt on Sonnet 4. Without caching, each request costs $0.006 for those tokens (2,000 / 1,000,000 x $3.00). With caching:
Request 1 (cache write)
$0.0075
2K tokens x $3.75/MTok
Request 2+ (cache read)
$0.0006
2K tokens x $0.30/MTok
Savings per read
$0.0054
90% off standard rate
The extra cost of the cache write is $0.0015 (the 25% premium on the first request). Each cache read saves $0.0054 compared to the standard price. So you break even after the very first cache read, and every subsequent request saves $0.0054 in input costs.
| Model | Cacheable Tokens | Standard Cost/Req | Cached Cost/Req | Break-Even | Savings at 100K/mo |
|---|---|---|---|---|---|
| Opus 4 | 2,000 | $0.030 | $0.003 | 2 requests | $2,700 |
| Sonnet 4 | 2,000 | $0.006 | $0.0006 | 2 requests | $540 |
| Haiku 3.5 | 2,000 | $0.0016 | $0.00016 | 2 requests | $144 |
Savings based on 2,000 cacheable tokens per request, 100,000 requests/month, assuming near-100% cache hit rate with steady traffic.
Real Savings Examples
Customer support chatbot (Sonnet 4)
A chatbot with a 3,000-token system prompt handling 50,000 conversations per month. The system prompt is identical for every conversation and is easily cacheable.
Without caching
$450.00/mo
3K tokens x 50K requests x $3.00/MTok
With caching
$45.00/mo
3K tokens x 50K requests x $0.30/MTok
Monthly savings: $405 on system prompt input alone
RAG pipeline with shared context (Opus 4)
A research assistant that includes a 10,000-token knowledge base summary as context for every query. With 20,000 queries per month on Opus 4, the context is highly cacheable.
Without caching
$3,000.00/mo
10K tokens x 20K requests x $15.00/MTok
With caching
$300.00/mo
10K tokens x 20K requests x $1.50/MTok
Monthly savings: $2,700 on context input costs
Evaluation pipeline with few-shot examples (Haiku 3.5)
A content moderation system that includes 5,000 tokens of labelled examples in every request. Running 200,000 checks per month on Haiku 3.5 for cost efficiency.
Without caching
$800.00/mo
5K tokens x 200K requests x $0.80/MTok
With caching
$80.00/mo
5K tokens x 200K requests x $0.08/MTok
Monthly savings: $720 on few-shot example input
When to Use Caching (and When Not To)
Use prompt caching when:
- +You have a system prompt that repeats across many requests. This is the most common and highest-impact use case.
- +Your prompts include few-shot examples. If you send 10+ labelled examples in every request, caching them saves massively.
- +You run RAG with a shared context prefix. If your retrieval template and base context are identical across queries, cache them.
- +Multi-turn conversations with long history. The conversation history prefix can be cached between turns.
- +You make more than a few requests every 5 minutes to keep the cache warm.
Skip caching when:
- xEvery request has unique input with no shared prefix. If nothing repeats, there is nothing to cache.
- xYour cacheable content is below the minimum size threshold (1,024 tokens for Haiku, 2,048 for Sonnet/Opus).
- xTraffic is extremely bursty with gaps longer than 5 minutes. The cache will expire between bursts, and you will keep paying the write premium.
- xYou only make a handful of requests total. The 25% write premium is not worth it if you only send 2-3 requests before stopping.
Cache TTL and Minimum Cacheable Size
Time-to-Live (TTL): 5 minutes
Once created, a cache entry lives for 5 minutes. Every time the cache is read (a cache hit), the TTL resets to 5 minutes. This means that for any application with steady traffic -- even just one request every few minutes -- the cache effectively never expires. The cache only drops when there is a gap of more than 5 minutes with no matching requests.
Minimum cacheable size
| Model | Minimum Tokens | Approx. Words |
|---|---|---|
| Claude Opus 4 | 2,048 | ~1,536 |
| Claude Sonnet 4 | 2,048 | ~1,536 |
| Claude Haiku 3.5 | 1,024 | ~768 |
If your system prompt is shorter than the minimum, you can still reach the threshold by combining it with few-shot examples, role definitions, or structured output format instructions. Many production prompts naturally exceed these minimums once you include formatting rules and constraints.
Monitoring cache performance
Every API response includes headers that tell you exactly how the cache performed:
anthropic-cache-creation-input-tokens: 2048
anthropic-cache-read-input-tokens: 2048
The cache-creation-input-tokens header shows tokens that were written to cache (you paid the write premium). The cache-read-input-tokens header shows tokens served from cache (you paid only 10%). Track the ratio over time to measure your effective savings.
How to Implement Prompt Caching
Add the cache_control parameter to any content block you want cached. Here is a Python example using the Anthropic SDK:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a helpful customer support agent...",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "How do I reset my password?"}
]
)The first call creates the cache and charges the write premium. All subsequent calls with the same system prompt content within the 5-minute TTL will read from cache at 90% off. Output tokens are never affected by caching -- they are always charged at the standard output rate.