Updated March 2026

Claude Prompt Caching Pricing

Cache repeated input content and pay only 10% of the standard input token price on subsequent requests. The single most effective way to reduce Claude API costs at scale.

Cache Write

+25% vs standard

Cache Read

-90% vs standard

Break-Even

2 requests

What Is Prompt Caching?

Prompt caching is a feature that lets you store frequently-used portions of your API input on Anthropic's servers so they do not need to be re-processed on every request. When you send an API call, a significant chunk of the input is often identical across requests: system prompts, few-shot examples, shared context documents, or conversation history prefixes. Without caching, you pay the full input token price for this repeated content every single time.

With caching enabled, the first request processes and stores the cacheable content at a slight premium (25% more than standard input pricing). Every subsequent request that includes the same cached prefix pays only 10% of the standard input price for those tokens. The cache persists for 5 minutes and resets its TTL with each hit, so steady traffic keeps the cache warm indefinitely.

Under the hood, Anthropic stores the computed key-value attention states for the cached prefix. When a new request arrives with the same prefix, the model skips the forward pass for those tokens entirely, reading the pre-computed states from cache. This is why the cost reduction is so dramatic: the compute work is literally not being done.

Cache Pricing for All Claude Models

All prices are per 1 million tokens. Cache write is a one-time cost per cache creation. Cache read applies to every subsequent request within the TTL window.

Model	Standard Input	Cache Write (+25%)	Cache Read (-90%)	Output
Claude Opus 4	$15.00	$18.75	$1.50	$75.00
Claude Sonnet 4	$3.00	$3.75	$0.30	$15.00
Claude Haiku 3.5	$0.80	$1.00	$0.08	$4.00

Standard Input

The base price for processing input tokens without any caching. This is what you pay for every token in a non-cached request.

Cache Write

A one-time premium of 25% over standard input. Paid only when the cache is created or recreated after TTL expiry. For Sonnet 4, that is $3.75 per MTok instead of $3.00.

Cache Read

The discounted price for tokens served from cache: just 10% of the standard input rate. For Sonnet 4, that is $0.30 per MTok instead of $3.00. This is where the savings happen.

Break-Even Analysis

How quickly does prompt caching pay for itself? The answer: almost immediately. The cache write premium is only 25%, but you save 90% on every subsequent read. After just 2 requests, you are already saving money.

The math behind break-even

Consider a 2,000-token system prompt on Sonnet 4. Without caching, each request costs $0.006 for those tokens (2,000 / 1,000,000 x $3.00). With caching:

Request 1 (cache write)

$0.0075

2K tokens x $3.75/MTok

Request 2+ (cache read)

$0.0006

2K tokens x $0.30/MTok

Savings per read

$0.0054

90% off standard rate

The extra cost of the cache write is $0.0015 (the 25% premium on the first request). Each cache read saves $0.0054 compared to the standard price. So you break even after the very first cache read, and every subsequent request saves $0.0054 in input costs.

Model	Cacheable Tokens	Standard Cost/Req	Cached Cost/Req	Break-Even	Savings at 100K/mo
Opus 4	2,000	$0.030	$0.003	2 requests	$2,700
Sonnet 4	2,000	$0.006	$0.0006	2 requests	$540
Haiku 3.5	2,000	$0.0016	$0.00016	2 requests	$144

Savings based on 2,000 cacheable tokens per request, 100,000 requests/month, assuming near-100% cache hit rate with steady traffic.

Real Savings Examples

Customer support chatbot (Sonnet 4)

A chatbot with a 3,000-token system prompt handling 50,000 conversations per month. The system prompt is identical for every conversation and is easily cacheable.

Without caching

$450.00/mo

3K tokens x 50K requests x $3.00/MTok

With caching

$45.00/mo

3K tokens x 50K requests x $0.30/MTok

Monthly savings: $405 on system prompt input alone

RAG pipeline with shared context (Opus 4)

A research assistant that includes a 10,000-token knowledge base summary as context for every query. With 20,000 queries per month on Opus 4, the context is highly cacheable.

Without caching

$3,000.00/mo

10K tokens x 20K requests x $15.00/MTok

With caching

$300.00/mo

10K tokens x 20K requests x $1.50/MTok

Monthly savings: $2,700 on context input costs

Evaluation pipeline with few-shot examples (Haiku 3.5)

A content moderation system that includes 5,000 tokens of labelled examples in every request. Running 200,000 checks per month on Haiku 3.5 for cost efficiency.

Without caching

$800.00/mo

5K tokens x 200K requests x $0.80/MTok

With caching

$80.00/mo

5K tokens x 200K requests x $0.08/MTok

Monthly savings: $720 on few-shot example input

When to Use Caching (and When Not To)

Use prompt caching when:

+You have a system prompt that repeats across many requests. This is the most common and highest-impact use case.
+Your prompts include few-shot examples. If you send 10+ labelled examples in every request, caching them saves massively.
+You run RAG with a shared context prefix. If your retrieval template and base context are identical across queries, cache them.
+Multi-turn conversations with long history. The conversation history prefix can be cached between turns.
+You make more than a few requests every 5 minutes to keep the cache warm.

Skip caching when:

xEvery request has unique input with no shared prefix. If nothing repeats, there is nothing to cache.
xYour cacheable content is below the minimum size threshold (1,024 tokens for Haiku, 2,048 for Sonnet/Opus).
xTraffic is extremely bursty with gaps longer than 5 minutes. The cache will expire between bursts, and you will keep paying the write premium.
xYou only make a handful of requests total. The 25% write premium is not worth it if you only send 2-3 requests before stopping.

Cache TTL and Minimum Cacheable Size

Time-to-Live (TTL): 5 minutes

Once created, a cache entry lives for 5 minutes. Every time the cache is read (a cache hit), the TTL resets to 5 minutes. This means that for any application with steady traffic -- even just one request every few minutes -- the cache effectively never expires. The cache only drops when there is a gap of more than 5 minutes with no matching requests.

Minimum cacheable size

Model	Minimum Tokens	Approx. Words
Claude Opus 4	2,048	~1,536
Claude Sonnet 4	2,048	~1,536
Claude Haiku 3.5	1,024	~768

If your system prompt is shorter than the minimum, you can still reach the threshold by combining it with few-shot examples, role definitions, or structured output format instructions. Many production prompts naturally exceed these minimums once you include formatting rules and constraints.

Monitoring cache performance

Every API response includes headers that tell you exactly how the cache performed:

anthropic-cache-creation-input-tokens: 2048

anthropic-cache-read-input-tokens: 2048

The cache-creation-input-tokens header shows tokens that were written to cache (you paid the write premium). The cache-read-input-tokens header shows tokens served from cache (you paid only 10%). Track the ratio over time to measure your effective savings.

How to Implement Prompt Caching

Add the cache_control parameter to any content block you want cached. Here is a Python example using the Anthropic SDK:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful customer support agent...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "How do I reset my password?"}
    ]
)

The first call creates the cache and charges the write premium. All subsequent calls with the same system prompt content within the 5-minute TTL will read from cache at 90% off. Output tokens are never affected by caching -- they are always charged at the standard output rate.

Frequently Asked Questions

How much does prompt caching save on Claude API costs?

Prompt caching reduces the cost of cached input tokens by 90%. If 70% of your input tokens are cacheable, your effective input cost drops by roughly 63%. For a workload sending a 2,000-token system prompt across 100,000 requests per month on Sonnet 4, caching saves approximately $540 per month on input costs alone.

What is the minimum size for cacheable content?

The minimum cacheable prefix is 1,024 tokens for Claude Haiku 3.5 and 2,048 tokens for Claude Sonnet 4 and Opus 4. Content shorter than this threshold cannot be cached. If your system prompt is below the minimum, consider combining it with few-shot examples or shared context to reach the threshold.

How long does the cache last?

The cache has a 5-minute time-to-live (TTL). Each cache hit resets the TTL, so as long as requests arrive within 5 minutes of each other, the cache stays warm. For applications with steady traffic, expect near-100% cache hit rates. For bursty workloads with gaps longer than 5 minutes, the cache will need to be recreated, incurring the cache write cost again.

Does caching work with the Batch API?

Yes. You can combine prompt caching with the Batch API for maximum savings. Batch processing gives you 50% off all token costs, and caching reduces input costs by up to 90%. The combined savings can exceed 95% compared to standard uncached input pricing.

How do I enable prompt caching in my API calls?

Add a cache_control parameter with type set to 'ephemeral' to the content block you want cached. This works on system messages and user messages. The first request pays the cache write price (25% more than standard input), and all subsequent requests within the TTL window pay only the cache read price (90% less than standard input). Monitor cache performance via the anthropic-cache-creation-input-tokens and anthropic-cache-read-input-tokens response headers.

Claude API Pricing Batch API Pricing Reduce API Costs Cost Calculator