Prompt Caching Discount Estimator
Calculate exactly how much you save by caching your system prompt with Claude, GPT-4o, or Gemini. See your break-even point and monthly savings in seconds.
Anthropic offers 90% off cache reads · OpenAI offers 50% off · Google offers 75% off. Prices from official provider pages, May 2025.
Estimate Your Prompt Caching Savings
Cache write: $3.75/M · Cache read: $0.3/M
Quick guide: 1,000 words ≈ 1,333 tokens · 10-page PDF ≈ 10,000 tokens · 100-page codebase ≈ 80,000 tokens
Want your AI marketing system set up in a week?
MarketingAI builds and hands over three coordinated, AI-assisted marketing systems — content engine, outbound lead sequence, and email nurture — configured to your business. Done in under a week. You own it permanently.
Get your marketing system →How to Use This Calculator
- 1
Select your model
Choose the LLM provider and model you currently use. Caching discounts vary significantly: Anthropic offers 90% off cache reads, OpenAI offers 50%, Google offers 75%.
- 2
Enter your system prompt length
Input the number of tokens in your system prompt, or use the character estimator (1 token ≈ 4 characters). A 1,000-word prompt is roughly 1,333 tokens; a 10-page PDF is roughly 10,000 tokens.
- 3
Enter your daily query volume
How many API requests does your application make per day? Higher query volumes produce larger absolute savings from caching.
- 4
Review your monthly savings
The estimator shows your monthly cost without caching, with caching, and the break-even point — the number of queries at which caching pays for itself.
Why Prompt Caching Matters for AI Developers
If you're running an LLM-powered application at scale, your system prompt is almost certainly your largest and most repetitive cost driver. Every time your application calls the API — for a user question, a document classification, a support ticket response — you're sending the same thousand-token system prompt over and over again. Each repetition is billed at full input token rates.
Prompt caching solves this by letting you store that static prefix on the provider's infrastructure. Subsequent calls that reuse the same prefix are charged at the cache read rate instead: 90% cheaper for Anthropic, 50% cheaper for OpenAI, 75% cheaper for Google. For applications making thousands of daily calls to the same large system prompt, the monthly savings can be substantial.
Anthropic Claude: Explicit Cache Control
Anthropic's implementation requires you to explicitly mark blocks for caching using a cache_control parameter in the messages API. Cache writes cost 25% more than standard input rates (to cover the overhead of writing to cache), but cache reads cost only 10% of the input rate — a 90% discount. The cache persists for at least 5 minutes and resets on each access. For production applications with consistent traffic, the write cost is negligible compared to the read savings.
OpenAI GPT-4o: Automatic Caching
OpenAI caches automatically — there's no API parameter to set. Any prompt prefix longer than 1,024 tokens that appears in recent requests is automatically cached. Cached tokens are billed at 50% of standard input rates. The lack of an explicit write cost means OpenAI caching is always profitable from the very first repeated request. The downside is less control: you can't force cache writes or inspect cache hits programmatically.
Google Gemini: Context Caching with Storage Fees
Google's approach is more explicit than OpenAI but adds a storage dimension: you create a cached content resource with a specified TTL, and pay an hourly storage fee per million cached tokens while it exists. Cache reads are billed at 75% off input rates. For workloads with consistent traffic throughout the day, the storage cost is easily offset by the input savings. For sporadic workloads, evaluate whether storage fees erode the discount.
When to Use Prompt Caching
Caching delivers the highest ROI when your system prompt is long (thousands of tokens), your query volume is high (hundreds or thousands per day), and your prompt content is mostly static. Common use cases include: RAG systems where the same large document context is queried repeatedly, customer support bots with detailed product knowledge bases, code review tools that embed an entire codebase, and multi-step agent workflows that reuse the same tool definitions across many turns.
Caching is less effective for one-off queries, personalised prompts that change per user, or very short system prompts below the provider minimum (typically 1,024 tokens).
Frequently Asked Questions
What is prompt caching in LLM APIs?
Prompt caching is a feature offered by Anthropic (Claude), OpenAI (GPT-4o), and Google (Gemini) that lets you store a large, reused portion of your prompt — typically a system prompt, a document, or a codebase — on the provider's servers. On subsequent requests that reuse the same prefix, the provider reads from cache instead of re-processing the full token count. Cache reads are dramatically cheaper: 90% off for Anthropic, 50% off for OpenAI, and 75% off for Google.
How much does prompt caching actually save?
The savings depend on how large your system prompt is relative to your query volume. A 10,000-token system prompt sent to Claude Sonnet 1,000 times per day costs $90/month in input tokens alone. With caching, the same setup costs about $9/month in cache reads — a $81/month saving (90% reduction). The larger your prompt and higher your query volume, the bigger the absolute saving. This calculator shows you the exact numbers for your specific setup.
How does Anthropic prompt caching work?
With Anthropic's explicit caching, you mark a section of your prompt with a cache_control parameter (type: 'ephemeral'). The first request that includes this prompt section pays the cache write rate — 125% of the standard input rate. All subsequent requests within the cache TTL (5 minutes by default) pay the cache read rate — just 10% of standard input price. The cache TTL resets each time the cached block is accessed. For production applications with high query volume, the break-even is typically reached after fewer than 5 queries.
How does OpenAI prompt caching work?
OpenAI implements automatic prompt caching for supported models (GPT-4o, GPT-4o mini, o1, o3). You don't need to explicitly mark sections for caching. Any prompt prefix of 1,024 or more tokens that has been sent in a recent request is automatically cached. Cached tokens are charged at 50% of the standard input rate. There is no explicit write cost — caching happens transparently. The cache is maintained for about an hour of inactivity.
How does Google context caching work?
Google's context caching is available for Gemini models. Unlike Anthropic or OpenAI, Google requires you to explicitly create a cached content resource via the API, specifying the content to cache and a TTL. Cached content storage costs $1.00 per million tokens per hour for Gemini 1.5 Pro, or $0.25 per million tokens per hour for Gemini 1.5 Flash. Cache read requests get a 75% discount on input token pricing. Google context caching is most cost-effective when you query the same cached content many times per hour.
What should I put in my cached prompt?
Cache your largest, most static content: (1) System instructions and persona definitions, (2) Long documents, PDFs, or reference material your app processes repeatedly, (3) Code files or codebases for code review/generation tasks, (4) Product catalogues, knowledge bases, or FAQ documents, (5) Tool definitions for function-calling setups. Do NOT cache user-specific data, the conversation history, or anything that changes per request — those must remain in the non-cached portion of the prompt.
What is the minimum prompt size for caching to work?
Each provider has a minimum token requirement before caching applies. Anthropic requires at least 1,024 tokens in a cached block (Claude Sonnet/Opus) or 2,048 tokens (Claude Haiku). OpenAI automatically caches prefixes of 1,024 tokens or more. Google's minimum cached content is 4,096 tokens. For small system prompts under these thresholds, caching won't help — focus on reducing prompt length instead.
Does prompt caching affect response quality?
No. The model processes cached tokens identically to non-cached tokens — the cache is a billing optimization, not a model shortcut. The AI still 'reads' and 'understands' the full cached context when generating each response. Response quality, latency, and accuracy are unaffected by whether the prompt prefix was served from cache or re-processed from scratch.
Need an AI-assisted marketing system for your business?
MarketingAI builds done-with-you marketing systems for Australian small businesses — content, email, and lead generation, configured to your offer and delivered in under a week. One-time setup, owned by you permanently.
Learn About MarketingAI →Related Calculators
AI Model Router Savings Calculator
See how much you save routing easy queries to cheaper models.
Marketing ROI Calculator
Calculate the return on your marketing investment.
Ad Spend Calculator
Project clicks, leads, and revenue from paid ads.
Website Speed Impact Calculator
See how page speed affects conversions and revenue.