TECHNICAL DEEP DIVE

Prompt Caching Explained: How to Cut API Costs with Prefix Caching

By Institute of AI PM·12 min read·May 20, 2026

TL;DR

Prompt caching is an API-level feature offered by Anthropic, OpenAI, and Google that lets you reuse the computed KV state of a large prompt prefix across multiple API calls. Instead of paying full input-token cost every time you send the same 10,000-token system prompt, the provider charges a fraction on cache hits. For products with large, stable system prompts — or workflows that repeatedly process the same documents — prompt caching can cut input costs by 60–90% and reduce latency by up to 85%. It is not the same as response caching (storing model outputs) or KV cache (the internal memory mechanism inside a transformer). This article explains how it works, when to use it, and how to structure your prompts to maximize cache hit rate.

What Prompt Caching Actually Is (and Isn't)

When an LLM processes a prompt, it converts every input token into a key-value (KV) representation that gets stored in memory and referenced during generation. This computation — called the prefill phase — is expensive. For a 10,000-token system prompt, you're paying the full prefill cost on every single API call, even if the first 9,800 tokens are identical across all calls.

Prompt caching (also called prefix caching or context caching) solves this by storing the computed KV state of a stable prompt prefix on the provider's infrastructure. On subsequent calls that share that prefix, the provider skips the prefill computation and serves the cached state directly — charging a fraction of the normal input-token cost.

1

Prompt caching

An API feature. You mark a prompt prefix as cacheable; the provider stores its computed state. Subsequent calls that match the prefix get a cache hit and pay reduced rates. Controlled at the API client level.

2

Response caching

Your application stores the model's output for a given input and returns that stored output without hitting the API again. Useful for FAQ-style queries. Completely bypasses the LLM on a hit. Not what this article is about.

3

KV cache (internal)

The in-memory cache inside the transformer that stores key-value pairs during a single inference call. Enables autoregressive generation. You never interact with this directly — it's an implementation detail inside the model server.

The key distinction: response caching reuses outputs. Prompt caching reuses the computation of inputs. You still get a fresh generation on every call — the model is not returning a stale answer. You're just not re-paying to process the same context.

Provider Support and Pricing in 2026

All three major API providers now support some form of prompt caching, though the mechanics and pricing differ meaningfully. Understanding these differences matters for cost modeling.

Anthropic (Claude)

Cache writes charged at 125% of base input price. Cache reads charged at 10% of base input price — a 90% discount. Cache TTL is 5 minutes by default, extendable. You mark cache breakpoints explicitly with cache_control: {type: 'ephemeral'} in the API request. Minimum cacheable prefix is 1,024 tokens.

OpenAI (GPT-4o, o-series)

Prompt caching is automatic — no explicit markup required. Cache hits cost 50% of standard input price. Cache TTL is ~1 hour. Works for prompts longer than 1,024 tokens. You can see cache hit metrics in API response headers.

Google (Gemini)

Called 'context caching.' Available via the Gemini API for prompts exceeding 32,768 tokens. Cache write charged at standard input price plus a storage fee per hour. Cache reads cost roughly 25% of standard input price. TTL is configurable from 1 hour to 1 month.

For most production applications using Anthropic, the economics are compelling: if you're making 1,000 API calls per hour with a 10,000-token system prompt, and 90% of calls are cache hits, you're paying for roughly 1,000 + 1,000,000 × 10% = 101,000 effective input tokens instead of 10,000,000. That's a ~10x reduction on input costs alone.

When Prompt Caching Saves the Most Money

Prompt caching is not universally useful. It shines in specific patterns. If your prompts are short or always unique, you won't see meaningful savings. Here are the five use cases where it delivers the most value.

Large, stable system prompts

When it applies: Your system prompt contains detailed instructions, personas, output formats, or examples — and it stays the same across calls for a given user session or product feature.

Cost saving: High. A 5,000-token system prompt with 90% cache hit rate reduces that portion of input cost by 9x.

Document Q&A and analysis

When it applies: A user uploads a PDF or long document, then asks multiple questions about it. The document sits at the front of the context; only the question changes.

Cost saving: Very high. Document tokens dominate the input cost. Caching the document after the first question makes follow-up questions nearly free.

Multi-turn conversations with long history

When it applies: Your chat product passes the full conversation history on every API call. The prefix (earlier turns) is stable; only the latest user message is new.

Cost saving: Moderate to high, depending on conversation length and turn rate.

Few-shot prompting with many examples

When it applies: You inject 20–50 examples into the prompt for in-context learning or style consistency. These examples are identical across calls.

Cost saving: High. Few-shot examples can easily run 3,000–8,000 tokens. Caching them eliminates repeated prefill cost.

Batch processing of the same corpus

When it applies: You're running the same large document or codebase through multiple analysis passes — summarization, extraction, classification.

Cost saving: Very high for Gemini-style context caching where the TTL can be hours or days.

Master AI Infrastructure Decisions

The AI PM Masterclass covers cost modeling, provider selection, and the technical decisions that separate efficient AI products from expensive ones — taught by a Salesforce Sr. Director PM.

How to Structure Prompts for Maximum Cache Hit Rate

Cache hits require that the beginning of your prompt is byte-for-byte identical to a previously cached prefix. Small variations — even a single character change early in the prompt — invalidate the cache. Prompt structure is therefore a caching concern, not just a quality concern.

Put static content first

System prompt, persona, instructions, examples, and documents should come before dynamic content (user input, current date, session-specific variables). Any variation in early tokens cascades and kills the cache hit.

Never inject dates or timestamps in the static prefix

A single 'Today is May 20, 2026' at line 1 of your system prompt means every call is a cache miss. Move time-sensitive content to the end of the prompt, after the cacheable prefix.

Normalize user inputs before they enter the prompt

Trim whitespace, normalize unicode, and standardize formatting on any content that comes before the cache breakpoint. Invisible character differences break the byte-match.

Use explicit cache breakpoints (Anthropic)

Mark the end of your stable prefix with cache_control. Everything before the breakpoint is cached; everything after is processed fresh. You can have multiple breakpoints for layered caching (system prompt cached separately from a large document).

Monitor cache hit rate in production

Both Anthropic and OpenAI surface cache hit metrics in API responses. Track cache_read_input_tokens vs cache_creation_input_tokens. A hit rate below 60% on a use case that should cache well signals a prompt structure problem.

Cost Modeling for AI PMs: Prompt Caching in the Budget

When building the cost model for an AI feature, prompt caching changes the math in two places: input token cost and latency (which affects compute time if you're self-hosting, and user experience if latency drives churn). Here's how to factor it in.

Estimate your cache-eligible token count

Not all input tokens are cacheable — only the stable prefix. Measure your average system prompt + document + few-shot example length. If that's 8,000 of your 9,000 average input tokens, 89% of your input cost is cacheable.

Estimate realistic hit rate

Cache TTL and traffic patterns determine hit rate. A product with 10 calls per minute and a 5-minute TTL will have most calls hit cache. A product with 1 call per hour against a 5-minute TTL will have ~0% hits. Match TTL to your traffic pattern.

Account for cache write cost

Anthropic charges 125% of standard input price to write to cache. Every cache miss is slightly more expensive than a non-cached call. In your model, expected cost = (cache_write_cost × miss_rate + cache_read_cost × hit_rate) × cacheable_tokens.

Latency reduction has secondary value

Cache hits are 40–85% faster than full prefill. For products where response latency correlates with user satisfaction (chat, real-time tools), faster cache hits can meaningfully improve retention — a benefit that doesn't show up in the cost model but matters for product KPIs.

Benchmark break-even hit rate

For Anthropic: with 90% cache read discount and 25% cache write premium, you need roughly 12% hit rate to break even vs no caching. Above that, you save money. Most production products with stable system prompts clear this easily.

Re-run cost models when prompts change

A prompt refactor that adds an injection point early in the system prompt can tank your cache hit rate to near zero. Treat prompt engineering as an infrastructure decision — changes should trigger a cost model review.

What to Tell Your Engineering Team (and Stakeholders)

As a PM, you likely won't implement prompt caching yourself — but you need to know when to ask for it and how to frame the business case.

When to ask for prompt caching

During roadmap planning: any AI feature with a large, static system prompt (over 2,000 tokens) and more than a few calls per minute should have caching in scope. Flag it at the architecture review stage, not after launch — retrofitting is painful because it requires restructuring prompts.

How to frame it for finance

Prompt caching is not a backend optimization — it's a cost driver that belongs in your unit economics model. Frame it as: "For every $1 of AI API spend, we can reduce input token cost by X with caching, extending our runway by Y weeks at current growth rate."

When caching won't help

If prompts are short (<1,024 tokens), highly dynamic (each call is unique), or your product has very low call volume (cache expires before reuse), caching adds implementation complexity with minimal return. Pick your battles.

The PM's prompt caching checklist

  • ✓ Identify all features with system prompts over 2,000 tokens
  • ✓ Confirm prompt structure places stable content before dynamic content
  • ✓ Verify no time-varying injections appear in the cacheable prefix
  • ✓ Add cache hit rate to your AI cost dashboard
  • ✓ Set a cache hit rate target (aim for 70%+ for high-volume features)
  • ✓ Include caching savings in your unit economics model for AI investor updates

Build AI Products That Stay Profitable at Scale

The AI PM Masterclass covers cost modeling, API economics, and technical decisions that separate products that scale from ones that don't — taught live by a former Apple Group PM.