KV Cache Explained: Why Long Conversations Get Expensive Fast
TL;DR
The KV cache is the GPU memory that stores attention Key and Value tensors for every token already in context, so the model doesn't have to recompute them on every new token. Without it, generating token 1000 would cost 1000x more than token 1. With it, decode is cheap but memory-bound. Inference splits into two regimes: prefill (compute-bound, processes the prompt in parallel) and decode (memory-bound, generates one token at a time). Anthropic and OpenAI prompt caching both expose the KV cache as a billable feature — typically 90% off cached input tokens with a 5-minute TTL. PMs who design products around the cache (stable system prompts, append-only conversations, prefix-shared retrieval) can cut inference costs 4-10x without sacrificing quality.
What the KV Cache Actually Is
In a transformer, each layer computes Key (K) and Value (V) tensors for every token in the context. These are derived only from the token itself and its position — they don't depend on future tokens. So once a token has been "seen," its K and V can be cached and reused for every subsequent decoding step. That's the KV cache.
Why caching is correct
Decoder-only LLMs use causal attention: token N can attend to tokens 1..N-1, but never the other direction. So K and V for older tokens are immutable — recomputing them produces the same result. Caching is purely an optimization, not a quality tradeoff.
What gets cached
Per token, per layer, per head: the K and V vectors. For Llama 3 70B (80 layers, 8 KV heads, 128 dim, FP16): ~320 KB per token. A 100K-token context: ~32 GB of cache. This is why long-context inference is GPU-memory-bound.
What is NOT cached
The Query (Q) for the current token has to be computed every step (it depends on the new token). The feed-forward layers run fresh every step. Most decode compute is reading the cache + small Q computation + FFN.
Cache lifetime is per-request, by default
Without explicit prompt caching, the KV cache exists for the duration of one streamed generation. As soon as the response finishes, it's gone. The next request rebuilds it from scratch — even if the prompt is identical.
Prompt caching extends lifetime across requests
Anthropic, OpenAI, Google, and DeepSeek all expose APIs to keep cache entries alive 5 minutes to ~1 hour, indexed by prompt prefix hash. Cached tokens cost 10% (Anthropic) to 50% (OpenAI standard tier) of normal input pricing.
Prefill vs Decode: Two Different Cost Curves
Inference has two phases with completely different performance characteristics. Confusing them is the most common cause of wrong cost estimates in PM specs.
Prefill: process the prompt
All input tokens are processed in parallel through every layer. Compute-bound — the GPU is doing dense matrix multiplications at near-peak FLOPs. Time grows roughly linearly with input length up to a saturation point. Builds the initial KV cache.
Decode: generate output tokens
Tokens are produced one at a time. Each step reads the entire KV cache from HBM, computes Q for one new token, and emits one output token. Memory-bandwidth-bound — bottleneck is moving the KV cache, not the math itself. ~30-150 tokens/second on modern GPUs.
Why output tokens cost 3-5x more than input
Prefill batches efficiently across many requests. Decode batches poorly because every request is at a different cache size. Per-token, decode wastes more GPU. Provider pricing reflects this — see GPT-4o, Claude, Gemini API tables.
Implications for streaming UX
Time-to-first-token (TTFT) is bounded by prefill latency — it grows with prompt length. Tokens-per-second after that is bounded by decode bandwidth. If your UX needs <200ms TTFT, your prompt has to be short OR you need prompt caching to skip prefill.
Prompt Caching: Anthropic vs OpenAI vs Google
Every major provider now exposes prompt caching, but the pricing models and ergonomics differ enough to matter for architecture decisions. As of mid-2026:
Anthropic Claude (explicit cache_control)
How it works: You mark up to 4 cache breakpoints in the prompt with cache_control. Cached input is 10% of base price (90% discount). Cache write is 1.25x base (you pay 25% premium to populate). Default TTL 5 minutes; 1-hour tier available at 2x base for write.
PM Implication: Best for retrieval-heavy systems with stable context blocks (system prompts + tool definitions + RAG document sets). The 90% discount means amortizing a 50K-token context over 10 turns is essentially free.
OpenAI (automatic prefix caching)
How it works: Caching is automatic for prompts >1024 tokens with shared prefixes. No API changes required. Cached tokens cost 50% of base price on standard tier, ~25% on Realtime API. TTL is provider-managed (typically 5-10 minutes).
PM Implication: Cheaper to adopt (zero code changes) but smaller discount and less control. Stable prefix matters: prepend everything that doesn't change to the start of the prompt.
Google Gemini (context caching)
How it works: Explicit API: createCachedContent uploads a context, returns a handle, you reference it on each call. Storage cost (per token, per hour) PLUS reduced compute cost on use. TTL configurable up to a day.
PM Implication: Best when one large context is reused thousands of times (e.g., embed a 1M-token codebase, query it for hours). Worse than Anthropic/OpenAI for ad-hoc shared prefixes because of the storage fee.
Cut Inference Costs With the Right Architecture
The AI PM Masterclass walks through real cost-optimization decisions — including the KV cache patterns top teams use to ship cheaper products. Live, taught by a Salesforce Sr. Director PM.
Designing Products Around the KV Cache
Cache hit rate is one of the highest-leverage metrics in production AI systems. A 5-minute design review can move a feature from 10% cache hit to 90% — and turn a money-losing product into a profitable one.
Stabilize the prompt prefix
Order: system prompt → tool definitions → static context → user message. Anything dynamic at the front kills cache. Replace timestamps with placeholders, normalize casing, freeze tool order. One PM team cut Claude bill 7x just by reordering the prompt.
Append-only conversation history
Multi-turn chat naturally caches well: each turn appends, so all previous turns are cached. Don't edit/redact prior turns mid-conversation — that invalidates everything after the edit. Summarize at boundaries, not in the middle.
Prefix-share retrieval results
Naive RAG re-injects different chunks per query, which destroys cache. Better: retrieve once per session, hold the document set stable, vary only the user query. Even better: cluster users by likely-relevant docs, share prefixes across the cluster.
Watch your cache hit rate metric
Anthropic and OpenAI both return cache hit/miss in the API response. Log it. If a feature has <50% cache hit rate, the prompt design is wrong. >85% means the design is working. Treat it like any other product metric — set a target, alert on regressions.
When the KV Cache Is the Wrong Optimization
Cache thinking is powerful but it\'s not always the right lever. Three cases where you should architect around it differently.
Highly personalized prompts
If every user gets a unique system prompt (their data, their preferences, their history), cache hit rate is structurally low. Better lever: smaller model, shorter prompts, or fine-tuning a model on common patterns instead of stuffing them into context.
Latency-critical real-time apps
Cached or not, decode latency is fundamental. If you need <100ms total response time, KV cache won't save you — you need a smaller model, speculative decoding, or output streaming with optimistic UI.
Compliance-sensitive contexts
Cache lives on shared infrastructure. Some regulated environments (healthcare, finance, EU GDPR-strict deployments) require dedicated inference or no cross-request caching. Check provider data-residency and isolation guarantees before designing around cache.
Embedding/classification workloads
Single-step inference (one input, one output, no follow-up) doesn't benefit from a persistent cache. The whole framing applies to autoregressive generation. For classifiers and embedding models, look at batching and quantization instead.