KV Cache Explained: Why Long Conversations Get Expensive Fast

What the KV Cache Actually Is

In a transformer, each layer computes Key (K) and Value (V) tensors for every token in the context. These are derived only from the token itself and its position — they don't depend on future tokens. So once a token has been "seen," its K and V can be cached and reused for every subsequent decoding step. That's the KV cache.

Why caching is correct

Decoder-only LLMs use causal attention: token N can attend to tokens 1..N-1, but never the other direction. So K and V for older tokens are immutable — recomputing them produces the same result. Caching is purely an optimization, not a quality tradeoff.

What gets cached

Per token, per layer, per head: the K and V vectors. For Llama 3 70B (80 layers, 8 KV heads, 128 dim, FP16): ~320 KB per token. A 100K-token context: ~32 GB of cache. This is why long-context inference is GPU-memory-bound.

What is NOT cached

The Query (Q) for the current token has to be computed every step (it depends on the new token). The feed-forward layers run fresh every step. Most decode compute is reading the cache + small Q computation + FFN.

Cache lifetime is per-request, by default

Without explicit prompt caching, the KV cache exists for the duration of one streamed generation. As soon as the response finishes, it's gone. The next request rebuilds it from scratch — even if the prompt is identical.

Prompt caching extends lifetime across requests

Anthropic, OpenAI, Google, and DeepSeek all expose APIs to keep cache entries alive 5 minutes to ~1 hour, indexed by prompt prefix hash. Cached tokens cost 10% (Anthropic) to 50% (OpenAI standard tier) of normal input pricing.

Prefill vs Decode: Two Different Cost Curves

Inference has two phases with completely different performance characteristics. Confusing them is the most common cause of wrong cost estimates in PM specs.

Prefill: process the prompt

All input tokens are processed in parallel through every layer. Compute-bound — the GPU is doing dense matrix multiplications at near-peak FLOPs. Time grows roughly linearly with input length up to a saturation point. Builds the initial KV cache.

Decode: generate output tokens

Tokens are produced one at a time. Each step reads the entire KV cache from HBM, computes Q for one new token, and emits one output token. Memory-bandwidth-bound — bottleneck is moving the KV cache, not the math itself. ~30-150 tokens/second on modern GPUs.

Why output tokens cost 3-5x more than input

Prefill batches efficiently across many requests. Decode batches poorly because every request is at a different cache size. Per-token, decode wastes more GPU. Provider pricing reflects this — see GPT-4o, Claude, Gemini API tables.

Implications for streaming UX

Time-to-first-token (TTFT) is bounded by prefill latency — it grows with prompt length. Tokens-per-second after that is bounded by decode bandwidth. If your UX needs <200ms TTFT, your prompt has to be short OR you need prompt caching to skip prefill.

Prompt Caching: Anthropic vs OpenAI vs Google

Every major provider now exposes prompt caching, but the pricing models and ergonomics differ enough to matter for architecture decisions. As of mid-2026:

Anthropic Claude (explicit cache_control)

How it works: You mark up to 4 cache breakpoints in the prompt with cache_control. Cached input is 10% of base price (90% discount). Cache write is 1.25x base (you pay 25% premium to populate). Default TTL 5 minutes; 1-hour tier available at 2x base for write.

PM Implication: Best for retrieval-heavy systems with stable context blocks (system prompts + tool definitions + RAG document sets). The 90% discount means amortizing a 50K-token context over 10 turns is essentially free.

OpenAI (automatic prefix caching)

How it works: Caching is automatic for prompts >1024 tokens with shared prefixes. No API changes required. Cached tokens cost 50% of base price on standard tier, ~25% on Realtime API. TTL is provider-managed (typically 5-10 minutes).

PM Implication: Cheaper to adopt (zero code changes) but smaller discount and less control. Stable prefix matters: prepend everything that doesn't change to the start of the prompt.

Google Gemini (context caching)

How it works: Explicit API: createCachedContent uploads a context, returns a handle, you reference it on each call. Storage cost (per token, per hour) PLUS reduced compute cost on use. TTL configurable up to a day.

PM Implication: Best when one large context is reused thousands of times (e.g., embed a 1M-token codebase, query it for hours). Worse than Anthropic/OpenAI for ad-hoc shared prefixes because of the storage fee.

Cut Inference Costs With the Right Architecture

The AI PM Masterclass walks through real cost-optimization decisions — including the KV cache patterns top teams use to ship cheaper products. Live, taught by a Salesforce Sr. Director PM.

Designing Products Around the KV Cache

Cache hit rate is one of the highest-leverage metrics in production AI systems. A 5-minute design review can move a feature from 10% cache hit to 90% — and turn a money-losing product into a profitable one.

Stabilize the prompt prefix

Order: system prompt → tool definitions → static context → user message. Anything dynamic at the front kills cache. Replace timestamps with placeholders, normalize casing, freeze tool order. One PM team cut Claude bill 7x just by reordering the prompt.

Append-only conversation history

Multi-turn chat naturally caches well: each turn appends, so all previous turns are cached. Don't edit/redact prior turns mid-conversation — that invalidates everything after the edit. Summarize at boundaries, not in the middle.

Prefix-share retrieval results

Naive RAG re-injects different chunks per query, which destroys cache. Better: retrieve once per session, hold the document set stable, vary only the user query. Even better: cluster users by likely-relevant docs, share prefixes across the cluster.

Watch your cache hit rate metric

Anthropic and OpenAI both return cache hit/miss in the API response. Log it. If a feature has <50% cache hit rate, the prompt design is wrong. >85% means the design is working. Treat it like any other product metric — set a target, alert on regressions.

When the KV Cache Is the Wrong Optimization

Cache thinking is powerful but it\'s not always the right lever. Three cases where you should architect around it differently.

Highly personalized prompts

If every user gets a unique system prompt (their data, their preferences, their history), cache hit rate is structurally low. Better lever: smaller model, shorter prompts, or fine-tuning a model on common patterns instead of stuffing them into context.

Latency-critical real-time apps

Cached or not, decode latency is fundamental. If you need <100ms total response time, KV cache won't save you — you need a smaller model, speculative decoding, or output streaming with optimistic UI.

Compliance-sensitive contexts

Cache lives on shared infrastructure. Some regulated environments (healthcare, finance, EU GDPR-strict deployments) require dedicated inference or no cross-request caching. Check provider data-residency and isolation guarantees before designing around cache.

Embedding/classification workloads

Single-step inference (one input, one output, no follow-up) doesn't benefit from a persistent cache. The whole framing applies to autoregressive generation. For classifiers and embedding models, look at batching and quantization instead.