LLM Prefill vs. Decode: The Two Inference Phases Every AI PM Must Understand

The Two Phases: What Actually Happens When a User Hits Send

When your product calls an LLM API, the model does not process your prompt and generate a response as a single uniform operation. There are two distinct phases, each with a different computational profile:

PHASE

Phase 1: Prefill (Prompt Processing)

The model ingests every token in your prompt — the system prompt, conversation history, retrieved documents, and user message — all at once, in parallel. It computes the attention scores for every token pair and builds an internal representation of the entire context. This phase ends when the model has processed all input and is ready to generate the first output token. The output of prefill is a set of key-value (KV) vectors stored in memory that the decode phase will reuse.

This determines your time-to-first-token (TTFT).

PHASE

Phase 2: Decode (Token Generation)

The model generates output one token at a time. At each step, it looks at all previously generated tokens and the KV cache from prefill to predict the next token. This process repeats — sample a token, append it to the context, predict the next — until the model generates an end-of-sequence token or hits a max token limit. Each decode step accesses the full KV cache, which grows with each generated token.

This determines your tokens-per-second (TPS) — also called inter-token latency.

These two phases appear serially in a single API call — prefill runs first, then decode. But they have almost nothing else in common. Different hardware constraints, different scaling laws, different cost drivers. An AI PM who treats them as the same thing will get the product spec wrong.

Prefill Is Compute-Bound: What That Means for You

Prefill is compute-bound, meaning its performance is limited by the GPU's raw compute throughput (measured in FLOPS). Processing a long prompt requires enormous matrix multiplications — the attention mechanism computes relationships between every token pair, so cost scales roughly quadratically with prompt length. The good news: GPUs are very good at parallel computation, so prefill is fast per token when batched efficiently.

Prefill cost scales with input tokens

Every additional input token adds compute cost to prefill. A 128K-token context doesn't cost 4x a 32K context — it costs roughly 16x, because attention is quadratic. This is the core reason long-context APIs price input tokens at a premium.

System prompt length is a cost lever you control

Your system prompt is processed in full on every single request. A 2,000-token system prompt vs. a 500-token system prompt means 4x the prefill compute on every call. Prompt engineering has a direct P&L impact — optimize your system prompt aggressively.

Prompt caching eliminates repeated prefill cost

Major providers offer prompt caching that stores the KV state of a repeated system prompt prefix. Subsequent calls with the same prefix skip reprefilling that segment entirely. Cache hit = 70-90% cost reduction on the cached portion. Cache hit rates of 60-80% are achievable on products with consistent system prompts.

TTFT tells you how long before the user sees anything

Long system prompts and large retrieval payloads directly increase TTFT. If your RAG pipeline stuffs 10,000 tokens of retrieved context into the prompt, your TTFT will be noticeably worse than a product with a tighter context. Users notice delays over 500ms before the first token appears.

Decode Is Memory-Bound: The KV Cache Connection

Decode is memory-bound, meaning its performance is limited by GPU memory bandwidth — how fast the GPU can read and write data — not by raw compute. At each decode step, the model reads the full KV cache (which includes all previous tokens, growing with every new generated token) from GPU memory. The bottleneck is moving data in and out of memory, not computing it.

Tokens per second is relatively constant — until it isn't

For a given model on a given hardware configuration, decode TPS is fairly stable across different response lengths. A model that generates 40 tokens/second will generate at roughly that rate whether the response is 200 tokens or 2,000 tokens. However, TPS degrades as the KV cache grows very large — reading a 100K-token KV cache each decode step is slower than reading a 10K-token one.

Output token pricing reflects this cost structure

This is why output tokens are priced higher than input tokens by most providers. Output tokens require one full decode step each — including reading the entire KV cache — whereas input tokens are processed once in parallel during prefill. On GPT-4o as of mid-2026, output tokens cost 4x input tokens. This ratio reflects the memory-bound compute cost difference between decode and prefill.

Longer outputs hurt throughput more than length suggests

Generating a 1,000-token response doesn't cost 10x a 100-token response at the API level (pricing aside) — but it does reduce the number of concurrent users a server can support. Long outputs hold GPU memory longer, reducing parallelism across users. This is why providers throttle long-generation workloads differently than short ones.

Streaming makes decode latency invisible

Streaming doesn't make decode faster — it makes the wait invisible. When you stream tokens as they're generated, the user perceives sub-second response starts even for responses that take 10+ seconds to fully generate. Streaming is the single highest-impact UX decision for any AI product with interactive output. If you're not streaming, you're manufacturing perceived latency.

How the Two Phases Show Up in Your Product's Latency Profile

The prefill-decode split explains a latency pattern that confuses PMs who haven't seen it before: a product can have excellent perceived responsiveness (low TTFT) but high total latency for long responses — or vice versa. These are separate things to optimize.

Metric	Driven by	How to optimize	User impact
Time to First Token (TTFT)	Prefill phase — input token count	Shorter system prompts, prompt caching, smaller retrieval payloads	Perceived responsiveness — how quickly does something appear?
Tokens Per Second (TPS)	Decode phase — KV cache reads	Streaming, smaller model, speculative decoding, flash attention	Reading speed — can the user keep up, or are they waiting for the stream?
Total Latency	Both phases combined	Sum of TTFT optimizations + TPS optimizations; also shorter outputs	Task completion time — how long until the full answer is available?
Cost Per Request	Input tokens (prefill) + Output tokens (decode)	Prompt optimization, prompt caching, output length limits, model tiering	Indirect — affects pricing model and margins, not perceived experience

The spec trap: don't specify “end-to-end latency <2 seconds” for an AI feature

This is the most common latency spec mistake. "End-to-end under 2 seconds" conflates TTFT and total latency. A 500-token streaming response at 40 TPS takes 12.5 seconds to fully render — but the user sees the first word in 300ms and reads comfortably the whole way. That product feels fast. Specify TTFT and TPS targets separately, not total latency. Then pick the right metric for each surface: conversational interfaces are TTFT-sensitive; document generation is TPS-sensitive.

Build Better AI Products With Technical Fluency

The AI PM Masterclass covers inference economics, latency architecture, and the technical decisions that ship fast AI products — taught by a Salesforce Sr. Director PM.

Disaggregated Inference: The Architecture Behind Modern AI APIs

As of 2026, the most performance-efficient AI serving systems — including those powering OpenAI, Anthropic, and Google APIs — use disaggregated inference: they run prefill and decode on physically separate GPU clusters optimized for each phase.

Prefill clusters use high-FLOPS GPUs (like H100s with NVLink) packed for compute throughput. Decode clusters use GPUs with high memory bandwidth (like H100 SXM with large HBM3 memory) configured for fast KV-cache reads at scale. The two clusters communicate over high-speed interconnects — one server ingests your prompt and hands the KV state to another server that generates the response.

Why you should care as a PM

Disaggregated inference is why prefill TTFT has gotten dramatically better in 2025-2026 without decode TPS degrading. Providers can independently scale each cluster based on demand patterns. For your product: TTFT and TPS are now improved by different teams with different hardware budgets. Troubleshoot them separately.

Why long prompts still spike TTFT in production

Even with disaggregated prefill clusters, very long prompts (100K+ tokens) cause measurable TTFT spikes because the prefill compute is genuinely massive. If your users are asking questions with huge document attachments, your p99 TTFT will be an order of magnitude higher than your median. Design your product around this: show progress indicators, process large documents asynchronously, and surface the 'thinking' state explicitly.

Prompt caching lives in the prefill cluster

When you enable prompt caching, the prefill cluster stores the KV state of your cached prefix. Subsequent calls skip re-computing that prefix and pull the cached KV state directly. This is why prompt caching slashes TTFT for your constant system prompt content — the prefill cluster skips it entirely. Cache misses still pay full prefill compute cost.

6 Product Decisions Shaped by the Prefill-Decode Split

Here are the decisions that should be different once you understand the two phases:

Latency SLA design

Write separate TTFT and TPS targets. For conversational UI: TTFT < 500ms, TPS > 30 tokens/second. For document generation: TTFT < 2s, TPS > 50 tokens/second. Never write a single end-to-end latency number.

System prompt length budget

Set a token budget for your system prompt and defend it in sprint reviews. Every additional 1,000 tokens costs money and slows TTFT on every request. When engineers want to add extensive instructions, quantify the per-request cost impact.

Retrieval payload size

RAG systems that stuff 20K tokens of retrieved context face 4x higher prefill cost and slower TTFT than systems that use 5K tokens. Reranking retrieved chunks to cut payload size directly improves TTFT and reduces cost — and is worth the inference round-trip.

Prompt caching adoption

If your system prompt is consistent across users (or consistent per user), enable prompt caching. It is one of the highest-ROI infrastructure decisions available. On Anthropic and OpenAI APIs as of mid-2026, cached input tokens cost 50-90% less than uncached. The one engineering cost is ensuring your system prompt prefix is always identical byte-for-byte.

Output length controls

Set max_tokens limits appropriate to each surface. Conversational responses rarely need more than 400 tokens. Document drafts might need 2,000. Leaving max_tokens unlimited on a chat surface means runaway output costs and a perceived slow product. Always set it.

Cost modeling for product pricing

If you're building pricing for your product, model input and output tokens separately. A product where users submit 1K-token prompts and receive 2K-token responses has very different unit economics than one with 500-token prompts and 200-token responses — even at the same total token count.

LLM Prefill vs. Decode: The Two Inference Phases That Shape Your Product's Speed and Cost