LLM Context Window: What Every AI Product Manager Needs to Know

Tokens, Context, and What the Model Actually Sees

When you send a message to an LLM, everything the model processes is tokenized and placed into the context window. The model generates each response token by attending to all previous tokens in the context. There is no persistent memory between sessions — only what's in the current context window.

What is a token?

Roughly 3/4 of a word in English. 'tokenization' → ['token', 'ization']. Numbers, code, and non-English text are often less efficient — a single character may be its own token. Rule of thumb: 1,000 tokens ≈ 750 words.

What goes into the context window?

System prompt + conversation history + any retrieved documents (in RAG) + the current user message + any tool call results. All of it counts toward your token budget.

Input tokens vs output tokens

Both cost money, but at different rates. Output tokens typically cost 3–5x more than input tokens because generation is computationally more expensive than processing input. Design prompts to minimize output length when possible.

Context window vs model knowledge

The model has two sources of information: its trained weights (what it learned during pre-training) and the current context. RAG uses the context to supply facts the weights don't reliably contain.

How Context Size Affects Product Behavior

Short context (≤ 8K tokens)

Use cases: Simple Q&A, single-document summaries, focused chatbots

Trade-offs: Fast, cheap, high accuracy. But conversations truncate quickly and you can't process long documents in one pass.

Medium context (8K–32K tokens)

Use cases: Multi-turn customer support, code review, report drafting

Trade-offs: Good balance of cost and capability. Covers most real-world use cases. Watch for degraded recall on facts from early in the context.

Long context (128K–1M tokens)

Use cases: Entire codebase analysis, book-length document Q&A, long legal contract review

Trade-offs: Expensive, slower, and accuracy degrades non-linearly. Models struggle with information in the middle of very long contexts. Often RAG is more accurate and cheaper than brute-force long context.

The Lost in the Middle Problem

Research has consistently found that LLMs attend most strongly to information at the very beginning and very end of the context window. Information buried in the middle of a long context is disproportionately ignored — even if it's the most relevant part.

Product implication for RAG

When inserting retrieved chunks into context, don't bury the most relevant chunk in the middle of 20 other chunks. Put the highest-scored chunks first (or last) in the context window.

System prompt placement

Critical instructions in your system prompt get stronger attention at position 0. Instructions buried at position 500 tokens into a long system prompt may be deprioritized. Keep system prompts concise and lead with the most important constraints.

Conversation history management

As conversations grow long, early turns receive less attention. For customer support bots, consider summarizing old turns rather than keeping them verbatim. A compressed summary at position 0 is often more effective.

Testing for context position effects

When evaluating your system, test with relevant information at different positions in context. A system that works with short context may fail surprisingly when the context grows — not because of length, but because of position.

Master Context Engineering in the AI PM Masterclass

Context management and RAG architecture are core skills covered in the masterclass. Build production-ready AI systems with a Salesforce Sr. Director PM.

Context Management Strategies for AI PMs

Rolling window / sliding context

Keep only the last N turns of conversation history. Drop older turns when the window fills. Simple and effective for chat apps where old context is usually irrelevant.

Summarization-based compression

When conversation history exceeds a threshold, use a cheap model call to summarize the conversation so far. Insert the summary at the top of the new context. Preserves substance while drastically reducing token count.

Hierarchical retrieval (RAG over long docs)

Instead of shoving entire documents into context, use RAG to retrieve only the relevant chunks. More accurate than brute-force long context and significantly cheaper. Use long context as a fallback when chunk retrieval fails.

Structured context packing

Order context elements by importance, not chronology. System prompt → most relevant retrieved context → recent conversation → current query. This positions high-priority information where the model attends most strongly.

KV cache optimization

Most LLM APIs cache the computed attention for repeated context prefixes. Static system prompts that don't change between requests can be cached, reducing cost and latency significantly. Anthropic calls this 'prompt caching.'

Cost, Latency & Context: The PM Trade-off Triangle

Cost scales linearly with input tokens

Doubling your context window doubles your input cost. For a chatbot processing 10K conversations per day at 4K tokens each, moving to 8K tokens doubles your LLM bill. Measure, don't guess.

Latency grows with context length

Time-to-first-token (TTFT) increases with context size. A 128K token context can have TTFT of 3–10 seconds on many providers. For latency-sensitive products, smaller context is faster even if the model theoretically supports large context.

Prompt caching changes the math

Providers like Anthropic and OpenAI offer significant discounts (50–90%) on cached input tokens. If your system prompt is large and static, prompt caching can make large contexts economically viable.

Model selection matters

Smaller, faster models (GPT-4o mini, Claude Haiku) are far cheaper per token than frontier models. For high-volume features, use a smaller model with optimized context instead of throwing a frontier model at every request.