LLM Context Window: What Every AI Product Manager Needs to Know
TL;DR
The context window is the LLM's working memory — everything it can "see" at once when generating a response. Every product decision involving AI is shaped by it: conversation length, document ingestion, cost, latency, and accuracy. Understanding how tokens work, why larger windows come with trade-offs, and how to manage context efficiently is one of the most underrated skills for AI PMs.
Tokens, Context, and What the Model Actually Sees
When you send a message to an LLM, everything the model processes is tokenized and placed into the context window. The model generates each response token by attending to all previous tokens in the context. There is no persistent memory between sessions — only what's in the current context window.
What is a token?
Roughly 3/4 of a word in English. 'tokenization' → ['token', 'ization']. Numbers, code, and non-English text are often less efficient — a single character may be its own token. Rule of thumb: 1,000 tokens ≈ 750 words.
What goes into the context window?
System prompt + conversation history + any retrieved documents (in RAG) + the current user message + any tool call results. All of it counts toward your token budget.
Input tokens vs output tokens
Both cost money, but at different rates. Output tokens typically cost 3–5x more than input tokens because generation is computationally more expensive than processing input. Design prompts to minimize output length when possible.
Context window vs model knowledge
The model has two sources of information: its trained weights (what it learned during pre-training) and the current context. RAG uses the context to supply facts the weights don't reliably contain.
How Context Size Affects Product Behavior
Short context (≤ 8K tokens)
Use cases: Simple Q&A, single-document summaries, focused chatbots
Trade-offs: Fast, cheap, high accuracy. But conversations truncate quickly and you can't process long documents in one pass.
Medium context (8K–32K tokens)
Use cases: Multi-turn customer support, code review, report drafting
Trade-offs: Good balance of cost and capability. Covers most real-world use cases. Watch for degraded recall on facts from early in the context.
Long context (128K–1M tokens)
Use cases: Entire codebase analysis, book-length document Q&A, long legal contract review
Trade-offs: Expensive, slower, and accuracy degrades non-linearly. Models struggle with information in the middle of very long contexts. Often RAG is more accurate and cheaper than brute-force long context.
The Lost in the Middle Problem
Research has consistently found that LLMs attend most strongly to information at the very beginning and very end of the context window. Information buried in the middle of a long context is disproportionately ignored — even if it's the most relevant part.
Product implication for RAG
When inserting retrieved chunks into context, don't bury the most relevant chunk in the middle of 20 other chunks. Put the highest-scored chunks first (or last) in the context window.
System prompt placement
Critical instructions in your system prompt get stronger attention at position 0. Instructions buried at position 500 tokens into a long system prompt may be deprioritized. Keep system prompts concise and lead with the most important constraints.
Conversation history management
As conversations grow long, early turns receive less attention. For customer support bots, consider summarizing old turns rather than keeping them verbatim. A compressed summary at position 0 is often more effective.
Testing for context position effects
When evaluating your system, test with relevant information at different positions in context. A system that works with short context may fail surprisingly when the context grows — not because of length, but because of position.
Master Context Engineering in the AI PM Masterclass
Context management and RAG architecture are core skills covered in the masterclass. Build production-ready AI systems with a Salesforce Sr. Director PM.
Context Management Strategies for AI PMs
Rolling window / sliding context
Keep only the last N turns of conversation history. Drop older turns when the window fills. Simple and effective for chat apps where old context is usually irrelevant.
Summarization-based compression
When conversation history exceeds a threshold, use a cheap model call to summarize the conversation so far. Insert the summary at the top of the new context. Preserves substance while drastically reducing token count.
Hierarchical retrieval (RAG over long docs)
Instead of shoving entire documents into context, use RAG to retrieve only the relevant chunks. More accurate than brute-force long context and significantly cheaper. Use long context as a fallback when chunk retrieval fails.
Structured context packing
Order context elements by importance, not chronology. System prompt → most relevant retrieved context → recent conversation → current query. This positions high-priority information where the model attends most strongly.
KV cache optimization
Most LLM APIs cache the computed attention for repeated context prefixes. Static system prompts that don't change between requests can be cached, reducing cost and latency significantly. Anthropic calls this 'prompt caching.'
Cost, Latency & Context: The PM Trade-off Triangle
Cost scales linearly with input tokens
Doubling your context window doubles your input cost. For a chatbot processing 10K conversations per day at 4K tokens each, moving to 8K tokens doubles your LLM bill. Measure, don't guess.
Latency grows with context length
Time-to-first-token (TTFT) increases with context size. A 128K token context can have TTFT of 3–10 seconds on many providers. For latency-sensitive products, smaller context is faster even if the model theoretically supports large context.
Prompt caching changes the math
Providers like Anthropic and OpenAI offer significant discounts (50–90%) on cached input tokens. If your system prompt is large and static, prompt caching can make large contexts economically viable.
Model selection matters
Smaller, faster models (GPT-4o mini, Claude Haiku) are far cheaper per token than frontier models. For high-volume features, use a smaller model with optimized context instead of throwing a frontier model at every request.