Long-Context Models: A Product Manager's Guide to 1M+ Token Windows

The Context Window Race: From 4K to 10M Tokens

In early 2023, the standard context window was 4,096 tokens — about 3,000 words. GPT-4 Turbo pushed it to 128K. Claude 3 hit 200K. Gemini 1.5 Pro launched at 1 million. By May 2026, Gemini 3.1 Pro advertises 2 million tokens, Meta's Llama 4 Scout reaches 10 million, and Subquadratic raised $29M in May 2026 targeting 12 million tokens with a sub-quadratic attention architecture.

For product managers, the question is not which model has the biggest number but what more context actually changes about your product. The answer depends heavily on your use case — and on understanding the gap between a model's nominal context window and its effective recall capability.

4K–32K tokens (2022–2023)

Single documents, short conversations. Standard chat and summarization. Limited to roughly 25 pages of text.

128K–200K tokens (2024)

Full books, large codebases, multi-document analysis. The GPT-4 Turbo and Claude 3 era. Most enterprise use cases fit here.

1M tokens (2025)

Entire codebases, hour-long meeting transcripts, multi-book research synthesis. Gemini 1.5 Pro opened this tier commercially.

2M–10M tokens (2026)

Multiple product datasets, dense video frame sequences, company-wide document corpora. Gemini 3.1 Pro and Llama 4 Scout operate here.

What 1M+ Tokens Actually Unlocks for Product Teams

Longer context windows do not just let you process more text — they enable qualitatively different product architectures. Some use cases were simply impossible below 1M tokens. These are the ones worth building around.

Whole-codebase analysis

Processing an entire GitHub repo in a single call to explain architecture, find security vulnerabilities, or generate a migration plan. Cursor and GitHub Copilot use this for large-scale code review.

Multi-document legal and compliance review

Submitting an entire contract portfolio or regulatory filing set simultaneously, rather than chunking into RAG. Better for queries that require cross-document reasoning.

Full meeting and call history

Loading months of transcripts for a customer account, not just the last call. Enables relationship intelligence that RAG misses because it cannot surface implicit patterns across conversations.

Real-time video understanding

Processing dense frame sequences from video feeds — security footage, manufacturing QA, sports analytics — as multi-modal long-context inputs.

Eliminating RAG for bounded corpora

For datasets under roughly 500K tokens, long context can replace a retrieval pipeline entirely. No chunking, no embeddings, no retrieval errors — just load and ask.

Multi-agent context sharing

Agents passing full task history to each other without summarization. Reduces information loss that accumulates when agents compress context between handoffs.

Context Rot: Advertised vs. Effective Capacity

The most important thing to know about long-context models in 2026: the number a provider advertises is not the number you should architect around. Chroma Research coined the term "context rot" to describe the performance degradation that happens as models approach their nominal context limits.

Gemini 3.1 Pro advertises 2M tokens but scores 26.3% on MRCR v2 — a benchmark that tests whether models can actually retrieve and reason about information spread across long contexts. The "1 million token lie" is real: most models can accept a large context but can only reliably use a fraction of it for complex reasoning tasks.

Lost in the Middle

What happens: Information at the start and end of a long context gets higher attention weights than information buried in the middle. A key fact on page 47 of a 100-page document is often ignored.

PM action: Put critical instructions at the top. For retrieval tasks, structure documents so key facts appear near boundaries. Or use RAG to surface the right chunks before the long-context call.

Recall Drops Nonlinearly

What happens: Performance does not degrade uniformly as context grows. Many models hit a cliff around 60–70% of their stated context limit. Beyond that, recall accuracy falls sharply.

PM action: Test your actual use case with your actual context length. Do not assume a 1M-token model is reliable at 900K tokens — measure MRCR-style recall on your corpus.

Cost Scales Quadratically in Standard Architectures

What happens: Standard attention is O(n^2) — doubling context quadruples the compute. Some newer architectures (flash attention, linear attention variants) reduce this, but not all providers disclose which they use.

PM action: For inputs above 200K tokens, ask your provider about their attention implementation and benchmark latency at your target context length before committing to an architecture.

Go Deeper in the AI PM Masterclass

Learn how to evaluate model capabilities, architect AI products, and make buy vs. build decisions — taught live by a Salesforce Sr. Director PM and former Apple Group PM.

Cost, Latency, and Infrastructure Trade-offs

Long context is expensive. At 2M tokens per call, even with aggressive caching, the cost-per-query is orders of magnitude higher than standard chat. Here is how to think about the trade-offs before committing your architecture.

Pricing structure

Most providers charge per input token and per output token. At 1M input tokens at $0.10/MTok, a single call costs $0.10. At 100 calls/day that is $300/month just in input costs — before output, infra, and caching. Run the unit economics before you ship.

Prompt caching

The biggest lever for long-context cost reduction. Providers including Anthropic, Google, and OpenAI support prompt caching — if you reuse the same large context across multiple queries, only the first call pays full price. Cache hit rates above 80% can cut costs 5x.

Time to first token (TTFT)

Prefilling a 500K-token context takes measurably longer than a 4K context. For synchronous user-facing products, measure TTFT at your target context length. For async batch jobs, latency matters less than throughput.

Memory vs. retrieval trade-off

Long context is not always better than RAG. For dynamic or frequently updated corpora, RAG retrieves fresh data on every query. Long context is better for stable datasets where cross-document reasoning matters more than freshness.

Picking the Right Long-Context Model for Your Use Case

The market in May 2026 offers meaningful differentiation between models on recall accuracy, cost, latency, and multimodal support. Here is the framework AI PMs use to choose.

For maximum reliable recall: Claude 3.5 at 200K

Anthropic's models score consistently on needle-in-haystack benchmarks at their stated limits. If recall accuracy matters more than scale, do not overshoot to 2M — use 200K reliably.

For multimodal long context: Gemini 3.1 Pro at 2M

Gemini's native multimodality at 2M tokens is unmatched in May 2026. Best for video, audio, and mixed-media document analysis where you need to reason across modalities at scale.

For open-source and cost control: Llama 4 Scout at 10M

Meta's Llama 4 Scout offers 10M tokens and can be self-hosted, eliminating per-token cost at scale. Quality degrades faster than frontier models, but fine-tuning on your corpus can close the gap.

For experimental architectures: Subquadratic at 12M

Subquadratic launched with $29M in May 2026 specifically to solve the quadratic attention cost problem. Worth watching for 2H 2026 — not production-ready today, but the architecture could redefine the cost curve.

The PM Decision Rule

Start at the smallest context that solves your problem reliably. Use RAG for dynamic or large-scale corpora. Move to long context only when cross-document reasoning accuracy clearly beats retrieval — and measure that with your own eval suite, not benchmark scores.