TECHNICAL DEEP DIVE

Gemini 3.1 Ultra for Product Managers: What the 2M-Token Context Window Changes

By Institute of AI PM·14 min read·Jun 13, 2026

TL;DR

Google released Gemini 3.1 Ultra in April 2026 with a 2-million token context window — the largest of any publicly available model. It processes text, images, audio, and video through a single unified attention mechanism, and can write and execute Python natively in a sandboxed environment mid-conversation. For AI PMs, this changes three decisions: when to use RAG vs. long-context, how to design multimodal pipelines, and when Gemini 3.1 Ultra outcompetes Claude or GPT-4o for your specific use case. This article covers the specs, the product implications, and a concrete decision matrix.

The AI PM Minute

One tactic to make you a sharper AI PM, twice a week. 60 seconds to read. Free.

No fluff. Unsubscribe anytime.

What Gemini 3.1 Ultra Is (and Is Not)

Google released Gemini 3.1 Ultra in April 2026. It's the flagship of the Gemini 3.x family — sitting above Gemini 3.0 Pro and Flash variants in both capability and cost. The headline number is 2 million input tokens, which translates to roughly 1.5 million words of English text, two hours of video at standard sampling, or 22 hours of audio in a single context window.

Context window2 million input tokens; 64K output tokens per response

ModalitiesText, images, audio, video — unified single attention mechanism across all four

Code executionNative sandboxed Python: writes code, runs it, reads the output, and revises — all within one inference call

BenchmarksBeats GPT-5 on SWE-bench code benchmarks; matches Claude on long-document reasoning; Google reports better coherence at 1.5M+ token positions than competing models

AvailabilityVertex AI, Gemini API; powers Google AI Overviews and Workspace AI

What it is not: a drop-in replacement for every use case. The 2M context window carries real latency at full utilization — first-token latency on a maxed context is measurably higher than on Claude Sonnet or GPT-4o mini. For high-volume, cost-sensitive tasks — classification, extraction, short-form generation — Gemini Flash is still the right default.

The 2M Token Context Window: What Actually Changes

After every context-window announcement, the real question is: does the model actually use the window effectively? Google claims better coherence across the full window — specifically, less degradation in the final third of the context, where competing models suffer the "lost in the middle" problem. Independent evaluations published in May 2026 found Gemini 3.1 Ultra retrieving facts from the 1.8M token position with 87% accuracy, vs. 71% for GPT-4 at equivalent positions in shorter contexts. That gap matters if your use case requires reading to the end of a long document.

RAG is now optional for mid-scale corpora

If your knowledge base fits under ~1,500 full-length documents and changes infrequently, you can stuff it directly into context. No vector database, no chunking decisions, no retrieval pipeline. Simpler architecture, fewer failure modes.

RAG still wins at enterprise document scale

When your corpus is tens of thousands of documents, grows daily, or needs freshness guarantees, RAG remains superior. Long-context is not a universal replacement — it is an alternative for the specific case where your corpus fits the window.

Full codebase reasoning is practical now

A 2M token window holds the full source code of most production applications. Gemini 3.1 Ultra can review, refactor, and reason about an entire codebase in one pass — enabling PM use cases like comprehensive architecture review without chunking strategy.

Run the cost model before defaulting to full-context

At 2M tokens per call, budget burns fast. Model pricing scales with tokens consumed. Before locking into long-context mode, quantify: how often does your actual use case need 2M tokens vs. 128K? Most production workloads do not.

Native Code Execution: What the Sandbox Changes for Product Teams

Gemini 3.1 Ultra can write Python, run it in a sandboxed environment, observe the result, and revise — all within the same inference call. The model does not return code for your system to execute separately; it runs, reads the output, and incorporates it into the response. This is qualitatively different from function calling or tool use, where the model issues a call and waits for your infrastructure to respond.

What code execution enables

What it means: Data analysis on uploaded CSVs. Mathematical verification of its own reasoning. Dynamic chart generation. Real-time hypothesis testing. Any task where the model benefits from running a computation and reading the result before responding.

PM Implication: You can build analytics assistant products where the model doesn't just narrate analysis — it performs it. 'Find the revenue anomaly in this dataset' can now mean the model runs statistical tests and shows its work, not just pattern-matches on training data.

What code execution does not enable

What it means: Persistent state, network access, external API calls, or file system writes. The sandbox is ephemeral — each execution starts clean. Think of it as a stateless calculator, not a server.

PM Implication: Don't design for the model to 'install libraries' or 'save output files.' The sandbox resets between calls. For anything requiring persistence or external IO, function calling with your infrastructure is still the right architecture.

Competitive context

What it means: OpenAI's Advanced Data Analysis has offered similar sandboxed execution for ChatGPT users since 2023. Gemini 3.1 Ultra brings this natively to the API — available programmatically for builders, not locked behind a consumer product.

PM Implication: If your team has been shipping a custom code execution wrapper on top of the OpenAI API, Gemini 3.1 Ultra's native sandbox is worth benchmarking on your actual workload before your next sprint planning.

Evaluate Frontier Models Systematically in the AI PM Masterclass

Learn to compare models, design around their constraints, and make architecture decisions that hold up at production scale — taught live by a Salesforce Sr. Director PM and former Apple Group PM.

Unified Multimodal Attention: One Model for Text, Images, Audio, and Video

Most "multimodal" models process each modality separately and merge the results — encoding an image separately from text, then reconciling. Gemini 3.1 Ultra uses a single attention mechanism across all four modalities simultaneously. The model reasons about relationships between modalities directly, not by passing summaries between specialized sub-models.

Why unified attention matters

When a user shares a deposition video alongside a transcript and 30 supporting documents, a multi-model pipeline processes each separately and reconciles. Gemini 3.1 Ultra attends to all simultaneously — catching discrepancies between spoken testimony and documentary evidence in one pass.

Product use case: legal and compliance discovery

Legal discovery workflows that ingest video testimony, audio transcripts, and documentary exhibits in a single call. Ask 'identify moments where this testimony contradicts exhibit D' and the model reasons across all modalities at once.

Product use case: media intelligence

Brand monitoring that analyzes social video, audio mentions, and text comments in one pass — replacing multi-stage pipelines with separate transcription, vision, and NLP models that each add latency and failure surface.

The real constraint: video token cost

Video at standard sampling consumes tokens fast — two hours fills the 2M window. For longer video analysis, you need chunking, which reintroduces the orchestration complexity unified attention was supposed to eliminate. Know your input lengths before finalizing the architecture.

When to Use Gemini 3.1 Ultra vs. Claude, GPT-4o, and Others

Model selection should be use-case driven. Here is a direct decision matrix based on what Gemini 3.1 Ultra does measurably well, and where alternatives still lead.

Choose Gemini 3.1 Ultra when

→Your context consistently exceeds 200K tokens
→You need native multimodal reasoning across video, audio, and text together
→Code execution without a separate tool call layer is core to your product
→You are building on Google Cloud or Workspace
→Best-in-class code generation benchmarks are a product requirement

Choose Claude when

→Instruction following precision and nuanced writing quality are top requirements
→Your use case involves complex multi-turn reasoning with subtle constraints
→You are building on AWS Bedrock or an Anthropic-hosted environment
→Safety alignment and refusal reliability are critical product properties
→You need Constitutional AI-style behavior for consumer-facing products

Choose GPT-4o when

→Your team is deep in the OpenAI ecosystem and migration cost is a factor
→You are using Assistants API v2 with file search and code interpreter
→Real-time voice interaction is a core product feature
→Broadest third-party tool and plugin ecosystem coverage matters
→Your users are already on ChatGPT and native feel is important

Choose Flash / Haiku / Mini variants when

→Volume is high and cost is the primary constraint
→Latency under 1 second is a hard product requirement
→Tasks are simple: classification, extraction, short-form formatting
→You are building inference at the edge or on mobile hardware
→You are running evals at scale and need cheap, fast scoring

Five Product Decisions Gemini 3.1 Ultra Changes

RAG architecture choice

If your knowledge base is under roughly 1,500 documents and changes infrequently, re-evaluate whether you need a vector database at all. Long-context stuffing may simplify your stack and reduce latency — run the build-vs-buy analysis on your actual document count and update frequency before defaulting to RAG.

Multimodal pipeline design

If you are stitching together Whisper, GPT-4V, and document models in a multi-stage pipeline, prototype the same output with a single Gemini 3.1 Ultra call first. Three systems vs. one is a maintenance and latency delta worth measuring before you lock in the architecture.

Analytics assistant products

Native code execution enables a category of data analysis products that previously required LangChain or custom tool layers. If your product helps users understand their own data, this capability lowers your build complexity significantly — prototype before adding infrastructure.

Vendor lock-in posture

Gemini 3.1 Ultra is deeply integrated into Google Cloud: Vertex AI, BigQuery, Workspace. If you are already on GCP, this is a natural gravity pull. If you are multi-cloud or OpenAI-native, the migration cost still needs to clear the capability bar — quantify the delta before switching.

Model evaluation cadence

The frontier model landscape is moving at paper-to-production speed. Kimi K2.6 launched June 12. Claude Opus 4.8 earlier this month. Gemini 3.1 Ultra in April. Build a quarterly model benchmarking review into your roadmap process — not because you should switch constantly, but because you should know when a new model changes the make-vs.-buy math.

Build on Frontier Models Without the Guesswork

The AI PM Masterclass teaches you to evaluate models, design around their constraints, and make architecture decisions that hold up at production scale — taught by a Salesforce Sr. Director PM who has done it.

→ Long Context Models: How to Actually Use 1M+ Token Windows in Production → Long Context vs. RAG: How to Choose the Right Retrieval Strategy for Your AI Product → Kimi K2.6 for Product Managers: What the Open-Source Leader Means for Your AI Stack → Multimodal AI for Product Managers: Building Products That See, Hear, and Read

Before you go: get the AI PM Minute