TECHNICAL DEEP DIVE

Gemini 3.5 Flash for Product Managers: Benchmarks, Pricing, and When to Use It

By Institute of AI PM·13 min read·Jul 5, 2026

TL;DR

Released May 19, 2026, Gemini 3.5 Flash runs a 1M-token context window at $1.50/$9 per million tokens — roughly 3x cheaper than GPT-5.6 Sol — while beating Gemini 3.1 Pro on coding, agentic, and multimodal benchmarks. It is 4x faster than comparable frontier models. Google's stated vision: 3.5 Pro orchestrates, 3.5 Flash executes as a swarm of sub-agents. For AI PMs, that changes how you design multi-agent systems and size your inference budget.

The AI PM Minute

One tactic to make you a sharper AI PM, twice a week. 60 seconds to read. Free.

No fluff. Unsubscribe anytime.

Where Flash Fits in Google's Model Lineup

Google's model family has two axes: capability tier (Flash vs Pro vs Ultra) and generation (3.1 vs 3.5). The confusing part for PMs: Gemini 3.5 Flash now outperforms Gemini 3.1 Pro on several benchmarks. The tier label refers to price and speed positioning, not raw capability ceiling. Each new generation trains a smarter Flash, not just a cheaper Pro.

Gemini 3.1 Ultra

Previous flagship

~$10 / $40 per 1M tokens

Highest reasoning ceiling in the 3.1 generation. Still leads on academic reasoning (HLE: 44.4% vs Flash's 40.2%) and long-context retrieval.

Gemini 3.1 Pro

Previous workhorse

~$3.50 / $10.50 per 1M tokens

Outclassed by 3.5 Flash on coding, tool use, and agentic benchmarks. Retains a long-context retrieval edge (MRCR v2: 84.9% vs 77.3%).

Gemini 3.5 Flash

Current efficiency flagship

$1.50 / $9 per 1M tokens

Released May 19, 2026. 4x faster than other frontier models. Beats 3.1 Pro on coding and agentic benchmarks. Google's sub-agent workhorse.

Gemini 3.5 Pro

Upcoming orchestrator

TBD — expected ~$7-12

Google describes it as the planner: 3.5 Pro reasons and delegates, Flash executes. Not publicly available as of July 2026.

The PM takeaway: evaluate models against your actual task distribution, not tier labels. Treating 3.1 Pro as the ceiling for what Flash can do will cause you to over-provision your inference stack on tasks where Flash already wins.

Benchmarks That Inform Product Decisions

Academic benchmarks (MMLU, HLE) measure reasoning on knowledge-heavy questions. Agentic benchmarks measure what runs in production. Flash scores differently across both — and the split tells you exactly which workloads it is right for.

Terminal-Bench 2.1 (coding agents)

High

76.2%

Autonomous code execution, debugging, and environment interaction. Flash beats 3.1 Pro here — the decisive signal for coding agent use cases.

MCP Atlas (tool use)

High

83.6%

Multi-step tool-use chains via the Model Context Protocol. A direct proxy for performance in tool-heavy agentic pipelines.

CharXiv Reasoning (multimodal)

Medium

84.2%

Chart and figure comprehension. Relevant if your product processes PDFs, financial charts, dashboards, or slide decks.

Humanity's Last Exam

Low for most

40.2%

Pure academic reasoning ceiling. Gemini 3.1 Pro leads at 44.4%. If your use case requires expert-level scientific or legal reasoning, test 3.1 Ultra.

MRCR v2 (long-context retrieval)

Medium

77.3%

Multi-document retrieval across 128K tokens. 3.1 Pro leads at 84.9%. For precise retrieval from massive documents, run your own eval.

Inference speed

High

4x faster than frontier peers

Tokens per second. For real-time user-facing features or latency-constrained agentic loops, this is often the decisive factor.

Multimodal Capabilities: What Flash Can Actually Process

Gemini 3.5 Flash is natively multimodal — text, images, video, audio, and PDFs in a single API call. The 1M-token context window (roughly 750,000 words) is one of the largest available at this price point. Here is what each input modality unlocks for product teams:

Text and code

What it handles: Standard LLM generation, code synthesis, structured output, function calling, analysis, and reasoning across the full 1M context.

Example uses: Writing agents, coding agents, document Q&A, summarization pipelines, structured data extraction.

Images

What it handles: Visual understanding, OCR, chart interpretation, UI screenshot analysis, product image descriptions, and visual comparison.

Example uses: Visual QA, catalog enrichment, screenshot-to-spec tools, design feedback agents.

Video (native)

What it handles: Frame-level and temporal reasoning across full video files — no manual frame extraction. Understands sequences of events over time.

Example uses: User session analysis, video meeting summarization, tutorial generation from screen recordings.

Audio

What it handles: Speech transcription, speaker diarization, tone analysis, and audio event detection in a single API call.

Example uses: Call center agents, sales call analysis, interview summarization, voice-driven workflows.

PDFs and documents

What it handles: Native parsing of PDFs including embedded figures, tables, and formatted text. No external preprocessing or OCR pipeline required.

Example uses: Contract analysis, compliance review, research paper Q&A, financial report summarization.

Learn to Evaluate AI Models Like a Senior PM

The AI PM Masterclass covers model selection, inference cost management, and how to build product strategy around a rapidly evolving foundation model landscape.

When to Use Flash vs a Stronger Model: A Decision Framework

Google's stated architecture for the 3.5 family: 3.5 Pro orchestrates, 3.5 Flash executes. One planner model directing a swarm of fast workers. Until 3.5 Pro is available, Flash competes head-to-head with GPT-5.6 Sol and Claude Sonnet 5 on production workloads. Here is when to route to Flash and when a stronger model is the right call:

High-volume agentic sub-tasks

Use Flash

When your orchestrator spawns dozens of sub-agent calls per session, Flash's 4x speed advantage and 3x cost savings compound. Coding loops, web search chains, and multi-step tool use are the clearest wins.

User-facing, latency-critical features

Use Flash

Real-time chat, autocomplete, and streaming responses where users see tokens as they arrive. Flash's speed makes the product feel faster even at equivalent token quality.

Multimodal pipelines at scale

Use Flash

Processing thousands of images, audio files, or PDFs daily at near-Pro quality. Unit economics work where they wouldn't at Pro-tier pricing.

Precise retrieval from massive documents

Use 3.1 Ultra or Claude Sonnet 5

Flash's MRCR v2 retrieval score (77.3%) lags 3.1 Pro (84.9%). For use cases requiring exact citation from 500K+ token documents, run your own eval before switching.

Expert-level scientific or legal reasoning

Use 3.1 Ultra

HLE: 40.2% vs 44.4% for 3.1 Ultra. Not a large gap, but for medical, legal, or research-grade outputs where accuracy is high-stakes, the ceiling matters.

Orchestrator in a multi-agent system

Wait for 3.5 Pro or use Claude Sonnet 5

Planning, task decomposition, and quality-gating of sub-agent outputs benefit from the highest reasoning ceiling. Use Flash for execution, not planning.

Build Economics: Cost and Latency Numbers

Inference cost determines whether your AI feature has positive unit economics at scale. At 10 million tokens per day — a moderate production load — the gap between Flash and a higher-tier model is the difference between a self-funding feature and one that requires a cost-reduction sprint every quarter.

Standard API cost

$1.50 input / $9.00 output per 1M tokens

Competitive with GPT-4o mini pricing while delivering near-Pro quality on coding and agentic tasks. The relevant comparison is capability-per-dollar, not sticker price.

Batch API cost

$0.75 input / $4.50 output per 1M tokens

50% discount for 24-hour batch processing. Built for async workflows: nightly enrichment jobs, bulk document analysis, offline eval pipelines.

Cached input cost

$0.15 per 1M tokens

Prefix caching for repeated system prompts. At 1,000 daily calls with a 10K-token system prompt, caching reduces that prompt's cost by roughly 90%.

vs GPT-5.6 Sol ($5/$30 per 1M)

3.3x cheaper on both input and output

At 10M input tokens/day, Flash saves $35 daily vs Sol — $12,775/year from inference costs alone, before output token savings.

Before switching: run a 200-task shadow eval

Run Flash and your current model side by side on 200-500 real production inputs. Measure: output acceptance rate (does your eval system approve outputs at the same rate?), task completion rate (do users reach their goal?), and error recovery rate (how often does Flash fail in ways that are unrecoverable?). Cost savings are real only if quality holds at your specific task distribution, not just on Google's benchmarks.

Build Smarter AI Products

The AI PM Masterclass covers model selection, inference budgeting, and the architectural decisions that make or break production AI features — taught live by a senior PM who has shipped at Apple and Salesforce.

Before you go: get the AI PM Minute

One tactic to make you a sharper AI PM, twice a week. 60 seconds to read. Free.

No fluff. Unsubscribe anytime.