Test-Time Compute Explained for Product Managers

The Old Scaling Law and Why It Hit a Wall

From 2018 to 2023, the dominant recipe for better AI was straightforward: train a larger model on more data using more compute. OpenAI's 2020 scaling laws paper quantified the relationship mathematically — double the parameters or the data and model performance improves predictably. GPT-2 to GPT-3 to GPT-4 followed this curve. The industry poured billions into pretraining clusters, and the results kept coming.

By 2024, diminishing returns started showing up. The datasets that matter had mostly been ingested. Training clusters were already costing hundreds of millions per run. And on benchmarks that required genuine multi-step reasoning — math olympiad problems, complex code, scientific question answering — even the largest models plateaued in ways that more training data wouldn't fix.

Why pretraining scaling hits limits

Training a model is a one-shot compression of the internet into weights. The model learns statistical patterns but doesn't get to think carefully about any individual problem. For routine tasks, this is fine. For hard problems requiring multi-step reasoning, a compressed statistical guess is not the same as working through a problem step by step.

The human analogy

A person answering a hard math problem in 0.3 seconds (like an LLM generating a token) will perform far worse than the same person given 10 minutes to work through it. Human experts don't perform hard tasks faster than novices — they think more carefully. Test-time compute gives LLMs the equivalent of that thinking time.

What test-time compute changes

Instead of a model that instantly produces a next token based on learned weights, a test-time compute model generates intermediate reasoning steps, evaluates multiple candidate reasoning paths, and selects the best answer. The same model weights can produce dramatically better results on hard problems when given more inference compute.

How Test-Time Compute Actually Works

The core idea is that the model generates a chain of thought — an extended internal monologue of reasoning steps — before producing a final answer. This is not new: chain-of-thought prompting has been around since 2022. What's new in models like o3 and DeepSeek-R1 is that the model was specifically trained to use this reasoning trace optimally, and the reasoning trace is much longer and more structured than simple chain-of-thought prompting produces.

Process Reward Models (PRMs)

A separate model trained to score the quality of each intermediate reasoning step, not just the final answer. During inference, the main model uses the PRM to evaluate whether its current reasoning path is productive and backtrack if not. PRMs are what separate genuine multi-step reasoning from plausible-sounding chains of thought that arrive at wrong answers.

Outcome Reward Models (ORMs)

A model trained to score the correctness of final answers. Used to evaluate which of multiple candidate solutions is most likely correct. Simpler than PRMs but effective when combined with best-of-N sampling: generate N answers, score each with the ORM, return the highest-scoring one.

Best-of-N Sampling

The simplest test-time compute strategy: run the model N times on the same problem and pick the best answer using a verifier or reward model. Effective for problems with verifiable answers (math, code). N=64 on a smaller model can match a much larger model on math benchmarks at lower overall cost.

Tree Search Over Reasoning Traces

More sophisticated: the model builds a tree of reasoning paths, expanding the most promising branches (like Monte Carlo Tree Search in game-playing AI). This is computationally expensive but allows the model to explore reasoning paths that a linear chain of thought would miss. o3's performance on hard math comes partly from this mechanism.

OpenAI's o-series models generate reasoning tokens that are hidden from the user — you see only the final answer. DeepSeek-R1 makes the reasoning trace visible (in a <think> block), which is useful for debugging and building trust. Gemini 2.5 Thinking exposes summarized reasoning. These are product design choices about what to show users, not fundamental differences in the underlying mechanism.

What Problems Benefit From Test-Time Compute

The gains from test-time compute are not uniform. They are concentrated in tasks that require multi-step reasoning where intermediate errors compound — and where there is a verifiable or evaluable correct answer. The empirical pattern from research is consistent: reasoning models dramatically outperform standard models on hard math, competitive programming, scientific reasoning, and complex multi-step planning. They show modest or no gains on tasks that require knowledge retrieval, creative generation, summarization, or conversation.

High gains from test-time compute

Tasks: Multi-step mathematical reasoning. Competitive programming (HumanEval Hard, LiveCodeBench). Scientific hypothesis generation and evaluation. Complex legal or contract analysis. Multi-hop question answering over long documents. Agentic tasks requiring tool use planning and error recovery.

Why: These tasks have a structure where reasoning step quality determines final answer quality, and where more careful thinking genuinely produces better outcomes.

Low or no gains from test-time compute

Tasks: Factual question answering on well-represented knowledge. Summarization. Creative writing. Translation. Conversational response generation. Simple classification tasks.

Why: These tasks don't benefit from extended reasoning chains because the bottleneck is knowledge retrieval or stylistic judgment, not multi-step deduction. Using a reasoning model for these is paying 5x for no quality improvement.

Where gains are task-difficulty dependent

Tasks: Code generation: simple functions see minimal gains, complex algorithms see large gains. Instruction following: simple instructions see no gains, multi-constraint complex instructions see meaningful gains. Document analysis: short single-document analysis sees little gain, cross-document reasoning sees substantial gains.

Why: The rule of thumb: if a very good human expert would benefit from 10 minutes of careful thought vs. giving an instant answer, the task will benefit from test-time compute.

Apply Technical Depth to Product Decisions

The AI PM Masterclass covers model selection, cost architecture, and capability evaluation — taught live by a Salesforce Sr. Director PM who's shipped AI products at scale.

Cost and Latency: The PM Trade-off Framework

Test-time compute is not free. Reasoning tokens — the internal chain-of-thought tokens the model generates before answering — are billed at the same rate as output tokens, and reasoning models can generate thousands of them per query. A task that takes 200 tokens on GPT-4o might take 3,000 reasoning tokens plus 200 output tokens on o3. At o3's pricing this can be a 15x cost increase for the same task.

Reasoning token volume

Varies 500—8,000+ tokens depending on problem difficulty and model configuration. OpenAI's API exposes a thinking_effort parameter (low/medium/high) that controls the reasoning budget. Anthropic's extended thinking has a configurable thinking token budget. Higher budgets improve quality on hard problems but increase cost and latency proportionally.

Latency profile

Reasoning models are fundamentally slower: time-to-first-token is measured in seconds rather than milliseconds because the model must complete significant reasoning before generating the answer. Streaming helps with perceived latency but doesn't reduce total time. For real-time user-facing features, reasoning models are often too slow unless the use case tolerates multi-second waits.

Quality gain curve

The quality gain from test-time compute follows a curve: beyond a certain reasoning budget, additional thinking tokens produce diminishing returns. For most tasks, medium thinking budget captures 80%+ of the quality gain at 40% of the cost of maximum budget. Tune the thinking budget empirically on your specific task distribution.

Routing economics

The economically optimal architecture routes tasks to the appropriate model by difficulty: a fast/cheap model handles the 70% of tasks that don't need reasoning, a reasoning model handles the 30% that do. A naive all-reasoning-model architecture overpays by 3-5x. The routing logic itself is a product engineering problem.

Product Architecture Decisions for Test-Time Compute

Task difficulty classification

Build a classifier or heuristic that routes incoming tasks by estimated complexity. Simple proxies: task length, presence of mathematical operators, multi-step instruction markers. More sophisticated: a fast small model that estimates task difficulty and routes accordingly. Start with heuristics; add ML routing when you have labeled difficulty data.

Evaluation infrastructure for reasoning tasks

Standard model evals (BLEU, human preference ratings) are insufficient for reasoning tasks. You need evals with verifiable correct answers: math problems with known solutions, coding tasks with test suites, logic puzzles with provable answers. If your task distribution lacks verifiable answers, reasoning model gains are hard to measure.

Thinking trace visibility

Decide whether to expose the reasoning trace to users. Visible traces (like DeepSeek-R1 style) build trust and enable debugging but add noise for non-technical users. Hidden traces (o3 style) are cleaner but make errors harder to diagnose. A middle ground: show reasoning traces in developer/debug mode, hide them in production.

Async vs. synchronous handling

For tasks where reasoning latency (5-30 seconds) is acceptable, consider async handling: accept the request, run the reasoning model in the background, notify when complete. This is better UX than a blocking 20-second spinner for features where immediate response is not required (document analysis, code review, research tasks).

What This Means for AI Product Strategy

Test-time compute changes the model selection landscape in important ways. It's no longer a simple choice between a big expensive model and a small cheap one. The question is: which task-model-budget combination gives you the best quality-per-dollar, and is your evaluation infrastructure good enough to measure it?

Smaller models close the gap on hard tasks

Best-of-N with a small model and a good verifier can match a large model on tasks with verifiable answers at lower total cost. This is a real competitive shift: a startup can access near-frontier quality on specific hard tasks without paying frontier model pricing, using open-source models like DeepSeek-R1 combined with task-specific verifiers.

Hard problems become newly tractable

Tasks that were previously not worth building AI features for — because quality was too inconsistent — may cross the quality threshold with reasoning models. Systematically re-evaluate your declined use cases from 2023-2024. Multi-step document analysis, complex plan generation, and multi-constraint optimization are candidates.

Evaluation becomes a core competency

The more your product relies on reasoning models for hard tasks, the more critical your evaluation infrastructure becomes. You can't route intelligently, tune thinking budgets, or measure quality improvements without task-specific evaluators. Teams that build strong eval infrastructure capture the test-time compute advantage; teams that don't are flying blind.

Model routing is a product feature, not just infrastructure

In products that handle diverse task types, intelligent model routing is a quality differentiator. A feature that routes hard queries to a reasoning model and simple queries to a fast model feels qualitatively better than one that applies the same model uniformly — and costs less. Routing logic should be explicitly designed and owned by the PM, not defaulted to by engineering.