AI STRATEGY

Neural Scaling Laws for Product Managers: How to Plan AI Roadmaps Around Predictable Capability Growth

By Institute of AI PM·13 min read·May 24, 2026

TL;DR

AI model performance doesn't improve randomly — it follows mathematical scaling laws: double the compute, and loss drops by a predictable amount. The 2022 Chinchilla paper from DeepMind overturned the prior assumption that bigger parameters always win, showing that data and parameters should scale together for compute-optimal training. In 2026, frontier labs use scaling laws to plan capability roadmaps years ahead. For AI PMs, understanding scaling laws changes how you time product bets, set stakeholder expectations about when AI "gets good enough," evaluate model provider claims, and decide when to build now vs. wait for the next generation. This guide explains what scaling laws are, what the Chinchilla era means, and how to apply these insights to your product strategy.

The AI PM Minute

One tactic to make you a sharper AI PM, twice a week. 60 seconds to read. Free.

No fluff. Unsubscribe anytime.

What Scaling Laws Actually Say

Neural scaling laws describe how model performance improves as a function of three variables: the number of model parameters (N), the number of training tokens (D), and the amount of training compute (C, measured in FLOPs). The key finding from the 2020 Kaplan et al. paper at OpenAI: loss decreases as a smooth power law across seven orders of magnitude of compute. The improvement is predictable and continuous — not random jumps.

Parameters (N)

Larger models have more capacity to learn patterns. Scaling from 1B to 100B parameters reliably improves loss on language tasks. But parameters alone aren't the whole story — they need to be matched with sufficient training data to reach potential.

PM implication: Model size is a rough proxy for capability ceiling. A 7B-parameter model has a hard upper limit on what it can learn, regardless of how long you train it.

Training Tokens (D)

More training data means the model sees more of the distribution it needs to model. The Chinchilla paper showed prior models were massively undertraining — they had too many parameters for how little data they were trained on.

PM implication: In 2026, frontier models train on 15T+ tokens (Llama 3, GPT-4 class). Data quantity is now the binding constraint for many labs, not compute.

Compute (C = 6ND)

Total training compute is approximately 6 × parameters × tokens. This is the unified lens for comparing training runs. A fixed compute budget can be spent on a big model trained briefly, or a smaller model trained longer — scaling laws tell you the optimal split.

PM implication: When a provider says they used 10x more compute for their next model, you can estimate roughly what performance improvement to expect. It won't be 10x better — improvement is sublinear.

The practical upshot: frontier AI capabilities are not magic. They are the result of scale applied systematically. Labs with bigger compute budgets can predict, with reasonable accuracy, what their next model will be capable of — before they train it.

The Chinchilla Finding: Why Bigger Isn't Always Better

The 2022 paper "Training Compute-Optimal Large Language Models" (Hoffmann et al., DeepMind) — known as the Chinchilla paper after its compute-optimal model — overturned the prevailing assumption in AI.

Before Chinchilla: the field assumed that given a fixed compute budget, you should prioritize building the largest possible model and train it for as long as feasible. GPT-3 (175B parameters, ~300B tokens) and Gopher (280B parameters, ~300B tokens) followed this logic.

Chinchilla's finding: for compute-optimal training, you should scale model parameters and training tokens equally. For every doubling of model size, you should also double the training data. Gopher (280B parameters, 300B tokens) was dramatically undertrained. Chinchilla (70B parameters, 1.4T tokens) — trained with 4x fewer parameters but 4x more data at the same compute budget — outperformed it on almost every benchmark.

What changed immediately

After Chinchilla, every major lab recalibrated their training recipes. Llama, Mistral, and other efficient models were trained on 10-20x more tokens than parameters suggest — following the Chinchilla optimal ratio. Smaller, more data-efficient models began outperforming their larger predecessors.

The data constraint emerges

Chinchilla-optimal training at frontier scale requires more data than the internet contains in high-quality form. Labs are now data-constrained, not compute-constrained. This is why synthetic data generation, data curation, and data quality have become first-tier research priorities.

Parameter efficiency becomes the metric

The new benchmark is intelligence per parameter, not raw intelligence. Gemma 4 (released May 2026) explicitly positions itself on this axis: competitive capability at dramatically lower parameter counts. This shifts the competitive landscape for open-source models.

Post-training gains scale separately

Chinchilla scaling laws apply to pre-training loss. Post-training improvements — RLHF, instruction tuning, tool use, chain-of-thought — can significantly improve real-world performance beyond what the pre-training loss predicts. Claude 3.5 vs. 3.0 is a post-training story, not just a scaling story.

Emergent Abilities: The Non-Linear Surprises

Scaling laws predict smooth, continuous improvement in aggregate loss. But they don't fully capture emergent abilities — capabilities that appear abruptly at specific scale thresholds and are essentially absent below them. Wei et al. (2022) documented 137 of these in a landmark paper.

Emergent abilities are the reason AI PMs can't just extrapolate linearly from today's model to next year's model. Some capabilities won't exist at GPT-3 quality and then suddenly work reliably at GPT-4 quality. Planning around this distinction matters.

Multi-step arithmetic reasoning

Scale threshold: Appears reliably around 100B+ parameter scale

PM implication: Below this threshold, prompting models to do multi-step math produces frequent errors. Above it, chain-of-thought prompting becomes a reliable product primitive. This is why AI coding assistants and analytical tools improved so dramatically between 2023 and 2024.

In-context learning (few-shot)

Scale threshold: Appears meaningfully above ~10B parameters; strong above 100B

PM implication: Small models show almost no in-context learning. Large models learn a new task from 3-5 examples without any fine-tuning. This is what makes few-shot prompting a real product capability, not just a research demo.

Instruction following (complex, multi-constraint)

Scale threshold: Requires SFT + RLHF on top of a large enough base model

PM implication: The reason Claude and GPT-4 follow complex instructions reliably while smaller models don't is both scale and post-training. The base capability emerges with scale; SFT and RLHF harness and direct it.

Tool use and function calling

Scale threshold: Reliably deployable only in models above ~70B parameters (or heavily fine-tuned smaller models)

PM implication: Agentic workflows require tool use. The reliability bar — low error rate, correct argument formatting, graceful failure — only became achievable at scale. Planning agent products for small models should come with careful eval gating.

Use AI Capability Forecasting in Your Roadmap

The AI PM Masterclass teaches how to build product roadmaps that account for AI capability growth — so you're building for what models can do at launch, not what they could do when you spec'd the feature. Taught by a Salesforce Sr. Director PM.

How to Apply Scaling Laws to Your AI Product Roadmap

Scaling laws give you something rare in product planning: a principled basis for capability forecasting. Here's how to translate the research into product decisions.

Time your feature bets around capability thresholds

If a feature requires reliable multi-step reasoning or complex instruction following, identify the capability threshold it needs and align your launch with models that have crossed it. Building a complex code review agent on a model one generation too small means shipping a product that undermines trust before the capability matures.

Example: A legal document summarization feature that requires maintaining cross-reference accuracy over 50+ pages is plausible on Claude 3.5 class models but inconsistent on GPT-4o mini class models. Launching on the wrong model costs you trust.

Set realistic stakeholder expectations for capability timelines

Scaling laws tell you that the next generation of frontier models will be meaningfully better, but improvements are sublinear relative to compute increases. A 10x compute investment produces roughly 2-3x improvement in many benchmarks — not 10x. Stakeholders who expect linear capability improvement will be disappointed.

Example: When your CEO asks 'can we do this with next year's model?', scaling laws let you answer with more than speculation. If the task requires emergent abilities not yet present, the answer is probably yes in 12-18 months. If it requires 100x better performance, the answer is probably no in that window.

Evaluate model provider claims using scaling fundamentals

When a model provider claims their new model is dramatically better, scaling laws give you a sanity check framework. Improvements beyond what compute scaling predicts require architectural innovation, better post-training, or proprietary data advantages — all of which are real but harder to sustain.

Example: If a new model claims 4x better performance on your benchmark with only 2x the parameters, ask: is this a pre-training win or a post-training win? Post-training gains are real but often don't generalize to out-of-distribution tasks.

Use cost curves to plan margin improvement

Compute cost per unit of capability has been dropping approximately 10x every two years historically. The same benchmark performance that cost $1M to train in 2022 costs $10K in 2026. Build your pricing and margin models around this trajectory — capabilities that are expensive now become cheap enough to offer in lower tiers within 18-24 months.

Example: A feature requiring GPT-4-class quality in 2023 costs roughly $0.06 per 1K output tokens. The same capability delivered by Llama 3.1 70B self-hosted in 2026 costs under $0.001. Margin trajectory, not current margin, determines long-run product economics.

What Scaling Laws Don't Tell You

Scaling laws are powerful, but they model a specific thing: pre-training loss on token prediction. A lot of what makes AI products succeed or fail sits outside that variable. Don't let scaling law fluency become false certainty about where AI is headed.

Data quality doesn't scale linearly with data quantity

Scaling laws assume IID (independent and identically distributed) data. Real training datasets have repetition, noise, and quality variance. Throwing more low-quality data at a model doesn't follow the clean power-law curves. Data curation — what you train on — matters as much as how much.

Post-training can outperform raw scale

RLHF, instruction tuning, and DPO can move model behavior dramatically without changing scale. Claude 3.7 vs. 3.5 is partly an alignment and post-training story. Scaling laws predict pre-training capability floors, not the ceiling you can reach with sophisticated post-training.

Emergent abilities are hard to predict in advance

We identify emergent abilities in retrospect. No scaling law predicted that ~100B parameters would unlock chain-of-thought reasoning before it was observed empirically. Your product roadmap may need to account for capability surprises — in both directions.

Test-time compute changes the equation

Inference-time scaling (chain-of-thought, extended reasoning, o1/o3 style models) shows that you can trade inference compute for quality without retraining. This creates a new axis of capability improvement that doesn't fit the standard N/D/C framework.

Architecture innovations break the curve

Mixture of Experts, selective state spaces (Mamba), and other architectural changes can deliver capability improvements that exceed what parameter count alone predicts. Scaling laws hold within an architecture family but reset when the architecture changes.

Benchmark saturation obscures real-world progress

When models approach 90%+ on standard benchmarks like MMLU, the remaining gains look small in percent terms but are enormous in practical impact. Scaling laws applied to saturated benchmarks underestimate real-world capability growth at the frontier.

The Right Mental Model

Scaling laws give you a baseline forecast for capability growth. Think of them like macroeconomic growth projections — useful for directional planning, accurate enough to make better decisions than intuition alone, but wrong in specific timing and magnitude. Use them to set ranges, not point estimates. Build in optionality: design your architecture to upgrade models without reengineering the product when the next capability threshold is crossed.

The 2026 Scaling Landscape: What's Actually Happening

The current frontier in 2026 reflects several scaling dynamics converging simultaneously. Understanding the state of play helps you calibrate your planning assumptions.

Data is now the binding constraint at frontier scale

Frontier labs have more GPU compute than high-quality training data to fill it with. Synthetic data generation (using models to generate training data), web crawl quality filters, and proprietary dataset partnerships are the new frontier of capability advantage — not raw compute.

Inference-time scaling is the new training-time scaling

The o1/o3/Gemini-thinking model family shows that you can trade inference compute (spending more tokens thinking) for dramatically better performance on hard reasoning tasks. This creates a capability axis that doesn't follow traditional training scaling laws and rewards different product architectures.

Open-source models are tracking frontier within 12-18 months

Llama 3.1 405B, Gemma 4, Mistral Large — open-source models are now within one model generation of closed frontier models for most tasks. Scaling laws apply to open-source too, and the gap is closing predictably. Products built on open-source today get a roadmap toward frontier quality.

Capability cost is deflationary at 10x per two years

The compute cost for a given capability level has dropped approximately 10x every 24 months historically. GPT-3 level performance now runs on a $50/month server. Plan your product economics around the assumption that today's expensive capability is tomorrow's commodity tier.

Build AI Products That Age Well

The AI PM Masterclass teaches you to think in terms of capability trajectories, not point-in-time model specs — so the products you ship in 2026 get better as AI improves rather than getting displaced by it.

→ Test-Time Compute Explained: Why AI Gets Smarter Without Retraining → Transformer Architecture Explained for Product Managers → Post-Transformer Architectures Explained for Product Managers → AI Make-or-Buy: Foundation Models, APIs, or Custom Models?

Before you go: get the AI PM Minute