TECHNICAL DEEP DIVE

Llama 4 for Product Managers: Scout, Maverick, and Behemoth Explained

By Institute of AI PM·14 min read·Jun 2, 2026

TL;DR

Meta's Llama 4 family — Scout (10M-token context), Maverick (1M context, 128 experts), and the still-unreleased Behemoth — is the most capable open-weight model family available as of mid-2026. For AI PMs, it means genuine parity with proprietary APIs on many tasks, plus full data control and no per-token billing. Scout dominates long-context retrieval. Maverick competes with GPT-4o on generation quality. The 700M MAU licensing threshold and Behemoth's continued delay are the two constraints that change the calculus for consumer products and the highest-stakes use cases.

The Llama 4 Family: What Actually Shipped

Meta released the Llama 4 family in April 2025 — the first generation of natively multimodal open-weight models from Meta. Unlike Llama 3, which was text-only at launch, Llama 4 processes images and text together from day one. The family has four members: two publicly available, one previewed but not yet released, and one closed-weight model that signals a strategic shift from Meta.

Llama 4 Scout

109B total / 17B active / 16 experts10M tokens context

Available

The long-context retrieval specialist. Scout's 10M-token context window is the largest on any open-weight model available today. Purpose-built for document analysis, code repository ingestion, and multi-document synthesis. The 17B active parameter count means it runs efficiently on a single H100 node despite the massive total parameter count.

Llama 4 Maverick

~400B total / 17B active / 128 experts1M tokens context

Available

The generation quality model. 128 experts give Maverick broader capability coverage — it matches GPT-4o on most generation benchmarks while remaining fully open-weight. Active parameter count stays at 17B per token (same as Scout), so inference cost is comparable, but routing overhead is higher.

Llama 4 Behemoth

~2T total / 288B active / 16 expertsTBD context

Not released

The teacher model. Previewed in April 2025 as the model that trained Maverick and Scout via knowledge distillation. As of June 2026, it has not shipped publicly — no formal cancellation, no release date. Do not build product plans around Behemoth availability.

Meta Muse Spark

Closed weightsTBD context

API-only (Apr 2026)

Meta's first closed-weight frontier model, built by Meta Superintelligence Labs under Alexandr Wang. Native multimodal reasoning, visual chain-of-thought, and multi-agent tool use. Signals that Meta now runs open-weight and closed-weight strategies in parallel — a strategic pivot from the Llama 1-3 era.

Scout vs. Maverick: Choosing the Right Model

The Scout/Maverick decision is the most common Llama 4 choice AI PMs face. Both models have the same active parameter count (17B) and comparable inference cost, but they're optimized for different workloads. Getting this wrong wastes compute or constrains quality on the tasks that matter most to your product.

Pick Scout for long-context retrieval

Processing a 500-page legal contract, ingesting a 200-file codebase, or answering questions over a 90-day email thread. Scout's 10M context window handles these without chunking or retrieval architecture. It's less capable than Maverick on open-ended generation but dominant on long-document tasks.

Pick Maverick for generation quality

Producing polished summaries, writing code, reasoning across domains, or handling diverse user requests. Maverick's 128-expert MoE gives it broader capability coverage. For most product use cases where context fits under 1M tokens, Maverick is the correct default.

Don't assume Scout is cheaper

Scout and Maverick have the same 17B active parameter count — inference cost per token is roughly equivalent. Scout's advantage is context capacity, not cost. Very long Scout queries cost more in absolute terms because you're processing up to 10M tokens per request.

Scout's 10M window isn't magic on arbitrary long inputs

Scout was trained specifically for long-context retrieval. It excels at needle-in-haystack retrieval and multi-document Q&A. It's not automatically better at long-form generation than Maverick. The right mental model: use Scout for retrieval, Maverick for synthesis.

Quick decision rule

Context length is the primary bottleneck (>100K tokens)? Use Scout. Generation quality or task diversity is the bottleneck? Use Maverick. Need frontier quality above both? Evaluate Muse Spark or a proprietary frontier API.

The Open-Weight Advantage: What It Actually Gives Product Teams

“Open-weight” gets used loosely. Here's what it specifically means for product decisions — and what the limitations are.

1

Full data control

Prompts, inputs, and outputs never leave your infrastructure. For healthcare, legal, finance, and enterprise products handling confidential data, this eliminates the data-processing agreement friction that blocks enterprise deals. This is Llama 4's biggest competitive advantage over GPT-4o and Claude for regulated verticals.

2

No per-token billing at inference

You pay for compute (GPU time), not tokens. At high volume — tens of millions of requests per day — owning your inference stack is typically 5–10x cheaper than API pricing. At low volume, proprietary APIs are usually cheaper because there's no idle GPU cost. The crossover is roughly 50M tokens per day on modern infrastructure.

3

Fine-tuning on proprietary data

You can fine-tune Llama 4 on proprietary data without sending it to a third party. Fine-tuned Maverick outperforms base GPT-4o on most narrow domain benchmarks with 10K–100K training examples. LoRA fine-tuning on Maverick runs on a single A100 in 2026.

4

Pinned, reproducible behavior

Proprietary API models change behavior with each update — sometimes dramatically. Open weights are pinned: the model you deploy in January is the model running in December. For regulated products and high-stakes workflows, behavioral stability is not a nice-to-have.

5

Ecosystem and portability

Llama models have the largest open-weight ecosystem: Ollama, vLLM, Together AI, Fireworks AI, and Groq all support Llama 4. You can run locally, on managed inference, or on your own cluster — and switch between providers without changing model versions or rewriting integrations.

Turn Model Knowledge Into Product Decisions

The AI PM Masterclass covers how to evaluate, select, and build on foundation models — including open-weight models like Llama 4. Taught live by a Salesforce Sr. Director PM.

Licensing Realities: The 700M MAU Threshold

Llama 4 is not MIT-licensed. The Llama 4 Community License permits commercial use with one hard constraint: companies that had more than 700 million monthly active users at the time of Llama 4's April 2025 release need a separate license from Meta. Below 700M MAU, commercial use is permitted without additional approval. Here's how this plays out across common product scenarios.

B2B SaaS (any size)

The vast majority of enterprise and SMB SaaS products are well under the threshold. Standard Community License covers commercial use with no additional approval needed.

Permitted

Consumer apps under 700M MAU

Most consumer apps, even very successful ones, are below this threshold. Fine to build on Llama 4 commercially.

Permitted

Consumer apps at Google/Apple/Meta scale

Companies that had over 700M MAU in April 2025 need a separate commercial agreement with Meta. The standard Community License does not cover this tier.

License required

Distributing fine-tuned derivatives

You can fine-tune and distribute fine-tuned Llama 4 models. Distributions must include the Llama 4 model name in the derivative's name and comply with Meta's acceptable use policy.

Permitted with terms

Running Llama 4 as a commercial API

Hosting Llama 4 inference for paying customers is allowed under the Community License, provided you're under the 700M MAU threshold.

Permitted

When to Use Llama 4 vs. Proprietary APIs

The build vs. buy question for foundation models has a cleaner answer in 2026 than in 2023. Llama 4 is genuinely frontier-quality on most benchmarks. Here's when the calculus tilts toward open-weight vs. proprietary, and the common mistakes teams make in each direction.

Use Llama 4: Data sensitivity is high

Healthcare records, legal documents, financial data, HR data. Any vertical where sending data to a third-party API creates compliance risk. Running Llama 4 on your own infrastructure eliminates this risk entirely and simplifies your enterprise security review.

Use Llama 4: Volume justifies infrastructure

At 50M+ tokens per day, self-hosted Llama 4 on spot GPU instances is typically 5–10x cheaper than API pricing. The infrastructure investment pays off within months at enterprise scale. Below that volume, APIs are usually cheaper inclusive of engineering overhead.

Use Llama 4: Model stability matters

Regulated products, high-stakes workflows, or anywhere behavioral drift is unacceptable. Open weights give you pinned, reproducible behavior — no surprise performance changes from upstream provider updates.

Use proprietary APIs: Frontier capability required

Muse Spark, GPT-5.5, and Claude Opus 4 still lead on the hardest multi-step reasoning tasks. If your product depends on that capability ceiling, proprietary APIs are the right call until open-weight models close the gap.

Use proprietary APIs: Latency is critical

For real-time sub-500ms use cases, managed APIs have optimized inference infrastructure most teams can't match. Unless you're at significant scale with a dedicated MLOps team, managed APIs win on tail latency.

Use proprietary APIs: Speed to market

No GPU provisioning, no inference optimization, no MLOps overhead. For MVPs, prototypes, and early-stage products, API-first is the right default. Switch to self-hosted Llama 4 when volume justifies the investment.

The practical answer for most teams in 2026: start with proprietary APIs to validate your product, then migrate specific high-volume or data-sensitive workloads to self-hosted Llama 4 as the business case becomes clear. Treating this as an either/or decision is a mistake — the best AI stacks in 2026 use both.

Build on the Right Foundation Model

The AI PM Masterclass covers foundation model evaluation and selection — including open-weight vs. proprietary tradeoffs. Taught live by a former Apple Group PM and Salesforce Sr. Director PM.