Mixture of Experts (MoE) Explained for Product Managers

The Core Idea

A standard transformer activates every parameter for every token. A MoE transformer has many parallel "expert" subnetworks, but only routes each token through 2-4 of them. Total parameters: huge. Active parameters per token: small. The model has the capacity to specialize across many domains while paying inference cost proportional to the active subset.

Total parameters

The full model size — what shows up in marketing. Mixtral 8x22B has 141B total parameters; DeepSeek-V3 has 671B.

Active parameters

What actually runs per token. Mixtral 8x22B activates ~39B; DeepSeek-V3 activates ~37B. This is what determines speed and cost.

Router (gating network)

A small network that decides which experts each token goes to. Trained jointly with the rest of the model.

Experts

Independent feed-forward subnetworks. Each ends up specializing — some on code, some on language, some on math — without explicit instruction.

Why MoE Changes the Economics

Pre-MoE, scaling capacity meant scaling everything: more parameters, more compute per token, more cost. MoE decouples capacity from inference cost. You can ship a model with the smarts of GPT-4 without paying GPT-4 prices for every token. That changes pricing, latency budgets, and competitive dynamics across the industry.

Cheaper frontier-quality inference

MoE models like DeepSeek-V3 and Mixtral hit GPT-4-class quality at a fraction of the per-token cost.

More capability for the same dollars

If your budget supports a 70B dense model, MoE lets you run a 200B+ MoE with similar speed and better capability.

Better latency per quality unit

Active parameters drive latency. MoE's smaller active count means faster generation than equivalent-quality dense models.

Compounding open-source advantage

Open MoE models (Mixtral, DeepSeek) are closing the gap with closed frontier models faster than dense scaling did.

The Tradeoffs MoE Doesn't Solve

MoE isn't free. The wins come with real complexity. Memory, batching, and routing all become harder, and self-hosting MoE is meaningfully more involved than self-hosting a dense model of comparable active size.

Memory pressure

Total parameters still need to fit in GPU memory. A 671B MoE model needs serious hardware even though only 37B activate per token.

Batching complexity

Different tokens route to different experts. Dynamic load balancing across experts is harder than uniform batching.

Training instability

Routing collapse — where most traffic ends up at a few experts — is a real failure mode. Auxiliary loss functions help, but tuning is delicate.

Latency variance

Some tokens go to crowded experts, others to sparse ones. Latency variance is higher than dense models without careful balancing.

Lead Architecture Conversations With Confidence

The AI PM Masterclass walks through MoE, attention variants, quantization, and other techniques that shape AI product economics — taught at PM-level depth.

Notable MoE Models You Should Recognize

Mixtral 8x7B / 8x22B

Mistral's open-weights MoE family. Among the first widely usable open MoEs. ~13B / ~39B active parameters.

DeepSeek-V3

671B total, 37B active. Open weights, frontier-quality reasoning at low inference cost. Major influence on industry pricing in 2025.

GPT-4 (rumored MoE)

Widely believed to be MoE based on cost and behavior signals. Confirms the production viability of the approach at frontier scale.

Gemini 1.5 (rumored MoE)

Google's long-context family is widely believed to be MoE. Long context + MoE economics is a particularly powerful combination.

What This Means for Your AI Product

Don't pay for capacity you don't use

MoE models have made "frontier quality at midsize cost" achievable. If your vendor isn't offering MoE pricing, consider alternatives.

Open MoE is a real option

Self-hosting a strong MoE used to be hopeless. Mixtral and DeepSeek changed that. The math now works for many enterprise use cases.

Eval becomes more important, not less

MoE behavior varies by token routing. You need broad eval coverage to catch quirks that dense models would smooth over.

Architecture vocabulary matters in vendor negotiations

Vendors who know you understand MoE economics offer better pricing. Vocabulary is leverage.