Mixture of Experts (MoE) Explained for Product Managers
TL;DR
Mixture of Experts (MoE) is the architecture trick behind GPT-4, Mixtral, and DeepSeek-V3 — models with hundreds of billions of total parameters that run at the cost of much smaller models. Each token is routed to a small subset of "expert" sub-networks instead of the full model. The result: capacity of a giant model, inference cost of a midsize one. PMs who understand MoE can reason about model selection, latency, and infrastructure choices that increasingly define the field.
The Core Idea
A standard transformer activates every parameter for every token. A MoE transformer has many parallel "expert" subnetworks, but only routes each token through 2-4 of them. Total parameters: huge. Active parameters per token: small. The model has the capacity to specialize across many domains while paying inference cost proportional to the active subset.
Total parameters
The full model size — what shows up in marketing. Mixtral 8x22B has 141B total parameters; DeepSeek-V3 has 671B.
Active parameters
What actually runs per token. Mixtral 8x22B activates ~39B; DeepSeek-V3 activates ~37B. This is what determines speed and cost.
Router (gating network)
A small network that decides which experts each token goes to. Trained jointly with the rest of the model.
Experts
Independent feed-forward subnetworks. Each ends up specializing — some on code, some on language, some on math — without explicit instruction.
Why MoE Changes the Economics
Pre-MoE, scaling capacity meant scaling everything: more parameters, more compute per token, more cost. MoE decouples capacity from inference cost. You can ship a model with the smarts of GPT-4 without paying GPT-4 prices for every token. That changes pricing, latency budgets, and competitive dynamics across the industry.
Cheaper frontier-quality inference
MoE models like DeepSeek-V3 and Mixtral hit GPT-4-class quality at a fraction of the per-token cost.
More capability for the same dollars
If your budget supports a 70B dense model, MoE lets you run a 200B+ MoE with similar speed and better capability.
Better latency per quality unit
Active parameters drive latency. MoE's smaller active count means faster generation than equivalent-quality dense models.
Compounding open-source advantage
Open MoE models (Mixtral, DeepSeek) are closing the gap with closed frontier models faster than dense scaling did.
The Tradeoffs MoE Doesn't Solve
MoE isn't free. The wins come with real complexity. Memory, batching, and routing all become harder, and self-hosting MoE is meaningfully more involved than self-hosting a dense model of comparable active size.
Memory pressure
Total parameters still need to fit in GPU memory. A 671B MoE model needs serious hardware even though only 37B activate per token.
Batching complexity
Different tokens route to different experts. Dynamic load balancing across experts is harder than uniform batching.
Training instability
Routing collapse — where most traffic ends up at a few experts — is a real failure mode. Auxiliary loss functions help, but tuning is delicate.
Latency variance
Some tokens go to crowded experts, others to sparse ones. Latency variance is higher than dense models without careful balancing.
Lead Architecture Conversations With Confidence
The AI PM Masterclass walks through MoE, attention variants, quantization, and other techniques that shape AI product economics — taught at PM-level depth.
Notable MoE Models You Should Recognize
Mixtral 8x7B / 8x22B
Mistral's open-weights MoE family. Among the first widely usable open MoEs. ~13B / ~39B active parameters.
DeepSeek-V3
671B total, 37B active. Open weights, frontier-quality reasoning at low inference cost. Major influence on industry pricing in 2025.
GPT-4 (rumored MoE)
Widely believed to be MoE based on cost and behavior signals. Confirms the production viability of the approach at frontier scale.
Gemini 1.5 (rumored MoE)
Google's long-context family is widely believed to be MoE. Long context + MoE economics is a particularly powerful combination.
What This Means for Your AI Product
Don't pay for capacity you don't use
MoE models have made "frontier quality at midsize cost" achievable. If your vendor isn't offering MoE pricing, consider alternatives.
Open MoE is a real option
Self-hosting a strong MoE used to be hopeless. Mixtral and DeepSeek changed that. The math now works for many enterprise use cases.
Eval becomes more important, not less
MoE behavior varies by token routing. You need broad eval coverage to catch quirks that dense models would smooth over.
Architecture vocabulary matters in vendor negotiations
Vendors who know you understand MoE economics offer better pricing. Vocabulary is leverage.