AI Quantization vs. Distillation: Which Path to Cheaper Inference?
TL;DR
Quantization compresses an existing model by storing weights in lower precision — quick, cheap, modest quality loss. Distillation trains a smaller new model to mimic a bigger one — slower, more expensive, higher ceiling. Most production AI products end up using both, but in a specific order. This guide compares them on cost, quality preservation, and operational fit, and shows when each is the right move.
The Core Difference in One Sentence
Quantization keeps the same model and stores its weights in less precise numbers (FP32 → INT8 or INT4). Distillation creates a new, smaller model trained to behave like the larger one. The first is a runtime trick; the second is a training investment. The two solve different problems and combine well.
Quantization
Lower-precision weights (8-bit, 4-bit). Same architecture, smaller memory footprint, faster compute. Quality drop typically 1-5%.
Distillation
Train a smaller model to imitate a larger one. Different architecture, much smaller, much faster. Quality drop varies widely (5-30%) depending on domain.
When to combine
Distill first to get the smaller architecture, then quantize the result. Stacks the savings; best practice in production AI today.
Time investment
Quantization: hours to days. Distillation: weeks to months including data preparation and eval. Choose accordingly.
Quantization in Detail
Quantization swaps high-precision weights (typically FP16 in inference) for low-precision representations. INT8 cuts memory and compute roughly in half; INT4 cuts them by ~4x. Modern quantization techniques (GPTQ, AWQ, GGUF) preserve quality remarkably well — usually within 1-3% of full precision on standard benchmarks.
Post-training quantization (PTQ)
Quantize after training is complete. Cheapest, fastest. May lose more quality than alternatives. Most common in practice.
Quantization-aware training (QAT)
Account for quantization during training. Higher quality at low precision, but requires retraining the model.
Activation-aware (AWQ)
Protect the most important weights from quantization. Often lossless or near-lossless at INT4.
Mixed precision
Critical layers stay high precision; others quantize aggressively. Common in production deployments.
Distillation in Detail
Distillation trains a smaller "student" model on outputs of a larger "teacher." Instead of just labels, the student learns from the teacher's full output distribution — capturing nuance hard labels can't convey. Done well, a 7B distilled model can match a 70B teacher on the specific tasks it was distilled for.
Task-specific distillation
Distill for one narrow task: code completion, classification, summarization. Small model + narrow scope = excellent quality at tiny cost.
General distillation
Distill broad capability into a smaller architecture. Harder, more compute, more data — but produces a versatile small model.
Synthetic data distillation
Generate training data from the teacher itself. Removes labeling cost; the dominant approach in 2025-2026.
Behavior cloning vs. logit matching
Cloning teaches the student to produce teacher outputs. Logit matching teaches it to produce teacher distributions. The latter generally works better.
Make Compression Decisions With Confidence
The AI PM Masterclass walks through compression techniques with cost models and case studies — so you can lead these conversations with engineers, not follow them.
Decision Framework: Which to Reach For First
Reach for quantization first when
You have a working model and need lower cost or latency right now. Quantization is days, not months. The wins are immediate.
Reach for distillation when
Your model is too big to deploy at all (latency, memory, edge constraints) or your domain is narrow enough that a much smaller specialized model is realistic.
Combine them when
You're shipping at scale and need every percentage point. Distill to get a smaller architecture, quantize to get a smaller representation. Production-grade.
Skip both when
Your traffic is small enough that frontier API costs are fine. Optimization eats engineering time; only invest when the volume justifies it.
What PMs Should Watch For
Quality drift on edge cases
Compressed models often hold up on average but break on rare inputs. Eval coverage on long-tail scenarios is mandatory before shipping.
Hidden cost of distillation training
"Just distill it" isn't a small project. Budget months and meaningful compute. Plan for failed iterations.
Over-quantization for the deployment
INT4 might work on benchmarks and fail on your specific traffic. Always re-eval on production-representative data.
Not capturing teacher signal upstream
If you might distill later, log teacher logits now. Replaying calls is expensive; capturing signal upstream is essentially free.