Model Distillation: Making AI Models Smaller Without Losing Quality
TL;DR
Model distillation trains a smaller "student" model to replicate the behavior of a larger "teacher" model. A well-distilled 7B parameter model can achieve 85-95% of a 70B model's quality on specific tasks at 5-10x lower inference cost and 3-5x lower latency. Distillation makes economic sense when you have a validated use case running at high volume on an expensive model. For AI PMs, distillation is the bridge between "this works with GPT-4" and "this is profitable at scale." This guide covers when distillation makes sense, how the process works, and how to evaluate whether your distilled model is good enough to ship.
What Model Distillation Actually Is
Distillation is a model compression technique where a large, capable model (the "teacher") generates training data that a smaller model (the "student") learns from. The key insight: a student model trained on the teacher's outputs learns more effectively than the same model trained on the original human-labeled data alone. The teacher's outputs contain "soft labels" — probability distributions over all possible outputs — that encode nuanced information about relationships between classes that hard labels don't capture.
The teacher-student framework
The teacher model is your best-performing model for the task — often a frontier model like GPT-4o or Claude. You run your production inputs through the teacher and collect its outputs. These input-output pairs become the training dataset for the student model. The student is a smaller architecture (Llama 8B, Mistral 7B, Phi-3) that is fine-tuned on this dataset. The student learns to mimic the teacher's behavior on your specific task distribution.
Why soft labels matter
When a teacher model classifies sentiment, it doesn't just output 'positive' — it outputs a probability distribution like [positive: 0.82, neutral: 0.15, negative: 0.03]. These soft probabilities contain information that hard labels ('positive') throw away. The student learns that this particular input is 'mostly positive but somewhat neutral' — which produces a more nuanced and calibrated student model. This is why distillation outperforms simple fine-tuning on labeled data.
Task-specific vs. general distillation
General distillation tries to compress all of a model's capabilities into a smaller model — this is extremely hard and produces significant quality loss across the board. Task-specific distillation focuses on one task or a narrow set of related tasks. Task-specific distillation is dramatically more effective: a 7B model distilled for customer support classification can match a 70B model's accuracy on that specific task while being unusable for creative writing.
Distillation is not the same as fine-tuning
Fine-tuning adapts a pre-trained model to a new task using labeled data. Distillation trains a smaller model to replicate a larger model's behavior using the larger model's outputs as training signal. The difference matters: distillation can work with unlabeled data (you only need inputs — the teacher generates the labels), and the teacher's soft probabilities provide richer training signal than human-generated hard labels. In practice, many teams combine both: distill from a teacher on production data, then fine-tune on a smaller set of human-verified examples.
The distillation quality ceiling
A distilled student cannot exceed the teacher's quality. If your teacher model achieves 94% accuracy on your eval set, the student's ceiling is 94% — and in practice it will be lower, typically 85-95% of the teacher depending on the task complexity and the student's capacity. Distillation works best when the task is well-defined, the input distribution is consistent, and the teacher's behavior is reliable. For tasks where the teacher itself is unreliable or inconsistent, distillation amplifies those problems.
When Distillation Makes Product Sense (and When It Doesn't)
Distillation requires meaningful investment — data collection, compute for training, evaluation infrastructure, and ongoing maintenance. Here are the conditions that make it worth pursuing and the situations where other optimization strategies are better.
High volume, proven use case
Distillation pays off at scale. If you're making 100K+ inference requests per day on a specific task using a frontier model, the cost savings from switching to a distilled model can be substantial. A task running on GPT-4o at $5/1M input tokens that switches to a self-hosted 7B distilled model might reduce per-request cost by 90%. But if you're making 1,000 requests per day, the infrastructure and maintenance cost of a self-hosted model exceeds the API savings.
Example: A customer support triage system processing 500K tickets per day on Claude Opus could cost $50K+/month. A distilled Llama 8B model handling the same classification at 93% of the quality might cost $3K/month in compute — saving $47K monthly with a 2-3 month payback on the distillation investment.
Latency-critical features
If your product requires sub-second response times that frontier API models cannot consistently deliver, distillation to a small, self-hosted model is one of the few paths to achieving it. A 7B model running on a single A10 GPU can generate 100+ tokens per second, enabling real-time autocomplete, inline suggestions, and interactive features that are impossible at frontier model latencies.
Example: A code autocomplete feature needs to respond in under 200ms to feel responsive. GPT-4o's time-to-first-token is typically 300-600ms. A distilled 3B model running locally can achieve 50ms TTFT — making the feature viable.
Data privacy and compliance requirements
Some industries (healthcare, finance, government) cannot send data to third-party API providers. Distillation lets you create a model that runs entirely within your infrastructure, eliminating data residency and third-party access concerns. The distillation process itself requires sending production-like data to the teacher — but you can use synthetic or anonymized data for this step, then deploy the student model on-premise with real data.
Example: A hospital system that cannot send patient data to OpenAI's API can distill a clinical note summarization model using anonymized notes, then deploy the student model within their HIPAA-compliant infrastructure to process real patient records.
When NOT to distill
Don't distill if your use case is still evolving (you'll need to re-distill every time you change the task), if your task requires general-purpose reasoning across many domains (distillation narrows capability), if your volume is too low to justify the infrastructure cost, or if the teacher model itself isn't performing well enough yet. Fix your prompt engineering and model selection first. Distillation optimizes the deployment of a working solution — it doesn't fix a solution that doesn't work.
Example: A startup exploring product-market fit with 5K daily requests should optimize prompts and use cheaper API models (GPT-4o mini, Claude Haiku) before investing in distillation. The use case may change significantly before reaching the volume where distillation makes economic sense.
The Distillation Process Step by Step
Distillation is a structured process with clear stages. Each stage has specific inputs, outputs, and decision points where the PM needs to be involved.
Step 1: Collect production data
Gather 10K-100K representative input examples from your production traffic. These should cover the full distribution of inputs your model encounters: common cases, edge cases, different user segments. Quality of this dataset directly determines the quality of your distilled model. Remove PII and sensitive data before sending to the teacher model. Ensure your dataset represents the diversity of real-world inputs.
Step 2: Generate teacher outputs
Run all collected inputs through your teacher model (the large model you want to replace). Store the full outputs including, if accessible, the probability distributions (logits) over the vocabulary. If using an API provider, you typically only get the generated text, not logits — this still works for distillation but produces slightly lower-quality students. Budget for API costs: 100K examples through GPT-4o costs roughly $500-2,000 depending on input/output length.
Step 3: Select and prepare the student model
Choose a student architecture based on your latency and infrastructure requirements. Popular choices: Llama 3.1 8B (good balance of quality and speed), Mistral 7B (strong at instruction following), Phi-3 3.8B (when maximum speed is required). Start with a pre-trained base model, not a blank architecture. The student benefits from its pre-training — distillation adapts its existing knowledge to your specific task.
Step 4: Train the student
Fine-tune the student model on the teacher's input-output pairs. If you have access to logits, use KL divergence loss (the student learns to match the teacher's probability distribution). If you only have text outputs, use standard supervised fine-tuning with the teacher's outputs as ground truth. Train for 2-5 epochs, monitoring for overfitting. Compute cost: training an 8B model on 50K examples typically requires 2-8 hours on a single A100 GPU.
Step 5: Evaluate against quality floor
Run your evaluation suite on both the teacher and student model. Compare accuracy, latency, and output quality across your test set. Pay special attention to edge cases and tail distribution inputs. If the student meets your quality floor (defined before starting the project), proceed to production testing. If not, iterate: add more training data, try a larger student model, or adjust the training hyperparameters.
Step 6: Deploy and monitor
Deploy the student model alongside your teacher model. Route a small percentage of traffic (5-10%) to the student and compare production metrics: task completion rate, user feedback, override rate, and latency. Monitor for distribution shift: if your production inputs change over time, the student's quality may degrade faster than the teacher's because it has less general capability to fall back on. Plan for periodic re-distillation.
Master AI Model Optimization in the Masterclass
Distillation, quantization, fine-tuning, and model deployment decisions are core to the AI PM Masterclass curriculum. Taught by a Salesforce Sr. Director PM.
Evaluating Distillation Quality — What to Measure
A distilled model that looks good on aggregate metrics can still fail on specific subsets of your traffic. Here is how to evaluate distillation quality thoroughly.
Measure quality on your task distribution, not benchmarks
General benchmarks (MMLU, HumanEval) test broad capabilities that your distilled model intentionally sacrificed. Evaluate on a held-out test set drawn from your production data. If your model handles customer support, test it on customer support queries — not on math or coding benchmarks. Your eval set should include at least 500 examples, stratified across the input categories your product handles.
Track quality degradation on edge cases separately
Distilled models degrade most on inputs that are rare in the training data: unusual phrasings, multi-language queries, adversarial inputs, and domain-specific jargon. Create an 'edge case' eval set of 100-200 examples specifically targeting these difficult inputs. If the student model drops below your quality floor on edge cases even while performing well on average, you need more edge case training data or a larger student model.
Compare calibration, not just accuracy
A well-calibrated model says 'I'm 80% confident' when it's correct 80% of the time. Distilled models can become poorly calibrated — overconfident in their wrong answers. Test confidence calibration: when the student expresses high confidence, is it actually correct proportionally? Poor calibration is particularly dangerous in products where the model's confidence score drives downstream decisions (like auto-approval thresholds).
Monitor production quality continuously after deployment
Distilled models are more sensitive to distribution shift than large general-purpose models. If your user base changes, your product adds new features, or external factors change the input distribution, the distilled model's quality may degrade before your teacher model's would. Set up automated quality monitoring: sample production outputs, run them through your eval pipeline, and alert when quality drops below your floor. Plan to re-distill quarterly or when quality alerts trigger.
Distillation vs. Quantization vs. Pruning — Choosing the Right Optimization
Distillation is one of several model optimization techniques. Each has different trade-offs, and they can be combined. Understanding when to use which — or when to combine them — is a critical decision for AI PMs scaling production systems.
Distillation: best for task-specific cost reduction
Distillation creates a new, smaller model trained to replicate a larger model's behavior on your specific task. It requires training data, compute for training, and ongoing maintenance. The payoff is large: 5-10x cost reduction at 85-95% quality retention. Use distillation when you have a proven, high-volume use case and need the largest possible cost-to-quality improvement. Distillation is the most effective optimization when the task is well-defined and stable.
Quantization: best for quick deployment optimization
Quantization reduces the numerical precision of model weights (e.g., from 16-bit to 4-bit) without changing the model architecture. It requires no training data, takes minutes to apply, and typically reduces memory and cost by 50-75% with minimal quality loss. Use quantization when you need a fast optimization with no data requirements. Quantization is complementary to distillation — you can quantize a distilled model for an additional 2x cost reduction.
Pruning: best for structured model compression
Pruning removes weights or entire layers from a model that contribute least to output quality. Structured pruning (removing entire attention heads or layers) can reduce model size by 20-40% with 1-3% quality loss on targeted tasks. Pruning is less commonly used in LLM production than distillation or quantization because the tooling is less mature, but it is effective for on-device deployment where model size constraints are absolute.
When to combine techniques
The most cost-efficient production models combine multiple optimization techniques. A common pipeline: distill from a 70B teacher to a 7B student (10x size reduction), then quantize the student to INT8 (2x size reduction), yielding a model that is 20x smaller and 15x cheaper than the original while retaining 80-90% of its task-specific quality. Each technique targets a different dimension of efficiency, so their benefits stack.
Decision framework for AI PMs
Start with quantization — it is fast, requires no data, and has minimal risk. If the quantized model meets your quality floor, ship it. If you need more optimization, evaluate distillation feasibility: do you have sufficient training data and volume to justify the investment? Pruning is a specialist technique — only pursue it if quantization and distillation together don't meet your size or speed requirements. Always define your quality floor before optimizing, and always A/B test optimized models against the baseline in production.
Scale AI Products Profitably in the AI PM Masterclass
Distillation, model optimization, and cost-performance trade-offs are core modules in the AI PM Masterclass. Learn to make the technical decisions that turn AI prototypes into profitable products. Taught by a Salesforce Sr. Director PM.