Model Distillation: Making AI Models Smaller Without Losing Quality

What Model Distillation Actually Is

Distillation is a model compression technique where a large, capable model (the "teacher") generates training data that a smaller model (the "student") learns from. The key insight: a student model trained on the teacher's outputs learns more effectively than the same model trained on the original human-labeled data alone. The teacher's outputs contain "soft labels" — probability distributions over all possible outputs — that encode nuanced information about relationships between classes that hard labels don't capture.

The teacher-student framework

The teacher model is your best-performing model for the task — often a frontier model like GPT-4o or Claude. You run your production inputs through the teacher and collect its outputs. These input-output pairs become the training dataset for the student model. The student is a smaller architecture (Llama 8B, Mistral 7B, Phi-3) that is fine-tuned on this dataset. The student learns to mimic the teacher's behavior on your specific task distribution.

Why soft labels matter

When a teacher model classifies sentiment, it doesn't just output 'positive' — it outputs a probability distribution like [positive: 0.82, neutral: 0.15, negative: 0.03]. These soft probabilities contain information that hard labels ('positive') throw away. The student learns that this particular input is 'mostly positive but somewhat neutral' — which produces a more nuanced and calibrated student model. This is why distillation outperforms simple fine-tuning on labeled data.

Task-specific vs. general distillation

General distillation tries to compress all of a model's capabilities into a smaller model — this is extremely hard and produces significant quality loss across the board. Task-specific distillation focuses on one task or a narrow set of related tasks. Task-specific distillation is dramatically more effective: a 7B model distilled for customer support classification can match a 70B model's accuracy on that specific task while being unusable for creative writing.

Distillation is not the same as fine-tuning

Fine-tuning adapts a pre-trained model to a new task using labeled data. Distillation trains a smaller model to replicate a larger model's behavior using the larger model's outputs as training signal. The difference matters: distillation can work with unlabeled data (you only need inputs — the teacher generates the labels), and the teacher's soft probabilities provide richer training signal than human-generated hard labels. In practice, many teams combine both: distill from a teacher on production data, then fine-tune on a smaller set of human-verified examples.

The distillation quality ceiling

A distilled student cannot exceed the teacher's quality. If your teacher model achieves 94% accuracy on your eval set, the student's ceiling is 94% — and in practice it will be lower, typically 85-95% of the teacher depending on the task complexity and the student's capacity. Distillation works best when the task is well-defined, the input distribution is consistent, and the teacher's behavior is reliable. For tasks where the teacher itself is unreliable or inconsistent, distillation amplifies those problems.

When Distillation Makes Product Sense (and When It Doesn't)

Distillation requires meaningful investment — data collection, compute for training, evaluation infrastructure, and ongoing maintenance. Here are the conditions that make it worth pursuing and the situations where other optimization strategies are better.

High volume, proven use case

Distillation pays off at scale. If you're making 100K+ inference requests per day on a specific task using a frontier model, the cost savings from switching to a distilled model can be substantial. A task running on GPT-4o at $5/1M input tokens that switches to a self-hosted 7B distilled model might reduce per-request cost by 90%. But if you're making 1,000 requests per day, the infrastructure and maintenance cost of a self-hosted model exceeds the API savings.

Example: A customer support triage system processing 500K tickets per day on Claude Opus could cost $50K+/month. A distilled Llama 8B model handling the same classification at 93% of the quality might cost $3K/month in compute — saving $47K monthly with a 2-3 month payback on the distillation investment.

Latency-critical features

If your product requires sub-second response times that frontier API models cannot consistently deliver, distillation to a small, self-hosted model is one of the few paths to achieving it. A 7B model running on a single A10 GPU can generate 100+ tokens per second, enabling real-time autocomplete, inline suggestions, and interactive features that are impossible at frontier model latencies.

Example: A code autocomplete feature needs to respond in under 200ms to feel responsive. GPT-4o's time-to-first-token is typically 300-600ms. A distilled 3B model running locally can achieve 50ms TTFT — making the feature viable.

Data privacy and compliance requirements

Some industries (healthcare, finance, government) cannot send data to third-party API providers. Distillation lets you create a model that runs entirely within your infrastructure, eliminating data residency and third-party access concerns. The distillation process itself requires sending production-like data to the teacher — but you can use synthetic or anonymized data for this step, then deploy the student model on-premise with real data.

Example: A hospital system that cannot send patient data to OpenAI's API can distill a clinical note summarization model using anonymized notes, then deploy the student model within their HIPAA-compliant infrastructure to process real patient records.

When NOT to distill

Don't distill if your use case is still evolving (you'll need to re-distill every time you change the task), if your task requires general-purpose reasoning across many domains (distillation narrows capability), if your volume is too low to justify the infrastructure cost, or if the teacher model itself isn't performing well enough yet. Fix your prompt engineering and model selection first. Distillation optimizes the deployment of a working solution — it doesn't fix a solution that doesn't work.

Example: A startup exploring product-market fit with 5K daily requests should optimize prompts and use cheaper API models (GPT-4o mini, Claude Haiku) before investing in distillation. The use case may change significantly before reaching the volume where distillation makes economic sense.

The Distillation Process Step by Step

Distillation is a structured process with clear stages. Each stage has specific inputs, outputs, and decision points where the PM needs to be involved.

Step 1: Collect production data

Gather 10K-100K representative input examples from your production traffic. These should cover the full distribution of inputs your model encounters: common cases, edge cases, different user segments. Quality of this dataset directly determines the quality of your distilled model. Remove PII and sensitive data before sending to the teacher model. Ensure your dataset represents the diversity of real-world inputs.

Step 2: Generate teacher outputs

Run all collected inputs through your teacher model (the large model you want to replace). Store the full outputs including, if accessible, the probability distributions (logits) over the vocabulary. If using an API provider, you typically only get the generated text, not logits — this still works for distillation but produces slightly lower-quality students. Budget for API costs: 100K examples through GPT-4o costs roughly $500-2,000 depending on input/output length.

Step 3: Select and prepare the student model

Choose a student architecture based on your latency and infrastructure requirements. Popular choices: Llama 3.1 8B (good balance of quality and speed), Mistral 7B (strong at instruction following), Phi-3 3.8B (when maximum speed is required). Start with a pre-trained base model, not a blank architecture. The student benefits from its pre-training — distillation adapts its existing knowledge to your specific task.

Step 4: Train the student

Fine-tune the student model on the teacher's input-output pairs. If you have access to logits, use KL divergence loss (the student learns to match the teacher's probability distribution). If you only have text outputs, use standard supervised fine-tuning with the teacher's outputs as ground truth. Train for 2-5 epochs, monitoring for overfitting. Compute cost: training an 8B model on 50K examples typically requires 2-8 hours on a single A100 GPU.

Step 5: Evaluate against quality floor

Run your evaluation suite on both the teacher and student model. Compare accuracy, latency, and output quality across your test set. Pay special attention to edge cases and tail distribution inputs. If the student meets your quality floor (defined before starting the project), proceed to production testing. If not, iterate: add more training data, try a larger student model, or adjust the training hyperparameters.

Step 6: Deploy and monitor

Deploy the student model alongside your teacher model. Route a small percentage of traffic (5-10%) to the student and compare production metrics: task completion rate, user feedback, override rate, and latency. Monitor for distribution shift: if your production inputs change over time, the student's quality may degrade faster than the teacher's because it has less general capability to fall back on. Plan for periodic re-distillation.

Master AI Model Optimization in the Masterclass

Distillation, quantization, fine-tuning, and model deployment decisions are core to the AI PM Masterclass curriculum. Taught by a Salesforce Sr. Director PM.

Evaluating Distillation Quality — What to Measure

A distilled model that looks good on aggregate metrics can still fail on specific subsets of your traffic. Here is how to evaluate distillation quality thoroughly.

Measure quality on your task distribution, not benchmarks

General benchmarks (MMLU, HumanEval) test broad capabilities that your distilled model intentionally sacrificed. Evaluate on a held-out test set drawn from your production data. If your model handles customer support, test it on customer support queries — not on math or coding benchmarks. Your eval set should include at least 500 examples, stratified across the input categories your product handles.

Track quality degradation on edge cases separately

Distilled models degrade most on inputs that are rare in the training data: unusual phrasings, multi-language queries, adversarial inputs, and domain-specific jargon. Create an 'edge case' eval set of 100-200 examples specifically targeting these difficult inputs. If the student model drops below your quality floor on edge cases even while performing well on average, you need more edge case training data or a larger student model.

Compare calibration, not just accuracy

A well-calibrated model says 'I'm 80% confident' when it's correct 80% of the time. Distilled models can become poorly calibrated — overconfident in their wrong answers. Test confidence calibration: when the student expresses high confidence, is it actually correct proportionally? Poor calibration is particularly dangerous in products where the model's confidence score drives downstream decisions (like auto-approval thresholds).

Monitor production quality continuously after deployment

Distilled models are more sensitive to distribution shift than large general-purpose models. If your user base changes, your product adds new features, or external factors change the input distribution, the distilled model's quality may degrade before your teacher model's would. Set up automated quality monitoring: sample production outputs, run them through your eval pipeline, and alert when quality drops below your floor. Plan to re-distill quarterly or when quality alerts trigger.

Distillation vs. Quantization vs. Pruning — Choosing the Right Optimization

Distillation is one of several model optimization techniques. Each has different trade-offs, and they can be combined. Understanding when to use which — or when to combine them — is a critical decision for AI PMs scaling production systems.

Distillation: best for task-specific cost reduction

Distillation creates a new, smaller model trained to replicate a larger model's behavior on your specific task. It requires training data, compute for training, and ongoing maintenance. The payoff is large: 5-10x cost reduction at 85-95% quality retention. Use distillation when you have a proven, high-volume use case and need the largest possible cost-to-quality improvement. Distillation is the most effective optimization when the task is well-defined and stable.

Quantization: best for quick deployment optimization

Quantization reduces the numerical precision of model weights (e.g., from 16-bit to 4-bit) without changing the model architecture. It requires no training data, takes minutes to apply, and typically reduces memory and cost by 50-75% with minimal quality loss. Use quantization when you need a fast optimization with no data requirements. Quantization is complementary to distillation — you can quantize a distilled model for an additional 2x cost reduction.

Pruning: best for structured model compression

Pruning removes weights or entire layers from a model that contribute least to output quality. Structured pruning (removing entire attention heads or layers) can reduce model size by 20-40% with 1-3% quality loss on targeted tasks. Pruning is less commonly used in LLM production than distillation or quantization because the tooling is less mature, but it is effective for on-device deployment where model size constraints are absolute.

When to combine techniques

The most cost-efficient production models combine multiple optimization techniques. A common pipeline: distill from a 70B teacher to a 7B student (10x size reduction), then quantize the student to INT8 (2x size reduction), yielding a model that is 20x smaller and 15x cheaper than the original while retaining 80-90% of its task-specific quality. Each technique targets a different dimension of efficiency, so their benefits stack.

Decision framework for AI PMs

Start with quantization — it is fast, requires no data, and has minimal risk. If the quantized model meets your quality floor, ship it. If you need more optimization, evaluate distillation feasibility: do you have sufficient training data and volume to justify the investment? Pruning is a specialist technique — only pursue it if quantization and distillation together don't meet your size or speed requirements. Always define your quality floor before optimizing, and always A/B test optimized models against the baseline in production.