AI Quantization vs. Distillation: Which Path to Cheaper Inference?

The Core Difference in One Sentence

Quantization keeps the same model and stores its weights in less precise numbers (FP32 → INT8 or INT4). Distillation creates a new, smaller model trained to behave like the larger one. The first is a runtime trick; the second is a training investment. The two solve different problems and combine well.

Quantization

Lower-precision weights (8-bit, 4-bit). Same architecture, smaller memory footprint, faster compute. Quality drop typically 1-5%.

Distillation

Train a smaller model to imitate a larger one. Different architecture, much smaller, much faster. Quality drop varies widely (5-30%) depending on domain.

When to combine

Distill first to get the smaller architecture, then quantize the result. Stacks the savings; best practice in production AI today.

Time investment

Quantization: hours to days. Distillation: weeks to months including data preparation and eval. Choose accordingly.

Quantization in Detail

Quantization swaps high-precision weights (typically FP16 in inference) for low-precision representations. INT8 cuts memory and compute roughly in half; INT4 cuts them by ~4x. Modern quantization techniques (GPTQ, AWQ, GGUF) preserve quality remarkably well — usually within 1-3% of full precision on standard benchmarks.

Post-training quantization (PTQ)

Quantize after training is complete. Cheapest, fastest. May lose more quality than alternatives. Most common in practice.

Quantization-aware training (QAT)

Account for quantization during training. Higher quality at low precision, but requires retraining the model.

Activation-aware (AWQ)

Protect the most important weights from quantization. Often lossless or near-lossless at INT4.

Mixed precision

Critical layers stay high precision; others quantize aggressively. Common in production deployments.

Distillation in Detail

Distillation trains a smaller "student" model on outputs of a larger "teacher." Instead of just labels, the student learns from the teacher's full output distribution — capturing nuance hard labels can't convey. Done well, a 7B distilled model can match a 70B teacher on the specific tasks it was distilled for.

Task-specific distillation

Distill for one narrow task: code completion, classification, summarization. Small model + narrow scope = excellent quality at tiny cost.

General distillation

Distill broad capability into a smaller architecture. Harder, more compute, more data — but produces a versatile small model.

Synthetic data distillation

Generate training data from the teacher itself. Removes labeling cost; the dominant approach in 2025-2026.

Behavior cloning vs. logit matching

Cloning teaches the student to produce teacher outputs. Logit matching teaches it to produce teacher distributions. The latter generally works better.

Make Compression Decisions With Confidence

The AI PM Masterclass walks through compression techniques with cost models and case studies — so you can lead these conversations with engineers, not follow them.

Decision Framework: Which to Reach For First

Reach for quantization first when

You have a working model and need lower cost or latency right now. Quantization is days, not months. The wins are immediate.

Reach for distillation when

Your model is too big to deploy at all (latency, memory, edge constraints) or your domain is narrow enough that a much smaller specialized model is realistic.

Combine them when

You're shipping at scale and need every percentage point. Distill to get a smaller architecture, quantize to get a smaller representation. Production-grade.

Skip both when

Your traffic is small enough that frontier API costs are fine. Optimization eats engineering time; only invest when the volume justifies it.

What PMs Should Watch For

Quality drift on edge cases

Compressed models often hold up on average but break on rare inputs. Eval coverage on long-tail scenarios is mandatory before shipping.

Hidden cost of distillation training

"Just distill it" isn't a small project. Budget months and meaningful compute. Plan for failed iterations.

Over-quantization for the deployment

INT4 might work on benchmarks and fail on your specific traffic. Always re-eval on production-representative data.

Not capturing teacher signal upstream

If you might distill later, log teacher logits now. Replaying calls is expensive; capturing signal upstream is essentially free.