Speculative Decoding for Product Managers: How LLMs Get Faster

The One-Sentence Explanation

An LLM normally generates one token at a time. Speculative decoding uses a fast small model to guess the next several tokens, then has the slow big model verify them all at once. When guesses are right, the big model produces several tokens in the time it would have taken to produce one. When guesses are wrong, you fall back to normal generation.

The draft model

A small, fast model (often a distilled version of the target). It proposes 4-8 candidate tokens per step. Cost is essentially free.

The target model

Your real frontier model. It verifies all draft tokens in parallel. The math here is what makes the speedup possible.

Verification step

Compares draft logits to target logits. Accepted tokens are kept; the first rejected token forces a sample, and the rest are discarded.

Why it's lossless

The output distribution is provably identical to the target model alone. Quality is preserved by construction — not by approximation.

Why It Works (the Intuition)

Most language is predictable. After "The capital of France is" the next several tokens (" Paris.") are nearly certain — even a small model gets them right. The expensive part of generation is the parallel matrix multiplication; verifying 5 tokens at once costs roughly the same as generating 1. So when draft guesses are correct, you collect free tokens.

Acceptance rate matters

A draft that's right 70% of the time delivers ~3x speedup. Right 90% of the time delivers ~5x. Right 30% can actually slow you down.

Domain matters

Code generation has very high acceptance (syntax is predictable). Creative writing has lower acceptance. Routing speculative decoding by domain helps.

Draft model choice matters

Too small = bad guesses. Too big = expensive draft step. Most production systems use a 1-7B draft for a 70-400B target.

Hardware matters

Speedup comes from parallel verification on the same GPU. The technique is GPU-bound; CPU systems see less benefit.

Why PMs Should Care

Latency is product. A chatbot that responds in 500ms feels alive; the same chatbot at 2 seconds feels broken. Speculative decoding is one of the few techniques that improves perceived speed without sacrificing quality, and it's often the single largest UX lever you have once the model is chosen.

Better perceived UX

Streaming feels twice as fast. Time-to-first-token improves; tokens-per-second more than doubles for typical workloads.

Higher throughput per GPU

Self-hosted teams see 2-3x more requests per GPU-hour. Cost economics shift meaningfully at scale.

Vendor-managed, free

Most major API providers already use it under the hood. Your job is to make sure it's on, not to implement it.

Question to ask vendors

"Do you use speculative decoding for our model? What's the acceptance rate on our domain?" If the rep can't answer, escalate.

Master AI Inference Tradeoffs in the Masterclass

The AI PM Masterclass demystifies inference techniques like speculative decoding so you can lead architecture conversations with confidence.

Variants Worth Knowing

Medusa decoding

Adds multiple lightweight prediction heads to the target model itself, eliminating the separate draft model. Simpler ops, modest speedup vs. classical speculative decoding.

Lookahead decoding

Uses Jacobi iteration to predict multiple tokens without a draft model. Works on any LLM with no extra training.

EAGLE / EAGLE-2

More sophisticated draft model architecture that uses target features. Higher acceptance rate, larger speedup, more complex ops.

Self-speculative decoding

Uses earlier layers of the same model as the draft. Avoids needing a separate model. Less common, real for very large models.

Limitations and Honest Tradeoffs

Doesn't help time-to-first-token (much)

The first token still requires a full forward pass. Speculative decoding accelerates subsequent tokens. For very short outputs, savings are smaller.

Memory pressure

Running a draft model alongside the target consumes GPU memory. Some setups can't fit both, eliminating the option.

Variable latency

Average latency drops, but variance can rise. Some requests get 5x speedup, others get 1.2x. Worst-case may be unchanged.

Custom samplers can break it

If you need exotic sampling (constrained decoding with complex grammars), speculative decoding may need adaptation or be unavailable.