LLM Temperature and Sampling Explained: Top-K, Top-P, and Why They Matter

What Temperature Actually Does

After the model produces logits (raw scores) for the next token, it divides them by temperature T before softmax. T < 1 sharpens the distribution (the top token dominates more); T > 1 flattens it (lower-probability tokens become viable); T = 0 deterministically picks the argmax (greedy decoding).

Temperature 0 (greedy)

Picks the single highest-probability token every step. Output is deterministic in theory — but most production APIs don't guarantee bit-exact reproducibility because of batch-size-dependent floating-point order. Expect ~95-99% identical outputs across runs, not 100%.

Temperature 0.2-0.4 (focused)

Slight randomness, but the high-probability token still wins almost always. Useful when you want consistency with light variation — customer support drafting, structured extraction, code suggestions.

Temperature 0.7-0.8 (balanced)

Default for chat APIs (OpenAI defaults to 1.0, Anthropic to 1.0, but most app code overrides to ~0.7). Lower-probability tokens get real weight. Output varies meaningfully between runs but stays coherent.

Temperature 1.2-1.5 (creative)

Output becomes notably more diverse. Useful for brainstorming, marketing copy, fiction, ideation. Quality degrades — incoherence, factual drift, and grammatical weirdness spike.

Temperature > 2 (broken)

The distribution is so flat that the model picks absurdly improbable tokens. Output becomes word salad. There's no use case for this in production unless paired with aggressive top-k/top-p capping.

Top-K, Top-P, Min-P: Capping the Candidate Pool

Temperature alone can pull the model into the long tail. Sampling cutoffs prevent that by truncating the candidate set before sampling. They're cheap quality safeguards — and you almost always want at least one of them on.

Top-k (fixed pool size)

Keep only the k highest-probability tokens; sample from those. Common values: 40-100. Simple but rigid — k=50 is overkill when only 3 tokens are plausible, and not enough when 200 are. Mostly used in older OSS pipelines now.

Top-p / nucleus sampling

Keep the smallest set of tokens whose cumulative probability reaches p. Common values: 0.9-0.95. Adapts to the distribution — picks 3 tokens when the model is confident, 100 when it's uncertain. Default cutoff in most production APIs.

Min-p (newer)

Keep tokens with probability ≥ min_p × (probability of top token). Common values: 0.05-0.1. More robust across temperature settings than top-p — high temperature doesn't blow up the candidate set the way it does with nucleus sampling. Increasingly common in OSS.

Combining cutoffs

Top-p + temperature is the standard combo. OpenAI recommends adjusting one or the other, not both. Anthropic exposes both but the docs caution that compound effects are unpredictable. Pick one knob to tune per use case.

Sampling Recipes by Product Use Case

Stop treating sampling as a default. Pick parameters by what the surface actually needs. These are battle-tested starting points; tune from here based on offline eval.

Structured extraction, classification, JSON output

Recipe: Temperature 0, top-p 1.0. You want maximum determinism. Pair with constrained decoding (function calling, JSON mode, grammar-constrained sampling) to enforce schema. Re-roll on validation failure rather than allowing creativity in.

PM Implication: Eval flakiness here is almost always a temperature problem, not a model problem. Lock to 0 and your eval pass rate becomes signal, not noise.

Customer support, factual Q&A, RAG answers

Recipe: Temperature 0.1-0.3, top-p 0.9. Slight variation is fine for natural-sounding answers, but you want grounded, consistent output. Higher temperature actively hurts factuality and increases hallucination rate.

PM Implication: Most teams default to 0.7 for chat and don't reconsider for support workflows. Drop to 0.2 and watch hallucination rate fall 10-30%.

Code generation

Recipe: Temperature 0 for autocomplete and editing tasks. Temperature 0.2-0.4 for from-scratch generation when you want some creativity in approach. Top-p 0.95. Repetition penalty off (code repeats a lot legitimately).

PM Implication: Cursor, GitHub Copilot, and Codex all default near 0 for completions. Higher creativity sounds good in demos, fails in production.

Creative writing, brainstorming, marketing copy

Recipe: Temperature 0.8-1.2, top-p 0.95, optionally min-p 0.05 to keep diversity bounded. Run multiple samples (n=3-5) and let the user pick — or rerank with a quality model.

PM Implication: Single-shot creative output at temperature 1.0 is mid. Sampled-and-ranked creative output at temperature 1.0 is good. The product pattern beats the parameter tuning.

Master the Knobs Behind Reliable AI Products

The AI PM Masterclass teaches the practical levers — sampling, prompt design, evals, retrieval, fine-tuning — behind every production-grade AI feature. Live, with a Salesforce Sr. Director PM.

Repetition Penalty, Frequency Penalty, Presence Penalty

Older models (GPT-3, early Llama) frequently looped: "The cat is on the mat. The cat is on the mat. The cat is on the mat." Modern models loop less but it still happens, especially on long generations or unusual prompts. Three penalties exist; they do different things.

Frequency penalty (OpenAI: -2.0 to 2.0)

Subtracts a value proportional to how many times a token has already appeared. Stronger penalty on tokens used many times. Default 0. Bump to 0.3-0.7 for long-form generation prone to looping.

Presence penalty (OpenAI: -2.0 to 2.0)

Subtracts a flat value if the token has appeared even once. Encourages topic shifts. Useful for brainstorming and exploration where you want the model to bring up new ideas.

Repetition penalty (Anthropic, OSS: typically 1.0-1.3)

Multiplicative — divides the logit for repeated tokens. >1.0 discourages repetition; values above 1.2 commonly degrade output (the model avoids common words like "the"). Use sparingly.

When to skip them entirely

Modern models (GPT-4o, Claude 3.7+, Gemini 2+) loop rarely. Keep penalties at 0 for most tasks. Code, data, and lists legitimately repeat — penalties hurt those tasks. Turn them on only if you observe loops in eval.

Determinism, Seeds, and the Reproducibility Lie

PMs and execs often ask "can we make this deterministic?" The honest answer is "close, but not exactly." Understand why before you promise a customer.

Why temperature=0 isn't enough

GPU floating-point math is non-associative. Different batch sizes change the order of summation. Order changes can flip a near-tie between two tokens — producing different outputs at temperature 0. Most providers do not guarantee bit-exact reproducibility.

Seed parameters help, partially

OpenAI exposes a seed parameter and a system_fingerprint to flag when infrastructure changed. Same seed + same fingerprint = high reproducibility (still not bit-exact). Anthropic does not currently expose a seed.

What this means for evals

Run evals at temperature 0 AND average over multiple seeds (3-5). Single-run evals at temperature 0 are still noisy. Tracking variance is more useful than trying to eliminate it.

Customer-facing implications

Don't promise "the same input always produces the same output." Promise "outputs are stable and reproducible to within X% of cases." If a regulatory or audit requirement needs true determinism, you need either deterministic post-processing on top, or a non-LLM rule layer.