AI PM Glossary 2026: 100+ Terms Every AI Product Manager Should Know

Foundations

Start here. If you can't define these seven terms in one sentence each, you'll lose credibility with engineers in the first 30 seconds of a design review.

What is a token?

A token is the smallest unit an LLM processes. Roughly 0.75 words per token in English. 'unbelievable' might tokenize as ['un', 'believ', 'able']. Token count drives latency, cost, and context-window usage.

What is a parameter?

A parameter is a learned weight inside the neural network. GPT-4 is estimated at over 1T parameters; Llama 3 8B has 8 billion. More parameters generally means more capability and higher inference cost.

What is a context window?

The maximum number of tokens (input + output) a model can process in one call. Claude Sonnet 4 supports 200K; Gemini 2.5 supports 1M+. Quadratic attention cost makes long contexts expensive.

What is an embedding?

A high-dimensional vector (typically 768–4096 dims) that represents the semantic meaning of text, images, or audio. Similar items end up close in vector space. Embeddings power search, RAG, and clustering.

What is a foundation model?

A large model pre-trained on broad data (e.g., GPT-4, Claude, Llama) intended to be adapted to many downstream tasks. Foundation models are the substrate; your product is the application layer.

What is multimodality?

The ability of a model to process multiple input types — text, images, audio, video — in the same context. GPT-4o, Claude 3.5 Sonnet, and Gemini are natively multimodal.

What is AGI?

Artificial General Intelligence — a system that matches or exceeds human performance across most economically valuable tasks. No agreed-upon definition or measurable threshold. Avoid using the term in PRDs.

LLM Architecture

These come up every time you debate model choice. You don't need to derive backprop — you need to know which words signal which tradeoffs.

What is a transformer?

The neural network architecture introduced in the 2017 'Attention Is All You Need' paper. Every modern LLM (GPT, Claude, Gemini, Llama) is a transformer. Its core innovation is self-attention.

What is attention?

The mechanism that lets each token weigh how much every other token matters. Computed via Query, Key, and Value vectors. Scales quadratically with sequence length, which is why long context is expensive.

What is a decoder-only model?

A transformer optimized for next-token generation. GPT, Claude, Llama are decoder-only. Encoder-decoder models (T5, BART) are different and rarely used in modern chat products.

What is Mixture of Experts (MoE)?

An architecture where each token is routed through only a subset of expert sub-networks. GPT-4 and Mixtral use MoE to have huge total parameter counts but lower per-token compute cost.

What is in-context learning?

The ability to learn a new task from examples in the prompt, without weight updates. Few-shot prompting is the practical form. Emerges only at scale.

What is a reasoning model?

A model trained or scaffolded to produce extended chain-of-thought before answering. OpenAI o1/o3, Claude with extended thinking, DeepSeek R1. Higher accuracy on math, code, planning — at higher latency and cost.

What is chain-of-thought?

A prompting technique (or training signal) where the model produces intermediate reasoning steps before its final answer. Improves multi-step task accuracy substantially.

Training

Most PMs will never train a model from scratch. But every PM working with vendors, fine-tuning, or evals will see these terms in proposals — and need to push back on misuse.

What is pre-training?

The first and most expensive training stage: the model learns to predict the next token over trillions of tokens of internet, books, and code. This is where most knowledge is stored.

What is supervised fine-tuning (SFT)?

Adapting a pre-trained model on curated input-output pairs. Used to teach instruction-following and domain behavior. Typically tens of thousands of examples.

What is RLHF?

Reinforcement Learning from Human Feedback. Humans rank model outputs; a reward model is trained on those rankings; the LLM is fine-tuned against the reward model. Source of 'helpful, harmless, honest' alignment.

What is DPO?

Direct Preference Optimization — a simpler alternative to RLHF that fine-tunes directly on preference pairs without a separate reward model. Cheaper, increasingly common in 2026.

What is LoRA?

Low-Rank Adaptation. A parameter-efficient fine-tuning method that trains small adapter matrices instead of all weights. 100–1000x cheaper than full fine-tuning, with minimal quality loss for most use cases.

What is distillation?

Training a smaller 'student' model to mimic the outputs of a larger 'teacher' model. Used to ship cheaper, faster models that retain most of the capability.

What is constitutional AI?

Anthropic's training approach where the model critiques and revises its own outputs against a written 'constitution' of principles. Reduces reliance on human labelers for harmful behavior.

Drill These Terms in the AI PM Masterclass

The masterclass tests vocabulary live. You explain each concept back to a Salesforce Sr. Director PM until you can do it cleanly. That's the difference between recognizing a term and owning it.

Inference

Where the money is spent. PMs who ignore inference economics ship products that lose money on every request.

What is inference?

Running a trained model to produce outputs. The lifetime cost of an LLM product is dominated by inference, not training. Latency, throughput, and cost per token are the inference metrics PMs track.

What is TTFT?

Time To First Token — how long after the request the first token streams back. Drives perceived responsiveness. Target sub-500ms for chat UX.

What is TPS?

Tokens Per Second — how fast the model streams output once started. Below ~30 TPS feels slow for chat; above ~80 TPS feels instant.

What is quantization?

Reducing the numerical precision of model weights (e.g., FP16 → INT8 → INT4). Cuts memory and latency 2–4x with small quality loss. Standard for on-device and cost-sensitive deployments.

What is speculative decoding?

An inference trick where a small 'draft' model proposes tokens and a large model verifies them in parallel. 2–3x speedup on the same hardware with no quality loss.

What is prefill vs decode?

Two phases of inference. Prefill processes the input prompt (compute-bound, parallel). Decode generates output tokens one at a time (memory-bound, sequential). They have different bottlenecks.

What is KV caching?

Storing keys and values from prior tokens so each new token doesn't recompute attention over the full history. Without it, generation would be O(n^2) per token.

Evaluation

If you can't measure quality, you can't ship. PMs who own evals own roadmap leverage.

What is an eval?

A test suite that measures model output quality on representative inputs. Evals are the regression tests of LLM products. Without them, you cannot ship updates safely.

What is MMLU?

Massive Multitask Language Understanding — a 57-subject benchmark of multiple-choice knowledge questions. Frontier models score 85–90%+. Mostly saturated in 2026.

What is HumanEval?

A code-generation benchmark of 164 Python problems. Measures pass@1 — whether the first generated solution passes unit tests. Frontier models exceed 90%.

What is GPQA?

Graduate-Level Google-Proof Q&A — hard science questions written by domain experts. Resistant to web lookup. A more meaningful 2026 reasoning benchmark.

What is LLM-as-judge?

Using one LLM to grade outputs of another. Cheap and scalable but biased toward verbosity, position, and self-preference. Validate against human labels before trusting it.

What is golden dataset?

A curated set of inputs with known-correct outputs used as a fixed regression suite. The single highest-leverage artifact you can build for a serious LLM product.

What is hallucination?

When a model outputs confident, plausible-sounding text that is factually wrong. Cannot be eliminated, only reduced via RAG, fine-tuning, and verification layers.

Deployment & Production

The terms that show up in launch reviews, security questionnaires, and incident postmortems.

What is RAG?

Retrieval-Augmented Generation. Pull relevant chunks from a vector store (or other retriever) into the prompt before the model answers. The default architecture for grounded enterprise LLM apps.

What is a vector database?

A datastore optimized for similarity search over embeddings. Pinecone, Weaviate, pgvector, Turbopuffer. Powers RAG and semantic search.

What is an agent?

An LLM that can take actions via tool calls in a loop, with state and a goal. Customer-support bots that look up orders are agents. Multi-step research assistants are agents.

What is tool use?

A model invoking external functions — web search, database queries, code execution — through structured outputs. Enables agents and grounds responses in fresh data.

What is MCP?

Model Context Protocol — Anthropic's open standard for connecting LLMs to tools and data sources. The USB-C of LLM tooling. Becoming standard across providers in 2026.

What is prompt injection?

An attack where malicious input causes the model to ignore its original instructions. Indirect prompt injection (poisoned web pages, emails) is the harder variant. Treat all model-reachable text as untrusted.

What is a guardrail?

A pre- or post-processing layer that filters inputs or outputs for policy violations, PII, or off-topic content. Distinct from model alignment. Use both.