The Attention Mechanism Deep Dive: Q, K, V Explained for PMs

Query, Key, Value: The Three Vectors That Matter

For every token in a sequence, the model computes three vectors by multiplying the token's embedding against three learned matrices: W_Q, W_K, W_V. The intuition is a key-value lookup, but soft and learned end-to-end.

Query (Q): "What am I looking for?"

Each token broadcasts a query vector that asks what kind of information it needs from other tokens. The token "it" in "The dog chased the cat because it was hungry" emits a query that points toward animate subjects.

Key (K): "What do I contain?"

Each token also emits a key vector — an advertisement of what kind of information it carries. "dog" advertises animate, subject, predator. "cat" advertises animate, object, prey.

Value (V): "What do I contribute?"

If selected, the token contributes its value vector. Q and K determine WHO to attend to. V determines WHAT actually flows. The split lets the model decouple addressing from content.

Attention scores: dot(Q, K) / √d

Compute Q·K for every token pair, divide by √d (where d is head dimension, ~64-128) for numerical stability. High score = high relevance. The √d scaling stops gradients from vanishing in long sequences — small detail, huge stability impact.

Softmax + weighted sum

Softmax over the scores turns them into a probability distribution. Multiply each token's V by its probability and sum. The result is the new representation for each token, now informed by every other token in context.

Why Attention Is O(n²): The Quadratic Tax

Attention computes pairwise interactions. For n tokens, that\'s n×n score computations, n×n softmax, n×n value mixing. Memory follows the same curve. This is not a software bug — it\'s the price of every-token-attends-to-every-token expressiveness.

1K context: trivial

1M attention scores per layer. Fits in cache, sub-millisecond per layer on an H100. Cost is dominated by feed-forward layers, not attention.

32K context: noticeable

1B attention scores per layer. Attention starts to dominate compute. Latency scales clearly with input length — first-token latency 5-10x longer than 1K input.

128K context: expensive

16B scores per layer × ~96 layers = real GPU time. Why providers charge per input token: long contexts genuinely cost more compute, not just storage. Doubling context quadruples attention cost.

1M context: an engineering project

1T attention scores per layer is infeasible without sparse attention, ring attention, or other tricks. Gemini 1.5 Pro and recent Claude models use proprietary techniques to make long context viable. Recall accuracy still degrades — see attention sinks below.

PM implication: when a vendor advertises a 1M-token context window, ask three questions. What\'s the per-token price? What\'s the time-to-first-token at 1M? And what\'s the published needle-in-a-haystack recall at depth?

Multi-Head Attention: Why Models Run 64 Attentions in Parallel

A single attention operation can only encode one type of relationship at a time. Multi-head attention splits the embedding into N parallel "heads" (typically 8-128), each with its own W_Q, W_K, W_V, and concatenates the results.

Different heads learn different relationships

What it is: Anthropic's mechanistic interpretability work on Claude has identified specific heads that handle syntactic agreement, coreference resolution, induction (copy patterns), positional tracking, and named entity disambiguation. They're not labeled — they emerge from training.

PM Implication: When a model fails on a specific kind of reasoning (e.g., pronoun resolution), it's often a small number of heads doing the work. Fine-tuning can sharpen those heads; quantization can damage them.

Grouped-Query Attention (GQA)

What it is: Llama 3, Mistral, and Claude Haiku use GQA: many query heads share a smaller number of key/value heads. Cuts KV cache memory ~4-8x with negligible quality loss.

PM Implication: Why "cheap" models can offer long context at lower price points — the architecture is intentionally trading a tiny bit of expressiveness for big memory savings.

Multi-head Latent Attention (MLA)

What it is: DeepSeek-V2 and V3 introduced MLA: compress KV into a smaller latent space, expand on demand. Cuts KV memory ~10x. One reason DeepSeek can serve million-context inference at competitive prices.

PM Implication: Architecture innovation moves the cost curve. Track which providers ship attention-architecture changes, not just bigger models. Cost-per-token drops happen at these inflection points.

Build Real Technical Fluency

The AI PM Masterclass goes deep enough that you can ask hard questions of your ML team — without pretending to be one of them. Taught live by a Salesforce Sr. Director PM.

Flash Attention: The Kernel That Made Long Context Viable

Until 2022, the bottleneck on attention wasn't FLOPs — it was memory bandwidth. The naive implementation materialized the full n×n attention matrix in GPU HBM, which thrashed memory and left compute idle. Tri Dao's Flash Attention paper rewrote the kernel to be IO-aware: tile the computation in SRAM, never materialize the full matrix, recompute on backward pass.

2-4x speedup, identical math

Flash Attention v1 → v2 → v3 cut training and inference time without changing model quality. Every modern training pipeline uses it. If your provider doesn't, they're paying 2-4x more for compute and passing it on.

Memory cost dropped from O(n²) to O(n)

The full attention matrix never lives in HBM. This is what unlocked 32K, 128K, and beyond. Without flash attention, GPT-4 Turbo's 128K window would not be economically viable.

Hardware-specific tuning

Flash Attention 3 specifically targets H100's asynchrony (warp-specialization, FP8). Provider performance now depends on which kernels they've adopted — and on whether they're running A100s, H100s, B200s, or TPUs.

Why this matters in vendor reviews

When two providers quote different latency for the same model, the difference is often kernel-level — flash attention version, KV cache layout, batching strategy. Ask the question; serious providers can answer it.

Attention Sinks and the "Lost in the Middle" Problem

Two empirical findings shape how prompts should be structured. Both come from how attention probability distributes across long contexts.

Attention sinks

Xiao et al. 2023 showed that LLMs allocate disproportionate attention to the very first tokens of the context — even when those tokens are semantically irrelevant. The softmax has to sum to 1, and early tokens become "sinks" for excess attention. Removing them collapses generation quality.

Lost in the middle

Liu et al. 2023 showed that retrieval accuracy follows a U-curve across context position: high at the start, high at the end, ~30 points lower in the middle. Holds across GPT-4, Claude, and open models. PM implication: put critical instructions and key facts at the top or the bottom of long prompts.

Why this happens

Causal attention masks (decoder-only models can only attend backwards) plus position encoding biases concentrate attention budget on positional anchors. It's an architectural artifact — newer position encodings (RoPE, ALiBi, YaRN) help but don't fully fix it.

How to design around it

Repeat critical constraints at top and bottom of system prompts. For RAG, keep the most relevant chunks closer to the query. For long agent traces, summarize stale context to compress middle drift. Don't assume the model will "remember" mid-context content.