Transformer Architecture Explained for Product Managers

What's Actually Inside a Transformer

The transformer is a neural network architecture introduced in the 2017 paper "Attention Is All You Need." Before it, sequential models (RNNs, LSTMs) processed text one token at a time — which made them slow and bad at remembering long-range dependencies. Transformers process all tokens in parallel and explicitly model the relationship between every token and every other token.

Tokenizer

Text is first broken into tokens — roughly 3/4 of a word on average. 'unbelievable' might become ['un', 'believ', 'able']. Token count directly determines cost and context window usage. This is why pricing is per-token, not per-word.

Embedding Layer

Each token is converted into a high-dimensional vector (e.g., 768 or 4096 dimensions). Similar tokens end up as nearby vectors. This is where meaning gets encoded mathematically.

Attention Layers (Stacked)

The core of the transformer. Each attention layer computes how much each token should 'pay attention to' every other token. Stacking multiple attention layers allows the model to capture increasingly abstract patterns.

Feed-Forward Layers

After attention, each token's representation passes through a feed-forward neural network. This is where most of the model's 'knowledge' is stored — in the learned weights.

Output Head

The final layer converts the last token's representation back into a probability distribution over the vocabulary. The highest-probability token is sampled as the next output.

Attention: Why It Changed Everything

Attention is the mechanism that lets transformers understand that "it" in "The cat sat on the mat because it was tired" refers to the cat, not the mat. Every attention operation computes three things for each token: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what do I contribute?). The model learns these mappings during training.

Why it matters for context windows

Attention is computed between every token pair — so cost scales quadratically with context length. A 128K context window requires 16x more computation than a 32K window. This is why long contexts cost more and are slower.

Multi-head attention

Models run many attention operations in parallel (the 'heads'). Each head learns to attend to different types of relationships: syntactic, semantic, positional. More heads generally means richer understanding.

Why models struggle with the middle of long contexts

Empirically, information at the start and end of a long context is attended to more strongly than information in the middle — the 'lost in the middle' problem. Product implication: put critical instructions at the top.

Cross-attention in multimodal models

When a model processes both text and images, cross-attention allows each modality to attend to the other. This is how models like GPT-4o or Claude 3 can reason about an image in relation to a text question.

Pre-Training, Fine-Tuning, and RLHF: The Training Stack

Modern LLMs are trained in stages. Understanding each stage helps you pick the right model and know when fine-tuning is worth the investment.

Stage 1: Pre-Training

What happens: The model is trained on a massive corpus (internet, books, code) to predict the next token. This is where the majority of parameters and compute go — often billions of dollars.

PM Implication: This determines the model's base knowledge, language understanding, and reasoning capability. You're renting this capability from model providers.

Stage 2: Supervised Fine-Tuning (SFT)

What happens: The pre-trained model is fine-tuned on human-curated examples of correct behavior: following instructions, helpful answers, safe responses.

PM Implication: This is what turns a 'predict the next token' engine into a useful assistant. Domain-specific SFT can adapt a model to your vertical (legal, medical, code review).

Stage 3: RLHF (Reinforcement Learning from Human Feedback)

What happens: Human raters compare model outputs and express preferences. A reward model is trained on these preferences, then used to further fine-tune the LLM to produce preferred responses.

PM Implication: RLHF is why models like Claude and GPT feel 'aligned' — they're optimized for human preference, not just prediction accuracy. It also explains sycophancy: models learned that agreeable answers get higher ratings.

Go Deeper in the AI PM Masterclass

The masterclass covers how LLM architecture decisions translate directly into product decisions — taught live by a Salesforce Sr. Director PM.

Architecture Decisions That Affect Your Product

Number of parameters

More parameters generally means more knowledge and better reasoning. But larger models are slower and more expensive. GPT-4 (est. 1T+ params) vs GPT-4o mini (est. 8B params) is a 100x cost difference.

Context window size

Determines how much text the model can process at once. 128K tokens ≈ 100,000 words ≈ a full novel. Larger context windows enable document analysis, long conversations, and complex multi-step reasoning.

Mixture of Experts (MoE)

Some models (Mixtral, GPT-4) route each token through only a subset of 'expert' layers. This allows a model to have many parameters but activate fewer per inference — reducing cost without sacrificing quality.

Decoder-only vs encoder-decoder

Most modern chat LLMs (GPT, Claude, Llama) are decoder-only: optimized for generation. Encoder-decoder models (T5, BART) are better for translation and summarization. Most PM use cases use decoder-only.

Emergent Abilities: What Changes With Scale

One of the most surprising findings in LLM research: certain capabilities appear abruptly at scale thresholds and are essentially absent below them. These "emergent abilities" weren't explicitly trained for — they arose from scale.

Chain-of-thought reasoning

Smaller models can't reliably solve multi-step math problems. Above ~100B parameters, models can follow a reasoning chain step-by-step and arrive at correct answers. This is why 'think step by step' works on large models.

In-context learning

The ability to learn a new task from a few examples in the prompt — without gradient updates. Tiny models show almost no in-context learning; large models are remarkably capable of it. This is what makes few-shot prompting work.

Instruction following

Small models struggle to reliably follow complex, multi-part instructions. Scale plus SFT produces instruction-following capability that appears qualitatively different.

Calibrated uncertainty

Larger models are marginally better at knowing what they don't know. They're more likely to say 'I'm not sure' on genuinely uncertain questions, though this is highly imperfect and use-case dependent.