Technical Deep Dive

How LLMs Work: A Product Manager's Guide to Large Language Model Architecture

By Institute of AI PM14 min readMar 22, 2026

TL;DR

Large language models predict the next token in a sequence using transformer architecture — attention mechanisms that weigh which parts of the input matter most for each prediction. PMs don't need to understand the math, but understanding the architecture helps you make better decisions about model selection, prompt design, context window management, and cost optimization.

Why PMs Need to Understand LLM Architecture

You don't need to train models. But you need to answer questions like: Why does our AI feature sometimes give different answers to the same question? Why does adding more context sometimes make outputs worse? Why is this model so expensive at scale? Why does the model struggle with our specific domain? The answers are all rooted in how LLMs work under the hood.

Understanding the architecture at a conceptual level transforms you from a PM who says "let's use AI" into a PM who says "we should use a model with a 128K context window because our average document is 40K tokens, and we need retrieval augmentation for anything longer." That specificity is what separates good AI PMs from great ones.

Master LLM architecture hands-on. The AI PM Masterclass has you build products using real LLM APIs — you'll make model selection, prompt engineering, and RAG decisions yourself.

Tokens: The Building Blocks

LLMs don't read words — they read tokens. A token is roughly 3–4 characters of English text. "Product management" is 2–3 tokens depending on the model. A typical page of text is about 500 tokens.

Input cost

Tokens you send to the model (prompts, context, documents)

Output cost

Tokens the model generates (responses, completions)

Context window

Max tokens the model can process in one request (input + output)

A feature that sends a 10,000-token document to the model for summarization costs roughly 10x more than summarizing a 1,000-token email. Understanding tokens is understanding your AI feature's unit economics.

The Transformer: How Attention Works

The transformer is the architecture behind every major LLM — GPT, Claude, Gemini, Llama. The key innovation is the attention mechanism, which lets the model figure out which parts of the input are most relevant to each part of the output.

The pattern-matching insight

LLMs are pattern matchers at massive scale. They excel when the pattern exists in their training data. They struggle when the pattern is novel, highly specialised, or contradicts what they learned. This is why your AI feature works great for common tasks and fails on edge cases.

Pre-training vs. Fine-tuning vs. Prompt Engineering

Pre-training

You buy this

The model learns language from massive datasets. Takes months and millions of dollars. You access this through a model API (GPT-4, Claude, Gemini).

Fine-tuning

Optional investment

Trains a pre-trained model further on your data. Costs thousands to tens of thousands of dollars. The PM decision: does improvement over the base model justify the investment?

Prompt engineering

Start here

Craft input to get better output — no additional training. Free (beyond API costs) and immediate. This is where most AI PM work happens in practice.

RLHF

Model providers do this

How models like ChatGPT are trained to be helpful and safe. Human raters evaluate outputs. Explains why the model sometimes refuses requests or hedges its answers.

Context Windows and Why They Matter

The context window is the maximum text the model can process in one request — input plus output combined. This is one of the most important architectural constraints for product design.

Model tierContext windowApprox. pages
Older models4K tokens~6 pages
Mid-tier models32K tokens~50 pages
Latest models128K–200K tokens~200–300 pages

Lost in the middle

Models tend to pay less attention to information in the middle of very long contexts. Your feature might have 100K tokens of context but effectively ignore information that's not near the beginning or end.

Temperature and Sampling: Why Outputs Vary

Temperature controls randomness in the model's output. At temperature 0, the model always picks the most likely next token — deterministic and consistent but can feel robotic. At higher temperatures, it samples from a probability distribution, introducing variety.

0 – 0.2

Factual retrieval, data extraction

Consistency over creativity

0.3 – 0.5

Most product features

Balances reliability with naturalness

0.7 – 1.0

Creative writing, brainstorming

Variety over predictability

Model Selection: The PM's Framework

Capability vs. cost

Larger models (GPT-4, Claude Opus) cost 10–30x more per token than smaller models. Test the cheapest viable model first.

Latency vs. quality

Larger models are slower. Users often prefer a fast good answer over a slow perfect answer. Match speed to UX needs.

Context window vs. cost

Larger context windows cost more and are slower. If your feature only needs 2K tokens of context, don't pay for a 128K model.

Specialisation vs. generality

Some models are better at specific tasks — coding, math, analysis. A specialised model may outperform a general one at lower cost.

Hallucination: Why Models Make Things Up

Hallucination is when the model generates plausible-sounding information that's factually wrong. This isn't a bug — it's a fundamental property of how LLMs work. The model predicts likely text, and sometimes likely text isn't true text.

Design for hallucination as a certainty, not a possibility

  • Ground the model in specific data through RAG — it retrieves facts rather than generating them
  • Add confidence indicators so users can assess reliability
  • Implement verification steps for critical information
  • Use LLMs for tasks where exact accuracy isn't required (summarising, drafting, brainstorming)

Embeddings and Vector Search

Embeddings are numerical representations of text that capture semantic meaning. "I love this product" and "This product is great" have different words but similar embeddings because they mean similar things.

This matters because embeddings power search, recommendation, and retrieval features. When your AI feature needs to find relevant documents for RAG, it converts the user's query into an embedding and finds documents with similar embeddings. This semantic search is far more powerful than keyword matching.

Pinecone

vector DB

Weaviate

vector DB

Chroma

vector DB

pgvector

vector DB

Apply These Concepts in the AI PM Masterclass

You'll build products using real LLM APIs and make the model selection, prompt engineering, and RAG decisions described in this guide — live, with a Salesforce Sr. Director PM.

Related Articles