How LLMs Work: A Product Manager's Guide to Large Language Model Architecture
TL;DR
Large language models predict the next token in a sequence using transformer architecture — attention mechanisms that weigh which parts of the input matter most for each prediction. PMs don't need to understand the math, but understanding the architecture helps you make better decisions about model selection, prompt design, context window management, and cost optimization.
Why PMs Need to Understand LLM Architecture
You don't need to train models. But you need to answer questions like: Why does our AI feature sometimes give different answers to the same question? Why does adding more context sometimes make outputs worse? Why is this model so expensive at scale? Why does the model struggle with our specific domain? The answers are all rooted in how LLMs work under the hood.
Understanding the architecture at a conceptual level transforms you from a PM who says "let's use AI" into a PM who says "we should use a model with a 128K context window because our average document is 40K tokens, and we need retrieval augmentation for anything longer." That specificity is what separates good AI PMs from great ones.
Master LLM architecture hands-on. The AI PM Masterclass has you build products using real LLM APIs — you'll make model selection, prompt engineering, and RAG decisions yourself.
Tokens: The Building Blocks
LLMs don't read words — they read tokens. A token is roughly 3–4 characters of English text. "Product management" is 2–3 tokens depending on the model. A typical page of text is about 500 tokens.
Input cost
Tokens you send to the model (prompts, context, documents)
Output cost
Tokens the model generates (responses, completions)
Context window
Max tokens the model can process in one request (input + output)
A feature that sends a 10,000-token document to the model for summarization costs roughly 10x more than summarizing a 1,000-token email. Understanding tokens is understanding your AI feature's unit economics.
The Transformer: How Attention Works
The transformer is the architecture behind every major LLM — GPT, Claude, Gemini, Llama. The key innovation is the attention mechanism, which lets the model figure out which parts of the input are most relevant to each part of the output.
The pattern-matching insight
LLMs are pattern matchers at massive scale. They excel when the pattern exists in their training data. They struggle when the pattern is novel, highly specialised, or contradicts what they learned. This is why your AI feature works great for common tasks and fails on edge cases.
Pre-training vs. Fine-tuning vs. Prompt Engineering
Pre-training
You buy thisThe model learns language from massive datasets. Takes months and millions of dollars. You access this through a model API (GPT-4, Claude, Gemini).
Fine-tuning
Optional investmentTrains a pre-trained model further on your data. Costs thousands to tens of thousands of dollars. The PM decision: does improvement over the base model justify the investment?
Prompt engineering
Start hereCraft input to get better output — no additional training. Free (beyond API costs) and immediate. This is where most AI PM work happens in practice.
RLHF
Model providers do thisHow models like ChatGPT are trained to be helpful and safe. Human raters evaluate outputs. Explains why the model sometimes refuses requests or hedges its answers.
Context Windows and Why They Matter
The context window is the maximum text the model can process in one request — input plus output combined. This is one of the most important architectural constraints for product design.
| Model tier | Context window | Approx. pages |
|---|---|---|
| Older models | 4K tokens | ~6 pages |
| Mid-tier models | 32K tokens | ~50 pages |
| Latest models | 128K–200K tokens | ~200–300 pages |
Lost in the middle
Models tend to pay less attention to information in the middle of very long contexts. Your feature might have 100K tokens of context but effectively ignore information that's not near the beginning or end.
Temperature and Sampling: Why Outputs Vary
Temperature controls randomness in the model's output. At temperature 0, the model always picks the most likely next token — deterministic and consistent but can feel robotic. At higher temperatures, it samples from a probability distribution, introducing variety.
0 – 0.2
Factual retrieval, data extraction
Consistency over creativity
0.3 – 0.5
Most product features
Balances reliability with naturalness
0.7 – 1.0
Creative writing, brainstorming
Variety over predictability
Model Selection: The PM's Framework
Capability vs. cost
Larger models (GPT-4, Claude Opus) cost 10–30x more per token than smaller models. Test the cheapest viable model first.
Latency vs. quality
Larger models are slower. Users often prefer a fast good answer over a slow perfect answer. Match speed to UX needs.
Context window vs. cost
Larger context windows cost more and are slower. If your feature only needs 2K tokens of context, don't pay for a 128K model.
Specialisation vs. generality
Some models are better at specific tasks — coding, math, analysis. A specialised model may outperform a general one at lower cost.
Hallucination: Why Models Make Things Up
Hallucination is when the model generates plausible-sounding information that's factually wrong. This isn't a bug — it's a fundamental property of how LLMs work. The model predicts likely text, and sometimes likely text isn't true text.
Design for hallucination as a certainty, not a possibility
- →Ground the model in specific data through RAG — it retrieves facts rather than generating them
- →Add confidence indicators so users can assess reliability
- →Implement verification steps for critical information
- →Use LLMs for tasks where exact accuracy isn't required (summarising, drafting, brainstorming)
Embeddings and Vector Search
Embeddings are numerical representations of text that capture semantic meaning. "I love this product" and "This product is great" have different words but similar embeddings because they mean similar things.
This matters because embeddings power search, recommendation, and retrieval features. When your AI feature needs to find relevant documents for RAG, it converts the user's query into an embedding and finds documents with similar embeddings. This semantic search is far more powerful than keyword matching.
Pinecone
vector DB
Weaviate
vector DB
Chroma
vector DB
pgvector
vector DB
Apply These Concepts in the AI PM Masterclass
You'll build products using real LLM APIs and make the model selection, prompt engineering, and RAG decisions described in this guide — live, with a Salesforce Sr. Director PM.