Compound AI Systems: Building Reliable AI Products from Multiple Components

Why Single-Model Calls Break at Scale

A compound AI system is a modular architecture that combines multiple AI and non-AI components — retrievers, validators, routers, LLMs, specialized models — to solve tasks that a single model call cannot handle reliably or efficiently. The concept was formalized in the UC Berkeley BAIR blog post "The Shift from Models to Compound AI Systems" (2024), which observed that the biggest AI quality gains came from system design, not model upgrades.

This is not an academic distinction. When your product sends a single prompt to an LLM and returns the output directly, you inherit all of the model's failure modes: hallucination on knowledge gaps, context-window truncation, inconsistent formatting, no factual grounding. Adding system structure suppresses those failure modes at the product layer, without waiting for a new model release.

The hallucination problem

A bare LLM call confidently fills knowledge gaps with plausible-sounding fabrication. A retrieval layer grounds the model in facts it must cite rather than invent. Hallucination rate drops by 40-60% in well-designed RAG systems.

The precision-recall tradeoff

A single model trades precision and recall through temperature. Compound systems can run a high-recall retrieval step followed by a high-precision generation step — the two constraints no longer live in the same call.

The context window ceiling

A 200K-token document won't fit in most models, and even when it does, recall degrades in the middle. Chunking and retrieval solve this architecturally instead of waiting for longer context windows.

The cost-quality tradeoff

Not every query requires frontier-model quality. A router can direct simple queries to a cheap model and escalate complex ones to Claude Opus or GPT-4 — cutting cost by 60-80% on mixed production workloads.

The Five Functional Layers of a Compound AI System

Production compound AI systems distribute responsibility across five functional layers. Not every system needs all five — a simple RAG pipeline might only use retrieval and generation — but understanding all five helps you identify which layers your product is missing and where failures are most likely to originate.

1. Retrieval

Fetches relevant context from an external source — vector database, keyword search, SQL query, or API call. Grounds the model in facts it wasn't trained on. Retrieval quality sets the ceiling for generation quality.

Example: a legal research assistant queries a case-law vector database with the user's question before calling the LLM.

2. Routing

Classifies the incoming request and directs it to the right downstream component. A router might separate queries by topic, required capability (code vs. text), or complexity (cheap model vs. frontier).

Example: a customer support system routes billing questions to a structured SQL query, product questions to RAG, and escalation requests to a human queue.

3. Generation

The LLM call itself. In a compound system, the prompt is constructed programmatically from retrieved context, user input, and system instructions. Prompt construction is an engineering problem, not a craft.

Example: the generation prompt includes retrieved docs, conversation history, formatting instructions, and a citation requirement — assembled by code, not written by hand.

4. Validation

Checks the model's output before it reaches the user. Can be rule-based (does the output match a JSON schema?), model-based (did a second model detect hallucination?), or hybrid (schema check then safety classifier).

Example: a medical information system runs a safety classifier on every generation before display. Outputs that trigger the classifier show a standard disclaimer instead.

5. Memory

Persists context across turns or sessions — user preferences, prior conversation summaries, entity state. Without memory, every conversation starts from scratch. With it, the product compounds value over time.

Example: an AI sales assistant stores each account's prior conversations, decisions, and open questions so the next session starts with full context, not a blank slate.

Four Canonical Design Patterns

Most production compound AI systems are variations of four design patterns. Recognizing the pattern lets you identify what your product needs without designing from scratch every time.

Pattern 1: Sequential Chain

Each step feeds its output into the next. Output A → Process B → Output C. Simple, predictable, debuggable. Best for linear transformations where each step adds distinct value.

When to use: document summarization pipelines, multi-step code generation, structured extraction followed by formatting, citation compilation.

Risk: errors compound. A bad retrieval step produces a bad generation. Add per-layer validation for production workloads.

Pattern 2: Router + Specialist

A classifier at entry categorizes the input and dispatches it to a specialized downstream handler. Each specialist is optimized for its narrow task rather than trying to be a generalist.

When to use: customer support with multiple topic types, mixed structured and unstructured queries, cost optimization (small model for simple; large for complex).

Risk: router accuracy is the system's weakest link. Misclassified inputs get the wrong specialist. Monitor router precision and recall separately from end-to-end quality.

Pattern 3: Parallel Ensemble

Multiple components run simultaneously on the same input; outputs are aggregated or voted on. More expensive but higher quality and more robust to any single component's failure mode.

When to use: high-stakes outputs (medical, legal, financial), redundancy for reliability, multi-perspective synthesis where disagreement is informative.

Risk: 2-3x latency and cost versus a single call. Only justified where quality and reliability guarantees outweigh the cost.

Pattern 4: Generate-then-Validate

The system generates a draft, a separate model or rule set critiques it, and optionally regenerates with the critique as additional context. Expensive but dramatically reduces hallucination and factual error rates.

When to use: any output acted on without human review — code execution, financial reports, medical instructions, legal summaries.

Risk: validator models can be sycophantic — agreeable validators will pass bad outputs. Use task-specific validators, not general-purpose LLMs asked to 'check this.'

Build AI Products That Survive Contact With Reality

The AI PM Masterclass covers compound AI system design, eval frameworks, and how to spec multi-component pipelines for engineering — taught live by a Salesforce Sr. Director PM.

When to Add a Layer vs. Upgrade the Model

Compound systems add latency, cost, and debugging complexity. They're the right choice when the problem demands it, not by default. Use this decision framework before adding a layer:

Evals show hallucination on domain knowledge

Add a retrieval layer. Single-model calls can't cite what the model wasn't trained on. Retrieval narrows the knowledge gap.

100% of queries route to a frontier model

Add a classifier to route simple queries to a cheaper model. Most mixed workloads can cut cost 50-70% with a two-tier router without quality regression on routine requests.

Output errors trigger downstream actions (emails, code runs, transactions)

Add validation before output reaches the user or any action is taken. The cost of one bad action in production exceeds the cost of a validation layer.

Users start every session from scratch

Add a memory layer. Retrieve user context at session start; summarize and store at session end. This is the lowest-effort compound layer with the highest user experience payoff.

Latency is the primary constraint

Audit each compound step for its contribution to total latency. One step typically accounts for 80%+ of total time. Optimize that step rather than the whole pipeline.

PM Responsibilities in Compound System Design

Most engineering effort in compound AI sits in system design, not model selection. That means PMs have more leverage here than in model-picking decisions. This is where product judgment matters most.

Define the failure modes that matter

Before engineering builds, specify which failures are acceptable (low-confidence outputs show a caveat) and which are unacceptable (fabricated citations in a legal product). Validation layer design flows directly from these requirements.

Set latency budgets per layer

A compound system with 5 layers can easily hit 10+ seconds total latency. Work backward from user tolerance — typically under 3s for interactive, under 10s for non-interactive — and allocate budgets per layer before engineering starts.

Define retrieval scope and freshness SLA

The retrieval layer is only as good as its index. Decide what data is indexed, how often it updates, and what staleness is tolerable. A legal RAG system running on 6-month-old case law is a liability, not a product.

Own the eval pipeline per layer

Don't wait for end-to-end evals. Define metrics for retrieval (precision@k, recall@k), routing (classification accuracy), and generation (factual accuracy, format compliance). Layer-specific evals find bugs far faster than black-box testing.