In-Context Learning Explained for Product Managers

What In-Context Learning Actually Is

Classical machine learning has a clear learning loop: show the model labeled examples, compute a loss, run gradient descent, update the weights. Repeat thousands of times. Learning is baked into the parameters. When inference runs, those parameters are fixed.

In-context learning breaks this assumption. You show the model a handful of input-output pairs inside the prompt, and the model adapts its behavior for your task without any gradient descent and without changing a single weight. The same frozen model produces different outputs depending on what examples appear in its context window.

This emerged as a surprise in the 2020 GPT-3 paper. Smaller models did not show it in any meaningful way. Above roughly 100 billion parameters, the behavior appeared as what researchers called an emergent capability: something that was not explicitly trained for but arose from scale. The working explanation is that very large models internalize so much structure about language and tasks during pre-training that reading a few examples is enough to activate the right behavior. The model recognizes what task is being described and routes accordingly.

Mechanistically, this happens through the attention mechanism during the forward pass. As the model processes each token in your prompt, attention heads are reading the examples you provided and building up an implicit representation of the task. By the time the model reaches the end of your input and starts generating output, it has already done the equivalent of light task inference from pattern recognition across your examples.

The key distinction for PMs

ICL does not change what the model knows. It changes what behavior the model expresses. You are not teaching it new facts. You are showing it what game you are playing. This means ICL works well for format and style adaptation and task definition, but poorly for injecting new knowledge the model was not exposed to during pre-training.

Few-Shot, One-Shot, Zero-Shot: The ICL Spectrum

ICL covers a range of example counts, each with a different cost and benefit profile. Understanding the spectrum lets you pick the right point for your product without over-engineering.

Zero-Shot0 examples

You describe the task in natural language and the model figures it out from pre-training alone. Works well for common tasks but breaks on niche formats.

Best for: General tasks, fast prototyping, tasks the model already knows

One-Shot1 example

A single demonstration anchors the format and style. Often enough to flip a stubborn model from wrong output structure to correct one.

Best for: Establishing output format, showing tone, unlocking a specific schema

Few-Shot3 to 20 examples

The practical sweet spot for most products. Enough signal to define the task reliably without exploding your token budget.

Best for: Classification, extraction, rewriting tasks, domain-specific labeling

Many-Shot50 to 1000+ examples

Newly practical with 1M context windows. Can match fine-tuning quality on some tasks without any training infrastructure.

Best for: Complex classification schemas, specialized jargon, substituting for fine-tuning at low volume

The conventional advice was that more examples always helps up to a point. That is still roughly true, but with two important caveats. First, quality dominates quantity. Five well-chosen, consistently formatted examples outperform twenty messy ones. Second, many-shot ICL is now a real option. With context windows reaching 1 million tokens on models like Gemini 1.5 Pro and Claude 3, you can fit hundreds of labeled examples into a single prompt. Research published in 2024 showed many-shot ICL closing much of the gap with fine-tuning on certain classification tasks, without any training infrastructure.

Practical starting point

Start with 3 to 5 examples covering the most common cases. Add examples only when you observe specific failure modes. Each new example should address a concrete failure, not just add variety. Stop when you hit diminishing returns or when token cost per call becomes a line item.

ICL vs Fine-Tuning vs RAG: The PM Decision Framework

These three approaches are often presented as competing alternatives, but they address different problems. The decision is not which is best in general. The decision is which is right for your specific constraint set today.

In-Context LearningStart here

Use when: You need fast iteration, have fewer than a few hundred examples, or are adapting format and style rather than baking in deep domain knowledge.

Watch for: Adds latency and cost per call. Breaks if production inputs drift from your example distribution.

Fine-TuningGraduate here

Use when: You have 1,000+ high-quality labeled examples, a stable task definition, and high inference volume that makes per-call token cost a real line item.

Watch for: Training cost and time upfront. Model snapshots go stale as base models improve. Regression risk on model upgrades.

RAGUse for facts

Use when: Your knowledge base changes frequently, you need citations, or the factual gap is too large to bridge with examples alone.

Watch for: Retrieval infrastructure. Chunking quality determines answer quality. Latency from retrieval round-trip.

The most common mistake product teams make is jumping to fine-tuning before exhausting ICL. Fine-tuning feels like a real investment and therefore like a signal of seriousness. But it introduces training infrastructure, model snapshots that go stale, and regression risk on every base model upgrade. ICL with thoughtful examples solves the problem faster and keeps you flexible while the task definition is still evolving.

The second most common mistake is conflating factual grounding with task adaptation. RAG is for knowledge. ICL is for behavior. If you need the model to know something specific (a policy document, a product catalog, last week's news), RAG is the right tool. If you need the model to respond in a specific way to a class of inputs, ICL is the right tool. Many products need both.

Apply this to your product

Learn to make AI architecture decisions with confidence

The AI PM Masterclass covers ICL, RAG, fine-tuning tradeoffs, and every other technical decision that lands on product managers at AI companies. Taught by a Salesforce Sr. Director PM.

Why ICL Works and Why It Sometimes Fails

Researchers have proposed several explanations for what ICL is actually doing mechanistically. None is complete, but they give useful intuitions for product decisions.

The most supported view is that ICL does at least three things simultaneously. First, it establishes the label space and format: by seeing your output examples, the model knows what kinds of answers are valid and how to structure them. Second, it resolves task ambiguity: a task description in natural language can mean many things, and examples pin down the intended interpretation. Third, it activates latent task knowledge: if the task is something the model encountered during pre-training in some form, examples trigger that knowledge to surface.

This explains both why ICL works well and where it breaks down. It works well when the task resembles something in the pre-training distribution and when your examples consistently demonstrate the right format. It breaks down in the following specific ways:

Inconsistent examples hurt more than wrong labels

Research from Brown et al. and subsequent work shows that inconsistent formatting across examples degrades ICL performance more than using incorrect ground-truth labels. The model is learning the task structure from pattern recognition. Inconsistent structure destroys that signal. If your examples vary in output format, punctuation, or field ordering, fix that before anything else.

Clean eval examples vs. messy production inputs

Teams often author beautiful examples by hand and test them against clean prompts. Production inputs are messier: typos, mixed languages, partial context, malformed data. ICL performance on your curated eval set can be 15 to 20 points higher than on real traffic. Stress-test with actual production samples before shipping.

Example order matters more than you expect

The position of examples in the context window affects which examples the model attends to most. Later examples carry more weight in some models. This means shuffling examples can produce different outputs on the same input. If your product needs deterministic outputs, fix the example order and test for sensitivity.

Task ambiguity that examples do not resolve

ICL resolves ambiguity by showing the model what you mean. If your examples do not cover the ambiguous cases, the model guesses. For products handling edge cases (unusual queries, out-of-domain inputs, low-confidence classifications), add examples that explicitly show how to handle uncertainty rather than forcing a confident output.

A useful mental model: ICL is pattern matching against your examples, not reasoning about your intent. Every failure mode follows from taking that model seriously. If you show inconsistent patterns, the match is unreliable. If production inputs do not match your example patterns, the match fails. If your examples do not cover ambiguous cases, the model matches on incomplete signal.

Product Design Implications of ICL

ICL is not just a prompting technique. It reframes several product decisions that PMs are typically not trained to think about. Here are the six most important ones.

Prompt versioning is not optional

Your system prompt and its examples are a product artifact that evolves. If you are not versioning prompts the same way you version code, you cannot reproduce a bug, roll back a regression, or A/B test a change. Treat prompt changes as deployments: branch, test, review, merge.

Example curation is a product investment

The quality of your few-shot examples determines output quality more than model choice in many cases. Budget time for labeling, edge case coverage, and ongoing refresh as production inputs shift. Who owns this work? It is neither pure engineering nor pure product. Assign it explicitly.

ICL and model upgrades interact

When OpenAI or Anthropic releases a new model, your ICL prompts may regress. A prompt tuned for GPT-4 may behave differently on GPT-4o or Claude 3.5 Sonnet. Run regression tests on your example suite before cutting over. Teams that skip this step discover the regression in production.

Token cost math is real at scale

At 1,000 tokens per example and 10 examples, you are adding 10,000 tokens to every call. At $3 per million tokens (a mid-range 2026 price) and 100,000 calls per day, that is $3,000 per day in example overhead alone. Many-shot ICL with 200 examples could cost $60,000 per day before any model output tokens. Run the math before choosing example count.

ICL is testable before engineering builds anything

This is the underused advantage. A PM can open a playground, write a system prompt with examples, and validate whether the task is even solvable before a single line of integration code is written. ICL lets you de-risk the core AI assumption in hours rather than sprints.

Many-shot as a fine-tuning alternative

For low to medium volume products (under roughly 50,000 calls per day at current prices), many-shot ICL in a 1M context window can match fine-tuning quality without training infrastructure. The crossover point depends on your token pricing, call volume, and how often the task definition changes. Calculate both options before committing to fine-tuning.

The token cost math at scale

~1,000

Tokens per example

Examples in prompt

100,000

Calls per day

$3,000

Daily cost at $3/M tokens

Example overhead only, before any output tokens. Run your own numbers before committing to a high example count.