AI PM TEMPLATES

AI Use Case Hypothesis Template: Validate AI Ideas Before You Build

By Institute of AI PM·12 min read·Jul 3, 2026

TL;DR

Most AI features fail not because the model was wrong but because the hypothesis was incomplete. PMs who write vague use case descriptions like "add AI to improve customer experience" ship features nobody uses and cannot iterate on. This template forces you to state the hypothesis in four parts: the user job, the AI mechanism, the measurable outcome, and the validation test. Use it before any AI feature goes into the build queue. Includes the full template, worked examples for B2B and B2C, and a pre-build checklist.

The AI PM Minute

One tactic to make you a sharper AI PM, twice a week. 60 seconds to read. Free.

No fluff. Unsubscribe anytime.

Why AI Features Fail Without a Written Hypothesis

AI feature failures fall into two categories. The first is model failure: the AI produces bad outputs, hallucinates, or refuses to respond correctly. This is the failure mode everyone worries about and spends the most engineering time on.

The second is hypothesis failure: the feature works as designed, but the design was wrong. Users do not adopt it. The metric it was supposed to move does not move. The "AI" surface does not meaningfully change user behavior. This failure is more common, more expensive (it gets past engineering and into production), and much harder to diagnose after the fact.

Vague job statement

"AI to help users with their documents" — What specifically are users trying to do? Search for information? Draft new content? Extract key terms? Without a specific job, the feature scope expands infinitely and success cannot be measured.

Mechanism assumption

"AI will make this faster" — Faster is not a mechanism. How specifically will AI change what users currently do? Replace a manual step? Generate a starting point? Classify incoming items? The mechanism is the hypothesis you are testing.

Unmeasurable outcome

"Users will be more productive" — Productivity is not measurable without a specific metric and baseline. If you cannot measure whether the hypothesis is true before building, you cannot learn from the experiment after shipping.

No validation plan

Building the full feature to see if users like it — the most expensive and slowest possible test. Most AI hypotheses can be tested with a prototype, a concierge MVP, or a manual simulation in days, not a full build in weeks.

The hypothesis template below forces clarity on each of these dimensions before a single line of code is written. It takes 30 to 60 minutes to fill out and saves weeks of misdirected effort.

The Four-Part Hypothesis Structure

A complete AI use case hypothesis has four components. Each component is independently important, but the four together form a testable, falsifiable claim that engineering can scope and the team can evaluate after shipping.

Part 1: User Job

What specific task is the user trying to accomplish?

A good job statement names a single, specific user action, not a category of actions. It can be observed — you can watch a user succeed or fail at this job. It is not a problem statement or a pain point.

Quality check: Can you watch a user do this job right now without AI? If yes, you have a real job. If not, you may be describing an aspiration rather than a behavior.

Part 2: AI Mechanism

How specifically does AI change what the user does?

A good mechanism statement describes the specific AI action: classifies, generates, retrieves, ranks, summarizes, extracts, predicts. It also names the step in the user workflow where AI intervenes and what the user previously did manually at that step.

Quality check: If someone asked 'what does the AI actually do here,' your mechanism statement should answer that question in one sentence without jargon.

Part 3: Measurable Outcome

What metric changes, by how much, and from what baseline?

A good outcome statement names a specific metric (time on task, completion rate, error rate, retention day 7, NPS for a feature), a direction (increases, decreases), a magnitude (from X% to Y%, or by Z%), and a timeframe for measurement.

Quality check: Can you look at this metric right now and tell where it stands today? If not, you need to either find an existing metric or acknowledge that baseline measurement is part of the MVP.

Part 4: Validation Test

What is the fastest, cheapest way to test this hypothesis?

A good validation test is the smallest possible experiment that would give you enough signal to invest more or kill the hypothesis. It names the sample size, the duration, and the decision threshold: 'If X happens within Y days with Z users, we proceed.'

Quality check: Is this test smaller than building the full feature? If the answer is no, the hypothesis is not specific enough to test without building. Break it down further.

The Template

Copy this template into your team's document system. Fill out one template per AI use case being evaluated. Do not combine multiple use cases into a single template — the hypothesis must be specific to a single user job and mechanism.

AI Use Case Hypothesis Template

# AI Use Case Hypothesis

Feature name: ___________________________

Author: _________________________________

Date: __________________________________

## 1. User Job

User segment: ___________________________

Specific job statement:

"When [context], the user needs to [specific action]

so that [immediate outcome for the user]."

How do users do this job today (without AI)?

___________________________________________

How long does it take? What breaks or frustrates?

___________________________________________

## 2. AI Mechanism

AI action (circle one): classify / generate / retrieve /

rank / summarize / extract / predict / other: _____

Mechanism statement:

"AI will [specific action] at [step in workflow]

so the user no longer needs to [current manual step]."

What input does the AI receive?

___________________________________________

What output does the AI produce?

___________________________________________

What does the user do with that output?

___________________________________________

## 3. Measurable Outcome

Primary metric: _________________________

Current baseline: _______________________

Target: ________________________________

Measurement window: ____________________

Secondary metrics (2 max):

1. _____________________________________

2. _____________________________________

Counter-metrics (what should NOT get worse):

___________________________________________

## 4. Validation Test

Test type (circle): prototype / concierge / A/B /

fake door / wizard of oz / shadow mode

Minimum sample size: ___________________

Test duration: _________________________

Decision threshold:

"If [metric] reaches [threshold] within [timeframe],

we proceed to full build."

Kill condition:

"If [condition] by [date], we stop and reassess."

## Approvals

PM sign-off: ___________________________

Engineering feasibility confirmed: Y / N

Data availability confirmed: Y / N

Legal/compliance flag needed: Y / N

Worked Examples

These two examples show how the template applies to common AI product scenarios. Read them in full — the value is in the specificity at each step, not just the structure.

B2B Example: AI-powered contract clause extraction

User Job

Segment: Legal operations manager at a 500-person company. Job: "When reviewing an inbound vendor contract, I need to identify all clauses that deviate from our standard terms so that I can flag them for legal review without reading the full contract." Today: read every page manually, average 45 minutes per contract, 12 contracts per week.

AI Mechanism

AI extracts and classifies clauses against a library of 40 standard-terms patterns at the point when the user uploads a contract PDF. Output: structured list of deviating clauses with the specific text and the standard term it conflicts with. User reviews the flagged list and decides which to escalate.

Measurable Outcome

Primary: time per contract review, baseline 45 min, target 15 min or less. Secondary: clause miss rate (flagged by legal team after AI review but missed by AI). Counter-metric: user adoption rate must stay above 80%.

Validation Test

Concierge test: manually extract clauses from 20 contracts uploaded by 3 legal ops managers over 2 weeks. Measure time saved and accuracy. Decision threshold: if 2 of 3 users report >50% time reduction with <5% miss rate, proceed to automated build.

B2C Example: AI onboarding path recommendation

User Job

Segment: New user in first session on a project management SaaS tool. Job: "When I first open the product, I need to understand which features are relevant to my specific role so that I can get to my first successful project without reading documentation." Today: generic onboarding tour, same for all users, 60% skip it entirely.

AI Mechanism

AI classifies the user into one of 6 role personas based on answers to 3 onboarding questions, then generates a personalized onboarding sequence that surfaces the 4 features most relevant to that persona first. The user sees a different feature order and tooltip copy than the default path.

Measurable Outcome

Primary: activation rate (user completes their first project within 7 days), baseline 22%, target 30%. Secondary: onboarding completion rate (currently 40%). Counter-metric: time to complete onboarding must not increase.

Validation Test

A/B test: 500 new users per arm (AI-personalized vs. current generic). Run for 3 weeks to capture full day-7 activation window. Decision threshold: if AI arm shows activation rate of 28%+ at statistical significance (p<0.05), proceed to full rollout.

Learn How to Structure AI Decisions in the Masterclass

The AI PM Masterclass covers hypothesis-driven product development, eval design, and the frameworks that separate AI PMs who ship value from those who ship features. Taught live by a Salesforce Sr. Director PM.

Pre-Build Hypothesis Validation Checklist

Before a hypothesis moves from the template into the engineering backlog, run it through this checklist. Any "no" is a signal to sharpen the hypothesis or run a smaller validation test first.

[User Job]User job is observable — you can watch a user succeed or fail at it
[User Job]User job is specific to one segment and one context, not all users in all contexts
[User Job]You have spoken to at least 3 users who confirmed this job is a pain point in the last 60 days
[AI Mechanism]The AI mechanism names a specific action (generate, classify, retrieve) not an aspiration
[AI Mechanism]You can describe the input the AI receives and the output the user sees in one sentence each
[AI Mechanism]Engineering has confirmed the mechanism is feasible with your current data and model access
[Measurable Outcome]The primary metric has a current baseline you can look up today
[Measurable Outcome]The target is specific enough to know in advance whether you succeeded or failed
[Measurable Outcome]You have named at least one counter-metric to watch for unintended regressions
[Validation Test]The validation test is smaller and faster than building the full feature
[Validation Test]The decision threshold is written down before the test starts, not after you see the results
[Validation Test]The kill condition is also written down — you know when to stop and reassess

Common Hypothesis Mistakes and How to Fix Them

The hypothesis describes a feature, not a user behavior change

Signal: The job statement says "user has access to AI assistant" instead of "user drafts meeting summaries."

Fix: Rewrite the job statement to describe what the user does differently, not what the product does differently. The product is the mechanism, not the job.

The outcome is a model quality metric, not a user metric

Signal: "AI achieves 92% accuracy on the classification task" — this is an engineering success criterion, not a product hypothesis.

Fix: Model quality is a prerequisite, not an outcome. Add a downstream user metric: time saved, task completion rate, or retention. If model quality does not translate to user behavior change, the hypothesis is incomplete.

The validation test is the full build

Signal: "We will ship the feature and see if users adopt it."

Fix: Ask what the fastest test is that would give you 70% of the signal. A concierge (manual) version of the AI, a prototype with simulated outputs, or a fake door test to measure intent all work for most AI hypotheses.

The hypothesis covers multiple user jobs at once

Signal: The job statement has "and" or "or" in it: "user can draft replies and categorize incoming tickets."

Fix: Split into two separate hypotheses. Multi-job hypotheses cannot be evaluated cleanly because you cannot attribute success or failure to a specific mechanism.

The decision threshold is set after seeing preliminary results

Signal: Team says 'let's see how it looks after two weeks' instead of defining the threshold before launching.

Fix: Write the decision threshold before the test starts, on the hypothesis document itself. Changing the threshold after seeing results is the most common source of confirmation bias in AI product work.

Ship AI Features That Actually Move the Metric

The AI PM Masterclass teaches hypothesis-driven product development, eval design, and how to structure AI decisions so the team builds the right thing the first time.

Before you go: get the AI PM Minute

One tactic to make you a sharper AI PM, twice a week. 60 seconds to read. Free.

No fluff. Unsubscribe anytime.