AI Product Scoping: How to Define the Right Problem Before You Build

Why Scoping Breaks Differently for AI Products

In traditional product development, scoping is primarily a prioritization exercise: given a list of user problems and business goals, which subset do we build first? The technology is mostly predictable. If you spec a search filter, you know roughly what a search filter can and cannot do.

AI changes the constraint. The technology is not predictable. A model that works beautifully in the demo may fail on 30% of real user inputs. A feature that appears to solve the right problem may solve it in a way users do not trust. The scope of "what the model does" and "what users need from the model" are two different boundaries, and misaligning them is the most common cause of AI feature failure.

Capability overestimation

PMs scope to what they saw in the demo, not what the model does on production inputs. Demo inputs are cherry-picked for model strengths. Production inputs are unpredictable. The right place to scope is production inputs, which means you need to test with real data before finalizing scope.

Underspecified success

"The AI summarizes the contract" is not a scope. A scope defines what correct output looks like, what the acceptable error rate is, and what happens to summaries outside the acceptance threshold. Without this, the team builds to an undefined bar and ships something nobody can evaluate.

Missing the failure mode

Traditional product scoping considers the happy path and a few error states. AI products have a third state: plausible-but-wrong output. The model produces something that looks correct but is not. This is worse than an error because the user may not catch it. Scoping must include what happens when the output is confidently wrong.

Scoping without a human fallback

AI is probabilistic. Some percentage of inputs will always fall outside the model's competence. Scoping without specifying how those cases are handled produces products that silently fail a minority of users with no recovery path. The scope must define the fallback before it defines the feature.

The Five Questions Before You Write a Spec

These five questions should be answered sequentially. Answering them out of order produces scope that looks complete on paper but fails in execution. Each answer constrains the next. If you cannot answer any of them, you are not ready to write a PRD.

1. What is the exact task the model must perform?

Not the feature — the model task. A feature is 'generate a meeting summary.' The model task is 'extract action items, key decisions, and next steps from a 45-minute meeting transcript in bullet form, attributing each item to the person who said it.' The model task is a testable specification. The feature is not.

Scoping test: Can you evaluate whether the model succeeded or failed on a given input without asking for an opinion? If not, the task is underspecified.

2. What does the input distribution look like?

Get 50 real inputs from your target users or a representative sample. Not hand-crafted examples. Not marketing transcripts. Real user data. Run the model on them and look at the output distribution. This tells you the real scope, not the hoped-for scope.

Scoping test: What percentage of inputs produce output within the acceptance threshold? If you do not know, you are guessing at scope.

3. What is the acceptable failure rate?

Some tasks tolerate 20% model error if the stakes are low (a first draft the user edits). Others tolerate 1% model error if stakes are high (a medical triage recommendation). Define this number before you build, not after. It determines your architecture, your human-in-the-loop design, and your cost model.

Scoping test: If the model fails on 15% of inputs, does the product still deliver value? If yes, is there a human fallback for the 15%? If no, you need a different architecture or a different scope.

4. What happens when the model is wrong?

Design the failure state before the success state. Wrong means: confidently wrong output that passes through to the user. Design: does the product surface uncertainty? Does it route to a human? Does it show the reasoning so the user can catch the error? Build the failure handling into scope, not into post-launch firefighting.

Scoping test: Pick your three most common failure types from the input distribution test. Walk through the user experience for each. If any produces a bad user outcome with no recovery, that is a scoping gap.

5. What does the human fallback look like?

Define who handles out-of-scope cases, what triggers the handoff, and what the human needs to do the job well. This is not an edge case design — it is a core part of the scope. Products without a defined human fallback either fail silently or create incidents when edge cases hit production.

Scoping test: If the model were unavailable for 24 hours, what is the manual process? Document it. That process is your fallback, and your product must support it.

Capability Boundary Testing: How to Know What the Model Actually Does

Capability boundary testing is the work you do between "we have a problem the model might solve" and "we have scoped the feature." It is a structured experiment, not a demo. Its goal is to find the edges of the model's competence on your specific task before committing to a build.

Step 1: Build a test set from real inputs

Collect 50-100 real inputs from the exact population your feature will serve. If you are building a contract summarizer for a law firm, get 50 real contracts from that firm. Split them into easy (well-formatted, standard terms), medium (edge cases, unusual clauses), and hard (complex, ambiguous, novel). This three-tier structure reveals where the model starts failing.

Step 2: Define a binary quality rubric

Before you run the model, define what 'correct' means in a way a non-expert could evaluate. For a contract summarizer: 'Did the summary include all parties, all key dates, and all material obligations? Yes or No.' Binary rubrics reveal the true pass rate without the subjective noise of 1-5 scoring.

Step 3: Run and score

Run all 50-100 inputs through the model with your intended prompting approach. Score each output against the rubric. Calculate pass rates by tier (easy, medium, hard). The overall pass rate is not the useful number. The hard-tier pass rate is. That is what you will see in production when real users push the boundaries.

Step 4: Read the failure patterns

Categorize failed outputs. Common categories: omitted information, hallucinated information, wrong attribution, formatting failures, misunderstood task. The failure category distribution tells you whether the problem is fixable with prompt engineering (formatting failures usually are) or structural (hallucination on low-frequency inputs rarely is without RAG or fine-tuning).

The output of capability boundary testing is not a go/no-go decision. It is a scoping decision. A 60% pass rate on hard inputs means: scope the feature to exclude hard-tier inputs from the automated flow, route them to the human fallback, and communicate that scope to engineering and design before any UI work begins.

Learn the Full AI Product Process

The AI PM Masterclass covers capability testing, scoping, evals, and every other phase of the AI product cycle — taught live by a Salesforce Sr. Director PM who has shipped AI features at scale.

The AI Scoping Document Format

A scoping document for an AI feature is distinct from a PRD. The PRD comes after scoping. The scoping document is the input to the PRD. It captures the decisions made during capability testing and the constraints that define what the PRD can and cannot propose. Here is the structure that works in practice.

Problem statement (3 sentences max)

Who has this problem, what they are doing today instead, and what a better outcome looks like. Not the solution, not the feature. The problem, grounded in user research.

Model task definition

The exact task the model must perform, stated as a testable specification. Include input format, expected output format, and what counts as a correct output. This becomes the eval rubric.

Input distribution summary

What the 50-100 real inputs looked like. Easy/medium/hard tier distribution. Any input types that should be explicitly out of scope based on the capability boundary test.

Pass rates by tier

Easy: X%, Medium: X%, Hard: X%. Document the model and prompt used. This anchors the quality expectation for engineering and becomes the baseline for future improvement.

Failure mode inventory

The three to five most common failure categories with example outputs. For each: severity (does this harm the user?), frequency (what percentage of inputs?), and mitigation (fixable via prompt engineering, requires RAG, requires fine-tuning, or out of scope).

Acceptance threshold

The minimum pass rate required for launch. Below this threshold, the feature does not ship in the automated path. This is a business and UX decision, not a technical one, and it belongs in the scoping document.

Human fallback design

Who handles the inputs that fall below the acceptance threshold. What triggers the handoff. What information they receive to do the job. What the target handling time is.

The Five Scoping Failures That Kill AI Products

These are not hypothetical. They are the scoping decisions that cause launched AI features to fail, get pulled, or sit unused. Study them before you finalize any scope.

Scoping to the demo, not the data

The demo uses clean, well-formatted, typical inputs. Production sees everything. Before finalizing scope, test the model on the ugliest, most ambiguous inputs you can find. Those are the inputs that will appear in production first and will define your early user reviews.

No numeric threshold

"The model should be accurate" is not a scope. "The model must achieve 85% correct output on the test set before launch" is a scope. Without a number, there is no agreed standard for done, and teams ship at wildly different quality levels than stakeholders expect.

Designing for the average case only

The average user input is not the problem. The 10% of unusual inputs is the problem. Users remember the failures, not the successes. Explicitly design for what happens when inputs are unusual, ambiguous, or adversarial. This belongs in scope, not in post-launch bug reports.

Shipping without a confidence signal

If the model is right 80% of the time, users need a way to know which 20% to verify. Outputs with no confidence signal train users to either trust everything (bad) or verify everything (defeats the purpose). Scoping must include how uncertainty is communicated, even if it is just a visual indicator.

Building the automation before the human workflow

PMs often scope the automated AI feature before understanding how the work gets done manually today. This is backwards. Map the manual process first. Understand which steps are the bottleneck. Scope the AI to automate the highest-value, most automatable step. Not the whole workflow.