AI User Acceptance Testing (UAT) Template for Product Managers
TL;DR
UAT for AI features is harder than UAT for traditional software because outputs are non-deterministic. This template gives you a structured approach: test scripts that exercise the model, scoring rubrics that capture both correctness and acceptability, edge case coverage, and sign-off criteria that protect you from quietly shipping a regression. Copy-paste ready.
Why AI UAT Is Different
In traditional UAT, you compare actual output to expected output. In AI UAT, the "expected output" doesn't exist as a fixed string — there are many acceptable outputs and many unacceptable ones. UAT shifts from binary pass/fail to rubric-based scoring, and from one-shot tests to scenarios that capture variance.
Rubric-based scoring
Each test case gets graded on multiple dimensions: accuracy, format, tone, helpfulness, safety. Composite score determines pass.
Multiple runs per case
Run each test case 3-5 times to capture variance. Single-run UAT lies about AI consistency.
Edge case explicit
Adversarial inputs, ambiguous requests, malformed inputs. AI fails differently than deterministic code, often more spectacularly.
Trust scenario testing
Test what happens when the AI is wrong. Does the user have recovery paths? Does the product mislead?
UAT Test Plan Structure
1. Test scope and assumptions
What is being tested. What is not. Model version, prompt version, retrieval source. Lock these before UAT begins.
2. Persona-based test scenarios
Group test cases by user persona. Each persona gets 5-10 scenarios reflecting real workflows.
3. Test cases per scenario
Specific inputs + expected behavior bands + scoring rubric. Concrete enough that two testers would score similarly.
4. Edge cases
Adversarial, malformed, ambiguous, off-topic. Each has explicit expected handling — refuse, hedge, ask, or attempt.
5. Trust and safety scenarios
What happens when the AI is wrong? What does the user do? Does the product fail safely or misleadingly?
6. Sign-off criteria
Pass thresholds per scoring dimension. Multiple sign-offs (PM, eng, design, legal as needed). Conditions that block ship.
Scoring Rubric Template
Correctness (0-3)
0 = wrong, 1 = partially correct, 2 = correct but incomplete, 3 = fully correct. Most important dimension for factual tasks.
Format adherence (0-3)
0 = wrong format, 3 = matches required format exactly. Critical for downstream automation.
Tone and helpfulness (0-3)
0 = unhelpful or off-tone, 3 = appropriately helpful and on-brand. Subjective but trackable.
Safety (pass/fail)
Binary. Any unsafe output is automatic fail regardless of other scores.
Build UAT Discipline in the Masterclass
The AI PM Masterclass includes UAT design exercises with real test plans and instructor feedback. Stop shipping AI on vibes — build the muscle for defensible launches.
Sign-Off Criteria
Pass thresholds per scenario
Average score ≥2.5/3 across happy path scenarios. ≥2.0/3 across edge cases. Zero safety failures across all runs.
Variance bounds
Across multiple runs of the same case, score variance ≤0.5. High variance suggests prompt instability that needs fixing before ship.
Stakeholder sign-offs
PM (correctness, scope), Eng (technical readiness), Design (experience), Legal/Safety (risk). Missing sign-off blocks ship.
Conditional ships
Sometimes ship with caveats: behind a feature flag, to a specific user segment, with a kill switch primed. Document the conditions explicitly.
Common Mistakes to Avoid
Single-run testing
AI outputs vary. A test that passes once may fail 3 of the next 10 times. Always run multiple times per case.
Pass/fail without rubrics
"It looks fine" isn't a sign-off. Rubrics force testers to articulate what good looks like before they see the output.
No edge case coverage
Happy path tests catch obvious bugs. Edge cases catch the embarrassing ones that hit production and become public.
Skipping safety scenarios
Safety failures are not graceful — they're viral. Always run prompt injection, jailbreak, and harmful content tests.