AI PM Templates

AI Experiment Brief Template: Run Better A/B Tests

A comprehensive template for planning and documenting AI feature experiments with proper hypothesis frameworks, statistical rigor, and rollout guardrails.

By Institute of AI PM11 min readDec 14, 2025

AI features require more rigorous experimentation than traditional product changes. Model behavior varies across user segments, latency impacts user experience differently, and statistical significance is harder to achieve with probabilistic outputs. This template helps you plan experiments that account for these complexities.

Why AI Experiments Are Different

Probabilistic Outputs

Same input can produce different outputs, requiring larger sample sizes and longer test durations.

Latency Sensitivity

AI inference time affects UX. A "better" model that's slower may underperform a faster, simpler one.

Segment Variance

AI performance varies dramatically across user segments, languages, and content types.

Cost Implications

Experiments with AI features directly impact infrastructure costs during the test period.

The Complete Template

Copy and fill out this template for every AI experiment you run:

╔══════════════════════════════════════════════════════════════════╗
║              AI EXPERIMENT BRIEF                                 ║
╠══════════════════════════════════════════════════════════════════╣

EXPERIMENT OVERVIEW
───────────────────────────────────────────────────────────────────
Experiment Name:    [Clear, descriptive name]
Experiment ID:      [EXP-AI-YYYY-###]
Feature Area:       [e.g., Search, Recommendations, Chat]
Owner:              [PM Name]
Engineering Lead:   [Eng Name]
Data Scientist:     [DS Name]
Start Date:         [Target start]
Duration:           [Minimum runtime in days/weeks]

HYPOTHESIS
───────────────────────────────────────────────────────────────────
We believe that:
  [Specific change or intervention]

For:
  [Target user segment]

Will result in:
  [Expected outcome with direction]

Because:
  [Rationale based on data, research, or theory]

We will know this is true when:
  [Primary success metric] changes by [X%] with [confidence level]

VARIANTS
───────────────────────────────────────────────────────────────────
┌─────────┬────────────────────────────────────────────────────────┐
│ Variant │ Description                                            │
├─────────┼────────────────────────────────────────────────────────┤
│ Control │ Current production behavior                            │
│         │ Model: [current model name/version]                    │
│         │ Prompt: [current prompt version]                       │
│         │ Config: [relevant parameters]                          │
├─────────┼────────────────────────────────────────────────────────┤
│ Test A  │ [Describe the change]                                  │
│         │ Model: [new model name/version]                        │
│         │ Prompt: [new prompt version]                           │
│         │ Config: [changed parameters]                           │
├─────────┼────────────────────────────────────────────────────────┤
│ Test B  │ [If testing multiple variants]                         │
│ (opt)   │ Model: [...]                                           │
└─────────┴────────────────────────────────────────────────────────┘

PRIMARY METRICS
───────────────────────────────────────────────────────────────────
Metric Name:        [e.g., Task Completion Rate]
Current Baseline:   [X%]
Minimum Detectable Effect (MDE): [Y%]
Expected Direction: [Increase/Decrease]
Measurement Method: [How it's calculated]

SECONDARY METRICS
───────────────────────────────────────────────────────────────────
┌─────────────────────────┬──────────┬─────────────────────────────┐
│ Metric                  │ Baseline │ Expected Impact             │
├─────────────────────────┼──────────┼─────────────────────────────┤
│ [Engagement metric]     │ [X]      │ [Direction + magnitude]     │
│ [Quality metric]        │ [X]      │ [Direction + magnitude]     │
│ [Efficiency metric]     │ [X]      │ [Direction + magnitude]     │
│ [User satisfaction]     │ [X]      │ [Direction + magnitude]     │
└─────────────────────────┴──────────┴─────────────────────────────┘

GUARDRAIL METRICS (Must Not Regress)
───────────────────────────────────────────────────────────────────
┌─────────────────────────┬──────────┬─────────────────────────────┐
│ Metric                  │ Baseline │ Max Acceptable Regression   │
├─────────────────────────┼──────────┼─────────────────────────────┤
│ Error Rate              │ [X%]     │ No more than [+Y%]          │
│ P95 Latency             │ [Xms]    │ No more than [+Yms]         │
│ Cost per Request        │ [$X]     │ No more than [+$Y]          │
│ Safety Incidents        │ [X]      │ No increase                 │
│ [Business metric]       │ [X]      │ No more than [-Y%]          │
└─────────────────────────┴──────────┴─────────────────────────────┘

STATISTICAL REQUIREMENTS
───────────────────────────────────────────────────────────────────
Sample Size per Variant:     [Calculated number]
Confidence Level:            [Usually 95%]
Statistical Power:           [Usually 80%]
Test Type:                   [One-tail / Two-tail]
Multiple Comparison Correction: [Bonferroni / None]
Minimum Runtime:             [X days, accounting for weekly cycles]

Sample Size Calculation:
  Baseline conversion: [X%]
  MDE: [Y%]
  Traffic available: [Z users/day]
  Estimated duration: [Calculated days]

TRAFFIC ALLOCATION
───────────────────────────────────────────────────────────────────
Total Eligible Traffic:      [X% of all users]

Allocation:
  Control: [X%]
  Test A:  [Y%]
  Test B:  [Z%] (if applicable)

Targeting Criteria:
  - [Geographic restrictions]
  - [Platform restrictions]
  - [User segment restrictions]
  - [Feature flag dependencies]

Exclusions:
  - [Users in other experiments]
  - [New users < X days]
  - [Enterprise accounts]
  - [Other exclusions]

AI-SPECIFIC CONSIDERATIONS
───────────────────────────────────────────────────────────────────
Model Details:
  Provider:           [OpenAI / Anthropic / Internal / etc.]
  Model Version:      [Specific version, not "latest"]
  Temperature:        [X.X]
  Max Tokens:         [X]
  Other Parameters:   [List any relevant params]

Prompt Version Control:
  Control Prompt ID:  [prompt-v1.2.3]
  Test Prompt ID:     [prompt-v1.3.0]
  Prompt Diff:        [Link to diff or summary of changes]

Caching Strategy:
  [ ] Caching disabled for experiment
  [ ] Cache key includes variant
  [ ] Semantic caching considerations

Cost Projections:
  Control cost/request:   [$X.XXXX]
  Test cost/request:      [$X.XXXX]
  Daily experiment cost:  [$X]
  Total experiment cost:  [$X]

ROLLOUT PLAN
───────────────────────────────────────────────────────────────────
Phase 1 - Smoke Test (Day 1-2):
  Traffic: 1%
  Success Criteria: No errors, latency within bounds
  Kill Criteria: Error rate > [X%], P99 > [Xms]

Phase 2 - Initial Ramp (Day 3-7):
  Traffic: 10%
  Success Criteria: Guardrails holding
  Kill Criteria: Any guardrail breach

Phase 3 - Full Experiment (Day 8+):
  Traffic: [Target %]
  Success Criteria: Statistical significance
  Kill Criteria: Sustained guardrail breach

KILL SWITCH CRITERIA
───────────────────────────────────────────────────────────────────
Automatically stop experiment if:
  [ ] Error rate exceeds [X%] for [Y minutes]
  [ ] P95 latency exceeds [Xms] for [Y minutes]
  [ ] Safety incident detected
  [ ] Cost exceeds [X%] of budget
  [ ] [Custom business metric] breaches threshold

Manual review triggered if:
  [ ] Negative user feedback spike
  [ ] Anomaly in segment performance
  [ ] Unexpected model behavior reported

ANALYSIS PLAN
───────────────────────────────────────────────────────────────────
Primary Analysis:
  Date: [X days after full ramp]
  Method: [Frequentist / Bayesian]
  Analyst: [Name]

Segment Analysis:
  - By user tenure (new vs returning)
  - By platform (web vs mobile)
  - By geography
  - By usage intensity
  - By content type / category

Qualitative Analysis:
  - Sample review of AI outputs (N=[X])
  - User feedback themes
  - Edge case identification

DECISION FRAMEWORK
───────────────────────────────────────────────────────────────────
Ship if:
  [ ] Primary metric improves by ≥[MDE] with ≥95% confidence
  [ ] No guardrail breaches
  [ ] Segment analysis shows no significant negative segments
  [ ] Cost impact acceptable

Iterate if:
  [ ] Directionally positive but not significant
  [ ] Mixed segment results
  [ ] Unexpected secondary metric movements

Kill if:
  [ ] Primary metric negative or flat
  [ ] Guardrail breaches
  [ ] Cost impact unacceptable
  [ ] Qualitative concerns

STAKEHOLDER COMMUNICATION
───────────────────────────────────────────────────────────────────
Pre-Launch:
  [ ] Engineering review complete
  [ ] Data science review complete
  [ ] Legal/compliance review (if needed)
  [ ] Stakeholder alignment meeting

During Experiment:
  [ ] Daily monitoring owner: [Name]
  [ ] Weekly update to: [Stakeholders]
  [ ] Escalation path: [Process]

Post-Experiment:
  [ ] Results readout: [Date]
  [ ] Attendees: [List]
  [ ] Documentation: [Where results will be stored]

APPENDIX
───────────────────────────────────────────────────────────────────
Related Experiments:    [Links to related past experiments]
Technical Spec:         [Link to eng spec]
Prompt Documentation:   [Link to prompt docs]
Dashboard Link:         [Link to experiment dashboard]
Data Pipeline:          [Link to relevant data documentation]

╚══════════════════════════════════════════════════════════════════╝

Writing Strong AI Hypotheses

A good hypothesis is specific, measurable, and falsifiable. Here's the framework:

The 4-Part Hypothesis Structure

1. We believe that

[Specific, implementable change] - Be precise about what you're changing

2. For

[Target segment] - Who will be affected? Be specific about the population

3. Will result in

[Measurable outcome] - Include direction and magnitude if possible

4. Because

[Rationale] - Why do you expect this? What's the mechanism?

Example Hypotheses

Weak Hypothesis

"Using GPT-4 instead of GPT-3.5 will improve our AI assistant."

Problem: No specific metric, no target segment, no expected magnitude, no rationale.

Strong Hypothesis

"We believe that upgrading from GPT-3.5 to GPT-4 for complex multi-step queries (3+ follow-ups) for power users (10+ sessions/week) will increase task completion rate by 15% because GPT-4's improved context retention reduces the need for users to repeat information."

Strength: Specific change, defined segment, measurable outcome, clear mechanism.

AI Experiment Metrics Guide

Choose metrics from each category to get a complete picture:

Quality Metrics

  • • Task completion rate
  • • Answer accuracy (human eval)
  • • Relevance score
  • • Hallucination rate
  • • User satisfaction (thumbs up/down)

Engagement Metrics

  • • Feature adoption rate
  • • Sessions per user
  • • Queries per session
  • • Return usage (D1, D7, D30)
  • • Time spent with feature

Efficiency Metrics

  • • Time to completion
  • • Interactions to success
  • • Edit/regeneration rate
  • • Fallback to manual rate
  • • Support ticket reduction

Guardrail Metrics

  • • Error rate
  • • P95/P99 latency
  • • Cost per request
  • • Safety/toxicity flags
  • • Timeout rate

Common AI Experiment Mistakes

1. Ending Too Early

AI outputs have high variance. What looks significant on day 3 often regresses to mean by day 14. Run for at least 2 full weekly cycles.

2. Ignoring Segment Differences

Overall positive results can hide that you're hurting a key segment. Always break down by user type, content type, and platform.

3. Not Pinning Model Versions

Using "gpt-4" instead of "gpt-4-0613" means your experiment can change mid-flight when the provider updates. Always pin versions.

4. Forgetting About Cost

A 10% quality improvement that 3x your inference costs may not be worth it. Always track cost as a guardrail metric.

5. No Qualitative Review

Metrics can miss important quality issues. Always sample and review actual AI outputs during and after the experiment.

Pre-Launch Checklist

Planning

  • [ ] Hypothesis documented and reviewed
  • [ ] Sample size calculated
  • [ ] Primary metric defined
  • [ ] Guardrail thresholds set
  • [ ] Kill criteria documented

Technical

  • [ ] Model version pinned
  • [ ] Prompt version controlled
  • [ ] Logging verified
  • [ ] Dashboard created
  • [ ] Alerts configured

Rollout

  • [ ] Traffic allocation configured
  • [ ] Exclusions applied
  • [ ] Ramp schedule documented
  • [ ] Rollback plan tested
  • [ ] On-call schedule set

Stakeholders

  • [ ] Eng review complete
  • [ ] DS review complete
  • [ ] Stakeholders informed
  • [ ] Results readout scheduled
  • [ ] Documentation location set

Related Resources