AI features require more rigorous experimentation than traditional product changes. Model behavior varies across user segments, latency impacts user experience differently, and statistical significance is harder to achieve with probabilistic outputs. This template helps you plan experiments that account for these complexities.
Why AI Experiments Are Different
Probabilistic Outputs
Same input can produce different outputs, requiring larger sample sizes and longer test durations.
Latency Sensitivity
AI inference time affects UX. A "better" model that's slower may underperform a faster, simpler one.
Segment Variance
AI performance varies dramatically across user segments, languages, and content types.
Cost Implications
Experiments with AI features directly impact infrastructure costs during the test period.
The Complete Template
Copy and fill out this template for every AI experiment you run:
╔══════════════════════════════════════════════════════════════════╗ ║ AI EXPERIMENT BRIEF ║ ╠══════════════════════════════════════════════════════════════════╣ EXPERIMENT OVERVIEW ─────────────────────────────────────────────────────────────────── Experiment Name: [Clear, descriptive name] Experiment ID: [EXP-AI-YYYY-###] Feature Area: [e.g., Search, Recommendations, Chat] Owner: [PM Name] Engineering Lead: [Eng Name] Data Scientist: [DS Name] Start Date: [Target start] Duration: [Minimum runtime in days/weeks] HYPOTHESIS ─────────────────────────────────────────────────────────────────── We believe that: [Specific change or intervention] For: [Target user segment] Will result in: [Expected outcome with direction] Because: [Rationale based on data, research, or theory] We will know this is true when: [Primary success metric] changes by [X%] with [confidence level] VARIANTS ─────────────────────────────────────────────────────────────────── ┌─────────┬────────────────────────────────────────────────────────┐ │ Variant │ Description │ ├─────────┼────────────────────────────────────────────────────────┤ │ Control │ Current production behavior │ │ │ Model: [current model name/version] │ │ │ Prompt: [current prompt version] │ │ │ Config: [relevant parameters] │ ├─────────┼────────────────────────────────────────────────────────┤ │ Test A │ [Describe the change] │ │ │ Model: [new model name/version] │ │ │ Prompt: [new prompt version] │ │ │ Config: [changed parameters] │ ├─────────┼────────────────────────────────────────────────────────┤ │ Test B │ [If testing multiple variants] │ │ (opt) │ Model: [...] │ └─────────┴────────────────────────────────────────────────────────┘ PRIMARY METRICS ─────────────────────────────────────────────────────────────────── Metric Name: [e.g., Task Completion Rate] Current Baseline: [X%] Minimum Detectable Effect (MDE): [Y%] Expected Direction: [Increase/Decrease] Measurement Method: [How it's calculated] SECONDARY METRICS ─────────────────────────────────────────────────────────────────── ┌─────────────────────────┬──────────┬─────────────────────────────┐ │ Metric │ Baseline │ Expected Impact │ ├─────────────────────────┼──────────┼─────────────────────────────┤ │ [Engagement metric] │ [X] │ [Direction + magnitude] │ │ [Quality metric] │ [X] │ [Direction + magnitude] │ │ [Efficiency metric] │ [X] │ [Direction + magnitude] │ │ [User satisfaction] │ [X] │ [Direction + magnitude] │ └─────────────────────────┴──────────┴─────────────────────────────┘ GUARDRAIL METRICS (Must Not Regress) ─────────────────────────────────────────────────────────────────── ┌─────────────────────────┬──────────┬─────────────────────────────┐ │ Metric │ Baseline │ Max Acceptable Regression │ ├─────────────────────────┼──────────┼─────────────────────────────┤ │ Error Rate │ [X%] │ No more than [+Y%] │ │ P95 Latency │ [Xms] │ No more than [+Yms] │ │ Cost per Request │ [$X] │ No more than [+$Y] │ │ Safety Incidents │ [X] │ No increase │ │ [Business metric] │ [X] │ No more than [-Y%] │ └─────────────────────────┴──────────┴─────────────────────────────┘ STATISTICAL REQUIREMENTS ─────────────────────────────────────────────────────────────────── Sample Size per Variant: [Calculated number] Confidence Level: [Usually 95%] Statistical Power: [Usually 80%] Test Type: [One-tail / Two-tail] Multiple Comparison Correction: [Bonferroni / None] Minimum Runtime: [X days, accounting for weekly cycles] Sample Size Calculation: Baseline conversion: [X%] MDE: [Y%] Traffic available: [Z users/day] Estimated duration: [Calculated days] TRAFFIC ALLOCATION ─────────────────────────────────────────────────────────────────── Total Eligible Traffic: [X% of all users] Allocation: Control: [X%] Test A: [Y%] Test B: [Z%] (if applicable) Targeting Criteria: - [Geographic restrictions] - [Platform restrictions] - [User segment restrictions] - [Feature flag dependencies] Exclusions: - [Users in other experiments] - [New users < X days] - [Enterprise accounts] - [Other exclusions] AI-SPECIFIC CONSIDERATIONS ─────────────────────────────────────────────────────────────────── Model Details: Provider: [OpenAI / Anthropic / Internal / etc.] Model Version: [Specific version, not "latest"] Temperature: [X.X] Max Tokens: [X] Other Parameters: [List any relevant params] Prompt Version Control: Control Prompt ID: [prompt-v1.2.3] Test Prompt ID: [prompt-v1.3.0] Prompt Diff: [Link to diff or summary of changes] Caching Strategy: [ ] Caching disabled for experiment [ ] Cache key includes variant [ ] Semantic caching considerations Cost Projections: Control cost/request: [$X.XXXX] Test cost/request: [$X.XXXX] Daily experiment cost: [$X] Total experiment cost: [$X] ROLLOUT PLAN ─────────────────────────────────────────────────────────────────── Phase 1 - Smoke Test (Day 1-2): Traffic: 1% Success Criteria: No errors, latency within bounds Kill Criteria: Error rate > [X%], P99 > [Xms] Phase 2 - Initial Ramp (Day 3-7): Traffic: 10% Success Criteria: Guardrails holding Kill Criteria: Any guardrail breach Phase 3 - Full Experiment (Day 8+): Traffic: [Target %] Success Criteria: Statistical significance Kill Criteria: Sustained guardrail breach KILL SWITCH CRITERIA ─────────────────────────────────────────────────────────────────── Automatically stop experiment if: [ ] Error rate exceeds [X%] for [Y minutes] [ ] P95 latency exceeds [Xms] for [Y minutes] [ ] Safety incident detected [ ] Cost exceeds [X%] of budget [ ] [Custom business metric] breaches threshold Manual review triggered if: [ ] Negative user feedback spike [ ] Anomaly in segment performance [ ] Unexpected model behavior reported ANALYSIS PLAN ─────────────────────────────────────────────────────────────────── Primary Analysis: Date: [X days after full ramp] Method: [Frequentist / Bayesian] Analyst: [Name] Segment Analysis: - By user tenure (new vs returning) - By platform (web vs mobile) - By geography - By usage intensity - By content type / category Qualitative Analysis: - Sample review of AI outputs (N=[X]) - User feedback themes - Edge case identification DECISION FRAMEWORK ─────────────────────────────────────────────────────────────────── Ship if: [ ] Primary metric improves by ≥[MDE] with ≥95% confidence [ ] No guardrail breaches [ ] Segment analysis shows no significant negative segments [ ] Cost impact acceptable Iterate if: [ ] Directionally positive but not significant [ ] Mixed segment results [ ] Unexpected secondary metric movements Kill if: [ ] Primary metric negative or flat [ ] Guardrail breaches [ ] Cost impact unacceptable [ ] Qualitative concerns STAKEHOLDER COMMUNICATION ─────────────────────────────────────────────────────────────────── Pre-Launch: [ ] Engineering review complete [ ] Data science review complete [ ] Legal/compliance review (if needed) [ ] Stakeholder alignment meeting During Experiment: [ ] Daily monitoring owner: [Name] [ ] Weekly update to: [Stakeholders] [ ] Escalation path: [Process] Post-Experiment: [ ] Results readout: [Date] [ ] Attendees: [List] [ ] Documentation: [Where results will be stored] APPENDIX ─────────────────────────────────────────────────────────────────── Related Experiments: [Links to related past experiments] Technical Spec: [Link to eng spec] Prompt Documentation: [Link to prompt docs] Dashboard Link: [Link to experiment dashboard] Data Pipeline: [Link to relevant data documentation] ╚══════════════════════════════════════════════════════════════════╝
Writing Strong AI Hypotheses
A good hypothesis is specific, measurable, and falsifiable. Here's the framework:
The 4-Part Hypothesis Structure
[Specific, implementable change] - Be precise about what you're changing
[Target segment] - Who will be affected? Be specific about the population
[Measurable outcome] - Include direction and magnitude if possible
[Rationale] - Why do you expect this? What's the mechanism?
Example Hypotheses
Weak Hypothesis
"Using GPT-4 instead of GPT-3.5 will improve our AI assistant."
Problem: No specific metric, no target segment, no expected magnitude, no rationale.
Strong Hypothesis
"We believe that upgrading from GPT-3.5 to GPT-4 for complex multi-step queries (3+ follow-ups) for power users (10+ sessions/week) will increase task completion rate by 15% because GPT-4's improved context retention reduces the need for users to repeat information."
Strength: Specific change, defined segment, measurable outcome, clear mechanism.
AI Experiment Metrics Guide
Choose metrics from each category to get a complete picture:
Quality Metrics
- • Task completion rate
- • Answer accuracy (human eval)
- • Relevance score
- • Hallucination rate
- • User satisfaction (thumbs up/down)
Engagement Metrics
- • Feature adoption rate
- • Sessions per user
- • Queries per session
- • Return usage (D1, D7, D30)
- • Time spent with feature
Efficiency Metrics
- • Time to completion
- • Interactions to success
- • Edit/regeneration rate
- • Fallback to manual rate
- • Support ticket reduction
Guardrail Metrics
- • Error rate
- • P95/P99 latency
- • Cost per request
- • Safety/toxicity flags
- • Timeout rate
Common AI Experiment Mistakes
1. Ending Too Early
AI outputs have high variance. What looks significant on day 3 often regresses to mean by day 14. Run for at least 2 full weekly cycles.
2. Ignoring Segment Differences
Overall positive results can hide that you're hurting a key segment. Always break down by user type, content type, and platform.
3. Not Pinning Model Versions
Using "gpt-4" instead of "gpt-4-0613" means your experiment can change mid-flight when the provider updates. Always pin versions.
4. Forgetting About Cost
A 10% quality improvement that 3x your inference costs may not be worth it. Always track cost as a guardrail metric.
5. No Qualitative Review
Metrics can miss important quality issues. Always sample and review actual AI outputs during and after the experiment.
Pre-Launch Checklist
Planning
- [ ] Hypothesis documented and reviewed
- [ ] Sample size calculated
- [ ] Primary metric defined
- [ ] Guardrail thresholds set
- [ ] Kill criteria documented
Technical
- [ ] Model version pinned
- [ ] Prompt version controlled
- [ ] Logging verified
- [ ] Dashboard created
- [ ] Alerts configured
Rollout
- [ ] Traffic allocation configured
- [ ] Exclusions applied
- [ ] Ramp schedule documented
- [ ] Rollback plan tested
- [ ] On-call schedule set
Stakeholders
- [ ] Eng review complete
- [ ] DS review complete
- [ ] Stakeholders informed
- [ ] Results readout scheduled
- [ ] Documentation location set
Related Resources
AI Feature PRD Template
Document your AI feature requirements before running experiments.
AI Model Evaluation Template
Systematically compare models before choosing what to experiment with.
AI Product Metrics That Matter
Deep dive into choosing the right metrics for your AI products.
Stakeholder Communication for AI
Learn how to communicate experiment results to different audiences.