A/B Testing AI Features: How to Run Experiments on Non-Deterministic Systems

Why AI Features Are Different to A/B Test

Non-determinism

The same user seeing the same AI feature twice will get different outputs. This means the 'experience' of control and treatment groups is not stable across the experiment. Variance in AI output quality inflates noise in your results.

Delayed outcome signals

The downstream effect of an AI feature may not manifest for days or weeks. A recommendation made today affects the user's decision next week. Standard experiment durations designed for immediate click-through fail to capture this.

Attribution across sessions

AI features that create async value (a draft written today that's sent tomorrow) are hard to attribute to a specific experiment exposure. Standard session-based attribution misses the causal chain.

Novelty effect contamination

Users engage more with any new feature in the first 2 weeks regardless of quality. Short experiments on AI features confuse novelty engagement with genuine value. Minimum 4-week duration for most AI experiments.

Model drift during experiment

If the underlying model gets updated (or the prompt changes) during your experiment, your control and treatment groups are no longer testing what you thought. Lock model versions before starting experiments.

Setting Up a Valid AI Experiment

Define one primary metric before starting

Pick one primary metric the experiment is powered for. Secondary metrics can inform but don't drive the decision. Teams that look at 15 metrics post-hoc and pick the one that moved are doing p-hacking.

User-level randomization, not session-level

Randomize at the user level so each user consistently receives control or treatment. Session-level randomization means the same user can be in both groups, which introduces carryover contamination.

Lock the model and prompt

Freeze the model version and prompt configuration for both groups at the start of the experiment. Any changes mid-experiment invalidate the comparison.

Pre-register your hypothesis

Write down your hypothesis, primary metric, sample size, and duration before starting. This prevents post-hoc rationalization and selective reporting of favorable results.

Run a pre-experiment AA test

Before running your experiment, run a test where both groups get the control experience. If your AA test shows a 'significant' result, your randomization or measurement is broken.

Choosing the Right Metrics for AI A/B Tests

Primary metrics (power your experiment for these)

•Task completion rate
•Downstream business outcome (conversion, retention, revenue per user)
•Time-to-task-completion

These must be directly attributable to the AI feature and measurable within your experiment window.

Guardrail metrics (stop the experiment if these degrade)

•Overall product engagement rate
•Core user retention
•Support ticket volume related to the feature

Even if your primary metric improves, a guardrail metric regression is a red flag. Don't ship without investigating.

Quality proxy metrics (diagnostic but not primary)

•Act-on rate
•Edit rate
•Explicit feedback ratings
•AI output quality scores from LLM-as-judge

These diagnose why an experiment succeeded or failed, but shouldn't be your primary decision metric — they're too gameable.

Learn to Run AI Experiments in the Masterclass

Experimentation design, AI evaluation, and data-driven product decisions are covered live with a Salesforce Sr. Director PM.

The Sample Size and Duration Problem

AI tasks have high variance

LLM output quality varies significantly across queries. High output variance means you need more samples than you would for a deterministic feature change to achieve the same statistical power.

Minimum viable duration: 4 weeks

AI features need at least 4 weeks to separate novelty effect from genuine value signals. Most product experiments are 2 weeks — this is insufficient for AI feature evaluation.

Segment before power calculation

Power your experiment on the user segment that actually uses the feature, not your entire user base. Diluting with non-exposed users massively understates effect sizes and overstates required sample sizes.

Sequential testing for AI experiments

For experiments with delayed outcomes, sequential testing (continuous monitoring with appropriate stopping rules) is more practical than fixed-horizon testing. Use Bayesian methods if your platform supports them.

Interpreting Results When AI Changes Behavior Over Time

Your experiment showed a significant positive result but production results are flat

Novelty effect. Your experiment window captured the initial excitement spike but didn't track long enough to see the post-novelty baseline. Run a holdback analysis 30 days post-launch.

The AI version performed better in experiment but worse after full rollout

Sample bias in your experiment. The first users to encounter the feature (or the specific user segment you tested) may have different behavior from the full user population.

Results are inconsistent between weeks within the same experiment

Check for model drift, prompt changes, or external events. Week-over-week inconsistency that's larger than your confidence interval is a signal that something changed mid-experiment.

The experiment was significant but the effect size is tiny

Statistical significance ≠ practical significance. Calculate the business impact of the effect size at your actual user volume. A 0.2% improvement in task completion for 100 users is not a meaningful result.