Writing Acceptance Criteria for AI Features: From Vague Ideas to Testable Requirements

Why Traditional Acceptance Criteria Break for AI

A typical user story acceptance criterion reads, given a logged in user, when they click submit, then the form saves and a confirmation appears. This is testable because the behavior is deterministic. The same input always produces the same output. AI features do not work this way, and the gap creates four predictable failure modes when teams try to write traditional criteria.

Yes or no checks pass on a single happy path and miss the failure tail

A criterion like, the assistant returns a relevant answer when asked about pricing, can pass on the one example the QA engineer tries and fail on the next 47 phrasings of the same question. Traditional acceptance testing assumes a small number of inputs cover the input space. AI inputs are open ended. Single example checks give false confidence and ship features that look fine in demo and break in production.

Tradeoff: Replacing single example checks with evaluation set checks adds setup cost (you need to build and maintain a test set) but produces a defensible quality signal. Teams that skip the evaluation set inevitably ship regressions because the only quality signal is the demo.

Vague qualitative criteria are not testable and produce arguments at sign off

Criteria like, the response is high quality, or, the assistant sounds professional, mean different things to different reviewers. Engineering ships what they think meets the bar, the PM sees something different, and sign off becomes a debate. This is even worse for AI because reviewers have different tolerance for variability and different intuitions about what the model can do.

Tradeoff: Quantifying qualitative criteria requires the PM to define rubrics, which takes hours. The shortcut, leaving criteria qualitative, costs more in the long run because every release requires the same argument again. Build the rubric once and reuse it.

Safety and policy criteria get bolted on at the end

Teams write functional acceptance criteria first and remember to add safety criteria after a near miss in staging. By that point, the design choices that determine safety properties are already locked. Acceptance criteria written this way produce features that are functional but unsafe, requiring rework before launch or worse, after launch.

Tradeoff: Writing safety criteria upfront slows the ticket creation step but eliminates the late stage scramble. Use a safety checklist (refusal scenarios, privacy boundaries, regulated content rules) on every AI feature ticket as a forcing function.

Criteria do not specify what to do when the model is wrong

Traditional criteria assume the system either succeeds or returns an error. AI systems often produce a confident, plausible, wrong answer. Criteria that do not specify how to handle this case (escalation paths, confidence thresholds, fallback behavior, user disclosures) leave engineering to guess, and different engineers guess differently. The result is inconsistent UX across features and a product that feels uncoordinated to users.

Tradeoff: Specifying fallback behavior for every criterion lengthens tickets by 30 to 50 percent. Skipping it is faster up front but creates UX debt that takes more time to clean up than it would have taken to write.

The Four Part AI Acceptance Criteria Format

Replace single sentence criteria with a four part structure. Every AI feature criterion includes all four parts. The structure forces the PM to think about variability, measurement, safety, and human override before sending the ticket to engineering.

Part 1: Behavior description with allowed variability

Describe the behavior in plain language and explicitly state where variation is acceptable. Example, when a user asks for a summary of a meeting transcript, the assistant produces a summary between 80 and 200 words that captures all decisions and at least 80 percent of action items. Length, phrasing, and ordering may vary across runs as long as the content requirements are met. The explicit variability statement prevents reviewers from rejecting outputs that differ from a reference example but still meet the requirement.

Tradeoff: Allowing variability by default makes acceptance review more subjective. Counter this by attaching example outputs that are clearly inside the band and clearly outside, so reviewers calibrate their judgment. Avoid attaching one golden output, which trains reviewers to reject anything that does not match it.

Part 2: Evaluation threshold on a named test set

Specify a measurable threshold on a test set the team owns. Example, on the Q2 meeting summary evaluation set (250 transcripts), the feature must achieve a graded quality score of at least 4.0 out of 5 with no individual case below 2.5, and a hallucination rate below 3 percent. Name the evaluation set explicitly so that engineering and the PM agree on what was measured. If the test set does not exist yet, building it is part of the ticket.

Tradeoff: Quantifying the threshold requires investment in the evaluation set. The first criterion you write this way takes a week. The tenth takes an hour because the evaluation set already exists. Teams that skip this step end up shipping features they cannot defend in a postlaunch quality review.

Part 3: Safety constraint with explicit refusal cases

List the categories of input that the feature must refuse, redirect, or escalate. Example, the feature must refuse to summarize content involving HR complaints, legal matters, or content marked confidential by the source system, and must not include any personally identifying information about people not in the meeting. Tie each refusal case to a specific output (a polite decline message, a redirect, or an escalation queue). Safety criteria are not optional and must be tested as part of the evaluation set.

Tradeoff: Listing refusal cases up front takes effort and may surface gaps in your safety taxonomy that the team has not yet decided how to handle. This is the point. Surfacing these gaps in the ticket is far cheaper than discovering them in production.

Part 4: Human review hook and confidence threshold

Specify when the model output must be reviewed by a human or escalated to a different system. Example, if the assistant cannot identify decisions or action items with a confidence score above 0.7, the summary must be flagged for human review and presented with a banner that explains the limitation. Specify the confidence threshold, the trigger condition, and the visible UX state. Without this hook, the model becomes the only line of defense, which is unacceptable for any feature with material consequences.

Tradeoff: Human review hooks add latency (a few seconds for in product flagging, longer for queues) and require operations capacity. The alternative, no hook, transfers all risk to the user and produces incidents that are far more expensive than the operations cost.

Sample Criteria for Three Common AI Feature Types

Concrete examples make the format easier to apply. Below are sample acceptance criteria for three common AI feature patterns. Each example uses the four part structure and includes the kind of specifics PMs should expect to write themselves.

Summarization feature criteria

Behavior, the assistant produces a 100 to 200 word summary of any input transcript, capturing all decisions and at least 80 percent of action items, with attribution of each action item to a named owner where possible. Threshold, on the Q2 evaluation set (250 transcripts) graded quality 4.0 of 5, hallucination rate below 3 percent. Safety, refuse if transcript is marked confidential or contains HR or legal categories, never include personally identifying information about non participants. Human hook, flag for human review if action item confidence is below 0.7 or transcript has more than 10 distinct speakers.

Classification feature criteria

Behavior, the model classifies incoming support tickets into one of 12 predefined categories with a confidence score for each ticket. Threshold, on the support evaluation set (5000 tickets) macro F1 score of at least 0.82, no single category below 0.65, calibration error below 0.05. Safety, never auto close a ticket based on classification, never route a ticket containing keywords on the escalation list (security incident, breach, regulator) without human review. Human hook, route to a human queue if confidence is below 0.6 or if the predicted category is one of the three highest impact categories.

Agentic workflow feature criteria

Behavior, the agent completes a multi step procurement request involving up to 6 tool calls (catalog search, price quote, approval routing, PO creation, vendor email, calendar booking). Threshold, on the procurement evaluation set (200 scripted scenarios) end to end success rate above 75 percent, no scenario produces an unrecoverable state. Safety, never spend more than 5000 dollars without human approval, never email outside a verified vendor list, log every tool call with rationale. Human hook, pause for human approval before any irreversible step (purchase, external email, calendar booking) and after any tool call that returned an error.

What every example has in common

All three examples specify a numerical threshold on a named test set, list at least three safety constraints with concrete categories, and define an explicit point at which a human takes over. Notice that no example uses words like high quality or appropriate without quantifying them. If you cannot quantify a criterion, it does not belong in acceptance, it belongs in design discussion.

A common pitfall: criteria that depend on a model that does not exist yet

When PMs write criteria assuming a future model release will improve a metric (for example, when the next model lands, hallucination rate will drop below 1 percent), they are setting acceptance against an uncontrolled variable. Write criteria against the model the team has access to today. If a future model unlocks a higher bar, raise the bar in a follow on ticket. The default rule is, today criteria for today shipping, future criteria for future tickets.

Ship AI Features With Defensible Quality

Acceptance criteria, evaluation design, and safety constraints for AI products are taught live in the AI PM Masterclass by a Salesforce Sr. Director PM.

How to Run an Acceptance Review for an AI Feature

Writing good criteria is half the battle. The other half is running an acceptance review that respects the structure. Traditional acceptance reviews where the PM clicks through a few flows do not work for AI features because clicks do not exercise the input space. The following four step review process makes acceptance reproducible and defensible.

Step 1, run the evaluation set and review the metrics

Before any human review begins, the engineering team runs the named evaluation set and shares the metrics. The PM checks that every threshold in the criteria is met and that the trends across runs are stable. If a metric is borderline (within 2 percentage points of the threshold), require a second run on a fresh sample. Acceptance reviews that skip the evaluation set step are guessing.

Step 2, hand grade a small sample of outputs

Sample 30 outputs across the input distribution, including the long tail (rare categories, edge case inputs). The PM, a domain expert, and one engineer each grade independently using the rubric defined in the criteria. Compare grades and discuss disagreements. The point is not to grade everything, it is to calibrate that the rubric and the model agree on what good looks like. If hand grading disagrees with the evaluation metric, the evaluation set has a problem and acceptance is paused.

Step 3, exercise the safety and refusal cases

For every refusal case in the criteria, run at least 5 inputs that should trigger the refusal and 5 inputs that should not. Verify that the refusals fire correctly and that the non refusals are not over blocked. Over blocking is a silent acceptance failure that produces a feature users perceive as broken even though it is technically safe. Document the safety test results in the ticket.

Step 4, walk through the human review hook end to end

Trigger the human review hook on a real input that should escalate, and follow the path all the way to the operator queue. Verify that the operator sees the input, the model output, the confidence score, and the reason for escalation. A human review hook that engineers built but no one tested end to end is a common source of postlaunch incidents. Acceptance is not complete until the operator side has been exercised.