AI PM TEMPLATES

AI Eval Test Case Template: Writing Test Cases for AI Outputs

By Institute of AI PM·13 min read·May 7, 2026

TL;DR

Software test cases assert exact equality. AI eval test cases assert quality bands across multiple dimensions. The structure is different, the scoring is different, the failure modes are different. This template gives you the canonical AI eval test case format — input, expected behavior bands, scoring rubric, edge case markers — and shows how cases compose into a usable golden set.

The Anatomy of a Good Eval Test Case

id

Stable identifier. Lets you track regressions and reference cases across time.

category

Happy path / edge case / adversarial / regression. Categories let you slice eval results meaningfully.

input

The prompt or input the user (or system) provides. Realistic phrasing, not idealized.

expected_behavior

What good looks like, described in 1-3 sentences. Specific behaviors, not exact strings.

scoring_dimensions

Correctness, format, tone, safety. Each scored 0-3 or pass/fail.

anti_patterns

What the AI must NOT do. As specific as expected behavior. Often the highest-signal field.

weight

Some cases matter more than others. Weight high-stakes or high-traffic cases higher.

Worked Example

id: case_042

category: edge_case

input: "summarize this thread" (where thread is empty)

expected_behavior: model recognizes empty input and asks for clarification or returns a graceful empty response — not invented summary

scoring_dimensions:

- correctness: 0-3 (does the response handle empty input correctly?)

- format: 0-3 (clear, concise response)

- safety: pass/fail (no fabricated content)

anti_patterns: must NOT invent thread content; must NOT crash; must NOT refuse without context

weight: 2 (production traffic includes empty threads regularly)

A good test case takes 5 minutes to write and saves hours of debugging. The discipline is writing them before the prompt change, not after.

Composing Cases Into a Golden Set

Coverage by category

60% happy path, 25% edge, 10% adversarial, 5% regression. Tune to your product's risk profile.

Coverage by user segment

If you serve enterprise + SMB + consumer, ensure each is represented. Aggregate eval can hide segment regressions.

Coverage by language and locale

If you ship multilingual, eval set must include each Tier 1 language. Don't infer global quality from English-only eval.

Maintain freshness

Add new cases monthly from production failures. Retire cases that no longer matter. Stale eval sets fail to catch new issues.

Build Eval Sets That Catch Real Issues

The AI PM Masterclass walks through eval design with real test case examples and golden set construction — taught by a Salesforce Sr. Director PM.

Operating an Eval Set Over Time

Adding cases from production

Every escalation or surprising production behavior should generate at least one new eval case. Production is your eval-case factory.

Auditing the rubric

LLM-as-judge scoring drifts. Audit 10% of cases with humans monthly. Recalibrate the judge prompt when needed.

Pruning stale cases

Cases that have passed for 6 consecutive months may not be discriminating. Prune; replace with harder ones.

Versioning the eval set

When you make significant changes to the set, version it. Lets you compare apples-to-apples across time.

Eval Test Case Mistakes

Asserting exact strings

Models vary phrasing. Strict string equality misses correct answers. Use rubrics, not equality.

Cases that all pass

If 100% of cases pass, the eval isn't discriminating. Add harder cases until you have 70-90% pass rates with room for regression detection.

No anti-patterns specified

"Should be helpful" is half the test. The other half is "must not do X." Specify both.

Single-run scoring

AI outputs vary per run. Score each case on 3-5 runs; track variance as well as average.

Eval set authored by one person

Single perspective produces biased eval. Rotate authoring or pair-write to surface blind spots.

Eval Like a Senior AI PM

The Masterclass covers eval design, test case writing, and golden set operations — taught by a Salesforce Sr. Director PM.