AI Eval Test Case Template: Writing Test Cases for AI Outputs

The Anatomy of a Good Eval Test Case

Stable identifier. Lets you track regressions and reference cases across time.

Worked Example

id: case_042

category: edge_case

input: "summarize this thread" (where thread is empty)

expected_behavior: model recognizes empty input and asks for clarification or returns a graceful empty response — not invented summary

scoring_dimensions:

- correctness: 0-3 (does the response handle empty input correctly?)

- format: 0-3 (clear, concise response)

- safety: pass/fail (no fabricated content)

anti_patterns: must NOT invent thread content; must NOT crash; must NOT refuse without context

weight: 2 (production traffic includes empty threads regularly)

A good test case takes 5 minutes to write and saves hours of debugging. The discipline is writing them before the prompt change, not after.

Composing Cases Into a Golden Set

Coverage by category

60% happy path, 25% edge, 10% adversarial, 5% regression. Tune to your product's risk profile.

Coverage by user segment

If you serve enterprise + SMB + consumer, ensure each is represented. Aggregate eval can hide segment regressions.

Coverage by language and locale

If you ship multilingual, eval set must include each Tier 1 language. Don't infer global quality from English-only eval.

Maintain freshness

Add new cases monthly from production failures. Retire cases that no longer matter. Stale eval sets fail to catch new issues.

Build Eval Sets That Catch Real Issues

The AI PM Masterclass walks through eval design with real test case examples and golden set construction — taught by a Salesforce Sr. Director PM.

Operating an Eval Set Over Time

Adding cases from production

Every escalation or surprising production behavior should generate at least one new eval case. Production is your eval-case factory.

Auditing the rubric

LLM-as-judge scoring drifts. Audit 10% of cases with humans monthly. Recalibrate the judge prompt when needed.

Pruning stale cases

Cases that have passed for 6 consecutive months may not be discriminating. Prune; replace with harder ones.

Versioning the eval set

When you make significant changes to the set, version it. Lets you compare apples-to-apples across time.

Eval Test Case Mistakes

Asserting exact strings

Models vary phrasing. Strict string equality misses correct answers. Use rubrics, not equality.

Cases that all pass

If 100% of cases pass, the eval isn't discriminating. Add harder cases until you have 70-90% pass rates with room for regression detection.

No anti-patterns specified

"Should be helpful" is half the test. The other half is "must not do X." Specify both.

Single-run scoring

AI outputs vary per run. Score each case on 3-5 runs; track variance as well as average.

Eval set authored by one person

Single perspective produces biased eval. Rotate authoring or pair-write to surface blind spots.