AI Eval Test Case Template: Writing Test Cases for AI Outputs
TL;DR
Software test cases assert exact equality. AI eval test cases assert quality bands across multiple dimensions. The structure is different, the scoring is different, the failure modes are different. This template gives you the canonical AI eval test case format — input, expected behavior bands, scoring rubric, edge case markers — and shows how cases compose into a usable golden set.
The Anatomy of a Good Eval Test Case
id
Stable identifier. Lets you track regressions and reference cases across time.
category
Happy path / edge case / adversarial / regression. Categories let you slice eval results meaningfully.
input
The prompt or input the user (or system) provides. Realistic phrasing, not idealized.
expected_behavior
What good looks like, described in 1-3 sentences. Specific behaviors, not exact strings.
scoring_dimensions
Correctness, format, tone, safety. Each scored 0-3 or pass/fail.
anti_patterns
What the AI must NOT do. As specific as expected behavior. Often the highest-signal field.
weight
Some cases matter more than others. Weight high-stakes or high-traffic cases higher.
Worked Example
id: case_042
category: edge_case
input: "summarize this thread" (where thread is empty)
expected_behavior: model recognizes empty input and asks for clarification or returns a graceful empty response — not invented summary
scoring_dimensions:
- correctness: 0-3 (does the response handle empty input correctly?)
- format: 0-3 (clear, concise response)
- safety: pass/fail (no fabricated content)
anti_patterns: must NOT invent thread content; must NOT crash; must NOT refuse without context
weight: 2 (production traffic includes empty threads regularly)
A good test case takes 5 minutes to write and saves hours of debugging. The discipline is writing them before the prompt change, not after.
Composing Cases Into a Golden Set
Coverage by category
60% happy path, 25% edge, 10% adversarial, 5% regression. Tune to your product's risk profile.
Coverage by user segment
If you serve enterprise + SMB + consumer, ensure each is represented. Aggregate eval can hide segment regressions.
Coverage by language and locale
If you ship multilingual, eval set must include each Tier 1 language. Don't infer global quality from English-only eval.
Maintain freshness
Add new cases monthly from production failures. Retire cases that no longer matter. Stale eval sets fail to catch new issues.
Build Eval Sets That Catch Real Issues
The AI PM Masterclass walks through eval design with real test case examples and golden set construction — taught by a Salesforce Sr. Director PM.
Operating an Eval Set Over Time
Adding cases from production
Every escalation or surprising production behavior should generate at least one new eval case. Production is your eval-case factory.
Auditing the rubric
LLM-as-judge scoring drifts. Audit 10% of cases with humans monthly. Recalibrate the judge prompt when needed.
Pruning stale cases
Cases that have passed for 6 consecutive months may not be discriminating. Prune; replace with harder ones.
Versioning the eval set
When you make significant changes to the set, version it. Lets you compare apples-to-apples across time.
Eval Test Case Mistakes
Asserting exact strings
Models vary phrasing. Strict string equality misses correct answers. Use rubrics, not equality.
Cases that all pass
If 100% of cases pass, the eval isn't discriminating. Add harder cases until you have 70-90% pass rates with room for regression detection.
No anti-patterns specified
"Should be helpful" is half the test. The other half is "must not do X." Specify both.
Single-run scoring
AI outputs vary per run. Score each case on 3-5 runs; track variance as well as average.
Eval set authored by one person
Single perspective produces biased eval. Rotate authoring or pair-write to surface blind spots.