How to Write AI Evals: The PM Skill That Separates Good from Great
TL;DR
In a 2026 survey of AI hiring managers, "evaluation writing" was listed as the #1 skill gap in AI PM candidates — above technical depth, strategic thinking, and domain knowledge. That's because evals are where product judgment meets AI behavior: writing a good eval requires knowing what "good" looks like for your users, how to catch the failure modes that matter, and how to turn subjective quality into a repeatable measurement. Engineers implement the pipeline; PMs define what it measures. This guide builds the skill from scratch — how to write a rubric, design coverage, annotate outputs, and develop the judgment that makes your evals trustworthy.
Why Eval Writing Is a Product Skill, Not an Engineering Skill
When most people think about AI evaluations, they picture test harnesses, automated pipelines, and Python scripts. That is the engineering side of evals — and engineers are well-suited to build it. But there is a prior problem that engineers cannot solve: deciding what to measure and what "good" means.
A well-implemented eval that measures the wrong thing is worse than useless — it gives you false confidence. An eval that passes because it tests easy cases while missing your real failure modes will let regressions ship to users. These are product judgment failures, not engineering failures.
What does 'good' actually mean for this output?
Engineers can measure accuracy against a reference answer. Only PMs know whether 'accurate' is the right metric — sometimes tone, conciseness, or actionability matters more. The rubric is a product decision disguised as a testing decision.
Which failure modes actually harm users?
Not all failures are equal. A hallucinated fact in a critical medical record is catastrophic. A slightly formal tone in a casual chat product is annoying. Prioritizing what to test requires knowing your users, their stakes, and their error tolerance. Engineers don't have this context by default.
Is this eval set representative of real usage?
Eval sets built from engineering intuition tend to over-represent clean, well-formed inputs. Real users write with typos, ambiguity, and edge-case intent. A PM who has done user research knows what the real distribution looks like.
When should we block shipping based on eval scores?
The threshold question — 'eval score dropped 3%, should we block the deploy?' — is a product risk decision. How much regression is acceptable for the quality improvement the new model brings? This is a PM call.
The eval writing skill is really five sub-skills: rubric writing, coverage design, annotation judgment, threshold setting, and eval maintenance. Each is learnable. None requires you to write a line of evaluation code.
Writing a Good Rubric: From Vague to Measurable
The rubric is the heart of an eval. It defines what you are measuring and what scores mean. Most first-time rubrics are either too vague ("is the output good?") or too rigid (exact string match). Here is how to write rubrics that are specific enough to be consistent but flexible enough to capture real quality.
Step 1: Start with failure, not success
The best rubrics start with a clear description of what failure looks like, not what success looks like. Success is often diffuse ("the output is helpful"). Failure is concrete ("the output recommends a medication without mentioning the documented allergy in the context"). Write 3-5 specific failure modes for your use case, then build rubric dimensions that catch each one.
Step 2: Define dimensions, not a single score
A single 1-5 quality score collapses too many dimensions into one number. When the score drops, you can't diagnose why. Separate your rubric into independent dimensions. For a customer support AI, that might be: factual accuracy, tone appropriateness, completeness, and actionability. Score each independently.
Step 3: Write anchor examples for each score level
For each dimension on your rubric, write a concrete example of what a 1, 3, and 5 look like. These anchors are what make annotation consistent across raters. Without anchors, two annotators rating the same output will diverge by 1-2 points on a 5-point scale. With strong anchors, inter-rater agreement (measured by Cohen's kappa) should reach 0.7 or higher.
Anchor example:
For 'Tone Appropriateness' in a consumer product: 5 — Warm, clear, matches the user's emotional register. A frustrated user gets an empathetic response. 3 — Correct but neutral. Not wrong, but doesn't match the emotional context. 1 — Robotic or mismatched. A grieving user gets a transactional response.
Step 4: Test the rubric on 10 real outputs before finalizing
Before your eval suite goes to engineering, apply your rubric to 10 real production outputs yourself. You will discover: dimensions that don't apply, scoring that feels arbitrary, and failure modes you missed. One round of rubric testing before building is worth 10 rounds of debugging after.
Coverage Design: What Cases to Include
Coverage is the second major eval skill. An eval that only tests happy-path inputs will pass with flying colors right up until a user sends an edge case to production. Here is the coverage framework that senior AI PMs use:
Happy path (50-60%)
Clear, well-formed inputs representative of your highest-volume usage. These establish baseline quality — your model should score 4-5 on all happy path cases. If it doesn't, you have a fundamental quality problem.
Edge cases (20-25%)
Unusual but legitimate inputs: very long inputs, very short inputs, inputs in unexpected languages, inputs that are technically valid but semantically unusual. These cases expose gaps in generalization.
Adversarial cases (10-15%)
Inputs designed to probe failure modes: prompt injection attempts, requests that conflict with your guardrails, inputs that test the model's refusal behavior. If you ship without adversarial cases, a user will find them for you.
Regression cases (5-10%)
Specific cases from past production failures. Every incident in production should produce at least one new eval case. This is how your eval set stays calibrated to real user behavior over time.
Where do eval cases come from? The best sources, in priority order:
Production logs: sample 200 real user inputs and use them as your eval seed set
User interviews: the specific things users told you were wrong or surprising in your last round of research
Failure tickets: every support ticket about AI quality is an eval case candidate
Your own adversarial brainstorming: 30 minutes with your team asking 'what would break this?'
Published eval sets in your domain (MMLU, HellaSwag, BIG-Bench for general reasoning; domain-specific benchmarks for your vertical)
Build Real Eval Skills in the AI PM Masterclass
You will write real eval rubrics and test them against live AI outputs in the masterclass — not just read about how to do it. Taught by a Salesforce Sr. Director PM who has shipped AI to millions of users.
Developing Annotation Judgment: The Reps That Build the Skill
The fastest way to develop eval intuition is to annotate AI outputs yourself — not to read about annotation. Fifty outputs changes how you see AI quality in a way that fifty articles cannot. Here is a structured practice approach:
The 50-output annotation sprint
BeginnerPick any publicly accessible AI product (ChatGPT, Claude, Gemini) and a task domain you know well. Write 50 prompts. Score every output on a rubric you write yourself. After 50, review your scoring for consistency and look for patterns in where the model underperforms. This single exercise builds more eval intuition than any course.
The inter-rater calibration exercise
IntermediateFind a colleague and independently annotate the same 20 outputs against the same rubric. Compare scores. For every case where you diverged by 2+ points, have a 5-minute conversation about why. This surfaces rubric ambiguity and builds shared quality standards. Cohen's kappa below 0.6 means your rubric needs clarification.
The regression catch exercise
IntermediateTake a model you know well and write 10 cases that you predict it will fail on based on its known limitations. Run them. Check your predictions. Your prediction accuracy tells you how well you understand the model's failure modes — which is what makes your future evals trustworthy.
The production incident -> eval case exercise
AdvancedPick a real AI product incident from your team's history (or a public one from your domain). Starting from the incident, write 3 eval cases that would have caught the failure before it reached users. Then reason backward: what coverage gap allowed this to miss? Add a category to your eval taxonomy.
The 'defeat your eval' stress test
AdvancedTake an eval set you are proud of and spend 30 minutes trying to break it — write inputs that would score well on your eval but produce bad outputs in the real product. If you can, your eval has a coverage gap or a rubric that can be gamed. Finding these before shipping is the job.
The Five Eval Mistakes PMs Make (and How to Avoid Them)
Evaluating what's easy to measure instead of what matters
Fix: Easy metrics (exact string match, BLEU score, accuracy against a reference set) are tempting because they are objective and automated. But they often don't correlate with user satisfaction. A customer service AI that produces technically accurate answers in a harsh tone will pass every accuracy eval while users rate it 1 star. Start with user satisfaction, then work backward to the proxy metrics that predict it.
Optimizing on the eval instead of the goal
Fix: When you share eval scores with engineering as the primary quality signal, teams optimize to improve the score — not necessarily to improve the product. This is Goodhart's Law applied to AI. Mitigation: hold back a test set that nobody sees until a final quality review. Never share the full eval set; the hidden set is your ground truth.
Building the eval set from engineering intuition, not user research
Fix: Engineers write test cases based on what's technically interesting or what they know the model is likely to fail at. Real users write based on what they need. Your eval set should be seeded from production logs and user interviews, not from engineering brainstorming. Evals that don't reflect real usage patterns measure the wrong distribution.
Never refreshing the eval set as the product evolves
Fix: An eval set written at product launch is outdated six months later. User behavior shifts. New failure modes emerge. Model updates change the quality profile. Schedule a quarterly eval set review: add regression cases from recent incidents, retire cases that are no longer representative, and recalibrate thresholds as the product matures.
Treating eval score as the only quality signal
Fix: Eval scores are necessary but not sufficient. Complement them with: periodic human review of production outputs (a sample, not all), user feedback signals (thumbs up/down, follow-up corrections), and qualitative user research sessions where you watch users interact with AI outputs live. The eval is the floor; the qualitative signals are the ceiling you're aiming for.
How to show eval skill in a job interview
The strongest interview signal is a concrete eval you built: the rubric you wrote, the coverage design decisions you made, the incident that taught you something, and what score you would have blocked a ship on. If you can narrate that story with specifics, you have demonstrated more eval skill than 90% of AI PM candidates. Build your first eval on a personal project if you don't have a work example — evaluate any public AI API against a task you care about.