AI Evaluation Design Guide: Building Evals That Actually Measure What Matters

Why Most AI Evals Fail (and What They Fail to Catch)

The most common AI evaluation failure is measuring something you can measure rather than something that matters. Teams report BLEU scores on text summarization features while users quietly churn because the summaries miss critical information. Teams run GPT-4 on standardized benchmarks while their product handles a specialized domain where the benchmark does not apply. Teams define success as "output quality" without specifying quality from whose perspective and for what purpose.

The second most common failure is over-fitting to evals. Once a team uses an eval set to make product decisions, that eval set becomes a target. Prompts get optimized for the eval rather than for real user value. Model selection is tuned to the eval distribution rather than the actual query distribution. This is Goodhart's Law applied to AI product development: when a measure becomes a target, it ceases to be a good measure.

Benchmark proxy problem

Using public benchmarks (MMLU, HumanEval, HellaSwag) as the primary quality signal. These benchmarks measure general capability, not performance on your specific use case. A model that scores 5 percent better on MMLU may perform worse on your actual user queries.

The golden path illusion

Evaluating only on ideal, well-formed inputs. Real user queries are messy, ambiguous, multilingual, and intentionally adversarial. An eval set built from internal examples misses the distribution of actual user behavior and produces optimistic quality estimates.

Measuring output quality, not outcome value

Grading whether the AI output sounds correct rather than whether it produces the intended user outcome. A legally accurate but incomprehensible summary of a contract scores well on quality rubrics but fails users. Outcome-based evaluation requires knowing what users do after they see the AI output.

No regression testing

Running evals once at launch and then not again. AI quality regresses constantly: when the model provider updates the underlying model, when prompts are changed, when retrieval databases are updated, when context window handling changes. Evals are only valuable if they run continuously.

The Four Types of Evals Every AI Product Needs

There is no single eval type that catches all failure modes. Mature AI product teams use a layered system, each layer catching what the others miss. The goal is not to run all four types for every feature, but to know which layers your most important features are missing and close those gaps.

Automated heuristic evals

MillisecondsNear zero

Rules-based checks that run on every model output. Examples: output length within specified range, required fields present in structured output, no profanity, response language matches query language, no company names that are not in the approved list. These catch regressions in format, structure, and safety constraints. They are fast enough to run in production and cheap enough to run on every query. Start here.

Limitation: Heuristics cannot measure quality. They can tell you the output has the right length; they cannot tell you whether it is correct, helpful, or relevant.

LLM-as-judge evals

SecondsModerate (one additional LLM call per eval)

A separate, often larger LLM grades each output on a rubric you define. Example: 'Grade the following summary on a 1 to 5 scale for accuracy (does it contain information not present in the source?), relevance (does it include the most important points?), and clarity (could a non-expert understand it?).' Well-designed judge prompts achieve 70 to 85 percent agreement with human raters on structured tasks.

Limitation: LLM judges inherit the biases of the judging model. They prefer longer outputs, verbose language, and confident tone regardless of accuracy. Always validate your judge against human labels on a sample before trusting it.

Human evaluation

Hours to daysHigh (requires human annotation time)

Real human raters evaluate a sample of outputs against a defined rubric. This is the ground truth layer. Use human evals to validate that your LLM judge is calibrated, to evaluate qualitative dimensions that resist automation (tone, creativity, domain expertise), and to generate the gold-standard labels your automated evals train on.

Limitation: Human evals are too slow and expensive to run continuously. Use them to calibrate and validate your automated systems, not as the primary measurement layer at scale.

A/B experiment evals

Days to weeksRequires real user traffic

The only eval type that measures actual user outcome rather than output quality. Split real traffic between model version A and model version B and measure downstream user behavior: task completion rate, session length, follow-up queries (a proxy for confusion), and conversion. This is the ultimate arbiter of whether an AI change actually helps users.

Limitation: A/B experiments require sufficient traffic to reach statistical significance and may not detect quality regressions quickly enough to prevent user harm. Do not use A/B as your only eval layer.

Designing Your Eval Set: Size, Coverage, and Construction

The eval set is the foundation of your entire evaluation system. A bad eval set produces misleading signals regardless of how sophisticated your evaluation pipeline is. Most teams underinvest here because it is unglamorous work. It is also the highest-leverage investment in AI quality measurement.

How many examples do you need

For most AI features, a eval set of 100 to 300 examples is sufficient to detect a 5 to 10 percent quality change with reasonable statistical confidence. For safety-critical features, use 500 to 1,000 to detect smaller regressions. Bigger is not always better: a 2,000-example eval set that is poorly sampled is less useful than a 200-example set that is well-sampled.

Sampling strategy

Sample from real user queries, not from your imagination. Pull queries from your production logs, stratified by query type, user segment, and edge case frequency. If you do not yet have production traffic, generate synthetic queries from personas and use cases, but validate them against real user research before relying on them.

Coverage requirements

Your eval set must cover: the happy path (well-formed queries in your primary use case), the edge cases (ambiguous queries, unusual inputs, adversarial inputs), the safety cases (queries that should trigger refusal or escalation), and your highest-value user segments (the users whose satisfaction matters most to business outcomes).

Gold labels

For each example, define what a correct answer looks like. This is harder than it sounds. For open-ended generation, 'correct' is multidimensional. Define a rubric with 3 to 5 dimensions and a 1 to 5 scale for each. Have two or more humans rate each example independently and measure inter-rater agreement. Low agreement means your rubric is ambiguous and your evals are not reproducible.

Eval set maintenance

Your eval set is a living artifact. Rotate in new examples from production as user behavior evolves. Flag examples where your model and human judges consistently disagree as candidates for rubric refinement. Review the eval set quarterly and update it before each major model or feature change.

Hold-out eval set

Keep a portion of your eval set (20 to 30 percent) completely hidden and unused in any development or optimization process. This hold-out set is your unbiased final measure of quality. Never use it for prompt optimization or model selection. Use it only for go/no-go decisions at major release gates.

Build Evaluation Fluency in the AI PM Masterclass

The masterclass includes hands-on labs on designing eval suites, building LLM judges, and running A/B experiments for AI features, taught live by a Salesforce Sr. Director PM.

Automating Your Evals: CI/CD for AI Quality

Manual evals are a one-time snapshot. Automated evals are a continuous signal. The goal is to build an evaluation pipeline that runs automatically whenever anything in your AI system changes: a new model version, an updated prompt, a change to the retrieval index, a new dataset used for fine-tuning.

Trigger: define what changes trigger an eval run

At minimum: any change to the system prompt, any model version update, any change to retrieval configuration. Optionally: weekly scheduled runs for drift detection, any change to the eval set itself (to verify the change is intentional, not a regression).

Pipeline: run heuristic evals first, LLM judge second

Structure your pipeline so cheap, fast evals run first. If heuristic evals fail (e.g., format violations, safety violations), stop the pipeline and surface the failure immediately without incurring LLM judge costs. Only run LLM judges on examples that pass heuristic checks.

Comparison: score against a baseline, not an absolute threshold

Compare each eval run against the current production model's score on the same eval set. A change that increases average LLM judge score from 3.8 to 4.1 is meaningful. A change that drops from 3.8 to 3.5 is a regression, even if 3.5 still sounds like a good score in absolute terms. Relative comparison is more sensitive and more reliable than absolute thresholds.

Reporting: surface failures to the right people immediately

Eval failures should route to the team responsible for the change that triggered the regression. Aggregate scores should be visible in a dashboard accessible to the PM and engineering lead. Define separate alert thresholds for 'investigate' (small regression) and 'block deployment' (large regression or safety failure).

Human review queue: keep humans in the loop on close calls

Automated evals should route ambiguous cases to a human review queue. An output that scores 2.5 on the LLM judge rubric (mid-range, not clearly good or bad) should go to a human. Design the review queue so reviewers see the query, the output, and the rubric score, and can override with a label and a reason.

Reading Eval Results: How to Interpret Quality Tradeoffs

Eval scores rarely tell a clean story. A model change that improves accuracy by 8 percent often degrades latency by 15 percent. A prompt change that improves helpfulness on one query type reduces quality on another. Reading eval results requires reasoning about tradeoffs, not just identifying the highest number.

Disaggregate by query type

Average scores hide the most important information. A change that moves average quality from 3.8 to 4.0 while dropping quality on your highest-stakes query type from 4.5 to 3.5 is a regression, not an improvement. Always break eval results down by the query categories you defined in your eval set.

Track the score distribution, not just the mean

A model with a mean score of 3.8 and narrow distribution (most outputs between 3.5 and 4.1) is preferable to one with a mean of 4.0 but wide distribution (many outputs between 2.0 and 5.0). Variance in AI output quality is a product problem. High variance means unpredictable user experience.

Measure what users do after seeing the output

The ultimate validation of eval quality is correlation with downstream user behavior. Periodically compare your eval scores against A/B experiment results on the same changes. If your eval score improvement does not correlate with improvement in task completion rate or retention, your eval is measuring the wrong thing.

Set non-negotiable guardrail metrics

Define a set of metrics that cannot regress, regardless of improvements in other metrics. Safety violation rate, format compliance rate, latency p95. Any change that crosses a guardrail threshold is blocked, even if quality scores improve. Guardrails prevent optimizing the feature into something harmful or unusable.

Eval Maturity Model: Where Are You and What to Build Next

Most AI product teams sit at Level 1 or 2. Knowing where you are tells you what to prioritize next. Do not skip levels: teams that jump to automated pipelines without a validated eval set build automation on a broken foundation.

Level 1: Vibe checks

Quality is assessed by the engineer who built the feature and a PM who clicks through a demo. No defined rubric, no systematic coverage, no comparison to baseline. This is every team's starting point and the most dangerous place to stay.

Next step: Build a 50 to 100 example eval set from real use cases with defined rubric dimensions. Even simple is better than nothing.

Level 2: Manual golden set

A defined eval set with human-annotated labels. Quality is assessed by running the model on the eval set and comparing outputs to gold labels by hand. This is better, but it is still manual, slow, and run infrequently.

Next step: Build an LLM judge that reproduces your human labels with 70 percent or higher agreement. This is your automated eval layer.

Level 3: Automated eval pipeline

Heuristic evals and LLM judge evals run automatically on every relevant code or configuration change. Results are compared to baseline and surfaced in a dashboard. Regressions block deployment. This is where most high-performing AI product teams operate.

Next step: Integrate eval results with A/B experiment data to validate correlation between eval scores and user outcomes. Rotate in new examples from production logs quarterly.

Level 4: Continuous production evaluation

A sample of real production traffic is evaluated continuously. Eval scores are tracked as a live metric alongside latency, error rate, and cost. Quality regressions in production trigger alerts within hours, not weeks. The eval set is maintained with a mix of synthetic and real examples and refreshed automatically.

Next step: At this level, invest in custom evaluation dimensions specific to your domain and user needs. Commodity eval infrastructure is solved; competitive advantage comes from evaluation depth.