AI Model Evaluation Template: Compare and Select the Right Model
A systematic framework for evaluating AI models across performance, cost, latency, safety, and vendor reliability. Stop making model decisions based on vibes.
Choosing the right AI model is one of the highest-leverage decisions an AI PM makes. Pick wrong and you're stuck with poor user experience, ballooning costs, or months of migration work. This template gives you a structured approach used by AI teams at Stripe, Notion, and Figma to evaluate models systematically.
How to Use This Template
- 1. Define your evaluation criteria weights based on product priorities
- 2. Run the same test cases across all candidate models
- 3. Score each model objectively using the rubrics provided
- 4. Calculate weighted scores to identify the best fit
- 5. Document your decision rationale for future reference
When to Use This Template
Not every model decision needs a formal evaluation. Use this template when the stakes are high.
Use This Template When:
- Launching a new AI-powered feature
- Migrating from one model to another
- Evaluating build vs. buy for ML capabilities
- Comparing vendors for a critical use case
- Justifying model costs to leadership
- Annual model review and optimization
Skip This When:
- Quick prototype or hackathon project
- Testing a hypothesis before full investment
- Model is already mandated by security/legal
- Single obvious choice with no alternatives
- Low-stakes internal tooling
Evaluation Criteria Framework
Every model evaluation should assess six core dimensions. Weight them based on your specific product requirements.
The Six Pillars of Model Evaluation
1. Quality & Accuracy
How well does the model perform on your specific tasks?
2. Latency & Speed
Response time and throughput for your use case
3. Cost & Economics
Total cost of ownership at your expected scale
4. Safety & Compliance
Security, privacy, and regulatory requirements
5. Reliability & Support
Uptime, SLAs, and vendor responsiveness
6. Integration & DX
API quality, documentation, and ease of implementation
Complete Model Evaluation Template
Copy this entire template into your documentation tool. Replace bracketed text with your specific information.
╔══════════════════════════════════════════════════════════════════╗ ║ AI MODEL EVALUATION SCORECARD ║ ╠══════════════════════════════════════════════════════════════════╣ PROJECT INFORMATION ─────────────────────────────────────────────────────────────────── Project Name: [Feature/Product Name] Evaluation Lead: [Your Name] Evaluation Date: [Date] Decision Deadline: [Date] Stakeholders: [Engineering Lead, Data Science, Legal, etc.] USE CASE DEFINITION ─────────────────────────────────────────────────────────────────── Primary Use Case: [e.g., "Customer support ticket classification"] Input Type: [Text/Image/Audio/Multimodal] Output Type: [Classification/Generation/Extraction/etc.] Expected Volume: [Requests per day/month] Latency Requirement: [e.g., "p95 < 500ms"] Accuracy Target: [e.g., "95% accuracy on test set"] CANDIDATE MODELS ─────────────────────────────────────────────────────────────────── Model A: [e.g., GPT-4o] Model B: [e.g., Claude 3.5 Sonnet] Model C: [e.g., Gemini 1.5 Pro] Model D: [e.g., Fine-tuned Llama 3] ╔══════════════════════════════════════════════════════════════════╗ ║ CRITERIA WEIGHTS (Must = 100%) ║ ╠══════════════════════════════════════════════════════════════════╣ Adjust weights based on your product priorities: ┌─────────────────────────┬────────────┬─────────────────────────┐ │ Criterion │ Weight │ Justification │ ├─────────────────────────┼────────────┼─────────────────────────┤ │ Quality & Accuracy │ [30]% │ [Why this weight] │ │ Latency & Speed │ [20]% │ [Why this weight] │ │ Cost & Economics │ [20]% │ [Why this weight] │ │ Safety & Compliance │ [15]% │ [Why this weight] │ │ Reliability & Support │ [10]% │ [Why this weight] │ │ Integration & DX │ [5]% │ [Why this weight] │ ├─────────────────────────┼────────────┼─────────────────────────┤ │ TOTAL │ 100% │ │ └─────────────────────────┴────────────┴─────────────────────────┘ ╔══════════════════════════════════════════════════════════════════╗ ║ DETAILED SCORING (1-5 Scale) ║ ╠══════════════════════════════════════════════════════════════════╣ QUALITY & ACCURACY ────────────────── Scoring Rubric: 5 = Exceeds requirements, handles edge cases perfectly 4 = Meets all requirements with minor issues 3 = Meets core requirements, some edge case failures 2 = Partially meets requirements, significant gaps 1 = Does not meet minimum requirements Test Categories: ┌─────────────────────────┬─────────┬─────────┬─────────┬─────────┐ │ Test Category │ Model A │ Model B │ Model C │ Model D │ ├─────────────────────────┼─────────┼─────────┼─────────┼─────────┤ │ Happy path accuracy │ [ /5] │ [ /5] │ [ /5] │ [ /5] │ │ Edge case handling │ [ /5] │ [ /5] │ [ /5] │ [ /5] │ │ Consistency/Reliability │ [ /5] │ [ /5] │ [ /5] │ [ /5] │ │ Format compliance │ [ /5] │ [ /5] │ [ /5] │ [ /5] │ │ Domain-specific perf │ [ /5] │ [ /5] │ [ /5] │ [ /5] │ ├─────────────────────────┼─────────┼─────────┼─────────┼─────────┤ │ CATEGORY AVERAGE │ [ /5] │ [ /5] │ [ /5] │ [ /5] │ └─────────────────────────┴─────────┴─────────┴─────────┴─────────┘ LATENCY & SPEED ────────────────── Scoring Rubric: 5 = < 200ms p95 (instant feel) 4 = 200-500ms p95 (fast) 3 = 500ms-1s p95 (acceptable) 2 = 1-3s p95 (slow but usable) 1 = > 3s p95 (unacceptable) Measurements: ┌─────────────────────────┬─────────┬─────────┬─────────┬─────────┐ │ Metric │ Model A │ Model B │ Model C │ Model D │ ├─────────────────────────┼─────────┼─────────┼─────────┼─────────┤ │ p50 latency (ms) │ [ ] │ [ ] │ [ ] │ [ ] │ │ p95 latency (ms) │ [ ] │ [ ] │ [ ] │ [ ] │ │ p99 latency (ms) │ [ ] │ [ ] │ [ ] │ [ ] │ │ Time to first token │ [ ] │ [ ] │ [ ] │ [ ] │ │ Throughput (req/min) │ [ ] │ [ ] │ [ ] │ [ ] │ ├─────────────────────────┼─────────┼─────────┼─────────┼─────────┤ │ CATEGORY SCORE │ [ /5] │ [ /5] │ [ /5] │ [ /5] │ └─────────────────────────┴─────────┴─────────┴─────────┴─────────┘ COST & ECONOMICS ────────────────── Scoring Rubric: 5 = < $0.001 per request (negligible) 4 = $0.001-0.01 per request (low) 3 = $0.01-0.05 per request (moderate) 2 = $0.05-0.20 per request (expensive) 1 = > $0.20 per request (very expensive) Cost Analysis: ┌─────────────────────────┬─────────┬─────────┬─────────┬─────────┐ │ Cost Factor │ Model A │ Model B │ Model C │ Model D │ ├─────────────────────────┼─────────┼─────────┼─────────┼─────────┤ │ Input cost (per 1M tok) │ $[ ] │ $[ ] │ $[ ] │ $[ ] │ │ Output cost (per 1M tok)│ $[ ] │ $[ ] │ $[ ] │ $[ ] │ │ Avg cost per request │ $[ ] │ $[ ] │ $[ ] │ $[ ] │ │ Monthly cost @ volume │ $[ ] │ $[ ] │ $[ ] │ $[ ] │ │ Annual projected cost │ $[ ] │ $[ ] │ $[ ] │ $[ ] │ ├─────────────────────────┼─────────┼─────────┼─────────┼─────────┤ │ CATEGORY SCORE │ [ /5] │ [ /5] │ [ /5] │ [ /5] │ └─────────────────────────┴─────────┴─────────┴─────────┴─────────┘ SAFETY & COMPLIANCE ────────────────── Scoring Rubric: 5 = Exceeds all requirements, SOC2/HIPAA certified 4 = Meets all requirements with documentation 3 = Meets most requirements, minor gaps 2 = Significant compliance gaps, requires workarounds 1 = Does not meet compliance requirements Assessment: ┌─────────────────────────┬─────────┬─────────┬─────────┬─────────┐ │ Requirement │ Model A │ Model B │ Model C │ Model D │ ├─────────────────────────┼─────────┼─────────┼─────────┼─────────┤ │ Data residency options │ [Y/N] │ [Y/N] │ [Y/N] │ [Y/N] │ │ No training on data │ [Y/N] │ [Y/N] │ [Y/N] │ [Y/N] │ │ SOC2 Type II │ [Y/N] │ [Y/N] │ [Y/N] │ [Y/N] │ │ GDPR compliant │ [Y/N] │ [Y/N] │ [Y/N] │ [Y/N] │ │ Content filtering │ [ /5] │ [ /5] │ [ /5] │ [ /5] │ │ Audit logging │ [Y/N] │ [Y/N] │ [Y/N] │ [Y/N] │ ├─────────────────────────┼─────────┼─────────┼─────────┼─────────┤ │ CATEGORY SCORE │ [ /5] │ [ /5] │ [ /5] │ [ /5] │ └─────────────────────────┴─────────┴─────────┴─────────┴─────────┘ RELIABILITY & SUPPORT ────────────────── Scoring Rubric: 5 = 99.99% uptime, 24/7 enterprise support 4 = 99.9% uptime, business hours support 3 = 99% uptime, email support 2 = Occasional outages, community support 1 = Unreliable, no support Assessment: ┌─────────────────────────┬─────────┬─────────┬─────────┬─────────┐ │ Factor │ Model A │ Model B │ Model C │ Model D │ ├─────────────────────────┼─────────┼─────────┼─────────┼─────────┤ │ Published uptime SLA │ [ ]% │ [ ]% │ [ ]% │ [ ]% │ │ Historical uptime │ [ ]% │ [ ]% │ [ ]% │ [ ]% │ │ Support response time │ [ ] │ [ ] │ [ ] │ [ ] │ │ Status page available │ [Y/N] │ [Y/N] │ [Y/N] │ [Y/N] │ │ Dedicated account mgr │ [Y/N] │ [Y/N] │ [Y/N] │ [Y/N] │ ├─────────────────────────┼─────────┼─────────┼─────────┼─────────┤ │ CATEGORY SCORE │ [ /5] │ [ /5] │ [ /5] │ [ /5] │ └─────────────────────────┴─────────┴─────────┴─────────┴─────────┘ INTEGRATION & DEVELOPER EXPERIENCE ────────────────── Scoring Rubric: 5 = Excellent SDK, comprehensive docs, quick setup 4 = Good SDK, clear docs, reasonable setup 3 = Basic SDK, adequate docs 2 = Limited SDK, sparse docs 1 = No SDK, poor documentation Assessment: ┌─────────────────────────┬─────────┬─────────┬─────────┬─────────┐ │ Factor │ Model A │ Model B │ Model C │ Model D │ ├─────────────────────────┼─────────┼─────────┼─────────┼─────────┤ │ SDK quality │ [ /5] │ [ /5] │ [ /5] │ [ /5] │ │ Documentation quality │ [ /5] │ [ /5] │ [ /5] │ [ /5] │ │ Time to first API call │ [ ] │ [ ] │ [ ] │ [ ] │ │ Playground available │ [Y/N] │ [Y/N] │ [Y/N] │ [Y/N] │ │ Streaming support │ [Y/N] │ [Y/N] │ [Y/N] │ [Y/N] │ ├─────────────────────────┼─────────┼─────────┼─────────┼─────────┤ │ CATEGORY SCORE │ [ /5] │ [ /5] │ [ /5] │ [ /5] │ └─────────────────────────┴─────────┴─────────┴─────────┴─────────┘ ╔══════════════════════════════════════════════════════════════════╗ ║ FINAL WEIGHTED SCORES ║ ╠══════════════════════════════════════════════════════════════════╣ ┌─────────────────────────┬────────┬─────────┬─────────┬─────────┬─────────┐ │ Criterion │ Weight │ Model A │ Model B │ Model C │ Model D │ ├─────────────────────────┼────────┼─────────┼─────────┼─────────┼─────────┤ │ Quality & Accuracy │ [30]% │ [ ] │ [ ] │ [ ] │ [ ] │ │ Latency & Speed │ [20]% │ [ ] │ [ ] │ [ ] │ [ ] │ │ Cost & Economics │ [20]% │ [ ] │ [ ] │ [ ] │ [ ] │ │ Safety & Compliance │ [15]% │ [ ] │ [ ] │ [ ] │ [ ] │ │ Reliability & Support │ [10]% │ [ ] │ [ ] │ [ ] │ [ ] │ │ Integration & DX │ [5]% │ [ ] │ [ ] │ [ ] │ [ ] │ ├─────────────────────────┼────────┼─────────┼─────────┼─────────┼─────────┤ │ WEIGHTED TOTAL │ 100% │ [ ]/5 │ [ ]/5 │ [ ]/5 │ [ ]/5 │ │ RANK │ │ [ ] │ [ ] │ [ ] │ [ ] │ └─────────────────────────┴────────┴─────────┴─────────┴─────────┴─────────┘ ╔══════════════════════════════════════════════════════════════════╗ ║ DECISION & RECOMMENDATION ║ ╠══════════════════════════════════════════════════════════════════╣ RECOMMENDED MODEL: [Model Name] PRIMARY REASONS: 1. [First key reason for selection] 2. [Second key reason for selection] 3. [Third key reason for selection] TRADE-OFFS ACCEPTED: 1. [What we're giving up by choosing this model] 2. [Mitigation strategy for the trade-off] RUNNER-UP: [Model Name] Reason to reconsider: [When we would switch to this model] MODELS ELIMINATED: - [Model Name]: [Brief reason for elimination] - [Model Name]: [Brief reason for elimination] NEXT STEPS: □ Get legal/security approval for recommended model □ Set up API credentials and billing □ Build integration prototype □ Define monitoring and alerting □ Plan migration/rollout timeline REVIEW SCHEDULE: - Next evaluation: [Date - typically 6-12 months] - Trigger for early re-evaluation: [e.g., "Cost exceeds $X/month"] ╚══════════════════════════════════════════════════════════════════╝
Designing Effective Test Cases
Your evaluation is only as good as your test cases. Build a comprehensive test set that covers all scenarios your model will encounter in production.
Test Case Categories
Happy Path Tests (40%)
Standard inputs that represent the core use case. These should be easy wins for any capable model.
Edge Case Tests (30%)
Unusual but valid inputs: very long text, special characters, multiple languages, ambiguous requests.
Adversarial Tests (20%)
Attempts to break the model: prompt injection, jailbreaks, confusing instructions, contradictory inputs.
Failure Mode Tests (10%)
Invalid inputs that should trigger graceful failures: empty inputs, wrong format, out-of-scope requests.
Sample Test Case Template
TEST CASE DOCUMENTATION ─────────────────────────────────────────────────── Test ID: TC-001 Category: [Happy Path / Edge Case / Adversarial / Failure] Description: [Brief description of what this tests] INPUT: """ [Exact input to send to the model] """ EXPECTED OUTPUT CRITERIA: - [ ] Contains: [Required elements] - [ ] Format: [Expected structure] - [ ] Tone: [Expected style] - [ ] Length: [Approximate length] SCORING RUBRIC: 5 = Perfect match to all criteria 4 = Minor deviation, still acceptable 3 = Meets minimum requirements 2 = Partial success, needs improvement 1 = Complete failure ACTUAL RESULTS: ┌─────────┬───────────────────────────────────────┬───────┐ │ Model │ Output Summary │ Score │ ├─────────┼───────────────────────────────────────┼───────┤ │ Model A │ [Brief summary of output] │ [ /5] │ │ Model B │ [Brief summary of output] │ [ /5] │ │ Model C │ [Brief summary of output] │ [ /5] │ └─────────┴───────────────────────────────────────┴───────┘ NOTES: [Any observations or surprises from this test]
Common Evaluation Mistakes to Avoid
1. Testing on Demo Prompts Only
Vendor demos are cherry-picked to show the model at its best. Always test with YOUR actual production data and edge cases.
2. Ignoring Cost at Scale
A model that's 10% better but 5x more expensive may not be the right choice. Always project costs at your expected 12-month volume.
3. Evaluating Once and Forgetting
Models improve rapidly. The winner from 6 months ago may be outclassed by newer options. Schedule regular re-evaluations.
4. Skipping Latency Under Load
p50 latency looks great until you're rate-limited during peak traffic. Test at 2-3x your expected peak volume.
5. Not Involving Engineering Early
The "best" model is useless if it can't integrate with your stack. Include engineering in evaluation criteria from day one.
Quick Decision Matrix
Don't have time for a full evaluation? Use this quick matrix to make a defensible decision in 30 minutes.
Speed Evaluation Checklist
5 min: Run 5 representative prompts through each model's playground
5 min: Check pricing pages and calculate cost for 10K requests/month
5 min: Review status page history for last 90 days
5 min: Skim API documentation and check SDK availability
5 min: Verify compliance certifications meet your requirements
5 min: Document decision with brief justification
Related Templates
AI Feature PRD Template
Comprehensive product requirements document template designed for AI features.
AI Buy vs Build Decision Framework
Framework for deciding when to build custom AI vs. buy off-the-shelf solutions.
Master AI Product Evaluation
Learn advanced model evaluation techniques, vendor negotiation, and AI architecture decisions in our comprehensive AI PM certification program.
Explore the Masterclass