AI Vendor Evaluation Scorecard Template (with Weights)
TL;DR
Vendor evaluations done on gut feel reliably pick the wrong vendor. This scorecard has 8 dimensions, each with a weight, sub-criteria, and a 1–5 scoring rubric. Run two evaluators in parallel, compare scores, and discuss only the gaps. The total weighted score is the recommendation — not the demo flash.
The 8 Dimensions and Default Weights
Total weights sum to 100. Adjust based on use case — a regulated-industry deployment shifts more weight to data privacy and exit; a consumer-grade prototype shifts more weight to latency and cost.
1. Model Quality
Eval pass rate on YOUR test set, hallucination rate, instruction-following. Not vendor benchmarks.
2. Latency
p50/p95/p99 under your real traffic profile. Test at peak hours.
3. Cost & Unit Economics
Total cost at 1x, 3x, and 10x volume. Include retries, eval traffic, and overage.
4. Data Privacy & Security
Certifications, training opt-out, data residency, BYOK, SSO/SCIM. Non-negotiable for enterprise.
5. Support & SLA
Uptime SLA, P1 response, named TAM, 24/7 coverage.
6. Roadmap & Velocity
Are they shipping on the cadence you need? Talk to current customers about the gap between roadmap and reality.
7. Integration & DX
SDK quality, docs, time to first call, webhook + streaming support.
8. Exit & Portability
Data export, prompt portability, contract terms. The hardest to renegotiate after signing.
The 1–5 Scoring Rubric
Use a five-point rubric, not a ten-point. The middle of a ten-point scale is meaningless mush. Five forces commitment.
5 — Best in class
Demonstrably better than any alternative on this dimension. Reference customer can confirm. Worth paying a premium for.
4 — Strong
Meets the bar with margin. No concerns. Roughly comparable to leaders.
3 — Adequate
Meets the minimum bar. No surprises but no advantages either. Default if the answer is incomplete.
2 — Concerning
Below the bar in a way that requires a workaround. Document the workaround cost in the scorecard notes.
1 — Disqualifying
On non-negotiable dimensions (security, exit), a 1 ends the evaluation. On others, it materially lowers the weighted total.
Worked Example: Three Vendors, Same Scorecard
Below is an example scorecard comparing three real vendor archetypes: Vendor A (frontier API), Vendor B (vertical AI startup), Vendor C (incumbent SaaS with bolt-on AI).
| Dimension | Weight | A | B | C |
|---|---|---|---|---|
| Model Quality | 20% | 5 | 4 | 3 |
| Latency | 12% | 4 | 4 | 3 |
| Cost | 15% | 3 | 4 | 2 |
| Data Privacy | 18% | 4 | 5 | 5 |
| Support & SLA | 10% | 3 | 4 | 5 |
| Roadmap | 8% | 5 | 4 | 2 |
| Integration & DX | 8% | 5 | 4 | 3 |
| Exit & Portability | 9% | 4 | 3 | 4 |
| Weighted Total (out of 5) | 100% | 4.10 | 4.08 | 3.31 |
A and B are within 0.02 — a tie. The recommendation in this case is not 'pick A.' It is: rerun the dimensions where A and B differ (cost, privacy, roadmap) with deeper diligence. The scorecard surfaces where the decision actually lives.
Run Vendor Evals Like a Senior PM
AI procurement, weighted decisions, and stakeholder defense are taught live in the AI PM Masterclass — by a Salesforce Sr. Director PM.
Two-Evaluator Protocol
Single-evaluator scorecards are biased. Use two evaluators in parallel and reconcile.
Step 1: Assign two evaluators
Pair a PM with an engineer or with a procurement lead. Different lenses surface different gaps.
Step 2: Score independently
No conferring. Both score the same scorecard with notes. Submit before reading each other's scores.
Step 3: Compare and reconcile
Discuss only the dimensions where scores differ by >1 point. Agreement on close scores wastes time.
Step 4: Resolve via evidence, not opinion
If scores differ, ask: what evidence would change my mind? Often the answer is 'a customer reference call' or 'a load test.' Run that, then re-score.
Step 5: Recommendation
Final scorecard signed by both evaluators. The procurement decision references this document, not the demo deck.
Disqualification Floors
Some dimensions have a hard floor. A score below the floor disqualifies the vendor regardless of the weighted total.
Data Privacy < 3 → disqualified
No SOC 2 Type II, default training-on, or no data residency = procurement no-go in most enterprises.
Exit & Portability < 2 → disqualified
No data export or no termination clause is a multi-year hostage situation. Not negotiable.
Latency p99 above your customer-facing SLA → disqualified
If the vendor cannot hit your SLA at peak, no other dimension matters.
Company viability concerns (<9 months runway) → escalate, do not auto-disqualify
Sometimes the right vendor is fragile. If you proceed, negotiate source-code escrow and a transition allowance up front.
How to Adjust Weights for Your Use Case
Regulated industry (health, finance, legal)
Privacy 25% (+7), Exit 14% (+5), Cost 10% (-5), Roadmap 5% (-3). Compliance dimensions dominate.
Consumer / high-volume / cost-sensitive
Cost 25% (+10), Latency 18% (+6), Roadmap 5% (-3), Exit 5% (-4), Privacy 12% (-6). Unit economics dominate.
Startup / prototype / time-to-market
Integration & DX 18% (+10), Roadmap 12% (+4), Cost 18% (+3), Privacy 10% (-8). Speed of integration dominates.
Mission-critical / agent-based / long-running workflows
Model Quality 28% (+8), Support & SLA 15% (+5), Latency 18% (+6), Cost 10% (-5), Roadmap 5% (-3). Reliability dominates.
Your default mid-market SaaS
Use the default 20/12/15/18/10/8/8/9 split. Do not over-tune unless you have a clear reason.