AI Vendor Evaluation Scorecard Template (with Weights)

The 8 Dimensions and Default Weights

Total weights sum to 100. Adjust based on use case — a regulated-industry deployment shifts more weight to data privacy and exit; a consumer-grade prototype shifts more weight to latency and cost.

20%

1. Model Quality

Eval pass rate on YOUR test set, hallucination rate, instruction-following. Not vendor benchmarks.

12%

2. Latency

p50/p95/p99 under your real traffic profile. Test at peak hours.

15%

3. Cost & Unit Economics

Total cost at 1x, 3x, and 10x volume. Include retries, eval traffic, and overage.

18%

4. Data Privacy & Security

Certifications, training opt-out, data residency, BYOK, SSO/SCIM. Non-negotiable for enterprise.

10%

5. Support & SLA

Uptime SLA, P1 response, named TAM, 24/7 coverage.

6. Roadmap & Velocity

Are they shipping on the cadence you need? Talk to current customers about the gap between roadmap and reality.

7. Integration & DX

SDK quality, docs, time to first call, webhook + streaming support.

8. Exit & Portability

Data export, prompt portability, contract terms. The hardest to renegotiate after signing.

The 1–5 Scoring Rubric

Use a five-point rubric, not a ten-point. The middle of a ten-point scale is meaningless mush. Five forces commitment.

5 — Best in class

Demonstrably better than any alternative on this dimension. Reference customer can confirm. Worth paying a premium for.

4 — Strong

Meets the bar with margin. No concerns. Roughly comparable to leaders.

3 — Adequate

Meets the minimum bar. No surprises but no advantages either. Default if the answer is incomplete.

2 — Concerning

Below the bar in a way that requires a workaround. Document the workaround cost in the scorecard notes.

1 — Disqualifying

On non-negotiable dimensions (security, exit), a 1 ends the evaluation. On others, it materially lowers the weighted total.

Worked Example: Three Vendors, Same Scorecard

Below is an example scorecard comparing three real vendor archetypes: Vendor A (frontier API), Vendor B (vertical AI startup), Vendor C (incumbent SaaS with bolt-on AI).

Dimension	Weight	A	B	C
Model Quality	20%	5	4	3
Latency	12%	4	4	3
Cost	15%	3	4	2
Data Privacy	18%	4	5	5
Support & SLA	10%	3	4	5
Roadmap	8%	5	4	2
Integration & DX	8%	5	4	3
Exit & Portability	9%	4	3	4
Weighted Total (out of 5)	100%	4.10	4.08	3.31

A and B are within 0.02 — a tie. The recommendation in this case is not 'pick A.' It is: rerun the dimensions where A and B differ (cost, privacy, roadmap) with deeper diligence. The scorecard surfaces where the decision actually lives.

Run Vendor Evals Like a Senior PM

AI procurement, weighted decisions, and stakeholder defense are taught live in the AI PM Masterclass — by a Salesforce Sr. Director PM.

Two-Evaluator Protocol

Single-evaluator scorecards are biased. Use two evaluators in parallel and reconcile.

Step 1: Assign two evaluators

Pair a PM with an engineer or with a procurement lead. Different lenses surface different gaps.

Step 2: Score independently

No conferring. Both score the same scorecard with notes. Submit before reading each other's scores.

Step 3: Compare and reconcile

Discuss only the dimensions where scores differ by >1 point. Agreement on close scores wastes time.

Step 4: Resolve via evidence, not opinion

If scores differ, ask: what evidence would change my mind? Often the answer is 'a customer reference call' or 'a load test.' Run that, then re-score.

Step 5: Recommendation

Final scorecard signed by both evaluators. The procurement decision references this document, not the demo deck.

Disqualification Floors

Some dimensions have a hard floor. A score below the floor disqualifies the vendor regardless of the weighted total.

Data Privacy < 3 → disqualified

No SOC 2 Type II, default training-on, or no data residency = procurement no-go in most enterprises.

Exit & Portability < 2 → disqualified

No data export or no termination clause is a multi-year hostage situation. Not negotiable.

Latency p99 above your customer-facing SLA → disqualified

If the vendor cannot hit your SLA at peak, no other dimension matters.

Company viability concerns (<9 months runway) → escalate, do not auto-disqualify

Sometimes the right vendor is fragile. If you proceed, negotiate source-code escrow and a transition allowance up front.

How to Adjust Weights for Your Use Case

Regulated industry (health, finance, legal)

Privacy 25% (+7), Exit 14% (+5), Cost 10% (-5), Roadmap 5% (-3). Compliance dimensions dominate.

Consumer / high-volume / cost-sensitive

Cost 25% (+10), Latency 18% (+6), Roadmap 5% (-3), Exit 5% (-4), Privacy 12% (-6). Unit economics dominate.

Startup / prototype / time-to-market

Integration & DX 18% (+10), Roadmap 12% (+4), Cost 18% (+3), Privacy 10% (-8). Speed of integration dominates.

Mission-critical / agent-based / long-running workflows

Model Quality 28% (+8), Support & SLA 15% (+5), Latency 18% (+6), Cost 10% (-5), Roadmap 5% (-3). Reliability dominates.

Your default mid-market SaaS

Use the default 20/12/15/18/10/8/8/9 split. Do not over-tune unless you have a clear reason.