AI Product North Star Metrics: Choosing the Right One for AI Products
TL;DR
Accuracy is not a north star. Neither is engagement, by itself. The right north star for an AI product captures the user benefit when the AI works and the cost to the user when it doesn't. This guide walks through the four north star archetypes for AI products, how to pick yours, and the supporting metric stack that prevents Goodhart's law from eating you alive.
Why "Accuracy" Is the Wrong North Star
Accuracy measures the model. A north star measures the product. They overlap less than you'd hope. A model that's 95% accurate but slow, expensive, or hard to act on may produce a worse product than a 90% accurate one that's instant and integrated. The north star you pick should incorporate the model's contribution — not equate to it.
Task completion north stars
"User completed the task using AI assistance." Used in copilots, search, support automation. Closest to user value.
Time-saved north stars
"Hours of user time saved per week." Strong for workflow products. Easy to translate into ROI for buyers.
Volume-handled north stars
"% of tickets/calls/queries successfully resolved without human escalation." Right for operational AI products.
Trust-bounded north stars
"Successful AI-assisted outcomes weighted by user confidence." The right north star when stakes are high (legal, medical).
Picking Yours: A Three-Question Test
The right north star answers three questions in the affirmative. If any of the three fails, you've picked the wrong metric and your team will optimize the wrong thing.
1. Does this metric capture real user value?
If the metric goes up but users aren't happier, the metric is broken. Tie north stars to outcomes users would name.
2. Does this metric reflect AI's actual contribution?
If the metric would move with or without AI, it's not an AI product north star — it's a general product north star.
3. Does this metric resist gaming?
Could the team hit the number by degrading the experience? If yes, you have a Goodhart problem. Pair the metric with guardrail metrics.
The Supporting Metric Stack
A north star without a supporting stack is fragile. The stack triangulates: model-level metrics tell you whether the engine is healthy; user-level metrics tell you whether the experience is working; business-level metrics tell you whether it pays off.
Model layer metrics
Accuracy, hallucination rate, latency, cost-per-task. The engine. Necessary but not sufficient.
Experience layer metrics
Acceptance rate, edit rate, escalation rate, retry rate. Tells you what users do with the model output.
Outcome layer metrics
Task completion rate, time-to-task, retention, NPS. The user's perceived benefit.
Business layer metrics
Conversion, ARR, expansion. Whether the product makes money. Usually correlated with — but not equal to — outcomes.
Pick Your North Star in the Masterclass
The AI PM Masterclass walks through metric design with real case studies from production AI products — taught by a Salesforce Sr. Director PM.
North Star Examples By Product Type
AI coding assistant
Lines of accepted suggestions per active developer per week. Captures both quality (accept rate) and value (volume). Github Copilot popularized the framing.
AI search/research product
Successful queries per session, weighted by confidence in citations. Perplexity-style products track this carefully.
AI customer support
Successful auto-resolved tickets per week — "successful" defined by no human re-open within 7 days. Captures both volume and quality holdup.
AI agent / workflow
Tasks completed end-to-end without human intervention, per user per week. Strict bar; the right one.
Guardrails: How to Prevent Goodhart
Pair the north star with quality floors
"Acceptance rate up 10%" means nothing if hallucination rate also went up. Quality floors stop the team from gaming.
Track inverse metrics explicitly
Escalation rate, retry rate, abandonment. If your north star is up but inverse metrics are also up, you're winning short-term and losing long-term.
Use multiple lenses
Cohort retention, satisfaction, support load. If three lenses agree, you're probably winning. If they diverge, investigate.
Re-examine the metric quarterly
Products evolve; metrics should too. The north star that's right today may be the wrong one in six months.