AI Product North Star Metrics: Choosing the Right One for AI Products

Why "Accuracy" Is the Wrong North Star

Accuracy measures the model. A north star measures the product. They overlap less than you'd hope. A model that's 95% accurate but slow, expensive, or hard to act on may produce a worse product than a 90% accurate one that's instant and integrated. The north star you pick should incorporate the model's contribution — not equate to it.

Task completion north stars

"User completed the task using AI assistance." Used in copilots, search, support automation. Closest to user value.

Time-saved north stars

"Hours of user time saved per week." Strong for workflow products. Easy to translate into ROI for buyers.

Volume-handled north stars

"% of tickets/calls/queries successfully resolved without human escalation." Right for operational AI products.

Trust-bounded north stars

"Successful AI-assisted outcomes weighted by user confidence." The right north star when stakes are high (legal, medical).

Picking Yours: A Three-Question Test

The right north star answers three questions in the affirmative. If any of the three fails, you've picked the wrong metric and your team will optimize the wrong thing.

1. Does this metric capture real user value?

If the metric goes up but users aren't happier, the metric is broken. Tie north stars to outcomes users would name.

2. Does this metric reflect AI's actual contribution?

If the metric would move with or without AI, it's not an AI product north star — it's a general product north star.

3. Does this metric resist gaming?

Could the team hit the number by degrading the experience? If yes, you have a Goodhart problem. Pair the metric with guardrail metrics.

The Supporting Metric Stack

A north star without a supporting stack is fragile. The stack triangulates: model-level metrics tell you whether the engine is healthy; user-level metrics tell you whether the experience is working; business-level metrics tell you whether it pays off.

Model layer metrics

Accuracy, hallucination rate, latency, cost-per-task. The engine. Necessary but not sufficient.

Experience layer metrics

Acceptance rate, edit rate, escalation rate, retry rate. Tells you what users do with the model output.

Outcome layer metrics

Task completion rate, time-to-task, retention, NPS. The user's perceived benefit.

Business layer metrics

Conversion, ARR, expansion. Whether the product makes money. Usually correlated with — but not equal to — outcomes.

Pick Your North Star in the Masterclass

The AI PM Masterclass walks through metric design with real case studies from production AI products — taught by a Salesforce Sr. Director PM.

North Star Examples By Product Type

AI coding assistant

Lines of accepted suggestions per active developer per week. Captures both quality (accept rate) and value (volume). Github Copilot popularized the framing.

AI search/research product

Successful queries per session, weighted by confidence in citations. Perplexity-style products track this carefully.

AI customer support

Successful auto-resolved tickets per week — "successful" defined by no human re-open within 7 days. Captures both volume and quality holdup.

AI agent / workflow

Tasks completed end-to-end without human intervention, per user per week. Strict bar; the right one.

Guardrails: How to Prevent Goodhart

Pair the north star with quality floors

"Acceptance rate up 10%" means nothing if hallucination rate also went up. Quality floors stop the team from gaming.

Track inverse metrics explicitly

Escalation rate, retry rate, abandonment. If your north star is up but inverse metrics are also up, you're winning short-term and losing long-term.

Use multiple lenses

Cohort retention, satisfaction, support load. If three lenses agree, you're probably winning. If they diverge, investigate.

Re-examine the metric quarterly

Products evolve; metrics should too. The north star that's right today may be the wrong one in six months.