AI Feature Prioritization Framework: How to Decide What to Build Next

Why Traditional RICE and ICE Break for AI

RICE (Reach × Impact × Confidence / Effort) and its cousin ICE assume the feature does what you said it would do. For deterministic features that is mostly fine — if you ship a CSV export, it exports a CSV. For AI features, the feature behavior itself is uncertain. The model might be 95% accurate on your test set and 60% accurate in the wild. ‘Impact’ is no longer a single number; it is a distribution.

Break 1 — Impact is conditional on quality

An AI summarization feature is worth a lot if it summarizes correctly 90% of the time. It is worth negative value if it hallucinates 20% of the time and users churn. Standard RICE prices these the same.

Break 2 — Confidence is doing two jobs

In classic RICE, Confidence is about your estimate of demand. For AI features it has to also represent your estimate of feasibility — can the model even do this? Stacking those two into one number hides the real risk.

Break 3 — Effort is unstable

Building an AI feature is rarely just engineering effort. It is also eval set creation, prompt iteration, fine-tuning runs, RLHF cycles, and red-teaming. Estimating engineer-weeks misses 50%+ of the real cost.

Break 4 — Marginal cost is non-zero

Most non-AI features have near-zero marginal cost per use. AI features have token cost per use. A feature that ships at 50% gross margin but degrades to negative margin if usage 10x’s is a different decision than a $0-marginal-cost feature.

The fix is not to abandon RICE — it is to add four AI-specific dimensions on top. For broader roadmap context, see AI product roadmap strategy.

The Four AI-Specific Scoring Dimensions

Score each candidate feature on these four dimensions from 1-5. Multiply with your existing reach/impact estimate. The result is a meaningfully different priority order than RICE alone — and the framework forces explicit conversations the team would otherwise avoid.

Q — Quality Confidence (1-5)

What it measures: Honest estimate of how reliably the model can perform this task. 5 = we have run experiments and frontier models hit >90% accuracy. 1 = it is a research problem with no demonstrated path.

Signal: If you cannot produce a 10-example test set the model passes today, score 1-2. Do not move forward without a prototype eval.

E — Evaluation Readability (1-5)

What it measures: Can you tell when the output is right? 5 = clean ground truth, automated eval. 1 = subjective quality only judgeable by domain experts at $200/hr.

Signal: Code generation: high (does the code compile and pass tests?). Open-ended writing assistance: low. The lower the score, the longer the path to ship.

C — Cost / Margin Impact (1-5)

What it measures: 5 = no per-call cost or cost is trivial relative to ARPU. 1 = unbounded inference cost that scales with usage and could break gross margin at adoption.

Signal: Calculate cost per active user per month at expected usage. If it exceeds 20% of ARPU, score 1-2. Microsoft and Anthropic both publicly noted in 2024-2025 that Copilot-style features at flat pricing were margin-negative for power users.

R — Reversibility (1-5)

What it measures: 5 = we can ship behind a flag, A/B test, and turn it off in minutes with zero customer impact. 1 = it is in the core flow, in the contract, and customers will protest loudly if removed.

Signal: Features sold as deliverables in enterprise contracts are 1s. Features in side panels behind opt-in toggles are 4-5s. Ship reversible features first when uncertainty is high.

The composite formula we recommend: AI-Score = (Reach × Impact × Q × E) / (Effort × (6 − R)) × cost multiplier. The cost multiplier is 1.0 if margin is fine, down to 0.3 if the feature would destroy unit economics. The (6 − R) inverts reversibility so low-reversibility features pay an effort tax.

A Worked Example: Three Features Scored

Imagine a B2B project management tool deciding between three AI features for Q3. Standard RICE would rank them differently from the AI-adjusted score.

Feature A — AI auto-titles for new tasks

RICE only: Reach: 5 (every user creates tasks). Impact: 2 (saves a few seconds). Confidence: 4. Effort: 1. RICE = 40.

AI-adjusted: Q = 5 (LLMs are excellent at this). E = 4 (does the title fit the task?). C = 4 (low token use). R = 5 (kill instantly with a flag). AI-Score: very high. Ship first.

Verdict: Ship Q3 week 1. Low risk, low cost, immediate adoption signal.

Feature B — AI project status report generator

RICE only: Reach: 3 (PMs only). Impact: 5 (replaces 2 hours/week of writing). Confidence: 3. Effort: 3. RICE = 15.

AI-adjusted: Q = 3 (quality varies by project complexity). E = 2 (hard to evaluate without PMs reviewing). C = 3 (long-context generation, moderate cost). R = 3 (medium reversibility — PMs come to depend on it). AI-Score: medium. Ship behind eval gate.

Verdict: Pilot with 10 customers in Q3, expand in Q4 only if eval scores stay above 4/5.

Feature C — AI agent that auto-resolves dependency conflicts

RICE only: Reach: 4. Impact: 5. Confidence: 4. Effort: 4. RICE = 20.

AI-adjusted: Q = 2 (multi-step agent reasoning is the cutting edge, not solved). E = 1 (correctness is subjective and depends on org context). C = 1 (agent loops can run 20-50 model calls per resolution). R = 2 (if the agent silently makes wrong calls, trust is hard to rebuild). AI-Score: low. Hold.

Verdict: Defer to 2027. The capability is not stable enough, the evals are not legible enough, and the cost is unbounded. Reconsider when frontier model agent benchmarks cross a threshold.

Standard RICE would have ranked Feature A > C > B. The AI-adjusted framework ranks A > B > C, with C deferred entirely. The difference is in the model and cost realities that RICE alone obscures.

Apply the Framework to Your Backlog

The AI PM Masterclass includes a live prioritization workshop using your actual backlog — taught by a Salesforce Sr. Director PM and former Apple Group PM.

Handling the “We Do Not Know What Is Possible” Problem

The hardest part of AI feature prioritization is that capability moves under you. A feature that scores Q=2 today might score Q=5 in nine months because of a model release. Three tactics keep you from being either too conservative or too optimistic.

Tactic 1 — The 2-day spike

For any feature with Q≤3, allocate a 2-day prototyping spike before the prioritization meeting. Build a stripped-down version against current frontier models and 10 representative inputs. If the spike passes, Q goes up. If it fails, you have a real data point, not a guess. This single change cuts wasted roadmap arguments by ~50% in our experience.

Tactic 2 — Capability watch list

Maintain a list of deferred features tagged with the capability they need (long-context retrieval, multi-step planning, vision OCR, etc.). When a model release ships that materially advances that capability, the watch list is automatically re-scored. Several teams have a Slack channel that pings on new model releases with the deferred features that should be re-evaluated.

Tactic 3 — Pre-mortems on the killable features

Before shipping any feature with R≤2 (low reversibility), run a 30-minute pre-mortem: ‘It is six months from now and we are removing this feature. Why?’ If the team can write more than three credible answers, do not ship it as a non-reversible feature. Either find a way to make it reversible or move it down the list.

See the AI MVP guide for more on how to keep the spike-to-ship cycle tight.

Killing AI Features That Do Not Earn Their Cost

The prioritization framework is incomplete without a sunset framework. AI features have ongoing inference cost, ongoing eval cost, and ongoing risk surface. A feature that ships and then plateaus at low usage is often worse than not shipping it — it consumes attention, margin, and trust budget.

Usage / Cost ratio

If a feature has fewer than 20 active uses per dollar of monthly inference cost, you are probably underpricing or shipping a feature nobody wants. Below this threshold, kill or repackage.

Eval pass rate

If your golden eval set drops below 80% pass rate for two consecutive weeks and you cannot get it back, the feature is either using a degraded model or has hit a quality ceiling. Either way, action required — do not let it rot in production.

Adoption curve shape

AI features that flatten under 15% of eligible users within 60 days rarely recover. By 90 days, you should either be killing the feature or re-architecting the entry point. Side-panel features are particularly prone to this.

Support cost per use

If AI feature usage is driving support tickets (hallucinations, confusing outputs, wrong answers) at a rate that consumes more than 5% of CS capacity, the feature is net-negative on margin even if revenue holds.

The single most undervalued PM skill in 2026 is killing AI features cleanly. Teams that ship 8 features and kill 3 outperform teams that ship 8 and quietly drag the underperformers for a year. For cost-side discipline, see AI cost optimization.

A Prioritization Cadence That Works

The framework is a tool. The cadence is what makes it operational. Run this rhythm:

Weekly — Spike day

One engineering pair runs the 2-day spike on the highest-priority Q≤3 feature. Outputs: pass/fail with eval examples, updated Q score.

Biweekly — Re-score the top 10

PM re-scores the top 10 backlog items against the four AI dimensions. Items that drop more than 30% in score get a kill/hold decision. Items that climb get fast-tracked.

Monthly — Cost and adoption review

Pull the cost/usage and eval data for every shipped AI feature. Apply the four sunset rules. Make the kill decisions explicit — do not leave them ambient.

Quarterly — Capability re-eval

Walk the watch list. For each deferred feature, re-run the spike. If frontier model capability has crossed the threshold, promote it back into active prioritization.

The PM teams that run this cadence ship 2-3x more validated AI features per year than teams running quarterly planning with annual roadmaps. The framework is the engine; the cadence is the throttle.