AI Product Pre-Mortem Template: Kill Your Feature Before It Ships Broken
TL;DR
A pre-mortem is the 60-minute meeting you run before shipping an AI feature where you ask: "It is six months from now and this feature failed completely. What happened?" Unlike a launch checklist (which confirms you did the right things) or a post-mortem (which explains what already went wrong), a pre-mortem surfaces the risks your team is too optimistic or too polite to raise in normal planning. AI features need their own version because the failure taxonomy is different: hallucination edge cases, silent accuracy drift, user trust collapse, and latency SLA failures don't show up on standard project risk registers. This guide gives you the complete template, the AI-specific failure modes, and the workshop format that actually produces useful output.
The AI PM Minute
One tactic to make you a sharper AI PM, twice a week. 60 seconds to read. Free.
No fluff. Unsubscribe anytime.
What a Pre-Mortem Is and Why AI Features Need One
The pre-mortem technique was formalized by psychologist Gary Klein in the 1990s and popularized by Daniel Kahneman in Thinking, Fast and Slow. The mechanic is simple: before a project begins, you tell the team to assume it has already failed catastrophically, then ask them to write down every reason they can think of for why. The technique works because it exploits "prospective hindsight": imagining an event has already occurred makes it easier to generate specific, credible causes instead of vague optimism.
Standard pre-mortems work well for software launches. But AI features fail in ways that standard software does not, and the failure modes are systematically underestimated by product teams without AI experience. The three differences that matter most:
AI failures are probabilistic, not deterministic
A broken CRUD endpoint either works or it doesn't. An AI feature works 94% of the time and produces wrong, offensive, or confidently wrong output 6% of the time. The 6% is what kills user trust. Standard launch checklists don't capture probabilistic failure rates.
AI failures are often invisible at launch
Model drift, distribution shift, and silent accuracy degradation often don't produce visible errors. The feature appears to work while producing increasingly low-quality output. By the time the data shows the problem, significant user trust damage has occurred.
AI failures interact with user expectations in unpredictable ways
Users calibrate trust based on early experiences. An AI feature that is wrong in a surprising way early in the user journey can produce abandonment rates that don't recover even after the underlying issue is fixed. The failure is not just technical — it is a trust debt that compounds.
The AI Feature Failure Taxonomy
Before running your pre-mortem workshop, give participants this taxonomy. It primes them to think beyond the obvious failure modes ("the API goes down") toward the ones that actually kill AI features.
Hallucination failures
The model confidently asserts false facts, cites sources that don't exist, or generates plausible-sounding but incorrect data. Particularly dangerous in legal, medical, financial, and customer-facing contexts where users assume accuracy.
Evaluation gap failures
Your eval set didn't represent real user input distribution. The model scored 92% on your benchmarks and 61% on production inputs because edge cases, slang, multilingual queries, or domain-specific jargon weren't in the eval set.
Latency SLA failures
The feature performs fine in development and staging. Under production load with real context window sizes, it exceeds your latency budget. Users perceive the feature as broken even though technically it returns correct output.
Cost overrun failures
Token usage is 3x higher in production than estimated. A long-context feature that seemed cheap in testing is consuming 40% of your API budget. You can't roll back without breaking the feature, and you can't afford to keep running it.
Distribution shift failures
The model was trained or evaluated on data that no longer represents your users. New user cohorts, seasonal behavior, or product changes create inputs the model has not seen. Accuracy degrades silently over 60-90 days.
User trust failures
The feature is technically functional but users don't trust it. Either the model's tone is off, its errors are embarrassing rather than benign, or it confidently contradicts something the user knows to be true. Adoption plateaus at 15% despite functional correctness.
Prompt injection failures
Malicious inputs manipulate the model into bypassing safety guardrails, leaking system prompts, or taking unintended actions. Especially relevant for any feature that processes user-generated content or external documents.
Model dependency failures
The upstream model provider updates the model without notice, changes pricing, deprecates the version you rely on, or has a multi-hour outage. You have no fallback and no SLA protection.
Running the Pre-Mortem Workshop
The workshop takes 60 minutes. Run it after the feature design is finalized but before development begins, or immediately before launch as a final review. Running it too early produces vague risks; too late and the results don't influence anything.
Preparation (before the meeting)
15 minutes of PM time- Send all participants the one-paragraph feature brief: what the AI does, who uses it, and what success looks like.
- Distribute the failure taxonomy above. Ask people to read it before the meeting.
- Confirm attendance: PM, engineering lead, ML engineer, designer, and one representative from customer success or sales who has heard real user objections.
Phase 1: Individual brainstorm (first 15 minutes)
15 minutes in the meeting- Facilitator (the PM) says: 'It is now one year from today. This AI feature failed completely. Users stopped using it, we pulled it back, and there is a post-mortem written about why it failed. Please spend the next 10 minutes writing down every reason you can think of that contributed to its failure. Be specific. No idea is too pessimistic.'
- Everyone writes independently. No discussion during this phase.
- Important: the PM also writes their own list. Don't just facilitate.
Phase 2: Round-robin sharing (next 20 minutes)
20 minutes in the meeting- Each person shares their top 3 failure modes, one at a time in rotation. No discussion or debate while sharing.
- Facilitator captures everything on a shared board.
- After all items are shared, run a dot vote: everyone gets 5 dots to put on the failure modes they consider most likely and most damaging. Stack ranking, not elimination.
Phase 3: Top risk deep dives (last 20 minutes)
20 minutes in the meeting- Take the top 3-5 risks by vote. For each, spend 3-4 minutes answering: 'If this happens, what is the earliest possible signal we would see? What is our mitigation or response plan?'
- Assign each risk an owner who will monitor that specific failure mode.
- Any risk without a mitigation path or monitoring plan becomes an explicit go/no-go criterion. If the team can't identify how they would detect this failure, they should delay launch until they can.
Ship AI Features That Survive Production
The AI PM Masterclass covers evaluation design, risk management, and the operational frameworks that keep AI features working after launch. Taught live by a Salesforce Sr. Director PM.
The Template: Fill-In Format
Use this template as a shared document before and during the workshop. Copy it into Notion, Confluence, or a Google Doc. The filled-in version becomes an artifact that lives alongside your PRD.
1. Feature brief
What the AI feature does, who uses it, and the success metric we are trying to move.
Example: 'The AI meeting summarizer generates a structured action-item list from recorded calls within 30 seconds of call end. Target: 25% reduction in time-to-action-item for AEs. Success metric: 80%+ of AEs using the summary without editing within 60 days of launch.'
2. Assumed failure statement
Write a one-sentence statement of the failure you are imagining. Make it vivid and specific.
Example: 'It is June 2027. The AI meeting summarizer was pulled from production in February. AE adoption peaked at 23%, below the 80% target. Three enterprise customers threatened churn after the feature attributed incorrect action items to executives during board-level calls. Engineering spent 6 weeks on a patch that improved accuracy but didn't recover user trust.'
3. Failure modes identified (from the workshop)
List all failure modes raised, then mark each: L = likely, D = damaging, LD = both.
Example entry: '[LD] Speaker diarization fails on calls with 5+ participants, causing action items to be attributed to the wrong person.' Include every failure mode raised, even the ones that seem unlikely. Pattern: likelihood x damage is the prioritization axis.
4. Top 3-5 risks (from the dot vote)
For each top risk: Risk description / Earliest warning signal / Mitigation plan / Owner
Example: Risk: Speaker attribution errors on large calls. Early signal: AE edit rate above 40% on calls with 5+ attendees. Mitigation: Add participant count as a feature flag threshold. Disable summarization for 5+ person calls until diarization is improved. Owner: [ML engineer name].
5. Go/no-go risks (unmitigated at launch)
List any risks from the top 5 that have no mitigation plan. These are explicit go/no-go conditions: launch is blocked until a mitigation exists.
Example: 'No solution identified for prompt injection via external meeting links. Feature will not launch until indirect prompt injection is addressed in the transcript processing pipeline.'
6. Red flag monitoring plan
List the 3-5 metrics you will track in the first 30 days post-launch as early warning signals.
Example metrics: AE edit rate on summaries (target <30%), user-reported errors via thumbs-down (target <2%), mean summary generation latency (target <45s p95), 30-day active usage rate among targeted AEs (target >50%).
Red Flag Indicators to Build Into Your Monitoring Plan
The pre-mortem is only valuable if the risk monitoring actually happens post-launch. These are the leading indicators that AI features are heading toward the failure modes identified in the taxonomy above.
User edit rate above 35%
Signals: Hallucination or low-quality output
Track the rate at which users edit, correct, or override AI-generated output. Above 35% signals the model is not meeting quality expectations. Above 50% signals users are treating the AI as a draft generator rather than a decision tool.
Explicit negative feedback above 3%
Signals: Trust collapse or off-tone output
Even a simple thumbs-down mechanism catches user trust failures early. Above 3% explicit negative feedback is a strong signal that specific output types are failing. Segment by user type, input category, and time of day to find the pattern.
P95 latency above 2x the P50
Signals: Latency SLA failure under real load
A large gap between median and 95th percentile latency signals that a specific input class (long context, complex prompts, high-load periods) is creating tail latency issues. Users who hit the slow path will have a qualitatively different experience.
Feature activation rate declining after day 14
Signals: User trust or habit failure
If users activated the feature in week 1 but aren't using it in week 3, the feature failed to establish a habit. Often caused by early bad outputs that created distrust before the feature had time to prove value. Track weekly active users as a primary health metric, not just launch-day adoption.
Eval accuracy declining month over month
Signals: Distribution shift or model update impact
Run your eval set against production model outputs on a scheduled basis (weekly or monthly). A declining eval score on the same inputs indicates model update impact or evaluation set coverage gaps. Don't wait for user complaints to detect this.
Mistakes That Make Pre-Mortems Useless
Pre-mortems are widely recommended and rarely executed well. These are the four failure modes for the pre-mortem itself.
Mistake: Running it the week before launch
Fix: At that point, engineering is done, commitments are made, and nobody wants to hear about risks. Pre-mortems need to run before significant engineering investment is locked in, or immediately after finalization but before development begins. Last-minute pre-mortems produce a risk list that nobody acts on.
Mistake: Only inviting optimists
Fix: If your pre-mortem includes only the PM and engineering leads who have been working on the feature for three months, you will get anchored on the risks you already know. Include someone from customer success who hears complaints, a skeptic from a different team, and ideally someone who has seen an AI feature fail in production.
Mistake: Producing a risk list without owners
Fix: A risk list with no names attached is a document that will not be read. Every top-5 risk must have a named owner responsible for monitoring it and escalating if the early warning signal fires. If nobody wants to own a risk, that is itself a signal.
Mistake: Not revisiting it at 30 and 60 days
Fix: Schedule two post-launch reviews at the end of the meeting itself. At 30 days: check each red flag indicator against the monitoring plan. At 60 days: run the full pre-mortem output against actual production data. Close the loop so the team builds the muscle to take the exercise seriously.
Ship AI Features That Hold Up in Production
The AI PM Masterclass covers the full lifecycle of AI product development: evaluation design, risk frameworks, pre-launch processes, and what separates AI PMs who ship reliably from those who don't.
Related Articles
Before you go: get the AI PM Minute
One tactic to make you a sharper AI PM, twice a week. 60 seconds to read. Free.
No fluff. Unsubscribe anytime.