AI Incident Response Plan Template for Product Managers
TL;DR
Public AI failures are the new operational risk. The teams that handle them well have an incident response plan in place before the incident — roles, severity tiers, communication scripts, decision trees. This template gives you all of it as copy-paste content. Adapt it to your team in an afternoon; don't wait until your first 2 AM page.
Why AI Incidents Need Their Own Plan
A standard SaaS incident plan covers outages, data breaches, and bugs. AI incidents add new failure modes: hallucinations going viral, biased outputs surfacing, prompt injection attacks, model regressions silently degrading quality. The response motions are different — you can't restart a hallucinated answer the way you restart a crashed service.
AI-specific incident types
Quality regression, harmful output going public, prompt injection, vendor outage, data leakage via model output, bias incident.
Time-to-detect challenge
AI quality drift can take days to detect. Severity grows quietly. Detection latency is itself a tracked metric.
Time-to-mitigate challenge
You can't patch a bad answer that's already public. Containment is about preventing more bad answers, not reversing the one that escaped.
Communication challenge
Users want to know whether to trust the product. Silence destroys faster than imperfect transparency.
Severity Tier Definitions
SEV-1 — Critical
AI is causing user harm, public reputational risk, regulatory exposure, or data leakage. War room within 15 minutes. Exec on call.
SEV-2 — High
AI quality regression affecting many users; incorrect outputs on important workflows. Containment within 1 hour. Eng + PM + comms.
SEV-3 — Moderate
Quality regression on a specific surface or user segment. Containment within 4 hours. PM-led response.
SEV-4 — Low
Single-user reports, minor format issues, edge case failures. Standard ticket triage. No special response motion.
Roles in an AI Incident
Incident Commander
Owns the response. Often a senior PM or eng lead. Makes containment calls; coordinates across functions; drives toward resolution.
Technical Lead
Owns mitigation. Disables features, rolls back prompts/models, deploys fixes. Reports status every 15 minutes during SEV-1/2.
Communications Lead
Owns messaging. Drafts customer comms, internal updates, public statements. Holds the pen on what gets said when.
Subject Matter Experts
ML engineer, safety expert, legal/comms specialist as needed. Pulled in by Incident Commander based on incident type.
Get Incident-Ready in the Masterclass
The AI PM Masterclass includes incident response drills, postmortem templates, and real-world case studies — the muscle that doesn't build naturally during quiet times.
First 60 Minutes — The Decision Tree
Minute 0-5: Confirm and classify
Reproduce the issue. Estimate impact (number of users, severity). Set the SEV tier. Page the Incident Commander.
Minute 5-15: Contain
Disable the feature, route around the failure, or roll back. The goal is stopping new bad outputs — not fixing the root cause yet.
Minute 15-30: Diagnose
Identify the root cause hypothesis. Pull eval data, recent changes, model version diffs. Don't deploy a fix yet.
Minute 30-60: Communicate
Internal update to the team, exec, support. External communication if user-facing impact. Set expectations on resolution.
Communication Scripts
Initial public statement
"We're aware of an issue affecting [feature]. We've disabled [behavior] while we investigate. We'll update within [timeframe]." Honest, brief, time-bound.
Internal status update
Every 15 minutes during SEV-1/2: current state, mitigation status, next milestone, blocker. Predictable cadence reduces anxiety.
Resolution announcement
"The issue is resolved as of [time]. We affected [count] users. Root cause: [brief]. Postmortem published within [timeframe]."
Postmortem publication
Within 5 business days. Public for SEV-1; internal for others. Blameless format. Concrete preventive actions with owners.