AI Incident Postmortem Template | Institute of AI Product Management

AI systems fail differently than traditional software. Model drift, adversarial inputs, edge cases in training data, and emergent behaviors require specialized documentation. This template helps you capture the unique aspects of AI incidents while following blameless postmortem best practices.

Why AI Postmortems Are Different

AI-Specific Failure Modes

Model-Related

Model drift over time
Training data gaps
Adversarial inputs
Confidence calibration issues

System-Related

Prompt injection attacks
Context window limits
Rate limiting failures
Fallback logic gaps

The Complete Postmortem Template

Copy and paste this template. Fill in brackets with your specific details.

╔══════════════════════════════════════════════════════════════╗
║              AI INCIDENT POSTMORTEM                          ║
╠══════════════════════════════════════════════════════════════╣
║ Incident ID: [INC-YYYY-MM-###]                               ║
║ Date: [YYYY-MM-DD]                                           ║
║ Severity: [S1/S2/S3/S4]                                      ║
║ Status: [Draft/Review/Final]                                 ║
╚══════════════════════════════════════════════════════════════╝

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. EXECUTIVE SUMMARY
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

One-Line Summary:
[What happened in one sentence]

Duration: [Start time] to [End time] ([X] hours/minutes)

User Impact:
• Users Affected: [Number or percentage]
• Functionality Impacted: [Feature/capability]
• Business Impact: [Revenue, reputation, SLA breach]

Root Cause (One Sentence):
[Brief description of why this happened]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
2. INCIDENT TIMELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[HH:MM UTC] - [Event description]
[HH:MM UTC] - [Event description]
[HH:MM UTC] - DETECTION: [How incident was discovered]
[HH:MM UTC] - RESPONSE: [First response action]
[HH:MM UTC] - MITIGATION: [Temporary fix applied]
[HH:MM UTC] - RESOLUTION: [Permanent fix applied]
[HH:MM UTC] - VERIFICATION: [Confirmed resolved]

Time to Detect (TTD): [X] minutes
Time to Mitigate (TTM): [X] minutes
Time to Resolve (TTR): [X] hours

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
3. AI/ML SPECIFICS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Model Information:
• Model Name/Version: [e.g., gpt-4-0125, claude-3-opus]
• Deployment Date: [When this model version was deployed]
• Provider: [OpenAI/Anthropic/Internal/Other]

Failure Category: (check all that apply)
[ ] Model Quality Degradation
[ ] Hallucination/Factual Error
[ ] Harmful/Toxic Output
[ ] Prompt Injection/Jailbreak
[ ] Data Leakage/Privacy
[ ] Bias/Fairness Issue
[ ] Latency/Timeout
[ ] Cost Overrun
[ ] Rate Limiting
[ ] Context Window Exceeded
[ ] Embedding/Retrieval Failure
[ ] Other: [specify]

Input That Triggered Failure:
[Sanitized example of problematic input - remove PII]

Actual Output:
[What the model actually produced - sanitize if needed]

Expected Output:
[What the model should have produced]

Confidence Scores (if applicable):
• Model Confidence: [X%]
• Threshold Setting: [X%]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
4. ROOT CAUSE ANALYSIS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

5 Whys Analysis:

1. Why did the incident occur?
   → [Answer]

2. Why did [Answer 1] happen?
   → [Answer]

3. Why did [Answer 2] happen?
   → [Answer]

4. Why did [Answer 3] happen?
   → [Answer]

5. Why did [Answer 4] happen?
   → [ROOT CAUSE]

Contributing Factors:
• [Factor 1]
• [Factor 2]
• [Factor 3]

Was this predictable? [ ] Yes [ ] No
If yes, what signals did we miss?
[Description]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
5. DETECTION & RESPONSE ANALYSIS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

How was the incident detected?
[ ] Automated Monitoring/Alerting
[ ] User Report
[ ] Internal Testing
[ ] Third Party Report
[ ] Social Media
[ ] Other: [specify]

Detection Gap Analysis:
• Should we have detected this sooner? [Yes/No]
• What monitoring was missing? [Description]
• What threshold should have alerted? [Description]

Response Effectiveness:
• Was the runbook followed? [Yes/No/Partial]
• Was the right team engaged? [Yes/No]
• Were escalation paths clear? [Yes/No]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
6. IMMEDIATE ACTIONS TAKEN
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Mitigation Actions:
1. [Action taken to stop the bleeding]
2. [Temporary fix implemented]
3. [Communication sent to users/stakeholders]

Rollback Actions (if applicable):
• Model Rollback: [Previous version deployed]
• Feature Flag: [Feature disabled]
• Traffic Shift: [% of traffic redirected]

User Communication:
• Status Page Updated: [Yes/No]
• User Notification Sent: [Yes/No]
• Support Team Briefed: [Yes/No]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
7. PREVENTION MEASURES
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Short-Term (This Sprint):
┌─────────────────────────────────────────────────────────────┐
│ Action Item          │ Owner    │ Due Date  │ Status      │
├─────────────────────────────────────────────────────────────┤
│ [Action 1]           │ [Name]   │ [Date]    │ [Status]    │
│ [Action 2]           │ [Name]   │ [Date]    │ [Status]    │
│ [Action 3]           │ [Name]   │ [Date]    │ [Status]    │
└─────────────────────────────────────────────────────────────┘

Medium-Term (This Quarter):
┌─────────────────────────────────────────────────────────────┐
│ Action Item          │ Owner    │ Due Date  │ Status      │
├─────────────────────────────────────────────────────────────┤
│ [Action 1]           │ [Name]   │ [Date]    │ [Status]    │
│ [Action 2]           │ [Name]   │ [Date]    │ [Status]    │
└─────────────────────────────────────────────────────────────┘

Long-Term (Roadmap):
• [Systemic improvement 1]
• [Systemic improvement 2]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
8. LESSONS LEARNED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

What Went Well:
• [Positive aspect of response]
• [Positive aspect of response]

What Went Poorly:
• [Area for improvement]
• [Area for improvement]

Where We Got Lucky:
• [Factor that reduced impact]
• [Factor that could have been worse]

Key Takeaways:
1. [Learning that applies broadly]
2. [Learning that applies broadly]
3. [Learning that applies broadly]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
9. APPENDIX
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Participants in Postmortem:
• [Name, Role]
• [Name, Role]
• [Name, Role]

Related Incidents:
• [Link to similar past incident]
• [Link to related incident]

Supporting Evidence:
• Dashboard Link: [URL]
• Log Query: [Query string]
• Slack Thread: [URL]
• Support Tickets: [Ticket IDs]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
SIGN-OFF
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Postmortem Author: [Name]
Reviewed By: [Name, Date]
Approved By: [Name, Date]

Next Review Date: [Date to check action items]

Section-by-Section Guidance

Executive Summary

Write this last, but put it first. Executives will often only read this section.

Keep the one-liner under 15 words
Quantify user impact with real numbers
Be specific about business impact (revenue, SLA, trust)

AI/ML Specifics

This section is what makes AI postmortems unique. Capture model-specific details.

Always record exact model version and deployment date
Sanitize inputs/outputs to remove PII before documenting
Note confidence scores - calibration issues are common
Include prompt/system instructions if relevant

Root Cause Analysis

Use 5 Whys to dig deeper than surface-level causes. AI failures often have systemic roots.

Don't stop at “the model made a mistake”
Ask: Why was this input possible? Why wasn't it caught?
Consider: training data, evaluation gaps, monitoring blind spots

Prevention Measures

Every action item must have an owner and due date. Vague commitments don't get done.

Prioritize actions that prevent the class of incident, not just this one
Include monitoring improvements, not just code fixes
Schedule a follow-up to verify actions were completed

AI Incident Severity Levels

Critical

Harmful outputs affecting users, data breach, complete feature outage, significant revenue impact. Requires immediate executive notification.

High

Degraded quality affecting many users, incorrect outputs in critical flows, SLA breach. Requires same-day response.

Medium

Quality issues affecting subset of users, edge case failures, increased latency. Addressed within normal sprint cycle.

Low

Minor quality issues, rare edge cases, cosmetic output problems. Logged for pattern analysis, fixed opportunistically.

Running a Blameless Postmortem

Do

Focus on systems and processes, not individuals
Ask “What made this possible?” not “Who caused this?”
Assume everyone acted with best intentions and available information
Create psychological safety for honest discussion
Celebrate when people speak up about near-misses

Don't

Name individuals as the cause of failure
Use language like “should have known” or “failed to”
Create action items that are just “be more careful”
Skip the postmortem because it was “just an AI issue”
Treat postmortems as punishment or performance reviews

Common AI Incident Patterns

Model Quality Incidents

Provider model update changed behavior
Input distribution shifted from training data
Edge case not covered in evaluation
Prompt changes had unintended effects

Safety Incidents

Jailbreak or prompt injection succeeded
Harmful content bypassed filters
PII leaked in model output
Bias surfaced in production

Operational Incidents

Rate limits exceeded unexpectedly
Cost spike from increased usage
Latency degradation from provider
Fallback logic failed to trigger

Integration Incidents

RAG retrieved irrelevant context
Tool/function calling errors
Context window exceeded silently
Caching served stale responses

Postmortem Timeline

0-24 hrsDraft initial timeline and impact assessment

24-48 hrsHold postmortem meeting with all responders

48-72 hrsFinalize document and assign action items

1 weekReview and approve final postmortem

2-4 weeksVerify all action items completed

AI Incident Postmortem Template: Learn from AI Failures