AI systems fail differently than traditional software. Model drift, adversarial inputs, edge cases in training data, and emergent behaviors require specialized documentation. This template helps you capture the unique aspects of AI incidents while following blameless postmortem best practices.
Why AI Postmortems Are Different
AI-Specific Failure Modes
Model-Related
- Model drift over time
- Training data gaps
- Adversarial inputs
- Confidence calibration issues
System-Related
- Prompt injection attacks
- Context window limits
- Rate limiting failures
- Fallback logic gaps
The Complete Postmortem Template
Copy and paste this template. Fill in brackets with your specific details.
╔══════════════════════════════════════════════════════════════╗ ║ AI INCIDENT POSTMORTEM ║ ╠══════════════════════════════════════════════════════════════╣ ║ Incident ID: [INC-YYYY-MM-###] ║ ║ Date: [YYYY-MM-DD] ║ ║ Severity: [S1/S2/S3/S4] ║ ║ Status: [Draft/Review/Final] ║ ╚══════════════════════════════════════════════════════════════╝ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1. EXECUTIVE SUMMARY ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ One-Line Summary: [What happened in one sentence] Duration: [Start time] to [End time] ([X] hours/minutes) User Impact: • Users Affected: [Number or percentage] • Functionality Impacted: [Feature/capability] • Business Impact: [Revenue, reputation, SLA breach] Root Cause (One Sentence): [Brief description of why this happened] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2. INCIDENT TIMELINE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ [HH:MM UTC] - [Event description] [HH:MM UTC] - [Event description] [HH:MM UTC] - DETECTION: [How incident was discovered] [HH:MM UTC] - RESPONSE: [First response action] [HH:MM UTC] - MITIGATION: [Temporary fix applied] [HH:MM UTC] - RESOLUTION: [Permanent fix applied] [HH:MM UTC] - VERIFICATION: [Confirmed resolved] Time to Detect (TTD): [X] minutes Time to Mitigate (TTM): [X] minutes Time to Resolve (TTR): [X] hours ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3. AI/ML SPECIFICS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Model Information: • Model Name/Version: [e.g., gpt-4-0125, claude-3-opus] • Deployment Date: [When this model version was deployed] • Provider: [OpenAI/Anthropic/Internal/Other] Failure Category: (check all that apply) [ ] Model Quality Degradation [ ] Hallucination/Factual Error [ ] Harmful/Toxic Output [ ] Prompt Injection/Jailbreak [ ] Data Leakage/Privacy [ ] Bias/Fairness Issue [ ] Latency/Timeout [ ] Cost Overrun [ ] Rate Limiting [ ] Context Window Exceeded [ ] Embedding/Retrieval Failure [ ] Other: [specify] Input That Triggered Failure: [Sanitized example of problematic input - remove PII] Actual Output: [What the model actually produced - sanitize if needed] Expected Output: [What the model should have produced] Confidence Scores (if applicable): • Model Confidence: [X%] • Threshold Setting: [X%] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4. ROOT CAUSE ANALYSIS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5 Whys Analysis: 1. Why did the incident occur? → [Answer] 2. Why did [Answer 1] happen? → [Answer] 3. Why did [Answer 2] happen? → [Answer] 4. Why did [Answer 3] happen? → [Answer] 5. Why did [Answer 4] happen? → [ROOT CAUSE] Contributing Factors: • [Factor 1] • [Factor 2] • [Factor 3] Was this predictable? [ ] Yes [ ] No If yes, what signals did we miss? [Description] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5. DETECTION & RESPONSE ANALYSIS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ How was the incident detected? [ ] Automated Monitoring/Alerting [ ] User Report [ ] Internal Testing [ ] Third Party Report [ ] Social Media [ ] Other: [specify] Detection Gap Analysis: • Should we have detected this sooner? [Yes/No] • What monitoring was missing? [Description] • What threshold should have alerted? [Description] Response Effectiveness: • Was the runbook followed? [Yes/No/Partial] • Was the right team engaged? [Yes/No] • Were escalation paths clear? [Yes/No] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6. IMMEDIATE ACTIONS TAKEN ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Mitigation Actions: 1. [Action taken to stop the bleeding] 2. [Temporary fix implemented] 3. [Communication sent to users/stakeholders] Rollback Actions (if applicable): • Model Rollback: [Previous version deployed] • Feature Flag: [Feature disabled] • Traffic Shift: [% of traffic redirected] User Communication: • Status Page Updated: [Yes/No] • User Notification Sent: [Yes/No] • Support Team Briefed: [Yes/No] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7. PREVENTION MEASURES ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Short-Term (This Sprint): ┌─────────────────────────────────────────────────────────────┐ │ Action Item │ Owner │ Due Date │ Status │ ├─────────────────────────────────────────────────────────────┤ │ [Action 1] │ [Name] │ [Date] │ [Status] │ │ [Action 2] │ [Name] │ [Date] │ [Status] │ │ [Action 3] │ [Name] │ [Date] │ [Status] │ └─────────────────────────────────────────────────────────────┘ Medium-Term (This Quarter): ┌─────────────────────────────────────────────────────────────┐ │ Action Item │ Owner │ Due Date │ Status │ ├─────────────────────────────────────────────────────────────┤ │ [Action 1] │ [Name] │ [Date] │ [Status] │ │ [Action 2] │ [Name] │ [Date] │ [Status] │ └─────────────────────────────────────────────────────────────┘ Long-Term (Roadmap): • [Systemic improvement 1] • [Systemic improvement 2] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8. LESSONS LEARNED ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ What Went Well: • [Positive aspect of response] • [Positive aspect of response] What Went Poorly: • [Area for improvement] • [Area for improvement] Where We Got Lucky: • [Factor that reduced impact] • [Factor that could have been worse] Key Takeaways: 1. [Learning that applies broadly] 2. [Learning that applies broadly] 3. [Learning that applies broadly] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9. APPENDIX ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Participants in Postmortem: • [Name, Role] • [Name, Role] • [Name, Role] Related Incidents: • [Link to similar past incident] • [Link to related incident] Supporting Evidence: • Dashboard Link: [URL] • Log Query: [Query string] • Slack Thread: [URL] • Support Tickets: [Ticket IDs] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ SIGN-OFF ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Postmortem Author: [Name] Reviewed By: [Name, Date] Approved By: [Name, Date] Next Review Date: [Date to check action items]
Section-by-Section Guidance
Executive Summary
Write this last, but put it first. Executives will often only read this section.
- Keep the one-liner under 15 words
- Quantify user impact with real numbers
- Be specific about business impact (revenue, SLA, trust)
AI/ML Specifics
This section is what makes AI postmortems unique. Capture model-specific details.
- Always record exact model version and deployment date
- Sanitize inputs/outputs to remove PII before documenting
- Note confidence scores - calibration issues are common
- Include prompt/system instructions if relevant
Root Cause Analysis
Use 5 Whys to dig deeper than surface-level causes. AI failures often have systemic roots.
- Don't stop at “the model made a mistake”
- Ask: Why was this input possible? Why wasn't it caught?
- Consider: training data, evaluation gaps, monitoring blind spots
Prevention Measures
Every action item must have an owner and due date. Vague commitments don't get done.
- Prioritize actions that prevent the class of incident, not just this one
- Include monitoring improvements, not just code fixes
- Schedule a follow-up to verify actions were completed
AI Incident Severity Levels
Critical
Harmful outputs affecting users, data breach, complete feature outage, significant revenue impact. Requires immediate executive notification.
High
Degraded quality affecting many users, incorrect outputs in critical flows, SLA breach. Requires same-day response.
Medium
Quality issues affecting subset of users, edge case failures, increased latency. Addressed within normal sprint cycle.
Low
Minor quality issues, rare edge cases, cosmetic output problems. Logged for pattern analysis, fixed opportunistically.
Running a Blameless Postmortem
Do
- Focus on systems and processes, not individuals
- Ask “What made this possible?” not “Who caused this?”
- Assume everyone acted with best intentions and available information
- Create psychological safety for honest discussion
- Celebrate when people speak up about near-misses
Don't
- Name individuals as the cause of failure
- Use language like “should have known” or “failed to”
- Create action items that are just “be more careful”
- Skip the postmortem because it was “just an AI issue”
- Treat postmortems as punishment or performance reviews
Common AI Incident Patterns
Model Quality Incidents
- Provider model update changed behavior
- Input distribution shifted from training data
- Edge case not covered in evaluation
- Prompt changes had unintended effects
Safety Incidents
- Jailbreak or prompt injection succeeded
- Harmful content bypassed filters
- PII leaked in model output
- Bias surfaced in production
Operational Incidents
- Rate limits exceeded unexpectedly
- Cost spike from increased usage
- Latency degradation from provider
- Fallback logic failed to trigger
Integration Incidents
- RAG retrieved irrelevant context
- Tool/function calling errors
- Context window exceeded silently
- Caching served stale responses