# Post-Mortem

# Post-Mortem

> **Project:** {{PROJECT_NAME}}
> **Version:** {{VERSION}}
> **Date:** {{DATE}}
> **Author:** {{AUTHOR}}
> **Status:** Draft | In Review | Approved
> **Reviewers:** {{REVIEWERS}}

## Document History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 0.1     | {{DATE}} | {{AUTHOR}} | Initial draft |

---

## Blameless Culture Statement

<!-- GUIDANCE: This preamble sets the tone. Post-mortems are about systems, not blame. -->

> This post-mortem is conducted in a **blameless spirit**. Our goal is to understand how and why the incident occurred — not to assign fault to individuals. People make the best decisions they can with the information and tools available at the time. When things go wrong, we look for systemic improvements that make the right action easier and the wrong action harder for everyone.

---

## 1. Incident Reference & Metadata

| Field | Value |
|-------|-------|
| **Incident ID** | INC-{{YYYY}}-{{SEQ}} |
| **Severity** | P{{SEVERITY}} |
| **Incident Report** | [INC-{{YYYY}}-{{SEQ}}](./incident-report.md) |
| **Post-Mortem Facilitator** | {{FACILITATOR}} |
| **Post-Mortem Date** | {{PM_DATE}} |
| **Attendees** | {{ATTENDEES}} |
| **Status** | Draft / In Review / Final |

---

## 2. Executive Summary

<!-- GUIDANCE: 2-3 sentences. Suitable for sharing with leadership and non-technical stakeholders. -->

{{EXECUTIVE_SUMMARY}}

> Example: "A database index was dropped during a migration on {{DATE}}, causing query performance to degrade by 50× under load. This resulted in a 1h 23min degraded service period affecting {{USERS}} users. We have restored the index, added migration validation tooling, and created safeguards to prevent similar incidents."

---

## 3. Impact Summary

<!-- GUIDANCE: Quantify the impact. Numbers make the case for investing in prevention. -->

| Metric | Value |
|--------|-------|
| **Total duration** | {{DURATION}} (detected at {{DETECTED}}, resolved at {{RESOLVED}}) |
| **Users affected** | {{USER_COUNT}} ({{USER_PERCENT}}% of user base) |
| **Requests affected** | {{REQUEST_COUNT}} ({{REQUEST_PERCENT}}% error rate during incident) |
| **Estimated revenue impact** | ${{REVENUE}} |
| **SLA breach** | {{SLA_BREACH}} <!-- Yes / No / Partial --> |
| **SLA credits owed** | ${{CREDITS}} |

---

## 4. Detailed Timeline

<!-- GUIDANCE: A precise timeline helps identify systemic delays (slow detection, slow response, slow fix). Use it to quantify MTTD and MTTR. -->

```mermaid
timeline
    title Incident Timeline
    {{TIME_1}} : {{EVENT_1}}
    {{TIME_2}} : {{EVENT_2}}
    {{TIME_3}} : {{EVENT_3}}
    {{TIME_4}} : {{EVENT_4}}
    {{TIME_5}} : {{EVENT_5}}
```

<!-- GUIDANCE: If Mermaid timeline isn't rendering in your tool, use the table below. -->

| Time | Event | MTTD/MTTR Marker |
|------|-------|-----------------|
| {{T1}} | {{EVENT}} <!-- Incident start (user impact begins) --> | ← Incident start |
| {{T2}} | {{EVENT}} <!-- Alert fired --> | |
| {{T3}} | {{EVENT}} <!-- On-call acknowledged --> | ← Detection (MTTD = T3 - T1) |
| {{T4}} | {{EVENT}} <!-- War room opened, investigation started --> | |
| {{T5}} | {{EVENT}} <!-- Root cause identified --> | |
| {{T6}} | {{EVENT}} <!-- Fix applied --> | |
| {{T7}} | {{EVENT}} <!-- Service restored to normal --> | |
| {{T8}} | {{EVENT}} <!-- Incident declared resolved --> | ← Resolved (MTTR = T8 - T1) |

**MTTD (Mean Time to Detect):** {{MTTD}} minutes
**MTTR (Mean Time to Resolve):** {{MTTR}} minutes

---

## 5. Root Cause Analysis

### 5.1 5 Whys Analysis

| Why # | Question | Answer |
|-------|----------|--------|
| Why 1 | Why did users experience {{SYMPTOM}}? | {{WHY_1}} |
| Why 2 | Why did {{WHY_1_ANSWER}} happen? | {{WHY_2}} |
| Why 3 | Why did {{WHY_2_ANSWER}} happen? | {{WHY_3}} |
| Why 4 | Why did {{WHY_3_ANSWER}} happen? | {{WHY_4}} |
| Why 5 | Why did {{WHY_4_ANSWER}} happen? | {{WHY_5}} |

**Root cause:** {{ROOT_CAUSE}}

### 5.2 Contributing Factors

<!-- GUIDANCE: The root cause rarely acts alone. List all contributing factors. These often become action items. -->

| Factor | Type | Action Required |
|--------|------|----------------|
| {{FACTOR_1}} | Technical / Process / Human | Yes / No |
| {{FACTOR_2}} | Technical / Process / Human | Yes / No |
| {{FACTOR_3}} | Technical / Process / Human | Yes / No |

### 5.3 Trigger Event

**The specific trigger for this incident:** {{TRIGGER}}
<!-- What changed right before the incident? Deployment, config change, traffic spike, external service, data issue? -->

---

## 6. What Went Well

<!-- GUIDANCE: Celebrate what worked. This reinforces the team's strengths and identifies practices to replicate. -->

1. **{{CATEGORY_1}}:** {{DESCRIPTION}}
   <!-- e.g., "Detection: Automated alerting detected the issue within 3 minutes — well within our 5-minute SLA" -->
2. **{{CATEGORY_2}}:** {{DESCRIPTION}}
3. **{{CATEGORY_3}}:** {{DESCRIPTION}}

---

## 7. What Went Wrong

<!-- GUIDANCE: Honest assessment without blame. Focus on systems, processes, tooling — not people. -->

1. **{{CATEGORY_1}}:** {{DESCRIPTION}}
   <!-- e.g., "Prevention: No automated test caught the migration error before it reached production" -->
2. **{{CATEGORY_2}}:** {{DESCRIPTION}}
3. **{{CATEGORY_3}}:** {{DESCRIPTION}}

---

## 8. Where We Got Lucky

<!-- GUIDANCE: This section reveals hidden risks — things that could have been worse but weren't due to chance. -->

1. {{LUCKY_1}} <!-- e.g., "The incident occurred at 3am PST (11am CEST) — low traffic reduced user impact by ~70% vs peak hours" -->
2. {{LUCKY_2}} <!-- e.g., "A senior engineer happened to be online and immediately recognized the pattern" -->
3. {{LUCKY_3}} <!-- e.g., "The data corruption only affected non-critical records — had it hit user accounts, impact would have been severe" -->

---

## 9. Action Items

<!-- GUIDANCE: Divide into short-term (this sprint), long-term (next quarter), and process changes. All must have owners and due dates. -->

### Short-Term Fixes (This Sprint)

| # | Action | Owner | Due | Priority | Ticket |
|---|--------|-------|-----|----------|--------|
| 1 | {{SHORT_TERM_1}} <!-- e.g., "Add migration validation script to CI" --> | {{OWNER}} | {{DATE}} | Critical | {{TICKET}} |
| 2 | {{SHORT_TERM_2}} | {{OWNER}} | {{DATE}} | High | {{TICKET}} |
| 3 | {{SHORT_TERM_3}} | {{OWNER}} | {{DATE}} | Medium | {{TICKET}} |

### Long-Term Improvements (Next Quarter)

| # | Action | Owner | Due | Priority | Ticket |
|---|--------|-------|-----|----------|--------|
| 1 | {{LONG_TERM_1}} <!-- e.g., "Implement performance regression tests in CI" --> | {{OWNER}} | {{DATE}} | High | {{TICKET}} |
| 2 | {{LONG_TERM_2}} | {{OWNER}} | {{DATE}} | Medium | {{TICKET}} |

### Process Changes

| # | Change | Owner | Implementation Date |
|---|--------|-------|---------------------|
| 1 | {{PROCESS_1}} <!-- e.g., "All migration PRs require DBA review" --> | {{OWNER}} | {{DATE}} |
| 2 | {{PROCESS_2}} | {{OWNER}} | {{DATE}} |

---

## 10. Follow-Up Tracking

<!-- GUIDANCE: Schedule a follow-up to verify action items were completed and are effective. -->

**Follow-up review date:** {{FOLLOWUP_DATE}} (4 weeks after incident)
**Follow-up owner:** {{FOLLOWUP_OWNER}}

| Action Item | Expected Completion | Verified Complete | Effective |
|-------------|--------------------|--------------------|-----------|
| {{ACTION_1}} | {{DATE}} | Yes / No | Yes / No / TBD |
| {{ACTION_2}} | {{DATE}} | | |

---

## 11. Recurrence Prevention

<!-- GUIDANCE: Summarize the net change in the system's resilience. What specifically will prevent this class of incident from recurring? -->

**Before this incident:** {{BEFORE_STATE}}
<!-- e.g., "No automated checks for missing indexes after migration, no performance tests in CI" -->

**After implementing action items:** {{AFTER_STATE}}
<!-- e.g., "CI validates all migrations against schema integrity, performance regression tests block deployment if P95 regresses > 20%" -->

**Confidence in prevention:** {{CONFIDENCE}} / 10
**Residual risk:** {{RESIDUAL_RISK}}
<!-- e.g., "Other types of migration errors could still escape — broader migration testing is a longer-term project" -->

---

## 12. Review & Sign-Off

<!-- GUIDANCE: Post-mortem must be reviewed and approved before closing the incident. -->

**Post-mortem presented at:** {{MEETING}} on {{MEETING_DATE}}
**Meeting recording:** {{RECORDING_LINK}}
**Meeting notes:** {{NOTES_LINK}}

---

## Related Documents

- [Incident Report INC-{{ID}}](./incident-report.md)
- [Operational Runbook](./operational-runbook.md)
- [Disaster Recovery Plan](../INFRASTRUCTURE/disaster-recovery-plan.md)

---

## Approval
| Role | Name | Date | Signature |
|------|------|------|-----------|
| Author | | | |
| Reviewer | | | |
| Approver | | | |