# Incident Report

# Incident Report

> **Project:** {{PROJECT_NAME}}
> **Version:** {{VERSION}}
> **Date:** {{DATE}}
> **Author:** {{AUTHOR}}
> **Status:** Draft | In Review | Approved
> **Reviewers:** {{REVIEWERS}}

## Document History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 0.1     | {{DATE}} | {{AUTHOR}} | Initial draft |

---

## 1. Incident Metadata

<!-- GUIDANCE: Fill these fields immediately when an incident is declared. Update as the incident evolves. -->

| Field | Value |
|-------|-------|
| **Incident ID** | INC-{{YYYY}}-{{SEQ}} |
| **Severity** | P{{SEVERITY}} <!-- P1 (Critical) / P2 (High) / P3 (Medium) / P4 (Low) --> |
| **Status** | {{STATUS}} <!-- Open / Monitoring / Resolved --> |
| **Incident Commander** | {{IC}} |
| **Technical Lead** | {{TECH_LEAD}} |
| **Communications Lead** | {{COMMS_LEAD}} |
| **Declared at** | {{START_TIME}} {{TIMEZONE}} |
| **Resolved at** | {{END_TIME}} {{TIMEZONE}} |
| **Total duration** | {{DURATION}} |
| **Affected service(s)** | {{SERVICES}} |
| **Environment** | Production / Staging |

---

## 2. Executive Summary

<!-- GUIDANCE: Write 1-2 sentences that a non-technical person can understand. What happened, how long did it last, what was the impact. -->

{{EXECUTIVE_SUMMARY}}

> Example: "On {{DATE}}, a database connection pool exhaustion caused the {{SERVICE}} API to return 503 errors for approximately 47 minutes, affecting {{AFFECTED_COUNT}} users and resulting in an estimated {{REVENUE_IMPACT}} in lost transactions. The root cause was a code change in the v{{VERSION}} deployment that introduced N+1 queries under high load."

---

## 3. Detection

<!-- GUIDANCE: How was the incident discovered? Was alerting effective? Did a user report it? -->

**Detected by:** {{DETECTION_METHOD}} <!-- Automated alert / User complaint / Internal discovery / External monitoring -->
**Detected at:** {{DETECTION_TIME}}
**Lag from start to detection:** {{DETECTION_LAG}} minutes
**Detecting system:** {{DETECTING_SYSTEM}} <!-- PagerDuty alert "HighErrorRate" / Customer email / Slack message from user -->

**Alerting effectiveness:**
- [ ] Alert fired within the expected window (< {{ALERT_SLA}} minutes)
- [ ] Alert delivered to on-call without delay
- [ ] Alert contained sufficient context to begin investigation

**Improvements to detection identified:**
- {{DETECTION_IMPROVEMENT_1}} <!-- e.g., "Alert threshold was too high — lower from 5% to 2% error rate" -->

---

## 4. Detailed Timeline

<!-- GUIDANCE: Minute-by-minute timeline for P1/P2. Capture every significant action, decision, and discovery. -->

> **Timezone:** All times in {{TIMEZONE}}

| Time | Event | Actor | Notes |
|------|-------|-------|-------|
| {{TIME}} | {{EVENT_1}} <!-- e.g., "Deployment of v2.4.1 completed" --> | {{ACTOR}} | |
| {{TIME}} | {{EVENT_2}} <!-- e.g., "Error rate alert fired (threshold: 5%)" --> | System | Alert ID: {{ALERT_ID}} |
| {{TIME}} | {{EVENT_3}} <!-- e.g., "On-call engineer acknowledged alert" --> | {{ENGINEER}} | |
| {{TIME}} | {{EVENT_4}} <!-- e.g., "Incident declared P2, war room opened" --> | {{IC}} | |
| {{TIME}} | {{EVENT_5}} <!-- e.g., "Root cause identified: DB connection pool exhausted" --> | {{ENGINEER}} | |
| {{TIME}} | {{EVENT_6}} <!-- e.g., "Mitigation applied: increased connection pool size" --> | {{ENGINEER}} | |
| {{TIME}} | {{EVENT_7}} <!-- e.g., "Error rate returned to < 1%" --> | System | |
| {{TIME}} | {{EVENT_8}} <!-- e.g., "Incident declared resolved" --> | {{IC}} | |

<!-- TODO: Add all timeline entries. Be as specific as possible with times and actions. -->

---

## 5. Impact Assessment

### Users Affected

| Metric | Value |
|--------|-------|
| Total users affected | {{USER_COUNT}} |
| % of total user base | {{USER_PERCENT}}% |
| Geography affected | {{GEOGRAPHY}} <!-- Global / EU / Specific region --> |
| User tier affected | {{USER_TIER}} <!-- All users / Enterprise only / Free tier only --> |

### Services Affected

| Service | Impact Type | Severity | Duration |
|---------|-------------|----------|---------|
| {{SERVICE_1}} | {{IMPACT_TYPE}} <!-- Degraded / Unavailable / Error rate elevated --> | {{SEV}} | {{DURATION}} |
| {{SERVICE_2}} | {{IMPACT_TYPE}} | {{SEV}} | {{DURATION}} |

### Data Impact

<!-- GUIDANCE: Be honest about data impact. "No data lost" must be verified, not assumed. -->

| Type | Assessment |
|------|------------|
| Data loss | {{DATA_LOSS}} <!-- "None confirmed" / "Estimated X records lost" --> |
| Data corruption | {{DATA_CORRUPTION}} <!-- "None" / "X records affected" --> |
| Data exposure | {{DATA_EXPOSURE}} <!-- "None" / "Describe scope if any" --> |
| Verification method | {{VERIFICATION}} <!-- How was no-data-loss confirmed? --> |

### Financial Impact

| Category | Amount | Notes |
|----------|--------|-------|
| Lost transactions | ${{AMOUNT}} | {{TRANSACTION_COUNT}} failed transactions |
| SLA credits | ${{AMOUNT}} | Per SLA contract |
| Operational cost | ${{AMOUNT}} | Engineering hours to resolve |
| **Total estimated** | **${{TOTAL}}** | |

### SLA Breach Assessment

| SLA Metric | Target | Actual | Breach |
|------------|--------|--------|--------|
| Uptime | {{UPTIME_SLA}}% | {{ACTUAL_UPTIME}}% | {{BREACH}} <!-- Yes / No --> |
| Response time (P99) | < {{P99_SLA}}ms | {{P99_ACTUAL}}ms | {{BREACH}} |
| MTTR | < {{MTTR_SLA}} | {{MTTR_ACTUAL}} | {{BREACH}} |

---

## 6. Root Cause Analysis

### 5 Whys

<!-- GUIDANCE: Ask "why" five times to find the systemic root cause, not just the surface-level trigger. -->

| Why # | Question | Answer |
|-------|----------|--------|
| Why 1 | Why did users see errors? | {{ANSWER_1}} <!-- e.g., "API was returning 503 Service Unavailable" --> |
| Why 2 | Why was the API returning 503? | {{ANSWER_2}} <!-- e.g., "Database connection pool was exhausted" --> |
| Why 3 | Why was the connection pool exhausted? | {{ANSWER_3}} <!-- e.g., "New feature introduced N+1 queries, increasing connections 10x" --> |
| Why 4 | Why was the N+1 query introduced? | {{ANSWER_4}} <!-- e.g., "Code review did not catch the ORM lazy loading issue" --> |
| Why 5 | Why did code review miss it? | {{ANSWER_5}} <!-- e.g., "No query count assertion in tests, no performance test in CI" --> |

**Root cause:** {{ROOT_CAUSE}}

### Contributing Factors

<!-- GUIDANCE: Multiple factors usually contribute. Identify all of them to prevent recurrence. -->

1. {{FACTOR_1}} <!-- e.g., "No automated performance regression test in CI pipeline" -->
2. {{FACTOR_2}} <!-- e.g., "Database connection pool too small for peak load" -->
3. {{FACTOR_3}} <!-- e.g., "Monitoring alert threshold too high (5% vs ideal 2%)" -->

### Trigger Event

**What triggered this specific incident now:** {{TRIGGER}}
<!-- e.g., "v2.4.1 deployment at 14:32 introduced the faulty code path" -->

---

## 7. Resolution Steps

<!-- GUIDANCE: Document the exact steps taken to resolve the incident. This serves as a playbook for similar future incidents. -->

| Step | Time | Action | Result |
|------|------|--------|--------|
| 1 | {{TIME}} | {{ACTION_1}} | {{RESULT_1}} |
| 2 | {{TIME}} | {{ACTION_2}} | {{RESULT_2}} |
| 3 | {{TIME}} | {{ACTION_3}} | {{RESULT_3}} |

**Resolution commands (for runbook):**
```bash
# {{RESOLUTION_DESCRIPTION}}
{{RESOLUTION_COMMAND}}
```

---

## 8. What Went Well

<!-- GUIDANCE: Acknowledge what worked. This reinforces good practices and team morale. -->

1. {{WENT_WELL_1}} <!-- e.g., "Alert fired within 2 minutes of user impact starting" -->
2. {{WENT_WELL_2}} <!-- e.g., "On-call team assembled quickly (< 5 min)" -->
3. {{WENT_WELL_3}} <!-- e.g., "Rollback process was well-documented and executed smoothly" -->

---

## 9. What Went Wrong

<!-- GUIDANCE: Be honest. This is blameless — focus on systems and processes, not individuals. -->

1. {{WENT_WRONG_1}} <!-- e.g., "No performance test caught the N+1 query before production" -->
2. {{WENT_WRONG_2}} <!-- e.g., "Took 12 minutes to identify root cause — unclear tooling" -->
3. {{WENT_WRONG_3}} <!-- e.g., "Customer support was not briefed, escalated to engineering unnecessarily" -->

---

## 10. Action Items

<!-- GUIDANCE: Every action item needs an owner and due date. Without both, it won't get done. -->

| # | Action | Owner | Due Date | Priority | Status |
|---|--------|-------|----------|----------|--------|
| 1 | {{ACTION_1}} <!-- e.g., "Add query count assertions to integration tests" --> | {{OWNER}} | {{DUE}} | High | Open |
| 2 | {{ACTION_2}} <!-- e.g., "Add performance test for checkout endpoint" --> | {{OWNER}} | {{DUE}} | High | Open |
| 3 | {{ACTION_3}} <!-- e.g., "Lower error rate alert threshold from 5% to 2%" --> | {{OWNER}} | {{DUE}} | Medium | Open |
| 4 | {{ACTION_4}} <!-- e.g., "Increase connection pool size to handle 2x peak load" --> | {{OWNER}} | {{DUE}} | High | Open |
| 5 | {{ACTION_5}} <!-- e.g., "Create customer support escalation guide" --> | {{OWNER}} | {{DUE}} | Low | Open |

---

## 11. Lessons Learned

<!-- GUIDANCE: What does the team know now that they didn't before? What should change? -->

1. {{LESSON_1}} <!-- e.g., "N+1 queries are hard to detect in code review alone — tooling needed" -->
2. {{LESSON_2}}
3. {{LESSON_3}}

---

## 12. Related Incidents

<!-- GUIDANCE: Link to related or similar past incidents to identify patterns. -->

| Incident ID | Date | Similarity | Resolved |
|-------------|------|------------|---------|
| INC-{{ID}} | {{DATE}} | {{DESCRIPTION}} | Yes / No |

---

## 13. Communication Log

<!-- GUIDANCE: Record all external communications sent during the incident for accountability and pattern analysis. -->

| Time | Channel | Message Summary | Audience | Sent By |
|------|---------|-----------------|----------|---------|
| {{TIME}} | Status page | "Investigating reports of elevated errors" | All users | {{SENDER}} |
| {{TIME}} | Status page | "Identified root cause, applying fix" | All users | {{SENDER}} |
| {{TIME}} | Status page | "Incident resolved, all systems normal" | All users | {{SENDER}} |
| {{TIME}} | Email | Customer notification for SLA breach | Affected customers | {{SENDER}} |

---

## Related Documents

- [Post-Mortem](./post-mortem.md)
- [Operational Runbook](./operational-runbook.md)
- [SLA Report](./sla-report.md)

---

## Approval
| Role | Name | Date | Signature |
|------|------|------|-----------|
| Author | | | |
| Reviewer | | | |
| Approver | | | |