Incident Report
Incident Report
Project: {{PROJECT_NAME}}
Version: {{VERSION}}
Date: {{DATE}}
Author: {{AUTHOR}}
Status: Draft | In Review | Approved
Reviewers: {{REVIEWERS}}
Document History
| Version |
Date |
Author |
Changes |
| 0.1 |
{{DATE}} |
{{AUTHOR}} |
Initial draft |
| Field |
Value |
| Incident ID |
INC-{{YYYY}}-{{SEQ}} |
| Severity |
P{{SEVERITY}} |
| Status |
{{STATUS}} |
| Incident Commander |
{{IC}} |
| Technical Lead |
{{TECH_LEAD}} |
| Communications Lead |
{{COMMS_LEAD}} |
| Declared at |
{{START_TIME}} {{TIMEZONE}} |
| Resolved at |
{{END_TIME}} {{TIMEZONE}} |
| Total duration |
{{DURATION}} |
| Affected service(s) |
{{SERVICES}} |
| Environment |
Production / Staging |
2. Executive Summary
{{EXECUTIVE_SUMMARY}}
Example: "On {{DATE}}, a database connection pool exhaustion caused the {{SERVICE}} API to return 503 errors for approximately 47 minutes, affecting {{AFFECTED_COUNT}} users and resulting in an estimated {{REVENUE_IMPACT}} in lost transactions. The root cause was a code change in the v{{VERSION}} deployment that introduced N+1 queries under high load."
3. Detection
Detected by: {{DETECTION_METHOD}}
Detected at: {{DETECTION_TIME}}
Lag from start to detection: {{DETECTION_LAG}} minutes
Detecting system: {{DETECTING_SYSTEM}}
Alerting effectiveness:
Improvements to detection identified:
- {{DETECTION_IMPROVEMENT_1}}
4. Detailed Timeline
Timezone: All times in {{TIMEZONE}}
| Time |
Event |
Actor |
Notes |
| {{TIME}} |
{{EVENT_1}} |
{{ACTOR}} |
|
| {{TIME}} |
{{EVENT_2}} |
System |
Alert ID: {{ALERT_ID}} |
| {{TIME}} |
{{EVENT_3}} |
{{ENGINEER}} |
|
| {{TIME}} |
{{EVENT_4}} |
{{IC}} |
|
| {{TIME}} |
{{EVENT_5}} |
{{ENGINEER}} |
|
| {{TIME}} |
{{EVENT_6}} |
{{ENGINEER}} |
|
| {{TIME}} |
{{EVENT_7}} |
System |
|
| {{TIME}} |
{{EVENT_8}} |
{{IC}} |
|
5. Impact Assessment
Users Affected
| Metric |
Value |
| Total users affected |
{{USER_COUNT}} |
| % of total user base |
{{USER_PERCENT}}% |
| Geography affected |
{{GEOGRAPHY}} |
| User tier affected |
{{USER_TIER}} |
Services Affected
| Service |
Impact Type |
Severity |
Duration |
| {{SERVICE_1}} |
{{IMPACT_TYPE}} |
{{SEV}} |
{{DURATION}} |
| {{SERVICE_2}} |
{{IMPACT_TYPE}} |
{{SEV}} |
{{DURATION}} |
Data Impact
| Type |
Assessment |
| Data loss |
{{DATA_LOSS}} |
| Data corruption |
{{DATA_CORRUPTION}} |
| Data exposure |
{{DATA_EXPOSURE}} |
| Verification method |
{{VERIFICATION}} |
Financial Impact
| Category |
Amount |
Notes |
| Lost transactions |
${{AMOUNT}} |
{{TRANSACTION_COUNT}} failed transactions |
| SLA credits |
${{AMOUNT}} |
Per SLA contract |
| Operational cost |
${{AMOUNT}} |
Engineering hours to resolve |
| Total estimated |
${{TOTAL}} |
|
SLA Breach Assessment
| SLA Metric |
Target |
Actual |
Breach |
| Uptime |
{{UPTIME_SLA}}% |
{{ACTUAL_UPTIME}}% |
{{BREACH}} |
| Response time (P99) |
< {{P99_SLA}}ms |
{{P99_ACTUAL}}ms |
{{BREACH}} |
| MTTR |
< {{MTTR_SLA}} |
{{MTTR_ACTUAL}} |
{{BREACH}} |
6. Root Cause Analysis
5 Whys
| Why # |
Question |
Answer |
| Why 1 |
Why did users see errors? |
{{ANSWER_1}} |
| Why 2 |
Why was the API returning 503? |
{{ANSWER_2}} |
| Why 3 |
Why was the connection pool exhausted? |
{{ANSWER_3}} |
| Why 4 |
Why was the N+1 query introduced? |
{{ANSWER_4}} |
| Why 5 |
Why did code review miss it? |
{{ANSWER_5}} |
Root cause: {{ROOT_CAUSE}}
Contributing Factors
- {{FACTOR_1}}
- {{FACTOR_2}}
- {{FACTOR_3}}
Trigger Event
What triggered this specific incident now: {{TRIGGER}}
7. Resolution Steps
| Step |
Time |
Action |
Result |
| 1 |
{{TIME}} |
{{ACTION_1}} |
{{RESULT_1}} |
| 2 |
{{TIME}} |
{{ACTION_2}} |
{{RESULT_2}} |
| 3 |
{{TIME}} |
{{ACTION_3}} |
{{RESULT_3}} |
Resolution commands (for runbook):
# {{RESOLUTION_DESCRIPTION}}
{{RESOLUTION_COMMAND}}
8. What Went Well
- {{WENT_WELL_1}}
- {{WENT_WELL_2}}
- {{WENT_WELL_3}}
9. What Went Wrong
- {{WENT_WRONG_1}}
- {{WENT_WRONG_2}}
- {{WENT_WRONG_3}}
10. Action Items
| # |
Action |
Owner |
Due Date |
Priority |
Status |
| 1 |
{{ACTION_1}} |
{{OWNER}} |
{{DUE}} |
High |
Open |
| 2 |
{{ACTION_2}} |
{{OWNER}} |
{{DUE}} |
High |
Open |
| 3 |
{{ACTION_3}} |
{{OWNER}} |
{{DUE}} |
Medium |
Open |
| 4 |
{{ACTION_4}} |
{{OWNER}} |
{{DUE}} |
High |
Open |
| 5 |
{{ACTION_5}} |
{{OWNER}} |
{{DUE}} |
Low |
Open |
11. Lessons Learned
- {{LESSON_1}}
- {{LESSON_2}}
- {{LESSON_3}}
| Incident ID |
Date |
Similarity |
Resolved |
| INC-{{ID}} |
{{DATE}} |
{{DESCRIPTION}} |
Yes / No |
13. Communication Log
| Time |
Channel |
Message Summary |
Audience |
Sent By |
| {{TIME}} |
Status page |
"Investigating reports of elevated errors" |
All users |
{{SENDER}} |
| {{TIME}} |
Status page |
"Identified root cause, applying fix" |
All users |
{{SENDER}} |
| {{TIME}} |
Status page |
"Incident resolved, all systems normal" |
All users |
{{SENDER}} |
| {{TIME}} |
Email |
Customer notification for SLA breach |
Affected customers |
{{SENDER}} |
Approval
| Role |
Name |
Date |
Signature |
| Author |
|
|
|
| Reviewer |
|
|
|
| Approver |
|
|
|
No comments to display
No comments to display