Incident Report
Incident Report
Project:
Drop{{PROJECT_NAME}} Version:0.1.0{{VERSION}} Date:2026-02-23{{DATE}} Author:Platform Architect (AI){{AUTHOR}} Status:ClosedDraft | In Review | Approved Reviewers:Alem Bašić (CEO){{REVIEWERS}}
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 |
1. Incident OverviewMetadata
This document is filled with a realisticexample incidentbased on Drop's architecture. Use it as a template for future real incidents.
| Field | Value |
|---|---|
| Incident ID | INC- |
| Severity | |
| Status | |
| Incident Commander | |
| Communications |
{{COMMS_LEAD}} |
| Declared at | {{START_TIME}} {{TIMEZONE}} |
| Resolved at | {{END_TIME}} {{TIMEZONE}} |
| Total duration | {{DURATION}} |
| Affected service(s) | {{SERVICES}} |
| Environment | Production / Staging |
1.2. Executive Summary
{{EXECUTIVE_SUMMARY}}
Example: "On {{DATE}}, a database connection pool exhaustion caused the {{SERVICE}} API to return 503 errors for approximately 47 minutes, affecting {{AFFECTED_COUNT}} users and resulting in an estimated {{REVENUE_IMPACT}} in lost transactions. The root cause was a code change in the v{{VERSION}} deployment that introduced N+1 queries under high load."
3. Detection
Detected by: {{DETECTION_METHOD}} Detected at: {{DETECTION_TIME}} Lag from start to detection: {{DETECTION_LAG}} minutes Detecting system: {{DETECTING_SYSTEM}}
Alerting effectiveness:
- Alert fired within the expected window (< {{ALERT_SLA}} minutes)
- Alert delivered to on-call without delay
- Alert contained sufficient context to begin investigation
Improvements to detection identified:
- {{DETECTION_IMPROVEMENT_1}}
4. Detailed Timeline
Timezone: All times in {{TIMEZONE}}
| Time | Event | Actor | Notes |
|---|---|---|---|
| {{TIME}} | {{EVENT_1}} | {{ACTOR}} | |
| {{TIME}} | {{EVENT_2}} | System | Alert ID: {{ALERT_ID}} |
| {{TIME}} | {{EVENT_3}} | {{ENGINEER}} | |
| {{TIME}} | {{EVENT_4}} | {{IC}} | |
| {{TIME}} | {{EVENT_5}} | {{ENGINEER}} | |
| {{TIME}} | {{EVENT_6}} | {{ENGINEER}} | |
| {{TIME}} | {{EVENT_7}} | System | |
| {{TIME}} | {{EVENT_8}} | {{IC}} |
5. Impact SummaryAssessment
Users Affected
| Metric | Value |
|---|---|
Customer-facingServices
behavior: Users saw error messages on all screens. /api/health returned HTTP 503 with "status":"down".
2. Timeline
Affected
| Severity | Duration | ||
|---|---|---|---|
| {{SEV}} | {{DURATION}} | ||
| |||
| |||
{ | |||
| |||
| |||
| |||
| |||
Data 3. Root Cause Analysis
What HappenedImpact
A burst of concurrent BankID login attempts (triggered by a marketing email sent at 10:28 UTC) created 45+ simultaneous database connections. The Drop application uses application-level connection pooling with a default max of ~85 connections (db.t4g.micro limit). Each /api/auth/bankid/callback request opens a connection for session creation + user upsert — the simultaneous spike exhausted the pool.
When the pool was exhausted, new requests failed with connection refused and the health check's DB query (SELECT 1) also failed, triggering a 503 response.
Contributing Factors
No explicit connection pool configured:Drop uses thepgdriver without PgBouncer or RDS Proxy. The application-level pool was implicitly bounded by the OS connection limits.No connection pool metrics:CloudWatch didn't have an alert onDatabaseConnections— the issue wasn't detected until the health check failed.No circuit breaker:The application did not gracefully degrade (e.g., serve cached data) when DB was unavailable.
Why App Runner restart fixed it
Restarting App Runner recreated the application process and reset all connection pools. With the burst of login requests completed (10 seconds of traffic), new requests arrived at a normal rate and the pool was sufficient.
4. Resolution
Immediate fix: App Runner restart (28 minutes to resolution from detection).
Permanent fixes required:
Add PgBouncer connection pooler (or RDS Proxy) to limit per-application connectionsAdd CloudWatch alarm onDatabaseConnections > 70for db.t4g.microImplement connection pool health check with metricsAdd rate limiting on BankID login initiation for burst protection (already exists at 10/min per IP — but not per minute across all IPs)
5. Detection Quality
| Data loss | {{DATA_LOSS}} |
| Data corruption | {{DATA_CORRUPTION}} |
| Data exposure | {{DATA_EXPOSURE}} |
| Verification method | {{VERIFICATION}} |
Financial Impact
| Category | Amount | Notes | ||
|---|---|---|---|---|
| ${{AMOUNT}} | Engineering hours to | |||
SLA Breach Assessment
| SLA Metric | Target | Actual | Breach |
|---|---|---|---|
| Uptime | {{UPTIME_SLA}}% | {{ACTUAL_UPTIME}}% | {{BREACH}} |
| {{BREACH}} | |||
| MTTR | < {{MTTR_SLA}} | {{MTTR_ACTUAL}} | {{BREACH}} |
6. Root Cause Analysis
5 Whys
| Why # | Question | Answer |
|---|---|---|
| Why 1 | Why did users see errors? | {{ANSWER_1}} |
| Why 2 | Why was the API returning 503? | {{ANSWER_2}} |
| Why 3 | Why was the connection pool exhausted? | {{ANSWER_3}} |
| Why 4 | Why was the N+1 query introduced? | {{ANSWER_4}} |
| Why 5 | Why did code review miss it? | {{ANSWER_5}} |
Root cause: {{ROOT_CAUSE}}
Contributing Factors
- {{FACTOR_1}}
- {{FACTOR_2}}
- {{FACTOR_3}}
Trigger Event
What triggered this specific incident now: {{TRIGGER}}
7. Resolution Steps
| Step | Time | Action | Result |
|---|---|---|---|
| 1 | {{TIME}} | {{ACTION_1}} | {{RESULT_1}} |
| 2 | {{TIME}} | {{ACTION_2}} | {{RESULT_2}} |
| 3 | {{TIME}} | {{ACTION_3}} | {{RESULT_3}} |
Resolution commands (for runbook):
# {{RESOLUTION_DESCRIPTION}}
{{RESOLUTION_COMMAND}}
8. What Went Well
- {{WENT_WELL_1}}
- {{WENT_WELL_2}}
- {{WENT_WELL_3}}
9. What Went Wrong
- {{WENT_WRONG_1}}
- {{WENT_WRONG_2}}
- {{WENT_WRONG_3}}
10. Action Items
| # | Action | Owner | Due Date | Priority | |
|---|---|---|---|---|---|
| 1 | Open | ||||
| 2 | |
Open | |||
| 3 | Open | ||||
| 4 | Open | ||||
| 5 | Open |
7.11. Lessons Learned
Connection{{LESSON_1}}pool- {{LESSON_2}}
- {{LESSON_3}}
12. Related Incidents
| Incident ID | Date | Similarity | Resolved |
|---|---|---|---|
| INC-{{ID}} | {{DATE}} | {{DESCRIPTION}} | Yes / No |
13. Communication Log
| Time | Channel | Message Summary | Audience | Sent By |
|---|---|---|---|---|
| {{TIME}} | Status page | "Investigating reports of elevated errors" | All users | {{SENDER}} |
| {{TIME}} | Status page | "Identified root cause, applying fix" | All users | {{SENDER}} |
| {{TIME}} | Status page | "Incident resolved, all systems normal" | All users | {{SENDER}} |
| {{TIME}} | Customer notification for |
Affected |
{{SENDER}} |
Related Documents
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | |||
| Reviewer | |||
| Approver |