Incident Report
Incident Report
Project:
{{PROJECT_NAME}}Drop Version:{{VERSION}}0.1.0 Date:{{DATE}}2026-02-23 Author:{{AUTHOR}}Platform Architect (AI) Status:Draft | In Review | ApprovedClosed Reviewers:{{REVIEWERS}}Alem Bašić (CEO)
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 |
1. Incident MetadataOverview
This document is filled with a realistic example incident based on Drop's architecture. Use it as a template for future real incidents.
| Field | Value |
|---|---|
| Incident ID | INC- |
| Severity | |
| Status | |
| Start Time | 2026-02-20 10:30 UTC |
| End Time | 2026-02-20 10:58 UTC |
| Duration | 28 minutes |
| Services Affected | Drop production (all users) |
| Root Cause | RDS PostgreSQL connection pool exhaustion after spike in concurrent logins |
| Incident Commander | |
Drop | Health |
2.1. ExecutiveImpact Summary
{{EXECUTIVE_SUMMARY}}
Example: "On {{DATE}}, a database connection pool exhaustion caused the {{SERVICE}} API to return 503 errors for approximately 47 minutes, affecting {{AFFECTED_COUNT}} users and resulting in an estimated {{REVENUE_IMPACT}} in lost transactions. The root cause was a code change in the v{{VERSION}} deployment that introduced N+1 queries under high load."
3. Detection
Detected by: {{DETECTION_METHOD}}
Detected at: {{DETECTION_TIME}}
Lag from start to detection: {{DETECTION_LAG}} minutes
Detecting system: {{DETECTING_SYSTEM}}
Alerting effectiveness:
Alert fired within the expected window (< {{ALERT_SLA}} minutes)Alert delivered to on-call without delayAlert contained sufficient context to begin investigation
Improvements to detection identified:
{{DETECTION_IMPROVEMENT_1}}
4. Detailed Timeline
Timezone:All times in {{TIMEZONE}}
5. Impact Assessment
Users Affected
| Metric | Value |
|---|---|
| Compliance impact | None — no data loss, audit logs intact |
Services
Customer-facing Affectedbehavior: Users saw error messages on all screens. /api/health returned HTTP 503 with "status":"down".
2. Timeline
#drop-ops receives critical alert: "Drop Health Check is DOWN" |
|||
| 10:30:45 | |||
| 10:31:00 | Alem checks App Runner status — service shows RUNNING |
||
| 10:31:30 | Alem checks /api/health — response: {"status":"down","checks":{ |
||
| 10:32:00 | Alem checks CloudWatch logs — sees repeated connection refused errors to RDS | ||
| 10:32:30 | Hypothesis: RDS connection issue. Alem checks RDS status — shows available |
||
| 10:33:00 | Alem queries RDS directly — connection succeeds from psql | ||
| 10:34:00 | New hypothesis: connection pool exhaustion in application | ||
| 10:35:00 | Alem triggers App Runner restart (aws apprunner start-deployment) |
||
| 10:38:00 | App Runner deployment completes — service RUNNING |
||
| 10:38:30 | Health check passes: {"status":"ok"} |
||
| 10:38:45 | BetterStack sends recovery alert: "Drop Health Check is UP (downtime: 8 min)" | ||
| 10:39:00 | Alem monitors for 15 minutes to confirm stability | ||
| 10:54:00 | Confirms stable — error spike cleared | ||
| 10:58:00 | Incident closed |
3. Root Cause Analysis
DataWhat ImpactHappened
A burst of concurrent BankID login attempts (triggered by a marketing email sent at 10:28 UTC) created 45+ simultaneous database connections. The Drop application uses application-level connection pooling with a default max of ~85 connections (db.t4g.micro limit). Each /api/auth/bankid/callback request opens a connection for session creation + user upsert — the simultaneous spike exhausted the pool.
When the pool was exhausted, new requests failed with connection refused and the health check's DB query (SELECT 1) also failed, triggering a 503 response.
Contributing Factors
- No explicit connection pool configured: Drop uses the
pgdriver without PgBouncer or RDS Proxy. The application-level pool was implicitly bounded by the OS connection limits. - No connection pool metrics: CloudWatch didn't have an alert on
DatabaseConnections— the issue wasn't detected until the health check failed. - No circuit breaker: The application did not gracefully degrade (e.g., serve cached data) when DB was unavailable.
Why App Runner restart fixed it
Restarting App Runner recreated the application process and reset all connection pools. With the burst of login requests completed (10 seconds of traffic), new requests arrived at a normal rate and the pool was sufficient.
4. Resolution
Immediate fix: App Runner restart (28 minutes to resolution from detection).
Permanent fixes required:
- Add PgBouncer connection pooler (or RDS Proxy) to limit per-application connections
- Add CloudWatch alarm on
DatabaseConnections > 70for db.t4g.micro - Implement connection pool health check with metrics
- Add rate limiting on BankID login initiation for burst protection (already exists at 10/min per IP — but not per minute across all IPs)
5. Detection Quality
Financial Impact
| Notes | ||
|---|---|---|
| 5 |
SLA Breach Assessment
6. Root Cause Analysis
5 Whys
Root cause: {{ROOT_CAUSE}}
Contributing Factors
{{FACTOR_1}}{{FACTOR_2}}{{FACTOR_3}}
Trigger Event
What triggered this specific incident now: {{TRIGGER}}
7. Resolution Steps
Resolution commands (for runbook):
# {{RESOLUTION_DESCRIPTION}}
{{RESOLUTION_COMMAND}}
8. What Went Well
{{WENT_WELL_1}}{{WENT_WELL_2}}{{WENT_WELL_3}}
9. What Went Wrong
{{WENT_WRONG_1}}{{WENT_WRONG_2}}{{WENT_WRONG_3}}
10. Action Items
| # | Action | Owner | Priority | ||
|---|---|---|---|---|---|
| 1 | |||||
| 2 | DatabaseConnections > 70 → Slack alert |
||||
| 3 | |||||
| 4 | |||||
| 5 |
11.7. Lessons Learned
{{LESSON_1}}Connection pool exhaustion is a real risk for synchronous monoliths under burst load. Add PgBouncer before launch.{{LESSON_2}}CloudWatchDatabaseConnectionsmetric needs an alarm. This should have alerted before the health check failed.{{LESSON_3}}App Runner restart is fast and reliable. 5 minutes is an acceptable RTO for this class of issue.- BetterStack detection was excellent — 30 seconds from failure to alert is within our target.
- Marketing campaigns need infrastructure coordination. A burst from a marketing email should trigger a pre-deployment review of connection capacity.
12. Related Incidents
13. Communication Log
Related Documents
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | Platform Architect (AI) | 2026-02-23 | |
| Reviewer | |||
| Approver | Alem Bašić |