Incident Report

Project: ~~Drop~~{{PROJECT_NAME}} Version: ~~0.1.0~~{{VERSION}} Date: ~~2026-02-23~~{{DATE}} Author: ~~Platform Architect (AI)~~{{AUTHOR}} Status: ~~Closed~~Draft | In Review | Approved Reviewers: ~~Alem Bašić (CEO)~~{{REVIEWERS}}

Document History

Version	Date	Author	Changes
0.1	~~2026-02-23~~{{DATE}}	~~Platform Architect (AI)~~{{AUTHOR}}	~~Example~~Initial ~~incident report (simulated pre-launch scenario)~~draft

1. Incident OverviewMetadata

~~This document is filled with a realistic~~ ~~example incident~~ ~~based on Drop's architecture. Use it as a template for future real incidents.~~

~~(automated~~—DropCheck~~monitor)~~

Field	Value
Incident ID	INC-~~2026-001~~{{YYYY}}-{{SEQ}}
Severity	P1P{{SEVERITY}} ~~— Critical (production down)~~
Status	~~Closed~~{{STATUS}} ~~/ Resolved~~
~~Start Time~~	~~2026-02-20 10:30 UTC~~
~~End Time~~	~~2026-02-20 10:58 UTC~~
~~Duration~~	~~28 minutes~~
~~Services Affected~~	~~Drop production (all users)~~
~~Root Cause~~	~~RDS PostgreSQL connection pool exhaustion after spike in concurrent logins~~
Incident Commander	~~Alem Bašić~~{{IC}}
~~Reported~~Technical ByLead	~~BetterStack~~{{TECH_LEAD}}
Communications ~~Health~~Lead	{{COMMS_LEAD}}
Declared at	{{START_TIME}} {{TIMEZONE}}
Resolved at	{{END_TIME}} {{TIMEZONE}}
Total duration	{{DURATION}}
Affected service(s)	{{SERVICES}}
Environment	Production / Staging

1.2. Executive Summary

Example: "On {{DATE}}, a database connection pool exhaustion caused the {{SERVICE}} API to return 503 errors for approximately 47 minutes, affecting {{AFFECTED_COUNT}} users and resulting in an estimated {{REVENUE_IMPACT}} in lost transactions. The root cause was a code change in the v{{VERSION}} deployment that introduced N+1 queries under high load."

3. Detection

Detected by: {{DETECTION_METHOD}} Detected at: {{DETECTION_TIME}} Lag from start to detection: {{DETECTION_LAG}} minutes Detecting system: {{DETECTING_SYSTEM}}

Alerting effectiveness:

Alert fired within the expected window (< {{ALERT_SLA}} minutes)

Alert delivered to on-call without delay

Alert contained sufficient context to begin investigation

Improvements to detection identified:

{{DETECTION_IMPROVEMENT_1}}

4. Detailed Timeline

Timezone: All times in {{TIMEZONE}}

Time	Event	Actor	Notes
{{TIME}}	{{EVENT_1}}	{{ACTOR}}
{{TIME}}	{{EVENT_2}}	System	Alert ID: {{ALERT_ID}}
{{TIME}}	{{EVENT_3}}	{{ENGINEER}}
{{TIME}}	{{EVENT_4}}	{{IC}}
{{TIME}}	{{EVENT_5}}	{{ENGINEER}}
{{TIME}}	{{EVENT_6}}	{{ENGINEER}}
{{TIME}}	{{EVENT_7}}	System
{{TIME}}	{{EVENT_8}}	{{IC}}

5. Impact SummaryAssessment

Users Affected

Metric	Value
~~Users~~Total users affected	~~All active users (100% — full outage)~~{{USER_COUNT}}
~~Transactions~~% ~~blocked~~of total user base	~~Estimated 3–5 remittances + 2 QR payments~~{{USER_PERCENT}}%
~~Revenue~~Geography ~~impact~~affected	~~Approx.~~{{GEOGRAPHY}} ~~NOK 6,000–10,000 in blocked transactions~~
~~SLA~~User ~~impact~~tier affected	28{{USER_TIER}} ~~minutes downtime → monthly uptime: 99.94%~~
~~Compliance impact~~	~~None — no data loss, audit logs intact~~

~~Customer-facing~~

Services behavior: Users saw error messages on all screens. `/api/health` returned HTTP 503 with `"status":"down"`.

2. Timeline

Affected ~~Checkmonitor on BetterStack detects HTTP 503~~

~~Time (UTC)~~Service	~~Event~~Impact Type	Severity	Duration
~~10:30:00~~{{SERVICE_1}}	~~Drop~~{{IMPACT_TYPE}} ~~Health~~	{{SEV}}	{{DURATION}}
~~10:30:30~~{{SERVICE_2}}	~~Slack~~ `#drop-ops` ~~receives critical alert: "Drop Health Check is DOWN"~~
~~10:30:45~~{{IMPACT_TYPE}}	~~Alem acknowledges alert~~
~~10:31:00~~{{SEV}}	~~Alem checks App Runner status — service shows~~ `RUNNING`
~~10:31:30~~	~~Alem checks~~ `/api/health` ~~— response:~~ `{"status":"down","checks":{"db":{"status":"fail"DURATION}}}}`
~~10:32:00~~	~~Alem checks CloudWatch logs — sees repeated connection refused errors to RDS~~
~~10:32:30~~	~~Hypothesis: RDS connection issue. Alem checks RDS status — shows~~ `available`
~~10:33:00~~	~~Alem queries RDS directly — connection succeeds from psql~~
~~10:34:00~~	~~New hypothesis: connection pool exhaustion in application~~
~~10:35:00~~	~~Alem triggers App Runner restart (~~`aws apprunner start-deployment`)
~~10:38:00~~	~~App Runner deployment completes — service~~ `RUNNING`
~~10:38:30~~	~~Health check passes:~~ `{"status":"ok"}`
~~10:38:45~~	~~BetterStack sends recovery alert: "Drop Health Check is UP (downtime: 8 min)"~~
~~10:39:00~~	~~Alem monitors for 15 minutes to confirm stability~~
~~10:54:00~~	~~Confirms stable — error spike cleared~~
~~10:58:00~~	~~Incident closed~~

Data

3. Root Cause Analysis

What HappenedImpact

A burst of concurrent BankID login attempts (triggered by a marketing email sent at 10:28 UTC) created 45+ simultaneous database connections. The Drop application uses application-level connection pooling with a default max of ~85 connections (db.t4g.micro limit). Each /api/auth/bankid/callback ~~request opens a connection for session creation + user upsert — the simultaneous spike exhausted the pool.~~

~~When the pool was exhausted, new requests failed with~~ connection refused ~~and the health check's DB query (~~SELECT 1~~) also failed, triggering a 503 response.~~

Contributing Factors

~~No explicit connection pool configured:~~ ~~Drop uses the~~ pg ~~driver without PgBouncer or RDS Proxy. The application-level pool was implicitly bounded by the OS connection limits.~~

~~No connection pool metrics:~~ ~~CloudWatch didn't have an alert on~~ DatabaseConnections ~~— the issue wasn't detected until the health check failed.~~

~~No circuit breaker:~~ ~~The application did not gracefully degrade (e.g., serve cached data) when DB was unavailable.~~

Why App Runner restart fixed it

Restarting App Runner recreated the application process and reset all connection pools. With the burst of login requests completed (10 seconds of traffic), new requests arrived at a normal rate and the pool was sufficient.

4. Resolution

~~Immediate fix:~~ ~~App Runner restart (28 minutes to resolution from detection).~~

~~Permanent fixes required:~~

~~Add PgBouncer connection pooler (or RDS Proxy) to limit per-application connections~~

~~Add CloudWatch alarm on~~ DatabaseConnections > 70 ~~for db.t4g.micro~~

~~Implement connection pool health check with metrics~~

~~Add rate limiting on BankID login initiation for burst protection (already exists at 10/min per IP — but not per minute across all IPs)~~

5. Detection Quality

~~Aspect~~Type	~~Score~~Assessment
Data loss	{{DATA_LOSS}}
Data corruption	{{DATA_CORRUPTION}}
Data exposure	{{DATA_EXPOSURE}}
Verification method	{{VERIFICATION}}

Financial Impact

~~minutes(AppRunner~~

Category	Amount	Notes
~~Time~~Lost ~~to detection~~transactions	~~Good~~${{AMOUNT}}	~~BetterStack~~{{TRANSACTION_COUNT}} ~~detected~~failed ~~in < 30 seconds~~transactions
~~Alert~~SLA ~~received~~credits	~~Good~~${{AMOUNT}}	~~Slack~~Per ~~alert~~SLA ~~within 30 seconds~~contract
~~Time~~Operational cost	${{AMOUNT}}	Engineering hours to ~~diagnose~~	~~Fair~~	~~4 minutes — hypothesis took 2 iterations~~resolve
~~Time~~Total ~~to resolve~~estimated	~~Good~~${{TOTAL}}	5

~~restart)~~

SLA Breach Assessment

~~for~~15~~minutesbeforeclosing~~

SLA Metric	Target	Actual	Breach
Uptime	{{UPTIME_SLA}}%	{{ACTUAL_UPTIME}}%	{{BREACH}}
~~Post-restore~~Response ~~verification~~time (P99)	~~Good~~< {{P99_SLA}}ms	~~Monitored~~{{P99_ACTUAL}}ms	{{BREACH}}
MTTR	< {{MTTR_SLA}}	{{MTTR_ACTUAL}}	{{BREACH}}

6. Root Cause Analysis

5 Whys

Why #	Question	Answer
Why 1	Why did users see errors?	{{ANSWER_1}}
Why 2	Why was the API returning 503?	{{ANSWER_2}}
Why 3	Why was the connection pool exhausted?	{{ANSWER_3}}
Why 4	Why was the N+1 query introduced?	{{ANSWER_4}}
Why 5	Why did code review miss it?	{{ANSWER_5}}

Root cause: {{ROOT_CAUSE}}

Contributing Factors

{{FACTOR_1}}

{{FACTOR_2}}

{{FACTOR_3}}

Trigger Event

What triggered this specific incident now: {{TRIGGER}}

7. Resolution Steps

Step	Time	Action	Result
1	{{TIME}}	{{ACTION_1}}	{{RESULT_1}}
2	{{TIME}}	{{ACTION_2}}	{{RESULT_2}}
3	{{TIME}}	{{ACTION_3}}	{{RESULT_3}}

Resolution commands (for runbook):

# {{RESOLUTION_DESCRIPTION}}
{{RESOLUTION_COMMAND}}

8. What Went Well

{{WENT_WELL_1}}

{{WENT_WELL_2}}

{{WENT_WELL_3}}

9. What Went Wrong

{{WENT_WRONG_1}}

{{WENT_WRONG_2}}

{{WENT_WRONG_3}}

10. Action Items

~~v1.0 launch1 weekv1.0 launch2 weeks2026~~

#	Action	Owner	Due Date	Priority	~~Due Date~~Status
1	~~Configure~~{{ACTION_1}} ~~PgBouncer or RDS Proxy for connection pooling~~	~~Alem / Platform~~{{OWNER}}	P1{{DUE}}	~~Before~~High	Open
2	~~Add~~{{ACTION_2}} ~~CloudWatch alarm:~~ `DatabaseConnections > 70` ~~→ Slack alert~~	~~Alem~~{{OWNER}}	P1{{DUE}}	~~Within~~High	Open
3	~~Add~~{{ACTION_3}} ~~connection pool metrics to health check endpoint~~	~~Platform~~{{OWNER}}	P2{{DUE}}	~~Before~~Medium	Open
4	~~Document~~{{ACTION_4}} ~~burst rate limiting strategy for marketing campaigns~~	~~Alem~~{{OWNER}}	P2{{DUE}}	~~Within~~High	Open
5	~~Test~~{{ACTION_5}} ~~DR runbook scenario for RDS connection failures~~	~~Platform~~{{OWNER}}	P3{{DUE}}	Q2Low	Open

7.11. Lessons Learned

~~Connection~~{{LESSON_1}} ~~pool~~

~~exhaustion~~

{{LESSON_2}}

{{LESSON_3}}

~~real~~

~~risk~~

Incident ID	Date	Similarity	Resolved
INC-{{ID}}	{{DATE}}	{{DESCRIPTION}}	Yes / No

13. Communication Log

~~underload.AddPgBouncerbefore launch.~~

~~CloudWatch~~ DatabaseConnections ~~metric needs an alarm.~~ ~~This should have alerted before the health check failed.~~

~~App Runner restart is fast and reliable.~~ ~~5 minutes is an acceptable RTO for this class of issue.~~

~~BetterStack detection was excellent~~ ~~— 30 seconds from failure to alert is within our target.~~

~~Marketing campaigns need infrastructure coordination.~~ ~~A burst from a marketing email should trigger a pre-deployment review of connection capacity.~~

Time	Channel	Message Summary	Audience	Sent By
{{TIME}}	Status page	"Investigating reports of elevated errors"	All users	{{SENDER}}
{{TIME}}	Status page	"Identified root cause, applying fix"	All users	{{SENDER}}
{{TIME}}	Status page	"Incident resolved, all systems normal"	All users	{{SENDER}}
{{TIME}}	Email	Customer notification for ~~synchronous~~SLA ~~monoliths~~breach	Affected ~~burst~~customers	{{SENDER}}

~~Monitoring & Observability~~Report

Approval

Role	Name	Date
Author	~~Platform Architect (AI)~~	~~2026-02-23~~
Reviewer
Approver	~~Alem Bašić~~

Incident Report

Incident Report

Document History

1. Incident OverviewMetadata

1.2. Executive Summary

3. Detection

4. Detailed Timeline

5. Impact SummaryAssessment

Users Affected

Services behavior: Users saw error messages on all screens. /api/health returned HTTP 503 with "status":"down".

2. Timeline

Data

3. Root Cause Analysis

What HappenedImpact

Contributing Factors

Why App Runner restart fixed it

4. Resolution

5. Detection Quality

Financial Impact

SLA Breach Assessment

6. Root Cause Analysis

5 Whys

Contributing Factors

Trigger Event

7. Resolution Steps

8. What Went Well

9. What Went Wrong

10. Action Items

7.11. Lessons Learned

12. Related Incidents

13. Communication Log

Related Documents

Approval

Services behavior: Users saw error messages on all screens. `/api/health` returned HTTP 503 with `"status":"down"`.