Incident Report

Project: ~~{{PROJECT_NAME}}~~Drop Version: ~~{{VERSION}}~~0.1.0 Date: ~~{{DATE}}~~2026-02-23 Author: ~~{{AUTHOR}}~~Platform Architect (AI) Status: ~~Draft | In Review | Approved~~Closed Reviewers: ~~{{REVIEWERS}}~~Alem Bašić (CEO)

Document History

Version	Date	Author	Changes
0.1	~~{{DATE}}~~2026-02-23	~~{{AUTHOR}}~~Platform Architect (AI)	~~Initial~~Example ~~draft~~incident report (simulated pre-launch scenario)

1. Incident MetadataOverview

This document is filled with a realistic example incident based on Drop's architecture. Use it as a template for future real incidents.

BetterStack(automated—HealthCheck

Field	Value
Incident ID	INC-~~{{YYYY}}-{{SEQ}}~~2026-001
Severity	~~P{{SEVERITY}}~~P1 — Critical (production down)
Status	~~{{STATUS}}~~Closed / Resolved
Start Time	2026-02-20 10:30 UTC
End Time	2026-02-20 10:58 UTC
Duration	28 minutes
Services Affected	Drop production (all users)
Root Cause	RDS PostgreSQL connection pool exhaustion after spike in concurrent logins
Incident Commander	~~{{IC}}~~Alem Bašić
~~Technical~~Reported ~~Lead~~By	~~{{TECH_LEAD}}~~
~~Communications~~`Drop Lead`	~~{{COMMS_LEAD}}~~
~~Declared at~~	~~{{START_TIME}} {{TIMEZONE}}~~
~~Resolved at~~	~~{{END_TIME}} {{TIMEZONE}}~~
~~Total duration~~	~~{{DURATION}}~~
~~Affected service(s)~~	~~{{SERVICES}}~~
~~Environment~~	~~Production / Staging~~monitor)

2.1. ExecutiveImpact Summary

~~{{EXECUTIVE_SUMMARY}}~~

Example: "On {{DATE}}, a database connection pool exhaustion caused the {{SERVICE}} API to return 503 errors for approximately 47 minutes, affecting {{AFFECTED_COUNT}} users and resulting in an estimated {{REVENUE_IMPACT}} in lost transactions. The root cause was a code change in the v{{VERSION}} deployment that introduced N+1 queries under high load."

3. Detection

~~Detected by:~~ ~~{{DETECTION_METHOD}}~~ ~~Detected at:~~ ~~{{DETECTION_TIME}}~~ ~~Lag from start to detection:~~ ~~{{DETECTION_LAG}} minutes~~ ~~Detecting system:~~ ~~{{DETECTING_SYSTEM}}~~

~~Alerting effectiveness:~~

~~Alert fired within the expected window (< {{ALERT_SLA}} minutes)~~

~~Alert delivered to on-call without delay~~

~~Alert contained sufficient context to begin investigation~~

~~Improvements to detection identified:~~

~~{{DETECTION_IMPROVEMENT_1}}~~

4. Detailed Timeline

~~Timezone:~~ ~~All times in {{TIMEZONE}}~~

~~Time~~	~~Event~~	~~Actor~~	~~Notes~~
~~{{TIME}}~~	~~{{EVENT_1}}~~	~~{{ACTOR}}~~
~~{{TIME}}~~	~~{{EVENT_2}}~~	~~System~~	~~Alert ID: {{ALERT_ID}}~~
~~{{TIME}}~~	~~{{EVENT_3}}~~	~~{{ENGINEER}}~~
~~{{TIME}}~~	~~{{EVENT_4}}~~	~~{{IC}}~~
~~{{TIME}}~~	~~{{EVENT_5}}~~	~~{{ENGINEER}}~~
~~{{TIME}}~~	~~{{EVENT_6}}~~	~~{{ENGINEER}}~~
~~{{TIME}}~~	~~{{EVENT_7}}~~	~~System~~
~~{{TIME}}~~	~~{{EVENT_8}}~~	~~{{IC}}~~

5. Impact Assessment

Users Affected

Metric	Value
~~Total users~~Users affected	~~{{USER_COUNT}}~~All active users (100% — full outage)
%Transactions ~~of total user base~~blocked	~~{{USER_PERCENT}}%~~Estimated 3–5 remittances + 2 QR payments
~~Geography~~Revenue ~~affected~~impact	~~{{GEOGRAPHY}}~~Approx. NOK 6,000–10,000 in blocked transactions
~~User~~SLA ~~tier affected~~impact	~~{{USER_TIER}}~~28 minutes downtime → monthly uptime: 99.94%
Compliance impact	None — no data loss, audit logs intact

Services
Customer-facing Affected

behavior: Users saw error messages on all screens. /api/health returned HTTP 503 with "status":"down".

2. Timeline

HealthCheck

~~Service~~Time (UTC)	~~Impact Type~~	~~Severity~~	~~Duration~~Event
~~{{SERVICE_1}}~~10:30:00	~~{{IMPACT_TYPE}}~~Drop	~~{{SEV}}~~	~~{{DURATION}}~~monitor on BetterStack detects HTTP 503
~~{{SERVICE_2}}~~10:30:30	~~{{IMPACT_TYPE}}~~Slack `#drop-ops` receives critical alert: "Drop Health Check is DOWN"
10:30:45	~~{{SEV}}~~Alem acknowledges alert
10:31:00	Alem checks App Runner status — service shows `RUNNING`
10:31:30	Alem checks `/api/health` — response: `{"status":"down","checks":{DURATION}"db":{"status":"fail"}}}`
10:32:00	Alem checks CloudWatch logs — sees repeated connection refused errors to RDS
10:32:30	Hypothesis: RDS connection issue. Alem checks RDS status — shows `available`
10:33:00	Alem queries RDS directly — connection succeeds from psql
10:34:00	New hypothesis: connection pool exhaustion in application
10:35:00	Alem triggers App Runner restart (`aws apprunner start-deployment`)
10:38:00	App Runner deployment completes — service `RUNNING`
10:38:30	Health check passes: `{"status":"ok"}`
10:38:45	BetterStack sends recovery alert: "Drop Health Check is UP (downtime: 8 min)"
10:39:00	Alem monitors for 15 minutes to confirm stability
10:54:00	Confirms stable — error spike cleared
10:58:00	Incident closed

3. Root Cause Analysis

DataWhat ImpactHappened

A burst of concurrent BankID login attempts (triggered by a marketing email sent at 10:28 UTC) created 45+ simultaneous database connections. The Drop application uses application-level connection pooling with a default max of ~85 connections (db.t4g.micro limit). Each /api/auth/bankid/callback request opens a connection for session creation + user upsert — the simultaneous spike exhausted the pool.

When the pool was exhausted, new requests failed with connection refused and the health check's DB query (SELECT 1) also failed, triggering a 503 response.

Contributing Factors

No explicit connection pool configured: Drop uses the pg driver without PgBouncer or RDS Proxy. The application-level pool was implicitly bounded by the OS connection limits.

No connection pool metrics: CloudWatch didn't have an alert on DatabaseConnections — the issue wasn't detected until the health check failed.

No circuit breaker: The application did not gracefully degrade (e.g., serve cached data) when DB was unavailable.

Why App Runner restart fixed it

Restarting App Runner recreated the application process and reset all connection pools. With the burst of login requests completed (10 seconds of traffic), new requests arrived at a normal rate and the pool was sufficient.

4. Resolution

Immediate fix: App Runner restart (28 minutes to resolution from detection).

Permanent fixes required:

Add PgBouncer connection pooler (or RDS Proxy) to limit per-application connections

Add CloudWatch alarm on DatabaseConnections > 70 for db.t4g.micro

Implement connection pool health check with metrics

Add rate limiting on BankID login initiation for burst protection (already exists at 10/min per IP — but not per minute across all IPs)

5. Detection Quality

~~Type~~Aspect	~~Assessment~~
~~Data loss~~	~~{{DATA_LOSS}}~~
~~Data corruption~~	~~{{DATA_CORRUPTION}}~~
~~Data exposure~~	~~{{DATA_EXPOSURE}}~~
~~Verification method~~	~~{{VERIFICATION}}~~

Financial Impact

5minutes(App

~~Category~~	~~Amount~~Score	Notes
~~Lost~~Time ~~transactions~~to detection	~~${{AMOUNT}}~~Good	~~{{TRANSACTION_COUNT}}~~BetterStack ~~failed~~detected ~~transactions~~in < 30 seconds
~~SLA~~Alert ~~credits~~received	~~${{AMOUNT}}~~Good	~~Per~~Slack ~~SLA~~alert ~~contract~~within 30 seconds
~~Operational~~Time ~~cost~~to diagnose	~~${{AMOUNT}}~~Fair	~~Engineering~~4 ~~hours~~minutes to— ~~resolve~~hypothesis took 2 iterations
~~Total~~Time ~~estimated~~to resolve	~~${{TOTAL}}~~Good

Runner

SLA Breach Assessment

Monitoredfor15minutesbefore

~~SLA Metric~~	~~Target~~	~~Actual~~	~~Breach~~
~~Uptime~~	~~{{UPTIME_SLA}}%~~	~~{{ACTUAL_UPTIME}}%~~	~~{{BREACH}}~~ restart)
~~Response~~Post-restore ~~time (P99)~~verification	~~< {{P99_SLA}}ms~~Good	~~{{P99_ACTUAL}}ms~~	~~{{BREACH}}~~
~~MTTR~~	~~< {{MTTR_SLA}}~~	~~{{MTTR_ACTUAL}}~~	~~{{BREACH}}~~closing

6. Root Cause Analysis

5 Whys

~~Why #~~	~~Question~~	~~Answer~~
~~Why 1~~	~~Why did users see errors?~~	~~{{ANSWER_1}}~~
~~Why 2~~	~~Why was the API returning 503?~~	~~{{ANSWER_2}}~~
~~Why 3~~	~~Why was the connection pool exhausted?~~	~~{{ANSWER_3}}~~
~~Why 4~~	~~Why was the N+1 query introduced?~~	~~{{ANSWER_4}}~~
~~Why 5~~	~~Why did code review miss it?~~	~~{{ANSWER_5}}~~

~~Root cause:~~ ~~{{ROOT_CAUSE}}~~

Contributing Factors

~~{{FACTOR_1}}~~

~~{{FACTOR_2}}~~

~~{{FACTOR_3}}~~

Trigger Event

~~What triggered this specific incident now:~~ ~~{{TRIGGER}}~~

7. Resolution Steps

~~Step~~	~~Time~~	~~Action~~	~~Result~~
1	~~{{TIME}}~~	~~{{ACTION_1}}~~	~~{{RESULT_1}}~~
2	~~{{TIME}}~~	~~{{ACTION_2}}~~	~~{{RESULT_2}}~~
3	~~{{TIME}}~~	~~{{ACTION_3}}~~	~~{{RESULT_3}}~~

~~Resolution commands (for runbook):~~

# {{RESOLUTION_DESCRIPTION}}
{{RESOLUTION_COMMAND}}

8. What Went Well

~~{{WENT_WELL_1}}~~

~~{{WENT_WELL_2}}~~

~~{{WENT_WELL_3}}~~

9. What Went Wrong

~~{{WENT_WRONG_1}}~~

~~{{WENT_WRONG_2}}~~

~~{{WENT_WRONG_3}}~~

10. Action Items

BeforeWithinBeforeWithinQ2

#	Action	Owner	~~Due Date~~	Priority	~~Status~~Due Date
1	~~{{ACTION_1}}~~Configure PgBouncer or RDS Proxy for connection pooling	~~{{OWNER}}~~Alem / Platform	~~{{DUE}}~~P1	~~High~~	~~Open~~v1.0 launch
2	~~{{ACTION_2}}~~Add CloudWatch alarm: `DatabaseConnections > 70` → Slack alert	~~{{OWNER}}~~Alem	~~{{DUE}}~~P1	~~High~~	~~Open~~1 week
3	~~{{ACTION_3}}~~Add connection pool metrics to health check endpoint	~~{{OWNER}}~~Platform	~~{{DUE}}~~P2	~~Medium~~	~~Open~~v1.0 launch
4	~~{{ACTION_4}}~~Document burst rate limiting strategy for marketing campaigns	~~{{OWNER}}~~Alem	~~{{DUE}}~~P2	~~High~~	~~Open~~2 weeks
5	~~{{ACTION_5}}~~Test DR runbook scenario for RDS connection failures	~~{{OWNER}}~~Platform	~~{{DUE}}~~P3	~~Low~~	~~Open~~2026

11.7. Lessons Learned

~~{{LESSON_1}}~~Connection pool exhaustion is a real risk for synchronous monoliths under burst load. Add PgBouncer before launch.
~~{{LESSON_2}}~~CloudWatch DatabaseConnections metric needs an alarm. This should have alerted before the health check failed.
~~{{LESSON_3}}~~App Runner restart is fast and reliable. 5 minutes is an acceptable RTO for this class of issue.

BetterStack detection was excellent — 30 seconds from failure to alert is within our target.

Marketing campaigns need infrastructure coordination. A burst from a marketing email should trigger a pre-deployment review of connection capacity.

~~Incident ID~~	~~Date~~	~~Similarity~~	~~Resolved~~
~~INC-{{ID}}~~	~~{{DATE}}~~	~~{{DESCRIPTION}}~~	~~Yes / No~~

13. Communication Log

~~Time~~	~~Channel~~	~~Message Summary~~	~~Audience~~	~~Sent By~~
~~{{TIME}}~~	~~Status page~~	~~"Investigating reports of elevated errors"~~	~~All users~~	~~{{SENDER}}~~
~~{{TIME}}~~	~~Status page~~	~~"Identified root cause, applying fix"~~	~~All users~~	~~{{SENDER}}~~
~~{{TIME}}~~	~~Status page~~	~~"Incident resolved, all systems normal"~~	~~All users~~	~~{{SENDER}}~~
~~{{TIME}}~~	~~Email~~	~~Customer notification for SLA breach~~	~~Affected customers~~	~~{{SENDER}}~~

Monitoring & Observability

Approval

Role	Name	Date
Author	Platform Architect (AI)	2026-02-23
Reviewer
Approver	Alem Bašić

Incident Report

Incident Report

Document History

1. Incident MetadataOverview

2.1. ExecutiveImpact Summary

3. Detection

4. Detailed Timeline

5. Impact Assessment

Users Affected

ServicesCustomer-facing Affected

2. Timeline

3. Root Cause Analysis

DataWhat ImpactHappened

Contributing Factors

Why App Runner restart fixed it

4. Resolution

5. Detection Quality

Financial Impact

SLA Breach Assessment

6. Root Cause Analysis

5 Whys

Contributing Factors

Trigger Event

7. Resolution Steps

8. What Went Well

9. What Went Wrong

10. Action Items

11.7. Lessons Learned

12. Related Incidents

13. Communication Log

Related Documents

Approval

Services
Customer-facing Affected