Post-Mortem
Post-Mortem
Project:
Drop{{PROJECT_NAME}} Version:0.1.0{{VERSION}} Date:2026-02-23{{DATE}} Author:Platform Architect (AI){{AUTHOR}} Status: Draft | In Review | Approved Reviewers:Alem Bašić (CEO){{REVIEWERS}}
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 |
Post-MortemBlameless OverviewCulture Statement
This
documentpost-mortem isfilledconducted in a blameless spirit. Our goal is to understand how and why the incident occurred — not to assign fault to individuals. People make the best decisions they can withatherealisticinformationexampleandpost-mortemtoolsbasedavailableon Drop's architecture. It documentsat thesametime.incidentWhenasthings go wrong, we look for systemic improvements that make theIncidentrightReportaction(INC-2026-001: RDS connection pool exhaustion) but provides deeper root cause analysiseasier andsystemictheimprovements.wrongUseactionas a templateharder forfuture real incidents.everyone.
1. Incident Reference & Metadata
| Field | Value |
|---|---|
| Incident ID | INC- |
| Severity | |
| Incident Report | INC-{{YYYY}}-{{SEQ}} |
| Post-Mortem Facilitator | {{FACILITATOR}} |
| Post-Mortem Date | |
1.2. Executive Summary
On{{EXECUTIVE_SUMMARY}}
atExample:
10:30"AUTC,databaseDropindexexperiencedwas dropped during a28-minutemigrationP1onoutage{{DATE}}, causing query performance to degrade by 50× under load. This resulted in a 1h 23min degraded service period affecting100% of production{{USERS}} users.TheWeroothavecause was RDS PostgreSQL connection pool exhaustion triggered by a burst of concurrent BankID authentication attempts following a marketing email campaign. The immediate fix was an App Runner service restart. This post-mortem documentsrestored thesystemicindex,improvementsaddedrequiredmigration validation tooling, and created safeguards to preventrecurrence.similar incidents."
Bottom line:The application lacked explicit connection pool limits and DB-level metrics alerting. A burst of ~45 concurrent logins exhausted the default connection pool, causing all subsequent DB-dependent requests (including the health check) to fail.
2.3. TimelineImpact Summary
({{REQUEST_PERCENT}}% | rate ||
| ||
4. Detailed Timeline
/api/healthtimeline
title Incident Timeline
{{TIME_1}} : {{EVENT_1}}
{{TIME_2}} : {{EVENT_2}}
{{TIME_3}} : {{EVENT_3}}
{{TIME_4}} : {{EVENT_4}}
{{TIME_5}} : {{EVENT_5}}
Time
Event
MTTD/MTTR Marker
{"status":"down","checks":{"db":{"status":"fail"T1}}}}
Diagnosis{{EVENT}}
← Incident start
10:32:00{{T2}}
CloudWatch{{EVENT}} logs show repeated connection refused to RDS
Diagnosis
10:33:00{{T3}}
Direct{{EVENT}} psql connection to RDS succeeds — rules out RDS-level failure
Diagnosis← Detection (MTTD = T3 - T1)
10:34:00{{T4}}
Hypothesis:{{EVENT}} application-level connection pool exhaustion
Diagnosis
10:35:00{{T5}}
Alem{{EVENT}} triggers App Runner restart via aws apprunner start-deployment
Mitigation
10:38:00{{T6}}
App{{EVENT}} Runner deployment completes
Mitigation
10:38:30{{T7}}
Health{{EVENT}} check returns {"status":"ok"}
Recovery
10:38:45{{T8}}
BetterStack{{EVENT}} recovery alert: "Drop Health Check is UP"
Recovery ← Resolved (MTTR 10:39:00 = BeginT8 15-minute- stability monitoring window
Post-recovery
10:54:00
Confirmed stable — error spike cleared
Closed
10:58:00
Incident formally closed
ClosedT1)
Total duration: 28 minutesMTTD (10:30Mean — 10:58 UTC)
Time to detect:Detect): <{{MTTD}} 30 secondsminutes
MTTR (Mean Time to diagnose root cause:Resolve): ~4{{MTTR}} minutes
Time to apply fix: ~5 minutes (App Runner restart)
3.5. Root Cause Analysis
3.5.1 The Five5 Whys Analysis
Why #
Question
Answer
Why 1
Why did Dropusers returnexperience HTTP{{SYMPTOM}}?
503?{{WHY_1}}
→
The
/api/healthWhy endpoint's2
DB check (SELECT 1) failed.
Why did the{{WHY_1_ANSWER}} DBhappen?
check{{WHY_2}}
fail?
→
The application could not acquire a database connection — pool was exhausted.
Why was3
the pool exhausted?
→ ~45 concurrent BankID login callbacks (each requiring a DB connection for session upsert) arrived simultaneously within a 10-second window.
Why were 45 concurrent logins able to exhaust the pool?
→ No explicit connection pool limit was configured in the pg driver. The pool was bounded only by OS-level limits (~85 connections for db.t4g.micro), and there was no queue/timeout — new requests failed immediately when the limit was hit.
Why did no{{WHY_2_ANSWER}} alerthappen?
fire{{WHY_3}}
before
the
healthWhy check4
failed?Why did {{WHY_3_ANSWER}} happen?
{{WHY_4}}
Why 5
Why did {{WHY_4_ANSWER}} happen?
{{WHY_5}}
Root cause: → No CloudWatch alarm was configured on DatabaseConnections. The only production alert path was the BetterStack health check, which by then was already failing.{{ROOT_CAUSE}}
3.5.2 Contributing Factors
Factor
DescriptionType
SeverityAction Required
No explicit pool config{{FACTOR_1}}
pgTechnical used/ withoutProcess max,/ idleTimeoutMillis, or connectionTimeoutMillisHuman
HighYes / No
No DB connection metrics{{FACTOR_2}}
NoTechnical CloudWatch/ alarmProcess on/ DatabaseConnections > 70Human
HighYes / No
No graceful degradation{{FACTOR_3}}
ApplicationTechnical returned/ 503Process when/ DB was unavailable, even for non-DB routesHuman
Medium Yes /
No rate limiting across all IPs
Per-IP rate limit (10/min) did not prevent burst across many IPs simultaneously
Medium
No pre-campaign infra review
Marketing email campaign launched without coordinating with infrastructure
Medium
No connection pool health metric
Health check did not report pool utilization
Low
3.3 What Worked Well
BetterStack detection was excellent: < 30 seconds from failure to alert.
Slack alert delivery was immediate: Alert to #drop-ops within 30 seconds.
App Runner restart is fast and reliable: Recovery completed in < 5 minutes.
RDS was not the problem: Direct psql connection succeeded, quickly ruling out infrastructure failure.
No data loss: Audit logs intact, no transactions corrupted.
4. Impact Analysis
Dimension
Impact
Users affected
100% — full service outage
Transactions blocked
~3–5 remittances + ~2 QR payments
Revenue impact
Approx. NOK 6,000–10,000
Compliance
None — no data loss, audit logs intact throughout
Regulatory
No notification required (< 4h, no PII exposure)
Reputation
Users saw error screens — limited blast radius pre-public-launch
SLA
28 min downtime → monthly uptime 99.94%
5. Corrective Actions
5.1 Immediate (before next marketing campaign)
#
Action
Owner
Status
1
Configure explicit pg pool: max: 10, idleTimeoutMillis: 30000, connectionTimeoutMillis: 2000
Platform
Pending
2
Add CloudWatch alarm: DatabaseConnections > 70 → Slack #drop-ops
Alem
Pending
3
Add global rate limit on /api/auth/bankid/initiate (e.g., 100/min across all IPs)
Platform
Pending
5.2 Before v1.0 Launch
#
Action
Owner
Priority
4
Add PgBouncer or RDS Proxy to externalize connection pooling
Platform
P1
5
Report pool utilization in /api/health response (poolSize, idleCount, waitingCount)
Platform
P2
6
Implement graceful degradation for non-DB routes when DB is unavailable
Platform
P2
5.3 ProcessTrigger Event
The specific trigger for this incident: {{TRIGGER}}
6. What Went Well
- {{CATEGORY_1}}: {{DESCRIPTION}}
- {{CATEGORY_2}}: {{DESCRIPTION}}
- {{CATEGORY_3}}: {{DESCRIPTION}}
7. What Went Wrong
- {{CATEGORY_1}}: {{DESCRIPTION}}
- {{CATEGORY_2}}: {{DESCRIPTION}}
- {{CATEGORY_3}}: {{DESCRIPTION}}
8. Where We Got Lucky
- {{LUCKY_1}}
- {{LUCKY_2}}
- {{LUCKY_3}}
9. Action Items
Short-Term Fixes (ongoing)This Sprint)
#
Action
Owner
Due
Priority
Ticket
71
Create{{SHORT_TERM_1}} "Marketing → Infra" coordination checklist — must be completed before any campaign
Alem{{OWNER}}
Within{{DATE}}
2Critical
weeks{{TICKET}}
82
Add DB connection metrics to weekly monitoring review{{SHORT_TERM_2}}
Alem{{OWNER}}
Ongoing{{DATE}}
High
{{TICKET}}
93
Test App Runner restart as a documented runbook step{{SHORT_TERM_3}}
Platform{{OWNER}}
Q2{{DATE}}
2026Medium
{{TICKET}}
6. SystemicLong-Term Improvements
6.1 Connection Pooling Fix
Current state: Implicit pool, no limits, no timeout.
Target state:
// src/drop-app/src/lib/db.ts (example)Next const pool = new Pool({
connectionString: process.env.DATABASE_URL,
max: 10, // Hard cap — never exceed RDS t4g.micro limit
idleTimeoutMillis: 30000, // Release idle connections after 30s
connectionTimeoutMillis: 2000, // Fail fast if pool is exhausted
});
When PgBouncer is added, set max higher in the app and let PgBouncer enforce the RDS limit.
6.2 CloudWatch AlarmQuarter)
aws cloudwatch put-metric-alarm \
--alarm-name "drop-db-connections-high" \
--alarm-description "RDS DatabaseConnections > 70" \
--metric-name DatabaseConnections \
--namespace AWS/RDS \
--dimensions Name=DBInstanceIdentifier,Value=drop-db \
--period 60 \
--evaluation-periods 2 \
--threshold 70 \
--comparison-operator GreaterThanThreshold \
--statistic Average \
--alarm-actions arn:aws:sns:eu-west-1:324480209768:drop-ops-alerts \
--region eu-west-1
6.3 Marketing Campaign Checklist (Pre-Launch)
Before any marketing campaign that targets > 100 recipients:
Notify infrastructure (Alem) at least 24h before send
Check current DatabaseConnections baseline in CloudWatch
Verify pool configuration is explicit
Consider sending campaign in batches (< 100/hour) to spread load
7. Lessons Learned
Explicit is always better than implicit for resource limits. Never rely on OS defaults for connection pool configuration in production.
Metrics must lead alerts, not lag them. The health check failure was a lagging indicator. DatabaseConnections CloudWatch alarm would have caught this 2 minutes earlier.
Marketing and infrastructure must coordinate. A 500-recipient email burst is a load event that needs infrastructure awareness.
App Runner restart is a fast, reliable mitigation. < 5 minutes RTO is acceptable for this class of issue. Document it as a first-response step in the runbook.
BetterStack + Slack alerting works. < 30 second detection time met our target. No changes needed here.
DB-level connection pooling (PgBouncer) is required for burst tolerance at scale. This should be resolved before public launch.
8. Action Item Tracking
#
Action
Owner
Due
StatusPriority
Ticket
1
Configure{{LONG_TERM_1}} explicit pg pool limits
Platform{{OWNER}}
Before next campaign{{DATE}}
PendingHigh
{{TICKET}}
2
CloudWatch alarm on DatabaseConnections > 70{{LONG_TERM_2}}
Alem{{OWNER}}
Within 1 week{{DATE}}
PendingMedium
{{TICKET}}
Process Changes
#
Change
Owner
Implementation Date
1
{{PROCESS_1}}
{{OWNER}}
{{DATE}}
32
Global rate limit on BankID initiate{{PROCESS_2}}
Platform{{OWNER}}
Before{{DATE}}
next
campaign
10. Follow-Up Tracking
Follow-up review date: {{FOLLOWUP_DATE}} (4 weeks after incident)
Follow-up owner: {{FOLLOWUP_OWNER}}
Action Item
Expected Completion
Verified Complete
Effective
{{ACTION_1}}
Pending{{DATE}}
Yes / No
Yes / No / TBD
4{{ACTION_2}}
PgBouncer / RDS Proxy{{DATE}}
Platform
Before v1.0
Pending
5
Pool utilization in /api/health
Platform
Before v1.0
Pending
6
Graceful degradation for non-DB routes
Platform
Before v1.0
Pending
7
Marketing → Infra coordination checklist
Alem
Within 2 weeks
Pending
8
DB connection metrics in weekly review
Alem
Ongoing
Ongoing
9
App Runner restart in DR runbook
Platform
Q2 2026
Pending
11. Recurrence Prevention
Before this incident: {{BEFORE_STATE}}
After implementing action items: {{AFTER_STATE}}
Confidence in prevention: {{CONFIDENCE}} / 10
Residual risk: {{RESIDUAL_RISK}}
12. Review & Sign-Off
Post-mortem presented at: {{MEETING}} on {{MEETING_DATE}}
Meeting recording: {{RECORDING_LINK}}
Meeting notes: {{NOTES_LINK}}
Related Documents
- Incident Report INC-
2026-001{{ID}}
- Operational Runbook
- Disaster Recovery Plan
Monitoring & Observability
Approval
Role
Name
Date
Signature
Author
Platform Architect (AI)
2026-02-23
FacilitatorReviewer
Alem Bašić
Approver
Alem Bašić