Post-Mortem
Post-Mortem
Project:
{{PROJECT_NAME}}Drop Version:{{VERSION}}0.1.0 Date:{{DATE}}2026-02-23 Author:{{AUTHOR}}Platform Architect (AI) Status:Draft |In Review| ApprovedReviewers:{{REVIEWERS}}Alem Bašić (CEO)
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 |
BlamelessPost-Mortem Culture StatementOverview
This
post-mortemdocument isconductedfilledinwith a realisticblamelessexamplespiritpost-mortem.OurbasedgoalonisDrop'stoarchitecture.understandIthowdocuments the same incident as the Incident Report (INC-2026-001: RDS connection pool exhaustion) but provides deeper root cause analysis andwhysystemictheimprovements.incidentUseoccurredas—anot to assign fault to individuals. People make the best decisions they can with the information and tools available at the time. When things go wrong, we looktemplate forsystemicfutureimprovementsrealthat make the right action easier and the wrong action harder for everyone.incidents.
1. Incident Reference & Metadata
| Field | Value |
|---|---|
| Incident ID | INC- |
| Severity | |
| Post-Mortem Date | |
| Participants | Alem Bašić (CEO), Platform Architect (AI) |
2.1. Executive Summary
On
{{EXECUTIVE_SUMMARY}}
at
Example:10:30"AUTC,databaseDropindexexperienced a 28-minute P1 outage affecting 100% of production users. The root cause wasdroppedRDSduringPostgreSQL connection pool exhaustion triggered by amigrationburstonof{{DATE}},concurrentcausingBankIDqueryauthenticationperformanceattemptstofollowingdegradeabymarketing50×emailundercampaign.load.The immediate fix was an App Runner service restart. Thisresultedpost-mortemin a 1h 23min degraded service period affecting {{USERS}} users. We have restoreddocuments theindex,systemicaddedimprovementsmigration validation tooling, and created safeguardsrequired to preventsimilar incidents."recurrence.
Bottom line: The application lacked explicit connection pool limits and DB-level metrics alerting. A burst of ~45 concurrent logins exhausted the default connection pool, causing all subsequent DB-dependent requests (including the health check) to fail.
3.2. Impact SummaryTimeline
| Phase | ||
|---|---|---|
| Pre-incident | ||
| Detection | ||
#drop-ops |
Detection | |
| Response | ||
RUNNING |
Diagnosis | |
/api/health → {"status":"down","checks":{ |
4. Detailed Timeline
timeline
title Incident Timeline
{{TIME_1}} : {{EVENT_1}}
{{TIME_2}} : {{EVENT_2}}
{{TIME_3}} : {{EVENT_3}}
{{TIME_4}} : {{EVENT_4}}
{{TIME_5}} : {{EVENT_5}}
connection refused to RDS |
Diagnosis | |
| Diagnosis | ||
aws apprunner start-deployment |
Mitigation | |
| Mitigation | ||
Health check returns { |
Recovery | |
| 10:39:00 | Begin |
Post-recovery |
| 10:54:00 | Confirmed stable — error spike cleared | Closed |
| 10:58:00 | Incident formally closed | Closed |
MTTDTotal duration: 28 minutes (Mean10:30 — 10:58 UTC)
Time to Detect):detect: {{MTTD}}< 30 seconds
Time to diagnose root cause: ~4 minutes
MTTR (Mean Time to Resolve):apply fix: {{MTTR}}~5 minutes (App Runner restart)
5.3. Root Cause Analysis
5.3.1 5The Five Whys Analysis
Why did | HTTP ||
/api/health | endpoint's DB check (SELECT 1) failed.
Why did | DB |
| The application could not acquire a database connection — pool was exhausted.
Why | was the pool exhausted?
→ ~45 concurrent BankID login callbacks (each requiring a DB connection for session upsert) arrived simultaneously within a 10-second window.
Why were 45 concurrent logins able to exhaust the pool?
→ No explicit connection pool limit was configured in the Why did | alert |
Root cause:failed?
{{ROOT_CAUSE}}→ No CloudWatch alarm was configured on DatabaseConnections. The only production alert path was the BetterStack health check, which by then was already failing.
5.3.2 Contributing Factors
| Factor | ||
|---|---|---|
pg max, idleTimeoutMillis, or connectionTimeoutMillis |
||
DatabaseConnections > 70 |
||
| No rate limiting across all IPs | Per-IP rate limit (10/min) did not prevent burst across many IPs simultaneously | Medium |
| No pre-campaign infra review | Marketing email campaign launched without coordinating with infrastructure | Medium |
| No connection pool health metric | Health check did not report pool utilization | Low |
3.3 What Worked Well
- BetterStack detection was excellent: < 30 seconds from failure to alert.
- Slack alert delivery was immediate: Alert to
#drop-opswithin 30 seconds. - App Runner restart is fast and reliable: Recovery completed in < 5 minutes.
- RDS was not the problem: Direct psql connection succeeded, quickly ruling out infrastructure failure.
- No data loss: Audit logs intact, no transactions corrupted.
4. Impact Analysis
| Dimension | Impact |
|---|---|
| Users affected | 100% — full service outage |
| Transactions blocked | ~3–5 remittances + ~2 QR payments |
| Revenue impact | Approx. NOK 6,000–10,000 |
| Compliance | None — no data loss, audit logs intact throughout |
| Regulatory | No notification required (< 4h, no PII exposure) |
| Reputation | Users saw error screens — limited blast radius pre-public-launch |
| SLA | 28 min downtime → monthly uptime 99.94% |
5. Corrective Actions
5.1 Immediate (before next marketing campaign)
| # | Action | Owner | Status |
|---|---|---|---|
| 1 | Configure explicit pg pool: max: 10, idleTimeoutMillis: 30000, connectionTimeoutMillis: 2000 |
Platform | Pending |
| 2 | Add CloudWatch alarm: DatabaseConnections > 70 → Slack #drop-ops |
Alem | Pending |
| 3 | Add global rate limit on /api/auth/bankid/initiate |
Platform | Pending |
5.2 Before v1.0 Launch
| # | Action | Owner | Priority |
|---|---|---|---|
| 4 | Add PgBouncer or RDS Proxy to externalize connection pooling | Platform | P1 |
| 5 | Report pool utilization in /api/health response (poolSize, idleCount, waitingCount) |
Platform | P2 |
| 6 | Implement graceful degradation for non-DB routes when DB is unavailable | Platform | P2 |
5.3 Trigger Event
The specific trigger for this incident: {{TRIGGER}}
6. What Went Well
{{CATEGORY_1}}:{{DESCRIPTION}}{{CATEGORY_2}}:{{DESCRIPTION}}{{CATEGORY_3}}:{{DESCRIPTION}}
7. What Went Wrong
{{CATEGORY_1}}:{{DESCRIPTION}}{{CATEGORY_2}}:{{DESCRIPTION}}{{CATEGORY_3}}:{{DESCRIPTION}}
8. Where We Got Lucky
{{LUCKY_1}}{{LUCKY_2}}{{LUCKY_3}}
9. Action Items
Short-Term FixesProcess (This Sprint)ongoing)
| # | Action | Owner | Due | ||
|---|---|---|---|---|---|
Long-Term
6. Systemic Improvements
6.1 Connection Pooling Fix
Current state: Implicit pool, no limits, no timeout.
Target state:
// src/drop-app/src/lib/db.ts (Nextexample)
Quarter)const pool = new Pool({
connectionString: process.env.DATABASE_URL,
max: 10, // Hard cap — never exceed RDS t4g.micro limit
idleTimeoutMillis: 30000, // Release idle connections after 30s
connectionTimeoutMillis: 2000, // Fail fast if pool is exhausted
});
When PgBouncer is added, set max higher in the app and let PgBouncer enforce the RDS limit.
6.2 CloudWatch Alarm
aws cloudwatch put-metric-alarm \
--alarm-name "drop-db-connections-high" \
--alarm-description "RDS DatabaseConnections > 70" \
--metric-name DatabaseConnections \
--namespace AWS/RDS \
--dimensions Name=DBInstanceIdentifier,Value=drop-db \
--period 60 \
--evaluation-periods 2 \
--threshold 70 \
--comparison-operator GreaterThanThreshold \
--statistic Average \
--alarm-actions arn:aws:sns:eu-west-1:324480209768:drop-ops-alerts \
--region eu-west-1
6.3 Marketing Campaign Checklist (Pre-Launch)
Before any marketing campaign that targets > 100 recipients:
- Notify infrastructure (Alem) at least 24h before send
- Check current
DatabaseConnectionsbaseline in CloudWatch - Verify pool configuration is explicit
- Consider sending campaign in batches (< 100/hour) to spread load
7. Lessons Learned
- Explicit is always better than implicit for resource limits. Never rely on OS defaults for connection pool configuration in production.
- Metrics must lead alerts, not lag them. The health check failure was a lagging indicator.
DatabaseConnectionsCloudWatch alarm would have caught this 2 minutes earlier. - Marketing and infrastructure must coordinate. A 500-recipient email burst is a load event that needs infrastructure awareness.
- App Runner restart is a fast, reliable mitigation. < 5 minutes RTO is acceptable for this class of issue. Document it as a first-response step in the runbook.
- BetterStack + Slack alerting works. < 30 second detection time met our target. No changes needed here.
- DB-level connection pooling (PgBouncer) is required for burst tolerance at scale. This should be resolved before public launch.
8. Action Item Tracking
| # | Action | Owner | Due | ||
|---|---|---|---|---|---|
| 1 | |||||
| 2 | |||||
| 3 | Platform | Before next campaign | Pending | ||
| 4 | PgBouncer / RDS Proxy | Platform | Before v1.0 | Pending | |
| 5 | Pool utilization in /api/health | Platform | Before v1.0 | Pending | |
| 6 | Graceful degradation for non-DB routes | Platform | Before v1.0 | Pending | |
| 7 | Marketing → Infra coordination checklist | Alem | Within 2 weeks | Pending | |
| 8 | DB connection metrics in weekly review | Alem | Ongoing | Ongoing | |
| 9 | App Runner restart in DR runbook | Platform | Q2 2026 | Pending |
Process Changes
10. Follow-Up Tracking
Follow-up review date: {{FOLLOWUP_DATE}} (4 weeks after incident)
Follow-up owner: {{FOLLOWUP_OWNER}}
11. Recurrence Prevention
Before this incident: {{BEFORE_STATE}}
After implementing action items: {{AFTER_STATE}}
Confidence in prevention: {{CONFIDENCE}} / 10
Residual risk: {{RESIDUAL_RISK}}
12. Review & Sign-Off
Post-mortem presented at: {{MEETING}} on {{MEETING_DATE}}
Meeting recording: {{RECORDING_LINK}}
Meeting notes: {{NOTES_LINK}}
Related Documents
- Incident Report INC-
{{ID}}2026-001 - Operational Runbook
- Disaster Recovery Plan
- Monitoring & Observability
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | Platform Architect (AI) | 2026-02-23 | |
| Alem Bašić | |||
| Approver | Alem Bašić |