Post-Mortem
Post-Mortem
Project: Drop Version: 0.1.0 Date: 2026-02-23 Author: Platform Architect (AI) Status: In Review Reviewers: Alem Bašić (CEO)
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | 2026-02-23 | Platform Architect (AI) | Example post-mortem (simulated pre-launch scenario — mirrors INC-2026-001) |
Post-Mortem Overview
This document is filled with a realistic example post-mortem based on Drop's architecture. It documents the same incident as the Incident Report (INC-2026-001: RDS connection pool exhaustion) but provides deeper root cause analysis and systemic improvements. Use as a template for future real incidents.
| Field | Value |
|---|---|
| Incident ID | INC-2026-001 |
| Severity | P1 — Critical |
| Post-Mortem Date | 2026-02-21 |
| Facilitator | Alem Bašić |
| Incident Commander | Alem Bašić |
| Participants | Alem Bašić (CEO), Platform Architect (AI) |
1. Executive Summary
On 2026-02-20 at 10:30 UTC, Drop experienced a 28-minute P1 outage affecting 100% of production users. The root cause was RDS PostgreSQL connection pool exhaustion triggered by a burst of concurrent BankID authentication attempts following a marketing email campaign. The immediate fix was an App Runner service restart. This post-mortem documents the systemic improvements required to prevent recurrence.
Bottom line: The application lacked explicit connection pool limits and DB-level metrics alerting. A burst of ~45 concurrent logins exhausted the default connection pool, causing all subsequent DB-dependent requests (including the health check) to fail.
2. Timeline
| Time (UTC) | Event | Phase |
|---|---|---|
| 10:28:00 | Marketing email delivered to ~500 recipients | Pre-incident |
| 10:30:00 | BetterStack detects HTTP 503 on Drop Health Check | Detection |
| 10:30:30 | Slack #drop-ops alert fires |
Detection |
| 10:30:45 | Alem acknowledges alert | Response |
| 10:31:00 | Alem checks App Runner → status RUNNING |
Diagnosis |
| 10:31:30 | Alem checks /api/health → {"status":"down","checks":{"db":{"status":"fail"}}} |
Diagnosis |
| 10:32:00 | CloudWatch logs show repeated connection refused to RDS |
Diagnosis |
| 10:33:00 | Direct psql connection to RDS succeeds — rules out RDS-level failure | Diagnosis |
| 10:34:00 | Hypothesis: application-level connection pool exhaustion | Diagnosis |
| 10:35:00 | Alem triggers App Runner restart via aws apprunner start-deployment |
Mitigation |
| 10:38:00 | App Runner deployment completes | Mitigation |
| 10:38:30 | Health check returns {"status":"ok"} |
Recovery |
| 10:38:45 | BetterStack recovery alert: "Drop Health Check is UP" | Recovery |
| 10:39:00 | Begin 15-minute stability monitoring window | Post-recovery |
| 10:54:00 | Confirmed stable — error spike cleared | Closed |
| 10:58:00 | Incident formally closed | Closed |
Total duration: 28 minutes (10:30 — 10:58 UTC) Time to detect: < 30 seconds Time to diagnose root cause: ~4 minutes Time to apply fix: ~5 minutes (App Runner restart)
3. Root Cause Analysis
3.1 The Five Whys
Why did Drop return HTTP 503?
→ The /api/health endpoint's DB check (SELECT 1) failed.
Why did the DB check fail? → The application could not acquire a database connection — pool was exhausted.
Why was the pool exhausted? → ~45 concurrent BankID login callbacks (each requiring a DB connection for session upsert) arrived simultaneously within a 10-second window.
Why were 45 concurrent logins able to exhaust the pool?
→ No explicit connection pool limit was configured in the pg driver. The pool was bounded only by OS-level limits (~85 connections for db.t4g.micro), and there was no queue/timeout — new requests failed immediately when the limit was hit.
Why did no alert fire before the health check failed?
→ No CloudWatch alarm was configured on DatabaseConnections. The only production alert path was the BetterStack health check, which by then was already failing.
3.2 Contributing Factors
| Factor | Description | Severity |
|---|---|---|
| No explicit pool config | pg used without max, idleTimeoutMillis, or connectionTimeoutMillis |
High |
| No DB connection metrics | No CloudWatch alarm on DatabaseConnections > 70 |
High |
| No graceful degradation | Application returned 503 when DB was unavailable, even for non-DB routes | Medium |
| No rate limiting across all IPs | Per-IP rate limit (10/min) did not prevent burst across many IPs simultaneously | Medium |
| No pre-campaign infra review | Marketing email campaign launched without coordinating with infrastructure | Medium |
| No connection pool health metric | Health check did not report pool utilization | Low |
3.3 What Worked Well
- BetterStack detection was excellent: < 30 seconds from failure to alert.
- Slack alert delivery was immediate: Alert to
#drop-opswithin 30 seconds. - App Runner restart is fast and reliable: Recovery completed in < 5 minutes.
- RDS was not the problem: Direct psql connection succeeded, quickly ruling out infrastructure failure.
- No data loss: Audit logs intact, no transactions corrupted.
4. Impact Analysis
| Dimension | Impact |
|---|---|
| Users affected | 100% — full service outage |
| Transactions blocked | ~3–5 remittances + ~2 QR payments |
| Revenue impact | Approx. NOK 6,000–10,000 |
| Compliance | None — no data loss, audit logs intact throughout |
| Regulatory | No notification required (< 4h, no PII exposure) |
| Reputation | Users saw error screens — limited blast radius pre-public-launch |
| SLA | 28 min downtime → monthly uptime 99.94% |
5. Corrective Actions
5.1 Immediate (before next marketing campaign)
| # | Action | Owner | Status |
|---|---|---|---|
| 1 | Configure explicit pg pool: max: 10, idleTimeoutMillis: 30000, connectionTimeoutMillis: 2000 |
Platform | Pending |
| 2 | Add CloudWatch alarm: DatabaseConnections > 70 → Slack #drop-ops |
Alem | Pending |
| 3 | Add global rate limit on /api/auth/bankid/initiate (e.g., 100/min across all IPs) |
Platform | Pending |
5.2 Before v1.0 Launch
| # | Action | Owner | Priority |
|---|---|---|---|
| 4 | Add PgBouncer or RDS Proxy to externalize connection pooling | Platform | P1 |
| 5 | Report pool utilization in /api/health response (poolSize, idleCount, waitingCount) |
Platform | P2 |
| 6 | Implement graceful degradation for non-DB routes when DB is unavailable | Platform | P2 |
5.3 Process (ongoing)
| # | Action | Owner | Due |
|---|---|---|---|
| 7 | Create "Marketing → Infra" coordination checklist — must be completed before any campaign | Alem | Within 2 weeks |
| 8 | Add DB connection metrics to weekly monitoring review | Alem | Ongoing |
| 9 | Test App Runner restart as a documented runbook step | Platform | Q2 2026 |
6. Systemic Improvements
6.1 Connection Pooling Fix
Current state: Implicit pool, no limits, no timeout.
Target state:
// src/drop-app/src/lib/db.ts (example)
const pool = new Pool({
connectionString: process.env.DATABASE_URL,
max: 10, // Hard cap — never exceed RDS t4g.micro limit
idleTimeoutMillis: 30000, // Release idle connections after 30s
connectionTimeoutMillis: 2000, // Fail fast if pool is exhausted
});
When PgBouncer is added, set max higher in the app and let PgBouncer enforce the RDS limit.
6.2 CloudWatch Alarm
aws cloudwatch put-metric-alarm \
--alarm-name "drop-db-connections-high" \
--alarm-description "RDS DatabaseConnections > 70" \
--metric-name DatabaseConnections \
--namespace AWS/RDS \
--dimensions Name=DBInstanceIdentifier,Value=drop-db \
--period 60 \
--evaluation-periods 2 \
--threshold 70 \
--comparison-operator GreaterThanThreshold \
--statistic Average \
--alarm-actions arn:aws:sns:eu-west-1:324480209768:drop-ops-alerts \
--region eu-west-1
6.3 Marketing Campaign Checklist (Pre-Launch)
Before any marketing campaign that targets > 100 recipients:
- Notify infrastructure (Alem) at least 24h before send
- Check current
DatabaseConnectionsbaseline in CloudWatch - Verify pool configuration is explicit
- Consider sending campaign in batches (< 100/hour) to spread load
7. Lessons Learned
- Explicit is always better than implicit for resource limits. Never rely on OS defaults for connection pool configuration in production.
- Metrics must lead alerts, not lag them. The health check failure was a lagging indicator.
DatabaseConnectionsCloudWatch alarm would have caught this 2 minutes earlier. - Marketing and infrastructure must coordinate. A 500-recipient email burst is a load event that needs infrastructure awareness.
- App Runner restart is a fast, reliable mitigation. < 5 minutes RTO is acceptable for this class of issue. Document it as a first-response step in the runbook.
- BetterStack + Slack alerting works. < 30 second detection time met our target. No changes needed here.
- DB-level connection pooling (PgBouncer) is required for burst tolerance at scale. This should be resolved before public launch.
8. Action Item Tracking
| # | Action | Owner | Due | Status |
|---|---|---|---|---|
| 1 | Configure explicit pg pool limits | Platform | Before next campaign | Pending |
| 2 | CloudWatch alarm on DatabaseConnections > 70 | Alem | Within 1 week | Pending |
| 3 | Global rate limit on BankID initiate | Platform | Before next campaign | Pending |
| 4 | PgBouncer / RDS Proxy | Platform | Before v1.0 | Pending |
| 5 | Pool utilization in /api/health | Platform | Before v1.0 | Pending |
| 6 | Graceful degradation for non-DB routes | Platform | Before v1.0 | Pending |
| 7 | Marketing → Infra coordination checklist | Alem | Within 2 weeks | Pending |
| 8 | DB connection metrics in weekly review | Alem | Ongoing | Ongoing |
| 9 | App Runner restart in DR runbook | Platform | Q2 2026 | Pending |
Related Documents
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | Platform Architect (AI) | 2026-02-23 | |
| Facilitator | Alem Bašić | ||
| Approver | Alem Bašić |