Post-Mortem

Project: Drop Version: 0.1.0 Date: 2026-02-23 Author: Platform Architect (AI) Status: In Review Reviewers: Alem Bašić (CEO)

Document History

Version	Date	Author	Changes
0.1	2026-02-23	Platform Architect (AI)	Example post-mortem (simulated pre-launch scenario — mirrors INC-2026-001)

Post-Mortem Overview

This document is filled with a realistic example post-mortem based on Drop's architecture. It documents the same incident as the Incident Report (INC-2026-001: RDS connection pool exhaustion) but provides deeper root cause analysis and systemic improvements. Use as a template for future real incidents.

Field	Value
Incident ID	INC-2026-001
Severity	P1 — Critical
Post-Mortem Date	2026-02-21
Facilitator	Alem Bašić
Incident Commander	Alem Bašić
Participants	Alem Bašić (CEO), Platform Architect (AI)

1. Executive Summary

On 2026-02-20 at 10:30 UTC, Drop experienced a 28-minute P1 outage affecting 100% of production users. The root cause was RDS PostgreSQL connection pool exhaustion triggered by a burst of concurrent BankID authentication attempts following a marketing email campaign. The immediate fix was an App Runner service restart. This post-mortem documents the systemic improvements required to prevent recurrence.

Bottom line: The application lacked explicit connection pool limits and DB-level metrics alerting. A burst of ~45 concurrent logins exhausted the default connection pool, causing all subsequent DB-dependent requests (including the health check) to fail.

2. Timeline

Time (UTC)	Event	Phase
10:28:00	Marketing email delivered to ~500 recipients	Pre-incident
10:30:00	BetterStack detects HTTP 503 on Drop Health Check	Detection
10:30:30	Slack `#drop-ops` alert fires	Detection
10:30:45	Alem acknowledges alert	Response
10:31:00	Alem checks App Runner → status `RUNNING`	Diagnosis
10:31:30	Alem checks `/api/health` → `{"status":"down","checks":{"db":{"status":"fail"}}}`	Diagnosis
10:32:00	CloudWatch logs show repeated `connection refused` to RDS	Diagnosis
10:33:00	Direct psql connection to RDS succeeds — rules out RDS-level failure	Diagnosis
10:34:00	Hypothesis: application-level connection pool exhaustion	Diagnosis
10:35:00	Alem triggers App Runner restart via `aws apprunner start-deployment`	Mitigation
10:38:00	App Runner deployment completes	Mitigation
10:38:30	Health check returns `{"status":"ok"}`	Recovery
10:38:45	BetterStack recovery alert: "Drop Health Check is UP"	Recovery
10:39:00	Begin 15-minute stability monitoring window	Post-recovery
10:54:00	Confirmed stable — error spike cleared	Closed
10:58:00	Incident formally closed	Closed

Total duration: 28 minutes (10:30 — 10:58 UTC) Time to detect: < 30 seconds Time to diagnose root cause: ~4 minutes Time to apply fix: ~5 minutes (App Runner restart)

3. Root Cause Analysis

3.1 The Five Whys

Why did Drop return HTTP 503? → The /api/health endpoint's DB check (SELECT 1) failed.

Why did the DB check fail? → The application could not acquire a database connection — pool was exhausted.

Why was the pool exhausted? → ~45 concurrent BankID login callbacks (each requiring a DB connection for session upsert) arrived simultaneously within a 10-second window.

Why were 45 concurrent logins able to exhaust the pool? → No explicit connection pool limit was configured in the pg driver. The pool was bounded only by OS-level limits (~85 connections for db.t4g.micro), and there was no queue/timeout — new requests failed immediately when the limit was hit.

Why did no alert fire before the health check failed? → No CloudWatch alarm was configured on DatabaseConnections. The only production alert path was the BetterStack health check, which by then was already failing.

3.2 Contributing Factors

Factor	Description	Severity
No explicit pool config	`pg` used without `max`, `idleTimeoutMillis`, or `connectionTimeoutMillis`	High
No DB connection metrics	No CloudWatch alarm on `DatabaseConnections > 70`	High
No graceful degradation	Application returned 503 when DB was unavailable, even for non-DB routes	Medium
No rate limiting across all IPs	Per-IP rate limit (10/min) did not prevent burst across many IPs simultaneously	Medium
No pre-campaign infra review	Marketing email campaign launched without coordinating with infrastructure	Medium
No connection pool health metric	Health check did not report pool utilization	Low

3.3 What Worked Well

BetterStack detection was excellent: < 30 seconds from failure to alert.
Slack alert delivery was immediate: Alert to #drop-ops within 30 seconds.
App Runner restart is fast and reliable: Recovery completed in < 5 minutes.
RDS was not the problem: Direct psql connection succeeded, quickly ruling out infrastructure failure.
No data loss: Audit logs intact, no transactions corrupted.

4. Impact Analysis

Dimension	Impact
Users affected	100% — full service outage
Transactions blocked	~3–5 remittances + ~2 QR payments
Revenue impact	Approx. NOK 6,000–10,000
Compliance	None — no data loss, audit logs intact throughout
Regulatory	No notification required (< 4h, no PII exposure)
Reputation	Users saw error screens — limited blast radius pre-public-launch
SLA	28 min downtime → monthly uptime 99.94%

5. Corrective Actions

5.1 Immediate (before next marketing campaign)

#	Action	Owner	Status
1	Configure explicit pg pool: `max: 10, idleTimeoutMillis: 30000, connectionTimeoutMillis: 2000`	Platform	Pending
2	Add CloudWatch alarm: `DatabaseConnections > 70` → Slack `#drop-ops`	Alem	Pending
3	Add global rate limit on `/api/auth/bankid/initiate` (e.g., 100/min across all IPs)	Platform	Pending

5.2 Before v1.0 Launch

#	Action	Owner	Priority
4	Add PgBouncer or RDS Proxy to externalize connection pooling	Platform	P1
5	Report pool utilization in `/api/health` response (`poolSize`, `idleCount`, `waitingCount`)	Platform	P2
6	Implement graceful degradation for non-DB routes when DB is unavailable	Platform	P2

5.3 Process (ongoing)

#	Action	Owner	Due
7	Create "Marketing → Infra" coordination checklist — must be completed before any campaign	Alem	Within 2 weeks
8	Add DB connection metrics to weekly monitoring review	Alem	Ongoing
9	Test App Runner restart as a documented runbook step	Platform	Q2 2026

6. Systemic Improvements

6.1 Connection Pooling Fix

Current state: Implicit pool, no limits, no timeout.

Target state:

// src/drop-app/src/lib/db.ts (example)
const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 10,                    // Hard cap — never exceed RDS t4g.micro limit
  idleTimeoutMillis: 30000,   // Release idle connections after 30s
  connectionTimeoutMillis: 2000, // Fail fast if pool is exhausted
});

When PgBouncer is added, set max higher in the app and let PgBouncer enforce the RDS limit.

6.2 CloudWatch Alarm

aws cloudwatch put-metric-alarm \
  --alarm-name "drop-db-connections-high" \
  --alarm-description "RDS DatabaseConnections > 70" \
  --metric-name DatabaseConnections \
  --namespace AWS/RDS \
  --dimensions Name=DBInstanceIdentifier,Value=drop-db \
  --period 60 \
  --evaluation-periods 2 \
  --threshold 70 \
  --comparison-operator GreaterThanThreshold \
  --statistic Average \
  --alarm-actions arn:aws:sns:eu-west-1:324480209768:drop-ops-alerts \
  --region eu-west-1

6.3 Marketing Campaign Checklist (Pre-Launch)

Before any marketing campaign that targets > 100 recipients:

Notify infrastructure (Alem) at least 24h before send
Check current DatabaseConnections baseline in CloudWatch
Verify pool configuration is explicit
Consider sending campaign in batches (< 100/hour) to spread load

7. Lessons Learned

Explicit is always better than implicit for resource limits. Never rely on OS defaults for connection pool configuration in production.
Metrics must lead alerts, not lag them. The health check failure was a lagging indicator. DatabaseConnections CloudWatch alarm would have caught this 2 minutes earlier.
Marketing and infrastructure must coordinate. A 500-recipient email burst is a load event that needs infrastructure awareness.
App Runner restart is a fast, reliable mitigation. < 5 minutes RTO is acceptable for this class of issue. Document it as a first-response step in the runbook.
BetterStack + Slack alerting works. < 30 second detection time met our target. No changes needed here.
DB-level connection pooling (PgBouncer) is required for burst tolerance at scale. This should be resolved before public launch.

8. Action Item Tracking

#	Action	Owner	Due	Status
1	Configure explicit pg pool limits	Platform	Before next campaign	Pending
2	CloudWatch alarm on DatabaseConnections > 70	Alem	Within 1 week	Pending
3	Global rate limit on BankID initiate	Platform	Before next campaign	Pending
4	PgBouncer / RDS Proxy	Platform	Before v1.0	Pending
5	Pool utilization in /api/health	Platform	Before v1.0	Pending
6	Graceful degradation for non-DB routes	Platform	Before v1.0	Pending
7	Marketing → Infra coordination checklist	Alem	Within 2 weeks	Pending
8	DB connection metrics in weekly review	Alem	Ongoing	Ongoing
9	App Runner restart in DR runbook	Platform	Q2 2026	Pending

Approval

Role	Name	Date
Author	Platform Architect (AI)	2026-02-23
Facilitator	Alem Bašić
Approver	Alem Bašić

Post-Mortem

Post-Mortem

Document History

Post-Mortem Overview

1. Executive Summary

2. Timeline

3. Root Cause Analysis

3.1 The Five Whys

3.2 Contributing Factors

3.3 What Worked Well

4. Impact Analysis

5. Corrective Actions

5.1 Immediate (before next marketing campaign)

5.2 Before v1.0 Launch

5.3 Process (ongoing)

6. Systemic Improvements

6.1 Connection Pooling Fix

6.2 CloudWatch Alarm

6.3 Marketing Campaign Checklist (Pre-Launch)

7. Lessons Learned

8. Action Item Tracking

Related Documents

Approval