Skip to main content

Post-Mortem

Post-Mortem

Project: Drop Version: 0.1.0 Date: 2026-02-23 Author: Platform Architect (AI) Status: In Review Reviewers: Alem Bašić (CEO)

Document History

Version Date Author Changes
0.1 2026-02-23 Platform Architect (AI) Example post-mortem (simulated pre-launch scenario — mirrors INC-2026-001)

Post-Mortem Overview

This document is filled with a realistic example post-mortem based on Drop's architecture. It documents the same incident as the Incident Report (INC-2026-001: RDS connection pool exhaustion) but provides deeper root cause analysis and systemic improvements. Use as a template for future real incidents.

Field Value
Incident ID INC-2026-001
Severity P1 — Critical
Post-Mortem Date 2026-02-21
Facilitator Alem Bašić
Incident Commander Alem Bašić
Participants Alem Bašić (CEO), Platform Architect (AI)

1. Executive Summary

On 2026-02-20 at 10:30 UTC, Drop experienced a 28-minute P1 outage affecting 100% of production users. The root cause was RDS PostgreSQL connection pool exhaustion triggered by a burst of concurrent BankID authentication attempts following a marketing email campaign. The immediate fix was an App Runner service restart. This post-mortem documents the systemic improvements required to prevent recurrence.

Bottom line: The application lacked explicit connection pool limits and DB-level metrics alerting. A burst of ~45 concurrent logins exhausted the default connection pool, causing all subsequent DB-dependent requests (including the health check) to fail.


2. Timeline

Time (UTC) Event Phase
10:28:00 Marketing email delivered to ~500 recipients Pre-incident
10:30:00 BetterStack detects HTTP 503 on Drop Health Check Detection
10:30:30 Slack #drop-ops alert fires Detection
10:30:45 Alem acknowledges alert Response
10:31:00 Alem checks App Runner → status RUNNING Diagnosis
10:31:30 Alem checks /api/health{"status":"down","checks":{"db":{"status":"fail"}}} Diagnosis
10:32:00 CloudWatch logs show repeated connection refused to RDS Diagnosis
10:33:00 Direct psql connection to RDS succeeds — rules out RDS-level failure Diagnosis
10:34:00 Hypothesis: application-level connection pool exhaustion Diagnosis
10:35:00 Alem triggers App Runner restart via aws apprunner start-deployment Mitigation
10:38:00 App Runner deployment completes Mitigation
10:38:30 Health check returns {"status":"ok"} Recovery
10:38:45 BetterStack recovery alert: "Drop Health Check is UP" Recovery
10:39:00 Begin 15-minute stability monitoring window Post-recovery
10:54:00 Confirmed stable — error spike cleared Closed
10:58:00 Incident formally closed Closed

Total duration: 28 minutes (10:30 — 10:58 UTC) Time to detect: < 30 seconds Time to diagnose root cause: ~4 minutes Time to apply fix: ~5 minutes (App Runner restart)


3. Root Cause Analysis

3.1 The Five Whys

Why did Drop return HTTP 503? → The /api/health endpoint's DB check (SELECT 1) failed.

Why did the DB check fail? → The application could not acquire a database connection — pool was exhausted.

Why was the pool exhausted? → ~45 concurrent BankID login callbacks (each requiring a DB connection for session upsert) arrived simultaneously within a 10-second window.

Why were 45 concurrent logins able to exhaust the pool? → No explicit connection pool limit was configured in the pg driver. The pool was bounded only by OS-level limits (~85 connections for db.t4g.micro), and there was no queue/timeout — new requests failed immediately when the limit was hit.

Why did no alert fire before the health check failed? → No CloudWatch alarm was configured on DatabaseConnections. The only production alert path was the BetterStack health check, which by then was already failing.

3.2 Contributing Factors

Factor Description Severity
No explicit pool config pg used without max, idleTimeoutMillis, or connectionTimeoutMillis High
No DB connection metrics No CloudWatch alarm on DatabaseConnections > 70 High
No graceful degradation Application returned 503 when DB was unavailable, even for non-DB routes Medium
No rate limiting across all IPs Per-IP rate limit (10/min) did not prevent burst across many IPs simultaneously Medium
No pre-campaign infra review Marketing email campaign launched without coordinating with infrastructure Medium
No connection pool health metric Health check did not report pool utilization Low

3.3 What Worked Well

  • BetterStack detection was excellent: < 30 seconds from failure to alert.
  • Slack alert delivery was immediate: Alert to #drop-ops within 30 seconds.
  • App Runner restart is fast and reliable: Recovery completed in < 5 minutes.
  • RDS was not the problem: Direct psql connection succeeded, quickly ruling out infrastructure failure.
  • No data loss: Audit logs intact, no transactions corrupted.

4. Impact Analysis

Dimension Impact
Users affected 100% — full service outage
Transactions blocked ~3–5 remittances + ~2 QR payments
Revenue impact Approx. NOK 6,000–10,000
Compliance None — no data loss, audit logs intact throughout
Regulatory No notification required (< 4h, no PII exposure)
Reputation Users saw error screens — limited blast radius pre-public-launch
SLA 28 min downtime → monthly uptime 99.94%

5. Corrective Actions

5.1 Immediate (before next marketing campaign)

# Action Owner Status
1 Configure explicit pg pool: max: 10, idleTimeoutMillis: 30000, connectionTimeoutMillis: 2000 Platform Pending
2 Add CloudWatch alarm: DatabaseConnections > 70 → Slack #drop-ops Alem Pending
3 Add global rate limit on /api/auth/bankid/initiate (e.g., 100/min across all IPs) Platform Pending

5.2 Before v1.0 Launch

# Action Owner Priority
4 Add PgBouncer or RDS Proxy to externalize connection pooling Platform P1
5 Report pool utilization in /api/health response (poolSize, idleCount, waitingCount) Platform P2
6 Implement graceful degradation for non-DB routes when DB is unavailable Platform P2

5.3 Process (ongoing)

# Action Owner Due
7 Create "Marketing → Infra" coordination checklist — must be completed before any campaign Alem Within 2 weeks
8 Add DB connection metrics to weekly monitoring review Alem Ongoing
9 Test App Runner restart as a documented runbook step Platform Q2 2026

6. Systemic Improvements

6.1 Connection Pooling Fix

Current state: Implicit pool, no limits, no timeout.

Target state:

// src/drop-app/src/lib/db.ts (example)
const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 10,                    // Hard cap — never exceed RDS t4g.micro limit
  idleTimeoutMillis: 30000,   // Release idle connections after 30s
  connectionTimeoutMillis: 2000, // Fail fast if pool is exhausted
});

When PgBouncer is added, set max higher in the app and let PgBouncer enforce the RDS limit.

6.2 CloudWatch Alarm

aws cloudwatch put-metric-alarm \
  --alarm-name "drop-db-connections-high" \
  --alarm-description "RDS DatabaseConnections > 70" \
  --metric-name DatabaseConnections \
  --namespace AWS/RDS \
  --dimensions Name=DBInstanceIdentifier,Value=drop-db \
  --period 60 \
  --evaluation-periods 2 \
  --threshold 70 \
  --comparison-operator GreaterThanThreshold \
  --statistic Average \
  --alarm-actions arn:aws:sns:eu-west-1:324480209768:drop-ops-alerts \
  --region eu-west-1

6.3 Marketing Campaign Checklist (Pre-Launch)

Before any marketing campaign that targets > 100 recipients:

  • Notify infrastructure (Alem) at least 24h before send
  • Check current DatabaseConnections baseline in CloudWatch
  • Verify pool configuration is explicit
  • Consider sending campaign in batches (< 100/hour) to spread load

7. Lessons Learned

  1. Explicit is always better than implicit for resource limits. Never rely on OS defaults for connection pool configuration in production.
  2. Metrics must lead alerts, not lag them. The health check failure was a lagging indicator. DatabaseConnections CloudWatch alarm would have caught this 2 minutes earlier.
  3. Marketing and infrastructure must coordinate. A 500-recipient email burst is a load event that needs infrastructure awareness.
  4. App Runner restart is a fast, reliable mitigation. < 5 minutes RTO is acceptable for this class of issue. Document it as a first-response step in the runbook.
  5. BetterStack + Slack alerting works. < 30 second detection time met our target. No changes needed here.
  6. DB-level connection pooling (PgBouncer) is required for burst tolerance at scale. This should be resolved before public launch.

8. Action Item Tracking

# Action Owner Due Status
1 Configure explicit pg pool limits Platform Before next campaign Pending
2 CloudWatch alarm on DatabaseConnections > 70 Alem Within 1 week Pending
3 Global rate limit on BankID initiate Platform Before next campaign Pending
4 PgBouncer / RDS Proxy Platform Before v1.0 Pending
5 Pool utilization in /api/health Platform Before v1.0 Pending
6 Graceful degradation for non-DB routes Platform Before v1.0 Pending
7 Marketing → Infra coordination checklist Alem Within 2 weeks Pending
8 DB connection metrics in weekly review Alem Ongoing Ongoing
9 App Runner restart in DR runbook Platform Q2 2026 Pending


Approval

Role Name Date Signature
Author Platform Architect (AI) 2026-02-23
Facilitator Alem Bašić
Approver Alem Bašić