Incident Report

Project: Drop Version: 0.1.0 Date: 2026-02-23 Author: Platform Architect (AI) Status: Closed Reviewers: Alem Bašić (CEO)

Document History

Version	Date	Author	Changes
0.1	2026-02-23	Platform Architect (AI)	Example incident report (simulated pre-launch scenario)

Incident Overview

This document is filled with a realistic example incident based on Drop's architecture. Use it as a template for future real incidents.

Field	Value
Incident ID	INC-2026-001
Severity	P1 — Critical (production down)
Status	Closed / Resolved
Start Time	2026-02-20 10:30 UTC
End Time	2026-02-20 10:58 UTC
Duration	28 minutes
Services Affected	Drop production (all users)
Root Cause	RDS PostgreSQL connection pool exhaustion after spike in concurrent logins
Incident Commander	Alem Bašić
Reported By	BetterStack (automated — `Drop Health Check` monitor)

1. Impact Summary

Metric	Value
Users affected	All active users (100% — full outage)
Transactions blocked	Estimated 3–5 remittances + 2 QR payments
Revenue impact	Approx. NOK 6,000–10,000 in blocked transactions
SLA impact	28 minutes downtime → monthly uptime: 99.94%
Compliance impact	None — no data loss, audit logs intact

Customer-facing behavior: Users saw error messages on all screens. /api/health returned HTTP 503 with "status":"down".

2. Timeline

Time (UTC)	Event
10:30:00	Drop Health Check monitor on BetterStack detects HTTP 503
10:30:30	Slack `#drop-ops` receives critical alert: "Drop Health Check is DOWN"
10:30:45	Alem acknowledges alert
10:31:00	Alem checks App Runner status — service shows `RUNNING`
10:31:30	Alem checks `/api/health` — response: `{"status":"down","checks":{"db":{"status":"fail"}}}`
10:32:00	Alem checks CloudWatch logs — sees repeated connection refused errors to RDS
10:32:30	Hypothesis: RDS connection issue. Alem checks RDS status — shows `available`
10:33:00	Alem queries RDS directly — connection succeeds from psql
10:34:00	New hypothesis: connection pool exhaustion in application
10:35:00	Alem triggers App Runner restart (`aws apprunner start-deployment`)
10:38:00	App Runner deployment completes — service `RUNNING`
10:38:30	Health check passes: `{"status":"ok"}`
10:38:45	BetterStack sends recovery alert: "Drop Health Check is UP (downtime: 8 min)"
10:39:00	Alem monitors for 15 minutes to confirm stability
10:54:00	Confirms stable — error spike cleared
10:58:00	Incident closed

3. Root Cause Analysis

What Happened

A burst of concurrent BankID login attempts (triggered by a marketing email sent at 10:28 UTC) created 45+ simultaneous database connections. The Drop application uses application-level connection pooling with a default max of ~85 connections (db.t4g.micro limit). Each /api/auth/bankid/callback request opens a connection for session creation + user upsert — the simultaneous spike exhausted the pool.

When the pool was exhausted, new requests failed with connection refused and the health check's DB query (SELECT 1) also failed, triggering a 503 response.

Contributing Factors

No explicit connection pool configured: Drop uses the pg driver without PgBouncer or RDS Proxy. The application-level pool was implicitly bounded by the OS connection limits.
No connection pool metrics: CloudWatch didn't have an alert on DatabaseConnections — the issue wasn't detected until the health check failed.
No circuit breaker: The application did not gracefully degrade (e.g., serve cached data) when DB was unavailable.

Why App Runner restart fixed it

Restarting App Runner recreated the application process and reset all connection pools. With the burst of login requests completed (10 seconds of traffic), new requests arrived at a normal rate and the pool was sufficient.

4. Resolution

Immediate fix: App Runner restart (28 minutes to resolution from detection).

Permanent fixes required:

Add PgBouncer connection pooler (or RDS Proxy) to limit per-application connections
Add CloudWatch alarm on DatabaseConnections > 70 for db.t4g.micro
Implement connection pool health check with metrics
Add rate limiting on BankID login initiation for burst protection (already exists at 10/min per IP — but not per minute across all IPs)

5. Detection Quality

Aspect	Score	Notes
Time to detection	Good	BetterStack detected in < 30 seconds
Alert received	Good	Slack alert within 30 seconds
Time to diagnose	Fair	4 minutes — hypothesis took 2 iterations
Time to resolve	Good	5 minutes (App Runner restart)
Post-restore verification	Good	Monitored for 15 minutes before closing

6. Action Items

#	Action	Owner	Priority	Due Date
1	Configure PgBouncer or RDS Proxy for connection pooling	Alem / Platform	P1	Before v1.0 launch
2	Add CloudWatch alarm: `DatabaseConnections > 70` → Slack alert	Alem	P1	Within 1 week
3	Add connection pool metrics to health check endpoint	Platform	P2	Before v1.0 launch
4	Document burst rate limiting strategy for marketing campaigns	Alem	P2	Within 2 weeks
5	Test DR runbook scenario for RDS connection failures	Platform	P3	Q2 2026

7. Lessons Learned

Connection pool exhaustion is a real risk for synchronous monoliths under burst load. Add PgBouncer before launch.
CloudWatch DatabaseConnections metric needs an alarm. This should have alerted before the health check failed.
App Runner restart is fast and reliable. 5 minutes is an acceptable RTO for this class of issue.
BetterStack detection was excellent — 30 seconds from failure to alert is within our target.
Marketing campaigns need infrastructure coordination. A burst from a marketing email should trigger a pre-deployment review of connection capacity.

Approval

Role	Name	Date
Author	Platform Architect (AI)	2026-02-23
Reviewer
Approver	Alem Bašić

Incident Report

Incident Report

Document History

Incident Overview

1. Impact Summary

2. Timeline

3. Root Cause Analysis

What Happened

Contributing Factors

Why App Runner restart fixed it

4. Resolution

5. Detection Quality

6. Action Items

7. Lessons Learned

Related Documents

Approval