Post-Mortem

Project: ~~{{PROJECT_NAME}}~~Drop Version: ~~{{VERSION}}~~0.1.0 Date: ~~{{DATE}}~~2026-02-23 Author: ~~{{AUTHOR}}~~Platform Architect (AI) Status: ~~Draft |~~ In Review ~~| Approved~~ Reviewers: ~~{{REVIEWERS}}~~Alem Bašić (CEO)

Document History

Version	Date	Author	Changes
0.1	~~{{DATE}}~~2026-02-23	~~{{AUTHOR}}~~Platform Architect (AI)	~~Initial~~Example ~~draft~~post-mortem (simulated pre-launch scenario — mirrors INC-2026-001)

BlamelessPost-Mortem Culture StatementOverview

This ~~post-mortem~~document is ~~conducted~~filled inwith a realistic ~~blameless~~example ~~spirit~~post-mortem. ~~Our~~based ~~goal~~on isDrop's toarchitecture. ~~understand~~It ~~how~~documents the same incident as the Incident Report (INC-2026-001: RDS connection pool exhaustion) but provides deeper root cause analysis and ~~why~~systemic ~~the~~improvements. ~~incident~~Use ~~occurred~~as —a ~~not to assign fault to individuals. People make the best decisions they can with the information and tools available at the time. When things go wrong, we look~~template for ~~systemic~~future ~~improvements~~real ~~that make the right action easier and the wrong action harder for everyone.~~incidents.

1. Incident Reference & Metadata

P1— In~~Review~~/~~Final~~

Field	Value
Incident ID	INC-~~{{YYYY}}-{{SEQ}}~~2026-001
Severity	~~P{{SEVERITY}}~~
~~Incident Report~~	~~INC-{{YYYY}}-{{SEQ}}~~
~~Post-Mortem Facilitator~~	~~{{FACILITATOR}}~~Critical
Post-Mortem Date	~~{{PM_DATE}}~~2026-02-21
~~Attendees~~Facilitator	~~{{ATTENDEES}}~~Alem Bašić
~~Status~~Incident Commander	~~Draft~~Alem /Bašić
Participants	Alem Bašić (CEO), Platform Architect (AI)

2.1. Executive Summary

~~{{EXECUTIVE_SUMMARY}}~~

2026-02-20

at
~~Example:~~10:30 "AUTC, ~~database~~Drop ~~index~~experienced a 28-minute P1 outage affecting 100% of production users. The root cause was ~~dropped~~RDS ~~during~~PostgreSQL connection pool exhaustion triggered by a ~~migration~~burst onof ~~{{DATE}},~~concurrent ~~causing~~BankID ~~query~~authentication ~~performance~~attempts tofollowing ~~degrade~~a bymarketing ~~50×~~email ~~under~~campaign. ~~load.~~The immediate fix was an App Runner service restart. This ~~resulted~~post-mortem ~~in a 1h 23min degraded service period affecting {{USERS}} users. We have restored~~documents the ~~index,~~systemic ~~added~~improvements ~~migration validation tooling, and created safeguards~~required to prevent ~~similar incidents."~~recurrence.

Bottom line: The application lacked explicit connection pool limits and DB-level metrics alerting. A burst of ~45 concurrent logins exhausted the default connection pool, causing all subsequent DB-dependent requests (including the health check) to fail.

3.2. Impact SummaryTimeline

~~{{RESOLVED}})~~ ~~during incident)~~

~~Metric~~Time (UTC)	~~Value~~Event	Phase
~~Total duration~~10:28:00	~~{{DURATION}}~~Marketing ~~(detected~~email atdelivered ~~{{DETECTED}},~~to ~~resolved~~~500 atrecipients	Pre-incident
~~Users affected~~10:30:00	~~{{USER_COUNT}}~~BetterStack ~~({{USER_PERCENT}}%~~detects ofHTTP ~~user~~503 ~~base)~~on Drop Health Check	Detection
~~Requests affected~~10:30:30	~~{{REQUEST_COUNT}}~~Slack ~~({{REQUEST_PERCENT}}%~~`#drop-ops` ~~error~~alert ~~rate~~fires	Detection
~~Estimated revenue impact~~10:30:45	~~${{REVENUE}}~~Alem acknowledges alert	Response
~~SLA breach~~10:31:00	~~{{SLA_BREACH}}~~Alem checks App Runner → status `RUNNING`	Diagnosis
~~SLA credits owed~~10:31:30	$Alem checks `/api/health` → `{"status":"down","checks":{CREDITS}"db":{"status":"fail"}`

4. Detailed Timeline

timeline
    title Incident Timeline
    {{TIME_1}} : {{EVENT_1}}
    {{TIME_2}} : {{EVENT_2}}
    {{TIME_3}} : {{EVENT_3}}
    {{TIME_4}} : {{EVENT_4}}
    {{TIME_5}} : {{EVENT_5}}

~~Resolved(MTTR~~=T8

~~Time~~	~~Event~~	~~MTTD/MTTR Marker~~
~~{{T1}}~~	~~{{EVENT}}~~	~~← Incident start~~Diagnosis
~~{{T2}}~~10:32:00	~~{{EVENT}}~~CloudWatch logs show repeated `connection refused` to RDS	Diagnosis
~~{{T3}}~~10:33:00	~~{{EVENT}}~~Direct psql connection to RDS succeeds — rules out RDS-level failure	~~← Detection (MTTD = T3 - T1)~~Diagnosis
~~{{T4}}~~10:34:00	~~{{EVENT}}~~Hypothesis: application-level connection pool exhaustion	Diagnosis
~~{{T5}}~~10:35:00	~~{{EVENT}}~~Alem triggers App Runner restart via `aws apprunner start-deployment`	Mitigation
~~{{T6}}~~10:38:00	~~{{EVENT}}~~App Runner deployment completes	Mitigation
~~{{T7}}~~10:38:30	Health check returns `{{EVENT}"status":"ok"}`	Recovery
~~{{T8}}~~10:38:45	~~{{EVENT}}~~BetterStack recovery alert: "Drop Health Check is UP"	←Recovery
10:39:00	Begin -15-minute ~~T1)~~stability monitoring window	Post-recovery
10:54:00	Confirmed stable — error spike cleared	Closed
10:58:00	Incident formally closed	Closed

~~MTTD~~Total duration: 28 minutes (~~Mean~~10:30 — 10:58 UTC) Time to ~~Detect):~~detect: ~~{{MTTD}}~~< 30 seconds Time to diagnose root cause: ~4 minutes ~~MTTR (Mean~~ Time to ~~Resolve):~~apply fix: ~~{{MTTR}}~~~5 minutes (App Runner restart)

5.3. Root Cause Analysis

5.3.1 5The Five Whys Analysis

HTTP503?→Theendpoint'sDBcheckfail?→wasalertfirebeforethecheck

~~Why #~~	~~Question~~	~~Answer~~
~~Why 1~~	Why did ~~users~~Drop ~~experience~~return ~~{{SYMPTOM}}?~~	~~{{WHY_1}}~~
~~Why~~`/api/health` 2	DB check (`SELECT 1`) failed. Why did ~~{{WHY_1_ANSWER}}~~the ~~happen?~~	~~{{WHY_2}}~~
The application could not acquire a database connection — pool was exhausted. Why 3	the pool exhausted? → ~45 concurrent BankID login callbacks (each requiring a DB connection for session upsert) arrived simultaneously within a 10-second window. Why were 45 concurrent logins able to exhaust the pool? → No explicit connection pool limit was configured in the `pg` driver. The pool was bounded only by OS-level limits (~85 connections for db.t4g.micro), and there was no queue/timeout — new requests failed immediately when the limit was hit. Why did ~~{{WHY_2_ANSWER}}~~no ~~happen?~~	~~{{WHY_3}}~~
~~Why~~health 4	~~Why did {{WHY_3_ANSWER}} happen?~~	~~{{WHY_4}}~~
~~Why 5~~	~~Why did {{WHY_4_ANSWER}} happen?~~	~~{{WHY_5}}~~

~~Root cause:~~failed? ~~{{ROOT_CAUSE}}~~→ No CloudWatch alarm was configured on DatabaseConnections. The only production alert path was the BetterStack health check, which by then was already failing.

5.3.2 Contributing Factors

Factor	~~Type~~Description	~~Action Required~~Severity
~~{{FACTOR_1}}~~No explicit pool config	~~Technical~~`pg` /used ~~Process~~without /`max`, ~~Human~~`idleTimeoutMillis`, or `connectionTimeoutMillis`	~~Yes / No~~High
~~{{FACTOR_2}}~~No DB connection metrics	~~Technical~~No /CloudWatch ~~Process~~alarm /on ~~Human~~`DatabaseConnections > 70`	~~Yes / No~~High
~~{{FACTOR_3}}~~No graceful degradation	~~Technical~~Application /returned ~~Process~~503 /when ~~Human~~DB was unavailable, even for non-DB routes	~~Yes~~Medium
No rate limiting across all IPs	Per-IP rate limit (10/min) did not prevent burst across many IPs simultaneously	Medium
No pre-campaign infra review	Marketing email campaign launched without coordinating with infrastructure	Medium
No connection pool health metric	Health check did not report pool utilization	Low

3.3 What Worked Well

BetterStack detection was excellent: < 30 seconds from failure to alert.

Slack alert delivery was immediate: Alert to #drop-ops within 30 seconds.

App Runner restart is fast and reliable: Recovery completed in < 5 minutes.

RDS was not the problem: Direct psql connection succeeded, quickly ruling out infrastructure failure.

No data loss: Audit logs intact, no transactions corrupted.

4. Impact Analysis

Dimension	Impact
Users affected	100% — full service outage
Transactions blocked	~3–5 remittances + ~2 QR payments
Revenue impact	Approx. NOK 6,000–10,000
Compliance	None — no data loss, audit logs intact throughout
Regulatory	No notification required (< 4h, no PII exposure)
Reputation	Users saw error screens — limited blast radius pre-public-launch
SLA	28 min downtime → monthly uptime 99.94%

5. Corrective Actions

5.1 Immediate (before next marketing campaign)

#	Action	Owner	Status
1	Configure explicit pg pool: `max: 10, idleTimeoutMillis: 30000, connectionTimeoutMillis: 2000`	Platform	Pending
2	Add CloudWatch alarm: `DatabaseConnections > 70` → Slack `#drop-ops`	Alem	Pending
3	Add global rate limit on `/api/auth/bankid/initiate` No(e.g., 100/min across all IPs)	Platform	Pending

5.2 Before v1.0 Launch

#	Action	Owner	Priority
4	Add PgBouncer or RDS Proxy to externalize connection pooling	Platform	P1
5	Report pool utilization in `/api/health` response (`poolSize`, `idleCount`, `waitingCount`)	Platform	P2
6	Implement graceful degradation for non-DB routes when DB is unavailable	Platform	P2

5.3 Trigger Event

~~The specific trigger for this incident:~~ ~~{{TRIGGER}}~~

6. What Went Well

~~{{CATEGORY_1}}:~~ ~~{{DESCRIPTION}}~~

~~{{CATEGORY_2}}:~~ ~~{{DESCRIPTION}}~~

~~{{CATEGORY_3}}:~~ ~~{{DESCRIPTION}}~~

7. What Went Wrong

~~{{CATEGORY_1}}:~~ ~~{{DESCRIPTION}}~~

~~{{CATEGORY_2}}:~~ ~~{{DESCRIPTION}}~~

~~{{CATEGORY_3}}:~~ ~~{{DESCRIPTION}}~~

8. Where We Got Lucky

~~{{LUCKY_1}}~~

~~{{LUCKY_2}}~~

~~{{LUCKY_3}}~~

9. Action Items

Short-Term FixesProcess (This Sprint)ongoing)

Within2 Q2

#	Action	Owner	Due	~~Priority~~	~~Ticket~~
17	~~{{SHORT_TERM_1}}~~Create "Marketing → Infra" coordination checklist — must be completed before any campaign	~~{{OWNER}}~~Alem	~~{{DATE}}~~	~~Critical~~	~~{{TICKET}}~~weeks
28	~~{{SHORT_TERM_2}}~~Add DB connection metrics to weekly monitoring review	~~{{OWNER}}~~Alem	~~{{DATE}}~~	~~High~~	~~{{TICKET}}~~Ongoing
39	~~{{SHORT_TERM_3}}~~Test App Runner restart as a documented runbook step	~~{{OWNER}}~~Platform	~~{{DATE}}~~	~~Medium~~	~~{{TICKET}}~~2026

Long-Term

6. Systemic Improvements

6.1 Connection Pooling Fix

Current state: Implicit pool, no limits, no timeout.

Target state:

// src/drop-app/src/lib/db.ts (Nextexample)
Quarter)const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 10,                    // Hard cap — never exceed RDS t4g.micro limit
  idleTimeoutMillis: 30000,   // Release idle connections after 30s
  connectionTimeoutMillis: 2000, // Fail fast if pool is exhausted
});

When PgBouncer is added, set max higher in the app and let PgBouncer enforce the RDS limit.

6.2 CloudWatch Alarm

aws cloudwatch put-metric-alarm \
  --alarm-name "drop-db-connections-high" \
  --alarm-description "RDS DatabaseConnections > 70" \
  --metric-name DatabaseConnections \
  --namespace AWS/RDS \
  --dimensions Name=DBInstanceIdentifier,Value=drop-db \
  --period 60 \
  --evaluation-periods 2 \
  --threshold 70 \
  --comparison-operator GreaterThanThreshold \
  --statistic Average \
  --alarm-actions arn:aws:sns:eu-west-1:324480209768:drop-ops-alerts \
  --region eu-west-1

6.3 Marketing Campaign Checklist (Pre-Launch)

Before any marketing campaign that targets > 100 recipients:

Notify infrastructure (Alem) at least 24h before send

Check current DatabaseConnections baseline in CloudWatch

Verify pool configuration is explicit

Consider sending campaign in batches (< 100/hour) to spread load

7. Lessons Learned

Explicit is always better than implicit for resource limits. Never rely on OS defaults for connection pool configuration in production.

Metrics must lead alerts, not lag them. The health check failure was a lagging indicator. DatabaseConnections CloudWatch alarm would have caught this 2 minutes earlier.

Marketing and infrastructure must coordinate. A 500-recipient email burst is a load event that needs infrastructure awareness.

App Runner restart is a fast, reliable mitigation. < 5 minutes RTO is acceptable for this class of issue. Document it as a first-response step in the runbook.

BetterStack + Slack alerting works. < 30 second detection time met our target. No changes needed here.

DB-level connection pooling (PgBouncer) is required for burst tolerance at scale. This should be resolved before public launch.

8. Action Item Tracking

#	Action	Owner	Due	~~Priority~~	~~Ticket~~Status
1	~~{{LONG_TERM_1}}~~Configure explicit pg pool limits	~~{{OWNER}}~~Platform	~~{{DATE}}~~Before next campaign	~~High~~	~~{{TICKET}}~~Pending
2	~~{{LONG_TERM_2}}~~CloudWatch alarm on DatabaseConnections > 70	~~{{OWNER}}~~Alem	~~{{DATE}}~~Within 1 week	~~Medium~~Pending
3	~~{{TICKET}}~~Global rate limit on BankID initiate	Platform	Before next campaign	Pending
4	PgBouncer / RDS Proxy	Platform	Before v1.0	Pending
5	Pool utilization in /api/health	Platform	Before v1.0	Pending
6	Graceful degradation for non-DB routes	Platform	Before v1.0	Pending
7	Marketing → Infra coordination checklist	Alem	Within 2 weeks	Pending
8	DB connection metrics in weekly review	Alem	Ongoing	Ongoing
9	App Runner restart in DR runbook	Platform	Q2 2026	Pending

Process Changes

#	~~Change~~	~~Owner~~	~~Implementation Date~~
1	~~{{PROCESS_1}}~~	~~{{OWNER}}~~	~~{{DATE}}~~
2	~~{{PROCESS_2}}~~	~~{{OWNER}}~~	~~{{DATE}}~~

10. Follow-Up Tracking

~~Follow-up review date:~~ ~~{{FOLLOWUP_DATE}} (4 weeks after incident)~~ ~~Follow-up owner:~~ ~~{{FOLLOWUP_OWNER}}~~

~~Action Item~~	~~Expected Completion~~	~~Verified Complete~~	~~Effective~~
~~{{ACTION_1}}~~	~~{{DATE}}~~	~~Yes / No~~	~~Yes / No / TBD~~
~~{{ACTION_2}}~~	~~{{DATE}}~~

11. Recurrence Prevention

~~Before this incident:~~ ~~{{BEFORE_STATE}}~~

~~After implementing action items:~~ ~~{{AFTER_STATE}}~~

~~Confidence in prevention:~~ ~~{{CONFIDENCE}} / 10~~ ~~Residual risk:~~ ~~{{RESIDUAL_RISK}}~~

12. Review & Sign-Off

~~Post-mortem presented at:~~ ~~{{MEETING}} on {{MEETING_DATE}}~~ ~~Meeting recording:~~ ~~{{RECORDING_LINK}}~~ ~~Meeting notes:~~ ~~{{NOTES_LINK}}~~

Monitoring & Observability

Approval

Role	Name	Date
Author	Platform Architect (AI)	2026-02-23
~~Reviewer~~Facilitator	Alem Bašić
Approver	Alem Bašić

Post-Mortem

Post-Mortem

Document History

BlamelessPost-Mortem Culture StatementOverview

1. Incident Reference & Metadata

2.1. Executive Summary

3.2. Impact SummaryTimeline

4. Detailed Timeline

5.3. Root Cause Analysis

5.3.1 5The Five Whys Analysis

5.3.2 Contributing Factors

3.3 What Worked Well

4. Impact Analysis

5. Corrective Actions

5.1 Immediate (before next marketing campaign)

5.2 Before v1.0 Launch

5.3 Trigger Event

6. What Went Well

7. What Went Wrong

8. Where We Got Lucky

9. Action Items

Short-Term FixesProcess (This Sprint)ongoing)

Long-Term

6. Systemic Improvements

6.1 Connection Pooling Fix

6.2 CloudWatch Alarm

6.3 Marketing Campaign Checklist (Pre-Launch)

7. Lessons Learned

8. Action Item Tracking

Process Changes

10. Follow-Up Tracking

11. Recurrence Prevention

12. Review & Sign-Off

Related Documents

Approval