Post-Mortem

Project: ~~Drop~~{{PROJECT_NAME}} Version: ~~0.1.0~~{{VERSION}} Date: ~~2026-02-23~~{{DATE}} Author: ~~Platform Architect (AI)~~{{AUTHOR}} Status: Draft | In Review | Approved Reviewers: ~~Alem Bašić (CEO)~~{{REVIEWERS}}

Document History

Version	Date	Author	Changes
0.1	~~2026-02-23~~{{DATE}}	~~Platform Architect (AI)~~{{AUTHOR}}	~~Example~~Initial ~~post-mortem (simulated pre-launch scenario — mirrors INC-2026-001)~~draft

Post-MortemBlameless OverviewCulture Statement

This ~~document~~post-mortem is ~~filled~~conducted in a blameless spirit. Our goal is to understand how and why the incident occurred — not to assign fault to individuals. People make the best decisions they can with athe ~~realistic~~information ~~example~~and ~~post-mortem~~tools ~~based~~available ~~on Drop's architecture. It documents~~at the ~~same~~time. ~~incident~~When asthings go wrong, we look for systemic improvements that make the ~~Incident~~right ~~Report~~action ~~(INC-2026-001: RDS connection pool exhaustion) but provides deeper root cause analysis~~easier and ~~systemic~~the ~~improvements.~~wrong ~~Use~~action ~~as a template~~harder for ~~future real incidents.~~everyone.

1. Incident Reference & Metadata

—~~Critical~~ /InReview/

Field	Value
Incident ID	INC-~~2026-001~~{{YYYY}}-{{SEQ}}
Severity	P1P{{SEVERITY}}
Incident Report	INC-{{YYYY}}-{{SEQ}}
Post-Mortem Facilitator	{{FACILITATOR}}
Post-Mortem Date	~~2026-02-21~~{{PM_DATE}}
~~Facilitator~~Attendees	~~Alem Bašić~~{{ATTENDEES}}
~~Incident Commander~~Status	~~Alem~~Draft ~~Bašić~~
~~Participants~~	~~Alem Bašić (CEO), Platform Architect (AI)~~Final

1.2. Executive Summary

On{{EXECUTIVE_SUMMARY}}

~~2026-02-20~~

at
Example: ~~10:30~~"A ~~UTC,~~database ~~Drop~~index ~~experienced~~was dropped during a ~~28-minute~~migration P1on ~~outage~~{{DATE}}, causing query performance to degrade by 50× under load. This resulted in a 1h 23min degraded service period affecting ~~100% of production~~{{USERS}} users. ~~The~~We ~~root~~have cause was RDS PostgreSQL connection pool exhaustion triggered by a burst of concurrent BankID authentication attempts following a marketing email campaign. The immediate fix was an App Runner service restart. This post-mortem documentsrestored the ~~systemic~~index, ~~improvements~~added ~~required~~migration validation tooling, and created safeguards to prevent ~~recurrence.~~similar incidents."

~~Bottom line:~~ The application lacked explicit connection pool limits and DB-level metrics alerting. A burst of ~45 concurrent logins exhausted the default connection pool, causing all subsequent DB-dependent requests (including the health check) to fail.

2.3. TimelineImpact Summary

at rate ~~checks~~

~~Time (UTC)~~Metric	~~Event~~	~~Phase~~Value
~~10:28:00~~Total duration	~~Marketing~~{{DURATION}} ~~email~~(detected ~~delivered~~at to{{DETECTED}}, ~~~500~~resolved ~~recipients~~	~~Pre-incident~~{{RESOLVED}})
~~10:30:00~~Users affected	~~BetterStack~~{{USER_COUNT}} ~~detects~~({{USER_PERCENT}}% ~~HTTP~~of ~~503~~user ~~on Drop Health Check~~	~~Detection~~base)
~~10:30:30~~Requests affected	~~Slack~~{{REQUEST_COUNT}} `#drop-ops`({{REQUEST_PERCENT}}% ~~alert~~error ~~fires~~	~~Detection~~during incident)
~~10:30:45~~Estimated revenue impact	~~Alem acknowledges alert~~	~~Response~~${{REVENUE}}
~~10:31:00~~SLA breach	~~Alem~~{{SLA_BREACH}} ~~checks App Runner → status~~ `RUNNING`	~~Diagnosis~~
~~10:31:30~~SLA credits owed	~~Alem~~${{CREDITS}}

4. Detailed Timeline

/api/healthtimeline
    title Incident Timeline
    {{TIME_1}} : {{EVENT_1}}
    {{TIME_2}} : {{EVENT_2}}
    {{TIME_3}} : {{EVENT_3}}
    {{TIME_4}} : {{EVENT_4}}
    {{TIME_5}} : {{EVENT_5}}

→












←Resolved(MTTR=










Time Event MTTD/MTTR Marker
{"status":"down","checks":{"db":{"status":"fail"T1}}}}
Diagnosis{{EVENT}} ← Incident start


10:32:00{{T2}}
CloudWatch{{EVENT}} logs show repeated connection refused to RDS
Diagnosis


10:33:00{{T3}}
Direct{{EVENT}} psql connection to RDS succeeds — rules out RDS-level failure
Diagnosis← Detection (MTTD = T3 - T1)


10:34:00{{T4}}
Hypothesis:{{EVENT}} application-level connection pool exhaustion
Diagnosis


10:35:00{{T5}}
Alem{{EVENT}} triggers App Runner restart via aws apprunner start-deployment
Mitigation


10:38:00{{T6}}
App{{EVENT}} Runner deployment completes
Mitigation


10:38:30{{T7}}
Health{{EVENT}} check returns {"status":"ok"}
Recovery


10:38:45{{T8}}
BetterStack{{EVENT}} recovery alert: "Drop Health Check is UP"
Recovery  
  10:39:00  BeginT8 15-minute- stability monitoring window Post-recovery
10:54:00 Confirmed stable — error spike cleared Closed
10:58:00 Incident formally closed ClosedT1)



Total duration: 28 minutesMTTD (10:30Mean — 10:58 UTC)
Time to detect:Detect): <{{MTTD}} 30 secondsminutes
MTTR (Mean Time to diagnose root cause:Resolve): ~4{{MTTR}} minutes
Time to apply fix: ~5 minutes (App Runner restart)

3.5. Root Cause Analysis
3.5.1 The Five5 Whys Analysis











503?→The/api/healthDB check (SELECT 1) failed.

checkfail?→The application could not acquire a database connection — pool was exhausted.

the pool exhausted?
→ ~45 concurrent BankID login callbacks (each requiring a DB connection for session upsert) arrived simultaneously within a 10-second window.

Why were 45 concurrent logins able to exhaust the pool?
→ No explicit connection pool limit was configured in the pg driver. The pool was bounded only by OS-level limits (~85 connections for db.t4g.micro), and there was no queue/timeout — new requests failed immediately when the limit was hit.

firebeforethehealthfailed?








Why # Question Answer
Why 1 Why did Dropusers returnexperience HTTP{{SYMPTOM}}?
{{WHY_1}}


Why endpoint's2
Why did the{{WHY_1_ANSWER}} DBhappen?
{{WHY_2}}


Why was3
Why did no{{WHY_2_ANSWER}} alerthappen?
{{WHY_3}}


Why check4
Why did {{WHY_3_ANSWER}} happen? {{WHY_4}}
Why 5 Why did {{WHY_4_ANSWER}} happen? {{WHY_5}}

Root cause: → No CloudWatch alarm was configured on DatabaseConnections. The only production alert path was the BetterStack health check, which by then was already failing.{{ROOT_CAUSE}}
3.5.2 Contributing Factors


Yes/


















Factor
DescriptionType
SeverityAction Required




No explicit pool config{{FACTOR_1}}
pgTechnical used/ withoutProcess max,/ idleTimeoutMillis, or connectionTimeoutMillisHuman
HighYes / No


No DB connection metrics{{FACTOR_2}}
NoTechnical CloudWatch/ alarmProcess on/ DatabaseConnections > 70Human
HighYes / No


No graceful degradation{{FACTOR_3}}
ApplicationTechnical returned/ 503Process when/ DB was unavailable, even for non-DB routesHuman
Medium  
 No rate limiting across all IPs Per-IP rate limit (10/min) did not prevent burst across many IPs simultaneously Medium
No pre-campaign infra review Marketing email campaign launched without coordinating with infrastructure Medium
No connection pool health metric Health check did not report pool utilization Low

3.3 What Worked Well


BetterStack detection was excellent: < 30 seconds from failure to alert.

Slack alert delivery was immediate: Alert to #drop-ops within 30 seconds.

App Runner restart is fast and reliable: Recovery completed in < 5 minutes.

RDS was not the problem: Direct psql connection succeeded, quickly ruling out infrastructure failure.

No data loss: Audit logs intact, no transactions corrupted.



4. Impact Analysis






































Dimension Impact
Users affected 100% — full service outage
Transactions blocked ~3–5 remittances + ~2 QR payments
Revenue impact Approx. NOK 6,000–10,000
Compliance None — no data loss, audit logs intact throughout
Regulatory No notification required (< 4h, no PII exposure)
Reputation Users saw error screens — limited blast radius pre-public-launch
SLA 28 min downtime → monthly uptime 99.94%


5. Corrective Actions

5.1 Immediate (before next marketing campaign)






























# Action Owner Status
1 Configure explicit pg pool: max: 10, idleTimeoutMillis: 30000, connectionTimeoutMillis: 2000 Platform Pending
2 Add CloudWatch alarm: DatabaseConnections > 70 → Slack #drop-ops Alem Pending
3 Add global rate limit on /api/auth/bankid/initiate (e.g., 100/min across all IPs) Platform Pending

5.2 Before v1.0 Launch



























# Action Owner Priority
4 Add PgBouncer or RDS Proxy to externalize connection pooling Platform P1
5 Report pool utilization in /api/health response (poolSize, idleCount, waitingCount) Platform P2
6 Implement graceful degradation for non-DB routes when DB is unavailable Platform P2



5.3 ProcessTrigger Event

The specific trigger for this incident: {{TRIGGER}}



6. What Went Well



{{CATEGORY_1}}: {{DESCRIPTION}}


{{CATEGORY_2}}: {{DESCRIPTION}}

{{CATEGORY_3}}: {{DESCRIPTION}}



7. What Went Wrong



{{CATEGORY_1}}: {{DESCRIPTION}}


{{CATEGORY_2}}: {{DESCRIPTION}}

{{CATEGORY_3}}: {{DESCRIPTION}}



8. Where We Got Lucky



{{LUCKY_1}} 

{{LUCKY_2}} 

{{LUCKY_3}} 



9. Action Items


Short-Term Fixes (ongoing)This Sprint)


2weeks

2026



#
Action
Owner
Due
Priority Ticket



71
Create{{SHORT_TERM_1}} "Marketing → Infra" coordination checklist — must be completed before any campaign
Alem{{OWNER}}
Within{{DATE}}
Critical
{{TICKET}}


82
Add DB connection metrics to weekly monitoring review{{SHORT_TERM_2}}
Alem{{OWNER}}
Ongoing{{DATE}} High {{TICKET}}


93
Test App Runner restart as a documented runbook step{{SHORT_TERM_3}}
Platform{{OWNER}}
Q2{{DATE}}
Medium {{TICKET}}




6. Systemic
Long-Term Improvements

6.1 Connection Pooling Fix

Current state: Implicit pool, no limits, no timeout.

Target state:

// src/drop-app/src/lib/db.ts (example)Next const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 10,                    // Hard cap — never exceed RDS t4g.micro limit
  idleTimeoutMillis: 30000,   // Release idle connections after 30s
  connectionTimeoutMillis: 2000, // Fail fast if pool is exhausted
});

When PgBouncer is added, set max higher in the app and let PgBouncer enforce the RDS limit.

6.2 CloudWatch AlarmQuarter)

aws cloudwatch put-metric-alarm \
  --alarm-name "drop-db-connections-high" \
  --alarm-description "RDS DatabaseConnections > 70" \
  --metric-name DatabaseConnections \
  --namespace AWS/RDS \
  --dimensions Name=DBInstanceIdentifier,Value=drop-db \
  --period 60 \
  --evaluation-periods 2 \
  --threshold 70 \
  --comparison-operator GreaterThanThreshold \
  --statistic Average \
  --alarm-actions arn:aws:sns:eu-west-1:324480209768:drop-ops-alerts \
  --region eu-west-1

6.3 Marketing Campaign Checklist (Pre-Launch)

Before any marketing campaign that targets > 100 recipients:


 Notify infrastructure (Alem) at least 24h before send

 Check current DatabaseConnections baseline in CloudWatch

 Verify pool configuration is explicit

 Consider sending campaign in batches (< 100/hour) to spread load



7. Lessons Learned


Explicit is always better than implicit for resource limits. Never rely on OS defaults for connection pool configuration in production.

Metrics must lead alerts, not lag them. The health check failure was a lagging indicator. DatabaseConnections CloudWatch alarm would have caught this 2 minutes earlier.

Marketing and infrastructure must coordinate. A 500-recipient email burst is a load event that needs infrastructure awareness.

App Runner restart is a fast, reliable mitigation. < 5 minutes RTO is acceptable for this class of issue. Document it as a first-response step in the runbook.

BetterStack + Slack alerting works. < 30 second detection time met our target. No changes needed here.

DB-level connection pooling (PgBouncer) is required for burst tolerance at scale. This should be resolved before public launch.



8. Action Item Tracking









#
Action
Owner
Due
StatusPriority Ticket




1
Configure{{LONG_TERM_1}} explicit pg pool limits
Platform{{OWNER}}
Before next campaign{{DATE}}
PendingHigh {{TICKET}}


2
CloudWatch alarm on DatabaseConnections > 70{{LONG_TERM_2}}
Alem{{OWNER}}
Within 1 week{{DATE}}
PendingMedium {{TICKET}}

Process Changes















nextcampaign
# Change Owner Implementation Date
1 {{PROCESS_1}} {{OWNER}} {{DATE}}


32
Global rate limit on BankID initiate{{PROCESS_2}}
Platform{{OWNER}}
Before{{DATE}}



10. Follow-Up Tracking


Follow-up review date: {{FOLLOWUP_DATE}} (4 weeks after incident)
Follow-up owner: {{FOLLOWUP_OWNER}}


















































Action Item Expected Completion Verified Complete Effective
{{ACTION_1}}
Pending{{DATE}} Yes / No Yes / No / TBD


4{{ACTION_2}}
PgBouncer / RDS Proxy{{DATE}}
Platform
Before v1.0 Pending
5 Pool utilization in /api/health Platform Before v1.0 Pending
6 Graceful degradation for non-DB routes Platform Before v1.0 Pending
7 Marketing → Infra coordination checklist Alem Within 2 weeks Pending
8 DB connection metrics in weekly review Alem Ongoing Ongoing
9 App Runner restart in DR runbook Platform Q2 2026 Pending




11. Recurrence Prevention


Before this incident: {{BEFORE_STATE}}


After implementing action items: {{AFTER_STATE}}


Confidence in prevention: {{CONFIDENCE}} / 10
Residual risk: {{RESIDUAL_RISK}}



12. Review & Sign-Off


Post-mortem presented at: {{MEETING}} on {{MEETING_DATE}}
Meeting recording: {{RECORDING_LINK}}
Meeting notes: {{NOTES_LINK}}

Related Documents

Incident Report INC-2026-001{{ID}}
Operational Runbook
Disaster Recovery Plan

Monitoring & Observability


Approval



Role
Name
Date
Signature




Author
Platform Architect (AI)
2026-02-23



FacilitatorReviewer
Alem Bašić




Approver
Alem Bašić

Time	Event	MTTD/MTTR Marker
{~~"status":"down","checks":~~{~~"db":{"status":"fail"~~T1}}}}	~~Diagnosis~~{{EVENT}}	← Incident start
~~10:32:00~~{{T2}}	~~CloudWatch~~{{EVENT}} ~~logs show repeated~~ `connection refused` ~~to RDS~~	~~Diagnosis~~
~~10:33:00~~{{T3}}	~~Direct~~{{EVENT}} ~~psql connection to RDS succeeds — rules out RDS-level failure~~	~~Diagnosis~~← Detection (MTTD = T3 - T1)
~~10:34:00~~{{T4}}	~~Hypothesis:~~{{EVENT}} ~~application-level connection pool exhaustion~~	~~Diagnosis~~
~~10:35:00~~{{T5}}	~~Alem~~{{EVENT}} ~~triggers App Runner restart via~~ `aws apprunner start-deployment`	~~Mitigation~~
~~10:38:00~~{{T6}}	~~App~~{{EVENT}} ~~Runner deployment completes~~	~~Mitigation~~
~~10:38:30~~{{T7}}	~~Health~~{{EVENT}} ~~check returns~~ `{"status":"ok"}`	~~Recovery~~
~~10:38:45~~{{T8}}	~~BetterStack~~{{EVENT}} ~~recovery alert: "Drop Health Check is UP"~~	~~Recovery~~
~~10:39:00~~	~~Begin~~T8 ~~15-minute~~- ~~stability monitoring window~~	~~Post-recovery~~
~~10:54:00~~	~~Confirmed stable — error spike cleared~~	~~Closed~~
~~10:58:00~~	~~Incident formally closed~~	~~Closed~~T1)

Factor	~~Description~~Type	~~Severity~~Action Required
~~No explicit pool config~~{{FACTOR_1}}	`pg`Technical ~~used~~/ ~~without~~Process `max`,/ `idleTimeoutMillis`~~, or~~ `connectionTimeoutMillis`Human	~~High~~Yes / No
~~No DB connection metrics~~{{FACTOR_2}}	NoTechnical ~~CloudWatch~~/ ~~alarm~~Process on/ `DatabaseConnections > 70`Human	~~High~~Yes / No
~~No graceful degradation~~{{FACTOR_3}}	~~Application~~Technical ~~returned~~/ ~~503~~Process ~~when~~/ ~~DB was unavailable, even for non-DB routes~~Human	~~Medium~~
No ~~rate limiting across all IPs~~	~~Per-IP rate limit (10/min) did not prevent burst across many IPs simultaneously~~	~~Medium~~
~~No pre-campaign infra review~~	~~Marketing email campaign launched without coordinating with infrastructure~~	~~Medium~~
~~No connection pool health metric~~	~~Health check did not report pool utilization~~	~~Low~~

~~Dimension~~	~~Impact~~
~~Users affected~~	~~100% — full service outage~~
~~Transactions blocked~~	~~~3–5 remittances + ~2 QR payments~~
~~Revenue impact~~	~~Approx. NOK 6,000–10,000~~
~~Compliance~~	~~None — no data loss, audit logs intact throughout~~
~~Regulatory~~	~~No notification required (< 4h, no PII exposure)~~
~~Reputation~~	~~Users saw error screens — limited blast radius pre-public-launch~~
~~SLA~~	~~28 min downtime → monthly uptime 99.94%~~

#	~~Action~~	~~Owner~~	~~Status~~
1	~~Configure explicit pg pool:~~ `max: 10, idleTimeoutMillis: 30000, connectionTimeoutMillis: 2000`	~~Platform~~	~~Pending~~
2	~~Add CloudWatch alarm:~~ `DatabaseConnections > 70` ~~→ Slack~~ `#drop-ops`	~~Alem~~	~~Pending~~
3	~~Add global rate limit on~~ `/api/auth/bankid/initiate` ~~(e.g., 100/min across all IPs)~~	~~Platform~~	~~Pending~~

#	~~Action~~	~~Owner~~	~~Priority~~
4	~~Add PgBouncer or RDS Proxy to externalize connection pooling~~	~~Platform~~	P1
5	~~Report pool utilization in~~ `/api/health` ~~response (~~`poolSize`, `idleCount`, `waitingCount`)	~~Platform~~	P2
6	~~Implement graceful degradation for non-DB routes when DB is unavailable~~	~~Platform~~	P2

Why #	Question	Answer
Why 1	Why did ~~Drop~~users ~~return~~experience ~~HTTP~~{{SYMPTOM}}?	{{WHY_1}}
Why ~~endpoint's~~2	Why did ~~the~~{{WHY_1_ANSWER}} DBhappen?	{{WHY_2}}
Why ~~was~~3	Why did no{{WHY_2_ANSWER}} ~~alert~~happen?	{{WHY_3}}
Why ~~check~~4	Why did {{WHY_3_ANSWER}} happen?	{{WHY_4}}
Why 5	Why did {{WHY_4_ANSWER}} happen?	{{WHY_5}}

#	Action	Owner	Due	Priority	Ticket
71	~~Create~~{{SHORT_TERM_1}} ~~"Marketing → Infra" coordination checklist — must be completed before any campaign~~	~~Alem~~{{OWNER}}	~~Within~~{{DATE}}	Critical	{{TICKET}}
82	~~Add DB connection metrics to weekly monitoring review~~{{SHORT_TERM_2}}	~~Alem~~{{OWNER}}	~~Ongoing~~{{DATE}}	High	{{TICKET}}
93	~~Test App Runner restart as a documented runbook step~~{{SHORT_TERM_3}}	~~Platform~~{{OWNER}}	Q2{{DATE}}	Medium	{{TICKET}}

#	Action	Owner	Due	~~Status~~Priority	Ticket
1	~~Configure~~{{LONG_TERM_1}} ~~explicit pg pool limits~~	~~Platform~~{{OWNER}}	~~Before next campaign~~{{DATE}}	~~Pending~~High	{{TICKET}}
2	~~CloudWatch alarm on DatabaseConnections > 70~~{{LONG_TERM_2}}	~~Alem~~{{OWNER}}	~~Within 1 week~~{{DATE}}	~~Pending~~Medium	{{TICKET}}

#	Change	Owner	Implementation Date
1	{{PROCESS_1}}	{{OWNER}}	{{DATE}}
32	~~Global rate limit on BankID initiate~~{{PROCESS_2}}	~~Platform~~{{OWNER}}	~~Before~~{{DATE}}

Action Item	Expected Completion	Verified Complete	Effective
{{ACTION_1}}	~~Pending~~{{DATE}}	Yes / No	Yes / No / TBD
4{{ACTION_2}}	~~PgBouncer / RDS Proxy~~{{DATE}}	~~Platform~~	~~Before v1.0~~	~~Pending~~
5	~~Pool utilization in /api/health~~	~~Platform~~	~~Before v1.0~~	~~Pending~~
6	~~Graceful degradation for non-DB routes~~	~~Platform~~	~~Before v1.0~~	~~Pending~~
7	~~Marketing → Infra coordination checklist~~	~~Alem~~	~~Within 2 weeks~~	~~Pending~~
8	~~DB connection metrics in weekly review~~	~~Alem~~	~~Ongoing~~	~~Ongoing~~
9	~~App Runner restart in DR runbook~~	~~Platform~~	~~Q2 2026~~	~~Pending~~

Role	Name	Date
Author	~~Platform Architect (AI)~~	~~2026-02-23~~
~~Facilitator~~Reviewer	~~Alem Bašić~~
Approver	~~Alem Bašić~~