Skip to main content

Post-Mortem

Post-Mortem

Project: Drop{{PROJECT_NAME}} Version: 0.1.0{{VERSION}} Date: 2026-02-23{{DATE}} Author: Platform Architect (AI){{AUTHOR}} Status: Draft | In Review | Approved Reviewers: Alem Bašić (CEO){{REVIEWERS}}

Document History

Version Date Author Changes
0.1 2026-02-23{{DATE}} Platform Architect (AI){{AUTHOR}} ExampleInitial post-mortem (simulated pre-launch scenario — mirrors INC-2026-001)draft

Post-MortemBlameless OverviewCulture Statement

This documentpost-mortem is filledconducted in a blameless spirit. Our goal is to understand how and why the incident occurred — not to assign fault to individuals. People make the best decisions they can with athe realisticinformation exampleand post-mortemtools basedavailable on Drop's architecture. It documentsat the sametime. incidentWhen asthings go wrong, we look for systemic improvements that make the Incidentright Reportaction (INC-2026-001: RDS connection pool exhaustion) but provides deeper root cause analysiseasier and systemicthe improvements.wrong Useaction as a templateharder for future real incidents.everyone.


1. Incident Reference & Metadata

Critical /InReview/
Field Value
Incident ID INC-2026-001{{YYYY}}-{{SEQ}}
Severity P1P{{SEVERITY}}
Incident ReportINC-{{YYYY}}-{{SEQ}}
Post-Mortem Facilitator{{FACILITATOR}}
Post-Mortem Date 2026-02-21{{PM_DATE}}
FacilitatorAttendees Alem Bašić{{ATTENDEES}}
Incident CommanderStatus AlemDraft Bašić
Participants Alem Bašić (CEO), Platform Architect (AI)Final

1.2. Executive Summary

On{{EXECUTIVE_SUMMARY}}

2026-02-20
at

Example: 10:30"A UTC,database Dropindex experiencedwas dropped during a 28-minutemigration P1on outage{{DATE}}, causing query performance to degrade by 50× under load. This resulted in a 1h 23min degraded service period affecting 100% of production{{USERS}} users. TheWe roothave cause was RDS PostgreSQL connection pool exhaustion triggered by a burst of concurrent BankID authentication attempts following a marketing email campaign. The immediate fix was an App Runner service restart. This post-mortem documentsrestored the systemicindex, improvementsadded requiredmigration validation tooling, and created safeguards to prevent recurrence.similar incidents."

Bottom line: The application lacked explicit connection pool limits and DB-level metrics alerting. A burst of ~45 concurrent logins exhausted the default connection pool, causing all subsequent DB-dependent requests (including the health check) to fail.


2.3. TimelineImpact Summary

at rate checks
Time (UTC)Metric EventPhaseValue
10:28:00Total duration Marketing{{DURATION}} email(detected deliveredat to{{DETECTED}}, ~500resolved recipients Pre-incident{{RESOLVED}})
10:30:00Users affected BetterStack{{USER_COUNT}} detects({{USER_PERCENT}}% HTTPof 503user on Drop Health CheckDetectionbase)
10:30:30Requests affected Slack{{REQUEST_COUNT}} #drop-ops({{REQUEST_PERCENT}}% alerterror fires Detectionduring incident)
10:30:45Estimated revenue impact Alem acknowledges alertResponse${{REVENUE}}
10:31:00SLA breach Alem{{SLA_BREACH}} checks App Runner → status RUNNINGDiagnosis
10:31:30SLA credits owed Alem${{CREDITS}}

4. Detailed Timeline

/api/healthtimeline
    title Incident Timeline
    {{TIME_1}} : {{EVENT_1}}
    {{TIME_2}} : {{EVENT_2}}
    {{TIME_3}} : {{EVENT_3}}
    {{TIME_4}} : {{EVENT_4}}
    {{TIME_5}} : {{EVENT_5}}
Resolved(MTTR=
TimeEventMTTD/MTTR Marker
{"status":"down","checks":{"db":{"status":"fail"T1}}}} Diagnosis{{EVENT}} ← Incident start
10:32:00{{T2}} CloudWatch{{EVENT}} logs show repeated connection refused to RDS Diagnosis
10:33:00{{T3}} Direct{{EVENT}} psql connection to RDS succeeds — rules out RDS-level failure Diagnosis← Detection (MTTD = T3 - T1)
10:34:00{{T4}} Hypothesis:{{EVENT}} application-level connection pool exhaustion Diagnosis
10:35:00{{T5}} Alem{{EVENT}} triggers App Runner restart via aws apprunner start-deployment Mitigation
10:38:00{{T6}} App{{EVENT}} Runner deployment completes Mitigation
10:38:30{{T7}} Health{{EVENT}} check returns {"status":"ok"} Recovery
10:38:45{{T8}} BetterStack{{EVENT}} recovery alert: "Drop Health Check is UP" Recovery
10:39:00 BeginT8 15-minute- stability monitoring windowPost-recovery
10:54:00Confirmed stable — error spike clearedClosed
10:58:00Incident formally closedClosedT1)

Total duration: 28 minutesMTTD (10:30Mean — 10:58 UTC) Time to detect:Detect): <{{MTTD}} 30 secondsminutes MTTR (Mean Time to diagnose root cause:Resolve): ~4{{MTTR}} minutes Time to apply fix: ~5 minutes (App Runner restart)


3.5. Root Cause Analysis

3.5.1 The Five5 Whys Analysis

503?The/api/healthDB check (SELECT 1) failed.

checkfail?The application could not acquire a database connection — pool was exhausted.

the pool exhausted? → ~45 concurrent BankID login callbacks (each requiring a DB connection for session upsert) arrived simultaneously within a 10-second window.

Why were 45 concurrent logins able to exhaust the pool? → No explicit connection pool limit was configured in the pg driver. The pool was bounded only by OS-level limits (~85 connections for db.t4g.micro), and there was no queue/timeout — new requests failed immediately when the limit was hit.

firebeforethehealthfailed?
Why #QuestionAnswer
Why 1Why did Dropusers returnexperience HTTP{{SYMPTOM}}? {{WHY_1}}
Why endpoint's2 Why did the{{WHY_1_ANSWER}} DBhappen? {{WHY_2}}
Why was3 Why did no{{WHY_2_ANSWER}} alerthappen? {{WHY_3}}
Why check4 Why did {{WHY_3_ANSWER}} happen?{{WHY_4}}
Why 5Why did {{WHY_4_ANSWER}} happen?{{WHY_5}}

Root cause: → No CloudWatch alarm was configured on DatabaseConnections. The only production alert path was the BetterStack health check, which by then was already failing.{{ROOT_CAUSE}}

3.5.2 Contributing Factors

Yes/
Factor DescriptionType SeverityAction Required
No explicit pool config{{FACTOR_1}} pgTechnical used/ withoutProcess max,/ idleTimeoutMillis, or connectionTimeoutMillisHuman HighYes / No
No DB connection metrics{{FACTOR_2}} NoTechnical CloudWatch/ alarmProcess on/ DatabaseConnections > 70Human HighYes / No
No graceful degradation{{FACTOR_3}} ApplicationTechnical returned/ 503Process when/ DB was unavailable, even for non-DB routesHuman Medium
No rate limiting across all IPsPer-IP rate limit (10/min) did not prevent burst across many IPs simultaneouslyMedium
No pre-campaign infra reviewMarketing email campaign launched without coordinating with infrastructureMedium
No connection pool health metricHealth check did not report pool utilizationLow

3.3 What Worked Well

  • BetterStack detection was excellent: < 30 seconds from failure to alert.
  • Slack alert delivery was immediate: Alert to #drop-ops within 30 seconds.
  • App Runner restart is fast and reliable: Recovery completed in < 5 minutes.
  • RDS was not the problem: Direct psql connection succeeded, quickly ruling out infrastructure failure.
  • No data loss: Audit logs intact, no transactions corrupted.

4. Impact Analysis

DimensionImpact
Users affected100% — full service outage
Transactions blocked~3–5 remittances + ~2 QR payments
Revenue impactApprox. NOK 6,000–10,000
ComplianceNone — no data loss, audit logs intact throughout
RegulatoryNo notification required (< 4h, no PII exposure)
ReputationUsers saw error screens — limited blast radius pre-public-launch
SLA28 min downtime → monthly uptime 99.94%

5. Corrective Actions

5.1 Immediate (before next marketing campaign)

#ActionOwnerStatus
1Configure explicit pg pool: max: 10, idleTimeoutMillis: 30000, connectionTimeoutMillis: 2000PlatformPending
2Add CloudWatch alarm: DatabaseConnections > 70 → Slack #drop-opsAlemPending
3Add global rate limit on /api/auth/bankid/initiate (e.g., 100/min across all IPs)PlatformPending

5.2 Before v1.0 Launch

#ActionOwnerPriority
4Add PgBouncer or RDS Proxy to externalize connection poolingPlatformP1
5Report pool utilization in /api/health response (poolSize, idleCount, waitingCount)PlatformP2
6Implement graceful degradation for non-DB routes when DB is unavailablePlatformP2

5.3 ProcessTrigger Event

The specific trigger for this incident: {{TRIGGER}}


6. What Went Well

  1. {{CATEGORY_1}}: {{DESCRIPTION}}
  2. {{CATEGORY_2}}: {{DESCRIPTION}}
  3. {{CATEGORY_3}}: {{DESCRIPTION}}

7. What Went Wrong

  1. {{CATEGORY_1}}: {{DESCRIPTION}}
  2. {{CATEGORY_2}}: {{DESCRIPTION}}
  3. {{CATEGORY_3}}: {{DESCRIPTION}}

8. Where We Got Lucky

  1. {{LUCKY_1}}
  2. {{LUCKY_2}}
  3. {{LUCKY_3}}

9. Action Items

Short-Term Fixes (ongoing)This Sprint)

2weeks 2026
# Action Owner Due PriorityTicket
71 Create{{SHORT_TERM_1}} "Marketing → Infra" coordination checklist — must be completed before any campaign Alem{{OWNER}} Within{{DATE}} Critical {{TICKET}}
82 Add DB connection metrics to weekly monitoring review{{SHORT_TERM_2}} Alem{{OWNER}} Ongoing{{DATE}}High{{TICKET}}
93 Test App Runner restart as a documented runbook step{{SHORT_TERM_3}} Platform{{OWNER}} Q2{{DATE}} Medium{{TICKET}}

6. Systemic

Long-Term Improvements

6.1 Connection Pooling Fix

Current state: Implicit pool, no limits, no timeout.

Target state:

// src/drop-app/src/lib/db.ts (example)Next const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 10,                    // Hard cap — never exceed RDS t4g.micro limit
  idleTimeoutMillis: 30000,   // Release idle connections after 30s
  connectionTimeoutMillis: 2000, // Fail fast if pool is exhausted
});

When PgBouncer is added, set max higher in the app and let PgBouncer enforce the RDS limit.

6.2 CloudWatch AlarmQuarter)

aws cloudwatch put-metric-alarm \
  --alarm-name "drop-db-connections-high" \
  --alarm-description "RDS DatabaseConnections > 70" \
  --metric-name DatabaseConnections \
  --namespace AWS/RDS \
  --dimensions Name=DBInstanceIdentifier,Value=drop-db \
  --period 60 \
  --evaluation-periods 2 \
  --threshold 70 \
  --comparison-operator GreaterThanThreshold \
  --statistic Average \
  --alarm-actions arn:aws:sns:eu-west-1:324480209768:drop-ops-alerts \
  --region eu-west-1

6.3 Marketing Campaign Checklist (Pre-Launch)

Before any marketing campaign that targets > 100 recipients:

  •  Notify infrastructure (Alem) at least 24h before send
  •  Check current DatabaseConnections baseline in CloudWatch
  •  Verify pool configuration is explicit
  •  Consider sending campaign in batches (< 100/hour) to spread load

7. Lessons Learned

  1. Explicit is always better than implicit for resource limits. Never rely on OS defaults for connection pool configuration in production.
  2. Metrics must lead alerts, not lag them. The health check failure was a lagging indicator. DatabaseConnections CloudWatch alarm would have caught this 2 minutes earlier.
  3. Marketing and infrastructure must coordinate. A 500-recipient email burst is a load event that needs infrastructure awareness.
  4. App Runner restart is a fast, reliable mitigation. < 5 minutes RTO is acceptable for this class of issue. Document it as a first-response step in the runbook.
  5. BetterStack + Slack alerting works. < 30 second detection time met our target. No changes needed here.
  6. DB-level connection pooling (PgBouncer) is required for burst tolerance at scale. This should be resolved before public launch.

8. Action Item Tracking

# Action Owner Due StatusPriorityTicket
1 Configure{{LONG_TERM_1}} explicit pg pool limits Platform{{OWNER}} Before next campaign{{DATE}} PendingHigh{{TICKET}}
2 CloudWatch alarm on DatabaseConnections > 70{{LONG_TERM_2}} Alem{{OWNER}} Within 1 week{{DATE}} PendingMedium{{TICKET}}

Process Changes

nextcampaign
#ChangeOwnerImplementation Date
1{{PROCESS_1}} {{OWNER}}{{DATE}}
32 Global rate limit on BankID initiate{{PROCESS_2}} Platform{{OWNER}} Before{{DATE}}

10. Follow-Up Tracking

Follow-up review date: {{FOLLOWUP_DATE}} (4 weeks after incident) Follow-up owner: {{FOLLOWUP_OWNER}}

Action ItemExpected CompletionVerified CompleteEffective
{{ACTION_1}} Pending{{DATE}}Yes / NoYes / No / TBD
4{{ACTION_2}} PgBouncer / RDS Proxy{{DATE}} Platform Before v1.0Pending
5Pool utilization in /api/healthPlatformBefore v1.0Pending
6Graceful degradation for non-DB routesPlatformBefore v1.0Pending
7Marketing → Infra coordination checklistAlemWithin 2 weeksPending
8DB connection metrics in weekly reviewAlemOngoingOngoing
9App Runner restart in DR runbookPlatformQ2 2026Pending

11. Recurrence Prevention

Before this incident: {{BEFORE_STATE}}

After implementing action items: {{AFTER_STATE}}

Confidence in prevention: {{CONFIDENCE}} / 10 Residual risk: {{RESIDUAL_RISK}}


12. Review & Sign-Off

Post-mortem presented at: {{MEETING}} on {{MEETING_DATE}} Meeting recording: {{RECORDING_LINK}} Meeting notes: {{NOTES_LINK}}



Approval

Role Name Date Signature
Author Platform Architect (AI) 2026-02-23
FacilitatorReviewer Alem Bašić
Approver Alem Bašić