Skip to main content

Post-Mortem

Post-Mortem

Project: {{PROJECT_NAME}}Drop Version: {{VERSION}}0.1.0 Date: {{DATE}}2026-02-23 Author: {{AUTHOR}}Platform Architect (AI) Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}Alem Bašić (CEO)

Document History

Version Date Author Changes
0.1 {{DATE}}2026-02-23 {{AUTHOR}}Platform Architect (AI) InitialExample draftpost-mortem (simulated pre-launch scenario — mirrors INC-2026-001)

BlamelessPost-Mortem Culture StatementOverview

This post-mortemdocument is conductedfilled inwith a realistic blamelessexample spiritpost-mortem. Ourbased goalon isDrop's toarchitecture. understandIt howdocuments the same incident as the Incident Report (INC-2026-001: RDS connection pool exhaustion) but provides deeper root cause analysis and whysystemic theimprovements. incidentUse occurredas a not to assign fault to individuals. People make the best decisions they can with the information and tools available at the time. When things go wrong, we looktemplate for systemicfuture improvementsreal that make the right action easier and the wrong action harder for everyone.incidents.


1. Incident Reference & Metadata

P1 InReview/Final
Field Value
Incident ID INC-{{YYYY}}-{{SEQ}}2026-001
Severity P{{SEVERITY}}
Incident ReportINC-{{YYYY}}-{{SEQ}}
Post-Mortem Facilitator{{FACILITATOR}}Critical
Post-Mortem Date {{PM_DATE}}2026-02-21
AttendeesFacilitator {{ATTENDEES}}Alem Bašić
StatusIncident Commander DraftAlem /Bašić
Participants Alem Bašić (CEO), Platform Architect (AI)

2.1. Executive Summary

On

{{EXECUTIVE_SUMMARY}}

2026-02-20
at

Example:10:30 "AUTC, databaseDrop indexexperienced a 28-minute P1 outage affecting 100% of production users. The root cause was droppedRDS duringPostgreSQL connection pool exhaustion triggered by a migrationburst onof {{DATE}},concurrent causingBankID queryauthentication performanceattempts tofollowing degradea bymarketing 50×email undercampaign. load.The immediate fix was an App Runner service restart. This resultedpost-mortem in a 1h 23min degraded service period affecting {{USERS}} users. We have restoreddocuments the index,systemic addedimprovements migration validation tooling, and created safeguardsrequired to prevent similar incidents."recurrence.

Bottom line: The application lacked explicit connection pool limits and DB-level metrics alerting. A burst of ~45 concurrent logins exhausted the default connection pool, causing all subsequent DB-dependent requests (including the health check) to fail.


3.2. Impact SummaryTimeline

{{RESOLVED}}) during incident)
MetricTime (UTC) ValueEventPhase
Total duration10:28:00 {{DURATION}}Marketing (detectedemail atdelivered {{DETECTED}},to resolved~500 atrecipients Pre-incident
Users affected10:30:00 {{USER_COUNT}}BetterStack ({{USER_PERCENT}}%detects ofHTTP user503 base)on Drop Health CheckDetection
Requests affected10:30:30 {{REQUEST_COUNT}}Slack ({{REQUEST_PERCENT}}%#drop-ops erroralert ratefires Detection
Estimated revenue impact10:30:45 ${{REVENUE}}Alem acknowledges alertResponse
SLA breach10:31:00 {{SLA_BREACH}}Alem checks App Runner → status RUNNINGDiagnosis
SLA credits owed10:31:30 $Alem checks /api/health → {"status":"down","checks":{CREDITS}"db":{"status":"fail"}

4. Detailed Timeline

timeline
    title Incident Timeline
    {{TIME_1}} : {{EVENT_1}}
    {{TIME_2}} : {{EVENT_2}}
    {{TIME_3}} : {{EVENT_3}}
    {{TIME_4}} : {{EVENT_4}}
    {{TIME_5}} : {{EVENT_5}}
Resolved(MTTR=T8
TimeEventMTTD/MTTR Marker
{{T1}} {{EVENT}} ← Incident startDiagnosis
{{T2}}10:32:00 {{EVENT}}CloudWatch logs show repeated connection refused to RDS Diagnosis
{{T3}}10:33:00 {{EVENT}}Direct psql connection to RDS succeeds — rules out RDS-level failure ← Detection (MTTD = T3 - T1)Diagnosis
{{T4}}10:34:00 {{EVENT}}Hypothesis: application-level connection pool exhaustion Diagnosis
{{T5}}10:35:00 {{EVENT}}Alem triggers App Runner restart via aws apprunner start-deployment Mitigation
{{T6}}10:38:00 {{EVENT}}App Runner deployment completes Mitigation
{{T7}}10:38:30 Health check returns {{EVENT}"status":"ok"}  Recovery
{{T8}}10:38:45 {{EVENT}}BetterStack recovery alert: "Drop Health Check is UP" Recovery
10:39:00 Begin -15-minute T1)stability monitoring windowPost-recovery
10:54:00Confirmed stable — error spike clearedClosed
10:58:00Incident formally closedClosed

MTTDTotal duration: 28 minutes (Mean10:30 — 10:58 UTC) Time to Detect):detect: {{MTTD}}< 30 seconds Time to diagnose root cause: ~4 minutes MTTR (Mean Time to Resolve):apply fix: {{MTTR}}~5 minutes (App Runner restart)


5.3. Root Cause Analysis

5.3.1 5The Five Whys Analysis

HTTP503?Theendpoint'sDBcheckfail?wasalertfirebeforethecheck
Why #QuestionAnswer
Why 1

Why did usersDrop experiencereturn {{SYMPTOM}}?

{{WHY_1}}
Why/api/health 2 DB check (SELECT 1) failed.

Why did {{WHY_1_ANSWER}}the happen?

{{WHY_2}}
The application could not acquire a database connection — pool was exhausted.

Why 3

the pool exhausted? → ~45 concurrent BankID login callbacks (each requiring a DB connection for session upsert) arrived simultaneously within a 10-second window.

Why were 45 concurrent logins able to exhaust the pool? → No explicit connection pool limit was configured in the pg driver. The pool was bounded only by OS-level limits (~85 connections for db.t4g.micro), and there was no queue/timeout — new requests failed immediately when the limit was hit.

Why did {{WHY_2_ANSWER}}no happen?

{{WHY_3}}
Whyhealth 4 Why did {{WHY_3_ANSWER}} happen?{{WHY_4}}
Why 5Why did {{WHY_4_ANSWER}} happen?{{WHY_5}}

Root cause:failed? {{ROOT_CAUSE}}→ No CloudWatch alarm was configured on DatabaseConnections. The only production alert path was the BetterStack health check, which by then was already failing.

5.3.2 Contributing Factors

Factor TypeDescription Action RequiredSeverity
{{FACTOR_1}}No explicit pool config Technicalpg /used Processwithout /max, HumanidleTimeoutMillis, or connectionTimeoutMillis Yes / NoHigh
{{FACTOR_2}}No DB connection metrics TechnicalNo /CloudWatch Processalarm /on HumanDatabaseConnections > 70 Yes / NoHigh
{{FACTOR_3}}No graceful degradation TechnicalApplication /returned Process503 /when HumanDB was unavailable, even for non-DB routes YesMedium
No rate limiting across all IPsPer-IP rate limit (10/min) did not prevent burst across many IPs simultaneouslyMedium
No pre-campaign infra reviewMarketing email campaign launched without coordinating with infrastructureMedium
No connection pool health metricHealth check did not report pool utilizationLow

3.3 What Worked Well

  • BetterStack detection was excellent: < 30 seconds from failure to alert.
  • Slack alert delivery was immediate: Alert to #drop-ops within 30 seconds.
  • App Runner restart is fast and reliable: Recovery completed in < 5 minutes.
  • RDS was not the problem: Direct psql connection succeeded, quickly ruling out infrastructure failure.
  • No data loss: Audit logs intact, no transactions corrupted.

4. Impact Analysis

DimensionImpact
Users affected100% — full service outage
Transactions blocked~3–5 remittances + ~2 QR payments
Revenue impactApprox. NOK 6,000–10,000
ComplianceNone — no data loss, audit logs intact throughout
RegulatoryNo notification required (< 4h, no PII exposure)
ReputationUsers saw error screens — limited blast radius pre-public-launch
SLA28 min downtime → monthly uptime 99.94%

5. Corrective Actions

5.1 Immediate (before next marketing campaign)

#ActionOwnerStatus
1Configure explicit pg pool: max: 10, idleTimeoutMillis: 30000, connectionTimeoutMillis: 2000PlatformPending
2Add CloudWatch alarm: DatabaseConnections > 70 → Slack #drop-opsAlemPending
3Add global rate limit on /api/auth/bankid/initiate No(e.g., 100/min across all IPs)PlatformPending

5.2 Before v1.0 Launch

#ActionOwnerPriority
4Add PgBouncer or RDS Proxy to externalize connection poolingPlatformP1
5Report pool utilization in /api/health response (poolSize, idleCount, waitingCount)PlatformP2
6Implement graceful degradation for non-DB routes when DB is unavailablePlatformP2

5.3 Trigger Event

The specific trigger for this incident: {{TRIGGER}}


6. What Went Well

  1. {{CATEGORY_1}}: {{DESCRIPTION}}
  2. {{CATEGORY_2}}: {{DESCRIPTION}}
  3. {{CATEGORY_3}}: {{DESCRIPTION}}

7. What Went Wrong

  1. {{CATEGORY_1}}: {{DESCRIPTION}}
  2. {{CATEGORY_2}}: {{DESCRIPTION}}
  3. {{CATEGORY_3}}: {{DESCRIPTION}}

8. Where We Got Lucky

  1. {{LUCKY_1}}
  2. {{LUCKY_2}}
  3. {{LUCKY_3}}

9. Action Items

Short-Term FixesProcess (This Sprint)ongoing)

Within2 Q2
# Action Owner Due PriorityTicket
17 {{SHORT_TERM_1}}Create "Marketing → Infra" coordination checklist — must be completed before any campaign {{OWNER}}Alem {{DATE}} Critical {{TICKET}}weeks
28 {{SHORT_TERM_2}}Add DB connection metrics to weekly monitoring review {{OWNER}}Alem {{DATE}}High{{TICKET}}Ongoing
39 {{SHORT_TERM_3}}Test App Runner restart as a documented runbook step {{OWNER}}Platform {{DATE}} Medium{{TICKET}}2026

Long-Term

6. Systemic Improvements

6.1 Connection Pooling Fix

Current state: Implicit pool, no limits, no timeout.

Target state:

// src/drop-app/src/lib/db.ts (Nextexample)
Quarter)const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 10,                    // Hard cap — never exceed RDS t4g.micro limit
  idleTimeoutMillis: 30000,   // Release idle connections after 30s
  connectionTimeoutMillis: 2000, // Fail fast if pool is exhausted
});

When PgBouncer is added, set max higher in the app and let PgBouncer enforce the RDS limit.

6.2 CloudWatch Alarm

aws cloudwatch put-metric-alarm \
  --alarm-name "drop-db-connections-high" \
  --alarm-description "RDS DatabaseConnections > 70" \
  --metric-name DatabaseConnections \
  --namespace AWS/RDS \
  --dimensions Name=DBInstanceIdentifier,Value=drop-db \
  --period 60 \
  --evaluation-periods 2 \
  --threshold 70 \
  --comparison-operator GreaterThanThreshold \
  --statistic Average \
  --alarm-actions arn:aws:sns:eu-west-1:324480209768:drop-ops-alerts \
  --region eu-west-1

6.3 Marketing Campaign Checklist (Pre-Launch)

Before any marketing campaign that targets > 100 recipients:

  •  Notify infrastructure (Alem) at least 24h before send
  •  Check current DatabaseConnections baseline in CloudWatch
  •  Verify pool configuration is explicit
  •  Consider sending campaign in batches (< 100/hour) to spread load

7. Lessons Learned

  1. Explicit is always better than implicit for resource limits. Never rely on OS defaults for connection pool configuration in production.
  2. Metrics must lead alerts, not lag them. The health check failure was a lagging indicator. DatabaseConnections CloudWatch alarm would have caught this 2 minutes earlier.
  3. Marketing and infrastructure must coordinate. A 500-recipient email burst is a load event that needs infrastructure awareness.
  4. App Runner restart is a fast, reliable mitigation. < 5 minutes RTO is acceptable for this class of issue. Document it as a first-response step in the runbook.
  5. BetterStack + Slack alerting works. < 30 second detection time met our target. No changes needed here.
  6. DB-level connection pooling (PgBouncer) is required for burst tolerance at scale. This should be resolved before public launch.

8. Action Item Tracking

# Action Owner Due PriorityTicketStatus
1 {{LONG_TERM_1}}Configure explicit pg pool limits {{OWNER}}Platform {{DATE}}Before next campaign High{{TICKET}}Pending
2 {{LONG_TERM_2}}CloudWatch alarm on DatabaseConnections > 70 {{OWNER}}Alem {{DATE}}Within 1 week MediumPending
3 {{TICKET}}Global rate limit on BankID initiatePlatformBefore next campaignPending
4PgBouncer / RDS ProxyPlatformBefore v1.0Pending
5Pool utilization in /api/healthPlatformBefore v1.0Pending
6Graceful degradation for non-DB routesPlatformBefore v1.0Pending
7Marketing → Infra coordination checklistAlemWithin 2 weeksPending
8DB connection metrics in weekly reviewAlemOngoingOngoing
9App Runner restart in DR runbookPlatformQ2 2026Pending

Process Changes

#ChangeOwnerImplementation Date
1{{PROCESS_1}} {{OWNER}}{{DATE}}
2{{PROCESS_2}}{{OWNER}}{{DATE}}

10. Follow-Up Tracking

Follow-up review date: {{FOLLOWUP_DATE}} (4 weeks after incident) Follow-up owner: {{FOLLOWUP_OWNER}}

Action ItemExpected CompletionVerified CompleteEffective
{{ACTION_1}}{{DATE}}Yes / NoYes / No / TBD
{{ACTION_2}}{{DATE}}

11. Recurrence Prevention

Before this incident: {{BEFORE_STATE}}

After implementing action items: {{AFTER_STATE}}

Confidence in prevention: {{CONFIDENCE}} / 10 Residual risk: {{RESIDUAL_RISK}}


12. Review & Sign-Off

Post-mortem presented at: {{MEETING}} on {{MEETING_DATE}} Meeting recording: {{RECORDING_LINK}} Meeting notes: {{NOTES_LINK}}



Approval

Role Name Date Signature
Author Platform Architect (AI) 2026-02-23
ReviewerFacilitator Alem Bašić
Approver Alem Bašić