Skip to main content

Incident Report

Incident Report

Project: Drop{{PROJECT_NAME}} Version: 0.1.0{{VERSION}} Date: 2026-02-23{{DATE}} Author: Platform Architect (AI){{AUTHOR}} Status: ClosedDraft | In Review | Approved Reviewers: Alem Bašić (CEO){{REVIEWERS}}

Document History

Version Date Author Changes
0.1 2026-02-23{{DATE}} Platform Architect (AI){{AUTHOR}} ExampleInitial incident report (simulated pre-launch scenario)draft

1. Incident OverviewMetadata

This document is filled with a realistic example incident based on Drop's architecture. Use it as a template for future real incidents.

(automatedDropCheckmonitor)
Field Value
Incident ID INC-2026-001{{YYYY}}-{{SEQ}}
Severity P1P{{SEVERITY}} — Critical (production down)
Status Closed{{STATUS}} / Resolved
Start Time2026-02-20 10:30 UTC
End Time2026-02-20 10:58 UTC
Duration28 minutes
Services AffectedDrop production (all users)
Root CauseRDS PostgreSQL connection pool exhaustion after spike in concurrent logins
Incident Commander Alem Bašić{{IC}}
ReportedTechnical ByLead BetterStack{{TECH_LEAD}}
Communications HealthLead {{COMMS_LEAD}}
Declared at{{START_TIME}} {{TIMEZONE}}
Resolved at{{END_TIME}} {{TIMEZONE}}
Total duration{{DURATION}}
Affected service(s){{SERVICES}}
EnvironmentProduction / Staging

1.2. Executive Summary

{{EXECUTIVE_SUMMARY}}

Example: "On {{DATE}}, a database connection pool exhaustion caused the {{SERVICE}} API to return 503 errors for approximately 47 minutes, affecting {{AFFECTED_COUNT}} users and resulting in an estimated {{REVENUE_IMPACT}} in lost transactions. The root cause was a code change in the v{{VERSION}} deployment that introduced N+1 queries under high load."


3. Detection

Detected by: {{DETECTION_METHOD}} Detected at: {{DETECTION_TIME}} Lag from start to detection: {{DETECTION_LAG}} minutes Detecting system: {{DETECTING_SYSTEM}}

Alerting effectiveness:

  •  Alert fired within the expected window (< {{ALERT_SLA}} minutes)
  •  Alert delivered to on-call without delay
  •  Alert contained sufficient context to begin investigation

Improvements to detection identified:

  • {{DETECTION_IMPROVEMENT_1}}

4. Detailed Timeline

Timezone: All times in {{TIMEZONE}}

TimeEventActorNotes
{{TIME}}{{EVENT_1}} {{ACTOR}}
{{TIME}}{{EVENT_2}} SystemAlert ID: {{ALERT_ID}}
{{TIME}}{{EVENT_3}} {{ENGINEER}}
{{TIME}}{{EVENT_4}} {{IC}}
{{TIME}}{{EVENT_5}} {{ENGINEER}}
{{TIME}}{{EVENT_6}} {{ENGINEER}}
{{TIME}}{{EVENT_7}} System
{{TIME}}{{EVENT_8}} {{IC}}

5. Impact SummaryAssessment

Users Affected

Metric Value
UsersTotal users affected All active users (100% — full outage){{USER_COUNT}}
Transactions% blockedof total user base Estimated 3–5 remittances + 2 QR payments{{USER_PERCENT}}%
RevenueGeography impactaffected Approx.{{GEOGRAPHY}} NOK 6,000–10,000 in blocked transactions
SLAUser impacttier affected 28{{USER_TIER}} minutes downtime → monthly uptime: 99.94%
Compliance impactNone — no data loss, audit logs intact

Customer-facing

Services behavior: Users saw error messages on all screens. /api/health returned HTTP 503 with "status":"down".


2. Timeline

Affected Checkmonitor on BetterStack detects HTTP 503
Time (UTC)Service EventImpact TypeSeverityDuration
10:30:00{{SERVICE_1}} Drop{{IMPACT_TYPE}} Health {{SEV}} {{DURATION}}
10:30:30{{SERVICE_2}} Slack #drop-ops receives critical alert: "Drop Health Check is DOWN"
10:30:45{{IMPACT_TYPE}} Alem acknowledges alert
10:31:00{{SEV}} Alem checks App Runner status — service shows RUNNING
10:31:30Alem checks /api/health — response: {"status":"down","checks":{"db":{"status":"fail"DURATION}}}}
10:32:00Alem checks CloudWatch logs — sees repeated connection refused errors to RDS
10:32:30Hypothesis: RDS connection issue. Alem checks RDS status — shows available
10:33:00Alem queries RDS directly — connection succeeds from psql
10:34:00New hypothesis: connection pool exhaustion in application
10:35:00Alem triggers App Runner restart (aws apprunner start-deployment)
10:38:00App Runner deployment completes — service RUNNING
10:38:30Health check passes: {"status":"ok"}
10:38:45BetterStack sends recovery alert: "Drop Health Check is UP (downtime: 8 min)"
10:39:00Alem monitors for 15 minutes to confirm stability
10:54:00Confirms stable — error spike cleared
10:58:00Incident closed

Data

3. Root Cause Analysis

What HappenedImpact

A burst of concurrent BankID login attempts (triggered by a marketing email sent at 10:28 UTC) created 45+ simultaneous database connections. The Drop application uses application-level connection pooling with a default max of ~85 connections (db.t4g.micro limit). Each /api/auth/bankid/callback request opens a connection for session creation + user upsert — the simultaneous spike exhausted the pool.

When the pool was exhausted, new requests failed with connection refused and the health check's DB query (SELECT 1) also failed, triggering a 503 response.

Contributing Factors

  1. No explicit connection pool configured: Drop uses the pg driver without PgBouncer or RDS Proxy. The application-level pool was implicitly bounded by the OS connection limits.
  2. No connection pool metrics: CloudWatch didn't have an alert on DatabaseConnections — the issue wasn't detected until the health check failed.
  3. No circuit breaker: The application did not gracefully degrade (e.g., serve cached data) when DB was unavailable.

Why App Runner restart fixed it

Restarting App Runner recreated the application process and reset all connection pools. With the burst of login requests completed (10 seconds of traffic), new requests arrived at a normal rate and the pool was sufficient.


4. Resolution

Immediate fix: App Runner restart (28 minutes to resolution from detection).

Permanent fixes required:

  1. Add PgBouncer connection pooler (or RDS Proxy) to limit per-application connections
  2. Add CloudWatch alarm on DatabaseConnections > 70 for db.t4g.micro
  3. Implement connection pool health check with metrics
  4. Add rate limiting on BankID login initiation for burst protection (already exists at 10/min per IP — but not per minute across all IPs)

5. Detection Quality

AspectType ScoreAssessment
Data loss{{DATA_LOSS}}
Data corruption{{DATA_CORRUPTION}}
Data exposure{{DATA_EXPOSURE}}
Verification method{{VERIFICATION}}

Financial Impact

minutes(AppRunner
CategoryAmount Notes
TimeLost to detectiontransactions Good${{AMOUNT}} BetterStack{{TRANSACTION_COUNT}} detectedfailed in < 30 secondstransactions
AlertSLA receivedcredits Good${{AMOUNT}} SlackPer alertSLA within 30 secondscontract
TimeOperational cost${{AMOUNT}}Engineering hours to diagnoseFair4 minutes — hypothesis took 2 iterationsresolve
TimeTotal to resolveestimated Good${{TOTAL}} 5
restart)

SLA Breach Assessment

for15minutesbeforeclosing
SLA MetricTargetActualBreach
Uptime{{UPTIME_SLA}}%{{ACTUAL_UPTIME}}%{{BREACH}}
Post-restoreResponse verificationtime (P99) Good< {{P99_SLA}}ms Monitored{{P99_ACTUAL}}ms {{BREACH}}
MTTR < {{MTTR_SLA}}{{MTTR_ACTUAL}}{{BREACH}}

6. Root Cause Analysis

5 Whys

Why #QuestionAnswer
Why 1Why did users see errors?{{ANSWER_1}}
Why 2Why was the API returning 503?{{ANSWER_2}}
Why 3Why was the connection pool exhausted?{{ANSWER_3}}
Why 4Why was the N+1 query introduced?{{ANSWER_4}}
Why 5Why did code review miss it?{{ANSWER_5}}

Root cause: {{ROOT_CAUSE}}

Contributing Factors

  1. {{FACTOR_1}}
  2. {{FACTOR_2}}
  3. {{FACTOR_3}}

Trigger Event

What triggered this specific incident now: {{TRIGGER}}


7. Resolution Steps

StepTimeActionResult
1{{TIME}}{{ACTION_1}}{{RESULT_1}}
2{{TIME}}{{ACTION_2}}{{RESULT_2}}
3{{TIME}}{{ACTION_3}}{{RESULT_3}}

Resolution commands (for runbook):

# {{RESOLUTION_DESCRIPTION}}
{{RESOLUTION_COMMAND}}

8. What Went Well

  1. {{WENT_WELL_1}}
  2. {{WENT_WELL_2}}
  3. {{WENT_WELL_3}}

9. What Went Wrong

  1. {{WENT_WRONG_1}}
  2. {{WENT_WRONG_2}}
  3. {{WENT_WRONG_3}}

10. Action Items

v1.0 launch1 weekv1.0 launch2 weeks2026
# Action Owner Due DatePriority Due DateStatus
1 Configure{{ACTION_1}} PgBouncer or RDS Proxy for connection pooling Alem / Platform{{OWNER}} P1{{DUE}} BeforeHigh Open
2 Add{{ACTION_2}} CloudWatch alarm: DatabaseConnections > 70 → Slack alert Alem{{OWNER}} P1{{DUE}} WithinHigh Open
3 Add{{ACTION_3}} connection pool metrics to health check endpoint Platform{{OWNER}} P2{{DUE}} BeforeMedium Open
4 Document{{ACTION_4}} burst rate limiting strategy for marketing campaigns Alem{{OWNER}} P2{{DUE}} WithinHigh Open
5 Test{{ACTION_5}} DR runbook scenario for RDS connection failures Platform{{OWNER}} P3{{DUE}} Q2Low Open

7.11. Lessons Learned

  1. Connection{{LESSON_1}} pool
  2. exhaustion
  3. {{LESSON_2}}
  4. is
  5. {{LESSON_3}}
  6. a
real
risk
Incident IDDateSimilarityResolved
INC-{{ID}}{{DATE}}{{DESCRIPTION}}Yes / No

13. Communication Log

underload.AddPgBouncerbefore launch.
  • CloudWatch DatabaseConnections metric needs an alarm. This should have alerted before the health check failed.
  • App Runner restart is fast and reliable. 5 minutes is an acceptable RTO for this class of issue.
  • BetterStack detection was excellent — 30 seconds from failure to alert is within our target.
  • Marketing campaigns need infrastructure coordination. A burst from a marketing email should trigger a pre-deployment review of connection capacity.
  • TimeChannelMessage SummaryAudienceSent By
    {{TIME}}Status page"Investigating reports of elevated errors"All users{{SENDER}}
    {{TIME}}Status page"Identified root cause, applying fix"All users{{SENDER}}
    {{TIME}}Status page"Incident resolved, all systems normal"All users{{SENDER}}
    {{TIME}}EmailCustomer notification for synchronousSLA monolithsbreach Affected burstcustomers {{SENDER}}


    Approval

    Role Name Date Signature
    Author Platform Architect (AI) 2026-02-23
    Reviewer
    Approver Alem Bašić