Skip to main content

Disaster Recovery Plan

Disaster Recovery Plan

Project: Drop{{PROJECT_NAME}} Version: 0.1.0{{VERSION}} Date: 2026-02-23{{DATE}} Author: Platform Architect (AI){{AUTHOR}} Status: Draft | In Review | Approved Reviewers: Alem Bašić (CEO){{REVIEWERS}}

Document History

Version Date Author Changes
0.1 2026-02-23{{DATE}} Platform Architect (AI){{AUTHOR}} CompiledInitial from DR-RUNBOOK.md + infrastructure analysisdraft

1. Business Continuity Overview

This plan documents the procedures to recover Drop{{PROJECT_NAME}} services following a disaster event (infrastructuredata center failure, data corruption, security breach, or totalcatastrophic region outage)failure). Drop is a PSD2 pass-through payment application — it never holds customer funds, so there is no risk of customer money being lost due to Drop infrastructure failure. The primary recovery concern is service availability and data integrity.

Plan Owner: Alem Bašić (CEO), [email protected], +47 40 47 42 51{{DR_OWNER}} Plan Reviewer: John (AI Director), Slack #drop-alerts{{DR_REVIEWER}} Last Tested: TBD — initial DR drill not yet conducted{{LAST_TEST_DATE}} Next Scheduled Test: Q2 2026 (App Runner restart + RDS snapshot restore){{NEXT_TEST_DATE}}

Disaster types covered:

  • App Runner serviceInfrastructure failure or(AZ/region crashoutage)
  • RDSData database failurecorruption or dataaccidental corruptiondeletion
  • Security incident (unauthorizedransomware, access,data credential compromise)breach)
  • Full regionVendor/provider outage (eu-west-1)
  • Catastrophic application failure (bad deployment)

2. RPO / RTO Targets Per Service Tier

transaction
Tier Description RPO RTO Drop ServicesExamples
Tier 1 — Critical Core user-facing servicesservices; downtime has direct revenue impact 5 minutes0 (PITR)real-time replication) 30< minutes15 min AuthAuth, (BankID),checkout, transactionscore (remittance + QR), health endpointAPI
Tier 2 — Important Supporting featuresservices; degraded experience without them 24< hours (snapshot)1 hour Merchant< dashboard,4 notifications,hours Notifications, historyreports
Tier 3 — Standard BackgroundBackground/admin /services; adminbusiness can operate without temporarily < 24 hours < 24 hours AuditAnalytics, logs,admin AML alerts, complaint recordspanel

3. Service Tier Classification

cannotrevenue featurerevenue featurefordependencydegraded,blockingissue
Service Tier JustificationOwnerRationale
BankID authentication{{SERVICE_1}} Tier 1 Users{{OWNER}} Core transactuser without loginjourney
Remittance API (/api/transactions/remittance){{SERVICE_2}} Tier 1 Core{{OWNER}} Authentication
QR payment API (/api/transactions/qr-payment){{SERVICE_3}} 1Tier 2 Core{{OWNER}} Supporting
Bank account read (AISP){{SERVICE_4}} 1Tier 3 Required{{OWNER}} Admin payment initiationonly
HealthDatabase endpoint (/api/health)Primary Tier 1 MonitoringPlatform All services depend on it
TransactionObject historyStorage Tier 2 UXPlatform User nouploads

4. Backup Strategy

4.1 Database Backups

ops degraded
DatabaseBackup TypeFrequencyRetentionLocationVerified
{{DB_PRIMARY}}Automated snapshotDaily30 days{{BACKUP_LOCATION}}Monthly
Merchant dashboard{{DB_PRIMARY}} 2Point-in-time recovery MerchantContinuous 7 impacteddays{{BACKUP_LOCATION}}Monthly
Notifications{{DB_READ_REPLICA}} 2Not backed up separately UX Rebuilt from primary

Automated backup tool: {{BACKUP_TOOL}} Backup encryption: AES-256, key managed in {{KMS_TOOL}} Cross-region copy: {{CROSS_REGION}}

4.2 File / Object Storage Backups

notreal-
StorageBackup MethodFrequencyRetentionDR Copy
{{S3_BUCKET}}S3 versioning + replicationContinuous{{RETENTION}}{{DR_BUCKET}}
Audit log{{FILE_STORE}} 3Snapshot ComplianceDaily 30 retained,days Cross-region

4.3 Configuration Backups

periodically,notreal-
ConfigBackup MethodLocationFrequency
IaC (Terraform)Git repository{{GIT_REPO}}On change
Application configGit repository{{GIT_REPO}}On change
SecretsSecrets manager replication{{SECRETS_BACKUP}}Real-time
AMLDNS alertsrecords 3Export to Git Reviewed{{GIT_REPO}} Weekly
TLS certificatesSecrets manager{{CERTS_BACKUP}}On renewal

4.4 Backup Testing Schedule

Backup TypeTest FrequencyLast TestResultTester
Database full restoreMonthly{{DATE}}{{RESULT}}{{TESTER}}
Point-in-time restoreQuarterly{{DATE}}{{RESULT}}{{TESTER}}
Object storage restoreQuarterly{{DATE}}{{RESULT}}{{TESTER}}
Full DR failover drillBi-annually{{DATE}}{{RESULT}}{{TESTER}}

4. Infrastructure Overview

Production

  • Service: AWS App Runner
  • Region: eu-west-1 (Ireland)
  • Service ARN: arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec
  • Service URL: https://9ef3szvvsb.eu-west-1.awsapprunner.com
  • ECR Repository: 324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web

Database (Production)

  • RDS Instance: drop-db
  • Endpoint: drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com:5432
  • DB Name: dropapp
  • DB User: dropuser
  • Backup Strategy: Automated daily snapshots, 7-day retention
  • Backup Window: 23:24–23:54 UTC daily
  • PITR: Enabled (5-minute granularity)

Staging

  • Platform: Fly.io, region arn (Stockholm)
  • App Name: drop-staging
  • Database: SQLite ephemeral volume — no automated backup

5. Backup Strategy

Production RDS PostgreSQL

  • Automated Snapshots: Daily at 23:24 UTC
  • Retention Period: 7 days
  • Point-in-Time Recovery (PITR): Enabled — any point within last 7 days, 5-minute granularity
  • Manual Snapshots: Created before every major deployment or migration
  • Snapshot verification: Run quarterly

ECR Docker Images

  • All pushed images retained in ECR repository
  • Rollback capability: Redeploy any previous image tag via aws apprunner start-deployment
  • Lifecycle policy: Delete untagged images after 7 days, keep last 10 tagged releases

Staging (Fly.io)

  • No automated backup — ephemeral SQLite storage
  • Manual backup procedure:
    flyctl ssh console -a drop-staging
    sqlite3 /app/data/drop.db ".backup /app/data/backup-$(date +%Y%m%d).db"
    

6. RecoveryFailover Procedures

Scenario5.1 1:Automated App Runner Service DownFailover

Symptoms:

  • BetterStack alert: Drop Health Check is DOWN
  • Slack #drop-ops: critical alert
  • App Runner service status not RUNNING

Investigation:

# Check service status
aws apprunner describe-service \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

# View recent logs (last 10 minutes)
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --follow --since 10m --region eu-west-1

# Check deployment history
aws apprunner list-operations \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

Recovery Option A: Restart (preferred)

aws apprunner start-deployment \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

RTO: 5–10 minutes | RPO: 0 (no data loss)

Recovery Option B: Rollback to previous image

# List recent ECR images
aws ecr describe-images \
  --repository-name drop-web \
  --region eu-west-1 \
  --query 'sort_by(imageDetails,&imagePushedAt)[-5:]'

# Update App Runner image tag via console or update the deployment workflow
# Then trigger new deployment

RTO: 15–20 minutes | RPO: 0 (no data loss)


Scenario 2: RDS Database Failure

Symptoms:

  • /api/health returns {"status":"down"} (HTTP 503)
  • BetterStack + Slack alerts fire
  • App Runner logs show connection timeout to RDS

Investigation:

# Check RDS status
aws rds describe-db-instances \
  --db-instance-identifier drop-db \
  --region eu-west-1 \
  --query 'DBInstances[0].DBInstanceStatus'

# Check available snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier drop-db \
  --region eu-west-1 \
  --query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-5:]'

# Check events
aws rds describe-events \
  --source-identifier drop-db \
  --source-type db-instance \
  --region eu-west-1 --duration 60

Recovery Option A: Restore from automated snapshot

LATEST=$(aws rds describe-db-snapshots \
  --db-instance-identifier drop-db --region eu-west-1 \
  --query 'DBSnapshots[?SnapshotType==`automated`]|sort_by(@,&SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
  --output text)

aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier drop-db-restored \
  --db-snapshot-identifier $LATEST \
  --db-instance-class db.t4g.micro \
  --region eu-west-1

aws rds wait db-instance-available --db-instance-identifier drop-db-restored --region eu-west-1

# Update DATABASE_URL in App Runner environment with new endpoint
NEW_EP=$(aws rds describe-db-instances --db-instance-identifier drop-db-restored \
  --query 'DBInstances[0].Endpoint.Address' --output text --region eu-west-1)

RTO: 30 minutes | RPO: 24 hours (last snapshot)

Recovery Option B: Point-in-Time Recovery (PITR)

aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier drop-db \
  --target-db-instance-identifier drop-db-pitr \
  --restore-time $(date -u -d '1 hour ago' '+%Y-%m-%dT%H:%M:%SZ') \
  --db-instance-class db.t4g.micro \
  --region eu-west-1

RTO: 30 minutes | RPO: 5 minutes (PITR granularity)


Scenario 3: Data Corruption

Symptoms:

  • Application reports data inconsistencies
  • User-reported missing or incorrect transactions
  • Audit log shows unexpected DELETE/UPDATE operations

Investigation:

# Check for soft-deleted users
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "SELECT COUNT(*) FROM users WHERE deleted_at IS NOT NULL;"

# Check recent suspicious audit log entries
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "SELECT * FROM audit_log WHERE action IN ('DELETE','UPDATE') ORDER BY timestamp DESC LIMIT 50;"

Recovery: Selective restore from clean snapshot (see Scenario 2 recovery steps) + merge affected tables.

RTO: 1–2 hours (selective) | RPO: Depends on snapshot age


Scenario 4: Full Region Outage (eu-west-1)

Current State: No automated cross-region failover. Manual failover to eu-north-1 (Stockholm) required.

Investigation:

  • Check AWS Health Dashboard: https://health.aws.amazon.com/health/status
  • Verify RDS snapshot accessibility from eu-west-1

Manual Failover to eu-north-1:

# 1. Copy latest RDS snapshot to eu-north-1
LATEST=$(aws rds describe-db-snapshots --db-instance-identifier drop-db --region eu-west-1 \
  --query 'DBSnapshots[?SnapshotType==`automated`]|sort_by(@,&SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
  --output text)

aws rds copy-db-snapshot \
  --source-db-snapshot-identifier arn:aws:rds:eu-west-1:324480209768:snapshot:$LATEST \
  --target-db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \
  --region eu-north-1

# 2. Restore RDS in eu-north-1
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier drop-db-failover \
  --db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \
  --db-instance-class db.t4g.micro \
  --region eu-north-1

# 3. Create ECR repository in eu-north-1 and push latest image
# 4. Create App Runner service in eu-north-1
# 5. Update DNS when getdrop.no is active

RTO: 2–4 hours (manual) | RPO: Up to 24 hours (last snapshot)


Scenario 5: Security Incident

Symptoms:

  • Suspicious audit log entries
  • Unauthorized access attempts
  • AML alerts triggered for unusual activity
  • Sumsub KYC bypass attempt

Investigation:

# Check audit log for recent suspicious activity
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "SELECT * FROM audit_log WHERE timestamp > NOW() - INTERVAL '24 hours' ORDER BY timestamp DESC;"

# Check AML alerts
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "SELECT * FROM aml_alerts WHERE status = 'open' OR created_at > NOW() - INTERVAL '24 hours';"

# Check CloudTrail for AWS API activity
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=ResourceName,AttributeValue=drop-db \
  --region eu-west-1 --max-results 50

Containment:

# 1. Revoke compromised sessions immediately
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "UPDATE sessions SET revoked = 1 WHERE user_id IN (SELECT user_id FROM aml_alerts WHERE status = 'open');"

# 2. Disable affected users
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "UPDATE users SET kyc_status = 'rejected' WHERE id IN (SELECT user_id FROM aml_alerts WHERE severity = 'critical');"

# 3. Rotate database credentials
aws rds modify-db-instance \
  --db-instance-identifier drop-db \
  --master-user-password <new-password> \
  --apply-immediately --region eu-west-1

# 4. Take forensic snapshot
aws rds create-db-snapshot \
  --db-instance-identifier drop-db \
  --db-snapshot-identifier drop-db-incident-$(date +%Y%m%d-%H%M) \
  --region eu-west-1

# 5. Rotate JWT_SECRET (invalidates all sessions)
# Generate: openssl rand -base64 48
# Update in AWS Secrets Manager + redeploy App Runner

Post-containment:

  1. Analyze audit logs — identify scope of breach
  2. File STR (Suspicious Transaction Report) if financial crime suspected
  3. Notify Finanstilsynet if user PII compromised (GDPR requirement, 72-hour window)
  4. User communication if required by GDPR Art. 34

RTO: Immediate containment (session revocation) / 24–48 hours full investigation


7. RTO / RPO Summary

loss) snapshot)
ScenarioComponent RTOAutomatic Failover RPOMechanismFailover Time
AppDatabase Runner restart(Multi-AZ) 5–10 minutesYes 0RDS (noautomatic datafailover 60-120 seconds
AppLoad Runner rollbackbalancer 15–20 minutesYes 0Health (nocheck data loss)route to healthy targets< 30 seconds
RDS snapshot restoreCDN 30 minutesYes 24Origin hourshealth (dailychecks < 60 seconds
RDSRedis PITR(if restoreclustered) 30 minutesYes 5Redis minutesSentinel / ElastiCache< 30 seconds

Monitoring automatic failover:

  • Alert fires: MultiAZFailover CloudWatch event or equivalent
  • On-call notified immediately
  • No manual action required, but on-call must confirm recovery

5.2 Manual Failover Steps

Prerequisite: Automatic failover has NOT occurred or has failed.

Database Manual Failover (Tier 1)

  1. Confirm primary is unavailable: ping {{DB_PRIMARY_HOST}} — should timeout
  2. Connect to standby: psql {{STANDBY_HOST}}
  3. Promote standby to primary: SELECT pg_promote();
  4. Update DNS record db.{{INTERNAL_DOMAIN}} → {{STANDBY_HOST}}
  5. DNS TTL: Ensure TTL was set to 60s pre-incident (if not, wait {{DNS_TTL}} seconds)
  6. Verify applications are reconnecting: Check application logs for successful DB connections
  7. Page on-call to verify all services healthy

Regional Failover (Catastrophic)

  1. Declare DR event (approval from {{DR_AUTHORITY}})
  2. Confirm primary region {{PRIMARY_REGION}} is unreachable
  3. Activate standby in {{DR_REGION}}: terraform apply -var-file=envs/dr.tfvars
  4. Restore database from latest cross-region snapshot
  5. Update Route 53 / DNS to point to {{DR_REGION}} endpoints
  6. Run smoke tests: bash scripts/smoke-tests.sh {{DR_REGION}}
  7. Notify stakeholders (see Communication Plan)
  8. Monitor enhanced metrics for {{MONITOR_PERIOD}}h

6. Recovery Procedures Per Service

Tier 1 Services

ServiceRecovery ProcedureRecovery ScriptEst. Time
{{SERVICE_1}}1. Restore from snapshot
2. Verify config
3. Run smoke tests
scripts/restore-{{SERVICE_1}}.sh{{TIME}}min
Authentication1. Deploy from last known good image
2. Verify JWT keys
3. Test login flow
scripts/restore-auth.sh{{TIME}}min

Tier 2 Services

Tier 3 Services


7. DR Drill Schedule & Scenarios

(logspreserved)
Drill TypeFrequencyParticipantsLast ExecutedNext Scheduled
Tabletop exerciseQuarterlyOn-call team + engineering lead{{DATE}}{{DATE}}
Database failover testQuarterlyDevOps + one developer{{DATE}}{{DATE}}
Full regionDR failover (eu-west-1) 2–4 hoursBi-annually 24Entire hoursengineering team{{DATE}}{{DATE}}
SecurityBackup incidentrestore containmenttest Immediate (session revocation)Monthly 0DevOps {{DATE}} {{DATE}}

Drill Scenarios to Cover:

  1. Database primary failure (automatic failover test)
  2. Accidental data deletion (point-in-time restore)
  3. Single AZ outage (multi-AZ failover)
  4. Full region failure (cross-region DR)
  5. Ransomware/data corruption (restore from offline backup)
  6. CDN outage (origin fallback)
  7. Secret store unavailable (cached credentials)

8. ContactsCommunication Plan During DR Event

Internal Communications

AudienceChannelFrequencyOwner
Engineering teamSlack #incidents + war room callReal-timeIncident commander
Engineering managementDirect messageAt declaration + hourlyIncident commander
Product/Business leadershipEmail + SlackAt declaration + hourlyIncident commander
Customer supportDedicated Slack channelAt declaration + 30 minSupport lead

External Communications

AudienceChannelTriggerMessage
CustomersStatus page ({{STATUS_PAGE}})Within 15 min of confirmed incident"We are investigating an issue"
CustomersStatus page updateEvery 30 minProgress update
CustomersEmailIf impact > {{EMAIL_THRESHOLD}}hDirect notification
SLA customersDirect contactPer SLA contractAs contractually required

Communication templates: See go-live-runbook.md communication section


9. War Room Setup

War Room: {{WAR_ROOM_LINK}} Bridge Line: {{BRIDGE_NUMBER}} Document: Live incident doc created at: {{INCIDENT_DOC_TEMPLATE}}

Roles during DR event:

/ +47 40 47 42 51#drop-alertssupport via AWS Console
Role NameResponsibility ContactPrimaryBackup
Primary Incident OwnerCommander AlemCoordinates Bašićresponse, (CEO)final decisions [email protected]{{IC}} {{IC_BACKUP}}
AITechnical OperationsLead JohnLeads (AItechnical Director)recovery Slack{{TECH_LEAD}} {{TECH_BACKUP}}
AWSCommunications SupportLead AWSInternal/external updates Premium{{COMMS_LEAD}} {{COMMS_BACKUP}}
Fly.io Support (staging)Scribe Fly.ioDocuments timeline, actions taken [email protected]
Sumsub Support{{SCRIBE}} Sumsub[email protected]
BankID SupportVipps MobilePay (BankID operator)Per contractRotate

9.10. RunbookPost-Recovery MaintenanceVerification Checklist

Review Schedule

  • Quarterly review:All VerifyTier all1 ARNs,services endpoints,healthy (health checks passing)
  •  Error rate back to baseline (< {{ERROR_BASELINE}}%)
  •  P99 latency back to baseline (< {{P99_BASELINE}}ms)
  •  Database connections stable
  •  Replication lag < {{REPLICATION_LAG}}s (if applicable)
  •  Backup jobs resumed and commandscompleted still validsuccessfully
  • After anyMonitoring incident:and Updatealerting with lessons learnedfunctional
  • Before majorNo releases:data Verifyloss backupconfirmed (or data loss quantified and rollbackdocumented)
  • procedures
  • workAll Tier 2 services healthy
  •  Stakeholders notified of recovery
  •  Status page updated to "Resolved"
  •  Incident timeline documented
  •  Post-mortem scheduled (within {{POSTMORTEM_SLA}}h)

Test
Schedule

  • Q2 2026: Full

    11. DR drillTest — App Runner restart + RDS snapshot restore to temp instance

  • Quarterly: App Runner rollback test
  • Monthly: Verify automated RDS snapshot creation

ChangeResults Log

Architect(AI)
Date ChangeTest Type AuthorScenarioRTO AchievedRPO AchievedIssues FoundResolved By
2026-02-23{{DATE}} Initial version from DR runbook + infra analysis{{TYPE}} Platform{{SCENARIO}} {{RTO}} {{RPO}}{{ISSUES}}{{RESOLVED}}


Approval

Role Name Date Signature
Author Platform Architect (AI) 2026-02-23
Reviewer
Approver Alem Bašić