Disaster Recovery Plan
Disaster Recovery Plan
Project:
Drop{{PROJECT_NAME}} Version:0.1.0{{VERSION}} Date:2026-02-23{{DATE}} Author:Platform Architect (AI){{AUTHOR}} Status: Draft | In Review | Approved Reviewers:Alem Bašić (CEO){{REVIEWERS}}
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 |
1. Business Continuity Overview
This plan documents the procedures to recover Drop{{PROJECT_NAME}} services following a disaster event (infrastructuredata center failure, data corruption, security breach, or totalcatastrophic region outage)failure). Drop is a PSD2 pass-through payment application — it never holds customer funds, so there is no risk of customer money being lost due to Drop infrastructure failure. The primary recovery concern is service availability and data integrity.
Plan Owner: Alem Bašić (CEO), [email protected], +47 40 47 42 51{{DR_OWNER}}
Plan Reviewer: John (AI Director), Slack {{DR_REVIEWER}}
Last Tested: #drop-alertsTBD — initial DR drill not yet conducted{{LAST_TEST_DATE}}
Next Scheduled Test: Q2 2026 (App Runner restart + RDS snapshot restore){{NEXT_TEST_DATE}}
Disaster types covered:
App Runner serviceInfrastructure failureor(AZ/regioncrashoutage)RDSDatadatabase failurecorruption ordataaccidentalcorruptiondeletion- Security incident (
unauthorizedransomware,access,datacredential compromise)breach) Full regionVendor/provider outage(eu-west-1)- Catastrophic application failure
(bad deployment)
2. RPO / RTO Targets Per Service Tier
| Tier | Description | RPO | RTO | ||
|---|---|---|---|---|---|
| Tier 1 — Critical | Core user-facing |
||||
| Tier 2 — Important | Supporting |
1 hour | Notifications, |
||
| Tier 3 — Standard | < 24 hours | < 24 hours |
3. Service Tier Classification
| Service | Tier | Rationale | |
|---|---|---|---|
| Tier 1 | Core |
||
|
Tier 1 | Authentication | |
|
Supporting | ||
| Admin |
|||
|
Tier 1 | All services depend on it | |
| Tier 2 | User |
4. Backup Strategy
4.1 Database Backups
| Database | Backup Type | Frequency | Retention | Location | Verified |
|---|---|---|---|---|---|
| {{DB_PRIMARY}} | Automated snapshot | Daily | 30 days | {{BACKUP_LOCATION}} | Monthly |
| 7 |
{{BACKUP_LOCATION}} | Monthly | |||
| — | Rebuilt from primary | — |
Automated backup tool: {{BACKUP_TOOL}} Backup encryption: AES-256, key managed in {{KMS_TOOL}} Cross-region copy: {{CROSS_REGION}}
4.2 File / Object Storage Backups
| Storage | Backup Method | Frequency | Retention | DR Copy |
|---|---|---|---|---|
| {{S3_BUCKET}} | S3 versioning + replication | Continuous | {{RETENTION}} | {{DR_BUCKET}} |
| 30 |
Cross-region |
4.3 Configuration Backups
| Config | Backup Method | Location | Frequency |
|---|---|---|---|
| IaC (Terraform) | Git repository | {{GIT_REPO}} | On change |
| Application config | Git repository | {{GIT_REPO}} | On change |
| Secrets | Secrets manager replication | {{SECRETS_BACKUP}} | Real-time |
| Weekly | |||
| TLS certificates | Secrets manager | {{CERTS_BACKUP}} | On renewal |
4.4 Backup Testing Schedule
| Backup Type | Test Frequency | Last Test | Result | Tester |
|---|---|---|---|---|
| Database full restore | Monthly | {{DATE}} | {{RESULT}} | {{TESTER}} |
| Point-in-time restore | Quarterly | {{DATE}} | {{RESULT}} | {{TESTER}} |
| Object storage restore | Quarterly | {{DATE}} | {{RESULT}} | {{TESTER}} |
| Full DR failover drill | Bi-annually | {{DATE}} | {{RESULT}} | {{TESTER}} |
4. Infrastructure Overview
Production
Service:AWS App RunnerRegion:eu-west-1 (Ireland)Service ARN:arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeecService URL:https://9ef3szvvsb.eu-west-1.awsapprunner.comECR Repository:324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web
Database (Production)
RDS Instance:drop-dbEndpoint:drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com:5432DB Name:dropappDB User:dropuserBackup Strategy:Automated daily snapshots, 7-day retentionBackup Window:23:24–23:54 UTC dailyPITR:Enabled (5-minute granularity)
Staging
Platform:Fly.io, regionarn(Stockholm)App Name:drop-stagingDatabase:SQLite ephemeral volume — no automated backup
5. Backup Strategy
Production RDS PostgreSQL
Automated Snapshots:Daily at 23:24 UTCRetention Period:7 daysPoint-in-Time Recovery (PITR):Enabled — any point within last 7 days, 5-minute granularityManual Snapshots:Created before every major deployment or migrationSnapshot verification:Run quarterly
ECR Docker Images
All pushed images retainedin ECR repositoryRollback capability:Redeploy any previous image tag viaaws apprunner start-deploymentLifecycle policy:Delete untagged images after 7 days, keep last 10 tagged releases
Staging (Fly.io)
No automated backup— ephemeral SQLite storageManual backup procedure:flyctl ssh console -a drop-staging sqlite3 /app/data/drop.db ".backup /app/data/backup-$(date +%Y%m%d).db"
6. RecoveryFailover Procedures
Scenario5.1 1:Automated App Runner Service DownFailover
Symptoms:
BetterStack alert:Drop Health Check is DOWNSlack#drop-ops: critical alertApp Runner service status notRUNNING
Investigation:
# Check service status
aws apprunner describe-service \
--service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
--region eu-west-1
# View recent logs (last 10 minutes)
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
--follow --since 10m --region eu-west-1
# Check deployment history
aws apprunner list-operations \
--service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
--region eu-west-1
Recovery Option A: Restart (preferred)
aws apprunner start-deployment \
--service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
--region eu-west-1
RTO: 5–10 minutes | RPO: 0 (no data loss)
Recovery Option B: Rollback to previous image
# List recent ECR images
aws ecr describe-images \
--repository-name drop-web \
--region eu-west-1 \
--query 'sort_by(imageDetails,&imagePushedAt)[-5:]'
# Update App Runner image tag via console or update the deployment workflow
# Then trigger new deployment
RTO: 15–20 minutes | RPO: 0 (no data loss)
Scenario 2: RDS Database Failure
Symptoms:
/api/healthreturns{"status":"down"}(HTTP 503)BetterStack + Slack alerts fireApp Runner logs show connection timeout to RDS
Investigation:
# Check RDS status
aws rds describe-db-instances \
--db-instance-identifier drop-db \
--region eu-west-1 \
--query 'DBInstances[0].DBInstanceStatus'
# Check available snapshots
aws rds describe-db-snapshots \
--db-instance-identifier drop-db \
--region eu-west-1 \
--query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-5:]'
# Check events
aws rds describe-events \
--source-identifier drop-db \
--source-type db-instance \
--region eu-west-1 --duration 60
Recovery Option A: Restore from automated snapshot
LATEST=$(aws rds describe-db-snapshots \
--db-instance-identifier drop-db --region eu-west-1 \
--query 'DBSnapshots[?SnapshotType==`automated`]|sort_by(@,&SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
--output text)
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier drop-db-restored \
--db-snapshot-identifier $LATEST \
--db-instance-class db.t4g.micro \
--region eu-west-1
aws rds wait db-instance-available --db-instance-identifier drop-db-restored --region eu-west-1
# Update DATABASE_URL in App Runner environment with new endpoint
NEW_EP=$(aws rds describe-db-instances --db-instance-identifier drop-db-restored \
--query 'DBInstances[0].Endpoint.Address' --output text --region eu-west-1)
RTO: 30 minutes | RPO: 24 hours (last snapshot)
Recovery Option B: Point-in-Time Recovery (PITR)
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier drop-db \
--target-db-instance-identifier drop-db-pitr \
--restore-time $(date -u -d '1 hour ago' '+%Y-%m-%dT%H:%M:%SZ') \
--db-instance-class db.t4g.micro \
--region eu-west-1
RTO: 30 minutes | RPO: 5 minutes (PITR granularity)
Scenario 3: Data Corruption
Symptoms:
Application reports data inconsistenciesUser-reported missing or incorrect transactionsAudit log shows unexpected DELETE/UPDATE operations
Investigation:
# Check for soft-deleted users
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
-c "SELECT COUNT(*) FROM users WHERE deleted_at IS NOT NULL;"
# Check recent suspicious audit log entries
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
-c "SELECT * FROM audit_log WHERE action IN ('DELETE','UPDATE') ORDER BY timestamp DESC LIMIT 50;"
Recovery: Selective restore from clean snapshot (see Scenario 2 recovery steps) + merge affected tables.
RTO: 1–2 hours (selective) | RPO: Depends on snapshot age
Scenario 4: Full Region Outage (eu-west-1)
Current State: No automated cross-region failover. Manual failover to eu-north-1 (Stockholm) required.
Investigation:
Check AWS Health Dashboard: https://health.aws.amazon.com/health/statusVerify RDS snapshot accessibility from eu-west-1
Manual Failover to eu-north-1:
# 1. Copy latest RDS snapshot to eu-north-1
LATEST=$(aws rds describe-db-snapshots --db-instance-identifier drop-db --region eu-west-1 \
--query 'DBSnapshots[?SnapshotType==`automated`]|sort_by(@,&SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
--output text)
aws rds copy-db-snapshot \
--source-db-snapshot-identifier arn:aws:rds:eu-west-1:324480209768:snapshot:$LATEST \
--target-db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \
--region eu-north-1
# 2. Restore RDS in eu-north-1
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier drop-db-failover \
--db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \
--db-instance-class db.t4g.micro \
--region eu-north-1
# 3. Create ECR repository in eu-north-1 and push latest image
# 4. Create App Runner service in eu-north-1
# 5. Update DNS when getdrop.no is active
RTO: 2–4 hours (manual) | RPO: Up to 24 hours (last snapshot)
Scenario 5: Security Incident
Symptoms:
Suspicious audit log entriesUnauthorized access attemptsAML alerts triggered for unusual activitySumsub KYC bypass attempt
Investigation:
# Check audit log for recent suspicious activity
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
-c "SELECT * FROM audit_log WHERE timestamp > NOW() - INTERVAL '24 hours' ORDER BY timestamp DESC;"
# Check AML alerts
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
-c "SELECT * FROM aml_alerts WHERE status = 'open' OR created_at > NOW() - INTERVAL '24 hours';"
# Check CloudTrail for AWS API activity
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=ResourceName,AttributeValue=drop-db \
--region eu-west-1 --max-results 50
Containment:
# 1. Revoke compromised sessions immediately
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
-c "UPDATE sessions SET revoked = 1 WHERE user_id IN (SELECT user_id FROM aml_alerts WHERE status = 'open');"
# 2. Disable affected users
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
-c "UPDATE users SET kyc_status = 'rejected' WHERE id IN (SELECT user_id FROM aml_alerts WHERE severity = 'critical');"
# 3. Rotate database credentials
aws rds modify-db-instance \
--db-instance-identifier drop-db \
--master-user-password <new-password> \
--apply-immediately --region eu-west-1
# 4. Take forensic snapshot
aws rds create-db-snapshot \
--db-instance-identifier drop-db \
--db-snapshot-identifier drop-db-incident-$(date +%Y%m%d-%H%M) \
--region eu-west-1
# 5. Rotate JWT_SECRET (invalidates all sessions)
# Generate: openssl rand -base64 48
# Update in AWS Secrets Manager + redeploy App Runner
Post-containment:
Analyze audit logs — identify scope of breachFile STR (Suspicious Transaction Report) if financial crime suspectedNotify Finanstilsynet if user PII compromised (GDPR requirement, 72-hour window)User communication if required by GDPR Art. 34
RTO: Immediate containment (session revocation) / 24–48 hours full investigation
7. RTO / RPO Summary
| Failover Time | |||
|---|---|---|---|
| 60-120 seconds | |||
| < 30 seconds | |||
| < 60 seconds | |||
| < 30 seconds |
Monitoring automatic failover:
- Alert fires:
MultiAZFailoverCloudWatch event or equivalent - On-call notified immediately
- No manual action required, but on-call must confirm recovery
5.2 Manual Failover Steps
Prerequisite: Automatic failover has NOT occurred or has failed.
Database Manual Failover (Tier 1)
- Confirm primary is unavailable:
ping {{DB_PRIMARY_HOST}}— should timeout - Connect to standby:
psql {{STANDBY_HOST}} - Promote standby to primary:
SELECT pg_promote(); - Update DNS record
db.{{INTERNAL_DOMAIN}}→{{STANDBY_HOST}} - DNS TTL: Ensure TTL was set to 60s pre-incident (if not, wait {{DNS_TTL}} seconds)
- Verify applications are reconnecting: Check application logs for successful DB connections
- Page on-call to verify all services healthy
Regional Failover (Catastrophic)
- Declare DR event (approval from {{DR_AUTHORITY}})
- Confirm primary region {{PRIMARY_REGION}} is unreachable
- Activate standby in {{DR_REGION}}:
terraform apply -var-file=envs/dr.tfvars - Restore database from latest cross-region snapshot
- Update Route 53 / DNS to point to {{DR_REGION}} endpoints
- Run smoke tests:
bash scripts/smoke-tests.sh {{DR_REGION}} - Notify stakeholders (see Communication Plan)
- Monitor enhanced metrics for {{MONITOR_PERIOD}}h
6. Recovery Procedures Per Service
Tier 1 Services
| Service | Recovery Procedure | Recovery Script | Est. Time |
|---|---|---|---|
| {{SERVICE_1}} | 1. Restore from snapshot 2. Verify config 3. Run smoke tests |
scripts/restore-{{SERVICE_1}}.sh |
{{TIME}}min |
| Authentication | 1. Deploy from last known good image 2. Verify JWT keys 3. Test login flow |
scripts/restore-auth.sh |
{{TIME}}min |
Tier 2 Services
Tier 3 Services
7. DR Drill Schedule & Scenarios
| Drill Type | Frequency | Participants | Last Executed | Next Scheduled |
|---|---|---|---|---|
| Tabletop exercise | Quarterly | On-call team + engineering lead | {{DATE}} | {{DATE}} |
| Database failover test | Quarterly | DevOps + one developer | {{DATE}} | {{DATE}} |
| Full |
{{DATE}} | {{DATE}} | ||
| {{DATE}} | {{DATE}} |
Drill Scenarios to Cover:
- Database primary failure (automatic failover test)
- Accidental data deletion (point-in-time restore)
- Single AZ outage (multi-AZ failover)
- Full region failure (cross-region DR)
- Ransomware/data corruption (restore from offline backup)
- CDN outage (origin fallback)
- Secret store unavailable (cached credentials)
8. ContactsCommunication Plan During DR Event
Internal Communications
| Audience | Channel | Frequency | Owner |
|---|---|---|---|
| Engineering team | Slack #incidents + war room call | Real-time | Incident commander |
| Engineering management | Direct message | At declaration + hourly | Incident commander |
| Product/Business leadership | Email + Slack | At declaration + hourly | Incident commander |
| Customer support | Dedicated Slack channel | At declaration + 30 min | Support lead |
External Communications
| Audience | Channel | Trigger | Message |
|---|---|---|---|
| Customers | Status page ({{STATUS_PAGE}}) | Within 15 min of confirmed incident | "We are investigating an issue" |
| Customers | Status page update | Every 30 min | Progress update |
| Customers | If impact > {{EMAIL_THRESHOLD}}h | Direct notification | |
| SLA customers | Direct contact | Per SLA contract | As contractually required |
Communication templates: See go-live-runbook.md communication section
9. War Room Setup
War Room: {{WAR_ROOM_LINK}} Bridge Line: {{BRIDGE_NUMBER}} Document: Live incident doc created at: {{INCIDENT_DOC_TEMPLATE}}
Roles during DR event:
| Role | Backup | ||
|---|---|---|---|
| {{IC_BACKUP}} | |||
| {{TECH_BACKUP}} | |||
| {{COMMS_BACKUP}} | |||
9.10. RunbookPost-Recovery MaintenanceVerification Checklist
Review Schedule
Quarterlyreview:AllVerifyTierall1ARNs,servicesendpoints,healthy (health checks passing)- Error rate back to baseline (< {{ERROR_BASELINE}}%)
- P99 latency back to baseline (< {{P99_BASELINE}}ms)
- Database connections stable
- Replication lag < {{REPLICATION_LAG}}s (if applicable)
- Backup jobs resumed and
commandscompletedstill validsuccessfully AfteranyMonitoringincident:andUpdatealertingwith lessons learnedfunctionalBeforemajorNoreleases:dataVerifylossbackupconfirmed (or data loss quantified androllbackdocumented)-
workAll Tier 2 services healthy - Stakeholders notified of recovery
- Status page updated to "Resolved"
- Incident timeline documented
- Post-mortem scheduled (within {{POSTMORTEM_SLA}}h)
Test
Schedule
Q2 2026:Full11. DR
drillTest— App Runner restart + RDS snapshot restore to temp instanceQuarterly:App Runner rollback testMonthly:Verify automated RDS snapshot creation
ChangeResults Log
| Date | RTO Achieved | RPO Achieved | Issues Found | Resolved By | ||
|---|---|---|---|---|---|---|
| {{RTO}} | {{RPO}} | {{ISSUES}} | {{RESOLVED}} |
Related Documents
Deployment Architecture- Monitoring & Observability
DROperational RunbookSourceOperationalIncidentRunbookReport- Post-Mortem
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | |||
| Reviewer | |||
| Approver |