Disaster Recovery Plan
Disaster Recovery Plan
Project: Drop Version: 0.1.0 Date: 2026-02-23 Author: Platform Architect (AI) Status: In Review Reviewers: Alem Bašić (CEO)
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | 2026-02-23 | Platform Architect (AI) | Compiled from DR-RUNBOOK.md + infrastructure analysis |
1. Business Continuity Overview
This plan documents procedures to recover Drop services following a disaster event (infrastructure failure, data corruption, security breach, or total region outage). Drop is a PSD2 pass-through payment application — it never holds customer funds, so there is no risk of customer money being lost due to Drop infrastructure failure. The primary recovery concern is service availability and data integrity.
Plan Owner: Alem Bašić (CEO), [email protected], +47 40 47 42 51
Plan Reviewer: John (AI Director), Slack #drop-alerts
Last Tested: TBD — initial DR drill not yet conducted
Next Scheduled Test: Q2 2026 (App Runner restart + RDS snapshot restore)
Disaster types covered:
- App Runner service failure or crash
- RDS database failure or data corruption
- Security incident (unauthorized access, credential compromise)
- Full region outage (eu-west-1)
- Catastrophic application failure (bad deployment)
2. RPO / RTO Targets Per Service Tier
| Tier | Description | RPO | RTO | Drop Services |
|---|---|---|---|---|
| Tier 1 — Critical | Core user-facing services | 5 minutes (PITR) | 30 minutes | Auth (BankID), transactions (remittance + QR), health endpoint |
| Tier 2 — Important | Supporting features | 24 hours (snapshot) | 1 hour | Merchant dashboard, notifications, transaction history |
| Tier 3 — Standard | Background / admin | 24 hours | 24 hours | Audit logs, AML alerts, complaint records |
3. Service Tier Classification
| Service | Tier | Justification |
|---|---|---|
| BankID authentication | 1 | Users cannot transact without login |
Remittance API (/api/transactions/remittance) |
1 | Core revenue feature |
QR payment API (/api/transactions/qr-payment) |
1 | Core revenue feature |
| Bank account read (AISP) | 1 | Required for payment initiation |
Health endpoint (/api/health) |
1 | Monitoring dependency |
| Transaction history | 2 | UX degraded, no blocking issue |
| Merchant dashboard | 2 | Merchant ops impacted |
| Notifications | 2 | UX degraded |
| Audit log | 3 | Compliance — retained, not real-time |
| AML alerts | 3 | Reviewed periodically, not real-time |
4. Infrastructure Overview
Production
- Service: AWS App Runner
- Region: eu-west-1 (Ireland)
- Service ARN:
arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec - Service URL:
https://9ef3szvvsb.eu-west-1.awsapprunner.com - ECR Repository:
324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web
Database (Production)
- RDS Instance:
drop-db - Endpoint:
drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com:5432 - DB Name:
dropapp - DB User:
dropuser - Backup Strategy: Automated daily snapshots, 7-day retention
- Backup Window: 23:24–23:54 UTC daily
- PITR: Enabled (5-minute granularity)
Staging
- Platform: Fly.io, region
arn(Stockholm) - App Name:
drop-staging - Database: SQLite ephemeral volume — no automated backup
5. Backup Strategy
Production RDS PostgreSQL
- Automated Snapshots: Daily at 23:24 UTC
- Retention Period: 7 days
- Point-in-Time Recovery (PITR): Enabled — any point within last 7 days, 5-minute granularity
- Manual Snapshots: Created before every major deployment or migration
- Snapshot verification: Run quarterly
ECR Docker Images
- All pushed images retained in ECR repository
- Rollback capability: Redeploy any previous image tag via
aws apprunner start-deployment - Lifecycle policy: Delete untagged images after 7 days, keep last 10 tagged releases
Staging (Fly.io)
- No automated backup — ephemeral SQLite storage
- Manual backup procedure:
flyctl ssh console -a drop-staging sqlite3 /app/data/drop.db ".backup /app/data/backup-$(date +%Y%m%d).db"
6. Recovery Procedures
Scenario 1: App Runner Service Down
Symptoms:
- BetterStack alert:
Drop Health Check is DOWN - Slack
#drop-ops: critical alert - App Runner service status not
RUNNING
Investigation:
# Check service status
aws apprunner describe-service \
--service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
--region eu-west-1
# View recent logs (last 10 minutes)
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
--follow --since 10m --region eu-west-1
# Check deployment history
aws apprunner list-operations \
--service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
--region eu-west-1
Recovery Option A: Restart (preferred)
aws apprunner start-deployment \
--service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
--region eu-west-1
RTO: 5–10 minutes | RPO: 0 (no data loss)
Recovery Option B: Rollback to previous image
# List recent ECR images
aws ecr describe-images \
--repository-name drop-web \
--region eu-west-1 \
--query 'sort_by(imageDetails,&imagePushedAt)[-5:]'
# Update App Runner image tag via console or update the deployment workflow
# Then trigger new deployment
RTO: 15–20 minutes | RPO: 0 (no data loss)
Scenario 2: RDS Database Failure
Symptoms:
/api/healthreturns{"status":"down"}(HTTP 503)- BetterStack + Slack alerts fire
- App Runner logs show connection timeout to RDS
Investigation:
# Check RDS status
aws rds describe-db-instances \
--db-instance-identifier drop-db \
--region eu-west-1 \
--query 'DBInstances[0].DBInstanceStatus'
# Check available snapshots
aws rds describe-db-snapshots \
--db-instance-identifier drop-db \
--region eu-west-1 \
--query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-5:]'
# Check events
aws rds describe-events \
--source-identifier drop-db \
--source-type db-instance \
--region eu-west-1 --duration 60
Recovery Option A: Restore from automated snapshot
LATEST=$(aws rds describe-db-snapshots \
--db-instance-identifier drop-db --region eu-west-1 \
--query 'DBSnapshots[?SnapshotType==`automated`]|sort_by(@,&SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
--output text)
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier drop-db-restored \
--db-snapshot-identifier $LATEST \
--db-instance-class db.t4g.micro \
--region eu-west-1
aws rds wait db-instance-available --db-instance-identifier drop-db-restored --region eu-west-1
# Update DATABASE_URL in App Runner environment with new endpoint
NEW_EP=$(aws rds describe-db-instances --db-instance-identifier drop-db-restored \
--query 'DBInstances[0].Endpoint.Address' --output text --region eu-west-1)
RTO: 30 minutes | RPO: 24 hours (last snapshot)
Recovery Option B: Point-in-Time Recovery (PITR)
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier drop-db \
--target-db-instance-identifier drop-db-pitr \
--restore-time $(date -u -d '1 hour ago' '+%Y-%m-%dT%H:%M:%SZ') \
--db-instance-class db.t4g.micro \
--region eu-west-1
RTO: 30 minutes | RPO: 5 minutes (PITR granularity)
Scenario 3: Data Corruption
Symptoms:
- Application reports data inconsistencies
- User-reported missing or incorrect transactions
- Audit log shows unexpected DELETE/UPDATE operations
Investigation:
# Check for soft-deleted users
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
-c "SELECT COUNT(*) FROM users WHERE deleted_at IS NOT NULL;"
# Check recent suspicious audit log entries
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
-c "SELECT * FROM audit_log WHERE action IN ('DELETE','UPDATE') ORDER BY timestamp DESC LIMIT 50;"
Recovery: Selective restore from clean snapshot (see Scenario 2 recovery steps) + merge affected tables.
RTO: 1–2 hours (selective) | RPO: Depends on snapshot age
Scenario 4: Full Region Outage (eu-west-1)
Current State: No automated cross-region failover. Manual failover to eu-north-1 (Stockholm) required.
Investigation:
- Check AWS Health Dashboard: https://health.aws.amazon.com/health/status
- Verify RDS snapshot accessibility from eu-west-1
Manual Failover to eu-north-1:
# 1. Copy latest RDS snapshot to eu-north-1
LATEST=$(aws rds describe-db-snapshots --db-instance-identifier drop-db --region eu-west-1 \
--query 'DBSnapshots[?SnapshotType==`automated`]|sort_by(@,&SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
--output text)
aws rds copy-db-snapshot \
--source-db-snapshot-identifier arn:aws:rds:eu-west-1:324480209768:snapshot:$LATEST \
--target-db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \
--region eu-north-1
# 2. Restore RDS in eu-north-1
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier drop-db-failover \
--db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \
--db-instance-class db.t4g.micro \
--region eu-north-1
# 3. Create ECR repository in eu-north-1 and push latest image
# 4. Create App Runner service in eu-north-1
# 5. Update DNS when getdrop.no is active
RTO: 2–4 hours (manual) | RPO: Up to 24 hours (last snapshot)
Scenario 5: Security Incident
Symptoms:
- Suspicious audit log entries
- Unauthorized access attempts
- AML alerts triggered for unusual activity
- Sumsub KYC bypass attempt
Investigation:
# Check audit log for recent suspicious activity
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
-c "SELECT * FROM audit_log WHERE timestamp > NOW() - INTERVAL '24 hours' ORDER BY timestamp DESC;"
# Check AML alerts
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
-c "SELECT * FROM aml_alerts WHERE status = 'open' OR created_at > NOW() - INTERVAL '24 hours';"
# Check CloudTrail for AWS API activity
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=ResourceName,AttributeValue=drop-db \
--region eu-west-1 --max-results 50
Containment:
# 1. Revoke compromised sessions immediately
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
-c "UPDATE sessions SET revoked = 1 WHERE user_id IN (SELECT user_id FROM aml_alerts WHERE status = 'open');"
# 2. Disable affected users
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
-c "UPDATE users SET kyc_status = 'rejected' WHERE id IN (SELECT user_id FROM aml_alerts WHERE severity = 'critical');"
# 3. Rotate database credentials
aws rds modify-db-instance \
--db-instance-identifier drop-db \
--master-user-password <new-password> \
--apply-immediately --region eu-west-1
# 4. Take forensic snapshot
aws rds create-db-snapshot \
--db-instance-identifier drop-db \
--db-snapshot-identifier drop-db-incident-$(date +%Y%m%d-%H%M) \
--region eu-west-1
# 5. Rotate JWT_SECRET (invalidates all sessions)
# Generate: openssl rand -base64 48
# Update in AWS Secrets Manager + redeploy App Runner
Post-containment:
- Analyze audit logs — identify scope of breach
- File STR (Suspicious Transaction Report) if financial crime suspected
- Notify Finanstilsynet if user PII compromised (GDPR requirement, 72-hour window)
- User communication if required by GDPR Art. 34
RTO: Immediate containment (session revocation) / 24–48 hours full investigation
7. RTO / RPO Summary
| Scenario | RTO | RPO |
|---|---|---|
| App Runner restart | 5–10 minutes | 0 (no data loss) |
| App Runner rollback | 15–20 minutes | 0 (no data loss) |
| RDS snapshot restore | 30 minutes | 24 hours (daily snapshot) |
| RDS PITR restore | 30 minutes | 5 minutes |
| Full region failover (eu-west-1) | 2–4 hours | 24 hours |
| Security incident containment | Immediate (session revocation) | 0 (logs preserved) |
8. Contacts
| Role | Name | Contact |
|---|---|---|
| Primary Incident Owner | Alem Bašić (CEO) | [email protected] / +47 40 47 42 51 |
| AI Operations | John (AI Director) | Slack #drop-alerts |
| AWS Support | AWS | Premium support via AWS Console |
| Fly.io Support (staging) | Fly.io | [email protected] |
| Sumsub Support | Sumsub | [email protected] |
| BankID Support | Vipps MobilePay (BankID operator) | Per contract |
9. Runbook Maintenance
Review Schedule
- Quarterly review: Verify all ARNs, endpoints, and commands still valid
- After any incident: Update with lessons learned
- Before major releases: Verify backup and rollback procedures work
Test Schedule
- Q2 2026: Full DR drill — App Runner restart + RDS snapshot restore to temp instance
- Quarterly: App Runner rollback test
- Monthly: Verify automated RDS snapshot creation
Change Log
| Date | Change | Author |
|---|---|---|
| 2026-02-23 | Initial version from DR runbook + infra analysis | Platform Architect (AI) |
Related Documents
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | Platform Architect (AI) | 2026-02-23 | |
| Reviewer | |||
| Approver | Alem Bašić |