Disaster Recovery Plan

Project: Drop Version: 0.1.0 Date: 2026-02-23 Author: Platform Architect (AI) Status: In Review Reviewers: Alem Bašić (CEO)

Document History

Version	Date	Author	Changes
0.1	2026-02-23	Platform Architect (AI)	Compiled from DR-RUNBOOK.md + infrastructure analysis

1. Business Continuity Overview

This plan documents procedures to recover Drop services following a disaster event (infrastructure failure, data corruption, security breach, or total region outage). Drop is a PSD2 pass-through payment application — it never holds customer funds, so there is no risk of customer money being lost due to Drop infrastructure failure. The primary recovery concern is service availability and data integrity.

Plan Owner: Alem Bašić (CEO), [email protected], +47 40 47 42 51 Plan Reviewer: John (AI Director), Slack #drop-alerts Last Tested: TBD — initial DR drill not yet conducted Next Scheduled Test: Q2 2026 (App Runner restart + RDS snapshot restore)

Disaster types covered:

App Runner service failure or crash
RDS database failure or data corruption
Security incident (unauthorized access, credential compromise)
Full region outage (eu-west-1)
Catastrophic application failure (bad deployment)

2. RPO / RTO Targets Per Service Tier

Tier	Description	RPO	RTO	Drop Services
Tier 1 — Critical	Core user-facing services	5 minutes (PITR)	30 minutes	Auth (BankID), transactions (remittance + QR), health endpoint
Tier 2 — Important	Supporting features	24 hours (snapshot)	1 hour	Merchant dashboard, notifications, transaction history
Tier 3 — Standard	Background / admin	24 hours	24 hours	Audit logs, AML alerts, complaint records

3. Service Tier Classification

Service	Tier	Justification
BankID authentication	1	Users cannot transact without login
Remittance API (`/api/transactions/remittance`)	1	Core revenue feature
QR payment API (`/api/transactions/qr-payment`)	1	Core revenue feature
Bank account read (AISP)	1	Required for payment initiation
Health endpoint (`/api/health`)	1	Monitoring dependency
Transaction history	2	UX degraded, no blocking issue
Merchant dashboard	2	Merchant ops impacted
Notifications	2	UX degraded
Audit log	3	Compliance — retained, not real-time
AML alerts	3	Reviewed periodically, not real-time

4. Infrastructure Overview

Production

Service: AWS App Runner
Region: eu-west-1 (Ireland)
Service ARN: arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec
Service URL: https://9ef3szvvsb.eu-west-1.awsapprunner.com
ECR Repository: 324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web

Database (Production)

RDS Instance: drop-db
Endpoint: drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com:5432
DB Name: dropapp
DB User: dropuser
Backup Strategy: Automated daily snapshots, 7-day retention
Backup Window: 23:24–23:54 UTC daily
PITR: Enabled (5-minute granularity)

Staging

Platform: Fly.io, region arn (Stockholm)
App Name: drop-staging
Database: SQLite ephemeral volume — no automated backup

5. Backup Strategy

Production RDS PostgreSQL

Automated Snapshots: Daily at 23:24 UTC
Retention Period: 7 days
Point-in-Time Recovery (PITR): Enabled — any point within last 7 days, 5-minute granularity
Manual Snapshots: Created before every major deployment or migration
Snapshot verification: Run quarterly

ECR Docker Images

All pushed images retained in ECR repository
Rollback capability: Redeploy any previous image tag via aws apprunner start-deployment
Lifecycle policy: Delete untagged images after 7 days, keep last 10 tagged releases

Staging (Fly.io)

No automated backup — ephemeral SQLite storage

Manual backup procedure:

flyctl ssh console -a drop-staging
sqlite3 /app/data/drop.db ".backup /app/data/backup-$(date +%Y%m%d).db"

6. Recovery Procedures

Scenario 1: App Runner Service Down

Symptoms:

BetterStack alert: Drop Health Check is DOWN
Slack #drop-ops: critical alert
App Runner service status not RUNNING

Investigation:

# Check service status
aws apprunner describe-service \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

# View recent logs (last 10 minutes)
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --follow --since 10m --region eu-west-1

# Check deployment history
aws apprunner list-operations \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

Recovery Option A: Restart (preferred)

aws apprunner start-deployment \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

RTO: 5–10 minutes | RPO: 0 (no data loss)

Recovery Option B: Rollback to previous image

# List recent ECR images
aws ecr describe-images \
  --repository-name drop-web \
  --region eu-west-1 \
  --query 'sort_by(imageDetails,&imagePushedAt)[-5:]'

# Update App Runner image tag via console or update the deployment workflow
# Then trigger new deployment

RTO: 15–20 minutes | RPO: 0 (no data loss)

Scenario 2: RDS Database Failure

Symptoms:

/api/health returns {"status":"down"} (HTTP 503)
BetterStack + Slack alerts fire
App Runner logs show connection timeout to RDS

Investigation:

# Check RDS status
aws rds describe-db-instances \
  --db-instance-identifier drop-db \
  --region eu-west-1 \
  --query 'DBInstances[0].DBInstanceStatus'

# Check available snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier drop-db \
  --region eu-west-1 \
  --query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-5:]'

# Check events
aws rds describe-events \
  --source-identifier drop-db \
  --source-type db-instance \
  --region eu-west-1 --duration 60

Recovery Option A: Restore from automated snapshot

LATEST=$(aws rds describe-db-snapshots \
  --db-instance-identifier drop-db --region eu-west-1 \
  --query 'DBSnapshots[?SnapshotType==`automated`]|sort_by(@,&SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
  --output text)

aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier drop-db-restored \
  --db-snapshot-identifier $LATEST \
  --db-instance-class db.t4g.micro \
  --region eu-west-1

aws rds wait db-instance-available --db-instance-identifier drop-db-restored --region eu-west-1

# Update DATABASE_URL in App Runner environment with new endpoint
NEW_EP=$(aws rds describe-db-instances --db-instance-identifier drop-db-restored \
  --query 'DBInstances[0].Endpoint.Address' --output text --region eu-west-1)

RTO: 30 minutes | RPO: 24 hours (last snapshot)

Recovery Option B: Point-in-Time Recovery (PITR)

aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier drop-db \
  --target-db-instance-identifier drop-db-pitr \
  --restore-time $(date -u -d '1 hour ago' '+%Y-%m-%dT%H:%M:%SZ') \
  --db-instance-class db.t4g.micro \
  --region eu-west-1

RTO: 30 minutes | RPO: 5 minutes (PITR granularity)

Scenario 3: Data Corruption

Symptoms:

Application reports data inconsistencies
User-reported missing or incorrect transactions
Audit log shows unexpected DELETE/UPDATE operations

Investigation:

# Check for soft-deleted users
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "SELECT COUNT(*) FROM users WHERE deleted_at IS NOT NULL;"

# Check recent suspicious audit log entries
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "SELECT * FROM audit_log WHERE action IN ('DELETE','UPDATE') ORDER BY timestamp DESC LIMIT 50;"

Recovery: Selective restore from clean snapshot (see Scenario 2 recovery steps) + merge affected tables.

RTO: 1–2 hours (selective) | RPO: Depends on snapshot age

Scenario 4: Full Region Outage (eu-west-1)

Current State: No automated cross-region failover. Manual failover to eu-north-1 (Stockholm) required.

Investigation:

Check AWS Health Dashboard: https://health.aws.amazon.com/health/status
Verify RDS snapshot accessibility from eu-west-1

Manual Failover to eu-north-1:

# 1. Copy latest RDS snapshot to eu-north-1
LATEST=$(aws rds describe-db-snapshots --db-instance-identifier drop-db --region eu-west-1 \
  --query 'DBSnapshots[?SnapshotType==`automated`]|sort_by(@,&SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
  --output text)

aws rds copy-db-snapshot \
  --source-db-snapshot-identifier arn:aws:rds:eu-west-1:324480209768:snapshot:$LATEST \
  --target-db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \
  --region eu-north-1

# 2. Restore RDS in eu-north-1
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier drop-db-failover \
  --db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \
  --db-instance-class db.t4g.micro \
  --region eu-north-1

# 3. Create ECR repository in eu-north-1 and push latest image
# 4. Create App Runner service in eu-north-1
# 5. Update DNS when getdrop.no is active

RTO: 2–4 hours (manual) | RPO: Up to 24 hours (last snapshot)

Scenario 5: Security Incident

Symptoms:

Suspicious audit log entries
Unauthorized access attempts
AML alerts triggered for unusual activity
Sumsub KYC bypass attempt

Investigation:

# Check audit log for recent suspicious activity
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "SELECT * FROM audit_log WHERE timestamp > NOW() - INTERVAL '24 hours' ORDER BY timestamp DESC;"

# Check AML alerts
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "SELECT * FROM aml_alerts WHERE status = 'open' OR created_at > NOW() - INTERVAL '24 hours';"

# Check CloudTrail for AWS API activity
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=ResourceName,AttributeValue=drop-db \
  --region eu-west-1 --max-results 50

Containment:

# 1. Revoke compromised sessions immediately
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "UPDATE sessions SET revoked = 1 WHERE user_id IN (SELECT user_id FROM aml_alerts WHERE status = 'open');"

# 2. Disable affected users
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "UPDATE users SET kyc_status = 'rejected' WHERE id IN (SELECT user_id FROM aml_alerts WHERE severity = 'critical');"

# 3. Rotate database credentials
aws rds modify-db-instance \
  --db-instance-identifier drop-db \
  --master-user-password <new-password> \
  --apply-immediately --region eu-west-1

# 4. Take forensic snapshot
aws rds create-db-snapshot \
  --db-instance-identifier drop-db \
  --db-snapshot-identifier drop-db-incident-$(date +%Y%m%d-%H%M) \
  --region eu-west-1

# 5. Rotate JWT_SECRET (invalidates all sessions)
# Generate: openssl rand -base64 48
# Update in AWS Secrets Manager + redeploy App Runner

Post-containment:

Analyze audit logs — identify scope of breach
File STR (Suspicious Transaction Report) if financial crime suspected
Notify Finanstilsynet if user PII compromised (GDPR requirement, 72-hour window)
User communication if required by GDPR Art. 34

RTO: Immediate containment (session revocation) / 24–48 hours full investigation

7. RTO / RPO Summary

Scenario	RTO	RPO
App Runner restart	5–10 minutes	0 (no data loss)
App Runner rollback	15–20 minutes	0 (no data loss)
RDS snapshot restore	30 minutes	24 hours (daily snapshot)
RDS PITR restore	30 minutes	5 minutes
Full region failover (eu-west-1)	2–4 hours	24 hours
Security incident containment	Immediate (session revocation)	0 (logs preserved)

8. Contacts

Role	Name	Contact
Primary Incident Owner	Alem Bašić (CEO)	[email protected] / +47 40 47 42 51
AI Operations	John (AI Director)	Slack `#drop-alerts`
AWS Support	AWS	Premium support via AWS Console
Fly.io Support (staging)	Fly.io	[email protected]
Sumsub Support	Sumsub	[email protected]
BankID Support	Vipps MobilePay (BankID operator)	Per contract

9. Runbook Maintenance

Review Schedule

Quarterly review: Verify all ARNs, endpoints, and commands still valid
After any incident: Update with lessons learned
Before major releases: Verify backup and rollback procedures work

Test Schedule

Q2 2026: Full DR drill — App Runner restart + RDS snapshot restore to temp instance
Quarterly: App Runner rollback test
Monthly: Verify automated RDS snapshot creation

Change Log

Date	Change	Author
2026-02-23	Initial version from DR runbook + infra analysis	Platform Architect (AI)

Approval

Role	Name	Date
Author	Platform Architect (AI)	2026-02-23
Reviewer
Approver	Alem Bašić

Disaster Recovery Plan

Disaster Recovery Plan

Document History

1. Business Continuity Overview

2. RPO / RTO Targets Per Service Tier

3. Service Tier Classification

4. Infrastructure Overview

Production

Database (Production)

Staging

5. Backup Strategy

Production RDS PostgreSQL

ECR Docker Images

Staging (Fly.io)

6. Recovery Procedures

Scenario 1: App Runner Service Down

Scenario 2: RDS Database Failure

Scenario 3: Data Corruption

Scenario 4: Full Region Outage (eu-west-1)

Scenario 5: Security Incident

7. RTO / RPO Summary

8. Contacts

9. Runbook Maintenance

Review Schedule

Test Schedule

Change Log

Related Documents

Approval