Disaster Recovery Plan

Project: ~~{{PROJECT_NAME}}~~Drop Version: ~~{{VERSION}}~~0.1.0 Date: ~~{{DATE}}~~2026-02-23 Author: ~~{{AUTHOR}}~~Platform Architect (AI) Status: ~~Draft |~~ In Review ~~| Approved~~ Reviewers: ~~{{REVIEWERS}}~~Alem Bašić (CEO)

Document History

Version	Date	Author	Changes
0.1	~~{{DATE}}~~2026-02-23	~~{{AUTHOR}}~~Platform Architect (AI)	~~Initial~~Compiled ~~draft~~from DR-RUNBOOK.md + infrastructure analysis

1. Business Continuity Overview

This plan documents ~~the~~ procedures to recover ~~{{PROJECT_NAME}}~~Drop services following a disaster event (~~data center~~infrastructure failure, data corruption, security breach, or ~~catastrophic~~total ~~failure)~~region outage). Drop is a PSD2 pass-through payment application — it never holds customer funds, so there is no risk of customer money being lost due to Drop infrastructure failure. The primary recovery concern is service availability and data integrity.

Plan Owner: ~~{{DR_OWNER}}~~Alem Bašić (CEO), [email protected], +47 40 47 42 51 Plan Reviewer: ~~{{DR_REVIEWER}}~~John (AI Director), Slack #drop-alerts Last Tested: ~~{{LAST_TEST_DATE}}~~TBD — initial DR drill not yet conducted Next Scheduled Test: ~~{{NEXT_TEST_DATE}}~~Q2 2026 (App Runner restart + RDS snapshot restore)

Disaster types covered:

~~Infrastructure~~App Runner service failure ~~(AZ/region~~or ~~outage)~~crash
~~Data~~RDS ~~corruption~~database failure or ~~accidental~~data ~~deletion~~corruption
Security incident (~~ransomware,~~unauthorized ~~data~~access, ~~breach)~~credential compromise)
~~Vendor/provider~~Full region outage (eu-west-1)
Catastrophic application failure (bad deployment)

2. RPO / RTO Targets Per Service Tier

notifications,

Tier	Description	RPO	RTO	~~Examples~~Drop Services
Tier 1 — Critical	Core user-facing ~~services; downtime has direct revenue impact~~services	05 minutes (~~real-time replication)~~PITR)	<30 ~~15 min~~minutes	~~Auth,~~Auth ~~checkout,~~(BankID), ~~core~~transactions ~~API~~(remittance + QR), health endpoint
Tier 2 — Important	Supporting ~~services; degraded experience without them~~features	<24 hours (snapshot)	1 hour	<Merchant 4dashboard, ~~hours~~	~~Notifications,~~transaction ~~reports~~history
Tier 3 — Standard	~~Background/~~Background / admin ~~services; business can operate without temporarily~~	< 24 hours	< 24 hours	~~Analytics,~~Audit ~~admin~~logs, ~~panel~~AML alerts, complaint records

3. Service Tier Classification

UsersCoreCoreRequiredMonitoring ~~uploads~~

Service	Tier	~~Owner~~	~~Rationale~~Justification
~~{{SERVICE_1}}~~BankID authentication	~~Tier~~ 1	~~{{OWNER}}~~	~~Core~~cannot ~~user~~transact ~~journey~~without login
~~{{SERVICE_2}}~~Remittance API (`/api/transactions/remittance`)	~~Tier~~ 1	~~{{OWNER}}~~	~~Authentication~~revenue feature
~~{{SERVICE_3}}~~QR payment API (`/api/transactions/qr-payment`)	~~Tier 2~~1	~~{{OWNER}}~~	~~Supporting~~revenue feature
~~{{SERVICE_4}}~~Bank account read (AISP)	~~Tier 3~~1	~~{{OWNER}}~~	~~Admin~~for ~~only~~payment initiation
~~Database~~Health —endpoint ~~Primary~~(`/api/health`)	~~Tier~~ 1	~~Platform~~	~~All services depend on it~~dependency
~~Object~~Transaction ~~Storage~~history	~~Tier~~ 2	~~Platform~~UX degraded, no blocking issue
Merchant dashboard	~~User~~2	Merchant ops impacted
Notifications	2	UX degraded
Audit log	3	Compliance — retained, not real-time
AML alerts	3	Reviewed periodically, not real-time

4. Infrastructure Overview

Production

Service: AWS App Runner

Region: eu-west-1 (Ireland)

Service ARN: arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec

Service URL: https://9ef3szvvsb.eu-west-1.awsapprunner.com

ECR Repository: 324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web

Database (Production)

RDS Instance: drop-db

Endpoint: drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com:5432

DB Name: dropapp

DB User: dropuser

Backup Strategy: Automated daily snapshots, 7-day retention

Backup Window: 23:24–23:54 UTC daily

PITR: Enabled (5-minute granularity)

Staging

Platform: Fly.io, region arn (Stockholm)

App Name: drop-staging

Database: SQLite ephemeral volume — no automated backup

4.5. Backup Strategy

4.Production RDS PostgreSQL

Automated Snapshots: Daily at 23:24 UTC

Retention Period: 7 days

Point-in-Time Recovery (PITR): Enabled — any point within last 7 days, 5-minute granularity

Manual Snapshots: Created before every major deployment or migration

Snapshot verification: Run quarterly

ECR Docker Images

All pushed images retained in ECR repository

Rollback capability: Redeploy any previous image tag via aws apprunner start-deployment

Lifecycle policy: Delete untagged images after 7 days, keep last 10 tagged releases

Staging (Fly.io)

No automated backup — ephemeral SQLite storage

Manual backup procedure:

flyctl ssh console -a drop-staging
sqlite3 /app/data/drop.db ".backup /app/data/backup-$(date +%Y%m%d).db"

6. Recovery Procedures

Scenario 1: App Runner Service Down

Symptoms:

BetterStack alert: Drop Health Check is DOWN

Slack #drop-ops: critical alert

App Runner service status not RUNNING

Investigation:

# Check service status
aws apprunner describe-service \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

# View recent logs (last 10 minutes)
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --follow --since 10m --region eu-west-1

# Check deployment history
aws apprunner list-operations \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

Recovery Option A: Restart (preferred)

aws apprunner start-deployment \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

RTO: 5–10 minutes | RPO: 0 (no data loss)

Recovery Option B: Rollback to previous image

# List recent ECR images
aws ecr describe-images \
  --repository-name drop-web \
  --region eu-west-1 \
  --query 'sort_by(imageDetails,&imagePushedAt)[-5:]'

# Update App Runner image tag via console or update the deployment workflow
# Then trigger new deployment

RTO: 15–20 minutes | RPO: 0 (no data loss)

Scenario 2: RDS Database BackupsFailure

Symptoms:

/api/health returns {"status":"down"} (HTTP 503)

BetterStack + Slack alerts fire

App Runner logs show connection timeout to RDS

Investigation:

# Check RDS status
aws rds describe-db-instances \
  --db-instance-identifier drop-db \
  --region eu-west-1 \
  --query 'DBInstances[0].DBInstanceStatus'

# Check available snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier drop-db \
  --region eu-west-1 \
  --query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-5:]'

# Check events
aws rds describe-events \
  --source-identifier drop-db \
  --source-type db-instance \
  --region eu-west-1 --duration 60

Recovery Option A: Restore from automated snapshot

LATEST=$(aws rds describe-db-snapshots \
  --db-instance-identifier drop-db --region eu-west-1 \
  --query 'DBSnapshots[?SnapshotType==`automated`]|sort_by(@,&SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
  --output text)

aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier drop-db-restored \
  --db-snapshot-identifier $LATEST \
  --db-instance-class db.t4g.micro \
  --region eu-west-1

aws rds wait db-instance-available --db-instance-identifier drop-db-restored --region eu-west-1

# Update DATABASE_URL in App Runner environment with new endpoint
NEW_EP=$(aws rds describe-db-instances --db-instance-identifier drop-db-restored \
  --query 'DBInstances[0].Endpoint.Address' --output text --region eu-west-1)

RTO: 30 minutes | RPO: 24 hours (last snapshot)

Recovery Option B: Point-in-Time Recovery (PITR)

aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier drop-db \
  --target-db-instance-identifier drop-db-pitr \
  --restore-time $(date -u -d '1 hour ago' '+%Y-%m-%dT%H:%M:%SZ') \
  --db-instance-class db.t4g.micro \
  --region eu-west-1

RTO: 30 minutes | RPO: 5 minutes (PITR granularity)

Scenario 3: Data Corruption

Symptoms:

Application reports data inconsistencies

User-reported missing or incorrect transactions

Audit log shows unexpected DELETE/UPDATE operations

Investigation:

# Check for soft-deleted users
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "SELECT COUNT(*) FROM users WHERE deleted_at IS NOT NULL;"

# Check recent suspicious audit log entries
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "SELECT * FROM audit_log WHERE action IN ('DELETE','UPDATE') ORDER BY timestamp DESC LIMIT 50;"

Recovery: Selective restore from clean snapshot (see Scenario 2 recovery steps) + merge affected tables.

RTO: 1–2 hours (selective) | RPO: Depends on snapshot age

Scenario 4: Full Region Outage (eu-west-1)

Current State: No automated cross-region failover. Manual failover to eu-north-1 (Stockholm) required.

Investigation:

Check AWS Health Dashboard: https://health.aws.amazon.com/health/status

Verify RDS snapshot accessibility from eu-west-1

Manual Failover to eu-north-1:

# 1. Copy latest RDS snapshot to eu-north-1
LATEST=$(aws rds describe-db-snapshots --db-instance-identifier drop-db --region eu-west-1 \
  --query 'DBSnapshots[?SnapshotType==`automated`]|sort_by(@,&SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
  --output text)

aws rds copy-db-snapshot \
  --source-db-snapshot-identifier arn:aws:rds:eu-west-1:324480209768:snapshot:$LATEST \
  --target-db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \
  --region eu-north-1

# 2. Restore RDS in eu-north-1
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier drop-db-failover \
  --db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \
  --db-instance-class db.t4g.micro \
  --region eu-north-1

# 3. Create ECR repository in eu-north-1 and push latest image
# 4. Create App Runner service in eu-north-1
# 5. Update DNS when getdrop.no is active

RTO: 2–4 hours (manual) | RPO: Up to 24 hours (last snapshot)

Scenario 5: Security Incident

Symptoms:

Suspicious audit log entries

Unauthorized access attempts

AML alerts triggered for unusual activity

Sumsub KYC bypass attempt

Investigation:

# Check audit log for recent suspicious activity
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "SELECT * FROM audit_log WHERE timestamp > NOW() - INTERVAL '24 hours' ORDER BY timestamp DESC;"

# Check AML alerts
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "SELECT * FROM aml_alerts WHERE status = 'open' OR created_at > NOW() - INTERVAL '24 hours';"

# Check CloudTrail for AWS API activity
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=ResourceName,AttributeValue=drop-db \
  --region eu-west-1 --max-results 50

Containment:

# 1. Revoke compromised sessions immediately
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "UPDATE sessions SET revoked = 1 WHERE user_id IN (SELECT user_id FROM aml_alerts WHERE status = 'open');"

# 2. Disable affected users
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "UPDATE users SET kyc_status = 'rejected' WHERE id IN (SELECT user_id FROM aml_alerts WHERE severity = 'critical');"

# 3. Rotate database credentials
aws rds modify-db-instance \
  --db-instance-identifier drop-db \
  --master-user-password <new-password> \
  --apply-immediately --region eu-west-1

# 4. Take forensic snapshot
aws rds create-db-snapshot \
  --db-instance-identifier drop-db \
  --db-snapshot-identifier drop-db-incident-$(date +%Y%m%d-%H%M) \
  --region eu-west-1

# 5. Rotate JWT_SECRET (invalidates all sessions)
# Generate: openssl rand -base64 48
# Update in AWS Secrets Manager + redeploy App Runner

Post-containment:

Analyze audit logs — identify scope of breach

File STR (Suspicious Transaction Report) if financial crime suspected

Notify Finanstilsynet if user PII compromised (GDPR requirement, 72-hour window)

User communication if required by GDPR Art. 34

RTO: Immediate containment (session revocation) / 24–48 hours full investigation

7. RTO / RPO Summary

0data 0data 24hours

~~Database~~Scenario	~~Backup Type~~RTO	~~Frequency~~	~~Retention~~	~~Location~~	~~Verified~~RPO
~~{{DB_PRIMARY}}~~App Runner restart	~~Automated~~5–10 ~~snapshot~~minutes	~~Daily~~	30(no ~~days~~	~~{{BACKUP_LOCATION}}~~	~~Monthly~~loss)
~~{{DB_PRIMARY}}~~App Runner rollback	~~Point-in-time~~15–20 ~~recovery~~minutes	~~Continuous~~	7(no ~~days~~	~~{{BACKUP_LOCATION}}~~	~~Monthly~~loss)
~~{{DB_READ_REPLICA}}~~RDS snapshot restore	~~Not~~30 ~~backed up separately~~minutes	—	—	~~Rebuilt~~(daily ~~from primary~~	—

~~Automated backup tool:~~ ~~{{BACKUP_TOOL}}~~ ~~Backup encryption:~~ ~~AES-256, key managed in {{KMS_TOOL}}~~ ~~Cross-region copy:~~ ~~{{CROSS_REGION}}~~

4.2 File / Object Storage Backups

RDSPITR5

~~Storage~~	~~Backup Method~~	~~Frequency~~	~~Retention~~	~~DR Copy~~
~~{{S3_BUCKET}}~~	~~S3 versioning + replication~~	~~Continuous~~	~~{{RETENTION}}~~	~~{{DR_BUCKET}}~~snapshot)
~~{{FILE_STORE}}~~	~~Snapshot~~	~~Daily~~restore	30 ~~days~~minutes	~~Cross-region~~

4.3 Configuration Backups

~~Config~~	~~Backup Method~~	~~Location~~	~~Frequency~~
~~IaC (Terraform)~~	~~Git repository~~	~~{{GIT_REPO}}~~	~~On change~~
~~Application config~~	~~Git repository~~	~~{{GIT_REPO}}~~	~~On change~~
~~Secrets~~	~~Secrets manager replication~~	~~{{SECRETS_BACKUP}}~~	~~Real-time~~
~~DNS records~~	~~Export to Git~~	~~{{GIT_REPO}}~~	~~Weekly~~
~~TLS certificates~~	~~Secrets manager~~	~~{{CERTS_BACKUP}}~~	~~On renewal~~

4.4 Backup Testing Schedule

~~Backup Type~~	~~Test Frequency~~	~~Last Test~~	~~Result~~	~~Tester~~
~~Database full restore~~	~~Monthly~~	~~{{DATE}}~~	~~{{RESULT}}~~	~~{{TESTER}}~~
~~Point-in-time restore~~	~~Quarterly~~	~~{{DATE}}~~	~~{{RESULT}}~~	~~{{TESTER}}~~
~~Object storage restore~~	~~Quarterly~~	~~{{DATE}}~~	~~{{RESULT}}~~	~~{{TESTER}}~~minutes
Full DRregion failover ~~drill~~(eu-west-1)	~~Bi-annually~~2–4 hours	~~{{DATE}}~~24 hours
Security incident containment	~~{{RESULT}}~~Immediate (session revocation)	~~{{TESTER}}~~0 (logs preserved)

5. Failover Procedures

5.1 Automated Failover

~~Component~~	~~Automatic Failover~~	~~Mechanism~~	~~Failover Time~~
~~Database (Multi-AZ)~~	~~Yes~~	~~RDS automatic failover~~	~~60-120 seconds~~
~~Load balancer~~	~~Yes~~	~~Health check → route to healthy targets~~	~~< 30 seconds~~
~~CDN~~	~~Yes~~	~~Origin health checks~~	~~< 60 seconds~~
~~Redis (if clustered)~~	~~Yes~~	~~Redis Sentinel / ElastiCache~~	~~< 30 seconds~~

~~Monitoring automatic failover:~~

~~Alert fires:~~ MultiAZFailover ~~CloudWatch event or equivalent~~

~~On-call notified immediately~~

~~No manual action required, but on-call must confirm recovery~~

5.2 Manual Failover Steps

~~Prerequisite:~~ ~~Automatic failover has NOT occurred or has failed.~~

Database Manual Failover (Tier 1)

~~Confirm primary is unavailable:~~ ping {{DB_PRIMARY_HOST}} ~~— should timeout~~

~~Connect to standby:~~ psql {{STANDBY_HOST}}

~~Promote standby to primary:~~ SELECT pg_promote();

~~Update DNS record~~ db.{{INTERNAL_DOMAIN}} → {{STANDBY_HOST}}

~~DNS TTL: Ensure TTL was set to 60s pre-incident (if not, wait {{DNS_TTL}} seconds)~~

~~Verify applications are reconnecting: Check application logs for successful DB connections~~

~~Page on-call to verify all services healthy~~

Regional Failover (Catastrophic)

~~Declare DR event (approval from {{DR_AUTHORITY}})~~

~~Confirm primary region {{PRIMARY_REGION}} is unreachable~~

~~Activate standby in {{DR_REGION}}:~~ terraform apply -var-file=envs/dr.tfvars

~~Restore database from latest cross-region snapshot~~

~~Update Route 53 / DNS to point to {{DR_REGION}} endpoints~~

~~Run smoke tests:~~ bash scripts/smoke-tests.sh {{DR_REGION}}

~~Notify stakeholders (see Communication Plan)~~

~~Monitor enhanced metrics for {{MONITOR_PERIOD}}h~~

6. Recovery Procedures Per Service

Tier 1 Services

~~Service~~	~~Recovery Procedure~~	~~Recovery Script~~	~~Est. Time~~
~~{{SERVICE_1}}~~	~~1. Restore from snapshot~~ ~~2. Verify config~~ ~~3. Run smoke tests~~	`scripts/restore-{{SERVICE_1}}.sh`	~~{{TIME}}min~~
~~Authentication~~	~~1. Deploy from last known good image~~ ~~2. Verify JWT keys~~ ~~3. Test login flow~~	`scripts/restore-auth.sh`	~~{{TIME}}min~~

Tier 2 Services

Tier 3 Services

7. DR Drill Schedule & Scenarios

~~Drill Type~~	~~Frequency~~	~~Participants~~	~~Last Executed~~	~~Next Scheduled~~
~~Tabletop exercise~~	~~Quarterly~~	~~On-call team + engineering lead~~	~~{{DATE}}~~	~~{{DATE}}~~
~~Database failover test~~	~~Quarterly~~	~~DevOps + one developer~~	~~{{DATE}}~~	~~{{DATE}}~~
~~Full DR failover~~	~~Bi-annually~~	~~Entire engineering team~~	~~{{DATE}}~~	~~{{DATE}}~~
~~Backup restore test~~	~~Monthly~~	~~DevOps~~	~~{{DATE}}~~	~~{{DATE}}~~

~~Drill Scenarios to Cover:~~

~~Database primary failure (automatic failover test)~~

~~Accidental data deletion (point-in-time restore)~~

~~Single AZ outage (multi-AZ failover)~~

~~Full region failure (cross-region DR)~~

~~Ransomware/data corruption (restore from offline backup)~~

~~CDN outage (origin fallback)~~

~~Secret store unavailable (cached credentials)~~

8. Communication Plan During DR EventContacts

Internal Communications

~~Audience~~	~~Channel~~	~~Frequency~~	~~Owner~~
~~Engineering team~~	~~Slack #incidents + war room call~~	~~Real-time~~	~~Incident commander~~
~~Engineering management~~	~~Direct message~~	~~At declaration + hourly~~	~~Incident commander~~
~~Product/Business leadership~~	~~Email + Slack~~	~~At declaration + hourly~~	~~Incident commander~~
~~Customer support~~	~~Dedicated Slack channel~~	~~At declaration + 30 min~~	~~Support lead~~

External Communications

~~Audience~~	~~Channel~~	~~Trigger~~	~~Message~~
~~Customers~~	~~Status page ({{STATUS_PAGE}})~~	~~Within 15 min of confirmed incident~~	~~"We are investigating an issue"~~
~~Customers~~	~~Status page update~~	~~Every 30 min~~	~~Progress update~~
~~Customers~~	~~Email~~	~~If impact > {{EMAIL_THRESHOLD}}h~~	~~Direct notification~~
~~SLA customers~~	~~Direct contact~~	~~Per SLA contract~~	~~As contractually required~~

~~Communication templates:~~ ~~See~~ ~~go-live-runbook.md~~ ~~communication section~~

9. War Room Setup

~~War Room:~~ ~~{{WAR_ROOM_LINK}}~~ ~~Bridge Line:~~ ~~{{BRIDGE_NUMBER}}~~ ~~Document:~~ ~~Live incident doc created at: {{INCIDENT_DOC_TEMPLATE}}~~

~~Roles during DR event:~~

[email protected]SlackPremium

Role	~~Responsibility~~Name	~~Primary~~	~~Backup~~Contact
Primary Incident ~~Commander~~Owner	~~Coordinates~~Alem ~~response,~~Bašić ~~final decisions~~(CEO)	~~{{IC}}~~	~~{{IC_BACKUP}}~~/ +47 40 47 42 51
~~Technical~~AI ~~Lead~~Operations	~~Leads~~John ~~technical~~(AI ~~recovery~~Director)	~~{{TECH_LEAD}}~~	~~{{TECH_BACKUP}}~~`#drop-alerts`
~~Communications~~AWS ~~Lead~~Support	~~Internal/external updates~~AWS	~~{{COMMS_LEAD}}~~	~~{{COMMS_BACKUP}}~~support via AWS Console
~~Scribe~~Fly.io Support (staging)	~~Documents timeline, actions taken~~Fly.io	~~{{SCRIBE}}~~[email protected]
Sumsub Support	~~Rotate~~Sumsub	[email protected]
BankID Support	Vipps MobilePay (BankID operator)	Per contract

10.9. Post-RecoveryRunbook Verification ChecklistMaintenance

Review Schedule

Test

11.Schedule

Q2 2026: Full DR ~~Test~~drill ~~Results~~— App Runner restart + RDS snapshot restore to temp instance

Quarterly: App Runner rollback test

Monthly: Verify automated RDS snapshot creation

Change Log

PlatformArchitect

Date	~~Test Type~~Change	~~Scenario~~	~~RTO Achieved~~	~~RPO Achieved~~	~~Issues Found~~	~~Resolved By~~Author
~~{{DATE}}~~2026-02-23	~~{{TYPE}}~~Initial version from DR runbook + infra analysis	~~{{SCENARIO}}~~	~~{{RTO}}~~	~~{{RPO}}~~	~~{{ISSUES}}~~	~~{{RESOLVED}}~~(AI)

Deployment Architecture

~~Post-Mortem~~Runbook

Approval

Role	Name	Date
Author	Platform Architect (AI)	2026-02-23
Reviewer
Approver	Alem Bašić

Disaster Recovery Plan

Disaster Recovery Plan

Document History

1. Business Continuity Overview

2. RPO / RTO Targets Per Service Tier

3. Service Tier Classification

4. Infrastructure Overview

Production

Database (Production)

Staging

4.5. Backup Strategy

4.Production RDS PostgreSQL

ECR Docker Images

Staging (Fly.io)

6. Recovery Procedures

Scenario 1: App Runner Service Down

Scenario 2: RDS Database BackupsFailure

Scenario 3: Data Corruption

Scenario 4: Full Region Outage (eu-west-1)

Scenario 5: Security Incident

7. RTO / RPO Summary

4.2 File / Object Storage Backups

4.3 Configuration Backups

4.4 Backup Testing Schedule

5. Failover Procedures

5.1 Automated Failover

5.2 Manual Failover Steps

Database Manual Failover (Tier 1)

Regional Failover (Catastrophic)

6. Recovery Procedures Per Service

Tier 1 Services

Tier 2 Services

Tier 3 Services

7. DR Drill Schedule & Scenarios

8. Communication Plan During DR EventContacts

Internal Communications

External Communications

9. War Room Setup

10.9. Post-RecoveryRunbook Verification ChecklistMaintenance

Review Schedule

Test

11.Schedule

Change Log

Related Documents

Approval