DR Runbook

Drop — Disaster Recovery Runbook

Infrastructure Overview

Production Environment

Service: AWS App Runner
Region: eu-west-1 (Ireland)
Service ARN: arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec
Service URL: https://9ef3szvvsb.eu-west-1.awsapprunner.com
ECR Repository: 324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web

Database

RDS Instance: drop-db
Endpoint: drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com:5432
Database Name: dropapp
Username: dropuser
Backup Strategy: Automated snapshots, 7-day retention
Backup Window: 23:24-23:54 UTC daily

Staging Environment

Platform: Fly.io
App Name: drop-staging
Region: arn (Stockholm)
Database: ~~SQLite~~PostgreSQL 16 (~~ephemeral~~RDS, ~~volume,~~eu-north-1, noor ~~automated~~Docker ~~backup)~~in CI)

Domain

Production: getdrop.no (future)
Current: App Runner subdomain

Backup Strategy

RDS PostgreSQL (Production)

Automated Snapshots: Daily at 23:24 UTC
Retention Period: 7 days
Point-in-Time Recovery: Enabled (5-minute granularity)
Manual Snapshots: Created before major changes
Storage: Same region (eu-west-1)

Fly.ioStaging SQLitePostgreSQL (Staging)RDS)

NoAutomated ~~automated backup~~Snapshots: —Daily, ~~ephemeral~~7-day ~~storage~~retention (same config as production)
Backup Method: Manual export via flyctl ssh console and sqlite3 .backup
Recommended: Export before major changes

Recovery Procedures

Scenario 1: App Runner Service Down

Symptoms

Service health checks failing
5xx errors from App Runner URL
CloudWatch alarms triggered

Investigation Steps

# 1. Check service status
aws apprunner describe-service \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

# 2. View recent logs (last 10 minutes)
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --follow \
  --since 10m \
  --region eu-west-1

# 3. Check deployment history
aws apprunner list-operations \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

Recovery Actions

Option A: Restart Service

# Trigger new deployment (no code change)
aws apprunner start-deployment \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

# Monitor deployment status
aws apprunner describe-service \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --query 'Service.Status' \
  --region eu-west-1

Option B: Rollback to Previous Image

# 1. List recent ECR images
aws ecr describe-images \
  --repository-name drop-web \
  --region eu-west-1 \
  --query 'sort_by(imageDetails,& imagePushedAt)[-5:]'

# 2. Update service to use previous image tag
# (Manual step: Update .github/workflows/deploy-aws.yml with previous tag and push)

# 3. Or update directly via App Runner console (rollback to previous deployment)

RTO: 5-10 minutes (restart) / 15-20 minutes (rollback)

Scenario 2: RDS Database Failure

Symptoms

Connection timeouts to drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com
Database errors in App Runner logs
RDS CloudWatch metrics show instance down

Investigation Steps

# 1. Check RDS instance status
aws rds describe-db-instances \
  --db-instance-identifier drop-db \
  --region eu-west-1 \
  --query 'DBInstances[0].DBInstanceStatus'

# 2. Check for automated snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier drop-db \
  --region eu-west-1 \
  --query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-5:]'

# 3. Review recent events
aws rds describe-events \
  --source-identifier drop-db \
  --source-type db-instance \
  --region eu-west-1 \
  --duration 60

Recovery Actions

Option A: Restore from Latest Automated Snapshot

# 1. Identify latest snapshot
LATEST_SNAPSHOT=$(aws rds describe-db-snapshots \
  --db-instance-identifier drop-db \
  --region eu-west-1 \
  --query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
  --output text)

echo "Latest snapshot: $LATEST_SNAPSHOT"

# 2. Restore to new instance
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier drop-db-restored \
  --db-snapshot-identifier $LATEST_SNAPSHOT \
  --db-instance-class db.t4g.micro \
  --vpc-security-group-ids sg-XXXXX \
  --db-subnet-group-name default \
  --region eu-west-1

# 3. Wait for restore to complete (10-20 minutes)
aws rds wait db-instance-available \
  --db-instance-identifier drop-db-restored \
  --region eu-west-1

# 4. Update DATABASE_URL in App Runner
# (Manual step: Update environment variable via AWS Console or CLI)

# 5. Verify connection
NEW_ENDPOINT=$(aws rds describe-db-instances \
  --db-instance-identifier drop-db-restored \
  --query 'DBInstances[0].Endpoint.Address' \
  --output text \
  --region eu-west-1)

echo "New endpoint: $NEW_ENDPOINT"

Option B: Point-in-Time Recovery

# Restore to specific timestamp (e.g., 1 hour ago)
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier drop-db \
  --target-db-instance-identifier drop-db-pitr \
  --restore-time $(date -u -d '1 hour ago' '+%Y-%m-%dT%H:%M:%SZ') \
  --db-instance-class db.t4g.micro \
  --region eu-west-1

# Wait for restore
aws rds wait db-instance-available \
  --db-instance-identifier drop-db-pitr \
  --region eu-west-1

RPO: 24 hours (snapshot) / 5 minutes (PITR) RTO: 30 minutes (snapshot) / 30 minutes (PITR)

Scenario 3: Data Corruption

Symptoms

Application reports data inconsistencies
Missing or incorrect records in database
User reports of lost data

Investigation Steps

# 1. Connect to RDS and inspect data
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
     -U dropuser \
     -d dropapp \
     -c "SELECT COUNT(*) FROM users WHERE deleted_at IS NOT NULL;"

# 2. Check audit_log table for suspicious activity
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
     -U dropuser \
     -d dropapp \
     -c "SELECT * FROM audit_log WHERE action IN ('DELETE', 'UPDATE') ORDER BY timestamp DESC LIMIT 50;"

# 3. Identify time of corruption
# Review application logs and database query logs

Recovery Actions

Option A: Selective Data Restore (if corruption is isolated)

# 1. Create temporary snapshot of current state
aws rds create-db-snapshot \
  --db-instance-identifier drop-db \
  --db-snapshot-identifier drop-db-before-restore-$(date +%Y%m%d-%H%M) \
  --region eu-west-1

# 2. Restore clean snapshot to temporary instance
CLEAN_SNAPSHOT=<snapshot-before-corruption>

aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier drop-db-temp \
  --db-snapshot-identifier $CLEAN_SNAPSHOT \
  --db-instance-class db.t4g.micro \
  --region eu-west-1

# 3. Export affected tables from clean instance
pg_dump -h <temp-endpoint> \
        -U dropuser \
        -d dropapp \
        -t users \
        -t transactions \
        --data-only \
        > clean_data.sql

# 4. Selectively import into production (after verification)
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
     -U dropuser \
     -d dropapp \
     < clean_data.sql

# 5. Terminate temporary instance
aws rds delete-db-instance \
  --db-instance-identifier drop-db-temp \
  --skip-final-snapshot \
  --region eu-west-1

Option B: Full Database Restore (see Scenario 2)

RTO: 1-2 hours (selective) / 30 minutes (full restore) RPO: Depends on snapshot age

Scenario 4: Full Region Outage (eu-west-1)

Current State

No automated cross-region failover
No replica in secondary region
Manual failover required

Investigation Steps

# 1. Check AWS Service Health Dashboard
# https://health.aws.amazon.com/health/status

# 2. Verify RDS snapshots are accessible
aws rds describe-db-snapshots \
  --db-instance-identifier drop-db \
  --region eu-west-1

# 3. Check ECR images (may need to copy to secondary region)
aws ecr describe-images \
  --repository-name drop-web \
  --region eu-west-1

Recovery Actions (Manual Failover to eu-north-1)

# 1. Copy latest RDS snapshot to eu-north-1
LATEST_SNAPSHOT=$(aws rds describe-db-snapshots \
  --db-instance-identifier drop-db \
  --region eu-west-1 \
  --query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
  --output text)

aws rds copy-db-snapshot \
  --source-db-snapshot-identifier arn:aws:rds:eu-west-1:324480209768:snapshot:$LATEST_SNAPSHOT \
  --target-db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \
  --region eu-north-1

# 2. Restore RDS in eu-north-1
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier drop-db-failover \
  --db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \
  --db-instance-class db.t4g.micro \
  --region eu-north-1

# 3. Copy ECR image to eu-north-1
# (Manual: create ECR repo in eu-north-1, retag and push latest image)

# 4. Deploy App Runner in eu-north-1
# (Manual: create new App Runner service via console with failover database endpoint)

# 5. Update DNS (when getdrop.no is active)
# Point getdrop.no to new App Runner URL

RTO: 2-4 hours (manual process) RPO: Last snapshot before outage (24 hours worst case, 5 minutes with PITR if available)

Scenario 5: Security Incident

Symptoms

Suspicious database activity
Unauthorized access attempts
AML alerts triggered
STR report filed

Investigation Steps

# 1. Check audit logs for suspicious activity
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
     -U dropuser \
     -d dropapp \
     -c "SELECT * FROM audit_log WHERE timestamp > NOW() - INTERVAL '24 hours' ORDER BY timestamp DESC;"

# 2. Review AML alerts
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
     -U dropuser \
     -d dropapp \
     -c "SELECT * FROM aml_alerts WHERE status = 'open' OR created_at > NOW() - INTERVAL '24 hours';"

# 3. Check AWS CloudTrail for API activity
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=ResourceName,AttributeValue=drop-db \
  --region eu-west-1 \
  --max-results 50

# 4. Review App Runner access logs
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --start-time $(date -u -d '24 hours ago' +%s)000 \
  --region eu-west-1

Containment Actions

# 1. Revoke compromised sessions
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
     -U dropuser \
     -d dropapp \
     -c "UPDATE sessions SET revoked = 1 WHERE user_id IN (SELECT user_id FROM aml_alerts WHERE status = 'open');"

# 2. Temporarily disable affected users
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
     -U dropuser \
     -d dropapp \
     -c "UPDATE users SET kyc_status = 'rejected' WHERE id IN (SELECT user_id FROM aml_alerts WHERE severity = 'critical');"

# 3. Rotate database credentials
aws rds modify-db-instance \
  --db-instance-identifier drop-db \
  --master-user-password <new-password> \
  --apply-immediately \
  --region eu-west-1

# Update DATABASE_URL in App Runner with new password

# 4. Enable enhanced monitoring
aws rds modify-db-instance \
  --db-instance-identifier drop-db \
  --monitoring-interval 1 \
  --monitoring-role-arn arn:aws:iam::324480209768:role/rds-monitoring-role \
  --region eu-west-1

# 5. Take forensic snapshot
aws rds create-db-snapshot \
  --db-instance-identifier drop-db \
  --db-snapshot-identifier drop-db-incident-$(date +%Y%m%d-%H%M) \
  --region eu-west-1

Investigation & Remediation

Analyze audit logs — identify scope of breach
File STR reports — if financial crime suspected (via str_reports table)
Notify Finanstilsynet — if user data compromised (GDPR requirement)
Update security policies — patch vulnerabilities
User communication — notify affected users if required by GDPR

RTO: Immediate containment (revoke sessions) / 24-48 hours full investigation

RTO/RPO Targets

Scenario	RTO	RPO
App Runner restart	5-10 minutes	0 (no data loss)
App Runner rollback	15-20 minutes	0 (no data loss)
RDS snapshot restore	30 minutes	24 hours (last snapshot)
RDS PITR restore	30 minutes	5 minutes (PITR granularity)
Full region failover	2-4 hours	24 hours (manual process)
Security incident containment	Immediate	0 (logs preserved)

Contacts

Primary

Alem Bašić (CEO): +47 40 47 42 51
Email: [email protected]

AI Operations

John (AI Director): Slack #drop-alerts channel

External Support

AWS Support: Premium support via AWS Console
Fly.io Support: Email [email protected]

Runbook Maintenance

Review Schedule

Quarterly review — verify all ARNs, endpoints, and procedures
After incidents — update based on lessons learned
Before major releases — verify backup and rollback procedures

Test Schedule

Annually — full DR drill (restore from snapshot to temporary instance)
Quarterly — App Runner restart and rollback tests
Monthly — verify snapshot creation and retention

Change Log

Date	Change	Author
2026-02-18	Initial version created	Builder 3 (AI)

Appendix: Useful Commands

Quick Health Check

# Check App Runner status
aws apprunner describe-service \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --query 'Service.Status' \
  --output text \
  --region eu-west-1

# Check RDS status
aws rds describe-db-instances \
  --db-instance-identifier drop-db \
  --query 'DBInstances[0].DBInstanceStatus' \
  --output text \
  --region eu-west-1

# Check latest snapshot age
aws rds describe-db-snapshots \
  --db-instance-identifier drop-db \
  --region eu-west-1 \
  --query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-1].SnapshotCreateTime' \
  --output text

Database Connection Test

# Test connection from local machine
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
     -U dropuser \
     -d dropapp \
     -c "SELECT 1;"

Log Streaming

# Stream App Runner application logs
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --follow \
  --region eu-west-1

# Stream RDS error logs
aws rds download-db-log-file-portion \
  --db-instance-identifier drop-db \
  --log-file-name error/postgresql.log \
  --region eu-west-1