DR Runbook
Drop — Disaster Recovery Runbook
Infrastructure Overview
Production Environment
- Service: AWS App Runner
- Region: eu-west-1 (Ireland)
- Service ARN:
arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec - Service URL: https://9ef3szvvsb.eu-west-1.awsapprunner.com
- ECR Repository:
324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web
Database
- RDS Instance: drop-db
- Endpoint:
drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com:5432 - Database Name: dropapp
- Username: dropuser
- Backup Strategy: Automated snapshots, 7-day retention
- Backup Window: 23:24-23:54 UTC daily
Staging Environment
- Platform: Fly.io
- App Name: drop-staging
- Region: arn (Stockholm)
- Database:
SQLitePostgreSQL 16 (ephemeralRDS,volume,eu-north-1,noorautomatedDockerbackup)in CI)
Domain
- Production: getdrop.no (future)
- Current: App Runner subdomain
Backup Strategy
RDS PostgreSQL (Production)
- Automated Snapshots: Daily at 23:24 UTC
- Retention Period: 7 days
- Point-in-Time Recovery: Enabled (5-minute granularity)
- Manual Snapshots: Created before major changes
- Storage: Same region (eu-west-1)
Fly.ioStaging SQLitePostgreSQL (Staging)RDS)
NoAutomatedautomated backupSnapshots:—Daily,ephemeral7-daystorageretention (same config as production)- Backup Method: Manual export via
flyctl ssh consoleandsqlite3 .backup - Recommended: Export before major changes
Recovery Procedures
Scenario 1: App Runner Service Down
Symptoms
- Service health checks failing
- 5xx errors from App Runner URL
- CloudWatch alarms triggered
Investigation Steps
# 1. Check service status
aws apprunner describe-service \
--service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
--region eu-west-1
# 2. View recent logs (last 10 minutes)
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
--follow \
--since 10m \
--region eu-west-1
# 3. Check deployment history
aws apprunner list-operations \
--service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
--region eu-west-1
Recovery Actions
Option A: Restart Service
# Trigger new deployment (no code change)
aws apprunner start-deployment \
--service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
--region eu-west-1
# Monitor deployment status
aws apprunner describe-service \
--service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
--query 'Service.Status' \
--region eu-west-1
Option B: Rollback to Previous Image
# 1. List recent ECR images
aws ecr describe-images \
--repository-name drop-web \
--region eu-west-1 \
--query 'sort_by(imageDetails,& imagePushedAt)[-5:]'
# 2. Update service to use previous image tag
# (Manual step: Update .github/workflows/deploy-aws.yml with previous tag and push)
# 3. Or update directly via App Runner console (rollback to previous deployment)
RTO: 5-10 minutes (restart) / 15-20 minutes (rollback)
Scenario 2: RDS Database Failure
Symptoms
- Connection timeouts to
drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com - Database errors in App Runner logs
- RDS CloudWatch metrics show instance down
Investigation Steps
# 1. Check RDS instance status
aws rds describe-db-instances \
--db-instance-identifier drop-db \
--region eu-west-1 \
--query 'DBInstances[0].DBInstanceStatus'
# 2. Check for automated snapshots
aws rds describe-db-snapshots \
--db-instance-identifier drop-db \
--region eu-west-1 \
--query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-5:]'
# 3. Review recent events
aws rds describe-events \
--source-identifier drop-db \
--source-type db-instance \
--region eu-west-1 \
--duration 60
Recovery Actions
Option A: Restore from Latest Automated Snapshot
# 1. Identify latest snapshot
LATEST_SNAPSHOT=$(aws rds describe-db-snapshots \
--db-instance-identifier drop-db \
--region eu-west-1 \
--query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
--output text)
echo "Latest snapshot: $LATEST_SNAPSHOT"
# 2. Restore to new instance
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier drop-db-restored \
--db-snapshot-identifier $LATEST_SNAPSHOT \
--db-instance-class db.t4g.micro \
--vpc-security-group-ids sg-XXXXX \
--db-subnet-group-name default \
--region eu-west-1
# 3. Wait for restore to complete (10-20 minutes)
aws rds wait db-instance-available \
--db-instance-identifier drop-db-restored \
--region eu-west-1
# 4. Update DATABASE_URL in App Runner
# (Manual step: Update environment variable via AWS Console or CLI)
# 5. Verify connection
NEW_ENDPOINT=$(aws rds describe-db-instances \
--db-instance-identifier drop-db-restored \
--query 'DBInstances[0].Endpoint.Address' \
--output text \
--region eu-west-1)
echo "New endpoint: $NEW_ENDPOINT"
Option B: Point-in-Time Recovery
# Restore to specific timestamp (e.g., 1 hour ago)
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier drop-db \
--target-db-instance-identifier drop-db-pitr \
--restore-time $(date -u -d '1 hour ago' '+%Y-%m-%dT%H:%M:%SZ') \
--db-instance-class db.t4g.micro \
--region eu-west-1
# Wait for restore
aws rds wait db-instance-available \
--db-instance-identifier drop-db-pitr \
--region eu-west-1
RPO: 24 hours (snapshot) / 5 minutes (PITR) RTO: 30 minutes (snapshot) / 30 minutes (PITR)
Scenario 3: Data Corruption
Symptoms
- Application reports data inconsistencies
- Missing or incorrect records in database
- User reports of lost data
Investigation Steps
# 1. Connect to RDS and inspect data
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
-U dropuser \
-d dropapp \
-c "SELECT COUNT(*) FROM users WHERE deleted_at IS NOT NULL;"
# 2. Check audit_log table for suspicious activity
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
-U dropuser \
-d dropapp \
-c "SELECT * FROM audit_log WHERE action IN ('DELETE', 'UPDATE') ORDER BY timestamp DESC LIMIT 50;"
# 3. Identify time of corruption
# Review application logs and database query logs
Recovery Actions
Option A: Selective Data Restore (if corruption is isolated)
# 1. Create temporary snapshot of current state
aws rds create-db-snapshot \
--db-instance-identifier drop-db \
--db-snapshot-identifier drop-db-before-restore-$(date +%Y%m%d-%H%M) \
--region eu-west-1
# 2. Restore clean snapshot to temporary instance
CLEAN_SNAPSHOT=<snapshot-before-corruption>
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier drop-db-temp \
--db-snapshot-identifier $CLEAN_SNAPSHOT \
--db-instance-class db.t4g.micro \
--region eu-west-1
# 3. Export affected tables from clean instance
pg_dump -h <temp-endpoint> \
-U dropuser \
-d dropapp \
-t users \
-t transactions \
--data-only \
> clean_data.sql
# 4. Selectively import into production (after verification)
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
-U dropuser \
-d dropapp \
< clean_data.sql
# 5. Terminate temporary instance
aws rds delete-db-instance \
--db-instance-identifier drop-db-temp \
--skip-final-snapshot \
--region eu-west-1
Option B: Full Database Restore (see Scenario 2)
RTO: 1-2 hours (selective) / 30 minutes (full restore) RPO: Depends on snapshot age
Scenario 4: Full Region Outage (eu-west-1)
Current State
- No automated cross-region failover
- No replica in secondary region
- Manual failover required
Investigation Steps
# 1. Check AWS Service Health Dashboard
# https://health.aws.amazon.com/health/status
# 2. Verify RDS snapshots are accessible
aws rds describe-db-snapshots \
--db-instance-identifier drop-db \
--region eu-west-1
# 3. Check ECR images (may need to copy to secondary region)
aws ecr describe-images \
--repository-name drop-web \
--region eu-west-1
Recovery Actions (Manual Failover to eu-north-1)
# 1. Copy latest RDS snapshot to eu-north-1
LATEST_SNAPSHOT=$(aws rds describe-db-snapshots \
--db-instance-identifier drop-db \
--region eu-west-1 \
--query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
--output text)
aws rds copy-db-snapshot \
--source-db-snapshot-identifier arn:aws:rds:eu-west-1:324480209768:snapshot:$LATEST_SNAPSHOT \
--target-db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \
--region eu-north-1
# 2. Restore RDS in eu-north-1
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier drop-db-failover \
--db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \
--db-instance-class db.t4g.micro \
--region eu-north-1
# 3. Copy ECR image to eu-north-1
# (Manual: create ECR repo in eu-north-1, retag and push latest image)
# 4. Deploy App Runner in eu-north-1
# (Manual: create new App Runner service via console with failover database endpoint)
# 5. Update DNS (when getdrop.no is active)
# Point getdrop.no to new App Runner URL
RTO: 2-4 hours (manual process) RPO: Last snapshot before outage (24 hours worst case, 5 minutes with PITR if available)
Scenario 5: Security Incident
Symptoms
- Suspicious database activity
- Unauthorized access attempts
- AML alerts triggered
- STR report filed
Investigation Steps
# 1. Check audit logs for suspicious activity
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
-U dropuser \
-d dropapp \
-c "SELECT * FROM audit_log WHERE timestamp > NOW() - INTERVAL '24 hours' ORDER BY timestamp DESC;"
# 2. Review AML alerts
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
-U dropuser \
-d dropapp \
-c "SELECT * FROM aml_alerts WHERE status = 'open' OR created_at > NOW() - INTERVAL '24 hours';"
# 3. Check AWS CloudTrail for API activity
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=ResourceName,AttributeValue=drop-db \
--region eu-west-1 \
--max-results 50
# 4. Review App Runner access logs
aws logs filter-log-events \
--log-group-name /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
--start-time $(date -u -d '24 hours ago' +%s)000 \
--region eu-west-1
Containment Actions
# 1. Revoke compromised sessions
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
-U dropuser \
-d dropapp \
-c "UPDATE sessions SET revoked = 1 WHERE user_id IN (SELECT user_id FROM aml_alerts WHERE status = 'open');"
# 2. Temporarily disable affected users
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
-U dropuser \
-d dropapp \
-c "UPDATE users SET kyc_status = 'rejected' WHERE id IN (SELECT user_id FROM aml_alerts WHERE severity = 'critical');"
# 3. Rotate database credentials
aws rds modify-db-instance \
--db-instance-identifier drop-db \
--master-user-password <new-password> \
--apply-immediately \
--region eu-west-1
# Update DATABASE_URL in App Runner with new password
# 4. Enable enhanced monitoring
aws rds modify-db-instance \
--db-instance-identifier drop-db \
--monitoring-interval 1 \
--monitoring-role-arn arn:aws:iam::324480209768:role/rds-monitoring-role \
--region eu-west-1
# 5. Take forensic snapshot
aws rds create-db-snapshot \
--db-instance-identifier drop-db \
--db-snapshot-identifier drop-db-incident-$(date +%Y%m%d-%H%M) \
--region eu-west-1
Investigation & Remediation
- Analyze audit logs — identify scope of breach
- File STR reports — if financial crime suspected (via
str_reportstable) - Notify Finanstilsynet — if user data compromised (GDPR requirement)
- Update security policies — patch vulnerabilities
- User communication — notify affected users if required by GDPR
RTO: Immediate containment (revoke sessions) / 24-48 hours full investigation
RTO/RPO Targets
| Scenario | RTO | RPO |
|---|---|---|
| App Runner restart | 5-10 minutes | 0 (no data loss) |
| App Runner rollback | 15-20 minutes | 0 (no data loss) |
| RDS snapshot restore | 30 minutes | 24 hours (last snapshot) |
| RDS PITR restore | 30 minutes | 5 minutes (PITR granularity) |
| Full region failover | 2-4 hours | 24 hours (manual process) |
| Security incident containment | Immediate | 0 (logs preserved) |
Contacts
Primary
- Alem Bašić (CEO): +47 40 47 42 51
- Email: [email protected]
AI Operations
- John (AI Director): Slack #drop-alerts channel
External Support
Runbook Maintenance
Review Schedule
- Quarterly review — verify all ARNs, endpoints, and procedures
- After incidents — update based on lessons learned
- Before major releases — verify backup and rollback procedures
Test Schedule
- Annually — full DR drill (restore from snapshot to temporary instance)
- Quarterly — App Runner restart and rollback tests
- Monthly — verify snapshot creation and retention
Change Log
| Date | Change | Author |
|---|---|---|
| 2026-02-18 | Initial version created | Builder 3 (AI) |
Appendix: Useful Commands
Quick Health Check
# Check App Runner status
aws apprunner describe-service \
--service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
--query 'Service.Status' \
--output text \
--region eu-west-1
# Check RDS status
aws rds describe-db-instances \
--db-instance-identifier drop-db \
--query 'DBInstances[0].DBInstanceStatus' \
--output text \
--region eu-west-1
# Check latest snapshot age
aws rds describe-db-snapshots \
--db-instance-identifier drop-db \
--region eu-west-1 \
--query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-1].SnapshotCreateTime' \
--output text
Database Connection Test
# Test connection from local machine
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
-U dropuser \
-d dropapp \
-c "SELECT 1;"
Log Streaming
# Stream App Runner application logs
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
--follow \
--region eu-west-1
# Stream RDS error logs
aws rds download-db-log-file-portion \
--db-instance-identifier drop-db \
--log-file-name error/postgresql.log \
--region eu-west-1