Operational Runbook

Project: ~~{{PROJECT_NAME}}~~Drop Version: ~~{{VERSION}}~~0.1.0 Date: ~~{{DATE}}~~2026-02-23 Author: ~~{{AUTHOR}}~~Platform Architect (AI) Status: ~~Draft |~~ In Review ~~| Approved~~ Reviewers: ~~{{REVIEWERS}}~~Alem Bašić (CEO)

Document History

Version	Date	Author	Changes
0.1	~~{{DATE}}~~2026-02-23	~~{{AUTHOR}}~~Platform Architect (AI)	Initial draft covering day-to-day Drop operations

1. Service Overview

This

~~Service:~~runbook ~~{{PROJECT_NAME}}~~covers ~~Purpose:~~day-to-day ~~{{SERVICE_PURPOSE}}~~operations ~~Technology~~of ~~stack:~~Drop's ~~{{STACK}}~~production environment. ~~Architecture~~Drop ~~reference:~~runs ~~Deployment~~on ~~Architecture~~AWS App Runner (eu-west-1) with RDS PostgreSQL.

~~Service~~Primary ~~URLs:~~operations contact: Alem Bašić — [email protected] / +47 40 47 42 51 AI Operations: John (AI Director) — Slack #drop-alerts

2. Quick Reference

Production Infrastructure

~~Environment~~Component	~~URL~~	~~Health Check~~Identifier
~~Production~~App Runner service	`{{PROD_URL}}arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec`
App Runner URL	`{{PROD_URL}}https:/health/9ef3szvvsb.eu-west-1.awsapprunner.com`
RDS instance	`drop-db`
RDS endpoint	`drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com:5432`
ECR repository	`324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web`
Staging	`{{STG_URL}}https://drop-staging.fly.dev`
Status page	`{{STG_URL}}https:/health/drop-status.betteruptime.com`
Slack alerts	`#drop-ops` on `alai-talk.slack.com`

~~Key~~

Quick dashboards:
Health Check

# Application health (production)
curl -s https://getdrop.no/api/health | jq

# App Runner status
aws apprunner describe-service \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --query 'Service.Status' --output text --region eu-west-1

# RDS status
aws rds describe-db-instances \
  --db-instance-identifier drop-db \
  --query 'DBInstances[0].DBInstanceStatus' --output text --region eu-west-1

# Live App Runner logs
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --follow --region eu-west-1

3. Routine Operations

3.1 Daily Checks

~~System~~ ~~overview:~~BetterStack: ~~{{DASHBOARD_LINK}}~~all 3 monitors green (health, landing, US east)
~~Service~~ ~~metrics:~~Slack ~~{{SERVICE_DASHBOARD_LINK}}~~#drop-ops: no unresolved critical alerts from last 24h
~~Logs:~~ App Runner service status: RUNNING

RDS snapshot from last night: exists and < 24h old

# Verify last RDS snapshot
aws rds describe-db-snapshots \
  --db-instance-identifier drop-db --region eu-west-1 \
  --query 'DBSnapshots[?SnapshotType==`automated`]|sort_by(@,&SnapshotCreateTime)[-1].{{LOG_DASHBOARD_LINK}}id:DBSnapshotIdentifier,time:SnapshotCreateTime}' \
  --output table

3.2 Weekly Checks

Review CloudWatch logs for recurring error patterns

Check RDS free storage space (alert if < 2GB)

Review AML alerts table for any open cases

Review pending KYC applicants (stuck in pending status > 24h)

Check ECR — clean up untagged images manually if lifecycle policy hasn't run

# Check RDS storage
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name FreeStorageSpace \
  --dimensions Name=DBInstanceIdentifier,Value=drop-db \
  --start-time $(date -u -d '1 hour ago' --iso-8601=seconds) \
  --end-time $(date -u --iso-8601=seconds) \
  --period 3600 \
  --statistics Average \
  --region eu-west-1

# Check pending KYC (connect to RDS first via bastion or VPN)
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "SELECT id, email, kyc_status, created_at FROM users WHERE kyc_status = 'pending' ORDER BY created_at ASC;"

3.3 Monthly Checks

Review SLA report (uptime, error rate, p99 latency)

Test BetterStack alerts (pause monitor → verify escalation fires → resume)

Verify RDS snapshot restore works (restore to temp instance, verify data, delete)

Review secret rotation schedule — anything due?

Review STR reports table — any pending filings?

2.4. CommonDeployment Operational TasksProcedure

2.4.1 ServiceStandard Restart Procedure

~~When to use:~~ ~~Application unresponsive, hanging workers, suspected deadlock~~

~~Steps:~~

~~Option A — Rolling restart~~Deployment (noApp ~~downtime):~~

Runner)

# AWS1. ECSEnsure all CI checks pass on main branch
# 2. Build and push new Docker image to ECR
docker build -t drop-app .
docker tag drop-app:latest 324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web:$(git rev-parse --short HEAD)
aws ecsecr update-serviceget-login-password --clusterregion {{CLUSTER}}eu-west-1 | \
  docker login --serviceusername {{SERVICE}}AWS --force-new-deploymentpassword-stdin 324480209768.dkr.ecr.eu-west-1.amazonaws.com
docker push 324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web:$(git rev-parse --short HEAD)

# Kubernetes3. kubectlCreate rolloutpre-deployment restartRDS deployment/{{DEPLOYMENT}}snapshot
aws rds create-db-snapshot \
  -n-db-instance-identifier {{NAMESPACE}}drop-db

~~Option~~--db-snapshot-identifier Bdrop-db-pre-deploy-$(date —+%Y%m%d-%H%M) ~~Emergency~~\ ~~restart~~--region eu-west-1 # 4. Create BetterStack maintenance window (~~brief~~prevents ~~downtime,~~false ~~use~~alerts) ~~only if rolling restart fails):~~

# StopGo allto instancesBetterStack {{STOP_COMMAND}}→ Maintenance Windows → Create Window (30 min)

# 5. Trigger App Runner deployment
aws apprunner start-deployment \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

# 6. Monitor deployment status
aws apprunner describe-service \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --query 'Service.Status' --output text --region eu-west-1
# Wait for drain
sleep 30RUNNING

# Start7. freshVerify {{START_COMMAND}}health
curl -s https://getdrop.no/api/health | jq

# 8. Close BetterStack maintenance window

~~Verify:~~Typical deployment time: 3–5 minutes

4.2 Staging Deployment (Fly.io)

# Deploy to Fly.io staging
cd src/drop-app
fly deploy --app drop-staging

# Verify staging health
curl -s https://drop-staging.fly.dev/api/health | jq

4.3 Emergency Rollback

# Identify previous ECR image
aws ecr describe-images --repository-name drop-web --region eu-west-1 \
  --query 'sort_by(imageDetails,&imagePushedAt)[-2].imageDigest' --output text

# Update App Runner to use previous image tag via console,
# then trigger deployment:
aws apprunner start-deployment \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

5. Secret Rotation

5.1 Rotate `JWT_SECRET`

Impact: All active user sessions immediately invalidated. All logged-in users are logged out.

# Check1. allGenerate instancesnew healthysecret
{{HEALTH_CHECK_COMMAND}}NEW_SECRET=$(openssl rand -base64 48)

# Check2. forUpdate errorsin post-restartAWS Secrets Manager
aws secretsmanager update-secret \
  --secret-id drop/production/jwt-secret \
  --secret-string "$NEW_SECRET" \
  --region eu-west-1

# 3. Update App Runner environment variable (via console or CLI)
# Then trigger new deployment

# 4. Log rotation in audit_log
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "INSERT INTO audit_log (id, action, resource_type, resource_id, details) VALUES (gen_random_uuid(), 'secret_rotated', 'secret', 'JWT_SECRET', '{{LOG_CHECK_COMMAND}\"rotated_at\": \"$(date -u --iso-8601=seconds)\"}');"

~~Expected restart time:~~ ~~{{RESTART_TIME}} minutes~~ ~~Alert expected:~~ ~~Service restart will trigger deployment alert — acknowledge in PagerDuty~~

2.5.2 LogRotate RetrievalDatabase & AnalysisPassword

~~Centralized logs:~~ ~~{{LOG_URL}}~~

~~Quick log retrieval:~~

# Last1. 100Generate errornew linespassword
{{LOG_TOOL}}NEW_PASS=$(openssl rand -base64 32)

# 2. Update RDS master password
aws rds modify-db-instance \
  --filterdb-instance-identifier "level=error"drop-db \
  --sincemaster-user-password "1h"$NEW_PASS" \
  --serviceapply-immediately {{SERVICE}}\
  --region eu-west-1

# Logs3. Update DATABASE_URL in Secrets Manager with new password
# 4. Trigger App Runner redeployment to pick up new DATABASE_URL
# 5. Verify health: curl https://getdrop.no/api/health

6. Database Operations

6.1 Connect to Production Database

Note: RDS must be accessible — either via VPN, bastion host, or AWS Systems Manager Session Manager.

psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
     -U dropuser \
     -d dropapp \
     -c "SELECT 1;"

6.2 User Management Queries

-- Check user KYC status
SELECT id, email, kyc_status, auth_provider, created_at
FROM users WHERE email = '[email protected]';

-- List pending KYC users (> 24h)
SELECT id, email, kyc_status, created_at FROM users
WHERE kyc_status = 'pending'
  AND created_at < NOW() - INTERVAL '24 hours'
ORDER BY created_at ASC;

-- Revoke all sessions for a specificuser (emergency)
UPDATE sessions SET revoked = 1
WHERE user_id = 'usr_...' AND revoked = 0;

-- Soft-delete user {{LOG_TOOL}}(GDPR --filtererasure)
"user_id={{USER_ID}}"UPDATE --sinceusers "24h"SET #deleted_at Logs= forNOW() aWHERE specificid request= {{LOG_TOOL}}'usr_...';
--filterUPDATE "request_id={{REQUEST_ID}}"sessions #SET Databaserevoked slow= query1 logsWHERE {{DB_LOG_COMMAND}}user_id = 'usr_...';

~~Log~~

6.3 formatTransaction reference:Queries

~~See~~

-- MonitoringRecent transactions (last 24h)
SELECT id, type, status, send_amount, send_currency, created_at
FROM transactions
WHERE created_at > NOW() - INTERVAL '24 hours'
ORDER BY created_at DESC LIMIT 50;

-- Failed transactions (may need investigation)
SELECT t.*, u.email FROM transactions t
JOIN users u ON t.user_id = u.id
WHERE t.status = 'failed'
  AND t.created_at > NOW() - INTERVAL '7 days'
ORDER BY t.created_at DESC;

-- AML: large transactions (> NOK 50,000)
SELECT * FROM transactions
WHERE send_amount > 50000
  AND created_at > NOW() - INTERVAL '30 days'
ORDER BY send_amount DESC;

6.4 Manual RDS Snapshot

# Create manual snapshot before risky operations
aws rds create-db-snapshot \
  --db-instance-identifier drop-db \
  --db-snapshot-identifier drop-db-manual-$(date +%Y%m%d-%H%M) \
  --region eu-west-1

# Wait for snapshot to complete
aws rds wait db-snapshot-completed \
  --db-snapshot-identifier drop-db-manual-$(date +%Y%m%d-%H%M) \
  --region eu-west-1

7. AML & Observability
Compliance
Operations

2.37.1 DatabaseAML MaintenanceAlert Review

-- View open AML alerts
SELECT a.*, u.email, t.send_amount, t.send_currency
FROM aml_alerts a
JOIN users u ON a.user_id = u.id
LEFT JOIN transactions t ON a.transaction_id = t.id
WHERE a.status = 'open'
ORDER BY a.created_at DESC;

-- Close an AML alert (after review)
UPDATE aml_alerts SET status = 'closed', reviewed_at = NOW(),
  reviewer_notes = 'Reviewed — legitimate transaction'
WHERE id = 'alert_...';

7.2 STR Filing

~~Connection~~If ~~count~~financial ~~check:~~crime is suspected:

SELECT-- count(*File STR
INSERT INTO str_reports (
  id, user_id, transaction_id, report_type, details, filed_at, status
) asVALUES connections,(
  state FROM pg_stat_activity GROUP BY state;

~~Kill idle connections:~~

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state =gen_random_uuid(), 'idle'
  AND state_change < now() - intervalusr_...', '5tx_...', minutes''suspicious_transaction',
  AND'{"reason": pid"Unusual <>pattern", pg_backend_pid("amount": 50000}',
  NOW(), 'filed'
);

Then contact Finanstilsynet via the official STR filing portal.

~~Running~~Data ~~queries~~export ~~(detect long-running):~~request:

-- User data is exported via /api/user/data-export endpoint
-- Check data_access_requests table
SELECT pid, duration, query, state* FROM pg_stat_activitydata_access_requests WHERE (now()user_id - pg_stat_activity.query_start) > interval '1 minute'
  AND state != 'idle';usr_...' ORDER BY created_at DESC;

~~Vacuum~~Erasure ~~/ analyze (if table bloat suspected):~~request:

VACUUM-- ANALYZEAccount {{TABLE_NAME}}deletion (soft delete)
UPDATE users SET deleted_at = NOW() WHERE id = 'usr_...';

UPDATE

~~Check~~sessions ~~replication~~SET ~~lag:~~

revoked

SELECT= now()1 WHERE user_id = 'usr_...';
-- pg_last_xact_replay_timestamp()Note: ASdata replication_lag;retained for 5 years per hvitvaskingsloven

8. Incident Response

2.48.1 CacheAlert Clearing / WarmingTriage

~~Clear~~When ~~all~~a ~~cache (use with caution — may spike DB load):~~

{{CACHE_FLUSH_COMMAND}}

~~Clear specific key pattern:~~

{{CACHE_DELETE_PATTERN_COMMAND}}

~~Check cache hit rate:~~

{{CACHE_STATS_COMMAND}}

~~Warm cache after clearing:~~

# Run cache warming script
bash scripts/warm-cache.sh {{ENVIRONMENT}}
# Or trigger warming job
{{WARM_CACHE_JOB_COMMAND}}

~~Expected DB load spike after cache clear:~~ ~~{{CACHE_CLEAR_IMPACT}} minutes of elevated load~~

2.5 Certificate Renewal

~~Automated renewal:~~ ~~Configured via {{CERT_TOOL}} (Let's Encrypt / ACM)~~ ~~Auto-renewal trigger:~~ ~~30 days before expiry~~

~~Manual renewal (if auto-renewal fails):~~

# Check expiry
echo | openssl s_client -connect {{DOMAIN}}:443 2>/dev/null | openssl x509 -noout -dates

# Manual renewal
{{CERT_RENEW_COMMAND}}

# Verify
{{CERT_VERIFY_COMMAND}}

~~Verify renewal~~Slack alert ~~is working:~~

~~Alert configured: "Certificate expiring~~fires in ~~< 30 days" → {{ALERT_CHANNEL}}~~

~~Test certificate:~~ curl -I https://{{DOMAIN}}#drop-ops ~~and check~~ Strict-Transport-Security ~~header~~

2.6 Scaling Up / Down

~~Scale up (increase capacity)~~:

~~Endpoint~~	~~Method~~	~~Expected Response~~	~~What It Checks~~
`{{BASE_URL}}/health`	~~GET~~	~~HTTP 200~~ `{"status":"ok"}`	~~Application running~~
`{{BASE_URL}}/health/ready`	~~GET~~	~~HTTP 200~~ `{"status":"ready"}`	~~App + DB + Cache connected~~
`{{BASE_URL}}/health/live`	~~GET~~	~~HTTP 200~~ `{"status":"alive"}`	~~App process alive~~
`{{BASE_URL}}/health/db`	~~GET~~	~~HTTP 200~~ `{"status":"ok","latency_ms":X}`	~~Database reachable~~
`{{BASE_URL}}/health/cache`	~~GET~~	~~HTTP 200~~ `{"status":"ok"}`	~~Redis reachable~~

Alert	~~Immediate~~First ~~Action~~Response	~~Runbook Section~~
`HighErrorRate`	~~Check logs, identify error type, assess scope~~	~~3.5 High Error Rates~~
`SlowP99`	~~Check DB slow queries, recent deploys~~	~~3.3 Slow DB Queries~~
`ServiceDown`	~~Restart service, check logs~~	~~2.1 Service Restart~~
`HighCPU`	~~Scale up, identify source~~	~~3.1 High CPU~~
`DiskAlmostFull`	~~Clear logs/tmp, escalate if > 90%~~	~~3.6 Disk Space~~
`DBReplicationLag`	~~Check replication, network, disk on replica~~	~~DB section~~
`CertificateExpiring`	~~Trigger manual renewal~~	~~2.5 Certificate Renewal~~

~~Situation~~	~~First Contact~~	~~Escalation~~	~~Ultimate~~ Escalation
~~Service~~Health ~~down~~check DOWN	~~On-call~~Run ~~engineer~~quick health check, check App Runner logs	~~Tech~~After ~~lead~~	~~Engineering~~min: ~~manager~~restart App Runner
~~Data~~Error ~~loss / corruption~~spike	~~On-call~~Check +CloudWatch ~~Tech~~logs ~~lead~~for error pattern	~~CTO~~	~~CTO~~10 min: escalate
~~Security~~App ~~incident~~startup/shutdown	~~Security~~Informational ~~contact~~— no action unless unexpected	~~CISO~~	~~CEO~~
~~Payment system down~~	~~On-call + Payment owner~~	~~Stripe/payment provider support~~	~~Engineering manager~~N/A

~~Role~~	~~Name~~	~~Phone~~	~~Slack~~
~~On-~~--query 'DBInstances[0].DBInstanceStatus' --output text --region eu-west-1 # 2. If not 'available', wait for AWS to auto-recover or follow DR Scenario 2 # 3. Check connection string in App Runner environment # 4. Restart App Runner service Issue: BankID login failing `# Check App Runner logs for BankID errors aws logs filter-log-events \ --log-group-name /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \ --filter-pattern "BankID" --region eu-west-1 # Verify BankID environment variables are set # Check BankID status: https://driftsstatus.vippsmobilepay.com/` Issue: KYC verification stuck in pending `# Check Sumsub dashboard for stuck applicants # Or query: psql -c "SELECT id, email, kyc_status FROM users WHERE kyc_status='pending' AND created_at < NOW()-INTERVAL '2 hours';" # Force-process via Sumsub dashboard or API call (primary)`	~~{{PRIMARY}}~~	~~{{PHONE}}~~	~~{{SLACK}}~~
~~On-call (backup)~~	~~{{BACKUP}}~~	~~{{PHONE}}~~	~~{{SLACK}}~~
~~Tech Lead~~	~~{{TECH_LEAD}}~~	~~{{PHONE}}~~	~~{{SLACK}}~~
~~Engineering Manager~~	~~{{ENG_MGR}}~~	~~{{PHONE}}~~	~~{{SLACK}}~~

# AWS ECS aws ecs update-service --cluster {{CLUSTER}} --service {{SERVICE}} --desired-count {{COUNT}} # Kubernetes kubectl scale deployment/{{DEPLOYMENT}} --replicas={{COUNT}} -n {{NAMESPACE}}

~~Verify scale-out:~~

# Check instance count {{INSTANCE_COUNT_COMMAND}} # Confirm health {{HEALTH_CHECK_COMMAND}}

~~Scale down (reduce capacity — use cautiously):~~

~~Do NOT scale below {{MIN_INSTANCES}} instances~~

~~Scale down during off-peak hours only ({{OFF_PEAK_HOURS}})~~

~~Monitor for 10 minutes after scaling down to confirm stability~~

~~3. Troubleshooting Playbooks~~

~~3.1 High CPU Usage~~

~~Symptoms:~~ ~~CPU alert fires, slow responses, possible OOM~~

~~Identify the source:~~
# Top processes by CPU {{CPU_TOP_COMMAND}}

~~Check for: runaway loops, large queries being processed, missing cache causing recalculation~~

~~Check for recently deployed code~~ ~~— did CPU spike after a deploy? → Consider rollback~~

~~Check queue depth~~ ~~— backed-up job queue causes worker CPU spike~~

~~If single instance: restart that instance (~~{{RESTART_SINGLE_COMMAND}})

~~If all instances: scale up immediately, then investigate root cause~~

~~Escalate if:~~ ~~CPU > {{CPU_ESCALATE}}% for > {{ESCALATE_DURATION}} min after scaling~~

~~3.2 Memory Leaks~~

~~Symptoms:~~ ~~Slowly increasing memory, eventual OOM kill / restart loop~~

~~Check memory trend~~ ~~in monitoring dashboard — linear increase over hours = leak~~

~~Identify the leak:~~

~~Enable heap dump:~~ {{HEAP_DUMP_COMMAND}}

~~Profile with:~~ {{PROFILER}}

~~Short-term mitigation:~~ ~~Schedule rolling restarts every {{RESTART_INTERVAL}}h~~
{{SCHEDULED_RESTART_COMMAND}}

~~Create ticket~~ ~~with heap dump attached — requires developer investigation~~

~~Escalate if:~~ ~~Restart cycle < {{MIN_RESTART_INTERVAL}}h (memory fills too fast)~~

~~3.3 Slow Database Queries~~

~~Symptoms:~~ ~~High P99 latency, DB CPU spike, timeouts in logs~~

~~Find slow queries:~~
SELECT query, calls, mean_exec_time, max_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 20;

~~Check for missing indexes:~~ ~~Look for sequential scans on large tables~~

~~Check for blocking queries:~~
SELECT blocking.pid, blocking.query, blocked.pid, blocked.query FROM pg_stat_activity blocked JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));

~~Kill blocking query if safe:~~
SELECT pg_cancel_backend({{PID}}); -- If cancel doesn't work: SELECT pg_terminate_backend({{PID}});

~~Create ticket~~ ~~— developer must optimize the query~~

~~3.4 Service Connectivity Issues~~

~~Symptoms:~~ ~~Connectivity errors between services, 502/503 errors~~

~~Check health endpoints:~~
curl -I {{SERVICE_URL}}/health

~~Check network security groups / firewall rules~~ ~~— was anything changed recently?~~

~~Check service discovery~~ ~~— DNS resolving correctly?~~
nslookup {{SERVICE_INTERNAL_DNS}}

~~Check if service is running:~~
{{SERVICE_STATUS_COMMAND}}

~~Check logs for connection errors:~~
{{CONNECTIVITY_LOG_COMMAND}}

~~3.5 High Error Rates~~

~~Symptoms:~~ ~~Error rate alert, user complaints, 5xx in logs~~

~~Identify error type:~~ {{LOG_ERROR_COMMAND}} ~~— what errors, what services, what endpoints?~~

~~Check if correlated with:~~ ~~recent deployment, external service outage, traffic spike~~

~~Check external service status pages:~~

~~{{SERVICE_1}} status: {{STATUS_PAGE_1}}~~

~~{{SERVICE_2}} status: {{STATUS_PAGE_2}}~~

~~If recent deployment:~~ ~~Consider rollback if errors affecting > {{ROLLBACK_ERROR_THRESHOLD}}% of requests~~

~~If external service down:~~ ~~Check circuit breaker status, enable fallback~~

~~Escalate if:~~ ~~Error rate > {{ESCALATE_ERROR_RATE}}% for > {{ESCALATE_DURATION}} min~~

~~3.6 Disk Space Issues~~

~~Symptoms:~~ ~~Disk space alert, application errors writing files~~

~~Check disk usage:~~
df -h du -sh /var/log/* | sort -rh | head -10

~~Quick wins:~~
# Rotate and compress logs logrotate -f /etc/logrotate.conf # Clear old Docker images docker image prune -a --filter "until=24h" # Clear /tmp find /tmp -mtime +7 -delete

~~If database disk:~~ ~~Check for table bloat, dead tuples, WAL accumulation~~
SELECT pg_size_pretty(pg_database_size('{{DB_NAME}}'));

~~Escalate if:~~ ~~Disk > {{DISK_ESCALATE}}% and cannot free space quickly~~

~~4. Health Check Endpoints~~

~~Endpoint~~ ~~Method~~ ~~Expected Response~~ ~~What It Checks~~
{{BASE_URL}}/health ~~GET~~ ~~HTTP 200~~ {"status":"ok"} ~~Application running~~
{{BASE_URL}}/health/ready ~~GET~~ ~~HTTP 200~~ {"status":"ready"} ~~App + DB + Cache connected~~
{{BASE_URL}}/health/live ~~GET~~ ~~HTTP 200~~ {"status":"alive"} ~~App process alive~~
{{BASE_URL}}/health/db ~~GET~~ ~~HTTP 200~~ {"status":"ok","latency_ms":X} ~~Database reachable~~
{{BASE_URL}}/health/cache ~~GET~~ ~~HTTP 200~~ {"status":"ok"} ~~Redis reachable~~

~~Health check from load balancer:~~ {{HEALTH_CHECK_PATH}} ~~every {{LB_INTERVAL}}s~~ ~~Unhealthy threshold:~~ ~~{{UNHEALTHY_COUNT}} consecutive failures~~

~~5. Alert Response Procedures~~

Alert ~~Immediate~~First ~~Action~~Response ~~Runbook Section~~
HighErrorRate ~~Check logs, identify error type, assess scope~~ ~~3.5 High Error Rates~~
SlowP99 ~~Check DB slow queries, recent deploys~~ ~~3.3 Slow DB Queries~~
ServiceDown ~~Restart service, check logs~~ ~~2.1 Service Restart~~
HighCPU ~~Scale up, identify source~~ ~~3.1 High CPU~~
DiskAlmostFull ~~Clear logs/tmp, escalate if > 90%~~ ~~3.6 Disk Space~~
DBReplicationLag ~~Check replication, network, disk on replica~~ ~~DB section~~
CertificateExpiring ~~Trigger manual renewal~~ ~~2.5 Certificate Renewal~~

~~6. Escalation Matrix~~
5After
~~Situation~~ ~~First Contact~~ ~~Escalation~~ ~~Ultimate~~ Escalation

~~Service~~Health ~~down~~check DOWN ~~On-call~~Run ~~engineer~~quick health check, check App Runner logs ~~Tech~~After ~~lead~~ ~~Engineering~~min: ~~manager~~restart App Runner

~~Data~~Error ~~loss / corruption~~spike ~~On-call~~Check +CloudWatch ~~Tech~~logs ~~lead~~for error pattern ~~CTO~~ ~~CTO~~10 min: escalate

~~Security~~App ~~incident~~startup/shutdown ~~Security~~Informational ~~contact~~— no action unless unexpected ~~CISO~~ ~~CEO~~
~~Payment system down~~ ~~On-call + Payment owner~~ ~~Stripe/payment provider support~~ ~~Engineering manager~~N/A

8.2 Common Issues

~~Emergency~~Issue: ~~contacts:~~Health check returns 503 (DB unreachable)

#
1. Check RDS statusaws rds describe-db-instances --db-instance-identifier drop-db \
~~Role~~ ~~Name~~ ~~Phone~~ ~~Slack~~
~~On-~~--query 'DBInstances[0].DBInstanceStatus' --output text --region eu-west-1 # 2. If not 'available', wait for AWS to auto-recover or follow DR Scenario 2 # 3. Check connection string in App Runner environment # 4. Restart App Runner service
Issue: BankID login failing

# Check App Runner logs for BankID errors aws logs filter-log-events \ --log-group-name /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \ --filter-pattern "BankID" --region eu-west-1 # Verify BankID environment variables are set # Check BankID status: https://driftsstatus.vippsmobilepay.com/

Issue: KYC verification stuck in pending

# Check Sumsub dashboard for stuck applicants # Or query: psql -c "SELECT id, email, kyc_status FROM users WHERE kyc_status='pending' AND created_at < NOW()-INTERVAL '2 hours';" # Force-process via Sumsub dashboard or API call (primary)
~~{{PRIMARY}}~~ ~~{{PHONE}}~~ ~~{{SLACK}}~~
~~On-call (backup)~~ ~~{{BACKUP}}~~ ~~{{PHONE}}~~ ~~{{SLACK}}~~
~~Tech Lead~~ ~~{{TECH_LEAD}}~~ ~~{{PHONE}}~~ ~~{{SLACK}}~~
~~Engineering Manager~~ ~~{{ENG_MGR}}~~ ~~{{PHONE}}~~ ~~{{SLACK}}~~

7.9. ~~On-Call~~Monitoring ~~Handoff~~Verification ~~Procedure~~Commands

# Handoff1. cadence:Full {{HANDOFF_CADENCE}}health check
Handoffcurl time:-s {{HANDOFF_TIME}}
https://getdrop.no/api/health Outgoing| on-callpython3 must-m document:
json.tool

# 2. AnyDatabase openlatency incidentscheck
orcurl ongoing-s issues
https://getdrop.no/api/health | Anyjq monitoring'.data.checks.db.latencyMs'
anomalies# (elevated error rates, slow queries not yet resolved)

 Any upcoming events that may affect the system (marketing campaigns, scheduled maintenance)

 Any temporary mitigations in place that need permanent fixes

 Context on any unusual alerts that fired and were noise


Handoff document template: {{HANDOFF_TEMPLATE_LINK}}


8. Maintenance Window Procedure


Maintenance window schedule: {{MAINTENANCE_WINDOW}} (lowest traffic period)

Pre-maintenance:


Announce in Slack #ops: "Maintenance window {{DATE}} {{TIME}}-{{END_TIME}}"

Update status page: "Scheduled maintenance" with details

Notify impacted customersAlert if downtime expected > {{DOWNTIME_NOTIFY_THRESHOLD}}100ms

minutes
# Confirm3. rollbackCheck planapp isversion
ready
curl 
-s Duringhttps://getdrop.no/api/health maintenance:
| jq Enable'.data.version'

maintenance# mode4. (ifCheck applicable):uptime
{{MAINTENANCE_MODE_CMD}}curl -s https://getdrop.no/api/health | jq '.data.uptime'

Execute maintenance tasks per the specific runbook for the task

Run smoke tests after each major step

Document every action taken with timestamps


Post-maintenance:


Disable maintenance mode: {{DISABLE_MAINTENANCE_CMD}}

Run full smoke test suite

Monitor for 30 minutes

Update status page: "Maintenance complete, all systems normal"

Post-maintenance report in #ops Slack channel



Related Documents

Go-LiveDisaster Runbook
Recovery Incident ReportPlan
Monitoring & Observability
DisasterGo-Live RecoveryRunbook
PlanSource DR Runbook


Approval



Role
Name
Date
Signature




Author
Platform Architect (AI)
2026-02-23



Reviewer





Approver
Alem Bašić

Role	Name	Date
Author	Platform Architect (AI)	2026-02-23
Reviewer
Approver	Alem Bašić

Operational Runbook

Operational Runbook

Document History

1. Service Overview

2. Quick Reference

Production Infrastructure

Quick dashboards:Health Check

3. Routine Operations

3.1 Daily Checks

3.2 Weekly Checks

3.3 Monthly Checks

2.4. CommonDeployment Operational TasksProcedure

2.4.1 ServiceStandard Restart Procedure

4.2 Staging Deployment (Fly.io)

4.3 Emergency Rollback

5. Secret Rotation

5.1 Rotate JWT_SECRET

2.5.2 LogRotate RetrievalDatabase & AnalysisPassword

6. Database Operations

6.1 Connect to Production Database

6.2 User Management Queries

6.3 formatTransaction reference:Queries

6.4 Manual RDS Snapshot

7. AML & ObservabilityCompliance Operations

2.37.1 DatabaseAML MaintenanceAlert Review

7.2 STR Filing

7.3 GDPR Requests

8. Incident Response

2.48.1 CacheAlert Clearing / WarmingTriage

2.5 Certificate Renewal

2.6 Scaling Up / Down

3. Troubleshooting Playbooks

3.1 High CPU Usage

3.2 Memory Leaks

3.3 Slow Database Queries

3.4 Service Connectivity Issues

3.5 High Error Rates

3.6 Disk Space Issues

4. Health Check Endpoints

5. Alert Response Procedures

6. Escalation Matrix

8.2 Common Issues

7.9. On-CallMonitoring HandoffVerification ProcedureCommands

8. Maintenance Window Procedure

Related Documents

Approval

Quick dashboards:
Health Check

5.1 Rotate `JWT_SECRET`

7. AML & Observability
Compliance
Operations