Operational Runbook
Operational Runbook
Project:
Drop{{PROJECT_NAME}} Version:0.1.0{{VERSION}} Date:2026-02-23{{DATE}} Author:Platform Architect (AI){{AUTHOR}} Status: Draft | In Review | Approved Reviewers:Alem Bašić (CEO){{REVIEWERS}}
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | Initial draft |
1. Service Overview
ThisService: runbook{{PROJECT_NAME}}
coversPurpose: day-to-day{{SERVICE_PURPOSE}}
operationsTechnology ofstack: Drop's{{STACK}} production
environment.Architecture Dropreference: runsDeployment on AWS App Runner (eu-west-1) with RDS PostgreSQL.Architecture
PrimaryService operations contact:URLs: Alem Bašić — [email protected] / +47 40 47 42 51
AI Operations: John (AI Director) — Slack #drop-alerts
2. Quick Reference
Production Infrastructure
| Health Check | ||
|---|---|---|
| ||
| ||
| ||
| ||
|
||
| Staging | | |
| ||
|
Key dashboards:
- System overview: {{DASHBOARD_LINK}}
- Service metrics: {{SERVICE_DASHBOARD_LINK}}
- Logs: {{LOG_DASHBOARD_LINK}}
2. Common Operational Tasks
Quick2.1 HealthService CheckRestart Procedure
When to use: Application unresponsive, hanging workers, suspected deadlock
Steps:
Option A — Rolling restart (no downtime):
# ApplicationAWS healthECS
(production)aws curlecs update-service -s-cluster https://getdrop.no/api/health{{CLUSTER}} |--service jq{{SERVICE}} --force-new-deployment
# AppKubernetes
Runnerkubectl statusrollout awsrestart apprunner describe-service \deployment/{{DEPLOYMENT}} --service-arnn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
--query 'Service.Status' --output text --region eu-west-1
# RDS status
aws rds describe-db-instances \
--db-instance-identifier drop-db \
--query 'DBInstances[0].DBInstanceStatus' --output text --region eu-west-1
# Live App Runner logs
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
--follow --region eu-west-1{{NAMESPACE}}
Option
Emergency 3.B Routine— Operations3.1 Daily Checks
BetterStack: all 3 monitors greenrestart (health,brief landing,downtime, USuse east)Slackrolling restart fails):#drop-ops no unresolved critical alerts from last 24h
App Runner service status: RUNNING
RDS snapshot from last night: exists and < 24h old
# VerifyStop lastall RDSinstances
snapshot{{STOP_COMMAND}}
aws# rdsWait describe-db-snapshotsfor \drain
--db-instance-identifiersleep drop-db30
--region# eu-west-1Start \fresh
--query 'DBSnapshots[?SnapshotType==`automated`]|sort_by(@,&SnapshotCreateTime)[-1].{id:DBSnapshotIdentifier,time:SnapshotCreateTime}' \
--output table{START_COMMAND}}
3.2 Weekly Checks
Review CloudWatch logs for recurring error patternsCheck RDS free storage space (alert if < 2GB)Review AML alerts table for any open casesReview pending KYC applicants (stuck inpendingstatus > 24h)Check ECR — clean up untagged images manually if lifecycle policy hasn't run
Verify:
# Check RDSall storageinstances awshealthy
cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name FreeStorageSpace \
--dimensions Name=DBInstanceIdentifier,Value=drop-db \
--start-time $(date -u -d '1 hour ago' --iso-8601=seconds) \
--end-time $(date -u --iso-8601=seconds) \
--period 3600 \
--statistics Average \
--region eu-west-1{{HEALTH_CHECK_COMMAND}}
# Check pendingfor KYCerrors (connectpost-restart
to{{LOG_CHECK_COMMAND}}
RDS
Expected viarestart bastiontime: or{{RESTART_TIME}} VPN)minutes
psqlAlert expected: Service restart will trigger deployment alert — acknowledge in PagerDuty
2.2 Log Retrieval & Analysis
Centralized logs: {{LOG_URL}}
Quick log retrieval:
# Last 100 error lines
{{LOG_TOOL}} -h-filter drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com"level=error" -U-since dropuser"1h" -d-service dropapp{{SERVICE}}
\# Logs for a specific user
{{LOG_TOOL}} -c-filter "user_id={{USER_ID}}" --since "24h"
# Logs for a specific request
{{LOG_TOOL}} --filter "request_id={{REQUEST_ID}}"
# Database slow query logs
{{DB_LOG_COMMAND}}
Log format reference: See Monitoring & Observability
2.3 Database Maintenance
Connection count check:
SELECT id,count(*) email,as kyc_status,connections, created_atstate FROM userspg_stat_activity GROUP BY state;
Kill idle connections:
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE kyc_statusstate = 'pending'idle'
ORDERAND BYstate_change created_at< ASC;"now() - interval '5 minutes'
AND pid <> pg_backend_pid();
Running queries (detect long-running):
SELECT pid, duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '1 minute'
AND state != 'idle';
Vacuum / analyze (if table bloat suspected):
VACUUM ANALYZE {{TABLE_NAME}};
Check replication lag:
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;
3.32.4 MonthlyCache ChecksClearing / Warming
Clear all cache (use with caution — may spike DB load):
{{CACHE_FLUSH_COMMAND}}
Clear specific key pattern:
{{CACHE_DELETE_PATTERN_COMMAND}}
Check cache hit rate:
{{CACHE_STATS_COMMAND}}
Warm cache after clearing:
# Run cache warming script
bash scripts/warm-cache.sh {{ENVIRONMENT}}
# Or trigger warming job
{{WARM_CACHE_JOB_COMMAND}}
Expected DB load spike after cache clear: {{CACHE_CLEAR_IMPACT}} minutes of elevated load
2.5 Certificate Renewal
Automated renewal: Configured via {{CERT_TOOL}} (Let's Encrypt / ACM) Auto-renewal trigger: 30 days before expiry
Manual renewal (if auto-renewal fails):
# Check expiry
echo | openssl s_client -connect {{DOMAIN}}:443 2>/dev/null | openssl x509 -noout -dates
# Manual renewal
{{CERT_RENEW_COMMAND}}
# Verify
{{CERT_VERIFY_COMMAND}}
Verify renewal alert is working:
- Alert
Reviewconfigured:SLA"Certificatereportexpiring(uptime,inerror<rate,30p99days"latency)→ {{ALERT_CHANNEL}} TestBetterStackcertificate:alertscurl -I https://{{DOMAIN}}and checkStrict-Transport-Securityheader
2.6 Scaling Up / Down
Scale up (pauseincrease monitorcapacity):
# verifyAWS escalationECS
firesaws →ecs resume)update-service --cluster {{CLUSTER}} --service {{SERVICE}} --desired-count {{COUNT}}
# Kubernetes
kubectl scale deployment/{{DEPLOYMENT}} --replicas={{COUNT}} -n {{NAMESPACE}}
Verify scale-out:
# Check instance count
{{INSTANCE_COUNT_COMMAND}}
# Confirm health
{{HEALTH_CHECK_COMMAND}}
Scale down (reduce capacity — use cautiously):
- Do NOT scale below {{MIN_INSTANCES}} instances
- Scale
VerifydownRDSduringsnapshotoff-peakrestorehoursworksonly (restore{{OFF_PEAK_HOURS}}) - Monitor for 10 minutes after scaling down to
tempconfirminstance, verify data, delete) Review secret rotation schedule — anything due?Review STR reports table — any pending filings?stability
4.3. DeploymentTroubleshooting ProcedurePlaybooks
4.3.1 StandardHigh DeploymentCPU (AppUsage
Symptoms: CPU alert fires, slow responses, possible OOM
- Identify the source:
#1.TopEnsureprocessesallbyCICPUchecks pass on main branch # 2. Build and push new Docker image to ECR docker build -t drop-app . docker tag drop-app:latest 324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web:$(git rev-parse --short HEAD) aws ecr get-login-password --region eu-west-1 | \ docker login --username AWS --password-stdin 324480209768.dkr.ecr.eu-west-1.amazonaws.com docker push 324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web:$(git rev-parse --short HEAD) # 3. Create pre-deployment RDS snapshot aws rds create-db-snapshot \ --db-instance-identifier drop-db \ --db-snapshot-identifier drop-db-pre-deploy-$(date +%Y%m%d-%H%M) \ --region eu-west-1 # 4. Create BetterStack maintenance window (prevents false alerts) # Go to BetterStack → Maintenance Windows → Create Window (30 min) # 5. Trigger App Runner deployment aws apprunner start-deployment \ --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \ --region eu-west-1 # 6. Monitor deployment status aws apprunner describe-service \ --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \ --query 'Service.Status' --output text --region eu-west-1 # Wait for RUNNING # 7. Verify health curl -s https://getdrop.no/api/health | jq # 8. Close BetterStack maintenance window{{CPU_TOP_COMMAND}} - Check for: runaway loops, large queries being processed, missing cache causing recalculation
- Check for recently deployed code — did CPU spike after a deploy? → Consider rollback
- Check queue depth — backed-up job queue causes worker CPU spike
- If single instance: restart that instance (
{{RESTART_SINGLE_COMMAND}}) - If all instances: scale up immediately, then investigate root cause
- Escalate if: CPU > {{CPU_ESCALATE}}% for > {{ESCALATE_DURATION}} min after scaling
3.2 Memory Leaks
Typical deployment time:Symptoms: 3–Slowly increasing memory, eventual OOM kill / restart loop
- Check memory trend in monitoring dashboard — linear increase over hours = leak
- Identify the leak:
- Enable heap dump:
{{HEAP_DUMP_COMMAND}} - Profile with:
{{PROFILER}}
- Enable heap dump:
- Short-term mitigation: Schedule rolling restarts every {{RESTART_INTERVAL}}h
{{SCHEDULED_RESTART_COMMAND}} - Create ticket with heap dump attached — requires developer investigation
- Escalate if: Restart cycle < {{MIN_RESTART_INTERVAL}}h (memory fills too fast)
3.3 Slow Database Queries
Symptoms: High P99 latency, DB CPU spike, timeouts in logs
- Find slow queries:
SELECT query, calls, mean_exec_time, max_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 20; - Check for missing indexes: Look for sequential scans on large tables
- Check for blocking queries:
SELECT blocking.pid, blocking.query, blocked.pid, blocked.query FROM pg_stat_activity blocked JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid)); - Kill blocking query if safe:
SELECT pg_cancel_backend({{PID}}); -- If cancel doesn't work: SELECT pg_terminate_backend({{PID}}); - Create ticket — developer must optimize the query
3.4 Service Connectivity Issues
Symptoms: Connectivity errors between services, 502/503 errors
- Check health endpoints:
curl -I {{SERVICE_URL}}/health - Check network security groups / firewall rules — was anything changed recently?
- Check service discovery — DNS resolving correctly?
nslookup {{SERVICE_INTERNAL_DNS}} - Check if service is running:
{{SERVICE_STATUS_COMMAND}} - Check logs for connection errors:
{{CONNECTIVITY_LOG_COMMAND}}
3.5 minutesHigh Error Rates
Symptoms: Error rate alert, user complaints, 5xx in logs
- Identify error type:
{{LOG_ERROR_COMMAND}}— what errors, what services, what endpoints? - Check if correlated with: recent deployment, external service outage, traffic spike
- Check external service status pages:
- {{SERVICE_1}} status: {{STATUS_PAGE_1}}
- {{SERVICE_2}} status: {{STATUS_PAGE_2}}
- If recent deployment: Consider rollback if errors affecting > {{ROLLBACK_ERROR_THRESHOLD}}% of requests
- If external service down: Check circuit breaker status, enable fallback
- Escalate if: Error rate > {{ESCALATE_ERROR_RATE}}% for > {{ESCALATE_DURATION}} min
4.23.6 StagingDisk DeploymentSpace (Fly.io)Issues
Symptoms: Disk space alert, application errors writing files
- Check disk usage:
df -h du -sh /var/log/* | sort -rh | head -10 - Quick wins:
#DeployRotatetoandFly.iocompressstaginglogscdlogrotatesrc/drop-app-ffly/etc/logrotate.confdeploy# Clear old Docker images docker image prune -a --appfilterdrop-staging"until=24h" #VerifyClearstaging/tmphealthfindcurl/tmp -smtimehttps://drop-staging.fly.dev/api/health+7| jq-delete4.3 - If
Rollbackdatabase disk: Check for table bloat, dead tuples, WAL accumulation#SELECTIdentify previous ECR image aws ecr describe-images --repository-name drop-web --region eu-west-1 \ --querypg_size_pretty(pg_database_size('sort_by(imageDetails,&imagePushedAt)[-2].imageDigest' --output text # Update App Runner to use previous image tag via console, # then trigger deployment: aws apprunner start-deployment \ --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \ --region eu-west-1{{DB_NAME}}')); - Escalate if: Disk > {{DISK_ESCALATE}}% and cannot free space quickly
4. Health Check Endpoints
| Endpoint | Method | Expected Response | What It Checks |
|---|---|---|---|
{{BASE_URL}}/health |
GET | HTTP 200 {"status":"ok"} |
Application running |
{{BASE_URL}}/health/ready |
GET | HTTP 200 {"status":"ready"} |
App + DB + Cache connected |
{{BASE_URL}}/health/live |
GET | HTTP 200 {"status":"alive"} |
App process alive |
{{BASE_URL}}/health/db |
GET | HTTP 200 {"status":"ok","latency_ms":X} |
Database reachable |
{{BASE_URL}}/health/cache |
GET | HTTP 200 {"status":"ok"} |
Redis reachable |
Health check from load balancer: {{HEALTH_CHECK_PATH}} every {{LB_INTERVAL}}s
Unhealthy threshold: {{UNHEALTHY_COUNT}} consecutive failures
5. SecretAlert RotationResponse Procedures
5.1 Rotate JWT_SECRET
Impact: All active user sessions immediately invalidated. All logged-in users are logged out.
# 1. Generate new secret
NEW_SECRET=$(openssl rand -base64 48)
# 2. Update in AWS Secrets Manager
aws secretsmanager update-secret \
--secret-id drop/production/jwt-secret \
--secret-string "$NEW_SECRET" \
--region eu-west-1
# 3. Update App Runner environment variable (via console or CLI)
# Then trigger new deployment
# 4. Log rotation in audit_log
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
-c "INSERT INTO audit_log (id, action, resource_type, resource_id, details) VALUES (gen_random_uuid(), 'secret_rotated', 'secret', 'JWT_SECRET', '{\"rotated_at\": \"$(date -u --iso-8601=seconds)\"}');"
5.2 Rotate Database Password
# 1. Generate new password
NEW_PASS=$(openssl rand -base64 32)
# 2. Update RDS master password
aws rds modify-db-instance \
--db-instance-identifier drop-db \
--master-user-password "$NEW_PASS" \
--apply-immediately \
--region eu-west-1
# 3. Update DATABASE_URL in Secrets Manager with new password
# 4. Trigger App Runner redeployment to pick up new DATABASE_URL
# 5. Verify health: curl https://getdrop.no/api/health
6. Database Operations
6.1 Connect to Production Database
Note: RDS must be accessible — either via VPN, bastion host, or AWS Systems Manager Session Manager.
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
-U dropuser \
-d dropapp \
-c "SELECT 1;"
6.2 User Management Queries
-- Check user KYC status
SELECT id, email, kyc_status, auth_provider, created_at
FROM users WHERE email = '[email protected]';
-- List pending KYC users (> 24h)
SELECT id, email, kyc_status, created_at FROM users
WHERE kyc_status = 'pending'
AND created_at < NOW() - INTERVAL '24 hours'
ORDER BY created_at ASC;
-- Revoke all sessions for a user (emergency)
UPDATE sessions SET revoked = 1
WHERE user_id = 'usr_...' AND revoked = 0;
-- Soft-delete user (GDPR erasure)
UPDATE users SET deleted_at = NOW() WHERE id = 'usr_...';
UPDATE sessions SET revoked = 1 WHERE user_id = 'usr_...';
6.3 Transaction Queries
-- Recent transactions (last 24h)
SELECT id, type, status, send_amount, send_currency, created_at
FROM transactions
WHERE created_at > NOW() - INTERVAL '24 hours'
ORDER BY created_at DESC LIMIT 50;
-- Failed transactions (may need investigation)
SELECT t.*, u.email FROM transactions t
JOIN users u ON t.user_id = u.id
WHERE t.status = 'failed'
AND t.created_at > NOW() - INTERVAL '7 days'
ORDER BY t.created_at DESC;
-- AML: large transactions (> NOK 50,000)
SELECT * FROM transactions
WHERE send_amount > 50000
AND created_at > NOW() - INTERVAL '30 days'
ORDER BY send_amount DESC;
6.4 Manual RDS Snapshot
# Create manual snapshot before risky operations
aws rds create-db-snapshot \
--db-instance-identifier drop-db \
--db-snapshot-identifier drop-db-manual-$(date +%Y%m%d-%H%M) \
--region eu-west-1
# Wait for snapshot to complete
aws rds wait db-snapshot-completed \
--db-snapshot-identifier drop-db-manual-$(date +%Y%m%d-%H%M) \
--region eu-west-1
7. AML & Compliance Operations
7.1 AML Alert Review
-- View open AML alerts
SELECT a.*, u.email, t.send_amount, t.send_currency
FROM aml_alerts a
JOIN users u ON a.user_id = u.id
LEFT JOIN transactions t ON a.transaction_id = t.id
WHERE a.status = 'open'
ORDER BY a.created_at DESC;
-- Close an AML alert (after review)
UPDATE aml_alerts SET status = 'closed', reviewed_at = NOW(),
reviewer_notes = 'Reviewed — legitimate transaction'
WHERE id = 'alert_...';
7.2 STR Filing
If financial crime is suspected:
-- File STR
INSERT INTO str_reports (
id, user_id, transaction_id, report_type, details, filed_at, status
) VALUES (
gen_random_uuid(), 'usr_...', 'tx_...', 'suspicious_transaction',
'{"reason": "Unusual pattern", "amount": 50000}',
NOW(), 'filed'
);
Then contact Finanstilsynet via the official STR filing portal.
7.3 GDPR Requests
Data export request:
-- User data is exported via /api/user/data-export endpoint
-- Check data_access_requests table
SELECT * FROM data_access_requests WHERE user_id = 'usr_...' ORDER BY created_at DESC;
Erasure request:
-- Account deletion (soft delete)
UPDATE users SET deleted_at = NOW() WHERE id = 'usr_...';
UPDATE sessions SET revoked = 1 WHERE user_id = 'usr_...';
-- Note: data retained for 5 years per hvitvaskingsloven
8. Incident Response
8.1 Alert Triage
When a Slack alert fires in #drop-ops:
| Alert | Runbook Section | |
|---|---|---|
HighErrorRate |
Check logs, identify error type, assess scope | 3.5 High Error Rates |
SlowP99 |
Check DB slow queries, recent deploys | 3.3 Slow DB Queries |
ServiceDown |
Restart service, check logs | 2.1 Service Restart |
HighCPU |
Scale up, identify source | 3.1 High CPU |
DiskAlmostFull |
Clear logs/tmp, escalate if > 90% | 3.6 Disk Space |
DBReplicationLag |
Check replication, network, disk on replica | DB section |
CertificateExpiring |
Trigger manual renewal | 2.5 Certificate Renewal |
6. Escalation Matrix
| Situation | First Contact | Escalation | Ultimate Escalation |
|---|---|---|---|
| Engineering |
|||
| CTO | |||
| CEO | |||
| Payment system down | On-call + Payment owner | Stripe/payment provider support | Engineering manager |
8.2 Common Issues
Issue:Emergency Health check returns 503 (DB unreachable)contacts:
#
1.
Check
RDSRole
statusName
awsPhone
rdsSlack
describe-db-instances
--db-instance-identifier
drop-db
\
--query 'DBInstances[0].DBInstanceStatus' --output text --region eu-west-1
# 2. If not 'available', wait for AWS to auto-recover or follow DR Scenario 2
# 3. Check connection string in App Runner environment
# 4. Restart App Runner service
Issue: BankID login failing
# Check App Runner logs for BankID errors
aws logs filter-log-events \
--log-group-name /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
--filter-pattern "BankID" --region eu-west-1
# Verify BankID environment variables are set
# Check BankID status: https://driftsstatus.vippsmobilepay.com/
Issue: KYC verification stuck in pending
# Check Sumsub dashboard for stuck applicants
# Or query:
psql -c "SELECT id, email, kyc_status FROM users WHERE kyc_status='pending' AND created_at < NOW()-INTERVAL '2 hours';"
# Force-process via Sumsub dashboard or API On-call (primary)
{{PRIMARY}}
{{PHONE}}
{{SLACK}}
On-call (backup)
{{BACKUP}}
{{PHONE}}
{{SLACK}}
Tech Lead
{{TECH_LEAD}}
{{PHONE}}
{{SLACK}}
Engineering Manager
{{ENG_MGR}}
{{PHONE}}
{{SLACK}}
9.7. MonitoringOn-Call VerificationHandoff CommandsProcedure
#
1.Handoff Fullcadence: health{{HANDOFF_CADENCE}} check
curlHandoff time: {{HANDOFF_TIME}}
Outgoing on-call must document:
- Any open incidents or ongoing issues
- Any monitoring anomalies (elevated error rates, slow queries not yet resolved)
- Any upcoming events that may affect the system (marketing campaigns, scheduled maintenance)
- Any temporary mitigations in place that need permanent fixes
- Context on any unusual alerts that fired and were noise
Handoff document template: {{HANDOFF_TEMPLATE_LINK}}
8. Maintenance Window Procedure
Maintenance window schedule: {{MAINTENANCE_WINDOW}} (lowest traffic period)
Pre-maintenance:
- Announce in Slack #ops: "Maintenance window {{DATE}} {{TIME}}-
s{{END_TIME}}"
https://getdrop.no/api/health- Update
|status python3page: -m"Scheduled json.toolmaintenance" #with 2.details
Database- Notify
latencyimpacted check
curl -s https://getdrop.no/api/health | jq '.data.checks.db.latencyMs'
# Alertcustomers if downtime expected > 100ms{{DOWNTIME_NOTIFY_THRESHOLD}} #minutes
3.- Confirm
Checkrollback appplan versionis curlready
-s
https://getdrop.no/api/healthDuring |maintenance:
jq
'.data.version'- Enable
#maintenance 4.mode Check(if uptimeapplicable): curl{{MAINTENANCE_MODE_CMD}}
-s- Execute
https://getdrop.no/api/healthmaintenance |tasks jqper '.data.uptime'the specific runbook for the task
- Run smoke tests after each major step
- Document every action taken with timestamps
Post-maintenance:
- Disable maintenance mode:
{{DISABLE_MAINTENANCE_CMD}}
Run full smoke test suite
Monitor for 30 minutes
Update status page: "Maintenance complete, all systems normal"
Post-maintenance report in #ops Slack channel
Related Documents
DisasterGo-Live RecoveryRunbook
Plan- Incident Report
- Monitoring & Observability
Go-LiveDisaster Runbook Recovery Source DR RunbookPlan
Approval
Role
Name
Date
Signature
Author
Platform Architect (AI)
2026-02-23
Reviewer
Approver
Alem Bašić