Operational Runbook

Project: ~~Drop~~{{PROJECT_NAME}} Version: ~~0.1.0~~{{VERSION}} Date: ~~2026-02-23~~{{DATE}} Author: ~~Platform Architect (AI)~~{{AUTHOR}} Status: Draft | In Review | Approved Reviewers: ~~Alem Bašić (CEO)~~{{REVIEWERS}}

Document History

Version	Date	Author	Changes
0.1	~~2026-02-23~~{{DATE}}	~~Platform Architect (AI)~~{{AUTHOR}}	Initial draft ~~covering day-to-day Drop operations~~

1. Service Overview

~~This~~Service: ~~runbook~~{{PROJECT_NAME}} ~~covers~~Purpose: ~~day-to-day~~{{SERVICE_PURPOSE}} ~~operations~~Technology ofstack: ~~Drop's~~{{STACK}} ~~production~~ ~~environment.~~Architecture ~~Drop~~reference: ~~runs~~Deployment ~~on AWS App Runner (eu-west-1) with RDS PostgreSQL.~~Architecture

~~Primary~~Service ~~operations contact:~~URLs: ~~Alem Bašić — [email protected] / +47 40 47 42 51~~ ~~AI Operations:~~ ~~John (AI Director) — Slack~~ #drop-alerts

2. Quick Reference

Production Infrastructure

~~Component~~Environment	~~Identifier~~URL	Health Check
~~App Runner service~~Production	`arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec{{PROD_URL}}`
~~App Runner URL~~	`https:{{PROD_URL}}//9ef3szvvsb.eu-west-1.awsapprunner.com`
~~RDS instance~~	`drop-db`
~~RDS endpoint~~	`drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com:5432`
~~ECR repository~~	`324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-webhealth`
Staging	`https://drop-staging.fly.dev{{STG_URL}}`
~~Status page~~	`https:{{STG_URL}}//drop-status.betteruptime.com`
~~Slack alerts~~	`#drop-ops` on `alai-talk.slack.comhealth`

Key dashboards:

System overview: {{DASHBOARD_LINK}}

Service metrics: {{SERVICE_DASHBOARD_LINK}}

Logs: {{LOG_DASHBOARD_LINK}}

2. Common Operational Tasks

Quick2.1 HealthService CheckRestart Procedure

When to use: Application unresponsive, hanging workers, suspected deadlock

Steps:

Option A — Rolling restart (no downtime):

# ApplicationAWS healthECS
(production)aws curlecs update-service -s-cluster https://getdrop.no/api/health{{CLUSTER}} |--service jq{{SERVICE}} --force-new-deployment

# AppKubernetes
Runnerkubectl statusrollout awsrestart apprunner describe-service \deployment/{{DEPLOYMENT}} --service-arnn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --query 'Service.Status' --output text --region eu-west-1

# RDS status
aws rds describe-db-instances \
  --db-instance-identifier drop-db \
  --query 'DBInstances[0].DBInstanceStatus' --output text --region eu-west-1

# Live App Runner logs
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --follow --region eu-west-1{{NAMESPACE}}

Option

3.B Routine— Operations

Emergency
~~3.1 Daily Checks~~

~~BetterStack: all 3 monitors green~~restart (~~health,~~brief ~~landing,~~downtime, USuse ~~east)~~
only
if ~~Slack~~rolling #drop-opsrestart fails): ~~no unresolved critical alerts from last 24h~~

~~App Runner service status:~~ RUNNING

~~RDS snapshot from last night: exists and < 24h old~~

# VerifyStop lastall RDSinstances
snapshot{{STOP_COMMAND}}
aws# rdsWait describe-db-snapshotsfor \drain
--db-instance-identifiersleep drop-db30
--region# eu-west-1Start \fresh
--query 'DBSnapshots[?SnapshotType==`automated`]|sort_by(@,&SnapshotCreateTime)[-1].{id:DBSnapshotIdentifier,time:SnapshotCreateTime}' \
  --output table{START_COMMAND}}

3.2 Weekly Checks

~~Review CloudWatch logs for recurring error patterns~~

~~Check RDS free storage space (alert if < 2GB)~~

~~Review AML alerts table for any open cases~~

~~Review pending KYC applicants (stuck in~~ pending ~~status > 24h)~~

~~Check ECR — clean up untagged images manually if lifecycle policy hasn't run~~

Verify:

# Check RDSall storageinstances awshealthy
cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name FreeStorageSpace \
  --dimensions Name=DBInstanceIdentifier,Value=drop-db \
  --start-time $(date -u -d '1 hour ago' --iso-8601=seconds) \
  --end-time $(date -u --iso-8601=seconds) \
  --period 3600 \
  --statistics Average \
  --region eu-west-1{{HEALTH_CHECK_COMMAND}}
# Check pendingfor KYCerrors (connectpost-restart
to{{LOG_CHECK_COMMAND}}
RDS

~~first~~

Expected ~~via~~restart ~~bastion~~time: or{{RESTART_TIME}} ~~VPN)~~minutes ~~psql~~Alert expected: Service restart will trigger deployment alert — acknowledge in PagerDuty

2.2 Log Retrieval & Analysis

Centralized logs: {{LOG_URL}}

Quick log retrieval:

# Last 100 error lines
{{LOG_TOOL}} -h-filter drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com"level=error" -U-since dropuser"1h" -d-service dropapp{{SERVICE}}

\# Logs for a specific user
{{LOG_TOOL}} -c-filter "user_id={{USER_ID}}" --since "24h"

# Logs for a specific request
{{LOG_TOOL}} --filter "request_id={{REQUEST_ID}}"

# Database slow query logs
{{DB_LOG_COMMAND}}

Log format reference: See Monitoring & Observability

2.3 Database Maintenance

Connection count check:

SELECT id,count(*) email,as kyc_status,connections, created_atstate FROM userspg_stat_activity GROUP BY state;

Kill idle connections:

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE kyc_statusstate = 'pending'idle'
  ORDERAND BYstate_change created_at< ASC;"now() - interval '5 minutes'
  AND pid <> pg_backend_pid();

Running queries (detect long-running):

SELECT pid, duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '1 minute'
  AND state != 'idle';

Vacuum / analyze (if table bloat suspected):

VACUUM ANALYZE {{TABLE_NAME}};

Check replication lag:

SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;

3.32.4 MonthlyCache ChecksClearing / Warming

Clear all cache (use with caution — may spike DB load):

{{CACHE_FLUSH_COMMAND}}

Clear specific key pattern:

{{CACHE_DELETE_PATTERN_COMMAND}}

Check cache hit rate:

{{CACHE_STATS_COMMAND}}

Warm cache after clearing:

# Run cache warming script
bash scripts/warm-cache.sh {{ENVIRONMENT}}
# Or trigger warming job
{{WARM_CACHE_JOB_COMMAND}}

Expected DB load spike after cache clear: {{CACHE_CLEAR_IMPACT}} minutes of elevated load

2.5 Certificate Renewal

Automated renewal: Configured via {{CERT_TOOL}} (Let's Encrypt / ACM) Auto-renewal trigger: 30 days before expiry

Manual renewal (if auto-renewal fails):

# Check expiry
echo | openssl s_client -connect {{DOMAIN}}:443 2>/dev/null | openssl x509 -noout -dates

# Manual renewal
{{CERT_RENEW_COMMAND}}

# Verify
{{CERT_VERIFY_COMMAND}}

Verify renewal alert is working:

Alert ~~Review~~configured: ~~SLA~~"Certificate ~~report~~expiring ~~(uptime,~~in ~~error~~< ~~rate,~~30 ~~p99~~days" ~~latency)~~→ {{ALERT_CHANNEL}}
Test ~~BetterStack~~certificate: ~~alerts~~curl -I https://{{DOMAIN}} and check Strict-Transport-Security header

2.6 Scaling Up / Down

Scale up (~~pause~~increase ~~monitor~~capacity):

→

# verifyAWS escalationECS
firesaws →ecs resume)update-service --cluster {{CLUSTER}} --service {{SERVICE}} --desired-count {{COUNT}}

# Kubernetes
kubectl scale deployment/{{DEPLOYMENT}} --replicas={{COUNT}} -n {{NAMESPACE}}

Verify scale-out:

# Check instance count
{{INSTANCE_COUNT_COMMAND}}
# Confirm health
{{HEALTH_CHECK_COMMAND}}

Scale down (reduce capacity — use cautiously):

Do NOT scale below {{MIN_INSTANCES}} instances
Scale ~~Verify~~down ~~RDS~~during ~~snapshot~~off-peak ~~restore~~hours ~~works~~only (~~restore~~{{OFF_PEAK_HOURS}})

Monitor for 10 minutes after scaling down to ~~temp~~confirm ~~instance, verify data, delete)~~

~~Review secret rotation schedule — anything due?~~

~~Review STR reports table — any pending filings?~~stability

4.3. DeploymentTroubleshooting ProcedurePlaybooks

4.3.1 StandardHigh DeploymentCPU (AppUsage

~~Runner)~~

Symptoms: CPU alert fires, slow responses, possible OOM

Identify the source:

# 1.Top Ensureprocesses allby CICPU
checks pass on main branch
# 2. Build and push new Docker image to ECR
docker build -t drop-app .
docker tag drop-app:latest 324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web:$(git rev-parse --short HEAD)
aws ecr get-login-password --region eu-west-1 | \
  docker login --username AWS --password-stdin 324480209768.dkr.ecr.eu-west-1.amazonaws.com
docker push 324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web:$(git rev-parse --short HEAD)

# 3. Create pre-deployment RDS snapshot
aws rds create-db-snapshot \
  --db-instance-identifier drop-db \
  --db-snapshot-identifier drop-db-pre-deploy-$(date +%Y%m%d-%H%M) \
  --region eu-west-1

# 4. Create BetterStack maintenance window (prevents false alerts)
# Go to BetterStack → Maintenance Windows → Create Window (30 min)

# 5. Trigger App Runner deployment
aws apprunner start-deployment \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

# 6. Monitor deployment status
aws apprunner describe-service \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --query 'Service.Status' --output text --region eu-west-1
# Wait for RUNNING

# 7. Verify health
curl -s https://getdrop.no/api/health | jq

# 8. Close BetterStack maintenance window{{CPU_TOP_COMMAND}}

Check for: runaway loops, large queries being processed, missing cache causing recalculation

Check for recently deployed code — did CPU spike after a deploy? → Consider rollback

Check queue depth — backed-up job queue causes worker CPU spike

If single instance: restart that instance ({{RESTART_SINGLE_COMMAND}})

If all instances: scale up immediately, then investigate root cause

Escalate if: CPU > {{CPU_ESCALATE}}% for > {{ESCALATE_DURATION}} min after scaling

3.2 Memory Leaks

~~Typical deployment time:~~Symptoms: 3–Slowly increasing memory, eventual OOM kill / restart loop

Check memory trend in monitoring dashboard — linear increase over hours = leak

Identify the leak:
- Enable heap dump: {{HEAP_DUMP_COMMAND}}
- Profile with: {{PROFILER}}

Short-term mitigation: Schedule rolling restarts every {{RESTART_INTERVAL}}h
```
{{SCHEDULED_RESTART_COMMAND}}
```

Create ticket with heap dump attached — requires developer investigation

Escalate if: Restart cycle < {{MIN_RESTART_INTERVAL}}h (memory fills too fast)

3.3 Slow Database Queries

Symptoms: High P99 latency, DB CPU spike, timeouts in logs

Find slow queries:

SELECT query, calls, mean_exec_time, max_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;

Check for missing indexes: Look for sequential scans on large tables

Check for blocking queries:

SELECT blocking.pid, blocking.query, blocked.pid, blocked.query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));

Kill blocking query if safe:

SELECT pg_cancel_backend({{PID}});
-- If cancel doesn't work:
SELECT pg_terminate_backend({{PID}});

Create ticket — developer must optimize the query

3.4 Service Connectivity Issues

Symptoms: Connectivity errors between services, 502/503 errors

Check health endpoints:
```
curl -I {{SERVICE_URL}}/health
```

Check network security groups / firewall rules — was anything changed recently?

Check service discovery — DNS resolving correctly?
```
nslookup {{SERVICE_INTERNAL_DNS}}
```

Check if service is running:
```
{{SERVICE_STATUS_COMMAND}}
```

Check logs for connection errors:
```
{{CONNECTIVITY_LOG_COMMAND}}
```

3.5 minutesHigh Error Rates

Symptoms: Error rate alert, user complaints, 5xx in logs

Identify error type: {{LOG_ERROR_COMMAND}} — what errors, what services, what endpoints?

Check if correlated with: recent deployment, external service outage, traffic spike

Check external service status pages:
- {{SERVICE_1}} status: {{STATUS_PAGE_1}}
- {{SERVICE_2}} status: {{STATUS_PAGE_2}}

If recent deployment: Consider rollback if errors affecting > {{ROLLBACK_ERROR_THRESHOLD}}% of requests

If external service down: Check circuit breaker status, enable fallback

Escalate if: Error rate > {{ESCALATE_ERROR_RATE}}% for > {{ESCALATE_DURATION}} min

4.23.6 StagingDisk DeploymentSpace (Fly.io)Issues

Symptoms: Disk space alert, application errors writing files

Check disk usage:

df -h
du -sh /var/log/* | sort -rh | head -10

Quick wins:

# DeployRotate toand Fly.iocompress staginglogs
cdlogrotate src/drop-app-f fly/etc/logrotate.conf
deploy# Clear old Docker images
docker image prune -a --appfilter drop-staging"until=24h"
# VerifyClear staging/tmp
healthfind curl/tmp -smtime https://drop-staging.fly.dev/api/health+7 | jq-delete

4.3

~~Emergency~~

If ~~Rollback~~database disk: Check for table bloat, dead tuples, WAL accumulation

#SELECT Identify previous ECR image
aws ecr describe-images --repository-name drop-web --region eu-west-1 \
  --query pg_size_pretty(pg_database_size('sort_by(imageDetails,&imagePushedAt)[-2].imageDigest' --output text

# Update App Runner to use previous image tag via console,
# then trigger deployment:
aws apprunner start-deployment \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1{{DB_NAME}}'));

Escalate if: Disk > {{DISK_ESCALATE}}% and cannot free space quickly

4. Health Check Endpoints

Endpoint	Method	Expected Response	What It Checks
`{{BASE_URL}}/health`	GET	HTTP 200 `{"status":"ok"}`	Application running
`{{BASE_URL}}/health/ready`	GET	HTTP 200 `{"status":"ready"}`	App + DB + Cache connected
`{{BASE_URL}}/health/live`	GET	HTTP 200 `{"status":"alive"}`	App process alive
`{{BASE_URL}}/health/db`	GET	HTTP 200 `{"status":"ok","latency_ms":X}`	Database reachable
`{{BASE_URL}}/health/cache`	GET	HTTP 200 `{"status":"ok"}`	Redis reachable

Health check from load balancer: {{HEALTH_CHECK_PATH}} every {{LB_INTERVAL}}s Unhealthy threshold: {{UNHEALTHY_COUNT}} consecutive failures

5. SecretAlert RotationResponse Procedures

5.1 Rotate `JWT_SECRET`

~~Impact:~~ ~~All active user sessions immediately invalidated. All logged-in users are logged out.~~

# 1. Generate new secret
NEW_SECRET=$(openssl rand -base64 48)

# 2. Update in AWS Secrets Manager
aws secretsmanager update-secret \
  --secret-id drop/production/jwt-secret \
  --secret-string "$NEW_SECRET" \
  --region eu-west-1

# 3. Update App Runner environment variable (via console or CLI)
# Then trigger new deployment

# 4. Log rotation in audit_log
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com -U dropuser -d dropapp \
  -c "INSERT INTO audit_log (id, action, resource_type, resource_id, details) VALUES (gen_random_uuid(), 'secret_rotated', 'secret', 'JWT_SECRET', '{\"rotated_at\": \"$(date -u --iso-8601=seconds)\"}');"

5.2 Rotate Database Password

# 1. Generate new password
NEW_PASS=$(openssl rand -base64 32)

# 2. Update RDS master password
aws rds modify-db-instance \
  --db-instance-identifier drop-db \
  --master-user-password "$NEW_PASS" \
  --apply-immediately \
  --region eu-west-1

# 3. Update DATABASE_URL in Secrets Manager with new password
# 4. Trigger App Runner redeployment to pick up new DATABASE_URL
# 5. Verify health: curl https://getdrop.no/api/health

6. Database Operations

6.1 Connect to Production Database

~~Note:~~ ~~RDS must be accessible — either via VPN, bastion host, or AWS Systems Manager Session Manager.~~

psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
     -U dropuser \
     -d dropapp \
     -c "SELECT 1;"

6.2 User Management Queries

-- Check user KYC status
SELECT id, email, kyc_status, auth_provider, created_at
FROM users WHERE email = '[email protected]';

-- List pending KYC users (> 24h)
SELECT id, email, kyc_status, created_at FROM users
WHERE kyc_status = 'pending'
  AND created_at < NOW() - INTERVAL '24 hours'
ORDER BY created_at ASC;

-- Revoke all sessions for a user (emergency)
UPDATE sessions SET revoked = 1
WHERE user_id = 'usr_...' AND revoked = 0;

-- Soft-delete user (GDPR erasure)
UPDATE users SET deleted_at = NOW() WHERE id = 'usr_...';
UPDATE sessions SET revoked = 1 WHERE user_id = 'usr_...';

6.3 Transaction Queries

-- Recent transactions (last 24h)
SELECT id, type, status, send_amount, send_currency, created_at
FROM transactions
WHERE created_at > NOW() - INTERVAL '24 hours'
ORDER BY created_at DESC LIMIT 50;

-- Failed transactions (may need investigation)
SELECT t.*, u.email FROM transactions t
JOIN users u ON t.user_id = u.id
WHERE t.status = 'failed'
  AND t.created_at > NOW() - INTERVAL '7 days'
ORDER BY t.created_at DESC;

-- AML: large transactions (> NOK 50,000)
SELECT * FROM transactions
WHERE send_amount > 50000
  AND created_at > NOW() - INTERVAL '30 days'
ORDER BY send_amount DESC;

6.4 Manual RDS Snapshot

# Create manual snapshot before risky operations
aws rds create-db-snapshot \
  --db-instance-identifier drop-db \
  --db-snapshot-identifier drop-db-manual-$(date +%Y%m%d-%H%M) \
  --region eu-west-1

# Wait for snapshot to complete
aws rds wait db-snapshot-completed \
  --db-snapshot-identifier drop-db-manual-$(date +%Y%m%d-%H%M) \
  --region eu-west-1

7. AML & Compliance Operations

7.1 AML Alert Review

-- View open AML alerts
SELECT a.*, u.email, t.send_amount, t.send_currency
FROM aml_alerts a
JOIN users u ON a.user_id = u.id
LEFT JOIN transactions t ON a.transaction_id = t.id
WHERE a.status = 'open'
ORDER BY a.created_at DESC;

-- Close an AML alert (after review)
UPDATE aml_alerts SET status = 'closed', reviewed_at = NOW(),
  reviewer_notes = 'Reviewed — legitimate transaction'
WHERE id = 'alert_...';

7.2 STR Filing

~~If financial crime is suspected:~~

-- File STR
INSERT INTO str_reports (
  id, user_id, transaction_id, report_type, details, filed_at, status
) VALUES (
  gen_random_uuid(), 'usr_...', 'tx_...', 'suspicious_transaction',
  '{"reason": "Unusual pattern", "amount": 50000}',
  NOW(), 'filed'
);

~~Then contact Finanstilsynet via the official STR filing portal.~~

~~Data export request:~~

-- User data is exported via /api/user/data-export endpoint
-- Check data_access_requests table
SELECT * FROM data_access_requests WHERE user_id = 'usr_...' ORDER BY created_at DESC;

~~Erasure request:~~

-- Account deletion (soft delete)
UPDATE users SET deleted_at = NOW() WHERE id = 'usr_...';
UPDATE sessions SET revoked = 1 WHERE user_id = 'usr_...';
-- Note: data retained for 5 years per hvitvaskingsloven

8. Incident Response

8.1 Alert Triage

~~When a Slack alert fires in~~ #drop-ops:

Alert	~~First~~Immediate ~~Response~~Action	Runbook Section
`HighErrorRate`	Check logs, identify error type, assess scope	3.5 High Error Rates
`SlowP99`	Check DB slow queries, recent deploys	3.3 Slow DB Queries
`ServiceDown`	Restart service, check logs	2.1 Service Restart
`HighCPU`	Scale up, identify source	3.1 High CPU
`DiskAlmostFull`	Clear logs/tmp, escalate if > 90%	3.6 Disk Space
`DBReplicationLag`	Check replication, network, disk on replica	DB section
`CertificateExpiring`	Trigger manual renewal	2.5 Certificate Renewal

6. Escalation Matrix

~~min:10 min: escalate~~

Situation	First Contact	Escalation	Ultimate Escalation
~~Health~~Service ~~check DOWN~~down	~~Run~~On-call ~~quick health check, check App Runner logs~~engineer	~~After~~Tech 5lead	Engineering ~~restart App Runner~~manager
~~Error~~Data ~~spike~~loss / corruption	~~Check~~On-call ~~CloudWatch~~+ ~~logs~~Tech ~~for error pattern~~lead	~~After~~CTO	CTO
~~App~~Security ~~startup/shutdown~~incident	~~Informational~~Security ~~— no action unless unexpected~~contact	~~N/A~~CISO	CEO
Payment system down	On-call + Payment owner	Stripe/payment provider support	Engineering manager

8.2 Common Issues

~~Issue:~~Emergency ~~Health check returns 503 (DB unreachable)~~contacts:

#1.CheckRDSstatusawsrdsdescribe-db-instances--db-instance-identifierdrop-db\--query 'DBInstances[0].DBInstanceStatus' --output text --region eu-west-1

# 2. If not 'available', wait for AWS to auto-recover or follow DR Scenario 2
# 3. Check connection string in App Runner environment
# 4. Restart App Runner service

Issue: BankID login failing

# Check App Runner logs for BankID errors
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --filter-pattern "BankID" --region eu-west-1

# Verify BankID environment variables are set
# Check BankID status: https://driftsstatus.vippsmobilepay.com/

Issue: KYC verification stuck in pending

# Check Sumsub dashboard for stuck applicants
# Or query:
psql -c "SELECT id, email, kyc_status FROM users WHERE kyc_status='pending' AND created_at < NOW()-INTERVAL '2 hours';"
# Force-process via Sumsub dashboard or API 



























Role
Name
Phone
Slack




On-call (primary) {{PRIMARY}} {{PHONE}} {{SLACK}}
On-call (backup) {{BACKUP}} {{PHONE}} {{SLACK}}
Tech Lead {{TECH_LEAD}} {{PHONE}} {{SLACK}}
Engineering Manager {{ENG_MGR}} {{PHONE}} {{SLACK}}


9.7. MonitoringOn-Call VerificationHandoff CommandsProcedure
#
1.Handoff Fullcadence: health{{HANDOFF_CADENCE}} check
curlHandoff time: {{HANDOFF_TIME}}

Outgoing on-call must document:


 Any open incidents or ongoing issues

 Any monitoring anomalies (elevated error rates, slow queries not yet resolved)

 Any upcoming events that may affect the system (marketing campaigns, scheduled maintenance)

 Any temporary mitigations in place that need permanent fixes

 Context on any unusual alerts that fired and were noise


Handoff document template: {{HANDOFF_TEMPLATE_LINK}}


8. Maintenance Window Procedure


Maintenance window schedule: {{MAINTENANCE_WINDOW}} (lowest traffic period)

Pre-maintenance:


Announce in Slack #ops: "Maintenance window {{DATE}} {{TIME}}-s{{END_TIME}}"
https://getdrop.no/api/healthUpdate |status python3page: -m"Scheduled json.toolmaintenance" #with 2.details
DatabaseNotify latencyimpacted check
curl -s https://getdrop.no/api/health | jq '.data.checks.db.latencyMs'
# Alertcustomers if downtime expected > 100ms{{DOWNTIME_NOTIFY_THRESHOLD}} #minutes
3.Confirm Checkrollback appplan versionis curlready
-s
https://getdrop.no/api/healthDuring |maintenance:
jq
'.data.version'Enable #maintenance 4.mode Check(if uptimeapplicable): curl{{MAINTENANCE_MODE_CMD}}
-sExecute https://getdrop.no/api/healthmaintenance |tasks jqper '.data.uptime'the specific runbook for the task

Run smoke tests after each major step

Document every action taken with timestamps


Post-maintenance:


Disable maintenance mode: {{DISABLE_MAINTENANCE_CMD}}

Run full smoke test suite

Monitor for 30 minutes

Update status page: "Maintenance complete, all systems normal"

Post-maintenance report in #ops Slack channel



Related Documents

DisasterGo-Live RecoveryRunbook
PlanIncident Report
Monitoring & Observability
Go-LiveDisaster Runbook
Recovery Source DR RunbookPlan


Approval



Role
Name
Date
Signature




Author
Platform Architect (AI)
2026-02-23



Reviewer





Approver
Alem Bašić

Role	Name	Phone	Slack
On-call (primary)	{{PRIMARY}}	{{PHONE}}	{{SLACK}}
On-call (backup)	{{BACKUP}}	{{PHONE}}	{{SLACK}}
Tech Lead	{{TECH_LEAD}}	{{PHONE}}	{{SLACK}}
Engineering Manager	{{ENG_MGR}}	{{PHONE}}	{{SLACK}}

Role	Name	Date
Author	~~Platform Architect (AI)~~	~~2026-02-23~~
Reviewer
Approver	~~Alem Bašić~~

Operational Runbook

Operational Runbook

Document History

1. Service Overview

2. Quick Reference

Production Infrastructure

2. Common Operational Tasks

Quick2.1 HealthService CheckRestart Procedure

3.B Routine— Operations

3.1 Daily Checks

3.2 Weekly Checks

2.2 Log Retrieval & Analysis

2.3 Database Maintenance

3.32.4 MonthlyCache ChecksClearing / Warming

2.5 Certificate Renewal

2.6 Scaling Up / Down

4.3. DeploymentTroubleshooting ProcedurePlaybooks

4.3.1 StandardHigh DeploymentCPU (AppUsage

3.2 Memory Leaks

3.3 Slow Database Queries

3.4 Service Connectivity Issues

3.5 minutesHigh Error Rates

4.23.6 StagingDisk DeploymentSpace (Fly.io)Issues

4.3

4. Health Check Endpoints

5. SecretAlert RotationResponse Procedures

5.1 Rotate JWT_SECRET

5.2 Rotate Database Password

6. Database Operations

6.1 Connect to Production Database

6.2 User Management Queries

6.3 Transaction Queries

6.4 Manual RDS Snapshot

7. AML & Compliance Operations

7.1 AML Alert Review

7.2 STR Filing

7.3 GDPR Requests

8. Incident Response

8.1 Alert Triage

6. Escalation Matrix

8.2 Common Issues

9.7. MonitoringOn-Call VerificationHandoff CommandsProcedure

8. Maintenance Window Procedure

Related Documents

Approval

5.1 Rotate `JWT_SECRET`