Operational Runbook

Operational Runbook 
 
 Project: {{PROJECT_NAME}}
 Version: {{VERSION}}
 Date: {{DATE}}
 Author: {{AUTHOR}}
 Status: Draft | In Review | Approved
 Reviewers: {{REVIEWERS}} 
 
 Document History 
 
 
 
 Version 
 Date 
 Author 
 Changes 
 
 
 
 
 0.1 
 {{DATE}} 
 {{AUTHOR}} 
 Initial draft 
 
 
 
 
 1. Service Overview 

 Service: {{PROJECT_NAME}}
 Purpose: {{SERVICE_PURPOSE}}
 Technology stack: {{STACK}} 
 Architecture reference: Deployment Architecture 
 Service URLs: 
 
 
 
 Environment 
 URL 
 Health Check 
 
 
 
 
 Production 
 {{PROD_URL}} 
 {{PROD_URL}}/health 
 
 
 Staging 
 {{STG_URL}} 
 {{STG_URL}}/health 
 
 
 
 Key dashboards: 
 
 System overview: {{DASHBOARD_LINK}} 
 Service metrics: {{SERVICE_DASHBOARD_LINK}} 
 Logs: {{LOG_DASHBOARD_LINK}} 
 
 
 2. Common Operational Tasks 
 2.1 Service Restart Procedure 

 When to use: Application unresponsive, hanging workers, suspected deadlock 
 Steps: 
 Option A — Rolling restart (no downtime): 
 # AWS ECS
aws ecs update-service --cluster {{CLUSTER}} --service {{SERVICE}} --force-new-deployment

# Kubernetes
kubectl rollout restart deployment/{{DEPLOYMENT}} -n {{NAMESPACE}}
 
 Option B — Emergency restart (brief downtime, use only if rolling restart fails): 
 # Stop all instances
{{STOP_COMMAND}}
# Wait for drain
sleep 30
# Start fresh
{{START_COMMAND}}
 
 Verify: 
 # Check all instances healthy
{{HEALTH_CHECK_COMMAND}}
# Check for errors post-restart
{{LOG_CHECK_COMMAND}}
 
 Expected restart time: {{RESTART_TIME}} minutes
 Alert expected: Service restart will trigger deployment alert — acknowledge in PagerDuty 
 
 2.2 Log Retrieval & Analysis 

 Centralized logs: {{LOG_URL}} 
 Quick log retrieval: 
 # Last 100 error lines
{{LOG_TOOL}} --filter "level=error" --since "1h" --service {{SERVICE}}

# Logs for a specific user
{{LOG_TOOL}} --filter "user_id={{USER_ID}}" --since "24h"

# Logs for a specific request
{{LOG_TOOL}} --filter "request_id={{REQUEST_ID}}"

# Database slow query logs
{{DB_LOG_COMMAND}}
 
 Log format reference: See Monitoring & Observability 
 
 2.3 Database Maintenance 
 Connection count check: 
 SELECT count(*) as connections, state FROM pg_stat_activity GROUP BY state;
 
 Kill idle connections: 
 SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
 AND state_change < now() - interval '5 minutes'
 AND pid <> pg_backend_pid();
 
 Running queries (detect long-running): 
 SELECT pid, duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '1 minute'
 AND state != 'idle';
 
 Vacuum / analyze (if table bloat suspected): 
 VACUUM ANALYZE {{TABLE_NAME}};
 
 Check replication lag: 
 SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;
 
 
 2.4 Cache Clearing / Warming 
 Clear all cache (use with caution — may spike DB load): 
 {{CACHE_FLUSH_COMMAND}}
 
 Clear specific key pattern: 
 {{CACHE_DELETE_PATTERN_COMMAND}}
 
 Check cache hit rate: 
 {{CACHE_STATS_COMMAND}}
 
 Warm cache after clearing: 
 # Run cache warming script
bash scripts/warm-cache.sh {{ENVIRONMENT}}
# Or trigger warming job
{{WARM_CACHE_JOB_COMMAND}}
 
 Expected DB load spike after cache clear: {{CACHE_CLEAR_IMPACT}} minutes of elevated load 
 
 2.5 Certificate Renewal 

 Automated renewal: Configured via {{CERT_TOOL}} (Let's Encrypt / ACM)
 Auto-renewal trigger: 30 days before expiry 
 Manual renewal (if auto-renewal fails): 
 # Check expiry
echo | openssl s_client -connect {{DOMAIN}}:443 2>/dev/null | openssl x509 -noout -dates

# Manual renewal
{{CERT_RENEW_COMMAND}}

# Verify
{{CERT_VERIFY_COMMAND}}
 
 Verify renewal alert is working: 
 
 Alert configured: "Certificate expiring in < 30 days" → {{ALERT_CHANNEL}} 
 Test certificate: curl -I https://{{DOMAIN}} and check Strict-Transport-Security header 
 
 
 2.6 Scaling Up / Down 
 Scale up (increase capacity): 
 # AWS ECS
aws ecs update-service --cluster {{CLUSTER}} --service {{SERVICE}} --desired-count {{COUNT}}

# Kubernetes
kubectl scale deployment/{{DEPLOYMENT}} --replicas={{COUNT}} -n {{NAMESPACE}}
 
 Verify scale-out: 
 # Check instance count
{{INSTANCE_COUNT_COMMAND}}
# Confirm health
{{HEALTH_CHECK_COMMAND}}
 
 Scale down (reduce capacity — use cautiously): 
 
 Do NOT scale below {{MIN_INSTANCES}} instances 
 Scale down during off-peak hours only ({{OFF_PEAK_HOURS}}) 
 Monitor for 10 minutes after scaling down to confirm stability 
 
 
 3. Troubleshooting Playbooks 
 3.1 High CPU Usage 

 Symptoms: CPU alert fires, slow responses, possible OOM 
 
 Identify the source: 
 # Top processes by CPU
{{CPU_TOP_COMMAND}}
 
 
 Check for: runaway loops, large queries being processed, missing cache causing recalculation 
 Check for recently deployed code — did CPU spike after a deploy? → Consider rollback 
 Check queue depth — backed-up job queue causes worker CPU spike 
 If single instance: restart that instance ( {{RESTART_SINGLE_COMMAND}} ) 
 If all instances: scale up immediately, then investigate root cause 
 Escalate if: CPU > {{CPU_ESCALATE}}% for > {{ESCALATE_DURATION}} min after scaling 
 
 
 3.2 Memory Leaks 
 Symptoms: Slowly increasing memory, eventual OOM kill / restart loop 
 
 Check memory trend in monitoring dashboard — linear increase over hours = leak 
 Identify the leak: 
 
 Enable heap dump: {{HEAP_DUMP_COMMAND}} 
 Profile with: {{PROFILER}} 
 
 
 Short-term mitigation: Schedule rolling restarts every {{RESTART_INTERVAL}}h
 {{SCHEDULED_RESTART_COMMAND}}
 
 
 Create ticket with heap dump attached — requires developer investigation 
 Escalate if: Restart cycle < {{MIN_RESTART_INTERVAL}}h (memory fills too fast) 
 
 
 3.3 Slow Database Queries 
 Symptoms: High P99 latency, DB CPU spike, timeouts in logs 
 
 Find slow queries: 
 SELECT query, calls, mean_exec_time, max_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;
 
 
 Check for missing indexes: Look for sequential scans on large tables 
 Check for blocking queries: 
 SELECT blocking.pid, blocking.query, blocked.pid, blocked.query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));
 
 
 Kill blocking query if safe: 
 SELECT pg_cancel_backend({{PID}});
-- If cancel doesn't work:
SELECT pg_terminate_backend({{PID}});
 
 
 Create ticket — developer must optimize the query 
 
 
 3.4 Service Connectivity Issues 
 Symptoms: Connectivity errors between services, 502/503 errors 
 
 Check health endpoints: 
 curl -I {{SERVICE_URL}}/health
 
 
 Check network security groups / firewall rules — was anything changed recently? 
 Check service discovery — DNS resolving correctly?
 nslookup {{SERVICE_INTERNAL_DNS}}
 
 
 Check if service is running: 
 {{SERVICE_STATUS_COMMAND}}
 
 
 Check logs for connection errors: 
 {{CONNECTIVITY_LOG_COMMAND}}
 
 
 
 
 3.5 High Error Rates 
 Symptoms: Error rate alert, user complaints, 5xx in logs 
 
 Identify error type: {{LOG_ERROR_COMMAND}} — what errors, what services, what endpoints? 
 Check if correlated with: recent deployment, external service outage, traffic spike 
 Check external service status pages: 
 
 {{SERVICE_1}} status: {{STATUS_PAGE_1}} 
 {{SERVICE_2}} status: {{STATUS_PAGE_2}} 
 
 
 If recent deployment: Consider rollback if errors affecting > {{ROLLBACK_ERROR_THRESHOLD}}% of requests 
 If external service down: Check circuit breaker status, enable fallback 
 Escalate if: Error rate > {{ESCALATE_ERROR_RATE}}% for > {{ESCALATE_DURATION}} min 
 
 
 3.6 Disk Space Issues 
 Symptoms: Disk space alert, application errors writing files 
 
 Check disk usage: 
 df -h
du -sh /var/log/* | sort -rh | head -10
 
 
 Quick wins: 
 # Rotate and compress logs
logrotate -f /etc/logrotate.conf
# Clear old Docker images
docker image prune -a --filter "until=24h"
# Clear /tmp
find /tmp -mtime +7 -delete
 
 
 If database disk: Check for table bloat, dead tuples, WAL accumulation
 SELECT pg_size_pretty(pg_database_size('{{DB_NAME}}'));
 
 
 Escalate if: Disk > {{DISK_ESCALATE}}% and cannot free space quickly 
 
 
 4. Health Check Endpoints 

 
 
 
 Endpoint 
 Method 
 Expected Response 
 What It Checks 
 
 
 
 
 {{BASE_URL}}/health 
 GET 
 HTTP 200 {"status":"ok"} 
 Application running 
 
 
 {{BASE_URL}}/health/ready 
 GET 
 HTTP 200 {"status":"ready"} 
 App + DB + Cache connected 
 
 
 {{BASE_URL}}/health/live 
 GET 
 HTTP 200 {"status":"alive"} 
 App process alive 
 
 
 {{BASE_URL}}/health/db 
 GET 
 HTTP 200 {"status":"ok","latency_ms":X} 
 Database reachable 
 
 
 {{BASE_URL}}/health/cache 
 GET 
 HTTP 200 {"status":"ok"} 
 Redis reachable 
 
 
 
 Health check from load balancer: {{HEALTH_CHECK_PATH}} every {{LB_INTERVAL}}s
 Unhealthy threshold: {{UNHEALTHY_COUNT}} consecutive failures 
 
 5. Alert Response Procedures 

 
 
 
 Alert 
 Immediate Action 
 Runbook Section 
 
 
 
 
 HighErrorRate 
 Check logs, identify error type, assess scope 
 3.5 High Error Rates 
 
 
 SlowP99 
 Check DB slow queries, recent deploys 
 3.3 Slow DB Queries 
 
 
 ServiceDown 
 Restart service, check logs 
 2.1 Service Restart 
 
 
 HighCPU 
 Scale up, identify source 
 3.1 High CPU 
 
 
 DiskAlmostFull 
 Clear logs/tmp, escalate if > 90% 
 3.6 Disk Space 
 
 
 DBReplicationLag 
 Check replication, network, disk on replica 
 DB section 
 
 
 CertificateExpiring 
 Trigger manual renewal 
 2.5 Certificate Renewal 
 
 
 
 
 6. Escalation Matrix 

 
 
 
 Situation 
 First Contact 
 Escalation 
 Ultimate Escalation 
 
 
 
 
 Service down 
 On-call engineer 
 Tech lead 
 Engineering manager 
 
 
 Data loss / corruption 
 On-call + Tech lead 
 CTO 
 CTO 
 
 
 Security incident 
 Security contact 
 CISO 
 CEO 
 
 
 Payment system down 
 On-call + Payment owner 
 Stripe/payment provider support 
 Engineering manager 
 
 
 
 Emergency contacts: 
 
 
 
 Role 
 Name 
 Phone 
 Slack 
 
 
 
 
 On-call (primary) 
 {{PRIMARY}} 
 {{PHONE}} 
 {{SLACK}} 
 
 
 On-call (backup) 
 {{BACKUP}} 
 {{PHONE}} 
 {{SLACK}} 
 
 
 Tech Lead 
 {{TECH_LEAD}} 
 {{PHONE}} 
 {{SLACK}} 
 
 
 Engineering Manager 
 {{ENG_MGR}} 
 {{PHONE}} 
 {{SLACK}} 
 
 
 
 
 7. On-Call Handoff Procedure 

 Handoff cadence: {{HANDOFF_CADENCE}} 
 Handoff time: {{HANDOFF_TIME}} 
 Outgoing on-call must document: 
 
 Any open incidents or ongoing issues 
 Any monitoring anomalies (elevated error rates, slow queries not yet resolved) 
 Any upcoming events that may affect the system (marketing campaigns, scheduled maintenance) 
 Any temporary mitigations in place that need permanent fixes 
 Context on any unusual alerts that fired and were noise 
 
 Handoff document template: {{HANDOFF_TEMPLATE_LINK}} 
 
 8. Maintenance Window Procedure 

 Maintenance window schedule: {{MAINTENANCE_WINDOW}} (lowest traffic period) 
 Pre-maintenance: 
 
 Announce in Slack #ops: "Maintenance window {{DATE}} {{TIME}}-{{END_TIME}}" 
 Update status page: "Scheduled maintenance" with details 
 Notify impacted customers if downtime expected > {{DOWNTIME_NOTIFY_THRESHOLD}} minutes 
 Confirm rollback plan is ready 
 
 During maintenance: 
 
 Enable maintenance mode (if applicable): {{MAINTENANCE_MODE_CMD}} 
 Execute maintenance tasks per the specific runbook for the task 
 Run smoke tests after each major step 
 Document every action taken with timestamps 
 
 Post-maintenance: 
 
 Disable maintenance mode: {{DISABLE_MAINTENANCE_CMD}} 
 Run full smoke test suite 
 Monitor for 30 minutes 
 Update status page: "Maintenance complete, all systems normal" 
 Post-maintenance report in #ops Slack channel 
 
 
 Related Documents 
 
 Go-Live Runbook 
 Incident Report 
 Monitoring & Observability 
 Disaster Recovery Plan 
 
 
 Approval 
 
 
 
 Role 
 Name 
 Date 
 Signature 
 
 
 
 
 Author 
 
 
 
 
 
 Reviewer 
 
 
 
 
 
 Approver