Operational Runbook

Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}

Document History

Version	Date	Author	Changes
0.1	{{DATE}}	{{AUTHOR}}	Initial draft

1. Service Overview

Service: {{PROJECT_NAME}} Purpose: {{SERVICE_PURPOSE}} Technology stack: {{STACK}} Architecture reference: Deployment Architecture

Service URLs:

Environment	URL	Health Check
Production	`{{PROD_URL}}`	`{{PROD_URL}}/health`
Staging	`{{STG_URL}}`	`{{STG_URL}}/health`

Key dashboards:

System overview: {{DASHBOARD_LINK}}
Service metrics: {{SERVICE_DASHBOARD_LINK}}
Logs: {{LOG_DASHBOARD_LINK}}

2. Common Operational Tasks

2.1 Service Restart Procedure

When to use: Application unresponsive, hanging workers, suspected deadlock

Steps:

Option A — Rolling restart (no downtime):

# AWS ECS
aws ecs update-service --cluster {{CLUSTER}} --service {{SERVICE}} --force-new-deployment

# Kubernetes
kubectl rollout restart deployment/{{DEPLOYMENT}} -n {{NAMESPACE}}

Option B — Emergency restart (brief downtime, use only if rolling restart fails):

# Stop all instances
{{STOP_COMMAND}}
# Wait for drain
sleep 30
# Start fresh
{{START_COMMAND}}

Verify:

# Check all instances healthy
{{HEALTH_CHECK_COMMAND}}
# Check for errors post-restart
{{LOG_CHECK_COMMAND}}

Expected restart time: {{RESTART_TIME}} minutes Alert expected: Service restart will trigger deployment alert — acknowledge in PagerDuty

2.2 Log Retrieval & Analysis

Centralized logs: {{LOG_URL}}

Quick log retrieval:

# Last 100 error lines
{{LOG_TOOL}} --filter "level=error" --since "1h" --service {{SERVICE}}

# Logs for a specific user
{{LOG_TOOL}} --filter "user_id={{USER_ID}}" --since "24h"

# Logs for a specific request
{{LOG_TOOL}} --filter "request_id={{REQUEST_ID}}"

# Database slow query logs
{{DB_LOG_COMMAND}}

Log format reference: See Monitoring & Observability

2.3 Database Maintenance

Connection count check:

SELECT count(*) as connections, state FROM pg_stat_activity GROUP BY state;

Kill idle connections:

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
  AND state_change < now() - interval '5 minutes'
  AND pid <> pg_backend_pid();

Running queries (detect long-running):

SELECT pid, duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '1 minute'
  AND state != 'idle';

Vacuum / analyze (if table bloat suspected):

VACUUM ANALYZE {{TABLE_NAME}};

Check replication lag:

SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;

2.4 Cache Clearing / Warming

Clear all cache (use with caution — may spike DB load):

{{CACHE_FLUSH_COMMAND}}

Clear specific key pattern:

{{CACHE_DELETE_PATTERN_COMMAND}}

Check cache hit rate:

{{CACHE_STATS_COMMAND}}

Warm cache after clearing:

# Run cache warming script
bash scripts/warm-cache.sh {{ENVIRONMENT}}
# Or trigger warming job
{{WARM_CACHE_JOB_COMMAND}}

Expected DB load spike after cache clear: {{CACHE_CLEAR_IMPACT}} minutes of elevated load

2.5 Certificate Renewal

Automated renewal: Configured via {{CERT_TOOL}} (Let's Encrypt / ACM) Auto-renewal trigger: 30 days before expiry

Manual renewal (if auto-renewal fails):

# Check expiry
echo | openssl s_client -connect {{DOMAIN}}:443 2>/dev/null | openssl x509 -noout -dates

# Manual renewal
{{CERT_RENEW_COMMAND}}

# Verify
{{CERT_VERIFY_COMMAND}}

Verify renewal alert is working:

Alert configured: "Certificate expiring in < 30 days" → {{ALERT_CHANNEL}}
Test certificate: curl -I https://{{DOMAIN}} and check Strict-Transport-Security header

2.6 Scaling Up / Down

Scale up (increase capacity):

# AWS ECS
aws ecs update-service --cluster {{CLUSTER}} --service {{SERVICE}} --desired-count {{COUNT}}

# Kubernetes
kubectl scale deployment/{{DEPLOYMENT}} --replicas={{COUNT}} -n {{NAMESPACE}}

Verify scale-out:

# Check instance count
{{INSTANCE_COUNT_COMMAND}}
# Confirm health
{{HEALTH_CHECK_COMMAND}}

Scale down (reduce capacity — use cautiously):

Do NOT scale below {{MIN_INSTANCES}} instances
Scale down during off-peak hours only ({{OFF_PEAK_HOURS}})
Monitor for 10 minutes after scaling down to confirm stability

3. Troubleshooting Playbooks

3.1 High CPU Usage

Symptoms: CPU alert fires, slow responses, possible OOM

Identify the source:

# Top processes by CPU
{{CPU_TOP_COMMAND}}

Check for: runaway loops, large queries being processed, missing cache causing recalculation
Check for recently deployed code — did CPU spike after a deploy? → Consider rollback
Check queue depth — backed-up job queue causes worker CPU spike
If single instance: restart that instance ({{RESTART_SINGLE_COMMAND}})
If all instances: scale up immediately, then investigate root cause
Escalate if: CPU > {{CPU_ESCALATE}}% for > {{ESCALATE_DURATION}} min after scaling

3.2 Memory Leaks

Symptoms: Slowly increasing memory, eventual OOM kill / restart loop

Check memory trend in monitoring dashboard — linear increase over hours = leak
Identify the leak:
- Enable heap dump: {{HEAP_DUMP_COMMAND}}
- Profile with: {{PROFILER}}
Short-term mitigation: Schedule rolling restarts every {{RESTART_INTERVAL}}h
```
{{SCHEDULED_RESTART_COMMAND}}
```
Create ticket with heap dump attached — requires developer investigation
Escalate if: Restart cycle < {{MIN_RESTART_INTERVAL}}h (memory fills too fast)

3.3 Slow Database Queries

Symptoms: High P99 latency, DB CPU spike, timeouts in logs

Find slow queries:

SELECT query, calls, mean_exec_time, max_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;

Check for missing indexes: Look for sequential scans on large tables

Check for blocking queries:

SELECT blocking.pid, blocking.query, blocked.pid, blocked.query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));

Kill blocking query if safe:

SELECT pg_cancel_backend({{PID}});
-- If cancel doesn't work:
SELECT pg_terminate_backend({{PID}});

Create ticket — developer must optimize the query

3.4 Service Connectivity Issues

Symptoms: Connectivity errors between services, 502/503 errors

Check health endpoints:
```
curl -I {{SERVICE_URL}}/health
```
Check network security groups / firewall rules — was anything changed recently?
Check service discovery — DNS resolving correctly?
```
nslookup {{SERVICE_INTERNAL_DNS}}
```
Check if service is running:
```
{{SERVICE_STATUS_COMMAND}}
```
Check logs for connection errors:
```
{{CONNECTIVITY_LOG_COMMAND}}
```

3.5 High Error Rates

Symptoms: Error rate alert, user complaints, 5xx in logs

Identify error type: {{LOG_ERROR_COMMAND}} — what errors, what services, what endpoints?
Check if correlated with: recent deployment, external service outage, traffic spike
Check external service status pages:
- {{SERVICE_1}} status: {{STATUS_PAGE_1}}
- {{SERVICE_2}} status: {{STATUS_PAGE_2}}
If recent deployment: Consider rollback if errors affecting > {{ROLLBACK_ERROR_THRESHOLD}}% of requests
If external service down: Check circuit breaker status, enable fallback
Escalate if: Error rate > {{ESCALATE_ERROR_RATE}}% for > {{ESCALATE_DURATION}} min

3.6 Disk Space Issues

Symptoms: Disk space alert, application errors writing files

Check disk usage:

df -h
du -sh /var/log/* | sort -rh | head -10

Quick wins:

# Rotate and compress logs
logrotate -f /etc/logrotate.conf
# Clear old Docker images
docker image prune -a --filter "until=24h"
# Clear /tmp
find /tmp -mtime +7 -delete

If database disk: Check for table bloat, dead tuples, WAL accumulation
```
SELECT pg_size_pretty(pg_database_size('{{DB_NAME}}'));
```
Escalate if: Disk > {{DISK_ESCALATE}}% and cannot free space quickly

4. Health Check Endpoints

Endpoint	Method	Expected Response	What It Checks
`{{BASE_URL}}/health`	GET	HTTP 200 `{"status":"ok"}`	Application running
`{{BASE_URL}}/health/ready`	GET	HTTP 200 `{"status":"ready"}`	App + DB + Cache connected
`{{BASE_URL}}/health/live`	GET	HTTP 200 `{"status":"alive"}`	App process alive
`{{BASE_URL}}/health/db`	GET	HTTP 200 `{"status":"ok","latency_ms":X}`	Database reachable
`{{BASE_URL}}/health/cache`	GET	HTTP 200 `{"status":"ok"}`	Redis reachable

Health check from load balancer: {{HEALTH_CHECK_PATH}} every {{LB_INTERVAL}}s Unhealthy threshold: {{UNHEALTHY_COUNT}} consecutive failures

5. Alert Response Procedures

Alert	Immediate Action	Runbook Section
`HighErrorRate`	Check logs, identify error type, assess scope	3.5 High Error Rates
`SlowP99`	Check DB slow queries, recent deploys	3.3 Slow DB Queries
`ServiceDown`	Restart service, check logs	2.1 Service Restart
`HighCPU`	Scale up, identify source	3.1 High CPU
`DiskAlmostFull`	Clear logs/tmp, escalate if > 90%	3.6 Disk Space
`DBReplicationLag`	Check replication, network, disk on replica	DB section
`CertificateExpiring`	Trigger manual renewal	2.5 Certificate Renewal

6. Escalation Matrix

Situation	First Contact	Escalation	Ultimate Escalation
Service down	On-call engineer	Tech lead	Engineering manager
Data loss / corruption	On-call + Tech lead	CTO	CTO
Security incident	Security contact	CISO	CEO
Payment system down	On-call + Payment owner	Stripe/payment provider support	Engineering manager

Emergency contacts:

Role	Name	Phone	Slack
On-call (primary)	{{PRIMARY}}	{{PHONE}}	{{SLACK}}
On-call (backup)	{{BACKUP}}	{{PHONE}}	{{SLACK}}
Tech Lead	{{TECH_LEAD}}	{{PHONE}}	{{SLACK}}
Engineering Manager	{{ENG_MGR}}	{{PHONE}}	{{SLACK}}

7. On-Call Handoff Procedure

Handoff cadence: {{HANDOFF_CADENCE}} Handoff time: {{HANDOFF_TIME}}

Outgoing on-call must document:

Any open incidents or ongoing issues
Any monitoring anomalies (elevated error rates, slow queries not yet resolved)
Any upcoming events that may affect the system (marketing campaigns, scheduled maintenance)
Any temporary mitigations in place that need permanent fixes
Context on any unusual alerts that fired and were noise

Handoff document template: {{HANDOFF_TEMPLATE_LINK}}

8. Maintenance Window Procedure

Maintenance window schedule: {{MAINTENANCE_WINDOW}} (lowest traffic period)

Pre-maintenance:

Announce in Slack #ops: "Maintenance window {{DATE}} {{TIME}}-{{END_TIME}}"
Update status page: "Scheduled maintenance" with details
Notify impacted customers if downtime expected > {{DOWNTIME_NOTIFY_THRESHOLD}} minutes
Confirm rollback plan is ready

During maintenance:

Enable maintenance mode (if applicable): {{MAINTENANCE_MODE_CMD}}
Execute maintenance tasks per the specific runbook for the task
Run smoke tests after each major step
Document every action taken with timestamps

Post-maintenance:

Disable maintenance mode: {{DISABLE_MAINTENANCE_CMD}}
Run full smoke test suite
Monitor for 30 minutes
Update status page: "Maintenance complete, all systems normal"
Post-maintenance report in #ops Slack channel

Approval

Role	Name	Date	Signature
Author
Reviewer
Approver

Operational Runbook

Operational Runbook

Document History

1. Service Overview

2. Common Operational Tasks

2.1 Service Restart Procedure

2.2 Log Retrieval & Analysis

2.3 Database Maintenance

2.4 Cache Clearing / Warming

2.5 Certificate Renewal

2.6 Scaling Up / Down

3. Troubleshooting Playbooks

3.1 High CPU Usage

3.2 Memory Leaks

3.3 Slow Database Queries

3.4 Service Connectivity Issues

3.5 High Error Rates

3.6 Disk Space Issues

4. Health Check Endpoints

5. Alert Response Procedures

6. Escalation Matrix

7. On-Call Handoff Procedure

8. Maintenance Window Procedure

Related Documents

Approval