Monitoring & Observability Monitoring & Observability Project: {{PROJECT_NAME}} Version: {{VERSION}} Date: {{DATE}} Author: {{AUTHOR}} Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}} Document History Version Date Author Changes 0.1 {{DATE}} {{AUTHOR}} Initial draft 1. Observability Strategy Observability Platform: {{OBS_PLATFORM}} Strategy: Instrument everything, alert on symptoms (not causes), correlate across pillars Core Questions We Must Be Able to Answer: Is the system up and serving users correctly? How fast is it responding? What errors are occurring and why? Where is the bottleneck? What changed before this problem started? 2. Three Pillars 2.1 Metrics Infrastructure Metrics Metric Source Alert Threshold Severity CPU utilization Node exporter / CloudWatch > {{CPU_WARN}}% (warn), > {{CPU_CRIT}}% (critical) Warning / Critical Memory utilization Node exporter / CloudWatch > {{MEM_WARN}}% (warn), > {{MEM_CRIT}}% (critical) Warning / Critical Disk utilization Node exporter / CloudWatch > {{DISK_WARN}}% (warn), > {{DISK_CRIT}}% (critical) Warning / Critical Network in/out Node exporter / CloudWatch > {{NET_LIMIT}}Mbps sustained Warning Container restarts Kubernetes / ECS > {{RESTART_LIMIT}} in 5min Critical Node not ready Kubernetes Any Critical Application Metrics (RED Method) Metric Description Target Alert Threshold Request rate Requests per second per service Baseline ± 20% 50% deviation Error rate % requests returning 5xx < {{ERROR_RATE}}% > {{ERROR_ALERT}}% P50 latency Median response time < {{P50}}ms > {{P50_ALERT}}ms P95 latency 95th percentile response time < {{P95}}ms > {{P95_ALERT}}ms P99 latency 99th percentile response time < {{P99}}ms > {{P99_ALERT}}ms Business Metrics Metric Description Collection Method Dashboard Active users (DAU/MAU) Daily/monthly active users Frontend instrumentation Business dashboard {{CONVERSION_METRIC}} {{CONVERSION_DESC}} Backend event Business dashboard {{REVENUE_METRIC}} {{REVENUE_DESC}} Payment events Finance dashboard Feature usage Feature-level engagement Feature flag SDK Product dashboard Custom Metrics Definition Metric Name Type Labels Description Unit {{APP}}_job_queue_depth Gauge queue_name Number of pending jobs count {{APP}}_job_processing_duration Histogram queue_name, status Job processing time seconds {{APP}}_external_api_calls_total Counter service, status External API call count count {{APP}}_cache_hit_ratio Gauge cache_type Cache hit percentage ratio 2.2 Logs Log Levels & Usage Guide Level When to Use Examples ERROR Unexpected failure requiring attention Database connection failure, unhandled exception WARN Unexpected but handled situation Deprecated API called, retry succeeded INFO Normal business events User logged in, order created, job completed DEBUG Diagnostic detail (dev/staging only) Function parameters, internal state TRACE Extremely verbose (local dev only) SQL queries, HTTP request/response bodies Production log level: INFO and above Structured Logging Format { "timestamp": "2026-01-15T10:30:00.000Z", "level": "INFO", "service": "{{SERVICE_NAME}}", "version": "{{VERSION}}", "trace_id": "abc123def456", "span_id": "789xyz", "user_id": "{{HASHED_OR_OMIT}}", "request_id": "req-uuid-here", "message": "Order created successfully", "order_id": "ord-123", "duration_ms": 45 } Required fields: timestamp , level , service , message , trace_id Forbidden in logs: passwords, tokens, credit card numbers, SSN, full email addresses (hash or truncate) Log Aggregation Pipeline flowchart LR APP[Application] -->|stdout/stderr| AGENT[Log Agent
Fluent Bit / Filebeat] AGENT -->|structured JSON| STORE[Log Store
Loki / Elasticsearch / CloudWatch] STORE --> QUERY[Query Interface
Grafana / Kibana] STORE --> ALERT[Alert Engine
AlertManager / PagerDuty] Stage Tool Configuration Application logging {{LOG_LIB}} Structured JSON to stdout Log agent {{LOG_AGENT}} Deployed as sidecar / DaemonSet Transport {{LOG_TRANSPORT}} TLS encrypted Storage {{LOG_STORE}} Indexed, compressed Query {{LOG_QUERY}} Access via dashboard Log Retention Policy Environment Retention Storage Tier Dev 7 days Hot Staging 30 days Hot Production {{PROD_LOG_RETENTION}} days Hot (30d) → Cold archive Audit logs 1 year (regulatory) Hot (90d) → Cold archive PII in Logs — Masking Strategy Data Type Strategy Example Email address Hash + truncate user:sha256(email)[:8] Phone number Redact [PHONE_REDACTED] IP address Anonymize last octet 192.168.1.xxx Payment data Never log Use [PAYMENT_DATA_OMITTED] Auth tokens Never log Use [TOKEN_OMITTED] Names Omit or pseudonymize Reference by ID only 2.3 Traces Distributed Tracing Setup Tracing Framework: {{TRACE_FRAMEWORK}} Backend: {{TRACE_BACKEND}} Auto-instrumentation: {{AUTO_INSTRUMENT}} Service Instrumented Framework Notes {{SERVICE_1}} Yes OpenTelemetry HTTP, DB, Redis {{SERVICE_2}} Yes OpenTelemetry HTTP, external calls Trace Sampling Strategy Environment Strategy Rate Notes Dev Always-on 100% Full visibility Staging Always-on 100% Full visibility Production Tail-based {{SAMPLE_RATE}}% + errors Error traces always kept Tail-based sampling rules: Always sample: traces with errors, traces > {{SLOW_THRESHOLD}}ms Sample rate: {{SAMPLE_RATE}}% of successful, fast traces Head-based fallback: {{HEAD_SAMPLE_RATE}}% if tail-based collector unavailable Span Naming Conventions Operation Type Naming Pattern Example HTTP handler HTTP {{METHOD}} {{ROUTE}} HTTP POST /api/orders DB query db.{{operation}} {{table}} db.select orders Cache cache.{{operation}} {{key_pattern}} cache.get user:* Queue queue.{{operation}} {{queue_name}} queue.publish order-events External HTTP {{service}} {{METHOD}} {{path}} stripe POST /charges Context Propagation Standard: W3C TraceContext ( traceparent header) Baggage: W3C Baggage (for user_id , tenant_id propagation) Async: Inject context into message queue headers / job metadata 3. Alerting 3.1 Alert Rules Alert Name Condition Duration Severity Channel Runbook HighErrorRate error_rate > {{ERROR_ALERT}}% 2 min Critical PagerDuty [link] SlowP99 p99_latency > {{P99_ALERT}}ms 5 min Warning Slack #alerts [link] ServiceDown health_check failing 1 min Critical PagerDuty [link] HighCPU cpu > {{CPU_CRIT}}% 10 min Warning Slack #alerts [link] DiskAlmostFull disk > {{DISK_CRIT}}% 5 min Critical PagerDuty [link] DeploymentFailed deployment status = failed Immediate Critical Slack #deployments [link] CertificateExpiringSoon cert_expiry < 30 days — Warning Slack #ops [link] BackupFailed backup job = failed — Critical PagerDuty [link] SLOBudgetBurning error_budget < 10% remaining — Critical PagerDuty [link] 3.2 Alert Routing & Escalation flowchart TD ALERT[Alert fires] --> SEVERITY{Severity?} SEVERITY -->|Critical| ONCALL[On-call engineer
PagerDuty / phone] SEVERITY -->|Warning| SLACK[Slack #alerts
No immediate response required] ONCALL -->|Not acknowledged in 5min| ESCALATE[Escalate to secondary] ESCALATE -->|Not acknowledged in 10min| MANAGER[Notify engineering lead] Severity Response SLA Channel Escalation Critical (P1) Acknowledge in 5 min, resolve in 1h PagerDuty + call Escalate at 5 min High (P2) Acknowledge in 30 min, resolve in 4h PagerDuty Escalate at 30 min Warning (P3) Review within 1 business day Slack Manual Info No response required Slack None 3.3 On-Call Rotation Schedule: {{ONCALL_SCHEDULE}} Calendar: {{ONCALL_TOOL}} Primary rotation: {{ONCALL_MEMBERS}} Secondary (escalation): {{ESCALATION_MEMBERS}} Minimum rotation size: 3 people (to avoid burnout) 3.4 Alert Fatigue Prevention Alert review cadence: Monthly — remove/adjust alerts with < {{ACTIONABLE_RATE}}% actionable rate Minimum alert duration: 2+ minutes (no single-spike alerts) Deduplication window: {{DEDUP_WINDOW}} minutes Business hours suppression: Allowed for non-critical alerts {{SUPPRESSION_HOURS}} Post-mortem requirement: Every Critical alert reviewed after incident 4. Dashboards 4.1 Dashboard Inventory Dashboard Purpose Link Audience System Overview High-level health of all services {{LINK}} Everyone {{SERVICE_1}} Service-level detail {{LINK}} Dev team Infrastructure Host/container metrics {{LINK}} DevOps Business Metrics KPIs and conversions {{LINK}} Leadership, PM SLO Tracker Error budget tracking {{LINK}} Engineering lead On-Call Current incidents, top errors {{LINK}} On-call engineer 4.2 Key Dashboard Specs — System Overview Required panels: Service health matrix (all services, green/red/yellow) Request rate (all services, last 1h) Error rate (all services, last 1h) P99 latency (all services, last 1h) Active incidents count Error budget remaining (all SLOs) Last deployment (service, version, time) Infrastructure health (CPU, memory, disk — aggregate) 5. SLOs / SLIs 5.1 SLI Definitions SLI Definition Measurement Method Availability % requests returning non-5xx (total_requests - 5xx_requests) / total_requests Latency % requests completing within threshold histogram_quantile(0.95, ...) < {{LATENCY_SLI}}ms Error rate % requests not returning errors (total_requests - error_requests) / total_requests 5.2 SLO Targets Service SLI Target Window Error Budget {{SERVICE}} Availability {{AVAIL_TARGET}}% 30 days {{BUDGET_MINUTES}} min/month {{SERVICE}} Latency (P95 < {{P95}}ms) {{LATENCY_TARGET}}% 30 days {{LATENCY_BUDGET_MINUTES}} min/month 5.3 Error Budget Tracking Service Monthly Budget Burned This Month Remaining Burn Rate (24h) {{SERVICE}} {{BUDGET}}min TBD TBD TBD Error budget policy: Budget > 50% remaining: Move fast, deploy freely Budget 10-50% remaining: Slow down, prioritize reliability work Budget < 10% remaining: Freeze non-critical deploys, focus on reliability 6. Tooling Tool Version Purpose Hosted {{METRICS_TOOL}} {{VERSION}} Metrics collection & storage {{HOSTING}} {{LOG_TOOL}} {{VERSION}} Log aggregation {{HOSTING}} {{TRACE_TOOL}} {{VERSION}} Distributed tracing {{HOSTING}} {{DASHBOARD_TOOL}} {{VERSION}} Visualization {{HOSTING}} {{ALERT_TOOL}} {{VERSION}} Alert routing & on-call {{HOSTING}} Related Documents Deployment Architecture Disaster Recovery Plan Incident Report Operational Runbook SLA Report Approval Role Name Date Signature Author Reviewer Approver