Monitoring & Observability
Monitoring & Observability
Project: {{PROJECT_NAME}}
Version: {{VERSION}}
Date: {{DATE}}
Author: {{AUTHOR}}
Status: Draft | In Review | Approved
Reviewers: {{REVIEWERS}}
Document History
Version
Date
Author
Changes
0.1
{{DATE}}
{{AUTHOR}}
Initial draft
1. Observability Strategy
Observability Platform: {{OBS_PLATFORM}}
Strategy: Instrument everything, alert on symptoms (not causes), correlate across pillars
Core Questions We Must Be Able to Answer:
Is the system up and serving users correctly?
How fast is it responding?
What errors are occurring and why?
Where is the bottleneck?
What changed before this problem started?
2. Three Pillars
2.1 Metrics
Infrastructure Metrics
Metric
Source
Alert Threshold
Severity
CPU utilization
Node exporter / CloudWatch
> {{CPU_WARN}}% (warn), > {{CPU_CRIT}}% (critical)
Warning / Critical
Memory utilization
Node exporter / CloudWatch
> {{MEM_WARN}}% (warn), > {{MEM_CRIT}}% (critical)
Warning / Critical
Disk utilization
Node exporter / CloudWatch
> {{DISK_WARN}}% (warn), > {{DISK_CRIT}}% (critical)
Warning / Critical
Network in/out
Node exporter / CloudWatch
> {{NET_LIMIT}}Mbps sustained
Warning
Container restarts
Kubernetes / ECS
> {{RESTART_LIMIT}} in 5min
Critical
Node not ready
Kubernetes
Any
Critical
Application Metrics (RED Method)
Metric
Description
Target
Alert Threshold
Request rate
Requests per second per service
Baseline ± 20%
50% deviation
Error rate
% requests returning 5xx
< {{ERROR_RATE}}%
> {{ERROR_ALERT}}%
P50 latency
Median response time
< {{P50}}ms
> {{P50_ALERT}}ms
P95 latency
95th percentile response time
< {{P95}}ms
> {{P95_ALERT}}ms
P99 latency
99th percentile response time
< {{P99}}ms
> {{P99_ALERT}}ms
Business Metrics
Metric
Description
Collection Method
Dashboard
Active users (DAU/MAU)
Daily/monthly active users
Frontend instrumentation
Business dashboard
{{CONVERSION_METRIC}}
{{CONVERSION_DESC}}
Backend event
Business dashboard
{{REVENUE_METRIC}}
{{REVENUE_DESC}}
Payment events
Finance dashboard
Feature usage
Feature-level engagement
Feature flag SDK
Product dashboard
Custom Metrics Definition
Metric Name
Type
Labels
Description
Unit
{{APP}}_job_queue_depth
Gauge
queue_name
Number of pending jobs
count
{{APP}}_job_processing_duration
Histogram
queue_name, status
Job processing time
seconds
{{APP}}_external_api_calls_total
Counter
service, status
External API call count
count
{{APP}}_cache_hit_ratio
Gauge
cache_type
Cache hit percentage
ratio
2.2 Logs
Log Levels & Usage Guide
Level
When to Use
Examples
ERROR
Unexpected failure requiring attention
Database connection failure, unhandled exception
WARN
Unexpected but handled situation
Deprecated API called, retry succeeded
INFO
Normal business events
User logged in, order created, job completed
DEBUG
Diagnostic detail (dev/staging only)
Function parameters, internal state
TRACE
Extremely verbose (local dev only)
SQL queries, HTTP request/response bodies
Production log level: INFO and above
Structured Logging Format
{
"timestamp": "2026-01-15T10:30:00.000Z",
"level": "INFO",
"service": "{{SERVICE_NAME}}",
"version": "{{VERSION}}",
"trace_id": "abc123def456",
"span_id": "789xyz",
"user_id": "{{HASHED_OR_OMIT}}",
"request_id": "req-uuid-here",
"message": "Order created successfully",
"order_id": "ord-123",
"duration_ms": 45
}
Required fields: timestamp , level , service , message , trace_id
Forbidden in logs: passwords, tokens, credit card numbers, SSN, full email addresses (hash or truncate)
Log Aggregation Pipeline
flowchart LR
APP[Application] -->|stdout/stderr| AGENT[Log Agent
Fluent Bit / Filebeat]
AGENT -->|structured JSON| STORE[Log Store
Loki / Elasticsearch / CloudWatch]
STORE --> QUERY[Query Interface
Grafana / Kibana]
STORE --> ALERT[Alert Engine
AlertManager / PagerDuty]
Stage
Tool
Configuration
Application logging
{{LOG_LIB}}
Structured JSON to stdout
Log agent
{{LOG_AGENT}}
Deployed as sidecar / DaemonSet
Transport
{{LOG_TRANSPORT}}
TLS encrypted
Storage
{{LOG_STORE}}
Indexed, compressed
Query
{{LOG_QUERY}}
Access via dashboard
Log Retention Policy
Environment
Retention
Storage Tier
Dev
7 days
Hot
Staging
30 days
Hot
Production
{{PROD_LOG_RETENTION}} days
Hot (30d) → Cold archive
Audit logs
1 year (regulatory)
Hot (90d) → Cold archive
PII in Logs — Masking Strategy
Data Type
Strategy
Example
Email address
Hash + truncate
user:sha256(email)[:8]
Phone number
Redact
[PHONE_REDACTED]
IP address
Anonymize last octet
192.168.1.xxx
Payment data
Never log
Use [PAYMENT_DATA_OMITTED]
Auth tokens
Never log
Use [TOKEN_OMITTED]
Names
Omit or pseudonymize
Reference by ID only
2.3 Traces
Distributed Tracing Setup
Tracing Framework: {{TRACE_FRAMEWORK}}
Backend: {{TRACE_BACKEND}}
Auto-instrumentation: {{AUTO_INSTRUMENT}}
Service
Instrumented
Framework
Notes
{{SERVICE_1}}
Yes
OpenTelemetry
HTTP, DB, Redis
{{SERVICE_2}}
Yes
OpenTelemetry
HTTP, external calls
Trace Sampling Strategy
Environment
Strategy
Rate
Notes
Dev
Always-on
100%
Full visibility
Staging
Always-on
100%
Full visibility
Production
Tail-based
{{SAMPLE_RATE}}% + errors
Error traces always kept
Tail-based sampling rules:
Always sample: traces with errors, traces > {{SLOW_THRESHOLD}}ms
Sample rate: {{SAMPLE_RATE}}% of successful, fast traces
Head-based fallback: {{HEAD_SAMPLE_RATE}}% if tail-based collector unavailable
Span Naming Conventions
Operation Type
Naming Pattern
Example
HTTP handler
HTTP {{METHOD}} {{ROUTE}}
HTTP POST /api/orders
DB query
db.{{operation}} {{table}}
db.select orders
Cache
cache.{{operation}} {{key_pattern}}
cache.get user:*
Queue
queue.{{operation}} {{queue_name}}
queue.publish order-events
External HTTP
{{service}} {{METHOD}} {{path}}
stripe POST /charges
Context Propagation
Standard: W3C TraceContext ( traceparent header)
Baggage: W3C Baggage (for user_id , tenant_id propagation)
Async: Inject context into message queue headers / job metadata
3. Alerting
3.1 Alert Rules
Alert Name
Condition
Duration
Severity
Channel
Runbook
HighErrorRate
error_rate > {{ERROR_ALERT}}%
2 min
Critical
PagerDuty
[link]
SlowP99
p99_latency > {{P99_ALERT}}ms
5 min
Warning
Slack #alerts
[link]
ServiceDown
health_check failing
1 min
Critical
PagerDuty
[link]
HighCPU
cpu > {{CPU_CRIT}}%
10 min
Warning
Slack #alerts
[link]
DiskAlmostFull
disk > {{DISK_CRIT}}%
5 min
Critical
PagerDuty
[link]
DeploymentFailed
deployment status = failed
Immediate
Critical
Slack #deployments
[link]
CertificateExpiringSoon
cert_expiry < 30 days
—
Warning
Slack #ops
[link]
BackupFailed
backup job = failed
—
Critical
PagerDuty
[link]
SLOBudgetBurning
error_budget < 10% remaining
—
Critical
PagerDuty
[link]
3.2 Alert Routing & Escalation
flowchart TD
ALERT[Alert fires] --> SEVERITY{Severity?}
SEVERITY -->|Critical| ONCALL[On-call engineer
PagerDuty / phone]
SEVERITY -->|Warning| SLACK[Slack #alerts
No immediate response required]
ONCALL -->|Not acknowledged in 5min| ESCALATE[Escalate to secondary]
ESCALATE -->|Not acknowledged in 10min| MANAGER[Notify engineering lead]
Severity
Response SLA
Channel
Escalation
Critical (P1)
Acknowledge in 5 min, resolve in 1h
PagerDuty + call
Escalate at 5 min
High (P2)
Acknowledge in 30 min, resolve in 4h
PagerDuty
Escalate at 30 min
Warning (P3)
Review within 1 business day
Slack
Manual
Info
No response required
Slack
None
3.3 On-Call Rotation
Schedule: {{ONCALL_SCHEDULE}}
Calendar: {{ONCALL_TOOL}}
Primary rotation: {{ONCALL_MEMBERS}}
Secondary (escalation): {{ESCALATION_MEMBERS}}
Minimum rotation size: 3 people (to avoid burnout)
3.4 Alert Fatigue Prevention
Alert review cadence: Monthly — remove/adjust alerts with < {{ACTIONABLE_RATE}}% actionable rate
Minimum alert duration: 2+ minutes (no single-spike alerts)
Deduplication window: {{DEDUP_WINDOW}} minutes
Business hours suppression: Allowed for non-critical alerts {{SUPPRESSION_HOURS}}
Post-mortem requirement: Every Critical alert reviewed after incident
4. Dashboards
4.1 Dashboard Inventory
Dashboard
Purpose
Link
Audience
System Overview
High-level health of all services
{{LINK}}
Everyone
{{SERVICE_1}}
Service-level detail
{{LINK}}
Dev team
Infrastructure
Host/container metrics
{{LINK}}
DevOps
Business Metrics
KPIs and conversions
{{LINK}}
Leadership, PM
SLO Tracker
Error budget tracking
{{LINK}}
Engineering lead
On-Call
Current incidents, top errors
{{LINK}}
On-call engineer
4.2 Key Dashboard Specs — System Overview
Required panels:
Service health matrix (all services, green/red/yellow)
Request rate (all services, last 1h)
Error rate (all services, last 1h)
P99 latency (all services, last 1h)
Active incidents count
Error budget remaining (all SLOs)
Last deployment (service, version, time)
Infrastructure health (CPU, memory, disk — aggregate)
5. SLOs / SLIs
5.1 SLI Definitions
SLI
Definition
Measurement Method
Availability
% requests returning non-5xx
(total_requests - 5xx_requests) / total_requests
Latency
% requests completing within threshold
histogram_quantile(0.95, ...) < {{LATENCY_SLI}}ms
Error rate
% requests not returning errors
(total_requests - error_requests) / total_requests
5.2 SLO Targets
Service
SLI
Target
Window
Error Budget
{{SERVICE}}
Availability
{{AVAIL_TARGET}}%
30 days
{{BUDGET_MINUTES}} min/month
{{SERVICE}}
Latency (P95 < {{P95}}ms)
{{LATENCY_TARGET}}%
30 days
{{LATENCY_BUDGET_MINUTES}} min/month
5.3 Error Budget Tracking
Service
Monthly Budget
Burned This Month
Remaining
Burn Rate (24h)
{{SERVICE}}
{{BUDGET}}min
TBD
TBD
TBD
Error budget policy:
Budget > 50% remaining: Move fast, deploy freely
Budget 10-50% remaining: Slow down, prioritize reliability work
Budget < 10% remaining: Freeze non-critical deploys, focus on reliability
6. Tooling
Tool
Version
Purpose
Hosted
{{METRICS_TOOL}}
{{VERSION}}
Metrics collection & storage
{{HOSTING}}
{{LOG_TOOL}}
{{VERSION}}
Log aggregation
{{HOSTING}}
{{TRACE_TOOL}}
{{VERSION}}
Distributed tracing
{{HOSTING}}
{{DASHBOARD_TOOL}}
{{VERSION}}
Visualization
{{HOSTING}}
{{ALERT_TOOL}}
{{VERSION}}
Alert routing & on-call
{{HOSTING}}
Related Documents
Deployment Architecture
Disaster Recovery Plan
Incident Report
Operational Runbook
SLA Report
Approval
Role
Name
Date
Signature
Author
Reviewer
Approver