Prometheus Best Practices — USE vs RED Prometheus Best Practices and Pitfalls Source: YouTube Learning — Julius Volz (Prometheus co-founder), Swiss Cloud Native Day 2021 Indexed: 2026-06-15 (MC #103620) USE vs RED: Decision Framework USE Method (Resource-Oriented Systems) For infrastructure components (CPU, memory, disk, network): U tilization: % busy (0-100%) S aturation: degree of queuing (wait time, queue length) E rrors: error count/rate When to use: Cloud Run instances, Azure Container Apps, database connections, worker threads, storage volumes. RED Method (Request-Oriented Systems) For services handling requests: R ate: requests/second E rrors: failed request count or % D uration: latency (p50, p95, p99) When to use: REST APIs, BFF layers, RPC services, HTTP endpoints. Custom Metrics in Application Code Best Practices Counter for events that only go up (requests, errors, jobs completed) Gauge for values that go up/down (active connections, queue size, temperature) Histogram for bucketed observations (latency, request size) — auto-generates _sum , _count , _bucket Summary for client-side quantiles (use histogram + server-side quantiles in PromQL instead) Common Pitfalls High cardinality labels (user IDs, UUIDs, timestamps) → cardinality explosion → OOM Missing units in metric names ( http_request_duration vs http_request_duration_seconds ) Inconsistent naming (mix of snake_case/camelCase) Not exposing /metrics endpoint early in service development Using Summary instead of Histogram (histograms aggregate better) PromQL Essentials # Rate of HTTP errors over 5min rate(http_requests_total{status=~"5.."}[5m]) # 95th percentile latency histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # CPU utilization (USE) 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Error rate (RED) sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) How This Applies to ALAI Current Infrastructure Grafana: https://grafana.alai.no (monitoring hub) Bilko APIs/BFF: Java/Spring Boot → RED metrics for /api/* endpoints LumisCare BFF/services: Kotlin/Ktor → RED metrics for REST + USE metrics for connection pools Cloud Run / Azure Container Apps: Platform exposes USE metrics (CPU, memory, request queue) Recommended Next Steps Instrument Bilko/LumisCare services with Micrometer (auto-exposes Prometheus /actuator/prometheus ) Add RED dashboards for all user-facing APIs (Grafana template: https://grafana.com/grafana/dashboards/4701) Add USE dashboards for Cloud Run / ACA resource health Alert on SLIs: Error rate >1%, p95 latency >2s, CPU >80% ALAI-Specific Pitfall to Avoid Do NOT add per-user or per-client labels to core metrics. Use organization_id buckets (max ~50) or aggregate at service level. High cardinality = Prometheus death. References Prometheus docs: https://prometheus.io/docs/practices/naming/ USE Method: http://www.brendangregg.com/usemethod.html RED Method: https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/ Micrometer + Spring Boot: https://micrometer.io/docs/registry/prometheus