Prometheus Best Practices — USE vs RED

Prometheus Best Practices and Pitfalls

Source: YouTube Learning — Julius Volz (Prometheus co-founder), Swiss Cloud Native Day 2021
Indexed: 2026-06-15 (MC #103620)

USE vs RED: Decision Framework

USE Method (Resource-Oriented Systems)

For infrastructure components (CPU, memory, disk, network):

Utilization: % busy (0-100%)
Saturation: degree of queuing (wait time, queue length)
Errors: error count/rate

When to use: Cloud Run instances, Azure Container Apps, database connections, worker threads, storage volumes.

RED Method (Request-Oriented Systems)

For services handling requests:

Rate: requests/second
Errors: failed request count or %
Duration: latency (p50, p95, p99)

When to use: REST APIs, BFF layers, RPC services, HTTP endpoints.

Custom Metrics in Application Code

Best Practices

Counter for events that only go up (requests, errors, jobs completed)
Gauge for values that go up/down (active connections, queue size, temperature)
Histogram for bucketed observations (latency, request size) — auto-generates _sum, _count, _bucket
Summary for client-side quantiles (use histogram + server-side quantiles in PromQL instead)

Common Pitfalls

High cardinality labels (user IDs, UUIDs, timestamps) → cardinality explosion → OOM
Missing units in metric names (http_request_duration vs http_request_duration_seconds)
Inconsistent naming (mix of snake_case/camelCase)
Not exposing /metrics endpoint early in service development
Using Summary instead of Histogram (histograms aggregate better)

PromQL Essentials

# Rate of HTTP errors over 5min
rate(http_requests_total{status=~"5.."}[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# CPU utilization (USE)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Error rate (RED)
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

How This Applies to ALAI

Current Infrastructure

Grafana: https://grafana.alai.no (monitoring hub)
Bilko APIs/BFF: Java/Spring Boot → RED metrics for /api/* endpoints
LumisCare BFF/services: Kotlin/Ktor → RED metrics for REST + USE metrics for connection pools
Cloud Run / Azure Container Apps: Platform exposes USE metrics (CPU, memory, request queue)

Recommended Next Steps

Instrument Bilko/LumisCare services with Micrometer (auto-exposes Prometheus /actuator/prometheus)
Add RED dashboards for all user-facing APIs (Grafana template: https://grafana.com/grafana/dashboards/4701)
Add USE dashboards for Cloud Run / ACA resource health
Alert on SLIs: Error rate >1%, p95 latency >2s, CPU >80%

ALAI-Specific Pitfall to Avoid

Do NOT add per-user or per-client labels to core metrics. Use organization_id buckets (max ~50) or aggregate at service level. High cardinality = Prometheus death.

References

Prometheus docs: https://prometheus.io/docs/practices/naming/
USE Method: http://www.brendangregg.com/usemethod.html
RED Method: https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/
Micrometer + Spring Boot: https://micrometer.io/docs/registry/prometheus

Environment Setup

Secrets Management

Deployment Checklist

DR Runbook

Deployment Guide

CI/CD Pipeline

Monitoring & Alerting

Production Deployment

BetterStack Setup

Sentry Setup

CloudWatch Logs Setup

DevOps/SRE Stack

WAF Rules

Cloud Deployment Options

Infrastructure Overview

Cloud Audit: Resource Inventory

Cloud Audit: Multi-Cloud Design

Cloud Audit: App Cloud Readiness

Cloud Audit: Validation Report