Monitoring & Observability

Project: ~~{{PROJECT_NAME}}~~Drop Version: ~~{{VERSION}}~~0.1.0 Date: ~~{{DATE}}~~2026-02-23 Author: ~~{{AUTHOR}}~~Platform Architect (AI) Status: ~~Draft |~~ In Review ~~| Approved~~ Reviewers: ~~{{REVIEWERS}}~~Alem Bašić (CEO)

Document History

Version	Date	Author	Changes
0.1	~~{{DATE}}~~2026-02-23	~~{{AUTHOR}}~~Platform Architect (AI)	Initial draft from source code and infrastructure analysis

1. Observability Strategy

Drop's observability stack is intentionally lean for the MVP phase: a health check endpoint with real DB verification, Slack alerting with error spike detection, BetterStack external uptime monitoring, and AWS CloudWatch for App Runner logs. Sentry was removed (MC #1271). Full APM and structured logging are planned for v1.0.

Observability Platform: ~~{{OBS_PLATFORM}}~~BetterStack (external uptime) + Slack alerting (internal) + CloudWatch (AWS logs) Strategy: ~~Instrument everything, alert~~Alert on symptoms (~~not~~service ~~causes)~~down, error spike), ~~correlate~~verify ~~across~~via ~~pillars~~health check, investigate via CloudWatch logs.

Core Questions We Must Be Able to Answer:

Is ~~the system~~Drop up and serving ~~users~~users? ~~correctly?~~(BetterStack monitors /api/health)
~~How~~Is ~~fast~~the isdatabase itconnected and responding? (health endpoint DB query)
~~What~~Are errors ~~are~~spiking? ~~occurring~~(Slack ~~and~~alerts.ts ~~why?~~

error

~~Where~~spike ~~is the bottleneck?~~detection)
What changed before ~~this~~the ~~problem~~problem? ~~started?~~(CloudWatch App Runner logs)

When was the last successful deployment? (App Runner deployment history)

2. Three Pillars

2.1 Metrics

Infrastructure Metrics

Metric	Source	Alert Threshold	Severity
App Runner service status	AWS CloudWatch / App Runner API	`RUNNING` → any other state	Critical
RDS instance status	AWS CloudWatch / RDS API	Not `available`	Critical
RDS CPU utilization	~~Node exporter /~~AWS CloudWatch `CPUUtilization`	> ~~{{CPU_WARN}}%~~80% (warn), > ~~{{CPU_CRIT}}%~~95% (critical)	Warning / Critical
~~Memory~~RDS ~~utilization~~free storage	~~Node exporter /~~AWS CloudWatch `FreeStorageSpace`	>< ~~{{MEM_WARN}}%~~1GB (warn), >< ~~{{MEM_CRIT}}%~~256MB (critical)	Warning / Critical
~~Disk~~RDS ~~utilization~~database connections	~~Node exporter /~~AWS CloudWatch `DatabaseConnections`	> ~~{{DISK_WARN}}%~~70 (~~warn),~~warn, >db.t4g.micro ~~{{DISK_CRIT}}%~~max ~~(critical)~~	~~Warning / Critical~~
~~Network in/out~~	~~Node exporter / CloudWatch~~	~~> {{NET_LIMIT}}Mbps sustained~~~85)	Warning
~~Container~~App ~~restarts~~Runner concurrent requests	~~Kubernetes~~AWS ~~/ ECS~~CloudWatch	>TBD ~~{{RESTART_LIMIT}}~~— inno ~~5min~~baseline yet	~~Critical~~
~~Node not ready~~	~~Kubernetes~~	~~Any~~	~~Critical~~TBD

Application Metrics (RED Method)

Metric	Source	Alert Threshold	Severity
Request rate	CloudWatch App Runner	Baseline TBD	Informational
Error rate	`src/lib/alerts.ts` `trackError()`	> 5 errors in 60 seconds → Slack alert	Critical
DB query latency	`/api/health` `dbLatencyMs` field	> 100ms (warn)	Warning
Health endpoint status	BetterStack + `/api/health`	`status` != `"ok"` → 503	Critical
Rate limit hits (429)	App logs	Spike of 429 responses	Warning

Business Metrics (Planned for v1.0)

Metric	Description	Target
Transactions per hour	Successful remittances + QR payments	TBD
Transaction success rate	Completed / total initiated	> 99%
KYC approval rate	Sumsub approvals / attempts	> 80%
BankID login success rate	Successful OIDC callbacks / initiated	> 99%

2.2 Logs

Log Sources

~~Threshold~~

~~Alert~~Source	Log Group	Format	Retention
~~Request~~App ~~rate~~Runner (application)	~~Requests per second per service~~`/aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application`	~~Baseline~~Next.js ±console ~~20%~~output	~~50%~~30 ~~deviation~~days (CloudWatch default)
~~Error~~App ~~rate~~Runner (system)	~~% requests returning 5xx~~`/aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/service`	<App ~~{{ERROR_RATE}}%~~Runner events	>30 ~~{{ERROR_ALERT}}%~~days
~~P50~~RDS ~~latency~~PostgreSQL	~~Median~~RDS ~~response~~error ~~time~~log via CloudWatch	<PostgreSQL ~~{{P50}}ms~~log format	>7 ~~{{P50_ALERT}}ms~~
~~P95 latency~~	~~95th percentile response time~~	~~< {{P95}}ms~~	~~> {{P95_ALERT}}ms~~
~~P99 latency~~	~~99th percentile response time~~	~~< {{P99}}ms~~	~~> {{P99_ALERT}}ms~~days

BusinessLog MetricsAccess

# Stream App Runner application logs (live)
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --follow \
  --region eu-west-1

# Filter for errors in last hour
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --filter-pattern "ERROR" \
  --region eu-west-1

# Download RDS error log
aws rds download-db-log-file-portion \
  --db-instance-identifier drop-db \
  --log-file-name error/postgresql.log \
  --region eu-west-1

Structured logging status: Not yet implemented. Current output is Next.js default console format. JSON structured logging planned for v1.0.

2.3 Traces

Status: Not implemented. Sentry (which would provide trace-level error context) was removed (MC #1271).

Planned for v1.0: Request ID correlation across middleware and DB queries.

3. Health Check System

3.1 Health Endpoint (`GET /api/health`)

The health endpoint performs a real DB query and reports application status. It is the primary signal for all monitoring layers.

Source: src/drop-app/src/app/api/health/route.ts

Success Response (HTTP 200):

{
  "data": {
    "status": "ok",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
      "services": { "mode": "production" }
    },
    "timestamp": "2026-02-23T12:00:00.000Z"
  }
}

Degraded Response (HTTP 200): status: "degraded" — DB returned unexpected result.

Down Response (HTTP 503):

{
  "data": {
    "status": "down",
    "checks": { "db": { "status": "fail" } },
    "timestamp": "..."
  }
}

3.2 Container Health Checks

~~dashboarddashboarddashboarddashboard~~

~~Metric~~Platform	~~Description~~Check	~~Collection Method~~Interval	~~Dashboard~~Timeout	Retries
~~Active~~Docker ~~users~~Compose (~~DAU/MAU)~~MVP)	~~Daily/monthly~~`wget active users/api/health`	~~Frontend instrumentation~~30s	~~Business~~10s	3
~~{{CONVERSION_METRIC}}~~Docker Compose (Production)	~~{{CONVERSION_DESC}}~~`wget /api/health`	~~Backend event~~30s	~~Business~~10s	3
~~{{REVENUE_METRIC}}~~Fly.io (staging)	~~{{REVENUE_DESC}}~~`GET /api/health`	~~Payment events~~30s	~~Finance~~5s	—
~~Feature~~AWS ~~usage~~App Runner	~~Feature-level~~`GET engagement/api/health`	~~Feature flag SDK~~30s	~~Product~~5s	3

Custom
Metrics

4. Definition

Alerting

4.1 Slack Alerting (Internal)

Source: src/drop-app/src/lib/alerts.ts Channel: #drop-ops on alai-talk.slack.com Webhook: SLACK_WEBHOOK_URL environment variable

~~Metric~~Alert ~~Name~~	Type	~~Labels~~Trigger	~~Description~~Severity
App startup	Application boots	Info	ℹ️
App shutdown	SIGTERM/SIGINT received	Info	ℹ️
Error spike	> 5 errors in 60 seconds	Critical	🚨
Unhandled exception	Process event handler catches error	Critical	🚨
Custom alert	`sendAlert()` called in code	Variable	Variable

Cooldown: 10-minute cooldown per alert title (prevents spam). Resets on app restart.

Error spike detection algorithm:

Every HTTP 5xx error calls trackError()

Rolling 1-minute window of error timestamps maintained in memory

When count > 5 in window → sends critical Slack alert

Alert cooldown prevents duplicate alerts within 10 minutes

Usage in code:

import { sendAlert, trackError } from '@/lib/alerts';

// Send manual alert
await sendAlert({ severity: 'critical', title: 'Database failover detected', message: '...' });

// Track error (called automatically in middleware)
await trackError();

4.2 BetterStack External Monitoring

Status: Ready to configure (setup guide: docs/infrastructure/BETTERSTACK-SETUP.md) Plan: Free tier (10 monitors, 3-minute check interval)

Monitor	URL	Check	Expected
Drop Health Check	`https://drop.alai.no/api/health`	HTTP GET + keyword	Status 200, body contains `"status":"ok"`
Drop Landing Page	`https://drop.alai.no`	HTTP GET + keyword	Status 200, body contains `Send penger`
Drop Health (US East)	`https://drop.alai.no/api/health`	HTTP GET (from US region)	Status 200, body contains `"status":"ok"`

Status page: https://drop-status.betteruptime.com (public)

Escalation policy (Drop Production Incidents):

Minute 0:   Service down → Slack #drop-ops (immediate)
Minute 5:   Still down  → Email [email protected]
Minute 15:  Still down  → SMS +47 40 47 42 51 (requires paid BetterStack plan)

SSL expiry warning: 14 days before certificate expiration.

5. Alerting Rules Reference

Slack Slack

Condition	Source	Channel	Severity	Action
`{{APP}}_job_queue_depth/api/health` returns 503	~~Gauge~~BetterStack	`queue_name`Slack + email	~~Number of pending jobs~~Critical	~~count~~Investigate DB + App Runner
Error spike (>5 in 60s)	`{{APP}}_job_processing_durationalerts.ts`	~~Histogram~~	`queue_name, status#drop-ops`	~~Job processing time~~Critical	~~seconds~~Check app logs
App Runner service not `{{APP}}_external_api_calls_totalRUNNING`	~~Counter~~AWS Console / CloudWatch	Manual check	Critical	`service,aws statusapprunner start-deployment`	~~External API call count~~	~~count~~
RDS CPU > 80%	CloudWatch (manual setup needed)	TBD	Warning	Investigate query patterns
RDS storage < 1GB	CloudWatch (manual setup needed)	TBD	Warning	Increase storage
SSL certificate expiring	BetterStack	Email	Warning	Renew certificate
App startup/shutdown	`{{APP}}_cache_hit_ratioalerts.ts`	~~Gauge~~	`cache_type#drop-ops`	~~Cache hit percentage~~Info	~~ratio~~No action needed

6. Dashboards

2.6.1 AWS CloudWatch Dashboard (Planned)

Target dashboard widgets:

App Runner: Request count, 5xx error count, latency

RDS: CPU, connections, free storage, read/write IOPS

Health check latency over time (from /api/health responses)

Setup: TBD — requires CloudWatch dashboard configuration via AWS Console or Terraform.

6.2 LogsBetterStack Status Page

Log
Public Levelsstatus page: `https://drop-status.betteruptime.com`

Components:

API & UsageHealth Guide

Endpoint (linked to Drop Health Check monitor)

Landing Page (linked to Drop Landing Page monitor)

Global Network (linked to US East monitor)

7. On-Call Procedures

7.1 Escalation Matrix

~~Level~~Time	~~When to Use~~Action	~~Examples~~Who
`ERROR`Alert fires (0 min)	~~Unexpected~~Acknowledge ~~failure~~Slack ~~requiring~~alert, ~~attention~~investigate	~~Database~~Alem ~~connection failure, unhandled exception~~Bašić
`WARN`5 min: still down	~~Unexpected~~Email ~~but~~alert ~~handled~~auto-sent, ~~situation~~try restart	~~Deprecated~~Alem ~~API called, retry succeeded~~Bašić
`INFO`15 min: still down	~~Normal~~SMS ~~business~~alert ~~events~~(if configured), escalate	~~User~~Alem ~~logged in, order created, job completed~~Bašić
`DEBUG`30 min: unresolved	~~Diagnostic~~Follow ~~detail~~DR ~~(dev/staging~~runbook ~~only)~~for scenario	~~Function~~Alem ~~parameters,~~Bašić ~~internal~~+ ~~state~~
`TRACE`	~~Extremely verbose~~John (~~local dev only)~~	~~SQL queries, HTTP request/response bodies~~AI)

~~Production~~

7.2 logFirst level:Response `INFO` and above

Structured Logging Format

Checklist

{# "timestamp":1. "2026-01-15T10:30:00.000Z",Check "level":service "INFO",status
"service":aws "{{SERVICE_NAME}}",apprunner "version":describe-service "{{VERSION}}",\
  "trace_id":--service-arn "abc123def456",arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec "span_id":\
  "789xyz",--query "user_id":'Service.Status' "{{HASHED_OR_OMIT}}",--output "request_id":text "req-uuid-here",--region "message":eu-west-1

"Order# created2. successfully",Check "order_id":recent "ord-123",logs
"duration_ms":aws 45logs }tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --since 10m --region eu-west-1

# 3. Check RDS status
aws rds describe-db-instances \
  --db-instance-identifier drop-db \
  --query 'DBInstances[0].DBInstanceStatus' --output text --region eu-west-1

# 4. Direct health check
curl -s https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health | jq

# 5. Restart App Runner if needed
aws apprunner start-deployment \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

~~Required~~

~~fields:~~

8. `timestamp`,Monitoring `level`, `service`, `message`, `trace_id` Forbidden in logs: passwords, tokens, credit card numbers, SSN, full email addressesGaps (hashPlanned orfor truncate)

Log Aggregation Pipeline

flowchart LR
    APP[Application] -->|stdout/stderr| AGENT[Log Agent<br/>Fluent Bit / Filebeat]
    AGENT -->|structured JSON| STORE[Log Store<br/>Loki / Elasticsearch / CloudWatch]
    STORE --> QUERY[Query Interface<br/>Grafana / Kibana]
    STORE --> ALERT[Alert Engine<br/>AlertManager / PagerDuty]

v1.0)

~~Stage~~Gap	~~Tool~~Impact	~~Configuration~~Priority
~~Application~~Structured JSON logging	~~{{LOG_LIB}}~~Cannot correlate requests across log lines	~~Structured JSON to stdout~~High
~~Log~~CloudWatch ~~agent~~alarms for RDS	~~{{LOG_AGENT}}~~No automated alerting on DB metrics	~~Deployed as sidecar / DaemonSet~~High
~~Transport~~APM / request tracing	~~{{LOG_TRANSPORT}}~~Cannot trace slow requests	~~TLS encrypted~~Medium
~~Storage~~Business metrics dashboard	~~{{LOG_STORE}}~~No visibility into transaction volume/success	~~Indexed, compressed~~Medium
~~Query~~Redis-backed error counter	~~{{LOG_QUERY}}~~Error counter resets on restart	~~Access via dashboard~~

Log Retention Policy

~~Environment~~	~~Retention~~	~~Storage Tier~~
~~Dev~~	~~7 days~~	~~Hot~~
~~Staging~~	~~30 days~~	~~Hot~~
~~Production~~	~~{{PROD_LOG_RETENTION}} days~~	~~Hot (30d) → Cold archive~~Low
Audit ~~logs~~logging stream	1Required ~~year~~for compliance (~~regulatory)~~AML)	~~Hot (90d) → Cold archive~~

PII in Logs — Masking Strategy

~~Data Type~~	~~Strategy~~	~~Example~~
~~Email address~~	~~Hash + truncate~~	`user:sha256(email)[:8]`High
~~Phone~~Per-endpoint ~~number~~	~~Redact~~	`[PHONE_REDACTED]`
~~IP address~~	~~Anonymize last octet~~	`192.168.1.xxx`
~~Payment data~~	~~Never log~~	~~Use~~ `[PAYMENT_DATA_OMITTED]`
~~Auth tokens~~	~~Never log~~	~~Use~~ `[TOKEN_OMITTED]`
~~Names~~	~~Omit or pseudonymize~~	~~Reference by ID only~~

2.3 Traces

Distributed Tracing Setup

~~Tracing Framework:~~ ~~{{TRACE_FRAMEWORK}}~~ ~~Backend:~~ ~~{{TRACE_BACKEND}}~~ ~~Auto-instrumentation:~~ ~~{{AUTO_INSTRUMENT}}~~

~~Service~~	~~Instrumented~~	~~Framework~~	~~Notes~~
~~{{SERVICE_1}}~~	~~Yes~~	~~OpenTelemetry~~	~~HTTP, DB, Redis~~
~~{{SERVICE_2}}~~	~~Yes~~	~~OpenTelemetry~~	~~HTTP, external calls~~

Trace Sampling Strategy

~~Environment~~	~~Strategy~~	~~Rate~~	~~Notes~~
~~Dev~~	~~Always-on~~	~~100%~~	~~Full visibility~~
~~Staging~~	~~Always-on~~	~~100%~~	~~Full visibility~~
~~Production~~	~~Tail-based~~	~~{{SAMPLE_RATE}}% + errors~~	~~Error traces always kept~~

~~Tail-based sampling rules:~~

~~Always sample: traces with errors, traces > {{SLOW_THRESHOLD}}ms~~

~~Sample rate: {{SAMPLE_RATE}}% of successful, fast traces~~

~~Head-based fallback: {{HEAD_SAMPLE_RATE}}% if tail-based collector unavailable~~

Span Naming Conventions

~~Operation Type~~	~~Naming Pattern~~	~~Example~~
~~HTTP handler~~	`HTTP {{METHOD}} {{ROUTE}}`	`HTTP POST /api/orders`
~~DB query~~	`db.{{operation}} {{table}}`	`db.select orders`
~~Cache~~	`cache.{{operation}} {{key_pattern}}`	`cache.get user:*`
~~Queue~~	`queue.{{operation}} {{queue_name}}`	`queue.publish order-events`
~~External HTTP~~	`{{service}} {{METHOD}} {{path}}`	`stripe POST /charges`

Context Propagation

~~Standard:~~ ~~W3C TraceContext (~~traceparent ~~header)~~ ~~Baggage:~~ ~~W3C Baggage (for~~ user_id, tenant_id ~~propagation)~~ ~~Async:~~ ~~Inject context into message queue headers / job metadata~~

3. Alerting

3.1 Alert Rules

~~Alert Name~~	~~Condition~~	~~Duration~~	~~Severity~~	~~Channel~~	~~Runbook~~
`HighErrorRate`	~~error_rate > {{ERROR_ALERT}}%~~	~~2 min~~	~~Critical~~	~~PagerDuty~~	~~[link]~~
`SlowP99`	~~p99_latency > {{P99_ALERT}}ms~~	~~5 min~~	~~Warning~~	~~Slack #alerts~~	~~[link]~~
`ServiceDown`	~~health_check failing~~	~~1 min~~	~~Critical~~	~~PagerDuty~~	~~[link]~~
`HighCPU`	~~cpu > {{CPU_CRIT}}%~~	~~10 min~~	~~Warning~~	~~Slack #alerts~~	~~[link]~~
`DiskAlmostFull`	~~disk > {{DISK_CRIT}}%~~	~~5 min~~	~~Critical~~	~~PagerDuty~~	~~[link]~~
`DeploymentFailed`	~~deployment status = failed~~	~~Immediate~~	~~Critical~~	~~Slack #deployments~~	~~[link]~~
`CertificateExpiringSoon`	~~cert_expiry < 30 days~~	—	~~Warning~~	~~Slack #ops~~	~~[link]~~
`BackupFailed`	~~backup job = failed~~	—	~~Critical~~	~~PagerDuty~~	~~[link]~~
`SLOBudgetBurning`	~~error_budget < 10% remaining~~	—	~~Critical~~	~~PagerDuty~~	~~[link]~~

3.2 Alert Routing & Escalation

flowchart TD
    ALERT[Alert fires] --> SEVERITY{Severity?}
    SEVERITY -->|Critical| ONCALL[On-call engineer<br/>PagerDuty / phone]
    SEVERITY -->|Warning| SLACK[Slack #alerts<br/>No immediate response required]
    ONCALL -->|Not acknowledged in 5min| ESCALATE[Escalate to secondary]
    ESCALATE -->|Not acknowledged in 10min| MANAGER[Notify engineering lead]

~~Severity~~	~~Response SLA~~	~~Channel~~	~~Escalation~~
~~Critical (P1)~~	~~Acknowledge in 5 min, resolve in 1h~~	~~PagerDuty + call~~	~~Escalate at 5 min~~
~~High (P2)~~	~~Acknowledge in 30 min, resolve in 4h~~	~~PagerDuty~~	~~Escalate at 30 min~~
~~Warning (P3)~~	~~Review within 1 business day~~	~~Slack~~	~~Manual~~
~~Info~~	~~No response required~~	~~Slack~~	~~None~~

3.3 On-Call Rotation

~~Schedule:~~ ~~{{ONCALL_SCHEDULE}}~~ ~~Calendar:~~ ~~{{ONCALL_TOOL}}~~ ~~Primary rotation:~~ ~~{{ONCALL_MEMBERS}}~~ ~~Secondary (escalation):~~ ~~{{ESCALATION_MEMBERS}}~~ ~~Minimum rotation size:~~ ~~3 people (to avoid burnout)~~

3.4 Alert Fatigue Prevention

~~Alert review cadence: Monthly — remove/adjust alerts with < {{ACTIONABLE_RATE}}% actionable rate~~

~~Minimum alert duration: 2+ minutes (no single-spike alerts)~~

~~Deduplication window: {{DEDUP_WINDOW}} minutes~~

~~Business hours suppression: Allowed for non-critical alerts {{SUPPRESSION_HOURS}}~~

~~Post-mortem requirement: Every Critical alert reviewed after incident~~

4. Dashboards

4.1 Dashboard Inventory

~~Dashboard~~	~~Purpose~~	~~Link~~	~~Audience~~
~~System Overview~~	~~High-level health of all services~~	~~{{LINK}}~~	~~Everyone~~
~~{{SERVICE_1}}~~	~~Service-level detail~~	~~{{LINK}}~~	~~Dev team~~
~~Infrastructure~~	~~Host/container metrics~~	~~{{LINK}}~~	~~DevOps~~
~~Business Metrics~~	~~KPIs and conversions~~	~~{{LINK}}~~	~~Leadership, PM~~
~~SLO Tracker~~	~~Error budget~~error tracking	~~{{LINK}}~~Cannot isolate problematic routes	~~Engineering lead~~
~~On-Call~~	~~Current incidents, top errors~~	~~{{LINK}}~~	~~On-call engineer~~

4.2 Key Dashboard Specs — System Overview

~~Required panels:~~

~~Service health matrix (all services, green/red/yellow)~~

~~Request rate (all services, last 1h)~~

~~Error rate (all services, last 1h)~~

~~P99 latency (all services, last 1h)~~

~~Active incidents count~~

~~Error budget remaining (all SLOs)~~

~~Last deployment (service, version, time)~~

~~Infrastructure health (CPU, memory, disk — aggregate)~~

5. SLOs / SLIs

5.1 SLI Definitions

~~SLI~~	~~Definition~~	~~Measurement Method~~
~~Availability~~	~~% requests returning non-5xx~~	~~(total_requests - 5xx_requests) / total_requests~~
~~Latency~~	~~% requests completing within threshold~~	~~histogram_quantile(0.95, ...) < {{LATENCY_SLI}}ms~~
~~Error rate~~	~~% requests not returning errors~~	~~(total_requests - error_requests) / total_requests~~

5.2 SLO Targets

~~Service~~	~~SLI~~	~~Target~~	~~Window~~	~~Error Budget~~
~~{{SERVICE}}~~	~~Availability~~	~~{{AVAIL_TARGET}}%~~	~~30 days~~	~~{{BUDGET_MINUTES}} min/month~~
~~{{SERVICE}}~~	~~Latency (P95 < {{P95}}ms)~~	~~{{LATENCY_TARGET}}%~~	~~30 days~~	~~{{LATENCY_BUDGET_MINUTES}} min/month~~

5.3 Error Budget Tracking

~~Service~~	~~Monthly Budget~~	~~Burned This Month~~	~~Remaining~~	~~Burn Rate (24h)~~
~~{{SERVICE}}~~	~~{{BUDGET}}min~~	~~TBD~~	~~TBD~~	~~TBD~~

~~Error budget policy:~~

~~Budget > 50% remaining: Move fast, deploy freely~~

~~Budget 10-50% remaining: Slow down, prioritize reliability work~~

~~Budget < 10% remaining: Freeze non-critical deploys, focus on reliability~~

6. Tooling

~~Tool~~	~~Version~~	~~Purpose~~	~~Hosted~~
~~{{METRICS_TOOL}}~~	~~{{VERSION}}~~	~~Metrics collection & storage~~	~~{{HOSTING}}~~
~~{{LOG_TOOL}}~~	~~{{VERSION}}~~	~~Log aggregation~~	~~{{HOSTING}}~~
~~{{TRACE_TOOL}}~~	~~{{VERSION}}~~	~~Distributed tracing~~	~~{{HOSTING}}~~
~~{{DASHBOARD_TOOL}}~~	~~{{VERSION}}~~	~~Visualization~~	~~{{HOSTING}}~~
~~{{ALERT_TOOL}}~~	~~{{VERSION}}~~	~~Alert routing & on-call~~	~~{{HOSTING}}~~Medium

Alerting

~~SLA Report~~Source

Approval

Role	Name	Date
Author	Platform Architect (AI)	2026-02-23
Reviewer
Approver	Alem Bašić

Monitoring & Observability

Monitoring & Observability

Document History

1. Observability Strategy

2. Three Pillars

2.1 Metrics

Infrastructure Metrics

Application Metrics (RED Method)

Business Metrics (Planned for v1.0)

2.2 Logs

Log Sources

BusinessLog MetricsAccess

2.3 Traces

3. Health Check System

3.1 Health Endpoint (GET /api/health)

3.2 Container Health Checks

Custom Metrics

4. Definition

4.1 Slack Alerting (Internal)

4.2 BetterStack External Monitoring

5. Alerting Rules Reference

6. Dashboards

2.6.1 AWS CloudWatch Dashboard (Planned)

6.2 LogsBetterStack Status Page

LogPublic Levelsstatus page: https://drop-status.betteruptime.com Components: API & UsageHealth Guide

7. On-Call Procedures

7.1 Escalation Matrix

7.2 logFirst level:Response INFO and above

Structured Logging Format

8. timestamp,Monitoring level, service, message, trace_id Forbidden in logs: passwords, tokens, credit card numbers, SSN, full email addressesGaps (hashPlanned orfor truncate)

Log Aggregation Pipeline

Log Retention Policy

PII in Logs — Masking Strategy

2.3 Traces

Distributed Tracing Setup

Trace Sampling Strategy

Span Naming Conventions

Context Propagation

3. Alerting

3.1 Alert Rules

3.2 Alert Routing & Escalation

3.3 On-Call Rotation

3.4 Alert Fatigue Prevention

4. Dashboards

4.1 Dashboard Inventory

4.2 Key Dashboard Specs — System Overview

5. SLOs / SLIs

5.1 SLI Definitions

5.2 SLO Targets

5.3 Error Budget Tracking

6. Tooling

Related Documents

Approval

3.1 Health Endpoint (`GET /api/health`)

Custom
Metrics

Log
Public Levelsstatus page: `https://drop-status.betteruptime.com`

Components:

API & UsageHealth Guide

7.2 logFirst level:Response `INFO` and above

8. `timestamp`,Monitoring `level`, `service`, `message`, `trace_id` Forbidden in logs: passwords, tokens, credit card numbers, SSN, full email addressesGaps (hashPlanned orfor truncate)