Monitoring & Observability

Project: ~~Drop~~{{PROJECT_NAME}} Version: ~~0.1.0~~{{VERSION}} Date: ~~2026-02-23~~{{DATE}} Author: ~~Platform Architect (AI)~~{{AUTHOR}} Status: Draft | In Review | Approved Reviewers: ~~Alem Bašić (CEO)~~{{REVIEWERS}}

Document History

Version	Date	Author	Changes
0.1	~~2026-02-23~~{{DATE}}	~~Platform Architect (AI)~~{{AUTHOR}}	Initial draft ~~from source code and infrastructure analysis~~

1. Observability Strategy

Drop's observability stack is intentionally lean for the MVP phase: a health check endpoint with real DB verification, Slack alerting with error spike detection, BetterStack external uptime monitoring, and AWS CloudWatch for App Runner logs. Sentry was removed (MC #1271). Full APM and structured logging are planned for v1.0.

Observability Platform: ~~BetterStack~~{{OBS_PLATFORM}} ~~(external uptime) + Slack alerting (internal) + CloudWatch (AWS logs)~~ Strategy: ~~Alert~~Instrument everything, alert on symptoms (~~service~~not ~~down, error spike)~~causes), ~~verify~~correlate ~~via~~across ~~health check, investigate via CloudWatch logs.~~pillars

Core Questions We Must Be Able to Answer:

Is ~~Drop~~the system up and serving ~~users?~~users ~~(BetterStack monitors~~ /api/health)correctly?
IsHow fast is it responding?

What errors are occurring and why?

Where is the ~~database connected and responding? (health endpoint DB query)~~

~~Are errors spiking? (Slack~~ alerts.ts ~~error spike detection)~~bottleneck?
What changed before ~~the~~this ~~problem?~~problem ~~(CloudWatch App Runner logs)~~

~~When was the last successful deployment? (App Runner deployment history)~~started?

2. Three Pillars

2.1 Metrics

Infrastructure Metrics

Metric	Source	Alert Threshold	Severity
~~App Runner service status~~	~~AWS CloudWatch / App Runner API~~	`RUNNING` ~~→ any other state~~	~~Critical~~
~~RDS instance status~~	~~AWS CloudWatch / RDS API~~	~~Not~~ `available`	~~Critical~~
~~RDS~~ CPU utilization	~~AWS~~Node exporter / CloudWatch `CPUUtilization`	> ~~80%~~{{CPU_WARN}}% (warn), > ~~95%~~{{CPU_CRIT}}% (critical)	Warning / Critical
~~RDS~~Memory ~~free storage~~utilization	~~AWS~~Node exporter / CloudWatch `FreeStorageSpace`	<> ~~1GB~~{{MEM_WARN}}% (warn), <> ~~256MB~~{{MEM_CRIT}}% (critical)	Warning / Critical
~~RDS~~Disk ~~database connections~~utilization	~~AWS~~Node exporter / CloudWatch `DatabaseConnections`	> 70{{DISK_WARN}}% (~~warn,~~warn), ~~db.t4g.micro~~> ~~max~~{{DISK_CRIT}}% ~~~85)~~(critical)	Warning / Critical
Network in/out	Node exporter / CloudWatch	> {{NET_LIMIT}}Mbps sustained	Warning
~~App~~Container ~~Runner concurrent requests~~restarts	~~AWS~~Kubernetes ~~CloudWatch~~/ ECS	~~TBD~~> —{{RESTART_LIMIT}} noin ~~baseline yet~~5min	~~TBD~~Critical
Node not ready	Kubernetes	Any	Critical

Application Metrics (RED Method)

~~Metric~~	~~Source~~	~~Alert Threshold~~	~~Severity~~
~~Request rate~~	~~CloudWatch App Runner~~	~~Baseline TBD~~	~~Informational~~
~~Error rate~~	`src/lib/alerts.ts` `trackError()`	~~> 5 errors in 60 seconds → Slack alert~~	~~Critical~~
~~DB query latency~~	`/api/health` `dbLatencyMs` ~~field~~	~~> 100ms (warn)~~	~~Warning~~
~~Health endpoint status~~	~~BetterStack +~~ `/api/health`	`status` != `"ok"` ~~→ 503~~	~~Critical~~
~~Rate limit hits (429)~~	~~App logs~~	~~Spike of 429 responses~~	~~Warning~~

Business Metrics (Planned for v1.0)

Metric	Description	Target
Alert
~~Transactions per hour~~	~~Successful remittances + QR payments~~	~~TBD~~
~~Transaction success rate~~	~~Completed / total initiated~~	~~> 99%~~
~~KYC approval rate~~	~~Sumsub approvals / attempts~~	~~> 80%~~
~~BankID login success rate~~	~~Successful OIDC callbacks / initiated~~	~~> 99%~~

2.2 Logs

Log Sources

~~Source~~	~~Log Group~~	~~Format~~	~~Retention~~Threshold
~~App~~Request ~~Runner (application)~~rate	`/aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application`Requests per second per service	~~Next.js~~Baseline ~~console~~± ~~output~~20%	3050% ~~days (CloudWatch default)~~deviation
~~App~~Error ~~Runner (system)~~rate	`/aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/service`% requests returning 5xx	~~App~~< ~~Runner events~~{{ERROR_RATE}}%	30> ~~days~~{{ERROR_ALERT}}%
~~RDS~~P50 ~~PostgreSQL~~latency	~~RDS~~Median ~~error~~response ~~log via CloudWatch~~time	~~PostgreSQL~~< ~~log format~~{{P50}}ms	7> ~~days~~{{P50_ALERT}}ms
P95 latency	95th percentile response time	< {{P95}}ms	> {{P95_ALERT}}ms
P99 latency	99th percentile response time	< {{P99}}ms	> {{P99_ALERT}}ms

LogBusiness AccessMetrics

# Stream App Runner application logs (live)
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --follow \
  --region eu-west-1

# Filter for errors in last hour
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --filter-pattern "ERROR" \
  --region eu-west-1

# Download RDS error log
aws rds download-db-log-file-portion \
  --db-instance-identifier drop-db \
  --log-file-name error/postgresql.log \
  --region eu-west-1

~~Structured logging status:~~ ~~Not yet implemented. Current output is Next.js default console format. JSON structured logging planned for v1.0.~~

2.3 Traces

~~Status:~~ ~~Not implemented. Sentry (which would provide trace-level error context) was removed (MC #1271).~~

~~Planned for v1.0:~~ ~~Request ID correlation across middleware and DB queries.~~

3. Health Check System

3.1 Health Endpoint (`GET /api/health`)

~~The health endpoint performs a real DB query and reports application status. It is the primary signal for all monitoring layers.~~

~~Source:~~ src/drop-app/src/app/api/health/route.ts

~~Success Response (HTTP 200):~~

{
  "data": {
    "status": "ok",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
      "services": { "mode": "production" }
    },
    "timestamp": "2026-02-23T12:00:00.000Z"
  }
}

~~Degraded Response (HTTP 200):~~ status: "degraded" ~~— DB returned unexpected result.~~

~~Down Response (HTTP 503):~~

{
  "data": {
    "status": "down",
    "checks": { "db": { "status": "fail" } },
    "timestamp": "..."
  }
}

3.2 Container Health Checks

BusinessBusinessFinanceProduct

~~Platform~~Metric	~~Check~~Description	~~Interval~~Collection Method	~~Timeout~~	~~Retries~~Dashboard
~~Docker~~Active ~~Compose~~users (~~MVP)~~DAU/MAU)	`wgetDaily/monthly /api/health`active users	~~30s~~Frontend instrumentation	~~10s~~	3dashboard
~~Docker Compose (Production)~~{{CONVERSION_METRIC}}	`wget /api/health`{{CONVERSION_DESC}}	~~30s~~Backend event	~~10s~~	3dashboard
~~Fly.io (staging)~~{{REVENUE_METRIC}}	`GET /api/health`{{REVENUE_DESC}}	~~30s~~Payment events	5s	—dashboard
~~AWS~~Feature ~~App Runner~~usage	`GETFeature-level /api/health`engagement	~~30s~~Feature flag SDK	5s	3dashboard

Custom

4.Metrics Alerting

Definition

4.1 Slack Alerting (Internal)

~~Source:~~ src/drop-app/src/lib/alerts.ts ~~Channel:~~ #drop-ops on alai-talk.slack.com ~~Webhook:~~ SLACK_WEBHOOK_URL ~~environment variable~~

~~Alert~~Metric Name	Type	~~Trigger~~Labels	~~Severity~~Description
~~App startup~~	~~Application boots~~	~~Info~~	ℹ️
~~App shutdown~~	~~SIGTERM/SIGINT received~~	~~Info~~	ℹ️
~~Error spike~~	~~> 5 errors in 60 seconds~~	~~Critical~~	🚨
~~Unhandled exception~~	~~Process event handler catches error~~	~~Critical~~	🚨
~~Custom alert~~	`sendAlert()` ~~called in code~~	~~Variable~~	~~Variable~~

~~Cooldown:~~ ~~10-minute cooldown per alert title (prevents spam). Resets on app restart.~~

~~Error spike detection algorithm:~~

~~Every HTTP 5xx error calls~~ trackError()

~~Rolling 1-minute window of error timestamps maintained in memory~~

~~When count > 5 in window → sends critical Slack alert~~

~~Alert cooldown prevents duplicate alerts within 10 minutes~~

~~Usage in code:~~

import { sendAlert, trackError } from '@/lib/alerts';

// Send manual alert
await sendAlert({ severity: 'critical', title: 'Database failover detected', message: '...' });

// Track error (called automatically in middleware)
await trackError();

4.2 BetterStack External Monitoring

~~Status:~~ ~~Ready to configure (setup guide:~~ docs/infrastructure/BETTERSTACK-SETUP.md) ~~Plan:~~ ~~Free tier (10 monitors, 3-minute check interval)~~

~~Monitor~~	~~URL~~	~~Check~~	~~Expected~~
~~Drop Health Check~~	`https://drop.alai.no/api/health`	~~HTTP GET + keyword~~	~~Status 200, body contains~~ `"status":"ok"`
~~Drop Landing Page~~	`https://drop.alai.no`	~~HTTP GET + keyword~~	~~Status 200, body contains~~ `Send penger`
~~Drop Health (US East)~~	`https://drop.alai.no/api/health`	~~HTTP GET (from US region)~~	~~Status 200, body contains~~ `"status":"ok"`

~~Status page:~~ https://drop-status.betteruptime.com ~~(public)~~

~~Escalation policy (~~Drop Production Incidents):

Minute 0:   Service down → Slack #drop-ops (immediate)
Minute 5:   Still down  → Email [email protected]
Minute 15:  Still down  → SMS +47 40 47 42 51 (requires paid BetterStack plan)

~~SSL expiry warning:~~ ~~14 days before certificate expiration.~~

5. Alerting Rules Reference

~~spike (>5 in 60s)~~ ~~start-deployment~~

~~Condition~~	~~Source~~	~~Channel~~	~~Severity~~	~~Action~~Unit
`/api/health{{APP}}_job_queue_depth` ~~returns 503~~	~~BetterStack~~Gauge	~~Slack + email~~`queue_name`	~~Critical~~Number of pending jobs	~~Investigate DB + App Runner~~count
~~Error~~`{{APP}}_job_processing_duration`	Histogram	`alerts.tsqueue_name, status`	~~Slack~~Job `#drop-ops`processing time	~~Critical~~	~~Check app logs~~seconds
~~App Runner service not~~ `RUNNING{{APP}}_external_api_calls_total`	~~AWS Console / CloudWatch~~	~~Manual check~~	~~Critical~~Counter	`awsservice, apprunnerstatus`	External API call count	count
~~RDS CPU > 80%~~`{{APP}}_cache_hit_ratio`	~~CloudWatch (manual setup needed)~~	~~TBD~~	~~Warning~~	~~Investigate query patterns~~
~~RDS storage < 1GB~~	~~CloudWatch (manual setup needed)~~	~~TBD~~	~~Warning~~	~~Increase storage~~
~~SSL certificate expiring~~	~~BetterStack~~	~~Email~~	~~Warning~~	~~Renew certificate~~
~~App startup/shutdown~~Gauge	`alerts.tscache_type`	~~Slack~~Cache `#drop-ops`hit percentage	~~Info~~	~~No action needed~~ratio

6.

2.2 Dashboards

6.1 AWS CloudWatch Dashboard (Planned)Logs

~~Target~~

Log dashboard widgets:

App Runner: Request count, 5xx error count, latency

RDS: CPU, connections, free storage, read/write IOPS

Health check latency over time (from `/api/health` responses)

Setup: TBD — requires CloudWatch dashboard configuration via AWS Console or Terraform.

6.2 BetterStack Status Page

~~Public status page:~~ https://drop-status.betteruptime.com

~~Components:~~

~~API~~Levels & ~~Health~~Usage ~~Endpoint~~Guide ~~(linked to Drop Health Check monitor)~~

~~Landing Page~~ ~~(linked to Drop Landing Page monitor)~~

~~Global Network~~ ~~(linked to US East monitor)~~

7. On-Call Procedures

7.1 Escalation Matrix

~~Time~~Level	~~Action~~When to Use	~~Who~~Examples
~~Alert fires (0 min)~~`ERROR`	~~Acknowledge~~Unexpected ~~Slack~~failure ~~alert,~~requiring ~~investigate~~attention	~~Alem~~Database ~~Bašić~~connection failure, unhandled exception
~~5 min: still down~~`WARN`	~~Email~~Unexpected ~~alert~~but ~~auto-sent,~~handled ~~try restart~~situation	~~Alem~~Deprecated ~~Bašić~~API called, retry succeeded
~~15 min: still down~~`INFO`	~~SMS~~Normal ~~alert~~business ~~(if configured), escalate~~events	~~Alem~~User ~~Bašić~~logged in, order created, job completed
~~30 min: unresolved~~`DEBUG`	~~Follow~~Diagnostic DRdetail ~~runbook~~(dev/staging ~~for scenario~~only)	~~Alem~~Function ~~Bašić~~parameters, +internal ~~John~~state
`TRACE`	Extremely verbose (~~AI)~~local dev only)	SQL queries, HTTP request/response bodies

7.2
Production Firstlog Responselevel: Checklist

INFO and above

Structured Logging Format

#{
  1."timestamp": Check"2026-01-15T10:30:00.000Z",
  service"level": status"INFO",
  aws"service": apprunner"{{SERVICE_NAME}}",
  describe-service"version": \"{{VERSION}}",
  --service-arn"trace_id": arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec"abc123def456",
  \"span_id": --query"789xyz",
  'Service.Status'"user_id": --output"{{HASHED_OR_OMIT}}",
  text"request_id": --region"req-uuid-here",
  eu-west-1"message": #"Order 2.created Checksuccessfully",
  recent"order_id": logs"ord-123",
  aws"duration_ms": logs45
tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --since 10m --region eu-west-1

# 3. Check RDS status
aws rds describe-db-instances \
  --db-instance-identifier drop-db \
  --query 'DBInstances[0].DBInstanceStatus' --output text --region eu-west-1

# 4. Direct health check
curl -s https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health | jq

# 5. Restart App Runner if needed
aws apprunner start-deployment \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1}

Required

8.fields: Monitoring`timestamp`, Gaps`level`, `service`, `message`, `trace_id` Forbidden in logs: passwords, tokens, credit card numbers, SSN, full email addresses (Plannedhash foror v1.0)

truncate)

Log Aggregation Pipeline

flowchart LR
    APP[Application] -->|stdout/stderr| AGENT[Log Agent<br/>Fluent Bit / Filebeat]
    AGENT -->|structured JSON| STORE[Log Store<br/>Loki / Elasticsearch / CloudWatch]
    STORE --> QUERY[Query Interface<br/>Grafana / Kibana]
    STORE --> ALERT[Alert Engine<br/>AlertManager / PagerDuty]

~~Gap~~Stage	~~Impact~~Tool	~~Priority~~Configuration
~~Structured JSON~~Application logging	~~Cannot correlate requests across log lines~~{{LOG_LIB}}	~~High~~Structured JSON to stdout
~~CloudWatch~~Log ~~alarms for RDS~~agent	~~No automated alerting on DB metrics~~{{LOG_AGENT}}	~~High~~Deployed as sidecar / DaemonSet
~~APM~~Transport	{{LOG_TRANSPORT}}	TLS encrypted
Storage	{{LOG_STORE}}	Indexed, compressed
Query	{{LOG_QUERY}}	Access via dashboard

Log Retention Policy

Environment	Retention	Storage Tier
Dev	7 days	Hot
Staging	30 days	Hot
Production	{{PROD_LOG_RETENTION}} days	Hot (30d) → Cold archive
Audit logs	1 year (regulatory)	Hot (90d) → Cold archive

PII in Logs — Masking Strategy

Data Type	Strategy	Example
Email address	Hash + truncate	`user:sha256(email)[:8]`
Phone number	Redact	`[PHONE_REDACTED]`
IP address	Anonymize last octet	`192.168.1.xxx`
Payment data	Never log	Use `[PAYMENT_DATA_OMITTED]`
Auth tokens	Never log	Use `[TOKEN_OMITTED]`
Names	Omit or pseudonymize	Reference by ID only

2.3 Traces

Distributed Tracing Setup

Tracing Framework: {{TRACE_FRAMEWORK}} Backend: {{TRACE_BACKEND}} Auto-instrumentation: {{AUTO_INSTRUMENT}}

Service	Instrumented	Framework	Notes
{{SERVICE_1}}	Yes	OpenTelemetry	HTTP, DB, Redis
{{SERVICE_2}}	Yes	OpenTelemetry	HTTP, external calls

Trace Sampling Strategy

Environment	Strategy	Rate	Notes
Dev	Always-on	100%	Full visibility
Staging	Always-on	100%	Full visibility
Production	Tail-based	{{SAMPLE_RATE}}% + errors	Error traces always kept

Tail-based sampling rules:

Always sample: traces with errors, traces > {{SLOW_THRESHOLD}}ms

Sample rate: {{SAMPLE_RATE}}% of successful, fast traces

Head-based fallback: {{HEAD_SAMPLE_RATE}}% if tail-based collector unavailable

Span Naming Conventions

Operation Type	Naming Pattern	Example
HTTP handler	`HTTP {{METHOD}} {{ROUTE}}`	`HTTP POST /api/orders`
DB query	`db.{{operation}} {{table}}`	`db.select orders`
Cache	`cache.{{operation}} {{key_pattern}}`	`cache.get user:*`
Queue	`queue.{{operation}} {{queue_name}}`	`queue.publish order-events`
External HTTP	`{{service}} {{METHOD}} {{path}}`	`stripe POST /charges`

Context Propagation

Standard: W3C TraceContext (traceparent header) Baggage: W3C Baggage (for user_id, tenant_id propagation) Async: Inject context into message queue headers / ~~request~~job ~~tracing~~metadata

3. Alerting

3.1 Alert Rules

Alert Name	Condition	Duration	Severity	Channel	Runbook
`HighErrorRate`	~~Cannot~~error_rate ~~trace~~> ~~slow requests~~{{ERROR_ALERT}}%	~~Medium~~2 min	Critical	PagerDuty	[link]
`SlowP99`	p99_latency > {{P99_ALERT}}ms	5 min	Warning	Slack #alerts	[link]
`ServiceDown`	health_check failing	1 min	Critical	PagerDuty	[link]
`HighCPU`	cpu > {{CPU_CRIT}}%	10 min	Warning	Slack #alerts	[link]
`DiskAlmostFull`	disk > {{DISK_CRIT}}%	5 min	Critical	PagerDuty	[link]
`DeploymentFailed`	deployment status = failed	Immediate	Critical	Slack #deployments	[link]
`CertificateExpiringSoon`	cert_expiry < 30 days	—	Warning	Slack #ops	[link]
`BackupFailed`	backup job = failed	—	Critical	PagerDuty	[link]
`SLOBudgetBurning`	error_budget < 10% remaining	—	Critical	PagerDuty	[link]

3.2 Alert Routing & Escalation

flowchart TD
    ALERT[Alert fires] --> SEVERITY{Severity?}
    SEVERITY -->|Critical| ONCALL[On-call engineer<br/>PagerDuty / phone]
    SEVERITY -->|Warning| SLACK[Slack #alerts<br/>No immediate response required]
    ONCALL -->|Not acknowledged in 5min| ESCALATE[Escalate to secondary]
    ESCALATE -->|Not acknowledged in 10min| MANAGER[Notify engineering lead]

Severity	Response SLA	Channel	Escalation
Critical (P1)	Acknowledge in 5 min, resolve in 1h	PagerDuty + call	Escalate at 5 min
High (P2)	Acknowledge in 30 min, resolve in 4h	PagerDuty	Escalate at 30 min
Warning (P3)	Review within 1 business day	Slack	Manual
Info	No response required	Slack	None

3.3 On-Call Rotation

Schedule: {{ONCALL_SCHEDULE}} Calendar: {{ONCALL_TOOL}} Primary rotation: {{ONCALL_MEMBERS}} Secondary (escalation): {{ESCALATION_MEMBERS}} Minimum rotation size: 3 people (to avoid burnout)

3.4 Alert Fatigue Prevention

Alert review cadence: Monthly — remove/adjust alerts with < {{ACTIONABLE_RATE}}% actionable rate

Minimum alert duration: 2+ minutes (no single-spike alerts)

Deduplication window: {{DEDUP_WINDOW}} minutes

Business hours suppression: Allowed for non-critical alerts {{SUPPRESSION_HOURS}}

Post-mortem requirement: Every Critical alert reviewed after incident

4. Dashboards

4.1 Dashboard Inventory

Dashboard	Purpose	Link	Audience
System Overview	High-level health of all services	{{LINK}}	Everyone
{{SERVICE_1}}	Service-level detail	{{LINK}}	Dev team
Infrastructure	Host/container metrics	{{LINK}}	DevOps
Business ~~metrics dashboard~~Metrics	NoKPIs ~~visibility~~and ~~into transaction volume/success~~conversions	~~Medium~~{{LINK}}	Leadership, PM
~~Redis-backed~~SLO ~~error counter~~Tracker	Error ~~counter~~budget ~~resets on restart~~tracking	~~Low~~{{LINK}}	Engineering lead
~~Audit logging stream~~On-Call	~~Required~~Current ~~for~~incidents, ~~compliance~~top ~~(AML)~~errors	~~High~~{{LINK}}	On-call engineer

4.2 Key Dashboard Specs — System Overview

Required panels:

Service health matrix (all services, green/red/yellow)

Request rate (all services, last 1h)

Error rate (all services, last 1h)

P99 latency (all services, last 1h)

Active incidents count

Error budget remaining (all SLOs)

Last deployment (service, version, time)

Infrastructure health (CPU, memory, disk — aggregate)

5. SLOs / SLIs

5.1 SLI Definitions

SLI	Definition	Measurement Method
Availability	% requests returning non-5xx	(total_requests - 5xx_requests) / total_requests
~~Per-endpoint error tracking~~Latency	~~Cannot~~% ~~isolate~~requests ~~problematic~~completing ~~routes~~within threshold	~~Medium~~histogram_quantile(0.95, ...) < {{LATENCY_SLI}}ms
Error rate	% requests not returning errors	(total_requests - error_requests) / total_requests

5.2 SLO Targets

Service	SLI	Target	Window	Error Budget
{{SERVICE}}	Availability	{{AVAIL_TARGET}}%	30 days	{{BUDGET_MINUTES}} min/month
{{SERVICE}}	Latency (P95 < {{P95}}ms)	{{LATENCY_TARGET}}%	30 days	{{LATENCY_BUDGET_MINUTES}} min/month

5.3 Error Budget Tracking

Service	Monthly Budget	Burned This Month	Remaining	Burn Rate (24h)
{{SERVICE}}	{{BUDGET}}min	TBD	TBD	TBD

Error budget policy:

Budget > 50% remaining: Move fast, deploy freely

Budget 10-50% remaining: Slow down, prioritize reliability work

Budget < 10% remaining: Freeze non-critical deploys, focus on reliability

6. Tooling

Tool	Version	Purpose	Hosted
{{METRICS_TOOL}}	{{VERSION}}	Metrics collection & storage	{{HOSTING}}
{{LOG_TOOL}}	{{VERSION}}	Log aggregation	{{HOSTING}}
{{TRACE_TOOL}}	{{VERSION}}	Distributed tracing	{{HOSTING}}
{{DASHBOARD_TOOL}}	{{VERSION}}	Visualization	{{HOSTING}}
{{ALERT_TOOL}}	{{VERSION}}	Alert routing & on-call	{{HOSTING}}

~~Source~~

SLA Report

Approval

Role	Name	Date
Author	~~Platform Architect (AI)~~	~~2026-02-23~~
Reviewer
Approver	~~Alem Bašić~~

Monitoring & Observability

Monitoring & Observability

Document History

1. Observability Strategy

2. Three Pillars

2.1 Metrics

Infrastructure Metrics

Application Metrics (RED Method)

Business Metrics (Planned for v1.0)

2.2 Logs

Log Sources

LogBusiness AccessMetrics

2.3 Traces

3. Health Check System

3.1 Health Endpoint (GET /api/health)

3.2 Container Health Checks

Custom

4.Metrics Alerting

4.1 Slack Alerting (Internal)

4.2 BetterStack External Monitoring

5. Alerting Rules Reference

6.

2.2 Dashboards

6.1 AWS CloudWatch Dashboard (Planned)Logs

Log dashboard widgets: App Runner: Request count, 5xx error count, latency RDS: CPU, connections, free storage, read/write IOPS Health check latency over time (from /api/health responses) Setup: TBD — requires CloudWatch dashboard configuration via AWS Console or Terraform.

6.2 BetterStack Status Page

7. On-Call Procedures

7.1 Escalation Matrix

7.2Production Firstlog Responselevel: Checklist

Structured Logging Format

8.fields: Monitoringtimestamp, Gapslevel, service, message, trace_id Forbidden in logs: passwords, tokens, credit card numbers, SSN, full email addresses (Plannedhash foror v1.0)

Log Aggregation Pipeline

Log Retention Policy

PII in Logs — Masking Strategy

2.3 Traces

Distributed Tracing Setup

Trace Sampling Strategy

Span Naming Conventions

Context Propagation

3. Alerting

3.1 Alert Rules

3.2 Alert Routing & Escalation

3.3 On-Call Rotation

3.4 Alert Fatigue Prevention

4. Dashboards

4.1 Dashboard Inventory

4.2 Key Dashboard Specs — System Overview

5. SLOs / SLIs

5.1 SLI Definitions

5.2 SLO Targets

5.3 Error Budget Tracking

6. Tooling

Related Documents

Approval

3.1 Health Endpoint (`GET /api/health`)

Log dashboard widgets:

App Runner: Request count, 5xx error count, latency

RDS: CPU, connections, free storage, read/write IOPS

Health check latency over time (from `/api/health` responses)

Setup: TBD — requires CloudWatch dashboard configuration via AWS Console or Terraform.

7.2
Production Firstlog Responselevel: Checklist

8.fields: Monitoring`timestamp`, Gaps`level`, `service`, `message`, `trace_id` Forbidden in logs: passwords, tokens, credit card numbers, SSN, full email addresses (Plannedhash foror v1.0)