Skip to main content

Monitoring & Observability

Monitoring & Observability

Project: Drop{{PROJECT_NAME}} Version: 0.1.0{{VERSION}} Date: 2026-02-23{{DATE}} Author: Platform Architect (AI){{AUTHOR}} Status: Draft | In Review | Approved Reviewers: Alem Bašić (CEO){{REVIEWERS}}

Document History

Version Date Author Changes
0.1 2026-02-23{{DATE}} Platform Architect (AI){{AUTHOR}} Initial draft from source code and infrastructure analysis

1. Observability Strategy

Drop's observability stack is intentionally lean for the MVP phase: a health check endpoint with real DB verification, Slack alerting with error spike detection, BetterStack external uptime monitoring, and AWS CloudWatch for App Runner logs. Sentry was removed (MC #1271). Full APM and structured logging are planned for v1.0.

Observability Platform: BetterStack{{OBS_PLATFORM}} (external uptime) + Slack alerting (internal) + CloudWatch (AWS logs) Strategy: AlertInstrument everything, alert on symptoms (servicenot down, error spike)causes), verifycorrelate viaacross health check, investigate via CloudWatch logs.pillars

Core Questions We Must Be Able to Answer:

  1. Is Dropthe system up and serving users?users (BetterStack monitors /api/health)correctly?
  2. IsHow fast is it responding?
  3. What errors are occurring and why?
  4. Where is the database connected and responding? (health endpoint DB query)
  5. Are errors spiking? (Slack alerts.ts error spike detection)bottleneck?
  6. What changed before thethis problem?problem (CloudWatch App Runner logs)
  7. When was the last successful deployment? (App Runner deployment history)started?

2. Three Pillars

2.1 Metrics

Infrastructure Metrics

Metric Source Alert Threshold Severity
App Runner service statusAWS CloudWatch / App Runner APIRUNNING → any other stateCritical
RDS instance statusAWS CloudWatch / RDS APINot availableCritical
RDS CPU utilization AWSNode exporter / CloudWatch CPUUtilization > 80%{{CPU_WARN}}% (warn), > 95%{{CPU_CRIT}}% (critical) Warning / Critical
RDSMemory free storageutilization AWSNode exporter / CloudWatch FreeStorageSpace <> 1GB{{MEM_WARN}}% (warn), <> 256MB{{MEM_CRIT}}% (critical) Warning / Critical
RDSDisk database connectionsutilization AWSNode exporter / CloudWatch DatabaseConnections > 70{{DISK_WARN}}% (warn,warn), db.t4g.micro> max{{DISK_CRIT}}% ~85)(critical)Warning / Critical
Network in/outNode exporter / CloudWatch> {{NET_LIMIT}}Mbps sustained Warning
AppContainer Runner concurrent requestsrestarts AWSKubernetes CloudWatch/ ECS TBD> {{RESTART_LIMIT}} noin baseline yet5min TBDCritical
Node not readyKubernetesAnyCritical

Application Metrics (RED Method)

MetricSourceAlert ThresholdSeverity
Request rateCloudWatch App RunnerBaseline TBDInformational
Error ratesrc/lib/alerts.ts trackError()> 5 errors in 60 seconds → Slack alertCritical
DB query latency/api/health dbLatencyMs field> 100ms (warn)Warning
Health endpoint statusBetterStack + /api/healthstatus != "ok" → 503Critical
Rate limit hits (429)App logsSpike of 429 responsesWarning

Business Metrics (Planned for v1.0)

Metric Description Target
Alert
Transactions per hourSuccessful remittances + QR paymentsTBD
Transaction success rateCompleted / total initiated> 99%
KYC approval rateSumsub approvals / attempts> 80%
BankID login success rateSuccessful OIDC callbacks / initiated> 99%

2.2 Logs

Log Sources

SourceLog GroupFormatRetentionThreshold
AppRequest Runner (application)rate /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/applicationRequests per second per service Next.jsBaseline console± output20% 3050% days (CloudWatch default)deviation
AppError Runner (system)rate /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/service% requests returning 5xx App< Runner events{{ERROR_RATE}}% 30> days{{ERROR_ALERT}}%
RDSP50 PostgreSQLlatency RDSMedian errorresponse log via CloudWatchtime PostgreSQL< log format{{P50}}ms 7> days{{P50_ALERT}}ms
P95 latency95th percentile response time< {{P95}}ms> {{P95_ALERT}}ms
P99 latency99th percentile response time< {{P99}}ms> {{P99_ALERT}}ms

LogBusiness AccessMetrics

# Stream App Runner application logs (live)
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --follow \
  --region eu-west-1

# Filter for errors in last hour
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --filter-pattern "ERROR" \
  --region eu-west-1

# Download RDS error log
aws rds download-db-log-file-portion \
  --db-instance-identifier drop-db \
  --log-file-name error/postgresql.log \
  --region eu-west-1

Structured logging status: Not yet implemented. Current output is Next.js default console format. JSON structured logging planned for v1.0.

2.3 Traces

Status: Not implemented. Sentry (which would provide trace-level error context) was removed (MC #1271).

Planned for v1.0: Request ID correlation across middleware and DB queries.


3. Health Check System

3.1 Health Endpoint (GET /api/health)

The health endpoint performs a real DB query and reports application status. It is the primary signal for all monitoring layers.

Source: src/drop-app/src/app/api/health/route.ts

Success Response (HTTP 200):

{
  "data": {
    "status": "ok",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
      "services": { "mode": "production" }
    },
    "timestamp": "2026-02-23T12:00:00.000Z"
  }
}

Degraded Response (HTTP 200): status: "degraded" — DB returned unexpected result.

Down Response (HTTP 503):

{
  "data": {
    "status": "down",
    "checks": { "db": { "status": "fail" } },
    "timestamp": "..."
  }
}

3.2 Container Health Checks

BusinessBusinessFinanceProduct
PlatformMetric CheckDescription IntervalCollection Method TimeoutRetriesDashboard
DockerActive Composeusers (MVP)DAU/MAU) wgetDaily/monthly /api/healthactive users 30sFrontend instrumentation 10s 3dashboard
Docker Compose (Production){{CONVERSION_METRIC}} wget /api/health{{CONVERSION_DESC}} 30sBackend event 10s 3dashboard
Fly.io (staging){{REVENUE_METRIC}} GET /api/health{{REVENUE_DESC}} 30sPayment events 5s dashboard
AWSFeature App Runnerusage GETFeature-level /api/healthengagement 30sFeature flag SDK 5s 3dashboard

Custom

4.Metrics Alerting

Definition

4.1 Slack Alerting (Internal)

Source: src/drop-app/src/lib/alerts.ts Channel: #drop-ops on alai-talk.slack.com Webhook: SLACK_WEBHOOK_URL environment variable

AlertMetric NameType TriggerLabels SeverityDescription Emoji
App startupApplication bootsInfoℹ️
App shutdownSIGTERM/SIGINT receivedInfoℹ️
Error spike> 5 errors in 60 secondsCritical🚨
Unhandled exceptionProcess event handler catches errorCritical🚨
Custom alertsendAlert() called in codeVariableVariable

Cooldown: 10-minute cooldown per alert title (prevents spam). Resets on app restart.

Error spike detection algorithm:

  1. Every HTTP 5xx error calls trackError()
  2. Rolling 1-minute window of error timestamps maintained in memory
  3. When count > 5 in window → sends critical Slack alert
  4. Alert cooldown prevents duplicate alerts within 10 minutes

Usage in code:

import { sendAlert, trackError } from '@/lib/alerts';

// Send manual alert
await sendAlert({ severity: 'critical', title: 'Database failover detected', message: '...' });

// Track error (called automatically in middleware)
await trackError();

4.2 BetterStack External Monitoring

Status: Ready to configure (setup guide: docs/infrastructure/BETTERSTACK-SETUP.md) Plan: Free tier (10 monitors, 3-minute check interval)

MonitorURLCheckExpected
Drop Health Checkhttps://drop.alai.no/api/healthHTTP GET + keywordStatus 200, body contains "status":"ok"
Drop Landing Pagehttps://drop.alai.noHTTP GET + keywordStatus 200, body contains Send penger
Drop Health (US East)https://drop.alai.no/api/healthHTTP GET (from US region)Status 200, body contains "status":"ok"

Status page: https://drop-status.betteruptime.com (public)

Escalation policy (Drop Production Incidents):

Minute 0:   Service down → Slack #drop-ops (immediate)
Minute 5:   Still down  → Email [email protected]
Minute 15:  Still down  → SMS +47 40 47 42 51 (requires paid BetterStack plan)

SSL expiry warning: 14 days before certificate expiration.


5. Alerting Rules Reference

spike (>5 in 60s) start-deployment
ConditionSourceChannelSeverityActionUnit
/api/health{{APP}}_job_queue_depth returns 503 BetterStackGauge Slack + emailqueue_name CriticalNumber of pending jobs Investigate DB + App Runnercount
Error{{APP}}_job_processing_duration Histogram alerts.tsqueue_name, status SlackJob #drop-opsprocessing time CriticalCheck app logsseconds
App Runner service not RUNNING{{APP}}_external_api_calls_total AWS Console / CloudWatchManual checkCriticalCounter awsservice, apprunnerstatus External API call countcount
RDS CPU > 80%{{APP}}_cache_hit_ratio CloudWatch (manual setup needed)TBDWarningInvestigate query patterns
RDS storage < 1GBCloudWatch (manual setup needed)TBDWarningIncrease storage
SSL certificate expiringBetterStackEmailWarningRenew certificate
App startup/shutdownGauge alerts.tscache_type SlackCache #drop-opshit percentage InfoNo action neededratio

6.

2.2 Dashboards

6.1 AWS CloudWatch Dashboard (Planned)Logs

Target

Log dashboard widgets:

  • App Runner: Request count, 5xx error count, latency
  • RDS: CPU, connections, free storage, read/write IOPS
  • Health check latency over time (from /api/health responses)

Setup: TBD — requires CloudWatch dashboard configuration via AWS Console or Terraform.

6.2 BetterStack Status Page

Public status page: https://drop-status.betteruptime.com

Components:

  • APILevels & HealthUsage EndpointGuide (linked to Drop Health Check monitor)
  • Landing Page (linked to Drop Landing Page monitor)
  • Global Network (linked to US East monitor)

7. On-Call Procedures

7.1 Escalation Matrix

TimeLevel ActionWhen to Use WhoExamples
Alert fires (0 min)ERROR AcknowledgeUnexpected Slackfailure alert,requiring investigateattention AlemDatabase Bašićconnection failure, unhandled exception
5 min: still downWARN EmailUnexpected alertbut auto-sent,handled try restartsituation AlemDeprecated BašićAPI called, retry succeeded
15 min: still downINFO SMSNormal alertbusiness (if configured), escalateevents AlemUser Bašićlogged in, order created, job completed
30 min: unresolvedDEBUG FollowDiagnostic DRdetail runbook(dev/staging for scenarioonly) AlemFunction Bašićparameters, +internal Johnstate
TRACEExtremely verbose (AI)local dev only)SQL queries, HTTP request/response bodies

7.2

Production Firstlog Responselevel: Checklist

INFO and above

Structured Logging Format

#{
  1."timestamp": Check"2026-01-15T10:30:00.000Z",
  service"level": status"INFO",
  aws"service": apprunner"{{SERVICE_NAME}}",
  describe-service"version": \"{{VERSION}}",
  --service-arn"trace_id": arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec"abc123def456",
  \"span_id": --query"789xyz",
  'Service.Status'"user_id": --output"{{HASHED_OR_OMIT}}",
  text"request_id": --region"req-uuid-here",
  eu-west-1"message": #"Order 2.created Checksuccessfully",
  recent"order_id": logs"ord-123",
  aws"duration_ms": logs45
tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --since 10m --region eu-west-1

# 3. Check RDS status
aws rds describe-db-instances \
  --db-instance-identifier drop-db \
  --query 'DBInstances[0].DBInstanceStatus' --output text --region eu-west-1

# 4. Direct health check
curl -s https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health | jq

# 5. Restart App Runner if needed
aws apprunner start-deployment \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1}

Required

8.fields: Monitoringtimestamp, Gapslevel, service, message, trace_id Forbidden in logs: passwords, tokens, credit card numbers, SSN, full email addresses (Plannedhash foror v1.0)

truncate)

Log Aggregation Pipeline

flowchart LR
    APP[Application] -->|stdout/stderr| AGENT[Log Agent<br/>Fluent Bit / Filebeat]
    AGENT -->|structured JSON| STORE[Log Store<br/>Loki / Elasticsearch / CloudWatch]
    STORE --> QUERY[Query Interface<br/>Grafana / Kibana]
    STORE --> ALERT[Alert Engine<br/>AlertManager / PagerDuty]
GapStage ImpactTool PriorityConfiguration
Structured JSONApplication logging Cannot correlate requests across log lines{{LOG_LIB}} HighStructured JSON to stdout
CloudWatchLog alarms for RDSagent No automated alerting on DB metrics{{LOG_AGENT}} HighDeployed as sidecar / DaemonSet
APMTransport{{LOG_TRANSPORT}}TLS encrypted
Storage{{LOG_STORE}}Indexed, compressed
Query{{LOG_QUERY}}Access via dashboard

Log Retention Policy

EnvironmentRetentionStorage Tier
Dev7 daysHot
Staging30 daysHot
Production{{PROD_LOG_RETENTION}} daysHot (30d) → Cold archive
Audit logs1 year (regulatory)Hot (90d) → Cold archive

PII in Logs — Masking Strategy

Data TypeStrategyExample
Email addressHash + truncateuser:sha256(email)[:8]
Phone numberRedact[PHONE_REDACTED]
IP addressAnonymize last octet192.168.1.xxx
Payment dataNever logUse [PAYMENT_DATA_OMITTED]
Auth tokensNever logUse [TOKEN_OMITTED]
NamesOmit or pseudonymizeReference by ID only

2.3 Traces

Distributed Tracing Setup

Tracing Framework: {{TRACE_FRAMEWORK}} Backend: {{TRACE_BACKEND}} Auto-instrumentation: {{AUTO_INSTRUMENT}}

ServiceInstrumentedFrameworkNotes
{{SERVICE_1}}YesOpenTelemetryHTTP, DB, Redis
{{SERVICE_2}}YesOpenTelemetryHTTP, external calls

Trace Sampling Strategy

EnvironmentStrategyRateNotes
DevAlways-on100%Full visibility
StagingAlways-on100%Full visibility
ProductionTail-based{{SAMPLE_RATE}}% + errorsError traces always kept

Tail-based sampling rules:

  • Always sample: traces with errors, traces > {{SLOW_THRESHOLD}}ms
  • Sample rate: {{SAMPLE_RATE}}% of successful, fast traces
  • Head-based fallback: {{HEAD_SAMPLE_RATE}}% if tail-based collector unavailable

Span Naming Conventions

Operation TypeNaming PatternExample
HTTP handlerHTTP {{METHOD}} {{ROUTE}}HTTP POST /api/orders
DB querydb.{{operation}} {{table}}db.select orders
Cachecache.{{operation}} {{key_pattern}}cache.get user:*
Queuequeue.{{operation}} {{queue_name}}queue.publish order-events
External HTTP{{service}} {{METHOD}} {{path}}stripe POST /charges

Context Propagation

Standard: W3C TraceContext (traceparent header) Baggage: W3C Baggage (for user_id, tenant_id propagation) Async: Inject context into message queue headers / requestjob tracingmetadata


3. Alerting

3.1 Alert Rules

Alert NameConditionDurationSeverityChannelRunbook
HighErrorRate Cannoterror_rate trace> slow requests{{ERROR_ALERT}}% Medium2 minCriticalPagerDuty[link]
SlowP99p99_latency > {{P99_ALERT}}ms5 minWarningSlack #alerts[link]
ServiceDownhealth_check failing1 minCriticalPagerDuty[link]
HighCPUcpu > {{CPU_CRIT}}%10 minWarningSlack #alerts[link]
DiskAlmostFulldisk > {{DISK_CRIT}}%5 minCriticalPagerDuty[link]
DeploymentFaileddeployment status = failedImmediateCriticalSlack #deployments[link]
CertificateExpiringSooncert_expiry < 30 daysWarningSlack #ops[link]
BackupFailedbackup job = failedCriticalPagerDuty[link]
SLOBudgetBurningerror_budget < 10% remainingCriticalPagerDuty[link]

3.2 Alert Routing & Escalation

flowchart TD
    ALERT[Alert fires] --> SEVERITY{Severity?}
    SEVERITY -->|Critical| ONCALL[On-call engineer<br/>PagerDuty / phone]
    SEVERITY -->|Warning| SLACK[Slack #alerts<br/>No immediate response required]
    ONCALL -->|Not acknowledged in 5min| ESCALATE[Escalate to secondary]
    ESCALATE -->|Not acknowledged in 10min| MANAGER[Notify engineering lead]
SeverityResponse SLAChannelEscalation
Critical (P1)Acknowledge in 5 min, resolve in 1hPagerDuty + callEscalate at 5 min
High (P2)Acknowledge in 30 min, resolve in 4hPagerDutyEscalate at 30 min
Warning (P3)Review within 1 business daySlackManual
InfoNo response requiredSlackNone

3.3 On-Call Rotation

Schedule: {{ONCALL_SCHEDULE}} Calendar: {{ONCALL_TOOL}} Primary rotation: {{ONCALL_MEMBERS}} Secondary (escalation): {{ESCALATION_MEMBERS}} Minimum rotation size: 3 people (to avoid burnout)

3.4 Alert Fatigue Prevention

  • Alert review cadence: Monthly — remove/adjust alerts with < {{ACTIONABLE_RATE}}% actionable rate
  • Minimum alert duration: 2+ minutes (no single-spike alerts)
  • Deduplication window: {{DEDUP_WINDOW}} minutes
  • Business hours suppression: Allowed for non-critical alerts {{SUPPRESSION_HOURS}}
  • Post-mortem requirement: Every Critical alert reviewed after incident

4. Dashboards

4.1 Dashboard Inventory

DashboardPurposeLinkAudience
System OverviewHigh-level health of all services{{LINK}}Everyone
{{SERVICE_1}}Service-level detail{{LINK}}Dev team
InfrastructureHost/container metrics{{LINK}}DevOps
Business metrics dashboardMetrics NoKPIs visibilityand into transaction volume/successconversions Medium{{LINK}}Leadership, PM
Redis-backedSLO error counterTracker Error counterbudget resets on restarttracking Low{{LINK}}Engineering lead
Audit logging streamOn-Call RequiredCurrent forincidents, compliancetop (AML)errors High{{LINK}}On-call engineer

4.2 Key Dashboard Specs — System Overview

Required panels:

  1. Service health matrix (all services, green/red/yellow)
  2. Request rate (all services, last 1h)
  3. Error rate (all services, last 1h)
  4. P99 latency (all services, last 1h)
  5. Active incidents count
  6. Error budget remaining (all SLOs)
  7. Last deployment (service, version, time)
  8. Infrastructure health (CPU, memory, disk — aggregate)

5. SLOs / SLIs

5.1 SLI Definitions

SLIDefinitionMeasurement Method
Availability% requests returning non-5xx(total_requests - 5xx_requests) / total_requests
Per-endpoint error trackingLatency Cannot% isolaterequests problematiccompleting routeswithin threshold Mediumhistogram_quantile(0.95, ...) < {{LATENCY_SLI}}ms
Error rate% requests not returning errors(total_requests - error_requests) / total_requests

5.2 SLO Targets

ServiceSLITargetWindowError Budget
{{SERVICE}}Availability{{AVAIL_TARGET}}%30 days{{BUDGET_MINUTES}} min/month
{{SERVICE}}Latency (P95 < {{P95}}ms){{LATENCY_TARGET}}%30 days{{LATENCY_BUDGET_MINUTES}} min/month

5.3 Error Budget Tracking

ServiceMonthly BudgetBurned This MonthRemainingBurn Rate (24h)
{{SERVICE}}{{BUDGET}}minTBDTBDTBD

Error budget policy:

  • Budget > 50% remaining: Move fast, deploy freely
  • Budget 10-50% remaining: Slow down, prioritize reliability work
  • Budget < 10% remaining: Freeze non-critical deploys, focus on reliability

6. Tooling

ToolVersionPurposeHosted
{{METRICS_TOOL}}{{VERSION}}Metrics collection & storage{{HOSTING}}
{{LOG_TOOL}}{{VERSION}}Log aggregation{{HOSTING}}
{{TRACE_TOOL}}{{VERSION}}Distributed tracing{{HOSTING}}
{{DASHBOARD_TOOL}}{{VERSION}}Visualization{{HOSTING}}
{{ALERT_TOOL}}{{VERSION}}Alert routing & on-call{{HOSTING}}


Approval

Role Name Date Signature
Author Platform Architect (AI) 2026-02-23
Reviewer
Approver Alem Bašić