Skip to main content

Monitoring & Observability

Monitoring & Observability

Project: {{PROJECT_NAME}}Drop Version: {{VERSION}}0.1.0 Date: {{DATE}}2026-02-23 Author: {{AUTHOR}}Platform Architect (AI) Status: Draft | In Review | Approved Reviewers: {{REVIEWERS}}Alem Bašić (CEO)

Document History

Version Date Author Changes
0.1 {{DATE}}2026-02-23 {{AUTHOR}}Platform Architect (AI) Initial draft from source code and infrastructure analysis

1. Observability Strategy

Drop's observability stack is intentionally lean for the MVP phase: a health check endpoint with real DB verification, Slack alerting with error spike detection, BetterStack external uptime monitoring, and AWS CloudWatch for App Runner logs. Sentry was removed (MC #1271). Full APM and structured logging are planned for v1.0.

Observability Platform: {{OBS_PLATFORM}}BetterStack (external uptime) + Slack alerting (internal) + CloudWatch (AWS logs) Strategy: Instrument everything, alertAlert on symptoms (notservice causes)down, error spike), correlateverify acrossvia pillarshealth check, investigate via CloudWatch logs.

Core Questions We Must Be Able to Answer:

  1. Is the systemDrop up and serving usersusers? correctly?(BetterStack monitors /api/health)
  2. HowIs fastthe isdatabase itconnected and responding? (health endpoint DB query)
  3. WhatAre errors arespiking? occurring(Slack andalerts.ts why?
  4. error
  5. Wherespike is the bottleneck?detection)
  6. What changed before thisthe problemproblem? started?(CloudWatch App Runner logs)
  7. When was the last successful deployment? (App Runner deployment history)

2. Three Pillars

2.1 Metrics

Infrastructure Metrics

Metric Source Alert Threshold Severity
App Runner service statusAWS CloudWatch / App Runner APIRUNNING → any other stateCritical
RDS instance statusAWS CloudWatch / RDS APINot availableCritical
RDS CPU utilization Node exporter /AWS CloudWatch CPUUtilization > {{CPU_WARN}}%80% (warn), > {{CPU_CRIT}}%95% (critical) Warning / Critical
MemoryRDS utilizationfree storage Node exporter /AWS CloudWatch FreeStorageSpace >< {{MEM_WARN}}%1GB (warn), >< {{MEM_CRIT}}%256MB (critical) Warning / Critical
DiskRDS utilizationdatabase connections Node exporter /AWS CloudWatch DatabaseConnections > {{DISK_WARN}}%70 (warn),warn, >db.t4g.micro {{DISK_CRIT}}%max (critical)Warning / Critical
Network in/outNode exporter / CloudWatch> {{NET_LIMIT}}Mbps sustained~85) Warning
ContainerApp restartsRunner concurrent requests KubernetesAWS / ECSCloudWatch >TBD {{RESTART_LIMIT}} inno 5minbaseline yet Critical
Node not readyKubernetesAnyCriticalTBD

Application Metrics (RED Method)

MetricSourceAlert ThresholdSeverity
Request rateCloudWatch App RunnerBaseline TBDInformational
Error ratesrc/lib/alerts.ts trackError()> 5 errors in 60 seconds → Slack alertCritical
DB query latency/api/health dbLatencyMs field> 100ms (warn)Warning
Health endpoint statusBetterStack + /api/healthstatus != "ok" → 503Critical
Rate limit hits (429)App logsSpike of 429 responsesWarning

Business Metrics (Planned for v1.0)

Metric Description Target
Transactions per hourSuccessful remittances + QR paymentsTBD
Transaction success rateCompleted / total initiated> 99%
KYC approval rateSumsub approvals / attempts> 80%
BankID login success rateSuccessful OIDC callbacks / initiated> 99%

2.2 Logs

Log Sources

Threshold
AlertSource Log GroupFormatRetention
RequestApp rateRunner (application) Requests per second per service/aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application BaselineNext.js ±console 20%output 50%30 deviationdays (CloudWatch default)
ErrorApp rateRunner (system) % requests returning 5xx/aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/service <App {{ERROR_RATE}}%Runner events >30 {{ERROR_ALERT}}%days
P50RDS latencyPostgreSQL MedianRDS responseerror timelog via CloudWatch <PostgreSQL {{P50}}mslog format >7 {{P50_ALERT}}ms
P95 latency95th percentile response time< {{P95}}ms> {{P95_ALERT}}ms
P99 latency99th percentile response time< {{P99}}ms> {{P99_ALERT}}msdays

BusinessLog MetricsAccess

# Stream App Runner application logs (live)
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --follow \
  --region eu-west-1

# Filter for errors in last hour
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --filter-pattern "ERROR" \
  --region eu-west-1

# Download RDS error log
aws rds download-db-log-file-portion \
  --db-instance-identifier drop-db \
  --log-file-name error/postgresql.log \
  --region eu-west-1

Structured logging status: Not yet implemented. Current output is Next.js default console format. JSON structured logging planned for v1.0.

2.3 Traces

Status: Not implemented. Sentry (which would provide trace-level error context) was removed (MC #1271).

Planned for v1.0: Request ID correlation across middleware and DB queries.


3. Health Check System

3.1 Health Endpoint (GET /api/health)

The health endpoint performs a real DB query and reports application status. It is the primary signal for all monitoring layers.

Source: src/drop-app/src/app/api/health/route.ts

Success Response (HTTP 200):

{
  "data": {
    "status": "ok",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
      "services": { "mode": "production" }
    },
    "timestamp": "2026-02-23T12:00:00.000Z"
  }
}

Degraded Response (HTTP 200): status: "degraded" — DB returned unexpected result.

Down Response (HTTP 503):

{
  "data": {
    "status": "down",
    "checks": { "db": { "status": "fail" } },
    "timestamp": "..."
  }
}

3.2 Container Health Checks

dashboarddashboarddashboarddashboard
MetricPlatform DescriptionCheck Collection MethodInterval DashboardTimeoutRetries
ActiveDocker usersCompose (DAU/MAU)MVP) Daily/monthlywget active users/api/health Frontend instrumentation30s Business10s 3
{{CONVERSION_METRIC}}Docker Compose (Production) {{CONVERSION_DESC}}wget /api/health Backend event30s Business10s 3
{{REVENUE_METRIC}}Fly.io (staging) {{REVENUE_DESC}}GET /api/health Payment events30s Finance5s
FeatureAWS usageApp Runner Feature-levelGET engagement/api/health Feature flag SDK30s Product5s 3

Custom
Metrics

4. Definition

Alerting

4.1 Slack Alerting (Internal)

Source: src/drop-app/src/lib/alerts.ts Channel: #drop-ops on alai-talk.slack.com Webhook: SLACK_WEBHOOK_URL environment variable

MetricAlert NameType LabelsTrigger DescriptionSeverity UnitEmoji
App startupApplication bootsInfoℹ️
App shutdownSIGTERM/SIGINT receivedInfoℹ️
Error spike> 5 errors in 60 secondsCritical🚨
Unhandled exceptionProcess event handler catches errorCritical🚨
Custom alertsendAlert() called in codeVariableVariable

Cooldown: 10-minute cooldown per alert title (prevents spam). Resets on app restart.

Error spike detection algorithm:

  1. Every HTTP 5xx error calls trackError()
  2. Rolling 1-minute window of error timestamps maintained in memory
  3. When count > 5 in window → sends critical Slack alert
  4. Alert cooldown prevents duplicate alerts within 10 minutes

Usage in code:

import { sendAlert, trackError } from '@/lib/alerts';

// Send manual alert
await sendAlert({ severity: 'critical', title: 'Database failover detected', message: '...' });

// Track error (called automatically in middleware)
await trackError();

4.2 BetterStack External Monitoring

Status: Ready to configure (setup guide: docs/infrastructure/BETTERSTACK-SETUP.md) Plan: Free tier (10 monitors, 3-minute check interval)

MonitorURLCheckExpected
Drop Health Checkhttps://drop.alai.no/api/healthHTTP GET + keywordStatus 200, body contains "status":"ok"
Drop Landing Pagehttps://drop.alai.noHTTP GET + keywordStatus 200, body contains Send penger
Drop Health (US East)https://drop.alai.no/api/healthHTTP GET (from US region)Status 200, body contains "status":"ok"

Status page: https://drop-status.betteruptime.com (public)

Escalation policy (Drop Production Incidents):

Minute 0:   Service down → Slack #drop-ops (immediate)
Minute 5:   Still down  → Email [email protected]
Minute 15:  Still down  → SMS +47 40 47 42 51 (requires paid BetterStack plan)

SSL expiry warning: 14 days before certificate expiration.


5. Alerting Rules Reference

Slack Slack
ConditionSourceChannelSeverityAction
{{APP}}_job_queue_depth/api/health returns 503 GaugeBetterStack queue_nameSlack + email Number of pending jobsCritical countInvestigate DB + App Runner
Error spike (>5 in 60s){{APP}}_job_processing_durationalerts.ts Histogram queue_name, status#drop-ops Job processing timeCritical secondsCheck app logs
App Runner service not {{APP}}_external_api_calls_totalRUNNING CounterAWS Console / CloudWatchManual checkCritical service,aws statusapprunner start-deploymentExternal API call countcount
RDS CPU > 80%CloudWatch (manual setup needed)TBDWarningInvestigate query patterns
RDS storage < 1GBCloudWatch (manual setup needed)TBDWarningIncrease storage
SSL certificate expiringBetterStackEmailWarningRenew certificate
App startup/shutdown{{APP}}_cache_hit_ratioalerts.ts Gauge cache_type#drop-ops Cache hit percentageInfo ratioNo action needed

6. Dashboards

2.6.1 AWS CloudWatch Dashboard (Planned)

Target dashboard widgets:

  • App Runner: Request count, 5xx error count, latency
  • RDS: CPU, connections, free storage, read/write IOPS
  • Health check latency over time (from /api/health responses)

Setup: TBD — requires CloudWatch dashboard configuration via AWS Console or Terraform.

6.2 LogsBetterStack Status Page

Log

Public Levelsstatus page: https://drop-status.betteruptime.com

Components:

  • API & UsageHealth Guide

Endpoint (linked to Drop Health Check monitor)
  • Landing Page (linked to Drop Landing Page monitor)
  • Global Network (linked to US East monitor)

  • 7. On-Call Procedures

    7.1 Escalation Matrix

    LevelTime When to UseAction ExamplesWho
    ERRORAlert fires (0 min) UnexpectedAcknowledge failureSlack requiringalert, attentioninvestigate DatabaseAlem connection failure, unhandled exceptionBašić
    WARN5 min: still down UnexpectedEmail butalert handledauto-sent, situationtry restart DeprecatedAlem API called, retry succeededBašić
    INFO15 min: still down NormalSMS businessalert events(if configured), escalate UserAlem logged in, order created, job completedBašić
    DEBUG30 min: unresolved DiagnosticFollow detailDR (dev/stagingrunbook only)for scenario FunctionAlem parameters,Bašić internal+ state
    TRACEExtremely verboseJohn (local dev only)SQL queries, HTTP request/response bodiesAI)

    Production

    7.2 logFirst level:Response INFO and above

    Structured Logging Format

    Checklist
    {# "timestamp":1. "2026-01-15T10:30:00.000Z",Check "level":service "INFO",status
    "service":aws "{{SERVICE_NAME}}",apprunner "version":describe-service "{{VERSION}}",\
      "trace_id":--service-arn "abc123def456",arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec "span_id":\
      "789xyz",--query "user_id":'Service.Status' "{{HASHED_OR_OMIT}}",--output "request_id":text "req-uuid-here",--region "message":eu-west-1
    
    "Order# created2. successfully",Check "order_id":recent "ord-123",logs
    "duration_ms":aws 45logs }tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
      --since 10m --region eu-west-1
    
    # 3. Check RDS status
    aws rds describe-db-instances \
      --db-instance-identifier drop-db \
      --query 'DBInstances[0].DBInstanceStatus' --output text --region eu-west-1
    
    # 4. Direct health check
    curl -s https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health | jq
    
    # 5. Restart App Runner if needed
    aws apprunner start-deployment \
      --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
      --region eu-west-1
    

    Required


    fields:

    8. timestamp,Monitoring level, service, message, trace_id Forbidden in logs: passwords, tokens, credit card numbers, SSN, full email addressesGaps (hashPlanned orfor truncate)

    Log Aggregation Pipeline

    flowchart LR
        APP[Application] -->|stdout/stderr| AGENT[Log Agent<br/>Fluent Bit / Filebeat]
        AGENT -->|structured JSON| STORE[Log Store<br/>Loki / Elasticsearch / CloudWatch]
        STORE --> QUERY[Query Interface<br/>Grafana / Kibana]
        STORE --> ALERT[Alert Engine<br/>AlertManager / PagerDuty]
    
    v1.0)
    StageGap ToolImpact ConfigurationPriority
    ApplicationStructured JSON logging {{LOG_LIB}}Cannot correlate requests across log lines Structured JSON to stdoutHigh
    LogCloudWatch agentalarms for RDS {{LOG_AGENT}}No automated alerting on DB metrics Deployed as sidecar / DaemonSetHigh
    TransportAPM / request tracing {{LOG_TRANSPORT}}Cannot trace slow requests TLS encryptedMedium
    StorageBusiness metrics dashboard {{LOG_STORE}}No visibility into transaction volume/success Indexed, compressedMedium
    QueryRedis-backed error counter {{LOG_QUERY}}Error counter resets on restart Access via dashboard

    Log Retention Policy

    EnvironmentRetentionStorage Tier
    Dev7 daysHot
    Staging30 daysHot
    Production{{PROD_LOG_RETENTION}} daysHot (30d) → Cold archiveLow
    Audit logslogging stream 1Required yearfor compliance (regulatory)AML) Hot (90d) → Cold archive

    PII in Logs — Masking Strategy

    Data TypeStrategyExample
    Email addressHash + truncateuser:sha256(email)[:8]High
    PhonePer-endpoint numberRedact[PHONE_REDACTED]
    IP addressAnonymize last octet192.168.1.xxx
    Payment dataNever logUse [PAYMENT_DATA_OMITTED]
    Auth tokensNever logUse [TOKEN_OMITTED]
    NamesOmit or pseudonymizeReference by ID only

    2.3 Traces

    Distributed Tracing Setup

    Tracing Framework: {{TRACE_FRAMEWORK}} Backend: {{TRACE_BACKEND}} Auto-instrumentation: {{AUTO_INSTRUMENT}}

    ServiceInstrumentedFrameworkNotes
    {{SERVICE_1}}YesOpenTelemetryHTTP, DB, Redis
    {{SERVICE_2}}YesOpenTelemetryHTTP, external calls

    Trace Sampling Strategy

    EnvironmentStrategyRateNotes
    DevAlways-on100%Full visibility
    StagingAlways-on100%Full visibility
    ProductionTail-based{{SAMPLE_RATE}}% + errorsError traces always kept

    Tail-based sampling rules:

    • Always sample: traces with errors, traces > {{SLOW_THRESHOLD}}ms
    • Sample rate: {{SAMPLE_RATE}}% of successful, fast traces
    • Head-based fallback: {{HEAD_SAMPLE_RATE}}% if tail-based collector unavailable

    Span Naming Conventions

    Operation TypeNaming PatternExample
    HTTP handlerHTTP {{METHOD}} {{ROUTE}}HTTP POST /api/orders
    DB querydb.{{operation}} {{table}}db.select orders
    Cachecache.{{operation}} {{key_pattern}}cache.get user:*
    Queuequeue.{{operation}} {{queue_name}}queue.publish order-events
    External HTTP{{service}} {{METHOD}} {{path}}stripe POST /charges

    Context Propagation

    Standard: W3C TraceContext (traceparent header) Baggage: W3C Baggage (for user_id, tenant_id propagation) Async: Inject context into message queue headers / job metadata


    3. Alerting

    3.1 Alert Rules

    Alert NameConditionDurationSeverityChannelRunbook
    HighErrorRateerror_rate > {{ERROR_ALERT}}%2 minCriticalPagerDuty[link]
    SlowP99p99_latency > {{P99_ALERT}}ms5 minWarningSlack #alerts[link]
    ServiceDownhealth_check failing1 minCriticalPagerDuty[link]
    HighCPUcpu > {{CPU_CRIT}}%10 minWarningSlack #alerts[link]
    DiskAlmostFulldisk > {{DISK_CRIT}}%5 minCriticalPagerDuty[link]
    DeploymentFaileddeployment status = failedImmediateCriticalSlack #deployments[link]
    CertificateExpiringSooncert_expiry < 30 daysWarningSlack #ops[link]
    BackupFailedbackup job = failedCriticalPagerDuty[link]
    SLOBudgetBurningerror_budget < 10% remainingCriticalPagerDuty[link]

    3.2 Alert Routing & Escalation

    flowchart TD
        ALERT[Alert fires] --> SEVERITY{Severity?}
        SEVERITY -->|Critical| ONCALL[On-call engineer<br/>PagerDuty / phone]
        SEVERITY -->|Warning| SLACK[Slack #alerts<br/>No immediate response required]
        ONCALL -->|Not acknowledged in 5min| ESCALATE[Escalate to secondary]
        ESCALATE -->|Not acknowledged in 10min| MANAGER[Notify engineering lead]
    
    SeverityResponse SLAChannelEscalation
    Critical (P1)Acknowledge in 5 min, resolve in 1hPagerDuty + callEscalate at 5 min
    High (P2)Acknowledge in 30 min, resolve in 4hPagerDutyEscalate at 30 min
    Warning (P3)Review within 1 business daySlackManual
    InfoNo response requiredSlackNone

    3.3 On-Call Rotation

    Schedule: {{ONCALL_SCHEDULE}} Calendar: {{ONCALL_TOOL}} Primary rotation: {{ONCALL_MEMBERS}} Secondary (escalation): {{ESCALATION_MEMBERS}} Minimum rotation size: 3 people (to avoid burnout)

    3.4 Alert Fatigue Prevention

    • Alert review cadence: Monthly — remove/adjust alerts with < {{ACTIONABLE_RATE}}% actionable rate
    • Minimum alert duration: 2+ minutes (no single-spike alerts)
    • Deduplication window: {{DEDUP_WINDOW}} minutes
    • Business hours suppression: Allowed for non-critical alerts {{SUPPRESSION_HOURS}}
    • Post-mortem requirement: Every Critical alert reviewed after incident

    4. Dashboards

    4.1 Dashboard Inventory

    DashboardPurposeLinkAudience
    System OverviewHigh-level health of all services{{LINK}}Everyone
    {{SERVICE_1}}Service-level detail{{LINK}}Dev team
    InfrastructureHost/container metrics{{LINK}}DevOps
    Business MetricsKPIs and conversions{{LINK}}Leadership, PM
    SLO TrackerError budgeterror tracking {{LINK}}Cannot isolate problematic routes Engineering lead
    On-CallCurrent incidents, top errors{{LINK}}On-call engineer

    4.2 Key Dashboard Specs — System Overview

    Required panels:

    1. Service health matrix (all services, green/red/yellow)
    2. Request rate (all services, last 1h)
    3. Error rate (all services, last 1h)
    4. P99 latency (all services, last 1h)
    5. Active incidents count
    6. Error budget remaining (all SLOs)
    7. Last deployment (service, version, time)
    8. Infrastructure health (CPU, memory, disk — aggregate)

    5. SLOs / SLIs

    5.1 SLI Definitions

    SLIDefinitionMeasurement Method
    Availability% requests returning non-5xx(total_requests - 5xx_requests) / total_requests
    Latency% requests completing within thresholdhistogram_quantile(0.95, ...) < {{LATENCY_SLI}}ms
    Error rate% requests not returning errors(total_requests - error_requests) / total_requests

    5.2 SLO Targets

    ServiceSLITargetWindowError Budget
    {{SERVICE}}Availability{{AVAIL_TARGET}}%30 days{{BUDGET_MINUTES}} min/month
    {{SERVICE}}Latency (P95 < {{P95}}ms){{LATENCY_TARGET}}%30 days{{LATENCY_BUDGET_MINUTES}} min/month

    5.3 Error Budget Tracking

    ServiceMonthly BudgetBurned This MonthRemainingBurn Rate (24h)
    {{SERVICE}}{{BUDGET}}minTBDTBDTBD

    Error budget policy:

    • Budget > 50% remaining: Move fast, deploy freely
    • Budget 10-50% remaining: Slow down, prioritize reliability work
    • Budget < 10% remaining: Freeze non-critical deploys, focus on reliability

    6. Tooling

    ToolVersionPurposeHosted
    {{METRICS_TOOL}}{{VERSION}}Metrics collection & storage{{HOSTING}}
    {{LOG_TOOL}}{{VERSION}}Log aggregation{{HOSTING}}
    {{TRACE_TOOL}}{{VERSION}}Distributed tracing{{HOSTING}}
    {{DASHBOARD_TOOL}}{{VERSION}}Visualization{{HOSTING}}
    {{ALERT_TOOL}}{{VERSION}}Alert routing & on-call{{HOSTING}}Medium


    Approval

    Role Name Date Signature
    Author Platform Architect (AI) 2026-02-23
    Reviewer
    Approver Alem Bašić