Skip to main content

Monitoring & Observability

Monitoring & Observability

Project: Drop Version: 0.1.0 Date: 2026-02-23 Author: Platform Architect (AI) Status: In Review Reviewers: Alem Bašić (CEO)

Document History

Version Date Author Changes
0.1 2026-02-23 Platform Architect (AI) Initial draft from source code and infrastructure analysis

1. Observability Strategy

Drop's observability stack is intentionally lean for the MVP phase: a health check endpoint with real DB verification, Slack alerting with error spike detection, BetterStack external uptime monitoring, and AWS CloudWatch for App Runner logs. Sentry was removed (MC #1271). Full APM and structured logging are planned for v1.0.

Observability Platform: BetterStack (external uptime) + Slack alerting (internal) + CloudWatch (AWS logs) Strategy: Alert on symptoms (service down, error spike), verify via health check, investigate via CloudWatch logs.

Core Questions We Must Be Able to Answer:

  1. Is Drop up and serving users? (BetterStack monitors /api/health)
  2. Is the database connected and responding? (health endpoint DB query)
  3. Are errors spiking? (Slack alerts.ts error spike detection)
  4. What changed before the problem? (CloudWatch App Runner logs)
  5. When was the last successful deployment? (App Runner deployment history)

2. Three Pillars

2.1 Metrics

Infrastructure Metrics

Metric Source Alert Threshold Severity
App Runner service status AWS CloudWatch / App Runner API RUNNING → any other state Critical
RDS instance status AWS CloudWatch / RDS API Not available Critical
RDS CPU utilization AWS CloudWatch CPUUtilization > 80% (warn), > 95% (critical) Warning / Critical
RDS free storage AWS CloudWatch FreeStorageSpace < 1GB (warn), < 256MB (critical) Warning / Critical
RDS database connections AWS CloudWatch DatabaseConnections > 70 (warn, db.t4g.micro max ~85) Warning
App Runner concurrent requests AWS CloudWatch TBD — no baseline yet TBD

Application Metrics (RED Method)

Metric Source Alert Threshold Severity
Request rate CloudWatch App Runner Baseline TBD Informational
Error rate src/lib/alerts.ts trackError() > 5 errors in 60 seconds → Slack alert Critical
DB query latency /api/health dbLatencyMs field > 100ms (warn) Warning
Health endpoint status BetterStack + /api/health status != "ok" → 503 Critical
Rate limit hits (429) App logs Spike of 429 responses Warning

Business Metrics (Planned for v1.0)

Metric Description Target
Transactions per hour Successful remittances + QR payments TBD
Transaction success rate Completed / total initiated > 99%
KYC approval rate Sumsub approvals / attempts > 80%
BankID login success rate Successful OIDC callbacks / initiated > 99%

2.2 Logs

Log Sources

Source Log Group Format Retention
App Runner (application) /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application Next.js console output 30 days (CloudWatch default)
App Runner (system) /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/service App Runner events 30 days
RDS PostgreSQL RDS error log via CloudWatch PostgreSQL log format 7 days

Log Access

# Stream App Runner application logs (live)
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --follow \
  --region eu-west-1

# Filter for errors in last hour
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --filter-pattern "ERROR" \
  --region eu-west-1

# Download RDS error log
aws rds download-db-log-file-portion \
  --db-instance-identifier drop-db \
  --log-file-name error/postgresql.log \
  --region eu-west-1

Structured logging status: Not yet implemented. Current output is Next.js default console format. JSON structured logging planned for v1.0.

2.3 Traces

Status: Not implemented. Sentry (which would provide trace-level error context) was removed (MC #1271).

Planned for v1.0: Request ID correlation across middleware and DB queries.


3. Health Check System

3.1 Health Endpoint (GET /api/health)

The health endpoint performs a real DB query and reports application status. It is the primary signal for all monitoring layers.

Source: src/drop-app/src/app/api/health/route.ts

Success Response (HTTP 200):

{
  "data": {
    "status": "ok",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
      "services": { "mode": "production" }
    },
    "timestamp": "2026-02-23T12:00:00.000Z"
  }
}

Degraded Response (HTTP 200): status: "degraded" — DB returned unexpected result.

Down Response (HTTP 503):

{
  "data": {
    "status": "down",
    "checks": { "db": { "status": "fail" } },
    "timestamp": "..."
  }
}

3.2 Container Health Checks

Platform Check Interval Timeout Retries
Docker Compose (MVP) wget /api/health 30s 10s 3
Docker Compose (Production) wget /api/health 30s 10s 3
Fly.io (staging) GET /api/health 30s 5s
AWS App Runner GET /api/health 30s 5s 3

4. Alerting

4.1 Slack Alerting (Internal)

Source: src/drop-app/src/lib/alerts.ts Channel: #drop-ops on alai-talk.slack.com Webhook: SLACK_WEBHOOK_URL environment variable

Alert Type Trigger Severity Emoji
App startup Application boots Info ℹ️
App shutdown SIGTERM/SIGINT received Info ℹ️
Error spike > 5 errors in 60 seconds Critical 🚨
Unhandled exception Process event handler catches error Critical 🚨
Custom alert sendAlert() called in code Variable Variable

Cooldown: 10-minute cooldown per alert title (prevents spam). Resets on app restart.

Error spike detection algorithm:

  1. Every HTTP 5xx error calls trackError()
  2. Rolling 1-minute window of error timestamps maintained in memory
  3. When count > 5 in window → sends critical Slack alert
  4. Alert cooldown prevents duplicate alerts within 10 minutes

Usage in code:

import { sendAlert, trackError } from '@/lib/alerts';

// Send manual alert
await sendAlert({ severity: 'critical', title: 'Database failover detected', message: '...' });

// Track error (called automatically in middleware)
await trackError();

4.2 BetterStack External Monitoring

Status: Ready to configure (setup guide: docs/infrastructure/BETTERSTACK-SETUP.md) Plan: Free tier (10 monitors, 3-minute check interval)

Monitor URL Check Expected
Drop Health Check https://drop.alai.no/api/health HTTP GET + keyword Status 200, body contains "status":"ok"
Drop Landing Page https://drop.alai.no HTTP GET + keyword Status 200, body contains Send penger
Drop Health (US East) https://drop.alai.no/api/health HTTP GET (from US region) Status 200, body contains "status":"ok"

Status page: https://drop-status.betteruptime.com (public)

Escalation policy (Drop Production Incidents):

Minute 0:   Service down → Slack #drop-ops (immediate)
Minute 5:   Still down  → Email [email protected]
Minute 15:  Still down  → SMS +47 40 47 42 51 (requires paid BetterStack plan)

SSL expiry warning: 14 days before certificate expiration.


5. Alerting Rules Reference

Condition Source Channel Severity Action
/api/health returns 503 BetterStack Slack + email Critical Investigate DB + App Runner
Error spike (>5 in 60s) alerts.ts Slack #drop-ops Critical Check app logs
App Runner service not RUNNING AWS Console / CloudWatch Manual check Critical aws apprunner start-deployment
RDS CPU > 80% CloudWatch (manual setup needed) TBD Warning Investigate query patterns
RDS storage < 1GB CloudWatch (manual setup needed) TBD Warning Increase storage
SSL certificate expiring BetterStack Email Warning Renew certificate
App startup/shutdown alerts.ts Slack #drop-ops Info No action needed

6. Dashboards

6.1 AWS CloudWatch Dashboard (Planned)

Target dashboard widgets:

  • App Runner: Request count, 5xx error count, latency
  • RDS: CPU, connections, free storage, read/write IOPS
  • Health check latency over time (from /api/health responses)

Setup: TBD — requires CloudWatch dashboard configuration via AWS Console or Terraform.

6.2 BetterStack Status Page

Public status page: https://drop-status.betteruptime.com

Components:

  • API & Health Endpoint (linked to Drop Health Check monitor)
  • Landing Page (linked to Drop Landing Page monitor)
  • Global Network (linked to US East monitor)

7. On-Call Procedures

7.1 Escalation Matrix

Time Action Who
Alert fires (0 min) Acknowledge Slack alert, investigate Alem Bašić
5 min: still down Email alert auto-sent, try restart Alem Bašić
15 min: still down SMS alert (if configured), escalate Alem Bašić
30 min: unresolved Follow DR runbook for scenario Alem Bašić + John (AI)

7.2 First Response Checklist

# 1. Check service status
aws apprunner describe-service \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --query 'Service.Status' --output text --region eu-west-1

# 2. Check recent logs
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --since 10m --region eu-west-1

# 3. Check RDS status
aws rds describe-db-instances \
  --db-instance-identifier drop-db \
  --query 'DBInstances[0].DBInstanceStatus' --output text --region eu-west-1

# 4. Direct health check
curl -s https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health | jq

# 5. Restart App Runner if needed
aws apprunner start-deployment \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

8. Monitoring Gaps (Planned for v1.0)

Gap Impact Priority
Structured JSON logging Cannot correlate requests across log lines High
CloudWatch alarms for RDS No automated alerting on DB metrics High
APM / request tracing Cannot trace slow requests Medium
Business metrics dashboard No visibility into transaction volume/success Medium
Redis-backed error counter Error counter resets on restart Low
Audit logging stream Required for compliance (AML) High
Per-endpoint error tracking Cannot isolate problematic routes Medium


Approval

Role Name Date Signature
Author Platform Architect (AI) 2026-02-23
Reviewer
Approver Alem Bašić