Skip to main content

Monitoring & Alerting

Drop Monitoring

Last updated: 2026-02-17 Source: src/drop-app/src/app/api/health/route.ts, docker-compose.yml, fly.toml, src/lib/alerts.ts


Health Check Endpoint

Route: GET /api/health Source: src/drop-app/src/app/api/health/route.ts:1-35

What It Checks

  1. Database connectivity -- Executes SELECT 1 as ok against the database
  2. Database latency -- Measures query execution time in milliseconds
  3. Database driver -- Reports whether using pg (PostgreSQL) or sqlite
  4. Service mode -- Reports NEXT_PUBLIC_SERVICE_MODE (mock or live)
  5. Application uptime -- Tracks seconds since server start
  6. Application version -- Reads from npm_package_version env var, defaults to 0.1.0

Status Values

  • ok -- All checks pass (HTTP 200)
  • degraded -- DB query returned unexpected result (HTTP 200)
  • down -- DB unreachable (HTTP 503)

Response Format

Healthy (200 OK):

{
  "data": {
    "status": "ok",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
      "services": { "mode": "live" }
    },
    "timestamp": "2026-02-17T12:00:00.000Z"
  }
}

Down (503 Service Unavailable):

{
  "data": {
    "status": "down",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "fail" },
      "services": { "mode": "live" }
    },
    "timestamp": "2026-02-17T12:00:00.000Z"
  }
}

Container Health Checks

Docker Compose (MVP)

Source: docker-compose.yml:12-17

healthcheck:
  test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/api/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 10s

Docker Compose (Production)

Source: docker-compose.production.yml:9-14

Same health check configuration as MVP. Additionally, PostgreSQL has its own health check:

healthcheck:
  test: ["CMD-SHELL", "pg_isready -U drop"]
  interval: 10s
  timeout: 5s
  retries: 5

The drop-app service depends on PostgreSQL being healthy before starting (depends_on.postgres.condition: service_healthy).

Fly.io

Source: fly.toml:19-23

[[http_service.checks]]
  grace_period = "10s"
  interval = "30s"
  method = "GET"
  path = "/api/health"
  timeout = "5s"

Fly.io uses this health check to determine machine readiness and to route traffic.


Current Monitoring State

What Exists

  • Health check endpoint with real database verification (not hardcoded)
  • Container-level health checks (Docker + Fly.io)
  • Automatic restart on failure (restart: unless-stopped in docker-compose)
  • Auto-scaling on Fly.io (scale to zero, auto-start on request)

What Does Not Exist Yet

  • External uptime monitoring service (see UptimeRobot setup below for recommended configuration)
  • Application Performance Monitoring (APM)
  • Structured logging (JSON format)
  • Log aggregation and forwarding
  • Database performance monitoring
  • Rate limit monitoring/metrics
  • Business metrics dashboard (transactions per hour, success rate)

Sentry Error Tracking

Status: REMOVED (MC #1271 — Sentry deinstalled)


Slack Alerting

Status: Implemented (MC #1183) Source: src/lib/alerts.ts, instrumentation.ts

Features

  • Operational alerts sent to Slack webhook
  • 10-minute cooldown per alert title (prevents spam)
  • Severity-based emoji prefixes (ℹ️ info, ⚠️ warning, 🚨 critical)
  • Graceful degradation when webhook URL not set (dev mode)

Setup Instructions

  1. Create incoming webhook in Slack workspace:
    • Go to Slack App Directory → Incoming Webhooks
    • Choose channel (e.g., #ops or #alerts)
    • Copy webhook URL
  2. Set environment variable:
    # .env.local (server-side secret)
    SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX
    

Required Environment Variable

Variable Required Description
SLACK_WEBHOOK_URL Yes (production) Slack incoming webhook URL

Note: When SLACK_WEBHOOK_URL is not set, alerts are logged to console but not sent to Slack.

Alert Types and Severities

Severity Emoji Use Case
info ℹ️ Application startup, normal operations
warning ⚠️ Degraded performance, non-critical issues
critical 🚨 Service outages, data loss, security incidents

Cooldown Behavior

  • Each alert title has a 10-minute cooldown
  • Same title sent within 10 minutes → skipped (prevents spam)
  • Different titles → sent immediately (independent tracking)
  • Cooldown resets on app restart (in-memory tracking)

Example: If "Database connection failed" is sent at 10:00, the next attempt before 10:10 will be skipped. But "High latency detected" can still be sent at 10:05.

Usage in Code

import { sendAlert } from '@/lib/alerts';

// Basic alert
await sendAlert({
  severity: 'critical',
  title: 'Database connection failed',
  message: 'PostgreSQL unreachable after 3 retries',
});

// Alert with details
await sendAlert({
  severity: 'warning',
  title: 'High error rate detected',
  message: '15 errors in last 5 minutes',
});

Current Integrations

  • App startup: Sends info alert when server starts (instrumentation.ts)
  • App shutdown: Sends info alert on SIGTERM/SIGINT (instrumentation.ts)
  • Error spike detection: Automatically tracks errors and alerts when >5 errors occur in 60 seconds (src/lib/alerts.ts:trackError)
  • Unhandled exceptions: Logged and tracked via process event handlers (instrumentation.ts)

Error Spike Detection

The alerting system automatically detects error spikes using a rolling window approach:

How it works:

  1. Every server error (HTTP 5xx) is tracked via trackError()
  2. Maintains rolling 1-minute window of error timestamps
  3. When count exceeds threshold (5 errors in 60 seconds), sends critical alert
  4. Integrates with middleware error handling

Threshold: 5 errors within 60 seconds Alert severity: Critical (🚨) Implementation: src/lib/alerts.ts:trackError(), wired into src/lib/middleware.ts:jsonError()

Note: Error counter is in-memory and resets on app restart. For production workloads requiring persistent tracking, consider Redis-backed counters.


BetterStack Uptime Monitoring

Status: Ready to configure (setup guide available) Documentation: BETTERSTACK-SETUP.md

Overview

BetterStack provides external uptime monitoring independent of Drop's infrastructure. Unlike internal health checks (Docker, Fly.io) that only work when containers are running, BetterStack detects total infrastructure failures.

Free tier includes:

  • 10 monitors (enough for Drop production)
  • 3-minute check interval
  • Unlimited integrations (Slack, email)
  • Public status page
  • SSL expiry monitoring
Monitor URL Purpose Expected Response
Health Endpoint https://drop.alai.no/api/health API + DB connectivity 200, body contains "status":"ok"
Landing Page https://drop.alai.no Public website 200, body contains Send penger
Multi-Region Check https://drop.alai.no/api/health Geographic availability 200, body contains "status":"ok"

Alert Escalation

BetterStack sends alerts through multiple channels:

Minute 0:   Alert fires → Slack #drop-ops (immediate)
Minute 5:   Still down → Email to [email protected]
Minute 15:  Still down → SMS (requires paid plan)

Status Page

Public status page shows real-time service status:

  • URL: https://drop-status.betteruptime.com
  • Components: API Health, Landing Page, Global Network
  • Auto-updates: Incidents automatically posted and resolved
  • Subscriptions: Users can subscribe to email updates

Setup Instructions

Complete setup guide with step-by-step instructions: BETTERSTACK-SETUP.md

Setup includes:

  1. Account creation (free tier)
  2. Configure 3 monitors (health, landing, multi-region)
  3. Slack integration (#drop-ops channel)
  4. On-call schedule and escalation policy
  5. Public status page creation
  6. Testing and verification

Key Features

Proactive monitoring:

  • 3-minute check interval (free tier) or 30s (paid)
  • Keyword verification (not just HTTP 200)
  • SSL certificate expiry warnings (14 days)
  • Multi-region checks (detect geographic issues)

Incident management:

  • Automatic incident creation on downtime
  • Status page updates (public transparency)
  • Escalation to multiple channels (Slack → Email → SMS)
  • Maintenance window support (suppress alerts during deployments)

Reporting:

  • Uptime SLA tracking (99.9% target)
  • Incident history and analysis
  • Response time graphs
  • Downtime duration reports

Integration with Drop Alerting

BetterStack complements Drop's internal alerting (src/lib/alerts.ts):

Feature Drop Internal Alerts BetterStack External
Detects Application errors, error spikes Infrastructure outages
When App is running App is unreachable
Source Application logs External HTTP checks
Delivery Slack webhook (direct) Escalation policy
Use case Code bugs, DB issues Container crashes, network failures

Example: Database connection fails:

  1. Drop internal alert: "Database connection failed" → Slack #drop-ops (immediate)
  2. BetterStack: Health check returns 503 → Slack #drop-ops + Email after 5 min

Maintenance Windows

When performing planned maintenance (deployments, upgrades):

  1. Create maintenance window in BetterStack
  2. Select affected monitors
  3. Set duration (e.g., 1 hour)
  4. Effect: Alerts suppressed, status page shows "Scheduled Maintenance"

Prevents: False downtime alerts during intentional service interruptions.

Best Practices

Do's:

  • ✅ Test alerts monthly (pause monitor to verify escalation)
  • ✅ Use keyword checks (not just HTTP status codes)
  • ✅ Monitor SSL expiry (14-day warnings)
  • ✅ Create maintenance windows for deployments
  • ✅ Review incident history monthly

Don'ts:

  • ❌ Don't ignore degraded status (investigate even if not fully down)
  • ❌ Don't disable monitors (use pause for temporary suppression)
  • ❌ Don't skip keyword checks (HTTP 200 ≠ working API)
  • ❌ Don't rely solely on external monitoring (combine with internal checks)

External Uptime Monitoring (Alternative: UptimeRobot)

Status: Alternative to BetterStack (not recommended)

BetterStack is recommended over UptimeRobot for Drop because:

  • Better Slack integration (richer notifications)
  • Built-in status page (UptimeRobot charges extra)
  • Better UI/UX for incident management
  • More flexible escalation policies

UptimeRobot Setup (if BetterStack unavailable)

Cost: Free tier (50 monitors, 5-minute interval)

  1. Create account at uptimerobot.com
  2. Add HTTP(S) monitor:
    • Friendly Name: Drop Production
    • URL: https://drop.alai.no/api/health
    • Monitoring Interval: 5 minutes (free tier) or 1 minute (paid)
  3. Configure alert contacts:
  4. Set Keyword Monitoring: Response contains "status":"ok"

Limitations:

  • No built-in escalation policies (requires third-party integrations)
  • Status page requires paid plan
  • Less detailed incident reports
  • 5-minute check interval (vs 3-minute for BetterStack free)

Monitoring Stack Summary

Implemented (MC #1184)

  • Health check endpoint/api/health with real database verification
  • Container health checks — Docker + Fly.io auto-restart on failure
  • Error tracking — Sentry REMOVED (MC #1271)
  • Slack alerting — Operational alerts with cooldown protection
  • Lifecycle monitoring — App startup and graceful shutdown alerts
  • Error spike detection — Automatic alerting when >5 errors/minute
  • 📋 External uptime monitoring — UptimeRobot checking /api/health every 5 minutes
  • 📋 Structured logging — JSON log format with request IDs for correlation
  • 📋 Metrics dashboard — Request latency, error rates, database query times
  • 📋 Audit logging — Tracked as security requirement (security/drop-security-rapport.md finding L3)

Future Enhancements (TODO)

  • Database performance monitoring (slow query alerts)
  • Rate limit metrics (track 429 errors per endpoint)
  • Business metrics dashboard (transactions per hour, success rate)
  • Redis-backed error counter (persistent across restarts)
  • Per-endpoint error tracking (isolate problematic routes)

Environment Variables Reference

Required for Production

# Slack alerting
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXX

Dev Mode (All Optional)

All monitoring features gracefully degrade when env vars are not set:

  • No SLACK_WEBHOOK_URL: Alerts logged to console only

This allows development to work without external services configured.