Skip to main content

Monitoring & Alerting

Drop Monitoring

Last updated: 2026-02-17 Source: src/drop-app/src/app/api/health/route.ts, docker-compose.yml, fly.toml, src/lib/alerts.ts


Health Check Endpoint

Route: GET /api/health Source: src/drop-app/src/app/api/health/route.ts:1-35

What It Checks

  1. Database connectivity -- Executes SELECT 1 as ok against the database
  2. Database latency -- Measures query execution time in milliseconds
  3. Database driver -- Reports whether using pg (PostgreSQL) or sqlite
  4. Service mode -- Reports NEXT_PUBLIC_SERVICE_MODE (mock or live)
  5. Application uptime -- Tracks seconds since server start
  6. Application version -- Reads from npm_package_version env var, defaults to 0.1.0

Status Values

  • ok -- All checks pass (HTTP 200)
  • degraded -- DB query returned unexpected result (HTTP 200)
  • down -- DB unreachable (HTTP 503)

Response Format

Healthy (200 OK):

{
  "data": {
    "status": "ok",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
      "services": { "mode": "live" }
    },
    "timestamp": "2026-02-17T12:00:00.000Z"
  }
}

Down (503 Service Unavailable):

{
  "data": {
    "status": "down",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "fail" },
      "services": { "mode": "live" }
    },
    "timestamp": "2026-02-17T12:00:00.000Z"
  }
}

Container Health Checks

Docker Compose (MVP)

Source: docker-compose.yml:12-17

healthcheck:
  test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/api/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 10s

Docker Compose (Production)

Source: docker-compose.production.yml:9-14

Same health check configuration as MVP. Additionally, PostgreSQL has its own health check:

healthcheck:
  test: ["CMD-SHELL", "pg_isready -U drop"]
  interval: 10s
  timeout: 5s
  retries: 5

The drop-app service depends on PostgreSQL being healthy before starting (depends_on.postgres.condition: service_healthy).

Fly.io

Source: fly.toml:19-23

[[http_service.checks]]
  grace_period = "10s"
  interval = "30s"
  method = "GET"
  path = "/api/health"
  timeout = "5s"

Fly.io uses this health check to determine machine readiness and to route traffic.


Current Monitoring State

What Exists

  • Health check endpoint with real database verification (not hardcoded)
  • Container-level health checks (Docker + Fly.io)
  • Automatic restart on failure (restart: unless-stopped in docker-compose)
  • Auto-scaling on Fly.io (scale to zero, auto-start on request)

What Does Not Exist Yet

  • External uptime monitoring service (see UptimeRobot setup below for recommended configuration)
  • Application Performance Monitoring (APM)
  • Structured logging (JSON format)
  • Log aggregation and forwarding
  • Database performance monitoring
  • Rate limit monitoring/metrics
  • Business metrics dashboard (transactions per hour, success rate)

Sentry Error Tracking

Status: REMOVED (MC #1271 — Sentry deinstalled)


Slack Alerting

Status: Implemented (MC #1183) Source: src/lib/alerts.ts, instrumentation.ts

Features

  • Operational alerts sent to Slack webhook
  • 10-minute cooldown per alert title (prevents spam)
  • Severity-based emoji prefixes (ℹ️ info, ⚠️ warning, 🚨 critical)
  • Graceful degradation when webhook URL not set (dev mode)

Setup Instructions

  1. Create incoming webhook in Slack workspace:
    • Go to Slack App Directory → Incoming Webhooks
    • Choose channel (e.g., #ops or #alerts)
    • Copy webhook URL
  2. Set environment variable:
    # .env.local (server-side secret)
    SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX
    

Required Environment Variable

Variable Required Description
SLACK_WEBHOOK_URL Yes (production) Slack incoming webhook URL

Note: When SLACK_WEBHOOK_URL is not set, alerts are logged to console but not sent to Slack.

Alert Types and Severities

Severity Emoji Use Case
info ℹ️ Application startup, normal operations
warning ⚠️ Degraded performance, non-critical issues
critical 🚨 Service outages, data loss, security incidents

Cooldown Behavior

  • Each alert title has a 10-minute cooldown
  • Same title sent within 10 minutes → skipped (prevents spam)
  • Different titles → sent immediately (independent tracking)
  • Cooldown resets on app restart (in-memory tracking)

Example: If "Database connection failed" is sent at 10:00, the next attempt before 10:10 will be skipped. But "High latency detected" can still be sent at 10:05.

Usage in Code

import { sendAlert } from '@/lib/alerts';

// Basic alert
await sendAlert({
  severity: 'critical',
  title: 'Database connection failed',
  message: 'PostgreSQL unreachable after 3 retries',
});

// Alert with details
await sendAlert({
  severity: 'warning',
  title: 'High error rate detected',
  message: '15 errors in last 5 minutes',
});

Current Integrations

  • App startup: Sends info alert when server starts (instrumentation.ts)
  • App shutdown: Sends info alert on SIGTERM/SIGINT (instrumentation.ts)
  • Error spike detection: Automatically tracks errors and alerts when >5 errors occur in 60 seconds (src/lib/alerts.ts:trackError)
  • Unhandled exceptions: Logged and tracked via process event handlers (instrumentation.ts)

Error Spike Detection

The alerting system automatically detects error spikes using a rolling window approach:

How it works:

  1. Every server error (HTTP 5xx) is tracked via trackError()
  2. Maintains rolling 1-minute window of error timestamps
  3. When count exceeds threshold (5 errors in 60 seconds), sends critical alert
  4. Integrates with middleware error handling

Threshold: 5 errors within 60 seconds Alert severity: Critical (🚨) Implementation: src/lib/alerts.ts:trackError(), wired into src/lib/middleware.ts:jsonError()

Note: Error counter is in-memory and resets on app restart. For production workloads requiring persistent tracking, consider Redis-backed counters.


ExternalBetterStack Uptime Monitoring (BetterStack or UptimeRobot)

Status: NotReady yetto configuredconfigure (manual setup required)guide available) Documentation: BETTERSTACK-SETUP.md

WhyOverview

External

BetterStack Monitoring?provides external uptime monitoring independent of Drop's infrastructure. Unlike internal health checks (Docker, Fly.io) that only work when containers are running, BetterStack detects total infrastructure failures.

Free tier includes:

  • Detects10 infrastructure failuresmonitors (containerenough crashes,for networkDrop issues)production)
  • Independent3-minute fromcheck the application (catches total outages)interval
  • ProvidesUnlimited uptimeintegrations SLA(Slack, trackingemail)
  • MultiplePublic notificationstatus channelspage
  • (email,
  • SSL Slack,expiry SMS)monitoring

Cost:

Freetier(5monitors,3-minuteinterval)

  • Create
  • accountatbetterstack.com
  • Go
  • Monitors > Create Monitor
  • Configure:
    • Monitor type: HTTP(s)
    • URL: 
  • Check
  • (paid) or 180 seconds (free)
  • Request timeout: 5 seconds
  • Confirmation period: 1 retry before alerting
  • Set Expected status code: 
  • Monitor URL Purpose Expected
      Response
    Health toEndpoint https://your-domain.com/drop.alai.no/api/health API interval:+ 60DB secondsconnectivity 200
  • Add Keyword check: Response, body contains "status":"ok"
  • Landing Pagehttps://drop.alai.noPublic website200, body contains Send penger
    Multi-Region Checkhttps://drop.alai.no/api/healthGeographic availability200, body contains "status":"ok"

    Alert Escalation

    BetterStack sends alerts through multiple channels:

    Minute 0:   Alert fires → Slack #drop-ops (immediate)
    Minute 5:   Still down → Email to [email protected]
    Minute 15:  Still down → SMS (requires paid plan)
    

    Status Page

    Public status page shows real-time service status:

    • URL: https://drop-status.betteruptime.com
    • Components: API Health, Landing Page, Global Network
    • Auto-updates: Incidents automatically posted and resolved
    • Subscriptions: Users can subscribe to email updates

    Setup Instructions

    Complete setup guide with step-by-step instructions: BETTERSTACK-SETUP.md

    Setup includes:

    1. Account creation (free tier)
    2. Configure alert3 monitors (health, landing, multi-region)
    3. Slack integration (#drop-ops channel)
    4. On-call schedule and escalation policy
    5. (see
    6. Public Escalationstatus Policypage below)creation
    7. Testing and verification

    OptionKey B:Features

    Proactive monitoring:

    • 3-minute check interval (free tier) or 30s (paid)
    • Keyword verification (not just HTTP 200)
    • SSL certificate expiry warnings (14 days)
    • Multi-region checks (detect geographic issues)

    Incident management:

    • Automatic incident creation on downtime
    • Status page updates (public transparency)
    • Escalation to multiple channels (Slack → Email → SMS)
    • Maintenance window support (suppress alerts during deployments)

    Reporting:

    • Uptime SLA tracking (99.9% target)
    • Incident history and analysis
    • Response time graphs
    • Downtime duration reports

    Integration with Drop Alerting

    BetterStack complements Drop's internal alerting (src/lib/alerts.ts):

    FeatureDrop Internal AlertsBetterStack External
    DetectsApplication errors, error spikesInfrastructure outages
    WhenApp is runningApp is unreachable
    SourceApplication logsExternal HTTP checks
    DeliverySlack webhook (direct)Escalation policy
    Use caseCode bugs, DB issuesContainer crashes, network failures

    Example: Database connection fails:

    1. Drop internal alert: "Database connection failed" → Slack #drop-ops (immediate)
    2. BetterStack: Health check returns 503 → Slack #drop-ops + Email after 5 min

    Maintenance Windows

    When performing planned maintenance (deployments, upgrades):

    1. Create maintenance window in BetterStack
    2. Select affected monitors
    3. Set duration (e.g., 1 hour)
    4. Effect: Alerts suppressed, status page shows "Scheduled Maintenance"

    Prevents: False downtime alerts during intentional service interruptions.

    Best Practices

    Do's:

    • ✅ Test alerts monthly (pause monitor to verify escalation)
    • ✅ Use keyword checks (not just HTTP status codes)
    • ✅ Monitor SSL expiry (14-day warnings)
    • ✅ Create maintenance windows for deployments
    • ✅ Review incident history monthly

    Don'ts:

    • ❌ Don't ignore degraded status (investigate even if not fully down)
    • ❌ Don't disable monitors (use pause for temporary suppression)
    • ❌ Don't skip keyword checks (HTTP 200 ≠ working API)
    • ❌ Don't rely solely on external monitoring (combine with internal checks)

    External Uptime Monitoring (Alternative: UptimeRobot)

    Status: Alternative to BetterStack (not recommended)

    BetterStack is recommended over UptimeRobot for Drop because:

    • Better Slack integration (richer notifications)
    • Built-in status page (UptimeRobot charges extra)
    • Better UI/UX for incident management
    • More flexible escalation policies

    UptimeRobot Setup (if BetterStack unavailable)

    Cost: Free tier (50 monitors, 5-minute interval)

    1. Create account at uptimerobot.com
    2. Add HTTP(S) monitor:
      • Friendly Name: Drop Production
      • URL: https://your-domain.com/drop.alai.no/api/health
      • Monitoring Interval: 5 minutes (free tier) or 1 minute (paid)
    3. Configure alert contactscontacts:
    4. Set Keyword Monitoring: Response contains "status":"ok"
    • No
    • built-inescalationpolicies(requiresthird-partyintegrations)
    • Status
    • pagerequirespaid
      Check Endpoint/Target Interval Timeout
      Healthplan
    • Less detailed incident reports
    • 5-minute check
    • GET /api/health60s5s
      SSL certificate expiryDomain certificateDailyN/A
      Response timeGET /api/health60s500ms threshold
      HomepageGET /300s10s

      Alert Channels

      vs
      ChannelUse CaseSetup
      Slack #alertsAll incidentsAdd Slack integration in monitoring tool
      EmailP1/P2 incidentsAdd team email addresses
      SMSinterval (paid) P1 onlyAdd phone numbers3-minute for on-call

      Escalation Policy

      Minute 0:   Alert fires → Slack #alerts (automatic)
      Minute 5:   Still down → Email to on-call engineer
      Minute 15:  Still down → SMS to CTO
      Minute 30:  Still down → Phone call to CEO
      

      Configure this in BetterStack underfree) Escalation Policies or in UptimeRobot under Alert Contacts.

      Note: This is external to the Drop application -- no code changes needed, purely configuration.


    Monitoring Stack Summary

    Implemented (MC #1184)

    • Health check endpoint/api/health with real database verification
    • Container health checks — Docker + Fly.io auto-restart on failure
    • Error tracking — Sentry REMOVED (MC #1271)
    • Slack alerting — Operational alerts with cooldown protection
    • Lifecycle monitoring — App startup and graceful shutdown alerts
    • Error spike detection — Automatic alerting when >5 errors/minute
    • 📋 External uptime monitoring — UptimeRobot checking /api/health every 5 minutes
    • 📋 Structured logging — JSON log format with request IDs for correlation
    • 📋 Metrics dashboard — Request latency, error rates, database query times
    • 📋 Audit logging — Tracked as security requirement (security/drop-security-rapport.md finding L3)

    Future Enhancements (TODO)

    • Database performance monitoring (slow query alerts)
    • Rate limit metrics (track 429 errors per endpoint)
    • Business metrics dashboard (transactions per hour, success rate)
    • Redis-backed error counter (persistent across restarts)
    • Per-endpoint error tracking (isolate problematic routes)

    Environment Variables Reference

    Required for Production

    # Slack alerting
    SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXX
    

    Dev Mode (All Optional)

    All monitoring features gracefully degrade when env vars are not set:

    • No SLACK_WEBHOOK_URL: Alerts logged to console only

    This allows development to work without external services configured.