Skip to main content

Monitoring & Alerting

Drop Monitoring

Last updated: 2026-02-17 Source: src/drop-app/src/app/api/health/route.ts, docker-compose.yml, fly.toml, src/lib/alerts.ts, sentry.server.config.ts


Health Check Endpoint

Route: GET /api/health Source: src/drop-app/src/app/api/health/route.ts:1-35

What It Checks

  1. Database connectivity -- Executes SELECT 1 as ok against the database
  2. Database latency -- Measures query execution time in milliseconds
  3. Database driver -- Reports whether using pg (PostgreSQL) or sqlite
  4. Service mode -- Reports NEXT_PUBLIC_SERVICE_MODE (mock or live)
  5. Application uptime -- Tracks seconds since server start
  6. Application version -- Reads from npm_package_version env var, defaults to 0.1.0

Status Values

  • ok -- All checks pass (HTTP 200)
  • degraded -- DB query returned unexpected result (HTTP 200)
  • down -- DB unreachable (HTTP 503)

Response Format

Healthy (200 OK):

{
  "data": {
    "status": "ok",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
      "services": { "mode": "live" }
    },
    "timestamp": "2026-02-17T12:00:00.000Z"
  }
}

Down (503 Service Unavailable):

{
  "data": {
    "status": "down",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "fail" },
      "services": { "mode": "live" }
    },
    "timestamp": "2026-02-17T12:00:00.000Z"
  }
}

Container Health Checks

Docker Compose (MVP)

Source: docker-compose.yml:12-17

healthcheck:
  test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/api/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 10s

Docker Compose (Production)

Source: docker-compose.production.yml:9-14

Same health check configuration as MVP. Additionally, PostgreSQL has its own health check:

healthcheck:
  test: ["CMD-SHELL", "pg_isready -U drop"]
  interval: 10s
  timeout: 5s
  retries: 5

The drop-app service depends on PostgreSQL being healthy before starting (depends_on.postgres.condition: service_healthy).

Fly.io

Source: fly.toml:19-23

[[http_service.checks]]
  grace_period = "10s"
  interval = "30s"
  method = "GET"
  path = "/api/health"
  timeout = "5s"

Fly.io uses this health check to determine machine readiness and to route traffic.


Current Monitoring State

What Exists

  • Health check endpoint with real database verification (not hardcoded)
  • Container-level health checks (Docker + Fly.io)
  • Automatic restart on failure (restart: unless-stopped in docker-compose)
  • Auto-scaling on Fly.io (scale to zero, auto-start on request)

What Does Not Exist Yet

  • External uptime monitoring service (see UptimeRobot setup below for recommended configuration)
  • Application Performance Monitoring (APM)
  • Structured logging (JSON format)
  • Log aggregation and forwarding
  • Database performance monitoring
  • Rate limit monitoring/metrics
  • Business metrics dashboard (transactions per hour, success rate)

Sentry Error Tracking

Status: Configured and ready (MC #1183) Source: sentry.server.config.ts, sentry.client.config.ts, src/lib/sentry.ts

What It Captures

  • Unhandled exceptions in API routes
  • Server component errors
  • Client-side errors (browser)
  • Middleware errors
  • Performance traces (10% sample rate by default)

Required Environment Variables

Variable Where Required Description
SENTRY_DSN Server-side Yes Sentry project DSN (secret, server-only)
NEXT_PUBLIC_SENTRY_DSN Client-side Yes Public DSN for browser error tracking

Optional Environment Variables

Variable Default Description
SENTRY_AUTH_TOKEN None Upload source maps for stack traces
SENTRY_ORG None Sentry organization slug
SENTRY_PROJECT None Sentry project slug
SENTRY_TRACES_SAMPLE_RATE 0.1 Performance monitoring sample rate (0.0-1.0)

Data Scrubbing (Privacy)

The following data is automatically removed before sending to Sentry:

  • Headers: authorization, cookie, x-auth-token
  • Cookies: auth-token, session
  • Query params: token, password, pin
  • Request body: password, pin, cardNumber, cvv
  • Breadcrumbs: Sensitive fields in request bodies

No PII is sent to Sentry (sendDefaultPii: false).

Setup Instructions

  1. Create a project at sentry.io
  2. Get DSN from Project Settings → Client Keys (DSN)
  3. Set environment variables:
    # .env.local (server-side)
    SENTRY_DSN=https://[email protected]/zzz
    
    # .env (public, safe to commit)
    NEXT_PUBLIC_SENTRY_DSN=https://[email protected]/zzz
    
  4. Optional: Set up source map uploads for better stack traces:
    SENTRY_AUTH_TOKEN=your_auth_token
    SENTRY_ORG=your_org_slug
    SENTRY_PROJECT=your_project_slug
    

Usage in Code

import { captureError, captureMessage } from '@/lib/sentry';

// Capture exception
try {
  await dangerousOperation();
} catch (error) {
  captureError(error, {
    userId: user.id,
    requestId: req.headers.get('x-request-id'),
    tags: { feature: 'remittance' },
    extra: { amount, currency },
  });
}

// Capture message
captureMessage('Critical threshold exceeded', 'warning', {
  tags: { metric: 'error_rate' },
});

Graceful Degradation: When SENTRY_DSN is not set, all Sentry functions become no-ops (dev mode works without Sentry).


Slack Alerting

Status: Implemented (MC #1183) Source: src/lib/alerts.ts, instrumentation.ts

Features

  • Operational alerts sent to Slack webhook
  • 10-minute cooldown per alert title (prevents spam)
  • Severity-based emoji prefixes (ℹ️ info, ⚠️ warning, 🚨 critical)
  • Graceful degradation when webhook URL not set (dev mode)
  • Optional Sentry URL attachment

Setup Instructions

  1. Create incoming webhook in Slack workspace:
    • Go to Slack App Directory → Incoming Webhooks
    • Choose channel (e.g., #ops or #alerts)
    • Copy webhook URL
  2. Set environment variable:
    # .env.local (server-side secret)
    SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX
    

Required Environment Variable

Variable Required Description
SLACK_WEBHOOK_URL Yes (production) Slack incoming webhook URL

Note: When SLACK_WEBHOOK_URL is not set, alerts are logged to console but not sent to Slack.

Alert Types and Severities

Severity Emoji Use Case
info ℹ️ Application startup, normal operations
warning ⚠️ Degraded performance, non-critical issues
critical 🚨 Service outages, data loss, security incidents

Cooldown Behavior

  • Each alert title has a 10-minute cooldown
  • Same title sent within 10 minutes → skipped (prevents spam)
  • Different titles → sent immediately (independent tracking)
  • Cooldown resets on app restart (in-memory tracking)

Example: If "Database connection failed" is sent at 10:00, the next attempt before 10:10 will be skipped. But "Sentry error spike" can still be sent at 10:05.

Usage in Code

import { sendAlert } from '@/lib/alerts';

// Basic alert
await sendAlert({
  severity: 'critical',
  title: 'Database connection failed',
  message: 'PostgreSQL unreachable after 3 retries',
});

// Alert with Sentry link
await sendAlert({
  severity: 'warning',
  title: 'High error rate detected',
  message: '15 errors in last 5 minutes',
  sentryUrl: 'https://sentry.io/organizations/drop/issues/12345/',
});

Current Integrations

  • App startup: Sends info alert when server starts (instrumentation.ts)
  • App shutdown: Sends info alert on SIGTERM/SIGINT (instrumentation.ts)
  • Error spike detection: Automatically tracks errors and alerts when >5 errors occur in 60 seconds (src/lib/alerts.ts:trackError)
  • Unhandled exceptions: Logged and tracked via process event handlers (instrumentation.ts)

Error Spike Detection

The alerting system automatically detects error spikes using a rolling window approach:

How it works:

  1. Every server error (HTTP 5xx) is tracked via trackError()
  2. Maintains rolling 1-minute window of error timestamps
  3. When count exceeds threshold (5 errors in 60 seconds), sends critical alert
  4. Integrates with both Sentry error tracking and middleware error handling

Threshold: 5 errors within 60 seconds Alert severity: Critical (🚨) Implementation: src/lib/alerts.ts:trackError(), wired into src/lib/sentry.ts:captureError() and src/lib/middleware.ts:jsonError()

Note: Error counter is in-memory and resets on app restart. For production workloads requiring persistent tracking, consider Redis-backed counters.


External Uptime Monitoring (BetterStack or UptimeRobot)

Status: Not yet configured (manual setup required)

Why External Monitoring?

  • Detects infrastructure failures (container crashes, network issues)
  • Independent from the application (catches total outages)
  • Provides uptime SLA tracking
  • Multiple notification channels (email, Slack, SMS)

Option A: BetterStack (Recommended)

Cost: Free tier (5 monitors, 3-minute interval)

  1. Create account at betterstack.com
  2. Go to Monitors > Create Monitor
  3. Configure:
    • Monitor type: HTTP(s)
    • URL: https://your-domain.com/api/health
    • Check interval: 60 seconds (paid) or 180 seconds (free)
    • Request timeout: 5 seconds
    • Confirmation period: 1 retry before alerting
  4. Set Expected status code: 200
  5. Add Keyword check: Response body contains "status":"ok"
  6. Configure alert policy (see Escalation Policy below)

Option B: UptimeRobot

Cost: Free tier (50 monitors, 5-minute interval)

  1. Create account at uptimerobot.com
  2. Add HTTP(S) monitor:
    • Friendly Name: Drop Production
    • URL: https://your-domain.com/api/health
    • Monitoring Interval: 5 minutes (free tier) or 1 minute (paid)
  3. Configure alert contacts (see Escalation Policy below)
  4. Set Keyword Monitoring: Response contains "status":"ok"
Check Endpoint/Target Interval Timeout
Health check GET /api/health 60s 5s
SSL certificate expiry Domain certificate Daily N/A
Response time GET /api/health 60s 500ms threshold
Homepage GET / 300s 10s

Alert Channels

Channel Use Case Setup
Slack #alerts All incidents Add Slack integration in monitoring tool
Email P1/P2 incidents Add team email addresses
SMS (paid) P1 only Add phone numbers for on-call

Escalation Policy

Minute 0:   Alert fires → Slack #alerts (automatic)
Minute 5:   Still down → Email to on-call engineer
Minute 15:  Still down → SMS to CTO
Minute 30:  Still down → Phone call to CEO

Configure this in BetterStack under Escalation Policies or in UptimeRobot under Alert Contacts.

Note: This is external to the Drop application -- no code changes needed, purely configuration.


Monitoring Stack Summary

Implemented (MC #1184)

  • Health check endpoint/api/health with real database verification
  • Container health checks — Docker + Fly.io auto-restart on failure
  • Error tracking — Sentry with automatic PII scrubbing
  • Slack alerting — Operational alerts with cooldown protection
  • Lifecycle monitoring — App startup and graceful shutdown alerts
  • Error spike detection — Automatic alerting when >5 errors/minute
  • 📋 External uptime monitoring — UptimeRobot checking /api/health every 5 minutes
  • 📋 Structured logging — JSON log format with request IDs for correlation
  • 📋 Metrics dashboard — Request latency, error rates, database query times
  • 📋 Audit logging — Tracked as security requirement (security/drop-security-rapport.md finding L3)

Future Enhancements (TODO)

  • Database performance monitoring (slow query alerts)
  • Rate limit metrics (track 429 errors per endpoint)
  • Business metrics dashboard (transactions per hour, success rate)
  • Redis-backed error counter (persistent across restarts)
  • Per-endpoint error tracking (isolate problematic routes)

Environment Variables Reference

Required for Production

# Sentry error tracking
SENTRY_DSN=https://[email protected]/zzz
NEXT_PUBLIC_SENTRY_DSN=https://[email protected]/zzz

# Slack alerting
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXX

Optional (Enhances Features)

# Sentry source maps (better stack traces)
SENTRY_AUTH_TOKEN=sntrys_xxx
SENTRY_ORG=your_org_slug
SENTRY_PROJECT=your_project_slug

# Sentry performance monitoring (default: 0.1 = 10%)
SENTRY_TRACES_SAMPLE_RATE=0.1

Dev Mode (All Optional)

All monitoring features gracefully degrade when env vars are not set:

  • No SENTRY_DSN: Errors logged to console only
  • No SLACK_WEBHOOK_URL: Alerts logged to console only

This allows development to work without external services configured.