Monitoring & Alerting

Drop Monitoring

Last updated: 2026-02-17 Source: src/drop-app/src/app/api/health/route.ts, docker-compose.yml, fly.toml, src/lib/alerts.ts, sentry.server.config.ts

Health Check Endpoint

Route: GET /api/health Source: src/drop-app/src/app/api/health/route.ts:1-35

What It Checks

Database connectivity -- Executes SELECT 1 as ok against the database
Database latency -- Measures query execution time in milliseconds
Database driver -- Reports whether using pg (PostgreSQL) or sqlite
Service mode -- Reports NEXT_PUBLIC_SERVICE_MODE (mock or live)
Application uptime -- Tracks seconds since server start
Application version -- Reads from npm_package_version env var, defaults to 0.1.0

Status Values

ok -- All checks pass (HTTP 200)
degraded -- DB query returned unexpected result (HTTP 200)
down -- DB unreachable (HTTP 503)

Response Format

Healthy (200 OK):

{
  "data": {
    "status": "ok",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
      "services": { "mode": "live" }
    },
    "timestamp": "2026-02-17T12:00:00.000Z"
  }
}

Down (503 Service Unavailable):

{
  "data": {
    "status": "down",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "fail" },
      "services": { "mode": "live" }
    },
    "timestamp": "2026-02-17T12:00:00.000Z"
  }
}

Container Health Checks

Docker Compose (MVP)

Source: docker-compose.yml:12-17

healthcheck:
  test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/api/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 10s

Docker Compose (Production)

Source: docker-compose.production.yml:9-14

Same health check configuration as MVP. Additionally, PostgreSQL has its own health check:

healthcheck:
  test: ["CMD-SHELL", "pg_isready -U drop"]
  interval: 10s
  timeout: 5s
  retries: 5

The drop-app service depends on PostgreSQL being healthy before starting (depends_on.postgres.condition: service_healthy).

Fly.io

Source: fly.toml:19-23

[[http_service.checks]]
  grace_period = "10s"
  interval = "30s"
  method = "GET"
  path = "/api/health"
  timeout = "5s"

Fly.io uses this health check to determine machine readiness and to route traffic.

Current Monitoring State

What Exists

Health check endpoint with real database verification (not hardcoded)
Container-level health checks (Docker + Fly.io)
Automatic restart on failure (restart: unless-stopped in docker-compose)
Auto-scaling on Fly.io (scale to zero, auto-start on request)

What Does Not Exist Yet

External uptime monitoring service (see UptimeRobot setup below for recommended configuration)
Application Performance Monitoring (APM)
Structured logging (JSON format)
Log aggregation and forwarding
Database performance monitoring
Rate limit monitoring/metrics
Business metrics dashboard (transactions per hour, success rate)

Sentry Error Tracking

Status: Configured and ready (MC #1183) Source: sentry.server.config.ts, sentry.client.config.ts, src/lib/sentry.ts

What It Captures

Unhandled exceptions in API routes
Server component errors
Client-side errors (browser)
Middleware errors
Performance traces (10% sample rate by default)

Required Environment Variables

Variable	Where	Required	Description
`SENTRY_DSN`	Server-side	Yes	Sentry project DSN (secret, server-only)
`NEXT_PUBLIC_SENTRY_DSN`	Client-side	Yes	Public DSN for browser error tracking

Optional Environment Variables

Variable	Default	Description
`SENTRY_AUTH_TOKEN`	None	Upload source maps for stack traces
`SENTRY_ORG`	None	Sentry organization slug
`SENTRY_PROJECT`	None	Sentry project slug
`SENTRY_TRACES_SAMPLE_RATE`	`0.1`	Performance monitoring sample rate (0.0-1.0)

Data Scrubbing (Privacy)

The following data is automatically removed before sending to Sentry:

Headers: authorization, cookie, x-auth-token
Cookies: auth-token, session
Query params: token, password, pin
Request body: password, pin, cardNumber, cvv
Breadcrumbs: Sensitive fields in request bodies

No PII is sent to Sentry (sendDefaultPii: false).

Setup Instructions

Create a project at sentry.io
Get DSN from Project Settings → Client Keys (DSN)

Set environment variables:

# .env.local (server-side)
SENTRY_DSN=https://[email protected]/zzz

# .env (public, safe to commit)
NEXT_PUBLIC_SENTRY_DSN=https://[email protected]/zzz

Optional: Set up source map uploads for better stack traces:

SENTRY_AUTH_TOKEN=your_auth_token
SENTRY_ORG=your_org_slug
SENTRY_PROJECT=your_project_slug

Usage in Code

import { captureError, captureMessage } from '@/lib/sentry';

// Capture exception
try {
  await dangerousOperation();
} catch (error) {
  captureError(error, {
    userId: user.id,
    requestId: req.headers.get('x-request-id'),
    tags: { feature: 'remittance' },
    extra: { amount, currency },
  });
}

// Capture message
captureMessage('Critical threshold exceeded', 'warning', {
  tags: { metric: 'error_rate' },
});

Graceful Degradation: When SENTRY_DSN is not set, all Sentry functions become no-ops (dev mode works without Sentry).

Slack Alerting

Status: Implemented (MC #1183) Source: src/lib/alerts.ts, instrumentation.ts

Features

Operational alerts sent to Slack webhook
10-minute cooldown per alert title (prevents spam)
Severity-based emoji prefixes (ℹ️ info, ⚠️ warning, 🚨 critical)
Graceful degradation when webhook URL not set (dev mode)
Optional Sentry URL attachment

Setup Instructions

Create incoming webhook in Slack workspace:
- Go to Slack App Directory → Incoming Webhooks
- Choose channel (e.g., #ops or #alerts)
- Copy webhook URL

Set environment variable:

# .env.local (server-side secret)
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX

Required Environment Variable

Variable	Required	Description
`SLACK_WEBHOOK_URL`	Yes (production)	Slack incoming webhook URL

Note: When SLACK_WEBHOOK_URL is not set, alerts are logged to console but not sent to Slack.

Alert Types and Severities

Severity	Emoji	Use Case
`info`	ℹ️	Application startup, normal operations
`warning`	⚠️	Degraded performance, non-critical issues
`critical`	🚨	Service outages, data loss, security incidents

Cooldown Behavior

Each alert title has a 10-minute cooldown
Same title sent within 10 minutes → skipped (prevents spam)
Different titles → sent immediately (independent tracking)
Cooldown resets on app restart (in-memory tracking)

Example: If "Database connection failed" is sent at 10:00, the next attempt before 10:10 will be skipped. But "Sentry error spike" can still be sent at 10:05.

Usage in Code

import { sendAlert } from '@/lib/alerts';

// Basic alert
await sendAlert({
  severity: 'critical',
  title: 'Database connection failed',
  message: 'PostgreSQL unreachable after 3 retries',
});

// Alert with Sentry link
await sendAlert({
  severity: 'warning',
  title: 'High error rate detected',
  message: '15 errors in last 5 minutes',
  sentryUrl: 'https://sentry.io/organizations/drop/issues/12345/',
});

Current Integrations

App startup: Sends info alert when server starts (instrumentation.ts)
App shutdown: Sends info alert on SIGTERM/SIGINT (instrumentation.ts)
Error spike detection: Automatically tracks errors and alerts when >5 errors occur in 60 seconds (src/lib/alerts.ts:trackError)
Unhandled exceptions: Logged and tracked via process event handlers (instrumentation.ts)

Error Spike Detection

The alerting system automatically detects error spikes using a rolling window approach:

How it works:

Every server error (HTTP 5xx) is tracked via trackError()
Maintains rolling 1-minute window of error timestamps
When count exceeds threshold (5 errors in 60 seconds), sends critical alert
Integrates with both Sentry error tracking and middleware error handling

Threshold: 5 errors within 60 seconds Alert severity: Critical (🚨) Implementation: src/lib/alerts.ts:trackError(), wired into src/lib/sentry.ts:captureError() and src/lib/middleware.ts:jsonError()

Note: Error counter is in-memory and resets on app restart. For production workloads requiring persistent tracking, consider Redis-backed counters.

External Uptime Monitoring (BetterStack or UptimeRobot)

Status: Not yet configured (manual setup required)

Why External Monitoring?

Detects infrastructure failures (container crashes, network issues)
Independent from the application (catches total outages)
Provides uptime SLA tracking
Multiple notification channels (email, Slack, SMS)

Option A: BetterStack (Recommended)

Cost: Free tier (5 monitors, 3-minute interval)

Create account at betterstack.com
Go to Monitors > Create Monitor
Configure:
- Monitor type: HTTP(s)
- URL: https://your-domain.com/api/health
- Check interval: 60 seconds (paid) or 180 seconds (free)
- Request timeout: 5 seconds
- Confirmation period: 1 retry before alerting
Set Expected status code: 200
Add Keyword check: Response body contains "status":"ok"
Configure alert policy (see Escalation Policy below)

Option B: UptimeRobot

Cost: Free tier (50 monitors, 5-minute interval)

Create account at uptimerobot.com
Add HTTP(S) monitor:
- Friendly Name: Drop Production
- URL: https://your-domain.com/api/health
- Monitoring Interval: 5 minutes (free tier) or 1 minute (paid)
Configure alert contacts (see Escalation Policy below)
Set Keyword Monitoring: Response contains "status":"ok"

Recommended Checks

Check	Endpoint/Target	Interval	Timeout
Health check	`GET /api/health`	60s	5s
SSL certificate expiry	Domain certificate	Daily	N/A
Response time	`GET /api/health`	60s	500ms threshold
Homepage	`GET /`	300s	10s

Alert Channels

Channel	Use Case	Setup
Slack `#alerts`	All incidents	Add Slack integration in monitoring tool
Email	P1/P2 incidents	Add team email addresses
SMS (paid)	P1 only	Add phone numbers for on-call

Escalation Policy

Minute 0:   Alert fires → Slack #alerts (automatic)
Minute 5:   Still down → Email to on-call engineer
Minute 15:  Still down → SMS to CTO
Minute 30:  Still down → Phone call to CEO

Configure this in BetterStack under Escalation Policies or in UptimeRobot under Alert Contacts.

Note: This is external to the Drop application -- no code changes needed, purely configuration.

Monitoring Stack Summary

Implemented (MC #1184)

✅ Health check endpoint — /api/health with real database verification
✅ Container health checks — Docker + Fly.io auto-restart on failure
✅ Error tracking — Sentry with automatic PII scrubbing
✅ Slack alerting — Operational alerts with cooldown protection
✅ Lifecycle monitoring — App startup and graceful shutdown alerts
✅ Error spike detection — Automatic alerting when >5 errors/minute

Recommended (Manual Setup)

📋 External uptime monitoring — UptimeRobot checking /api/health every 5 minutes
📋 Structured logging — JSON log format with request IDs for correlation
📋 Metrics dashboard — Request latency, error rates, database query times
📋 Audit logging — Tracked as security requirement (security/drop-security-rapport.md finding L3)

Future Enhancements (TODO)

Database performance monitoring (slow query alerts)
Rate limit metrics (track 429 errors per endpoint)
Business metrics dashboard (transactions per hour, success rate)
Redis-backed error counter (persistent across restarts)
Per-endpoint error tracking (isolate problematic routes)

Environment Variables Reference

Required for Production

# Sentry error tracking
SENTRY_DSN=https://[email protected]/zzz
NEXT_PUBLIC_SENTRY_DSN=https://[email protected]/zzz

# Slack alerting
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXX

Optional (Enhances Features)

# Sentry source maps (better stack traces)
SENTRY_AUTH_TOKEN=sntrys_xxx
SENTRY_ORG=your_org_slug
SENTRY_PROJECT=your_project_slug

# Sentry performance monitoring (default: 0.1 = 10%)
SENTRY_TRACES_SAMPLE_RATE=0.1

Dev Mode (All Optional)

All monitoring features gracefully degrade when env vars are not set:

No SENTRY_DSN: Errors logged to console only
No SLACK_WEBHOOK_URL: Alerts logged to console only

This allows development to work without external services configured.