Monitoring & Alerting

Drop Monitoring

Last updated: 2026-02-17 Source: src/drop-app/src/app/api/health/route.ts, docker-compose.yml, fly.toml, src/lib/alerts.ts

Health Check Endpoint

Route: GET /api/health Source: src/drop-app/src/app/api/health/route.ts:1-35

What It Checks

Database connectivity -- Executes SELECT 1 as ok against the database
Database latency -- Measures query execution time in milliseconds
Database driver -- Reports whether using pg (PostgreSQL) or sqlite
Service mode -- Reports NEXT_PUBLIC_SERVICE_MODE (mock or live)
Application uptime -- Tracks seconds since server start
Application version -- Reads from npm_package_version env var, defaults to 0.1.0

Status Values

ok -- All checks pass (HTTP 200)
degraded -- DB query returned unexpected result (HTTP 200)
down -- DB unreachable (HTTP 503)

Response Format

Healthy (200 OK):

{
  "data": {
    "status": "ok",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
      "services": { "mode": "live" }
    },
    "timestamp": "2026-02-17T12:00:00.000Z"
  }
}

Down (503 Service Unavailable):

{
  "data": {
    "status": "down",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "fail" },
      "services": { "mode": "live" }
    },
    "timestamp": "2026-02-17T12:00:00.000Z"
  }
}

Container Health Checks

Docker Compose (MVP)

Source: docker-compose.yml:12-17

healthcheck:
  test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/api/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 10s

Docker Compose (Production)

Source: docker-compose.production.yml:9-14

Same health check configuration as MVP. Additionally, PostgreSQL has its own health check:

healthcheck:
  test: ["CMD-SHELL", "pg_isready -U drop"]
  interval: 10s
  timeout: 5s
  retries: 5

The drop-app service depends on PostgreSQL being healthy before starting (depends_on.postgres.condition: service_healthy).

Fly.io

Source: fly.toml:19-23

[[http_service.checks]]
  grace_period = "10s"
  interval = "30s"
  method = "GET"
  path = "/api/health"
  timeout = "5s"

Fly.io uses this health check to determine machine readiness and to route traffic.

Current Monitoring State

What Exists

Health check endpoint with real database verification (not hardcoded)
Container-level health checks (Docker + Fly.io)
Automatic restart on failure (restart: unless-stopped in docker-compose)
Auto-scaling on Fly.io (scale to zero, auto-start on request)

What Does Not Exist Yet

External uptime monitoring service (see UptimeRobot setup below for recommended configuration)
Application Performance Monitoring (APM)
Structured logging (JSON format)
Log aggregation and forwarding
Database performance monitoring
Rate limit monitoring/metrics
Business metrics dashboard (transactions per hour, success rate)

Sentry Error Tracking

Status: REMOVED (MC #1271 — Sentry deinstalled)

Slack Alerting

Status: Implemented (MC #1183) Source: src/lib/alerts.ts, instrumentation.ts

Features

Operational alerts sent to Slack webhook
10-minute cooldown per alert title (prevents spam)
Severity-based emoji prefixes (ℹ️ info, ⚠️ warning, 🚨 critical)
Graceful degradation when webhook URL not set (dev mode)

Setup Instructions

Create incoming webhook in Slack workspace:
- Go to Slack App Directory → Incoming Webhooks
- Choose channel (e.g., #ops or #alerts)
- Copy webhook URL

Set environment variable:

# .env.local (server-side secret)
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX

Required Environment Variable

Variable	Required	Description
`SLACK_WEBHOOK_URL`	Yes (production)	Slack incoming webhook URL

Note: When SLACK_WEBHOOK_URL is not set, alerts are logged to console but not sent to Slack.

Alert Types and Severities

Severity	Emoji	Use Case
`info`	ℹ️	Application startup, normal operations
`warning`	⚠️	Degraded performance, non-critical issues
`critical`	🚨	Service outages, data loss, security incidents

Cooldown Behavior

Each alert title has a 10-minute cooldown
Same title sent within 10 minutes → skipped (prevents spam)
Different titles → sent immediately (independent tracking)
Cooldown resets on app restart (in-memory tracking)

Example: If "Database connection failed" is sent at 10:00, the next attempt before 10:10 will be skipped. But "High latency detected" can still be sent at 10:05.

Usage in Code

import { sendAlert } from '@/lib/alerts';

// Basic alert
await sendAlert({
  severity: 'critical',
  title: 'Database connection failed',
  message: 'PostgreSQL unreachable after 3 retries',
});

// Alert with details
await sendAlert({
  severity: 'warning',
  title: 'High error rate detected',
  message: '15 errors in last 5 minutes',
});

Current Integrations

App startup: Sends info alert when server starts (instrumentation.ts)
App shutdown: Sends info alert on SIGTERM/SIGINT (instrumentation.ts)
Error spike detection: Automatically tracks errors and alerts when >5 errors occur in 60 seconds (src/lib/alerts.ts:trackError)
Unhandled exceptions: Logged and tracked via process event handlers (instrumentation.ts)

Error Spike Detection

The alerting system automatically detects error spikes using a rolling window approach:

How it works:

Every server error (HTTP 5xx) is tracked via trackError()
Maintains rolling 1-minute window of error timestamps
When count exceeds threshold (5 errors in 60 seconds), sends critical alert
Integrates with middleware error handling

Threshold: 5 errors within 60 seconds Alert severity: Critical (🚨) Implementation: src/lib/alerts.ts:trackError(), wired into src/lib/middleware.ts:jsonError()

Note: Error counter is in-memory and resets on app restart. For production workloads requiring persistent tracking, consider Redis-backed counters.

BetterStack Uptime Monitoring

Status: Ready to configure (setup guide available) Documentation: BETTERSTACK-SETUP.md

Overview

BetterStack provides external uptime monitoring independent of Drop's infrastructure. Unlike internal health checks (Docker, Fly.io) that only work when containers are running, BetterStack detects total infrastructure failures.

Free tier includes:

10 monitors (enough for Drop production)
3-minute check interval
Unlimited integrations (Slack, email)
Public status page
SSL expiry monitoring

Recommended Monitors

Monitor	URL	Purpose	Expected Response
Health Endpoint	`https://drop.alai.no/api/health`	API + DB connectivity	`200`, body contains `"status":"ok"`
Landing Page	`https://drop.alai.no`	Public website	`200`, body contains `Send penger`
Multi-Region Check	`https://drop.alai.no/api/health`	Geographic availability	`200`, body contains `"status":"ok"`

Alert Escalation

BetterStack sends alerts through multiple channels:

Minute 0:   Alert fires → Slack #drop-ops (immediate)
Minute 5:   Still down → Email to [email protected]
Minute 15:  Still down → SMS (requires paid plan)

Status Page

Public status page shows real-time service status:

URL: https://drop-status.betteruptime.com
Components: API Health, Landing Page, Global Network
Auto-updates: Incidents automatically posted and resolved
Subscriptions: Users can subscribe to email updates

Setup Instructions

Complete setup guide with step-by-step instructions: BETTERSTACK-SETUP.md

Setup includes:

Account creation (free tier)
Configure 3 monitors (health, landing, multi-region)
Slack integration (#drop-ops channel)
On-call schedule and escalation policy
Public status page creation
Testing and verification

Key Features

Proactive monitoring:

3-minute check interval (free tier) or 30s (paid)
Keyword verification (not just HTTP 200)
SSL certificate expiry warnings (14 days)
Multi-region checks (detect geographic issues)

Incident management:

Automatic incident creation on downtime
Status page updates (public transparency)
Escalation to multiple channels (Slack → Email → SMS)
Maintenance window support (suppress alerts during deployments)

Reporting:

Uptime SLA tracking (99.9% target)
Incident history and analysis
Response time graphs
Downtime duration reports

Integration with Drop Alerting

BetterStack complements Drop's internal alerting (src/lib/alerts.ts):

Feature	Drop Internal Alerts	BetterStack External
Detects	Application errors, error spikes	Infrastructure outages
When	App is running	App is unreachable
Source	Application logs	External HTTP checks
Delivery	Slack webhook (direct)	Escalation policy
Use case	Code bugs, DB issues	Container crashes, network failures

Example: Database connection fails:

Drop internal alert: "Database connection failed" → Slack #drop-ops (immediate)
BetterStack: Health check returns 503 → Slack #drop-ops + Email after 5 min

Maintenance Windows

When performing planned maintenance (deployments, upgrades):

Create maintenance window in BetterStack
Select affected monitors
Set duration (e.g., 1 hour)
Effect: Alerts suppressed, status page shows "Scheduled Maintenance"

Prevents: False downtime alerts during intentional service interruptions.

Best Practices

Do's:

✅ Test alerts monthly (pause monitor to verify escalation)
✅ Use keyword checks (not just HTTP status codes)
✅ Monitor SSL expiry (14-day warnings)
✅ Create maintenance windows for deployments
✅ Review incident history monthly

Don'ts:

❌ Don't ignore degraded status (investigate even if not fully down)
❌ Don't disable monitors (use pause for temporary suppression)
❌ Don't skip keyword checks (HTTP 200 ≠ working API)
❌ Don't rely solely on external monitoring (combine with internal checks)

External Uptime Monitoring (Alternative: UptimeRobot)

Status: Alternative to BetterStack (not recommended)

BetterStack is recommended over UptimeRobot for Drop because:

Better Slack integration (richer notifications)
Built-in status page (UptimeRobot charges extra)
Better UI/UX for incident management
More flexible escalation policies

UptimeRobot Setup (if BetterStack unavailable)

Cost: Free tier (50 monitors, 5-minute interval)

Create account at uptimerobot.com
Add HTTP(S) monitor:
- Friendly Name: Drop Production
- URL: https://drop.alai.no/api/health
- Monitoring Interval: 5 minutes (free tier) or 1 minute (paid)
Configure alert contacts:
- Slack webhook (via Alert Contacts)
- Email ([email protected])
Set Keyword Monitoring: Response contains "status":"ok"

Limitations:

No built-in escalation policies (requires third-party integrations)
Status page requires paid plan
Less detailed incident reports
5-minute check interval (vs 3-minute for BetterStack free)

Monitoring Stack Summary

Implemented (MC #1184)

✅ Health check endpoint — /api/health with real database verification
✅ Container health checks — Docker + Fly.io auto-restart on failure
❌ Error tracking — Sentry REMOVED (MC #1271)
✅ Slack alerting — Operational alerts with cooldown protection
✅ Lifecycle monitoring — App startup and graceful shutdown alerts
✅ Error spike detection — Automatic alerting when >5 errors/minute

Recommended (Manual Setup)

📋 External uptime monitoring — UptimeRobot checking /api/health every 5 minutes
📋 Structured logging — JSON log format with request IDs for correlation
📋 Metrics dashboard — Request latency, error rates, database query times
📋 Audit logging — Tracked as security requirement (security/drop-security-rapport.md finding L3)

Future Enhancements (TODO)

Database performance monitoring (slow query alerts)
Rate limit metrics (track 429 errors per endpoint)
Business metrics dashboard (transactions per hour, success rate)
Redis-backed error counter (persistent across restarts)
Per-endpoint error tracking (isolate problematic routes)

Environment Variables Reference

Required for Production

# Slack alerting
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXX

Dev Mode (All Optional)

All monitoring features gracefully degrade when env vars are not set:

No SLACK_WEBHOOK_URL: Alerts logged to console only

This allows development to work without external services configured.