Monitoring & Alerting

Drop Monitoring

Last updated: 2026-02-17 Source: src/drop-app/src/app/api/health/route.ts, docker-compose.yml, fly.toml, src/lib/alerts.ts

Health Check Endpoint

Route: GET /api/health Source: src/drop-app/src/app/api/health/route.ts:1-35

What It Checks

Database connectivity -- Executes SELECT 1 as ok against the database
Database latency -- Measures query execution time in milliseconds
Database driver -- Reports whether using pg (PostgreSQL) or sqlite
Service mode -- Reports NEXT_PUBLIC_SERVICE_MODE (mock or live)
Application uptime -- Tracks seconds since server start
Application version -- Reads from npm_package_version env var, defaults to 0.1.0

Status Values

ok -- All checks pass (HTTP 200)
degraded -- DB query returned unexpected result (HTTP 200)
down -- DB unreachable (HTTP 503)

Response Format

Healthy (200 OK):

{
  "data": {
    "status": "ok",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
      "services": { "mode": "live" }
    },
    "timestamp": "2026-02-17T12:00:00.000Z"
  }
}

Down (503 Service Unavailable):

{
  "data": {
    "status": "down",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "fail" },
      "services": { "mode": "live" }
    },
    "timestamp": "2026-02-17T12:00:00.000Z"
  }
}

Container Health Checks

Docker Compose (MVP)

Source: docker-compose.yml:12-17

healthcheck:
  test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/api/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 10s

Docker Compose (Production)

Source: docker-compose.production.yml:9-14

Same health check configuration as MVP. Additionally, PostgreSQL has its own health check:

healthcheck:
  test: ["CMD-SHELL", "pg_isready -U drop"]
  interval: 10s
  timeout: 5s
  retries: 5

The drop-app service depends on PostgreSQL being healthy before starting (depends_on.postgres.condition: service_healthy).

Fly.io

Source: fly.toml:19-23

[[http_service.checks]]
  grace_period = "10s"
  interval = "30s"
  method = "GET"
  path = "/api/health"
  timeout = "5s"

Fly.io uses this health check to determine machine readiness and to route traffic.

Current Monitoring State

What Exists

Health check endpoint with real database verification (not hardcoded)
Container-level health checks (Docker + Fly.io)
Automatic restart on failure (restart: unless-stopped in docker-compose)
Auto-scaling on Fly.io (scale to zero, auto-start on request)

What Does Not Exist Yet

External uptime monitoring service (see UptimeRobot setup below for recommended configuration)
Application Performance Monitoring (APM)
Structured logging (JSON format)
Log aggregation and forwarding
Database performance monitoring
Rate limit monitoring/metrics
Business metrics dashboard (transactions per hour, success rate)

Sentry Error Tracking

Status: REMOVED (MC #1271 — Sentry deinstalled)

Slack Alerting

Status: Implemented (MC #1183) Source: src/lib/alerts.ts, instrumentation.ts

Features

Operational alerts sent to Slack webhook
10-minute cooldown per alert title (prevents spam)
Severity-based emoji prefixes (ℹ️ info, ⚠️ warning, 🚨 critical)
Graceful degradation when webhook URL not set (dev mode)

Setup Instructions

Create incoming webhook in Slack workspace:
- Go to Slack App Directory → Incoming Webhooks
- Choose channel (e.g., #ops or #alerts)
- Copy webhook URL

Set environment variable:

# .env.local (server-side secret)
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX

Required Environment Variable

Variable	Required	Description
`SLACK_WEBHOOK_URL`	Yes (production)	Slack incoming webhook URL

Note: When SLACK_WEBHOOK_URL is not set, alerts are logged to console but not sent to Slack.

Alert Types and Severities

Severity	Emoji	Use Case
`info`	ℹ️	Application startup, normal operations
`warning`	⚠️	Degraded performance, non-critical issues
`critical`	🚨	Service outages, data loss, security incidents

Cooldown Behavior

Each alert title has a 10-minute cooldown
Same title sent within 10 minutes → skipped (prevents spam)
Different titles → sent immediately (independent tracking)
Cooldown resets on app restart (in-memory tracking)

Example: If "Database connection failed" is sent at 10:00, the next attempt before 10:10 will be skipped. But "High latency detected" can still be sent at 10:05.

Usage in Code

import { sendAlert } from '@/lib/alerts';

// Basic alert
await sendAlert({
  severity: 'critical',
  title: 'Database connection failed',
  message: 'PostgreSQL unreachable after 3 retries',
});

// Alert with details
await sendAlert({
  severity: 'warning',
  title: 'High error rate detected',
  message: '15 errors in last 5 minutes',
});

Current Integrations

App startup: Sends info alert when server starts (instrumentation.ts)
App shutdown: Sends info alert on SIGTERM/SIGINT (instrumentation.ts)
Error spike detection: Automatically tracks errors and alerts when >5 errors occur in 60 seconds (src/lib/alerts.ts:trackError)
Unhandled exceptions: Logged and tracked via process event handlers (instrumentation.ts)

Error Spike Detection

The alerting system automatically detects error spikes using a rolling window approach:

How it works:

Every server error (HTTP 5xx) is tracked via trackError()
Maintains rolling 1-minute window of error timestamps
When count exceeds threshold (5 errors in 60 seconds), sends critical alert
Integrates with middleware error handling

Threshold: 5 errors within 60 seconds Alert severity: Critical (🚨) Implementation: src/lib/alerts.ts:trackError(), wired into src/lib/middleware.ts:jsonError()

Note: Error counter is in-memory and resets on app restart. For production workloads requiring persistent tracking, consider Redis-backed counters.

ExternalBetterStack Uptime Monitoring (BetterStack or UptimeRobot)

Status: ~~Not~~Ready ~~yet~~to ~~configured~~configure (~~manual~~ setup ~~required)~~guide available) Documentation: BETTERSTACK-SETUP.md

WhyOverview

~~External~~

BetterStack ~~Monitoring?~~provides external uptime monitoring independent of Drop's infrastructure. Unlike internal health checks (Docker, Fly.io) that only work when containers are running, BetterStack detects total infrastructure failures.

Free tier includes:

~~Detects~~10 ~~infrastructure failures~~monitors (~~container~~enough ~~crashes,~~for ~~network~~Drop ~~issues)~~production)
~~Independent~~3-minute ~~from~~check ~~the application (catches total outages)~~interval
~~Provides~~Unlimited ~~uptime~~integrations ~~SLA~~(Slack, ~~tracking~~email)
~~Multiple~~Public ~~notification~~status ~~channels~~page

~~(email,~~

SSL ~~Slack,~~expiry ~~SMS)~~monitoring

OptionRecommended A: BetterStack (Recommended)Monitors

~~Cost:~~

~~Freetier~~(5~~monitors,3-minuteinterval)~~

~~Create~~

~~account~~at~~betterstack.com~~

~~Monitors~~ > ~~Create Monitor~~

~~Configure:~~

~~Monitor type:~~ ~~HTTP(s)~~

~~URL:~~

~~Check~~

~~(paid) or 180 seconds (free)~~
~~Request timeout:~~ ~~5 seconds~~

~~Confirmation period:~~ ~~1 retry before alerting~~

~~Set~~ ~~Expected status code:~~

Monitor	URL	Purpose	Expected Response
Health toEndpoint	`https://your-domain.com/drop.alai.no/api/health`	API ~~interval:~~+ 60DB ~~seconds~~connectivity	`200` `Add Keyword check: Response`, body contains `"status":"ok"`
Landing Page	`https://drop.alai.no`	Public website	`200`, body contains `Send penger`
Multi-Region Check	`https://drop.alai.no/api/health`	Geographic availability	`200`, body contains `"status":"ok"`

Feature	Drop Internal Alerts	BetterStack External
Detects	Application errors, error spikes	Infrastructure outages
When	App is running	App is unreachable
Source	Application logs	External HTTP checks
Delivery	Slack webhook (direct)	Escalation policy
Use case	Code bugs, DB issues	Container crashes, network failures

Alert Escalation

BetterStack sends alerts through multiple channels:

Minute 0: Alert fires → Slack #drop-ops (immediate) Minute 5: Still down → Email to [email protected] Minute 15: Still down → SMS (requires paid plan)

Status Page

Public status page shows real-time service status:

URL: https://drop-status.betteruptime.com

Components: API Health, Landing Page, Global Network

Auto-updates: Incidents automatically posted and resolved

Subscriptions: Users can subscribe to email updates

Setup Instructions

Complete setup guide with step-by-step instructions: BETTERSTACK-SETUP.md

Setup includes:

Account creation (free tier)

Configure ~~alert~~3 monitors (health, landing, multi-region)

Slack integration (#drop-ops channel)

On-call schedule and escalation policy
~~(see~~
Public ~~Escalation~~status ~~Policy~~page ~~below)~~creation

Testing and verification

~~Option~~Key B:Features

Proactive monitoring:

3-minute check interval (free tier) or 30s (paid)

Keyword verification (not just HTTP 200)

SSL certificate expiry warnings (14 days)

Multi-region checks (detect geographic issues)

Incident management:

Automatic incident creation on downtime

Status page updates (public transparency)

Escalation to multiple channels (Slack → Email → SMS)

Maintenance window support (suppress alerts during deployments)

Reporting:

Uptime SLA tracking (99.9% target)

Incident history and analysis

Response time graphs

Downtime duration reports

Integration with Drop Alerting

BetterStack complements Drop's internal alerting (src/lib/alerts.ts):

Feature Drop Internal Alerts BetterStack External
Detects Application errors, error spikes Infrastructure outages
When App is running App is unreachable
Source Application logs External HTTP checks
Delivery Slack webhook (direct) Escalation policy
Use case Code bugs, DB issues Container crashes, network failures

Example: Database connection fails:

Drop internal alert: "Database connection failed" → Slack #drop-ops (immediate)

BetterStack: Health check returns 503 → Slack #drop-ops + Email after 5 min

Maintenance Windows

When performing planned maintenance (deployments, upgrades):

Create maintenance window in BetterStack

Select affected monitors

Set duration (e.g., 1 hour)

Effect: Alerts suppressed, status page shows "Scheduled Maintenance"

Prevents: False downtime alerts during intentional service interruptions.

Best Practices

Do's:

✅ Test alerts monthly (pause monitor to verify escalation)

✅ Use keyword checks (not just HTTP status codes)

✅ Monitor SSL expiry (14-day warnings)

✅ Create maintenance windows for deployments

✅ Review incident history monthly

Don'ts:

❌ Don't ignore degraded status (investigate even if not fully down)

❌ Don't disable monitors (use pause for temporary suppression)

❌ Don't skip keyword checks (HTTP 200 ≠ working API)

❌ Don't rely solely on external monitoring (combine with internal checks)

External Uptime Monitoring (Alternative: UptimeRobot)

Status: Alternative to BetterStack (not recommended)

BetterStack is recommended over UptimeRobot for Drop because:

Better Slack integration (richer notifications)

Built-in status page (UptimeRobot charges extra)

Better UI/UX for incident management

More flexible escalation policies

UptimeRobot Setup (if BetterStack unavailable)

Cost: Free tier (50 monitors, 5-minute interval)

Create account at uptimerobot.com

Add HTTP(S) monitor:

Friendly Name: Drop Production

URL: https://your-domain.com/drop.alai.no/api/health

Monitoring Interval: 5 minutes (free tier) or 1 minute (paid)

Configure alert ~~contacts~~contacts:

Slack webhook (~~see~~via ~~Escalation~~Alert ~~Policy~~Contacts)
~~below)~~
Email ([email protected])

Set Keyword Monitoring: Response contains "status":"ok"

~~Recommended~~
Limitations:
~~Checks~~

No
built-inescalationpolicies(requiresthird-partyintegrations)
Status
pagerequirespaid

~~Check~~ ~~Endpoint/Target~~ ~~Interval~~ ~~Timeout~~

~~Health~~plan
Less detailed incident reports

5-minute check
GET /api/health ~~60s~~ 5s
~~SSL certificate expiry~~ ~~Domain certificate~~ ~~Daily~~ ~~N/A~~
~~Response time~~ GET /api/health ~~60s~~ ~~500ms threshold~~
~~Homepage~~ GET / ~~300s~~ ~~10s~~

~~Alert Channels~~
vs
~~Channel~~ ~~Use Case~~ ~~Setup~~
~~Slack~~ #alerts ~~All incidents~~ ~~Add Slack integration in monitoring tool~~
~~Email~~ ~~P1/P2 incidents~~ ~~Add team email addresses~~
~~SMS~~interval (~~paid)~~ ~~P1 only~~ ~~Add phone numbers~~3-minute for ~~on-call~~

~~Escalation Policy~~

Minute 0: Alert fires → Slack #alerts (automatic) Minute 5: Still down → Email to on-call engineer Minute 15: Still down → SMS to CTO Minute 30: Still down → Phone call to CEO

~~Configure this in~~ BetterStack ~~under~~free) ~~Escalation Policies~~ ~~or in UptimeRobot under~~ ~~Alert Contacts~~.

~~Note:~~ ~~This is external to the Drop application -- no code changes needed, purely configuration.~~

~~Check~~	~~Endpoint/Target~~	~~Interval~~	~~Timeout~~
~~Health~~plan Less detailed incident reports 5-minute check	`GET /api/health`	~~60s~~	5s
~~SSL certificate expiry~~	~~Domain certificate~~	~~Daily~~	~~N/A~~
~~Response time~~	`GET /api/health`	~~60s~~	~~500ms threshold~~
~~Homepage~~	`GET /`	~~300s~~	~~10s~~

~~Channel~~	~~Use Case~~	~~Setup~~
~~Slack~~ `#alerts`	~~All incidents~~	~~Add Slack integration in monitoring tool~~
~~Email~~	~~P1/P2 incidents~~	~~Add team email addresses~~
~~SMS~~interval (~~paid)~~	~~P1 only~~	~~Add phone numbers~~3-minute for ~~on-call~~

Monitoring Stack Summary

Implemented (MC #1184)

✅ Health check endpoint — /api/health with real database verification
✅ Container health checks — Docker + Fly.io auto-restart on failure
❌ Error tracking — Sentry REMOVED (MC #1271)
✅ Slack alerting — Operational alerts with cooldown protection
✅ Lifecycle monitoring — App startup and graceful shutdown alerts
✅ Error spike detection — Automatic alerting when >5 errors/minute

Recommended (Manual Setup)

📋 External uptime monitoring — UptimeRobot checking /api/health every 5 minutes
📋 Structured logging — JSON log format with request IDs for correlation
📋 Metrics dashboard — Request latency, error rates, database query times
📋 Audit logging — Tracked as security requirement (security/drop-security-rapport.md finding L3)

Future Enhancements (TODO)

Database performance monitoring (slow query alerts)
Rate limit metrics (track 429 errors per endpoint)
Business metrics dashboard (transactions per hour, success rate)
Redis-backed error counter (persistent across restarts)
Per-endpoint error tracking (isolate problematic routes)

Environment Variables Reference

Required for Production

# Slack alerting
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXX

Dev Mode (All Optional)

All monitoring features gracefully degrade when env vars are not set:

No SLACK_WEBHOOK_URL: Alerts logged to console only

This allows development to work without external services configured.

Monitoring & Alerting

Drop Monitoring

Health Check Endpoint

What It Checks

Status Values

Response Format

Container Health Checks

Docker Compose (MVP)

Docker Compose (Production)

Fly.io

Current Monitoring State

What Exists

What Does Not Exist Yet

Sentry Error Tracking

Slack Alerting

Features

Setup Instructions

Required Environment Variable

Alert Types and Severities

Cooldown Behavior

Usage in Code

Current Integrations

Error Spike Detection

ExternalBetterStack Uptime Monitoring (BetterStack or UptimeRobot)

WhyOverview

OptionRecommended A: BetterStack (Recommended)Monitors

Alert Escalation

Status Page

Setup Instructions

OptionKey B:Features

Integration with Drop Alerting

Maintenance Windows

Best Practices

External Uptime Monitoring (Alternative: UptimeRobot)

UptimeRobot Setup (if BetterStack unavailable)

RecommendedLimitations: Checks

Alert Channels

Escalation Policy

Monitoring Stack Summary

Implemented (MC #1184)

Recommended (Manual Setup)

Future Enhancements (TODO)

Environment Variables Reference

Required for Production

Dev Mode (All Optional)

Recommended
Limitations:
Checks