Monitoring & Alerting
Drop Monitoring
Last updated: 2026-02-17
Source: src/drop-app/src/app/api/health/route.ts, docker-compose.yml, fly.toml, src/lib/alerts.ts, sentry.server.config.ts
Health Check Endpoint
Route: GET /api/health
Source: src/drop-app/src/app/api/health/route.ts:1-35
What It Checks
- Database connectivity -- Executes
SELECT 1 as okagainst the database - Database latency -- Measures query execution time in milliseconds
- Database driver -- Reports whether using
pg(PostgreSQL) orsqlite - Service mode -- Reports
NEXT_PUBLIC_SERVICE_MODE(mockorlive) - Application uptime -- Tracks seconds since server start
- Application version -- Reads from
npm_package_versionenv var, defaults to0.1.0
Status Values
- ok -- All checks pass (HTTP 200)
- degraded -- DB query returned unexpected result (HTTP 200)
- down -- DB unreachable (HTTP 503)
Response Format
Healthy (200 OK):
{
"data": {
"status": "ok",
"version": "0.1.0",
"uptime": 3600,
"checks": {
"db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
"services": { "mode": "live" }
},
"timestamp": "2026-02-17T12:00:00.000Z"
}
}
Down (503 Service Unavailable):
{
"data": {
"status": "down",
"version": "0.1.0",
"uptime": 3600,
"checks": {
"db": { "status": "fail" },
"services": { "mode": "live" }
},
"timestamp": "2026-02-17T12:00:00.000Z"
}
}
Container Health Checks
Docker Compose (MVP)
Source: docker-compose.yml:12-17
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/api/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
Docker Compose (Production)
Source: docker-compose.production.yml:9-14
Same health check configuration as MVP. Additionally, PostgreSQL has its own health check:
healthcheck:
test: ["CMD-SHELL", "pg_isready -U drop"]
interval: 10s
timeout: 5s
retries: 5
The drop-app service depends on PostgreSQL being healthy before starting (depends_on.postgres.condition: service_healthy).
Fly.io
Source: fly.toml:19-23
[[http_service.checks]]
grace_period = "10s"
interval = "30s"
method = "GET"
path = "/api/health"
timeout = "5s"
Fly.io uses this health check to determine machine readiness and to route traffic.
Current Monitoring State
What Exists
- Health check endpoint with real database verification (not hardcoded)
- Container-level health checks (Docker + Fly.io)
- Automatic restart on failure (
restart: unless-stoppedin docker-compose) - Auto-scaling on Fly.io (scale to zero, auto-start on request)
What Does Not Exist Yet
- External uptime monitoring service (see UptimeRobot setup below for recommended configuration)
- Application Performance Monitoring (APM)
- Structured logging (JSON format)
- Log aggregation and forwarding
- Database performance monitoring
- Rate limit monitoring/metrics
- Business metrics dashboard (transactions per hour, success rate)
Sentry Error Tracking
Status: Configured and readyREMOVED (MC #1183)#1271 Source: sentry.server.config.ts, sentry.client.config.ts, src/lib/sentry.ts
What It Captures
Unhandled exceptions in API routesServer component errorsClient-side errors (browser)Middleware errorsPerformance traces (10% sample rate by default)
Required Environment Variables
| |||
|
Optional Environment Variables
| ||
| ||
| ||
| |
Data Scrubbing (Privacy)
The following data is automatically removed before sending to Sentry:
No PII is sent to Sentry (sendDefaultPii: false).
Setup Instructions
Create a project atsentry.ioGet DSN fromProject Settings → Client Keys (DSN)Set environment variables:# .env.local (server-side) SENTRY_DSN=https://[email protected]/zzz # .env (public, safe to commit) NEXT_PUBLIC_SENTRY_DSN=https://[email protected]/zzzOptional: Set up source map uploads for better stack traces:SENTRY_AUTH_TOKEN=your_auth_token SENTRY_ORG=your_org_slug SENTRY_PROJECT=your_project_slug
Usage in Code
import { captureError, captureMessage } from '@/lib/sentry';
// Capture exception
try {
await dangerousOperation();
} catch (error) {
captureError(error, {
userId: user.id,
requestId: req.headers.get('x-request-id'),
tags: { feature: 'remittance' },
extra: { amount, currency },
});
}
// Capture message
captureMessage('Critical threshold exceeded', 'warning', {
tags: { metric: 'error_rate' },
});
Graceful Degradation: When SENTRY_DSN is not set, all— Sentry functions become no-ops (dev mode works without Sentry).deinstalled)
Slack Alerting
Status: Implemented (MC #1183)
Source: src/lib/alerts.ts, instrumentation.ts
Features
- Operational alerts sent to Slack webhook
- 10-minute cooldown per alert title (prevents spam)
- Severity-based emoji prefixes (ℹ️ info, ⚠️ warning, 🚨 critical)
- Graceful degradation when webhook URL not set (dev mode)
Optional Sentry URL attachment
Setup Instructions
- Create incoming webhook in Slack workspace:
- Go to Slack App Directory → Incoming Webhooks
- Choose channel (e.g.,
#opsor#alerts) - Copy webhook URL
- Set environment variable:
# .env.local (server-side secret) SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX
Required Environment Variable
| Variable | Required | Description |
|---|---|---|
SLACK_WEBHOOK_URL |
Yes (production) | Slack incoming webhook URL |
Note: When SLACK_WEBHOOK_URL is not set, alerts are logged to console but not sent to Slack.
Alert Types and Severities
| Severity | Emoji | Use Case |
|---|---|---|
info |
ℹ️ | Application startup, normal operations |
warning |
⚠️ | Degraded performance, non-critical issues |
critical |
🚨 | Service outages, data loss, security incidents |
Cooldown Behavior
- Each alert title has a 10-minute cooldown
- Same title sent within 10 minutes → skipped (prevents spam)
- Different titles → sent immediately (independent tracking)
- Cooldown resets on app restart (in-memory tracking)
Example: If "Database connection failed" is sent at 10:00, the next attempt before 10:10 will be skipped. But "SentryHigh errorlatency spike"detected" can still be sent at 10:05.
Usage in Code
import { sendAlert } from '@/lib/alerts';
// Basic alert
await sendAlert({
severity: 'critical',
title: 'Database connection failed',
message: 'PostgreSQL unreachable after 3 retries',
});
// Alert with Sentry linkdetails
await sendAlert({
severity: 'warning',
title: 'High error rate detected',
message: '15 errors in last 5 minutes',
sentryUrl: 'https://sentry.io/organizations/drop/issues/12345/',
});
Current Integrations
- App startup: Sends info alert when server starts (
instrumentation.ts) - App shutdown: Sends info alert on SIGTERM/SIGINT (
instrumentation.ts) - Error spike detection: Automatically tracks errors and alerts when >5 errors occur in 60 seconds (
src/lib/alerts.ts:trackError) - Unhandled exceptions: Logged and tracked via process event handlers (
instrumentation.ts)
Error Spike Detection
The alerting system automatically detects error spikes using a rolling window approach:
How it works:
- Every server error (HTTP 5xx) is tracked via
trackError() - Maintains rolling 1-minute window of error timestamps
- When count exceeds threshold (5 errors in 60 seconds), sends critical alert
- Integrates with
both Sentry error tracking andmiddleware error handling
Threshold: 5 errors within 60 seconds
Alert severity: Critical (🚨)
Implementation: src/lib/alerts.ts:trackError(), wired into src/lib/sentry.ts:captureError() and src/lib/middleware.ts:jsonError()
Note: Error counter is in-memory and resets on app restart. For production workloads requiring persistent tracking, consider Redis-backed counters.
External Uptime Monitoring (BetterStack or UptimeRobot)
Status: Not yet configured (manual setup required)
Why External Monitoring?
- Detects infrastructure failures (container crashes, network issues)
- Independent from the application (catches total outages)
- Provides uptime SLA tracking
- Multiple notification channels (email, Slack, SMS)
Option A: BetterStack (Recommended)
Cost: Free tier (5 monitors, 3-minute interval)
- Create account at betterstack.com
- Go to Monitors > Create Monitor
- Configure:
- Monitor type: HTTP(s)
- URL:
https://your-domain.com/api/health - Check interval: 60 seconds (paid) or 180 seconds (free)
- Request timeout: 5 seconds
- Confirmation period: 1 retry before alerting
- Set Expected status code: 200
- Add Keyword check: Response body contains
"status":"ok" - Configure alert policy (see Escalation Policy below)
Option B: UptimeRobot
Cost: Free tier (50 monitors, 5-minute interval)
- Create account at uptimerobot.com
- Add HTTP(S) monitor:
- Friendly Name: Drop Production
- URL:
https://your-domain.com/api/health - Monitoring Interval: 5 minutes (free tier) or 1 minute (paid)
- Configure alert contacts (see Escalation Policy below)
- Set Keyword Monitoring: Response contains
"status":"ok"
Recommended Checks
| Check | Endpoint/Target | Interval | Timeout |
|---|---|---|---|
| Health check | GET /api/health |
60s | 5s |
| SSL certificate expiry | Domain certificate | Daily | N/A |
| Response time | GET /api/health |
60s | 500ms threshold |
| Homepage | GET / |
300s | 10s |
Alert Channels
| Channel | Use Case | Setup |
|---|---|---|
Slack #alerts |
All incidents | Add Slack integration in monitoring tool |
| P1/P2 incidents | Add team email addresses | |
| SMS (paid) | P1 only | Add phone numbers for on-call |
Escalation Policy
Minute 0: Alert fires → Slack #alerts (automatic)
Minute 5: Still down → Email to on-call engineer
Minute 15: Still down → SMS to CTO
Minute 30: Still down → Phone call to CEO
Configure this in BetterStack under Escalation Policies or in UptimeRobot under Alert Contacts.
Note: This is external to the Drop application -- no code changes needed, purely configuration.
Monitoring Stack Summary
Implemented (MC #1184)
- ✅ Health check endpoint —
/api/healthwith real database verification - ✅ Container health checks — Docker + Fly.io auto-restart on failure
✅❌ Error tracking — SentrywithREMOVEDautomatic(MCPII scrubbing#1271)- ✅ Slack alerting — Operational alerts with cooldown protection
- ✅ Lifecycle monitoring — App startup and graceful shutdown alerts
- ✅ Error spike detection — Automatic alerting when >5 errors/minute
Recommended (Manual Setup)
- 📋 External uptime monitoring — UptimeRobot checking
/api/healthevery 5 minutes - 📋 Structured logging — JSON log format with request IDs for correlation
- 📋 Metrics dashboard — Request latency, error rates, database query times
- 📋 Audit logging — Tracked as security requirement (
security/drop-security-rapport.mdfinding L3)
Future Enhancements (TODO)
- Database performance monitoring (slow query alerts)
- Rate limit metrics (track 429 errors per endpoint)
- Business metrics dashboard (transactions per hour, success rate)
- Redis-backed error counter (persistent across restarts)
- Per-endpoint error tracking (isolate problematic routes)
Environment Variables Reference
Required for Production
# Sentry error tracking
SENTRY_DSN=https://[email protected]/zzz
NEXT_PUBLIC_SENTRY_DSN=https://[email protected]/zzz
# Slack alerting
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXX
Optional (Enhances Features)
# Sentry source maps (better stack traces)
SENTRY_AUTH_TOKEN=sntrys_xxx
SENTRY_ORG=your_org_slug
SENTRY_PROJECT=your_project_slug
# Sentry performance monitoring (default: 0.1 = 10%)
SENTRY_TRACES_SAMPLE_RATE=0.1
Dev Mode (All Optional)
All monitoring features gracefully degrade when env vars are not set:
- No
SENTRY_DSN:Errors logged to console only NoSLACK_WEBHOOK_URL: Alerts logged to console only
This allows development to work without external services configured.