Monitoring & Alerting
Drop Monitoring
Last updated: 2026-02-17
Source: src/drop-app/src/app/api/health/route.ts, docker-compose.yml, fly.toml, src/lib/alerts.ts
Health Check Endpoint
Route: GET /api/health
Source: src/drop-app/src/app/api/health/route.ts:1-35
What It Checks
- Database connectivity -- Executes
SELECT 1 as okagainst the database - Database latency -- Measures query execution time in milliseconds
- Database driver -- Reports whether using
pg(PostgreSQL) orsqlite - Service mode -- Reports
NEXT_PUBLIC_SERVICE_MODE(mockorlive) - Application uptime -- Tracks seconds since server start
- Application version -- Reads from
npm_package_versionenv var, defaults to0.1.0
Status Values
- ok -- All checks pass (HTTP 200)
- degraded -- DB query returned unexpected result (HTTP 200)
- down -- DB unreachable (HTTP 503)
Response Format
Healthy (200 OK):
{
"data": {
"status": "ok",
"version": "0.1.0",
"uptime": 3600,
"checks": {
"db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
"services": { "mode": "live" }
},
"timestamp": "2026-02-17T12:00:00.000Z"
}
}
Down (503 Service Unavailable):
{
"data": {
"status": "down",
"version": "0.1.0",
"uptime": 3600,
"checks": {
"db": { "status": "fail" },
"services": { "mode": "live" }
},
"timestamp": "2026-02-17T12:00:00.000Z"
}
}
Container Health Checks
Docker Compose (MVP)
Source: docker-compose.yml:12-17
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/api/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
Docker Compose (Production)
Source: docker-compose.production.yml:9-14
Same health check configuration as MVP. Additionally, PostgreSQL has its own health check:
healthcheck:
test: ["CMD-SHELL", "pg_isready -U drop"]
interval: 10s
timeout: 5s
retries: 5
The drop-app service depends on PostgreSQL being healthy before starting (depends_on.postgres.condition: service_healthy).
Fly.io
Source: fly.toml:19-23
[[http_service.checks]]
grace_period = "10s"
interval = "30s"
method = "GET"
path = "/api/health"
timeout = "5s"
Fly.io uses this health check to determine machine readiness and to route traffic.
Current Monitoring State
What Exists
- Health check endpoint with real database verification (not hardcoded)
- Container-level health checks (Docker + Fly.io)
- Automatic restart on failure (
restart: unless-stoppedin docker-compose) - Auto-scaling on Fly.io (scale to zero, auto-start on request)
What Does Not Exist Yet
- External uptime monitoring service (see UptimeRobot setup below for recommended configuration)
- Application Performance Monitoring (APM)
- Structured logging (JSON format)
- Log aggregation and forwarding
- Database performance monitoring
- Rate limit monitoring/metrics
- Business metrics dashboard (transactions per hour, success rate)
Sentry Error Tracking
Status: REMOVED (MC #1271 — Sentry deinstalled)
Slack Alerting
Status: Implemented (MC #1183)
Source: src/lib/alerts.ts, instrumentation.ts
Features
- Operational alerts sent to Slack webhook
- 10-minute cooldown per alert title (prevents spam)
- Severity-based emoji prefixes (ℹ️ info, ⚠️ warning, 🚨 critical)
- Graceful degradation when webhook URL not set (dev mode)
Setup Instructions
- Create incoming webhook in Slack workspace:
- Go to Slack App Directory → Incoming Webhooks
- Choose channel (e.g.,
#opsor#alerts) - Copy webhook URL
- Set environment variable:
# .env.local (server-side secret) SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX
Required Environment Variable
| Variable | Required | Description |
|---|---|---|
SLACK_WEBHOOK_URL |
Yes (production) | Slack incoming webhook URL |
Note: When SLACK_WEBHOOK_URL is not set, alerts are logged to console but not sent to Slack.
Alert Types and Severities
| Severity | Emoji | Use Case |
|---|---|---|
info |
ℹ️ | Application startup, normal operations |
warning |
⚠️ | Degraded performance, non-critical issues |
critical |
🚨 | Service outages, data loss, security incidents |
Cooldown Behavior
- Each alert title has a 10-minute cooldown
- Same title sent within 10 minutes → skipped (prevents spam)
- Different titles → sent immediately (independent tracking)
- Cooldown resets on app restart (in-memory tracking)
Example: If "Database connection failed" is sent at 10:00, the next attempt before 10:10 will be skipped. But "High latency detected" can still be sent at 10:05.
Usage in Code
import { sendAlert } from '@/lib/alerts';
// Basic alert
await sendAlert({
severity: 'critical',
title: 'Database connection failed',
message: 'PostgreSQL unreachable after 3 retries',
});
// Alert with details
await sendAlert({
severity: 'warning',
title: 'High error rate detected',
message: '15 errors in last 5 minutes',
});
Current Integrations
- App startup: Sends info alert when server starts (
instrumentation.ts) - App shutdown: Sends info alert on SIGTERM/SIGINT (
instrumentation.ts) - Error spike detection: Automatically tracks errors and alerts when >5 errors occur in 60 seconds (
src/lib/alerts.ts:trackError) - Unhandled exceptions: Logged and tracked via process event handlers (
instrumentation.ts)
Error Spike Detection
The alerting system automatically detects error spikes using a rolling window approach:
How it works:
- Every server error (HTTP 5xx) is tracked via
trackError() - Maintains rolling 1-minute window of error timestamps
- When count exceeds threshold (5 errors in 60 seconds), sends critical alert
- Integrates with middleware error handling
Threshold: 5 errors within 60 seconds
Alert severity: Critical (🚨)
Implementation: src/lib/alerts.ts:trackError(), wired into src/lib/middleware.ts:jsonError()
Note: Error counter is in-memory and resets on app restart. For production workloads requiring persistent tracking, consider Redis-backed counters.
ExternalBetterStack Uptime Monitoring (BetterStack or UptimeRobot)
Status: NotReady yetto configuredconfigure (manual setup required)guide available)
Documentation: BETTERSTACK-SETUP.md
WhyOverview
BetterStack Monitoring?provides external uptime monitoring independent of Drop's infrastructure. Unlike internal health checks (Docker, Fly.io) that only work when containers are running, BetterStack detects total infrastructure failures.
Free tier includes:
Detects10infrastructure failuresmonitors (containerenoughcrashes,fornetworkDropissues)production)Independent3-minutefromcheckthe application (catches total outages)intervalProvidesUnlimiteduptimeintegrationsSLA(Slack,trackingemail)MultiplePublicnotificationstatuschannelspage- SSL
Slack,expirySMS)monitoring
OptionRecommended A: BetterStack (Recommended)Monitors
Cost:
| Monitor | URL | Purpose | Expected
|
|---|---|---|---|
| Health |
https:// |
API |
200, body contains "status":"ok" |
| Landing Page | https://drop.alai.no |
Public website | 200, body contains Send penger |
| Multi-Region Check | https://drop.alai.no/api/health |
Geographic availability | 200, body contains "status":"ok" |
Alert Escalation
BetterStack sends alerts through multiple channels:
Minute 0: Alert fires → Slack #drop-ops (immediate)
Minute 5: Still down → Email to [email protected]
Minute 15: Still down → SMS (requires paid plan)
Status Page
Public status page shows real-time service status:
- URL:
https://drop-status.betteruptime.com - Components: API Health, Landing Page, Global Network
- Auto-updates: Incidents automatically posted and resolved
- Subscriptions: Users can subscribe to email updates
Setup Instructions
Complete setup guide with step-by-step instructions: BETTERSTACK-SETUP.md
Setup includes:
- Account creation (free tier)
- Configure
alert3 monitors (health, landing, multi-region) - Slack integration (
#drop-opschannel) - On-call schedule and escalation policy
- Public
EscalationstatusPolicypagebelow)creation - Testing and verification
OptionKey B:Features
Proactive monitoring:
- 3-minute check interval (free tier) or 30s (paid)
- Keyword verification (not just HTTP 200)
- SSL certificate expiry warnings (14 days)
- Multi-region checks (detect geographic issues)
Incident management:
- Automatic incident creation on downtime
- Status page updates (public transparency)
- Escalation to multiple channels (Slack → Email → SMS)
- Maintenance window support (suppress alerts during deployments)
Reporting:
- Uptime SLA tracking (99.9% target)
- Incident history and analysis
- Response time graphs
- Downtime duration reports
Integration with Drop Alerting
BetterStack complements Drop's internal alerting (src/lib/alerts.ts):
| Feature | Drop Internal Alerts | BetterStack External |
|---|---|---|
| Detects | Application errors, error spikes | Infrastructure outages |
| When | App is running | App is unreachable |
| Source | Application logs | External HTTP checks |
| Delivery | Slack webhook (direct) | Escalation policy |
| Use case | Code bugs, DB issues | Container crashes, network failures |
Example: Database connection fails:
- Drop internal alert: "Database connection failed" → Slack
#drop-ops(immediate) - BetterStack: Health check returns 503 → Slack
#drop-ops+ Email after 5 min
Maintenance Windows
When performing planned maintenance (deployments, upgrades):
- Create maintenance window in BetterStack
- Select affected monitors
- Set duration (e.g., 1 hour)
- Effect: Alerts suppressed, status page shows "Scheduled Maintenance"
Prevents: False downtime alerts during intentional service interruptions.
Best Practices
Do's:
- ✅ Test alerts monthly (pause monitor to verify escalation)
- ✅ Use keyword checks (not just HTTP status codes)
- ✅ Monitor SSL expiry (14-day warnings)
- ✅ Create maintenance windows for deployments
- ✅ Review incident history monthly
Don'ts:
- ❌ Don't ignore degraded status (investigate even if not fully down)
- ❌ Don't disable monitors (use pause for temporary suppression)
- ❌ Don't skip keyword checks (HTTP 200 ≠ working API)
- ❌ Don't rely solely on external monitoring (combine with internal checks)
External Uptime Monitoring (Alternative: UptimeRobot)
Status: Alternative to BetterStack (not recommended)
BetterStack is recommended over UptimeRobot for Drop because:
- Better Slack integration (richer notifications)
- Built-in status page (UptimeRobot charges extra)
- Better UI/UX for incident management
- More flexible escalation policies
UptimeRobot Setup (if BetterStack unavailable)
Cost: Free tier (50 monitors, 5-minute interval)
- Create account at uptimerobot.com
- Add HTTP(S) monitor:
- Friendly Name: Drop Production
- URL:
https://your-domain.com/drop.alai.no/api/health - Monitoring Interval: 5 minutes (free tier) or 1 minute (paid)
- Configure alert
contactscontacts:- Slack webhook (
seeviaEscalationAlertPolicyContacts) - Email (
[email protected])
below) - Slack webhook (
- Set Keyword Monitoring: Response contains
"status":"ok"
Recommended
Limitations:
- No built-in
escalation policiesCheck (requiresEndpoint/Target third-partyInterval integrations)Timeout- Status
page requirespaid Healthplan- Less detailed incident reports
- 5-minute check
GET /api/health60s5sSSL certificate expiryDomain certificateDailyN/AResponse timeGET /api/health60s500ms thresholdHomepageGET /300s10s
Alert Channels
| ||
Escalation Policy
Minute 0: Alert fires → Slack #alerts (automatic)
Minute 5: Still down → Email to on-call engineer
Minute 15: Still down → SMS to CTO
Minute 30: Still down → Phone call to CEO
Configure this in BetterStack underfree)
Escalation Policies or in UptimeRobot under Alert Contacts.
Note: This is external to the Drop application -- no code changes needed, purely configuration.
Monitoring Stack Summary
Implemented (MC #1184)
- ✅ Health check endpoint —
/api/healthwith real database verification - ✅ Container health checks — Docker + Fly.io auto-restart on failure
- ❌ Error tracking — Sentry REMOVED (MC #1271)
- ✅ Slack alerting — Operational alerts with cooldown protection
- ✅ Lifecycle monitoring — App startup and graceful shutdown alerts
- ✅ Error spike detection — Automatic alerting when >5 errors/minute
Recommended (Manual Setup)
- 📋 External uptime monitoring — UptimeRobot checking
/api/healthevery 5 minutes - 📋 Structured logging — JSON log format with request IDs for correlation
- 📋 Metrics dashboard — Request latency, error rates, database query times
- 📋 Audit logging — Tracked as security requirement (
security/drop-security-rapport.mdfinding L3)
Future Enhancements (TODO)
- Database performance monitoring (slow query alerts)
- Rate limit metrics (track 429 errors per endpoint)
- Business metrics dashboard (transactions per hour, success rate)
- Redis-backed error counter (persistent across restarts)
- Per-endpoint error tracking (isolate problematic routes)
Environment Variables Reference
Required for Production
# Slack alerting
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXX
Dev Mode (All Optional)
All monitoring features gracefully degrade when env vars are not set:
- No SLACK_WEBHOOK_URL: Alerts logged to console only
This allows development to work without external services configured.