# Monitoring & Alerting

# Drop Monitoring

**Last updated:** 2026-02-17
**Source:** `src/drop-app/src/app/api/health/route.ts`, `docker-compose.yml`, `fly.toml`, `src/lib/alerts.ts`

---

## Health Check Endpoint

**Route:** `GET /api/health`
**Source:** `src/drop-app/src/app/api/health/route.ts:1-35`

### What It Checks
1. **Database connectivity** -- Executes `SELECT 1 as ok` against the database
2. **Database latency** -- Measures query execution time in milliseconds
3. **Database driver** -- Reports `pg` (PostgreSQL 16 via Drizzle ORM)
4. **Service mode** -- Reports `NEXT_PUBLIC_SERVICE_MODE` (`mock` or `live`)
5. **Application uptime** -- Tracks seconds since server start
6. **Application version** -- Reads from `npm_package_version` env var, defaults to `0.1.0`

### Status Values
- **ok** -- All checks pass (HTTP 200)
- **degraded** -- DB query returned unexpected result (HTTP 200)
- **down** -- DB unreachable (HTTP 503)

### Response Format

**Healthy (200 OK):**
```json
{
  "data": {
    "status": "ok",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
      "services": { "mode": "live" }
    },
    "timestamp": "2026-02-17T12:00:00.000Z"
  }
}
```

**Down (503 Service Unavailable):**
```json
{
  "data": {
    "status": "down",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "fail" },
      "services": { "mode": "live" }
    },
    "timestamp": "2026-02-17T12:00:00.000Z"
  }
}
```

---

## Container Health Checks

### Docker Compose (MVP)
**Source:** `docker-compose.yml:12-17`

```yaml
healthcheck:
  test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/api/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 10s
```

### Docker Compose (Production)
**Source:** `docker-compose.production.yml:9-14`

Same health check configuration as MVP. Additionally, PostgreSQL has its own health check:

```yaml
healthcheck:
  test: ["CMD-SHELL", "pg_isready -U drop"]
  interval: 10s
  timeout: 5s
  retries: 5
```

The `drop-app` service depends on PostgreSQL being healthy before starting (`depends_on.postgres.condition: service_healthy`).

### Fly.io
**Source:** `fly.toml:19-23`

```toml
[[http_service.checks]]
  grace_period = "10s"
  interval = "30s"
  method = "GET"
  path = "/api/health"
  timeout = "5s"
```

Fly.io uses this health check to determine machine readiness and to route traffic.

---

## Current Monitoring State

### What Exists
- Health check endpoint with real database verification (not hardcoded)
- Container-level health checks (Docker + Fly.io)
- Automatic restart on failure (`restart: unless-stopped` in docker-compose)
- Auto-scaling on Fly.io (scale to zero, auto-start on request)

### What Does Not Exist Yet
- External uptime monitoring service (see UptimeRobot setup below for recommended configuration)
- Application Performance Monitoring (APM)
- Structured logging (JSON format)
- Log aggregation and forwarding
- Database performance monitoring
- Rate limit monitoring/metrics
- Business metrics dashboard (transactions per hour, success rate)

---

## Sentry Error Tracking

**Status:** REMOVED (MC #1271 — Sentry deinstalled)

---

## Slack Alerting

**Status:** Implemented (MC #1183)
**Source:** `src/lib/alerts.ts`, `instrumentation.ts`

### Features
- Operational alerts sent to Slack webhook
- 10-minute cooldown per alert title (prevents spam)
- Severity-based emoji prefixes (ℹ️ info, ⚠️ warning, 🚨 critical)
- Graceful degradation when webhook URL not set (dev mode)

### Setup Instructions
1. Create incoming webhook in Slack workspace:
   - Go to **Slack App Directory → Incoming Webhooks**
   - Choose channel (e.g., `#ops` or `#alerts`)
   - Copy webhook URL
2. Set environment variable:
   ```bash
   # .env.local (server-side secret)
   SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX
   ```

### Required Environment Variable
| Variable | Required | Description |
|----------|----------|-------------|
| `SLACK_WEBHOOK_URL` | Yes (production) | Slack incoming webhook URL |

**Note:** When `SLACK_WEBHOOK_URL` is not set, alerts are logged to console but not sent to Slack.

### Alert Types and Severities
| Severity | Emoji | Use Case |
|----------|-------|----------|
| `info` | ℹ️ | Application startup, normal operations |
| `warning` | ⚠️ | Degraded performance, non-critical issues |
| `critical` | 🚨 | Service outages, data loss, security incidents |

### Cooldown Behavior
- Each alert **title** has a 10-minute cooldown
- Same title sent within 10 minutes → skipped (prevents spam)
- Different titles → sent immediately (independent tracking)
- Cooldown resets on app restart (in-memory tracking)

**Example:** If "Database connection failed" is sent at 10:00, the next attempt before 10:10 will be skipped. But "High latency detected" can still be sent at 10:05.

### Usage in Code
```typescript
import { sendAlert } from '@/lib/alerts';

// Basic alert
await sendAlert({
  severity: 'critical',
  title: 'Database connection failed',
  message: 'PostgreSQL unreachable after 3 retries',
});

// Alert with details
await sendAlert({
  severity: 'warning',
  title: 'High error rate detected',
  message: '15 errors in last 5 minutes',
});
```

### Current Integrations
- **App startup:** Sends info alert when server starts (`instrumentation.ts`)
- **App shutdown:** Sends info alert on SIGTERM/SIGINT (`instrumentation.ts`)
- **Error spike detection:** Automatically tracks errors and alerts when >5 errors occur in 60 seconds (`src/lib/alerts.ts:trackError`)
- **Unhandled exceptions:** Logged and tracked via process event handlers (`instrumentation.ts`)

### Error Spike Detection

The alerting system automatically detects error spikes using a rolling window approach:

**How it works:**
1. Every server error (HTTP 5xx) is tracked via `trackError()`
2. Maintains rolling 1-minute window of error timestamps
3. When count exceeds threshold (5 errors in 60 seconds), sends critical alert
4. Integrates with middleware error handling

**Threshold:** 5 errors within 60 seconds
**Alert severity:** Critical (🚨)
**Implementation:** `src/lib/alerts.ts:trackError()`, wired into `src/lib/middleware.ts:jsonError()`

**Note:** Error counter is in-memory and resets on app restart. For production workloads requiring persistent tracking, consider Redis-backed counters.

---

## BetterStack Uptime Monitoring

**Status:** Ready to configure (setup guide available)
**Documentation:** [BETTERSTACK-SETUP.md](BETTERSTACK-SETUP.md)

### Overview

BetterStack provides external uptime monitoring independent of Drop's infrastructure. Unlike internal health checks (Docker, Fly.io) that only work when containers are running, BetterStack detects total infrastructure failures.

**Free tier includes:**
- 10 monitors (enough for Drop production)
- 3-minute check interval
- Unlimited integrations (Slack, email)
- Public status page
- SSL expiry monitoring

### Recommended Monitors

| Monitor | URL | Purpose | Expected Response |
|---------|-----|---------|-------------------|
| Health Endpoint | `https://drop.alai.no/api/health` | API + DB connectivity | `200`, body contains `"status":"ok"` |
| Landing Page | `https://drop.alai.no` | Public website | `200`, body contains `Send penger` |
| Multi-Region Check | `https://drop.alai.no/api/health` | Geographic availability | `200`, body contains `"status":"ok"` |

### Alert Escalation

BetterStack sends alerts through multiple channels:

```
Minute 0:   Alert fires → Slack #drop-ops (immediate)
Minute 5:   Still down → Email to alem@alai.no
Minute 15:  Still down → SMS (requires paid plan)
```

### Status Page

Public status page shows real-time service status:
- **URL:** `https://drop-status.betteruptime.com`
- **Components:** API Health, Landing Page, Global Network
- **Auto-updates:** Incidents automatically posted and resolved
- **Subscriptions:** Users can subscribe to email updates

### Setup Instructions

Complete setup guide with step-by-step instructions: [BETTERSTACK-SETUP.md](BETTERSTACK-SETUP.md)

**Setup includes:**
1. Account creation (free tier)
2. Configure 3 monitors (health, landing, multi-region)
3. Slack integration (`#drop-ops` channel)
4. On-call schedule and escalation policy
5. Public status page creation
6. Testing and verification

### Key Features

**Proactive monitoring:**
- 3-minute check interval (free tier) or 30s (paid)
- Keyword verification (not just HTTP 200)
- SSL certificate expiry warnings (14 days)
- Multi-region checks (detect geographic issues)

**Incident management:**
- Automatic incident creation on downtime
- Status page updates (public transparency)
- Escalation to multiple channels (Slack → Email → SMS)
- Maintenance window support (suppress alerts during deployments)

**Reporting:**
- Uptime SLA tracking (99.9% target)
- Incident history and analysis
- Response time graphs
- Downtime duration reports

### Integration with Drop Alerting

BetterStack complements Drop's internal alerting (`src/lib/alerts.ts`):

| Feature | Drop Internal Alerts | BetterStack External |
|---------|---------------------|----------------------|
| **Detects** | Application errors, error spikes | Infrastructure outages |
| **When** | App is running | App is unreachable |
| **Source** | Application logs | External HTTP checks |
| **Delivery** | Slack webhook (direct) | Escalation policy |
| **Use case** | Code bugs, DB issues | Container crashes, network failures |

**Example:** Database connection fails:
1. Drop internal alert: "Database connection failed" → Slack `#drop-ops` (immediate)
2. BetterStack: Health check returns 503 → Slack `#drop-ops` + Email after 5 min

### Maintenance Windows

When performing planned maintenance (deployments, upgrades):
1. Create maintenance window in BetterStack
2. Select affected monitors
3. Set duration (e.g., 1 hour)
4. **Effect:** Alerts suppressed, status page shows "Scheduled Maintenance"

**Prevents:** False downtime alerts during intentional service interruptions.

### Best Practices

**Do's:**
- ✅ Test alerts monthly (pause monitor to verify escalation)
- ✅ Use keyword checks (not just HTTP status codes)
- ✅ Monitor SSL expiry (14-day warnings)
- ✅ Create maintenance windows for deployments
- ✅ Review incident history monthly

**Don'ts:**
- ❌ Don't ignore degraded status (investigate even if not fully down)
- ❌ Don't disable monitors (use pause for temporary suppression)
- ❌ Don't skip keyword checks (HTTP 200 ≠ working API)
- ❌ Don't rely solely on external monitoring (combine with internal checks)

---

## External Uptime Monitoring (Alternative: UptimeRobot)

**Status:** Alternative to BetterStack (not recommended)

BetterStack is recommended over UptimeRobot for Drop because:
- Better Slack integration (richer notifications)
- Built-in status page (UptimeRobot charges extra)
- Better UI/UX for incident management
- More flexible escalation policies

### UptimeRobot Setup (if BetterStack unavailable)

**Cost:** Free tier (50 monitors, 5-minute interval)

1. Create account at uptimerobot.com
2. Add HTTP(S) monitor:
   - **Friendly Name:** Drop Production
   - **URL:** `https://drop.alai.no/api/health`
   - **Monitoring Interval:** 5 minutes (free tier) or 1 minute (paid)
3. Configure alert contacts:
   - Slack webhook (via Alert Contacts)
   - Email (`alem@alai.no`)
4. Set **Keyword Monitoring:** Response contains `"status":"ok"`

**Limitations:**
- No built-in escalation policies (requires third-party integrations)
- Status page requires paid plan
- Less detailed incident reports
- 5-minute check interval (vs 3-minute for BetterStack free)

---

## Monitoring Stack Summary

### Implemented (MC #1184)
- ✅ **Health check endpoint** — `/api/health` with real database verification
- ✅ **Container health checks** — Docker + Fly.io auto-restart on failure
- ❌ **Error tracking** — Sentry REMOVED (MC #1271)
- ✅ **Slack alerting** — Operational alerts with cooldown protection
- ✅ **Lifecycle monitoring** — App startup and graceful shutdown alerts
- ✅ **Error spike detection** — Automatic alerting when >5 errors/minute

### Recommended (Manual Setup)
- 📋 **External uptime monitoring** — UptimeRobot checking `/api/health` every 5 minutes
- 📋 **Structured logging** — JSON log format with request IDs for correlation
- 📋 **Metrics dashboard** — Request latency, error rates, database query times
- 📋 **Audit logging** — Tracked as security requirement (`security/drop-security-rapport.md` finding L3)

### Future Enhancements (TODO)
- Database performance monitoring (slow query alerts)
- Rate limit metrics (track 429 errors per endpoint)
- Business metrics dashboard (transactions per hour, success rate)
- Redis-backed error counter (persistent across restarts)
- Per-endpoint error tracking (isolate problematic routes)

---

## Environment Variables Reference

### Required for Production
```bash
# Slack alerting
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXX
```

### Dev Mode (All Optional)
All monitoring features gracefully degrade when env vars are not set:
- **No SLACK_WEBHOOK_URL:** Alerts logged to console only

This allows development to work without external services configured.