Monitoring & Observability

Project: Drop Version: 0.1.0 Date: 2026-02-23 Author: Platform Architect (AI) Status: In Review Reviewers: Alem Bašić (CEO)

Document History

Version	Date	Author	Changes
0.1	2026-02-23	Platform Architect (AI)	Initial draft from source code and infrastructure analysis

1. Observability Strategy

Drop's observability stack is intentionally lean for the MVP phase: a health check endpoint with real DB verification, Slack alerting with error spike detection, BetterStack external uptime monitoring, and AWS CloudWatch for App Runner logs. Sentry was removed (MC #1271). Full APM and structured logging are planned for v1.0.

Observability Platform: BetterStack (external uptime) + Slack alerting (internal) + CloudWatch (AWS logs) Strategy: Alert on symptoms (service down, error spike), verify via health check, investigate via CloudWatch logs.

Core Questions We Must Be Able to Answer:

Is Drop up and serving users? (BetterStack monitors /api/health)
Is the database connected and responding? (health endpoint DB query)
Are errors spiking? (Slack alerts.ts error spike detection)
What changed before the problem? (CloudWatch App Runner logs)
When was the last successful deployment? (App Runner deployment history)

2. Three Pillars

2.1 Metrics

Infrastructure Metrics

Metric	Source	Alert Threshold	Severity
App Runner service status	AWS CloudWatch / App Runner API	`RUNNING` → any other state	Critical
RDS instance status	AWS CloudWatch / RDS API	Not `available`	Critical
RDS CPU utilization	AWS CloudWatch `CPUUtilization`	> 80% (warn), > 95% (critical)	Warning / Critical
RDS free storage	AWS CloudWatch `FreeStorageSpace`	< 1GB (warn), < 256MB (critical)	Warning / Critical
RDS database connections	AWS CloudWatch `DatabaseConnections`	> 70 (warn, db.t4g.micro max ~85)	Warning
App Runner concurrent requests	AWS CloudWatch	TBD — no baseline yet	TBD

Application Metrics (RED Method)

Metric	Source	Alert Threshold	Severity
Request rate	CloudWatch App Runner	Baseline TBD	Informational
Error rate	`src/lib/alerts.ts` `trackError()`	> 5 errors in 60 seconds → Slack alert	Critical
DB query latency	`/api/health` `dbLatencyMs` field	> 100ms (warn)	Warning
Health endpoint status	BetterStack + `/api/health`	`status` != `"ok"` → 503	Critical
Rate limit hits (429)	App logs	Spike of 429 responses	Warning

Business Metrics (Planned for v1.0)

Metric	Description	Target
Transactions per hour	Successful remittances + QR payments	TBD
Transaction success rate	Completed / total initiated	> 99%
KYC approval rate	Sumsub approvals / attempts	> 80%
BankID login success rate	Successful OIDC callbacks / initiated	> 99%

2.2 Logs

Log Sources

Source	Log Group	Format	Retention
App Runner (application)	`/aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application`	Next.js console output	30 days (CloudWatch default)
App Runner (system)	`/aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/service`	App Runner events	30 days
RDS PostgreSQL	RDS error log via CloudWatch	PostgreSQL log format	7 days

Log Access

# Stream App Runner application logs (live)
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --follow \
  --region eu-west-1

# Filter for errors in last hour
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --filter-pattern "ERROR" \
  --region eu-west-1

# Download RDS error log
aws rds download-db-log-file-portion \
  --db-instance-identifier drop-db \
  --log-file-name error/postgresql.log \
  --region eu-west-1

Structured logging status: Not yet implemented. Current output is Next.js default console format. JSON structured logging planned for v1.0.

2.3 Traces

Status: Not implemented. Sentry (which would provide trace-level error context) was removed (MC #1271).

Planned for v1.0: Request ID correlation across middleware and DB queries.

3. Health Check System

3.1 Health Endpoint (`GET /api/health`)

The health endpoint performs a real DB query and reports application status. It is the primary signal for all monitoring layers.

Source: src/drop-app/src/app/api/health/route.ts

Success Response (HTTP 200):

{
  "data": {
    "status": "ok",
    "version": "0.1.0",
    "uptime": 3600,
    "checks": {
      "db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
      "services": { "mode": "production" }
    },
    "timestamp": "2026-02-23T12:00:00.000Z"
  }
}

Degraded Response (HTTP 200): status: "degraded" — DB returned unexpected result.

Down Response (HTTP 503):

{
  "data": {
    "status": "down",
    "checks": { "db": { "status": "fail" } },
    "timestamp": "..."
  }
}

3.2 Container Health Checks

Platform	Check	Interval	Timeout	Retries
Docker Compose (MVP)	`wget /api/health`	30s	10s	3
Docker Compose (Production)	`wget /api/health`	30s	10s	3
Fly.io (staging)	`GET /api/health`	30s	5s	—
AWS App Runner	`GET /api/health`	30s	5s	3

4. Alerting

4.1 Slack Alerting (Internal)

Source: src/drop-app/src/lib/alerts.ts Channel: #drop-ops on alai-talk.slack.com Webhook: SLACK_WEBHOOK_URL environment variable

Alert Type	Trigger	Severity	Emoji
App startup	Application boots	Info	ℹ️
App shutdown	SIGTERM/SIGINT received	Info	ℹ️
Error spike	> 5 errors in 60 seconds	Critical	🚨
Unhandled exception	Process event handler catches error	Critical	🚨
Custom alert	`sendAlert()` called in code	Variable	Variable

Cooldown: 10-minute cooldown per alert title (prevents spam). Resets on app restart.

Error spike detection algorithm:

Every HTTP 5xx error calls trackError()
Rolling 1-minute window of error timestamps maintained in memory
When count > 5 in window → sends critical Slack alert
Alert cooldown prevents duplicate alerts within 10 minutes

Usage in code:

import { sendAlert, trackError } from '@/lib/alerts';

// Send manual alert
await sendAlert({ severity: 'critical', title: 'Database failover detected', message: '...' });

// Track error (called automatically in middleware)
await trackError();

4.2 BetterStack External Monitoring

Status: Ready to configure (setup guide: docs/infrastructure/BETTERSTACK-SETUP.md) Plan: Free tier (10 monitors, 3-minute check interval)

Monitor	URL	Check	Expected
Drop Health Check	`https://drop.alai.no/api/health`	HTTP GET + keyword	Status 200, body contains `"status":"ok"`
Drop Landing Page	`https://drop.alai.no`	HTTP GET + keyword	Status 200, body contains `Send penger`
Drop Health (US East)	`https://drop.alai.no/api/health`	HTTP GET (from US region)	Status 200, body contains `"status":"ok"`

Status page: https://drop-status.betteruptime.com (public)

Escalation policy (Drop Production Incidents):

Minute 0:   Service down → Slack #drop-ops (immediate)
Minute 5:   Still down  → Email [email protected]
Minute 15:  Still down  → SMS +47 40 47 42 51 (requires paid BetterStack plan)

SSL expiry warning: 14 days before certificate expiration.

5. Alerting Rules Reference

Condition	Source	Channel	Severity	Action
`/api/health` returns 503	BetterStack	Slack + email	Critical	Investigate DB + App Runner
Error spike (>5 in 60s)	`alerts.ts`	Slack `#drop-ops`	Critical	Check app logs
App Runner service not `RUNNING`	AWS Console / CloudWatch	Manual check	Critical	`aws apprunner start-deployment`
RDS CPU > 80%	CloudWatch (manual setup needed)	TBD	Warning	Investigate query patterns
RDS storage < 1GB	CloudWatch (manual setup needed)	TBD	Warning	Increase storage
SSL certificate expiring	BetterStack	Email	Warning	Renew certificate
App startup/shutdown	`alerts.ts`	Slack `#drop-ops`	Info	No action needed

6. Dashboards

6.1 AWS CloudWatch Dashboard (Planned)

Target dashboard widgets:

App Runner: Request count, 5xx error count, latency
RDS: CPU, connections, free storage, read/write IOPS
Health check latency over time (from /api/health responses)

Setup: TBD — requires CloudWatch dashboard configuration via AWS Console or Terraform.

6.2 BetterStack Status Page

Public status page: https://drop-status.betteruptime.com

Components:

API & Health Endpoint (linked to Drop Health Check monitor)
Landing Page (linked to Drop Landing Page monitor)
Global Network (linked to US East monitor)

7. On-Call Procedures

7.1 Escalation Matrix

Time	Action	Who
Alert fires (0 min)	Acknowledge Slack alert, investigate	Alem Bašić
5 min: still down	Email alert auto-sent, try restart	Alem Bašić
15 min: still down	SMS alert (if configured), escalate	Alem Bašić
30 min: unresolved	Follow DR runbook for scenario	Alem Bašić + John (AI)

7.2 First Response Checklist

# 1. Check service status
aws apprunner describe-service \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --query 'Service.Status' --output text --region eu-west-1

# 2. Check recent logs
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
  --since 10m --region eu-west-1

# 3. Check RDS status
aws rds describe-db-instances \
  --db-instance-identifier drop-db \
  --query 'DBInstances[0].DBInstanceStatus' --output text --region eu-west-1

# 4. Direct health check
curl -s https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health | jq

# 5. Restart App Runner if needed
aws apprunner start-deployment \
  --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
  --region eu-west-1

8. Monitoring Gaps (Planned for v1.0)

Gap	Impact	Priority
Structured JSON logging	Cannot correlate requests across log lines	High
CloudWatch alarms for RDS	No automated alerting on DB metrics	High
APM / request tracing	Cannot trace slow requests	Medium
Business metrics dashboard	No visibility into transaction volume/success	Medium
Redis-backed error counter	Error counter resets on restart	Low
Audit logging stream	Required for compliance (AML)	High
Per-endpoint error tracking	Cannot isolate problematic routes	Medium

Approval

Role	Name	Date
Author	Platform Architect (AI)	2026-02-23
Reviewer
Approver	Alem Bašić

Monitoring & Observability

Monitoring & Observability

Document History

1. Observability Strategy

2. Three Pillars

2.1 Metrics

Infrastructure Metrics

Application Metrics (RED Method)

Business Metrics (Planned for v1.0)

2.2 Logs

Log Sources

Log Access

2.3 Traces

3. Health Check System

3.1 Health Endpoint (GET /api/health)

3.2 Container Health Checks

4. Alerting

4.1 Slack Alerting (Internal)

4.2 BetterStack External Monitoring

5. Alerting Rules Reference

6. Dashboards

6.1 AWS CloudWatch Dashboard (Planned)

6.2 BetterStack Status Page

7. On-Call Procedures

7.1 Escalation Matrix

7.2 First Response Checklist

8. Monitoring Gaps (Planned for v1.0)

Related Documents

Approval

3.1 Health Endpoint (`GET /api/health`)