Monitoring & Observability
Monitoring & Observability
Project: Drop Version: 0.1.0 Date: 2026-02-23 Author: Platform Architect (AI) Status: In Review Reviewers: Alem Bašić (CEO)
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | 2026-02-23 | Platform Architect (AI) | Initial draft from source code and infrastructure analysis |
1. Observability Strategy
Drop's observability stack is intentionally lean for the MVP phase: a health check endpoint with real DB verification, Slack alerting with error spike detection, BetterStack external uptime monitoring, and AWS CloudWatch for App Runner logs. Sentry was removed (MC #1271). Full APM and structured logging are planned for v1.0.
Observability Platform: BetterStack (external uptime) + Slack alerting (internal) + CloudWatch (AWS logs) Strategy: Alert on symptoms (service down, error spike), verify via health check, investigate via CloudWatch logs.
Core Questions We Must Be Able to Answer:
- Is Drop up and serving users? (BetterStack monitors
/api/health) - Is the database connected and responding? (health endpoint DB query)
- Are errors spiking? (Slack
alerts.tserror spike detection) - What changed before the problem? (CloudWatch App Runner logs)
- When was the last successful deployment? (App Runner deployment history)
2. Three Pillars
2.1 Metrics
Infrastructure Metrics
| Metric | Source | Alert Threshold | Severity |
|---|---|---|---|
| App Runner service status | AWS CloudWatch / App Runner API | RUNNING → any other state |
Critical |
| RDS instance status | AWS CloudWatch / RDS API | Not available |
Critical |
| RDS CPU utilization | AWS CloudWatch CPUUtilization |
> 80% (warn), > 95% (critical) | Warning / Critical |
| RDS free storage | AWS CloudWatch FreeStorageSpace |
< 1GB (warn), < 256MB (critical) | Warning / Critical |
| RDS database connections | AWS CloudWatch DatabaseConnections |
> 70 (warn, db.t4g.micro max ~85) | Warning |
| App Runner concurrent requests | AWS CloudWatch | TBD — no baseline yet | TBD |
Application Metrics (RED Method)
| Metric | Source | Alert Threshold | Severity |
|---|---|---|---|
| Request rate | CloudWatch App Runner | Baseline TBD | Informational |
| Error rate | src/lib/alerts.ts trackError() |
> 5 errors in 60 seconds → Slack alert | Critical |
| DB query latency | /api/health dbLatencyMs field |
> 100ms (warn) | Warning |
| Health endpoint status | BetterStack + /api/health |
status != "ok" → 503 |
Critical |
| Rate limit hits (429) | App logs | Spike of 429 responses | Warning |
Business Metrics (Planned for v1.0)
| Metric | Description | Target |
|---|---|---|
| Transactions per hour | Successful remittances + QR payments | TBD |
| Transaction success rate | Completed / total initiated | > 99% |
| KYC approval rate | Sumsub approvals / attempts | > 80% |
| BankID login success rate | Successful OIDC callbacks / initiated | > 99% |
2.2 Logs
Log Sources
| Source | Log Group | Format | Retention |
|---|---|---|---|
| App Runner (application) | /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application |
Next.js console output | 30 days (CloudWatch default) |
| App Runner (system) | /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/service |
App Runner events | 30 days |
| RDS PostgreSQL | RDS error log via CloudWatch | PostgreSQL log format | 7 days |
Log Access
# Stream App Runner application logs (live)
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
--follow \
--region eu-west-1
# Filter for errors in last hour
aws logs filter-log-events \
--log-group-name /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
--start-time $(date -u -d '1 hour ago' +%s)000 \
--filter-pattern "ERROR" \
--region eu-west-1
# Download RDS error log
aws rds download-db-log-file-portion \
--db-instance-identifier drop-db \
--log-file-name error/postgresql.log \
--region eu-west-1
Structured logging status: Not yet implemented. Current output is Next.js default console format. JSON structured logging planned for v1.0.
2.3 Traces
Status: Not implemented. Sentry (which would provide trace-level error context) was removed (MC #1271).
Planned for v1.0: Request ID correlation across middleware and DB queries.
3. Health Check System
3.1 Health Endpoint (GET /api/health)
The health endpoint performs a real DB query and reports application status. It is the primary signal for all monitoring layers.
Source: src/drop-app/src/app/api/health/route.ts
Success Response (HTTP 200):
{
"data": {
"status": "ok",
"version": "0.1.0",
"uptime": 3600,
"checks": {
"db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
"services": { "mode": "production" }
},
"timestamp": "2026-02-23T12:00:00.000Z"
}
}
Degraded Response (HTTP 200): status: "degraded" — DB returned unexpected result.
Down Response (HTTP 503):
{
"data": {
"status": "down",
"checks": { "db": { "status": "fail" } },
"timestamp": "..."
}
}
3.2 Container Health Checks
| Platform | Check | Interval | Timeout | Retries |
|---|---|---|---|---|
| Docker Compose (MVP) | wget /api/health |
30s | 10s | 3 |
| Docker Compose (Production) | wget /api/health |
30s | 10s | 3 |
| Fly.io (staging) | GET /api/health |
30s | 5s | — |
| AWS App Runner | GET /api/health |
30s | 5s | 3 |
4. Alerting
4.1 Slack Alerting (Internal)
Source: src/drop-app/src/lib/alerts.ts
Channel: #drop-ops on alai-talk.slack.com
Webhook: SLACK_WEBHOOK_URL environment variable
| Alert Type | Trigger | Severity | Emoji |
|---|---|---|---|
| App startup | Application boots | Info | ℹ️ |
| App shutdown | SIGTERM/SIGINT received | Info | ℹ️ |
| Error spike | > 5 errors in 60 seconds | Critical | 🚨 |
| Unhandled exception | Process event handler catches error | Critical | 🚨 |
| Custom alert | sendAlert() called in code |
Variable | Variable |
Cooldown: 10-minute cooldown per alert title (prevents spam). Resets on app restart.
Error spike detection algorithm:
- Every HTTP 5xx error calls
trackError() - Rolling 1-minute window of error timestamps maintained in memory
- When count > 5 in window → sends critical Slack alert
- Alert cooldown prevents duplicate alerts within 10 minutes
Usage in code:
import { sendAlert, trackError } from '@/lib/alerts';
// Send manual alert
await sendAlert({ severity: 'critical', title: 'Database failover detected', message: '...' });
// Track error (called automatically in middleware)
await trackError();
4.2 BetterStack External Monitoring
Status: Ready to configure (setup guide: docs/infrastructure/BETTERSTACK-SETUP.md)
Plan: Free tier (10 monitors, 3-minute check interval)
| Monitor | URL | Check | Expected |
|---|---|---|---|
| Drop Health Check | https://drop.alai.no/api/health |
HTTP GET + keyword | Status 200, body contains "status":"ok" |
| Drop Landing Page | https://drop.alai.no |
HTTP GET + keyword | Status 200, body contains Send penger |
| Drop Health (US East) | https://drop.alai.no/api/health |
HTTP GET (from US region) | Status 200, body contains "status":"ok" |
Status page: https://drop-status.betteruptime.com (public)
Escalation policy (Drop Production Incidents):
Minute 0: Service down → Slack #drop-ops (immediate)
Minute 5: Still down → Email [email protected]
Minute 15: Still down → SMS +47 40 47 42 51 (requires paid BetterStack plan)
SSL expiry warning: 14 days before certificate expiration.
5. Alerting Rules Reference
| Condition | Source | Channel | Severity | Action |
|---|---|---|---|---|
/api/health returns 503 |
BetterStack | Slack + email | Critical | Investigate DB + App Runner |
| Error spike (>5 in 60s) | alerts.ts |
Slack #drop-ops |
Critical | Check app logs |
App Runner service not RUNNING |
AWS Console / CloudWatch | Manual check | Critical | aws apprunner start-deployment |
| RDS CPU > 80% | CloudWatch (manual setup needed) | TBD | Warning | Investigate query patterns |
| RDS storage < 1GB | CloudWatch (manual setup needed) | TBD | Warning | Increase storage |
| SSL certificate expiring | BetterStack | Warning | Renew certificate | |
| App startup/shutdown | alerts.ts |
Slack #drop-ops |
Info | No action needed |
6. Dashboards
6.1 AWS CloudWatch Dashboard (Planned)
Target dashboard widgets:
- App Runner: Request count, 5xx error count, latency
- RDS: CPU, connections, free storage, read/write IOPS
- Health check latency over time (from
/api/healthresponses)
Setup: TBD — requires CloudWatch dashboard configuration via AWS Console or Terraform.
6.2 BetterStack Status Page
Public status page: https://drop-status.betteruptime.com
Components:
- API & Health Endpoint (linked to Drop Health Check monitor)
- Landing Page (linked to Drop Landing Page monitor)
- Global Network (linked to US East monitor)
7. On-Call Procedures
7.1 Escalation Matrix
| Time | Action | Who |
|---|---|---|
| Alert fires (0 min) | Acknowledge Slack alert, investigate | Alem Bašić |
| 5 min: still down | Email alert auto-sent, try restart | Alem Bašić |
| 15 min: still down | SMS alert (if configured), escalate | Alem Bašić |
| 30 min: unresolved | Follow DR runbook for scenario | Alem Bašić + John (AI) |
7.2 First Response Checklist
# 1. Check service status
aws apprunner describe-service \
--service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
--query 'Service.Status' --output text --region eu-west-1
# 2. Check recent logs
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
--since 10m --region eu-west-1
# 3. Check RDS status
aws rds describe-db-instances \
--db-instance-identifier drop-db \
--query 'DBInstances[0].DBInstanceStatus' --output text --region eu-west-1
# 4. Direct health check
curl -s https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health | jq
# 5. Restart App Runner if needed
aws apprunner start-deployment \
--service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
--region eu-west-1
8. Monitoring Gaps (Planned for v1.0)
| Gap | Impact | Priority |
|---|---|---|
| Structured JSON logging | Cannot correlate requests across log lines | High |
| CloudWatch alarms for RDS | No automated alerting on DB metrics | High |
| APM / request tracing | Cannot trace slow requests | Medium |
| Business metrics dashboard | No visibility into transaction volume/success | Medium |
| Redis-backed error counter | Error counter resets on restart | Low |
| Audit logging stream | Required for compliance (AML) | High |
| Per-endpoint error tracking | Cannot isolate problematic routes | Medium |
Related Documents
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | Platform Architect (AI) | 2026-02-23 | |
| Reviewer | |||
| Approver | Alem Bašić |