Monitoring & Observability
Monitoring & Observability
Project:
{{PROJECT_NAME}}Drop Version:{{VERSION}}0.1.0 Date:{{DATE}}2026-02-23 Author:{{AUTHOR}}Platform Architect (AI) Status:Draft |In Review| ApprovedReviewers:{{REVIEWERS}}Alem Bašić (CEO)
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | Initial draft from source code and infrastructure analysis |
1. Observability Strategy
Drop's observability stack is intentionally lean for the MVP phase: a health check endpoint with real DB verification, Slack alerting with error spike detection, BetterStack external uptime monitoring, and AWS CloudWatch for App Runner logs. Sentry was removed (MC #1271). Full APM and structured logging are planned for v1.0.
Observability Platform: {{OBS_PLATFORM}}BetterStack (external uptime) + Slack alerting (internal) + CloudWatch (AWS logs)
Strategy: Instrument everything, alertAlert on symptoms (notservice causes)down, error spike), correlateverify acrossvia pillarshealth check, investigate via CloudWatch logs.
Core Questions We Must Be Able to Answer:
- Is
the systemDrop up and servingusersusers?correctly?(BetterStack monitors/api/health) HowIsfasttheisdatabaseitconnected and responding? (health endpoint DB query)WhatAre errorsarespiking?occurring(Slackandalerts.tswhy?error Wherespikeis the bottleneck?detection)- What changed before
thistheproblemproblem?started?(CloudWatch App Runner logs) - When was the last successful deployment? (App Runner deployment history)
2. Three Pillars
2.1 Metrics
Infrastructure Metrics
| Metric | Source | Alert Threshold | Severity |
|---|---|---|---|
| App Runner service status | AWS CloudWatch / App Runner API | RUNNING → any other state |
Critical |
| RDS instance status | AWS CloudWatch / RDS API | Not available |
Critical |
| RDS CPU utilization | CPUUtilization |
> |
Warning / Critical |
FreeStorageSpace |
Warning / Critical | ||
DatabaseConnections |
> | ||
| Warning | |||
Application Metrics (RED Method)
| Metric | Source | Alert Threshold | Severity |
|---|---|---|---|
| Request rate | CloudWatch App Runner | Baseline TBD | Informational |
| Error rate | src/lib/alerts.ts trackError() |
> 5 errors in 60 seconds → Slack alert | Critical |
| DB query latency | /api/health dbLatencyMs field |
> 100ms (warn) | Warning |
| Health endpoint status | BetterStack + /api/health |
status != "ok" → 503 |
Critical |
| Rate limit hits (429) | App logs | Spike of 429 responses | Warning |
Business Metrics (Planned for v1.0)
| Metric | Description | Target |
|---|---|---|
| Transactions per hour | Successful remittances + QR payments | TBD |
| Transaction success rate | Completed / total initiated | > 99% |
| KYC approval rate | Sumsub approvals / attempts | > 80% |
| BankID login success rate | Successful OIDC callbacks / initiated | > 99% |
2.2 Logs
Log Sources
| Log Group | Format | Retention | |
|---|---|---|---|
/aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application |
|||
/aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/service |
|||
BusinessLog MetricsAccess
# Stream App Runner application logs (live)
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
--follow \
--region eu-west-1
# Filter for errors in last hour
aws logs filter-log-events \
--log-group-name /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
--start-time $(date -u -d '1 hour ago' +%s)000 \
--filter-pattern "ERROR" \
--region eu-west-1
# Download RDS error log
aws rds download-db-log-file-portion \
--db-instance-identifier drop-db \
--log-file-name error/postgresql.log \
--region eu-west-1
Structured logging status: Not yet implemented. Current output is Next.js default console format. JSON structured logging planned for v1.0.
2.3 Traces
Status: Not implemented. Sentry (which would provide trace-level error context) was removed (MC #1271).
Planned for v1.0: Request ID correlation across middleware and DB queries.
3. Health Check System
3.1 Health Endpoint (GET /api/health)
The health endpoint performs a real DB query and reports application status. It is the primary signal for all monitoring layers.
Source: src/drop-app/src/app/api/health/route.ts
Success Response (HTTP 200):
{
"data": {
"status": "ok",
"version": "0.1.0",
"uptime": 3600,
"checks": {
"db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
"services": { "mode": "production" }
},
"timestamp": "2026-02-23T12:00:00.000Z"
}
}
Degraded Response (HTTP 200): status: "degraded" — DB returned unexpected result.
Down Response (HTTP 503):
{
"data": {
"status": "down",
"checks": { "db": { "status": "fail" } },
"timestamp": "..."
}
}
3.2 Container Health Checks
| Retries | ||||
|---|---|---|---|---|
wget |
3 | |||
wget /api/health |
3 | |||
GET /api/health |
— | |||
GET |
3 |
Custom
Metrics4. Definition
Alerting
4.1 Slack Alerting (Internal)
Source: src/drop-app/src/lib/alerts.ts
Channel: #drop-ops on alai-talk.slack.com
Webhook: SLACK_WEBHOOK_URL environment variable
| Type | ||||
|---|---|---|---|---|
| App startup | Application boots | Info | ℹ️ | |
| App shutdown | SIGTERM/SIGINT received | Info | ℹ️ | |
| Error spike | > 5 errors in 60 seconds | Critical | 🚨 | |
| Unhandled exception | Process event handler catches error | Critical | 🚨 | |
| Custom alert | sendAlert() called in code |
Variable | Variable |
Cooldown: 10-minute cooldown per alert title (prevents spam). Resets on app restart.
Error spike detection algorithm:
- Every HTTP 5xx error calls
trackError() - Rolling 1-minute window of error timestamps maintained in memory
- When count > 5 in window → sends critical Slack alert
- Alert cooldown prevents duplicate alerts within 10 minutes
Usage in code:
import { sendAlert, trackError } from '@/lib/alerts';
// Send manual alert
await sendAlert({ severity: 'critical', title: 'Database failover detected', message: '...' });
// Track error (called automatically in middleware)
await trackError();
4.2 BetterStack External Monitoring
Status: Ready to configure (setup guide: docs/infrastructure/BETTERSTACK-SETUP.md)
Plan: Free tier (10 monitors, 3-minute check interval)
| Monitor | URL | Check | Expected |
|---|---|---|---|
| Drop Health Check | https://drop.alai.no/api/health |
HTTP GET + keyword | Status 200, body contains "status":"ok" |
| Drop Landing Page | https://drop.alai.no |
HTTP GET + keyword | Status 200, body contains Send penger |
| Drop Health (US East) | https://drop.alai.no/api/health |
HTTP GET (from US region) | Status 200, body contains "status":"ok" |
Status page: https://drop-status.betteruptime.com (public)
Escalation policy (Drop Production Incidents):
Minute 0: Service down → Slack #drop-ops (immediate)
Minute 5: Still down → Email [email protected]
Minute 15: Still down → SMS +47 40 47 42 51 (requires paid BetterStack plan)
SSL expiry warning: 14 days before certificate expiration.
5. Alerting Rules Reference
| Condition | Source | Channel | Severity | Action | ||
|---|---|---|---|---|---|---|
returns 503 |
Slack + email |
|||||
| Error spike (>5 in 60s) | |
|
||||
App Runner service not |
Manual check | Critical | | |||
| RDS CPU > 80% | CloudWatch (manual setup needed) | TBD | Warning | Investigate query patterns | ||
| RDS storage < 1GB | CloudWatch (manual setup needed) | TBD | Warning | Increase storage | ||
| SSL certificate expiring | BetterStack | Warning | Renew certificate | |||
| App startup/shutdown | |
|
6. Dashboards
2.6.1 AWS CloudWatch Dashboard (Planned)
Target dashboard widgets:
- App Runner: Request count, 5xx error count, latency
- RDS: CPU, connections, free storage, read/write IOPS
- Health check latency over time (from
/api/healthresponses)
Setup: TBD — requires CloudWatch dashboard configuration via AWS Console or Terraform.
6.2 LogsBetterStack Status Page
Log
Public Levelsstatus page: https://drop-status.betteruptime.com
Components:
- API &
UsageHealthGuideEndpoint (linked to Drop Health Check monitor) - Landing Page (linked to Drop Landing Page monitor)
- Global Network (linked to US East monitor)
7. On-Call Procedures
7.1 Escalation Matrix
Alert fires (0 min) |
||
5 min: still down |
||
15 min: still down |
||
30 min: unresolved |
||
|
Production7.2
Response logFirst level:INFO and above
Structured Logging Format
{# "timestamp":1. "2026-01-15T10:30:00.000Z",Check "level":service "INFO",status
"service":aws "{{SERVICE_NAME}}",apprunner "version":describe-service "{{VERSION}}",\
"trace_id":--service-arn "abc123def456",arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec "span_id":\
"789xyz",--query "user_id":'Service.Status' "{{HASHED_OR_OMIT}}",--output "request_id":text "req-uuid-here",--region "message":eu-west-1
"Order# created2. successfully",Check "order_id":recent "ord-123",logs
"duration_ms":aws 45logs }tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
--since 10m --region eu-west-1
# 3. Check RDS status
aws rds describe-db-instances \
--db-instance-identifier drop-db \
--query 'DBInstances[0].DBInstanceStatus' --output text --region eu-west-1
# 4. Direct health check
curl -s https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health | jq
# 5. Restart App Runner if needed
aws apprunner start-deployment \
--service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
--region eu-west-1
Required
fields:
8. timestamp,Monitoring level, service, message, trace_id
Forbidden in logs: passwords, tokens, credit card numbers, SSN, full email addressesGaps (hashPlanned orfor truncate)
Log Aggregation Pipeline
flowchart LR
APP[Application] -->|stdout/stderr| AGENT[Log Agent<br/>Fluent Bit / Filebeat]
AGENT -->|structured JSON| STORE[Log Store<br/>Loki / Elasticsearch / CloudWatch]
STORE --> QUERY[Query Interface<br/>Grafana / Kibana]
STORE --> ALERT[Alert Engine<br/>AlertManager / PagerDuty]
v1.0)
flowchart LR
APP[Application] -->|stdout/stderr| AGENT[Log Agent<br/>Fluent Bit / Filebeat]
AGENT -->|structured JSON| STORE[Log Store<br/>Loki / Elasticsearch / CloudWatch]
STORE --> QUERY[Query Interface<br/>Grafana / Kibana]
STORE --> ALERT[Alert Engine<br/>AlertManager / PagerDuty]
Log Retention Policy
| Audit |
PII in Logs — Masking Strategy
High |
||
| ||
| ||
| ||
| ||
2.3 Traces
Distributed Tracing Setup
Tracing Framework: {{TRACE_FRAMEWORK}}
Backend: {{TRACE_BACKEND}}
Auto-instrumentation: {{AUTO_INSTRUMENT}}
Trace Sampling Strategy
Tail-based sampling rules:
Always sample: traces with errors, traces > {{SLOW_THRESHOLD}}msSample rate: {{SAMPLE_RATE}}% of successful, fast tracesHead-based fallback: {{HEAD_SAMPLE_RATE}}% if tail-based collector unavailable
Span Naming Conventions
| | |
| | |
| | |
| | |
| |
Context Propagation
Standard: W3C TraceContext (traceparent header)
Baggage: W3C Baggage (for user_id, tenant_id propagation)
Async: Inject context into message queue headers / job metadata
3. Alerting
3.1 Alert Rules
| |||||
| |||||
| |||||
| |||||
| |||||
| |||||
| |||||
| |||||
|
3.2 Alert Routing & Escalation
flowchart TD
ALERT[Alert fires] --> SEVERITY{Severity?}
SEVERITY -->|Critical| ONCALL[On-call engineer<br/>PagerDuty / phone]
SEVERITY -->|Warning| SLACK[Slack #alerts<br/>No immediate response required]
ONCALL -->|Not acknowledged in 5min| ESCALATE[Escalate to secondary]
ESCALATE -->|Not acknowledged in 10min| MANAGER[Notify engineering lead]
3.3 On-Call Rotation
Schedule: {{ONCALL_SCHEDULE}}
Calendar: {{ONCALL_TOOL}}
Primary rotation: {{ONCALL_MEMBERS}}
Secondary (escalation): {{ESCALATION_MEMBERS}}
Minimum rotation size: 3 people (to avoid burnout)
3.4 Alert Fatigue Prevention
Alert review cadence: Monthly — remove/adjust alerts with < {{ACTIONABLE_RATE}}% actionable rateMinimum alert duration: 2+ minutes (no single-spike alerts)Deduplication window: {{DEDUP_WINDOW}} minutesBusiness hours suppression: Allowed for non-critical alerts {{SUPPRESSION_HOURS}}Post-mortem requirement: Every Critical alert reviewed after incident
4. Dashboards
4.1 Dashboard Inventory
4.2 Key Dashboard Specs — System Overview
Required panels:
Service health matrix (all services, green/red/yellow)Request rate (all services, last 1h)Error rate (all services, last 1h)P99 latency (all services, last 1h)Active incidents countError budget remaining (all SLOs)Last deployment (service, version, time)Infrastructure health (CPU, memory, disk — aggregate)
5. SLOs / SLIs
5.1 SLI Definitions
5.2 SLO Targets
5.3 Error Budget Tracking
Error budget policy:
Budget > 50% remaining: Move fast, deploy freelyBudget 10-50% remaining: Slow down, prioritize reliability workBudget < 10% remaining: Freeze non-critical deploys, focus on reliability
6. Tooling
Related Documents
- Deployment Architecture
- Disaster Recovery Plan
IncidentBetterStackReportSetup GuideOperationalSlackRunbookAlerting SLA ReportSource
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | Platform Architect (AI) | 2026-02-23 | |
| Reviewer | |||
| Approver | Alem Bašić |