Monitoring & Observability
Monitoring & Observability
Project:
Drop{{PROJECT_NAME}} Version:0.1.0{{VERSION}} Date:2026-02-23{{DATE}} Author:Platform Architect (AI){{AUTHOR}} Status: Draft | In Review | Approved Reviewers:Alem Bašić (CEO){{REVIEWERS}}
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | Initial draft |
1. Observability Strategy
Drop's observability stack is intentionally lean for the MVP phase: a health check endpoint with real DB verification, Slack alerting with error spike detection, BetterStack external uptime monitoring, and AWS CloudWatch for App Runner logs. Sentry was removed (MC #1271). Full APM and structured logging are planned for v1.0.
Observability Platform: BetterStack{{OBS_PLATFORM}} (external uptime) + Slack alerting (internal) + CloudWatch (AWS logs)
Strategy: AlertInstrument everything, alert on symptoms (servicenot down, error spike)causes), verifycorrelate viaacross health check, investigate via CloudWatch logs.pillars
Core Questions We Must Be Able to Answer:
- Is
Dropthe system up and servingusers?users(BetterStack monitors/api/health)correctly? IsHow fast is it responding?- What errors are occurring and why?
- Where is the
database connected and responding? (health endpoint DB query) Are errors spiking? (Slackalerts.tserror spike detection)bottleneck?- What changed before
thethisproblem?problem(CloudWatch App Runner logs) When was the last successful deployment? (App Runner deployment history)started?
2. Three Pillars
2.1 Metrics
Infrastructure Metrics
| Metric | Source | Alert Threshold | Severity |
|---|---|---|---|
| |||
| |||
|
> |
Warning / Critical | |
|
Warning / Critical | ||
|
> |
Warning / Critical | |
| Network in/out | Node exporter / CloudWatch | > {{NET_LIMIT}}Mbps sustained | Warning |
| Node not ready | Kubernetes | Any | Critical |
Application Metrics (RED Method)
| |||
| |||
| | ||
Business Metrics (Planned for v1.0)
| Metric | Description | Target | Alert |
|---|---|---|
2.2 Logs
Log Sources
Requests per second per service |
|||
% requests returning 5xx |
|||
| P95 latency | 95th percentile response time | < {{P95}}ms | > {{P95_ALERT}}ms |
| P99 latency | 99th percentile response time | < {{P99}}ms | > {{P99_ALERT}}ms |
LogBusiness AccessMetrics
# Stream App Runner application logs (live)
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
--follow \
--region eu-west-1
# Filter for errors in last hour
aws logs filter-log-events \
--log-group-name /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
--start-time $(date -u -d '1 hour ago' +%s)000 \
--filter-pattern "ERROR" \
--region eu-west-1
# Download RDS error log
aws rds download-db-log-file-portion \
--db-instance-identifier drop-db \
--log-file-name error/postgresql.log \
--region eu-west-1
Structured logging status: Not yet implemented. Current output is Next.js default console format. JSON structured logging planned for v1.0.
2.3 Traces
Status: Not implemented. Sentry (which would provide trace-level error context) was removed (MC #1271).
Planned for v1.0: Request ID correlation across middleware and DB queries.
3. Health Check System
3.1 Health Endpoint (GET /api/health)
The health endpoint performs a real DB query and reports application status. It is the primary signal for all monitoring layers.
Source: src/drop-app/src/app/api/health/route.ts
Success Response (HTTP 200):
{
"data": {
"status": "ok",
"version": "0.1.0",
"uptime": 3600,
"checks": {
"db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
"services": { "mode": "production" }
},
"timestamp": "2026-02-23T12:00:00.000Z"
}
}
Degraded Response (HTTP 200): status: "degraded" — DB returned unexpected result.
Down Response (HTTP 503):
{
"data": {
"status": "down",
"checks": { "db": { "status": "fail" } },
"timestamp": "..."
}
}
3.2 Container Health Checks
active users |
||||
{{CONVERSION_DESC}} |
||||
{{REVENUE_DESC}} |
||||
engagement |
Custom 4.Metrics Alerting
Definition
4.1 Slack Alerting (Internal)
Source: src/drop-app/src/lib/alerts.ts
Channel: #drop-ops on alai-talk.slack.com
Webhook: SLACK_WEBHOOK_URL environment variable
| Type | ||||
|---|---|---|---|---|
|
Cooldown: 10-minute cooldown per alert title (prevents spam). Resets on app restart.
Error spike detection algorithm:
Every HTTP 5xx error callstrackError()Rolling 1-minute window of error timestamps maintained in memoryWhen count > 5 in window → sends critical Slack alertAlert cooldown prevents duplicate alerts within 10 minutes
Usage in code:
import { sendAlert, trackError } from '@/lib/alerts';
// Send manual alert
await sendAlert({ severity: 'critical', title: 'Database failover detected', message: '...' });
// Track error (called automatically in middleware)
await trackError();
4.2 BetterStack External Monitoring
Status: Ready to configure (setup guide: docs/infrastructure/BETTERSTACK-SETUP.md)
Plan: Free tier (10 monitors, 3-minute check interval)
| | ||
| | ||
| |
Status page: https://drop-status.betteruptime.com (public)
Escalation policy (Drop Production Incidents):
Minute 0: Service down → Slack #drop-ops (immediate)
Minute 5: Still down → Email [email protected]
Minute 15: Still down → SMS +47 40 47 42 51 (requires paid BetterStack plan)
SSL expiry warning: 14 days before certificate expiration.
5. Alerting Rules Reference
|
queue_name |
|||||
{{APP}}_job_processing_duration |
Histogram | |
processing time |
|||
|
|
External API call count | count | |||
{{APP}}_cache_hit_ratio |
||||||
|
hit percentage |
6.2.2 Dashboards
6.1 AWS CloudWatch Dashboard (Planned)Logs
Target
Log dashboard widgets:
App Runner: Request count, 5xx error count, latency
RDS: CPU, connections, free storage, read/write IOPS
Health check latency over time (from /api/health responses)
/api/healthSetup: TBD — requires CloudWatch dashboard configuration via AWS Console or Terraform.
6.2 BetterStack Status Page
Public status page: https://drop-status.betteruptime.com
Components:
APILevels &HealthUsageEndpointGuide(linked to Drop Health Check monitor)Landing Page(linked to Drop Landing Page monitor)Global Network(linked to US East monitor)
7. On-Call Procedures
7.1 Escalation Matrix
ERROR |
||
WARN |
||
INFO |
||
DEBUG |
||
TRACE |
Extremely verbose ( |
SQL queries, HTTP request/response bodies |
7.2
Production Firstlog Responselevel: ChecklistINFO and above
Structured Logging Format
#{
1."timestamp": Check"2026-01-15T10:30:00.000Z",
service"level": status"INFO",
aws"service": apprunner"{{SERVICE_NAME}}",
describe-service"version": \"{{VERSION}}",
--service-arn"trace_id": arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec"abc123def456",
\"span_id": --query"789xyz",
'Service.Status'"user_id": --output"{{HASHED_OR_OMIT}}",
text"request_id": --region"req-uuid-here",
eu-west-1"message": #"Order 2.created Checksuccessfully",
recent"order_id": logs"ord-123",
aws"duration_ms": logs45
tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
--since 10m --region eu-west-1
# 3. Check RDS status
aws rds describe-db-instances \
--db-instance-identifier drop-db \
--query 'DBInstances[0].DBInstanceStatus' --output text --region eu-west-1
# 4. Direct health check
curl -s https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health | jq
# 5. Restart App Runner if needed
aws apprunner start-deployment \
--service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
--region eu-west-1}
Required 8.fields:Monitoringtimestamp, Gapslevel, service, message, trace_id
Forbidden in logs: passwords, tokens, credit card numbers, SSN, full email addresses (Plannedhash foror v1.0)truncate)
Log Aggregation Pipeline
flowchart LR
APP[Application] -->|stdout/stderr| AGENT[Log Agent<br/>Fluent Bit / Filebeat]
AGENT -->|structured JSON| STORE[Log Store<br/>Loki / Elasticsearch / CloudWatch]
STORE --> QUERY[Query Interface<br/>Grafana / Kibana]
STORE --> ALERT[Alert Engine<br/>AlertManager / PagerDuty]
| {{LOG_TRANSPORT}} | TLS encrypted | |
| Storage | {{LOG_STORE}} | Indexed, compressed |
| Query | {{LOG_QUERY}} | Access via dashboard |
Log Retention Policy
| Environment | Retention | Storage Tier |
|---|---|---|
| Dev | 7 days | Hot |
| Staging | 30 days | Hot |
| Production | {{PROD_LOG_RETENTION}} days | Hot (30d) → Cold archive |
| Audit logs | 1 year (regulatory) | Hot (90d) → Cold archive |
PII in Logs — Masking Strategy
| Data Type | Strategy | Example |
|---|---|---|
| Email address | Hash + truncate | user:sha256(email)[:8] |
| Phone number | Redact | [PHONE_REDACTED] |
| IP address | Anonymize last octet | 192.168.1.xxx |
| Payment data | Never log | Use [PAYMENT_DATA_OMITTED] |
| Auth tokens | Never log | Use [TOKEN_OMITTED] |
| Names | Omit or pseudonymize | Reference by ID only |
2.3 Traces
Distributed Tracing Setup
Tracing Framework: {{TRACE_FRAMEWORK}} Backend: {{TRACE_BACKEND}} Auto-instrumentation: {{AUTO_INSTRUMENT}}
| Service | Instrumented | Framework | Notes |
|---|---|---|---|
| {{SERVICE_1}} | Yes | OpenTelemetry | HTTP, DB, Redis |
| {{SERVICE_2}} | Yes | OpenTelemetry | HTTP, external calls |
Trace Sampling Strategy
| Environment | Strategy | Rate | Notes |
|---|---|---|---|
| Dev | Always-on | 100% | Full visibility |
| Staging | Always-on | 100% | Full visibility |
| Production | Tail-based | {{SAMPLE_RATE}}% + errors | Error traces always kept |
Tail-based sampling rules:
- Always sample: traces with errors, traces > {{SLOW_THRESHOLD}}ms
- Sample rate: {{SAMPLE_RATE}}% of successful, fast traces
- Head-based fallback: {{HEAD_SAMPLE_RATE}}% if tail-based collector unavailable
Span Naming Conventions
| Operation Type | Naming Pattern | Example |
|---|---|---|
| HTTP handler | HTTP {{METHOD}} {{ROUTE}} |
HTTP POST /api/orders |
| DB query | db.{{operation}} {{table}} |
db.select orders |
| Cache | cache.{{operation}} {{key_pattern}} |
cache.get user:* |
| Queue | queue.{{operation}} {{queue_name}} |
queue.publish order-events |
| External HTTP | {{service}} {{METHOD}} {{path}} |
stripe POST /charges |
Context Propagation
Standard: W3C TraceContext (traceparent header)
Baggage: W3C Baggage (for user_id, tenant_id propagation)
Async: Inject context into message queue headers / requestjob tracingmetadata
3. Alerting
3.1 Alert Rules
| Alert Name | Condition | Duration | Severity | Channel | Runbook |
|---|---|---|---|---|---|
HighErrorRate |
Critical | PagerDuty | [link] | ||
SlowP99 |
p99_latency > {{P99_ALERT}}ms | 5 min | Warning | Slack #alerts | [link] |
ServiceDown |
health_check failing | 1 min | Critical | PagerDuty | [link] |
HighCPU |
cpu > {{CPU_CRIT}}% | 10 min | Warning | Slack #alerts | [link] |
DiskAlmostFull |
disk > {{DISK_CRIT}}% | 5 min | Critical | PagerDuty | [link] |
DeploymentFailed |
deployment status = failed | Immediate | Critical | Slack #deployments | [link] |
CertificateExpiringSoon |
cert_expiry < 30 days | — | Warning | Slack #ops | [link] |
BackupFailed |
backup job = failed | — | Critical | PagerDuty | [link] |
SLOBudgetBurning |
error_budget < 10% remaining | — | Critical | PagerDuty | [link] |
3.2 Alert Routing & Escalation
flowchart TD
ALERT[Alert fires] --> SEVERITY{Severity?}
SEVERITY -->|Critical| ONCALL[On-call engineer<br/>PagerDuty / phone]
SEVERITY -->|Warning| SLACK[Slack #alerts<br/>No immediate response required]
ONCALL -->|Not acknowledged in 5min| ESCALATE[Escalate to secondary]
ESCALATE -->|Not acknowledged in 10min| MANAGER[Notify engineering lead]
| Severity | Response SLA | Channel | Escalation |
|---|---|---|---|
| Critical (P1) | Acknowledge in 5 min, resolve in 1h | PagerDuty + call | Escalate at 5 min |
| High (P2) | Acknowledge in 30 min, resolve in 4h | PagerDuty | Escalate at 30 min |
| Warning (P3) | Review within 1 business day | Slack | Manual |
| Info | No response required | Slack | None |
3.3 On-Call Rotation
Schedule: {{ONCALL_SCHEDULE}} Calendar: {{ONCALL_TOOL}} Primary rotation: {{ONCALL_MEMBERS}} Secondary (escalation): {{ESCALATION_MEMBERS}} Minimum rotation size: 3 people (to avoid burnout)
3.4 Alert Fatigue Prevention
- Alert review cadence: Monthly — remove/adjust alerts with < {{ACTIONABLE_RATE}}% actionable rate
- Minimum alert duration: 2+ minutes (no single-spike alerts)
- Deduplication window: {{DEDUP_WINDOW}} minutes
- Business hours suppression: Allowed for non-critical alerts {{SUPPRESSION_HOURS}}
- Post-mortem requirement: Every Critical alert reviewed after incident
4. Dashboards
4.1 Dashboard Inventory
| Dashboard | Purpose | Link | Audience |
|---|---|---|---|
| System Overview | High-level health of all services | {{LINK}} | Everyone |
| {{SERVICE_1}} | Service-level detail | {{LINK}} | Dev team |
| Infrastructure | Host/container metrics | {{LINK}} | DevOps |
| Business |
Leadership, PM | ||
| Error |
Engineering lead | ||
| On-call engineer |
4.2 Key Dashboard Specs — System Overview
Required panels:
- Service health matrix (all services, green/red/yellow)
- Request rate (all services, last 1h)
- Error rate (all services, last 1h)
- P99 latency (all services, last 1h)
- Active incidents count
- Error budget remaining (all SLOs)
- Last deployment (service, version, time)
- Infrastructure health (CPU, memory, disk — aggregate)
5. SLOs / SLIs
5.1 SLI Definitions
| SLI | Definition | Measurement Method |
|---|---|---|
| Availability | % requests returning non-5xx | (total_requests - 5xx_requests) / total_requests |
| Error rate | % requests not returning errors | (total_requests - error_requests) / total_requests |
5.2 SLO Targets
| Service | SLI | Target | Window | Error Budget |
|---|---|---|---|---|
| {{SERVICE}} | Availability | {{AVAIL_TARGET}}% | 30 days | {{BUDGET_MINUTES}} min/month |
| {{SERVICE}} | Latency (P95 < {{P95}}ms) | {{LATENCY_TARGET}}% | 30 days | {{LATENCY_BUDGET_MINUTES}} min/month |
5.3 Error Budget Tracking
| Service | Monthly Budget | Burned This Month | Remaining | Burn Rate (24h) |
|---|---|---|---|---|
| {{SERVICE}} | {{BUDGET}}min | TBD | TBD | TBD |
Error budget policy:
- Budget > 50% remaining: Move fast, deploy freely
- Budget 10-50% remaining: Slow down, prioritize reliability work
- Budget < 10% remaining: Freeze non-critical deploys, focus on reliability
6. Tooling
| Tool | Version | Purpose | Hosted |
|---|---|---|---|
| {{METRICS_TOOL}} | {{VERSION}} | Metrics collection & storage | {{HOSTING}} |
| {{LOG_TOOL}} | {{VERSION}} | Log aggregation | {{HOSTING}} |
| {{TRACE_TOOL}} | {{VERSION}} | Distributed tracing | {{HOSTING}} |
| {{DASHBOARD_TOOL}} | {{VERSION}} | Visualization | {{HOSTING}} |
| {{ALERT_TOOL}} | {{VERSION}} | Alert routing & on-call | {{HOSTING}} |
Related Documents
- Deployment Architecture
- Disaster Recovery Plan
BetterStackIncidentSetup GuideReportSlackOperationalAlertingRunbook- SLA Report
Approval
| Role | Name | Date | Signature |
|---|---|---|---|
| Author | |||
| Reviewer | |||
| Approver |