Skip to main content

drop-observability-plan

Drop — System Support & Observability Plan

Client: Drop (Digital Banking) Executing Company: FlowForge (DevOps & Infrastructure) Support Company: HelixSupport (Production Support & SLA) Status: DRAFT — čeka CEO approval Created: 2026-02-20


Current State (AS-IS)

Drop već ima:

  • ✅ Structured JSON logging (custom logger, request ID tracking)
  • ✅ Health check endpoint (/api/health sa DB verification)
  • ✅ Slack alerting (error spike detection, lifecycle events)
  • ✅ Container health checks (Docker + AWS App Runner)
  • ✅ Automated SQLite backup (WAL-safe, 30-day retention)
  • ✅ CI/CD sa security scanning (Trivy, npm audit)
  • ✅ Terraform IaC (AWS + Cloudflare)

Drop nema:

  • ❌ External uptime monitoring
  • ❌ Log aggregation (logs samo u docker logs)
  • ❌ Error tracking (Sentry skinut)
  • ❌ DB performance monitoring
  • ❌ Business metrics dashboard
  • ❌ APM / distributed tracing
  • ❌ Alerting escalation (samo Slack)

Target State (TO-BE)

Tier 1: Essential (Week 1)

Cost: ~$0 — free tiers

Component Tool Why
Uptime monitoring BetterStack (free: 10 monitors) Independent od AWS — znaš kad padne PRIJE korisnika
Error tracking Sentry (free: 5K events/mo) Stack traces, user context, release tracking
Log shipping AWS CloudWatch Logs (App Runner native) Searchable logs, retention, metric filters

Tier 2: Visibility (Week 2)

Cost: ~$0-20/mo

Component Tool Why
DB monitoring RDS Performance Insights (free tier) Slow queries, connection pool, wait events
CDN analytics Cloudflare Analytics (free) Traffic patterns, threats, cache hit rate
Alerting escalation BetterStack On-call (free: 1 team) Slack → Email → SMS escalation chain

Tier 3: Intelligence (Week 3-4)

Cost: ~$0-50/mo

Component Tool Why
Business metrics Custom endpoint + Grafana Cloud (free: 10K metrics) Tx/hour, success rate, revenue
Application metrics Prometheus client in app Request latency, error rate, saturation
Dashboards Grafana Cloud (free tier) SLO tracking, operational dashboards

Tier 4: Advanced (Future)

Kad bude potreba (production scale)

Component Tool Why
Distributed tracing OpenTelemetry → Grafana Tempo Cross-service request flow
Log aggregation Grafana Loki Centralized searchable logs
Chaos engineering Manual game days Resilience validation

FlowForge Execution Plan

Phase 1: Plan (ovaj dokument)

  • AS-IS analiza Drop infrastrukture
  • Tool selection (free-first approach)
  • Tiered rollout plan
  • CEO approval

Gate: Alem kaže GO

Phase 2: Provision (Week 1)

FlowForge SRE → implementacija:

2a. BetterStack Uptime (30 min)

  • Kreiraj BetterStack account (free tier)
  • Monitors: health endpoint (60s), landing page (5min), API (60s)
  • Alerting: Slack #drop-ops → email → SMS
  • Status page: public URL za klijente

2b. Sentry Re-integracija (1-2h)

  • npm install @sentry/nextjs u drop-app
  • DSN u AWS Secrets Manager
  • Source maps upload u CI/CD
  • Error boundary u React components
  • Server-side error capturing u API routes
  • Alert rules: new issue → Slack, spike → email

2c. CloudWatch Logs (1h)

  • App Runner → CloudWatch Logs (native, samo enable)
  • Log group: /drop/production
  • Retention: 30 dana
  • Metric filters: ERROR count, latency p99, 5xx count
  • CloudWatch Alarm: 5xx > 5/min → SNS → Slack

Phase 3: Deploy (Week 2)

3a. RDS Performance Insights (15 min)

  • Enable u Terraform (performance_insights_enabled = true)
  • Terraform apply

3b. Cloudflare Analytics (15 min)

  • Već aktivan — samo verifikuj da radi
  • Webhook za DDoS alerts → Slack

3c. Alerting Escalation (30 min)

  • BetterStack On-call team setup
  • Escalation: Slack (0min) → Email (5min) → SMS (15min)
  • On-call schedule: John (primary)

Phase 4: Monitor (Week 3-4)

4a. Business Metrics Endpoint (2-3h)

  • Novi API route: GET /api/metrics/business
  • Metrics: transactions/hour, success rate, active users, avg amount
  • Prometheus format output
  • Grafana Cloud dashboard

4b. Application Metrics (2-3h)

  • prom-client npm package
  • Metrics: http_request_duration, http_requests_total, db_query_duration
  • GET /metrics endpoint (Prometheus scrape)
  • Grafana Cloud → Prometheus remote write

4c. SLO Dashboard (1-2h)

  • Grafana dashboard sa:
    • Uptime SLO (target: 99.9%)
    • Latency SLO (p99 < 500ms)
    • Error rate SLO (< 0.1%)
    • Business KPIs

Phase 5: Optimize (Ongoing)

HelixSupport preuzima:

  • Weekly metrics review
  • Incident response po SLA (P1: 15min, P2: 1h)
  • Post-mortem za svaki P1/P2
  • Monthly SLA report za Alema
  • Monitoring tuning (alert thresholds, noise reduction)

Cost Summary

Item Monthly Cost
BetterStack Uptime (free tier) $0
Sentry (free tier, 5K events) $0
CloudWatch Logs (App Runner) ~$5
RDS Performance Insights (free) $0
Grafana Cloud (free tier) $0
Total ~$5/mo

Kad Drop skalira → upgrade na paid tiers (~$50-100/mo).


Success Metrics

Metric Target
MTTD (Mean Time to Detect) < 5 min
MTTR (Mean Time to Recover) < 1h (P1)
Uptime SLO 99.9%
Undetected outages 0
Alert noise (false positives) < 10%

Dependencies

  • AWS credentials (Terraform apply access)
  • BetterStack account creation
  • Sentry project creation
  • Grafana Cloud account creation

Approval

CEO Decision Required:

  • Approve plan (GO / NO-GO / MODIFY)
  • Approve tool selection (BetterStack, Sentry, Grafana)
  • Approve timeline (4 weeks)
  • Budget confirmation (~$5/mo)