drop-observability-plan

Drop — System Support & Observability Plan

Client: Drop (Digital Banking) Executing Company: FlowForge (DevOps & Infrastructure) Support Company: HelixSupport (Production Support & SLA) Status: DRAFT — čeka CEO approval Created: 2026-02-20

Current State (AS-IS)

Drop već ima:

✅ Structured JSON logging (custom logger, request ID tracking)
✅ Health check endpoint (/api/health sa DB verification)
✅ Slack alerting (error spike detection, lifecycle events)
✅ Container health checks (Docker + AWS App Runner)
✅ Automated SQLite backup (WAL-safe, 30-day retention)
✅ CI/CD sa security scanning (Trivy, npm audit)
✅ Terraform IaC (AWS + Cloudflare)

Drop nema:

❌ External uptime monitoring
❌ Log aggregation (logs samo u docker logs)
❌ Error tracking (Sentry skinut)
❌ DB performance monitoring
❌ Business metrics dashboard
❌ APM / distributed tracing
❌ Alerting escalation (samo Slack)

Target State (TO-BE)

Tier 1: Essential (Week 1)

Cost: ~$0 — free tiers

Component	Tool	Why
Uptime monitoring	BetterStack (free: 10 monitors)	Independent od AWS — znaš kad padne PRIJE korisnika
Error tracking	Sentry (free: 5K events/mo)	Stack traces, user context, release tracking
Log shipping	AWS CloudWatch Logs (App Runner native)	Searchable logs, retention, metric filters

Tier 2: Visibility (Week 2)

Cost: ~$0-20/mo

Component	Tool	Why
DB monitoring	RDS Performance Insights (free tier)	Slow queries, connection pool, wait events
CDN analytics	Cloudflare Analytics (free)	Traffic patterns, threats, cache hit rate
Alerting escalation	BetterStack On-call (free: 1 team)	Slack → Email → SMS escalation chain

Tier 3: Intelligence (Week 3-4)

Cost: ~$0-50/mo

Component	Tool	Why
Business metrics	Custom endpoint + Grafana Cloud (free: 10K metrics)	Tx/hour, success rate, revenue
Application metrics	Prometheus client in app	Request latency, error rate, saturation
Dashboards	Grafana Cloud (free tier)	SLO tracking, operational dashboards

Tier 4: Advanced (Future)

Kad bude potreba (production scale)

Component	Tool	Why
Distributed tracing	OpenTelemetry → Grafana Tempo	Cross-service request flow
Log aggregation	Grafana Loki	Centralized searchable logs
Chaos engineering	Manual game days	Resilience validation

FlowForge Execution Plan

Phase 1: Plan (ovaj dokument)

AS-IS analiza Drop infrastrukture
Tool selection (free-first approach)
Tiered rollout plan
CEO approval

Gate: Alem kaže GO

Phase 2: Provision (Week 1)

FlowForge SRE → implementacija:

2a. BetterStack Uptime (30 min)

Kreiraj BetterStack account (free tier)
Monitors: health endpoint (60s), landing page (5min), API (60s)
Alerting: Slack #drop-ops → email → SMS
Status page: public URL za klijente

2b. Sentry Re-integracija (1-2h)

npm install @sentry/nextjs u drop-app
DSN u AWS Secrets Manager
Source maps upload u CI/CD
Error boundary u React components
Server-side error capturing u API routes
Alert rules: new issue → Slack, spike → email

2c. CloudWatch Logs (1h)

App Runner → CloudWatch Logs (native, samo enable)
Log group: /drop/production
Retention: 30 dana
Metric filters: ERROR count, latency p99, 5xx count
CloudWatch Alarm: 5xx > 5/min → SNS → Slack

Phase 3: Deploy (Week 2)

3a. RDS Performance Insights (15 min)

Enable u Terraform (performance_insights_enabled = true)
Terraform apply

3b. Cloudflare Analytics (15 min)

Već aktivan — samo verifikuj da radi
Webhook za DDoS alerts → Slack

3c. Alerting Escalation (30 min)

BetterStack On-call team setup
Escalation: Slack (0min) → Email (5min) → SMS (15min)
On-call schedule: John (primary)

Phase 4: Monitor (Week 3-4)

4a. Business Metrics Endpoint (2-3h)

Novi API route: GET /api/metrics/business
Metrics: transactions/hour, success rate, active users, avg amount
Prometheus format output
Grafana Cloud dashboard

4b. Application Metrics (2-3h)

prom-client npm package
Metrics: http_request_duration, http_requests_total, db_query_duration
GET /metrics endpoint (Prometheus scrape)
Grafana Cloud → Prometheus remote write

4c. SLO Dashboard (1-2h)

Grafana dashboard sa:
- Uptime SLO (target: 99.9%)
- Latency SLO (p99 < 500ms)
- Error rate SLO (< 0.1%)
- Business KPIs

Phase 5: Optimize (Ongoing)

HelixSupport preuzima:

Weekly metrics review
Incident response po SLA (P1: 15min, P2: 1h)
Post-mortem za svaki P1/P2
Monthly SLA report za Alema
Monitoring tuning (alert thresholds, noise reduction)

Cost Summary

Item	Monthly Cost
BetterStack Uptime (free tier)	$0
Sentry (free tier, 5K events)	$0
CloudWatch Logs (App Runner)	~$5
RDS Performance Insights (free)	$0
Grafana Cloud (free tier)	$0
Total	~$5/mo

Kad Drop skalira → upgrade na paid tiers (~$50-100/mo).

Success Metrics

Metric	Target
MTTD (Mean Time to Detect)	< 5 min
MTTR (Mean Time to Recover)	< 1h (P1)
Uptime SLO	99.9%
Undetected outages	0
Alert noise (false positives)	< 10%

Dependencies

AWS credentials (Terraform apply access)
BetterStack account creation
Sentry project creation
Grafana Cloud account creation

Approval

CEO Decision Required:

Approve plan (GO / NO-GO / MODIFY)
Approve tool selection (BetterStack, Sentry, Grafana)
Approve timeline (4 weeks)
Budget confirmation (~$5/mo)