drop-observability-plan
Drop — System Support & Observability Plan
Client: Drop (Digital Banking) Executing Company: FlowForge (DevOps & Infrastructure) Support Company: HelixSupport (Production Support & SLA) Status: DRAFT — čeka CEO approval Created: 2026-02-20
Current State (AS-IS)
Drop već ima:
- ✅ Structured JSON logging (custom logger, request ID tracking)
- ✅ Health check endpoint (
/api/healthsa DB verification) - ✅ Slack alerting (error spike detection, lifecycle events)
- ✅ Container health checks (Docker + AWS App Runner)
- ✅ Automated SQLite backup (WAL-safe, 30-day retention)
- ✅ CI/CD sa security scanning (Trivy, npm audit)
- ✅ Terraform IaC (AWS + Cloudflare)
Drop nema:
- ❌ External uptime monitoring
- ❌ Log aggregation (logs samo u
docker logs) - ❌ Error tracking (Sentry skinut)
- ❌ DB performance monitoring
- ❌ Business metrics dashboard
- ❌ APM / distributed tracing
- ❌ Alerting escalation (samo Slack)
Target State (TO-BE)
Tier 1: Essential (Week 1)
Cost: ~$0 — free tiers
| Component | Tool | Why |
|---|---|---|
| Uptime monitoring | BetterStack (free: 10 monitors) | Independent od AWS — znaš kad padne PRIJE korisnika |
| Error tracking | Sentry (free: 5K events/mo) | Stack traces, user context, release tracking |
| Log shipping | AWS CloudWatch Logs (App Runner native) | Searchable logs, retention, metric filters |
Tier 2: Visibility (Week 2)
Cost: ~$0-20/mo
| Component | Tool | Why |
|---|---|---|
| DB monitoring | RDS Performance Insights (free tier) | Slow queries, connection pool, wait events |
| CDN analytics | Cloudflare Analytics (free) | Traffic patterns, threats, cache hit rate |
| Alerting escalation | BetterStack On-call (free: 1 team) | Slack → Email → SMS escalation chain |
Tier 3: Intelligence (Week 3-4)
Cost: ~$0-50/mo
| Component | Tool | Why |
|---|---|---|
| Business metrics | Custom endpoint + Grafana Cloud (free: 10K metrics) | Tx/hour, success rate, revenue |
| Application metrics | Prometheus client in app | Request latency, error rate, saturation |
| Dashboards | Grafana Cloud (free tier) | SLO tracking, operational dashboards |
Tier 4: Advanced (Future)
Kad bude potreba (production scale)
| Component | Tool | Why |
|---|---|---|
| Distributed tracing | OpenTelemetry → Grafana Tempo | Cross-service request flow |
| Log aggregation | Grafana Loki | Centralized searchable logs |
| Chaos engineering | Manual game days | Resilience validation |
FlowForge Execution Plan
Phase 1: Plan (ovaj dokument)
- AS-IS analiza Drop infrastrukture
- Tool selection (free-first approach)
- Tiered rollout plan
- CEO approval
Gate: Alem kaže GO
Phase 2: Provision (Week 1)
FlowForge SRE → implementacija:
2a. BetterStack Uptime (30 min)
- Kreiraj BetterStack account (free tier)
- Monitors: health endpoint (60s), landing page (5min), API (60s)
- Alerting: Slack #drop-ops → email → SMS
- Status page: public URL za klijente
2b. Sentry Re-integracija (1-2h)
npm install @sentry/nextjsu drop-app- DSN u AWS Secrets Manager
- Source maps upload u CI/CD
- Error boundary u React components
- Server-side error capturing u API routes
- Alert rules: new issue → Slack, spike → email
2c. CloudWatch Logs (1h)
- App Runner → CloudWatch Logs (native, samo enable)
- Log group:
/drop/production - Retention: 30 dana
- Metric filters: ERROR count, latency p99, 5xx count
- CloudWatch Alarm: 5xx > 5/min → SNS → Slack
Phase 3: Deploy (Week 2)
3a. RDS Performance Insights (15 min)
- Enable u Terraform (performance_insights_enabled = true)
- Terraform apply
3b. Cloudflare Analytics (15 min)
- Već aktivan — samo verifikuj da radi
- Webhook za DDoS alerts → Slack
3c. Alerting Escalation (30 min)
- BetterStack On-call team setup
- Escalation: Slack (0min) → Email (5min) → SMS (15min)
- On-call schedule: John (primary)
Phase 4: Monitor (Week 3-4)
4a. Business Metrics Endpoint (2-3h)
- Novi API route:
GET /api/metrics/business - Metrics: transactions/hour, success rate, active users, avg amount
- Prometheus format output
- Grafana Cloud dashboard
4b. Application Metrics (2-3h)
prom-clientnpm package- Metrics: http_request_duration, http_requests_total, db_query_duration
GET /metricsendpoint (Prometheus scrape)- Grafana Cloud → Prometheus remote write
4c. SLO Dashboard (1-2h)
- Grafana dashboard sa:
- Uptime SLO (target: 99.9%)
- Latency SLO (p99 < 500ms)
- Error rate SLO (< 0.1%)
- Business KPIs
Phase 5: Optimize (Ongoing)
HelixSupport preuzima:
- Weekly metrics review
- Incident response po SLA (P1: 15min, P2: 1h)
- Post-mortem za svaki P1/P2
- Monthly SLA report za Alema
- Monitoring tuning (alert thresholds, noise reduction)
Cost Summary
| Item | Monthly Cost |
|---|---|
| BetterStack Uptime (free tier) | $0 |
| Sentry (free tier, 5K events) | $0 |
| CloudWatch Logs (App Runner) | ~$5 |
| RDS Performance Insights (free) | $0 |
| Grafana Cloud (free tier) | $0 |
| Total | ~$5/mo |
Kad Drop skalira → upgrade na paid tiers (~$50-100/mo).
Success Metrics
| Metric | Target |
|---|---|
| MTTD (Mean Time to Detect) | < 5 min |
| MTTR (Mean Time to Recover) | < 1h (P1) |
| Uptime SLO | 99.9% |
| Undetected outages | 0 |
| Alert noise (false positives) | < 10% |
Dependencies
- AWS credentials (Terraform apply access)
- BetterStack account creation
- Sentry project creation
- Grafana Cloud account creation
Approval
CEO Decision Required:
- Approve plan (GO / NO-GO / MODIFY)
- Approve tool selection (BetterStack, Sentry, Grafana)
- Approve timeline (4 weeks)
- Budget confirmation (~$5/mo)
No comments to display
No comments to display