# drop-observability-plan

# Drop — System Support & Observability Plan

**Client:** Drop (Digital Banking)
**Executing Company:** FlowForge (DevOps & Infrastructure)
**Support Company:** HelixSupport (Production Support & SLA)
**Status:** DRAFT — čeka CEO approval
**Created:** 2026-02-20

---

## Current State (AS-IS)

Drop već ima:
- ✅ Structured JSON logging (custom logger, request ID tracking)
- ✅ Health check endpoint (`/api/health` sa DB verification)
- ✅ Slack alerting (error spike detection, lifecycle events)
- ✅ Container health checks (Docker + AWS App Runner)
- ✅ Automated SQLite backup (WAL-safe, 30-day retention)
- ✅ CI/CD sa security scanning (Trivy, npm audit)
- ✅ Terraform IaC (AWS + Cloudflare)

Drop nema:
- ❌ External uptime monitoring
- ❌ Log aggregation (logs samo u `docker logs`)
- ❌ Error tracking (Sentry skinut)
- ❌ DB performance monitoring
- ❌ Business metrics dashboard
- ❌ APM / distributed tracing
- ❌ Alerting escalation (samo Slack)

---

## Target State (TO-BE)

### Tier 1: Essential (Week 1)
**Cost: ~$0 — free tiers**

| Component | Tool | Why |
|-----------|------|-----|
| Uptime monitoring | BetterStack (free: 10 monitors) | Independent od AWS — znaš kad padne PRIJE korisnika |
| Error tracking | Sentry (free: 5K events/mo) | Stack traces, user context, release tracking |
| Log shipping | AWS CloudWatch Logs (App Runner native) | Searchable logs, retention, metric filters |

### Tier 2: Visibility (Week 2)
**Cost: ~$0-20/mo**

| Component | Tool | Why |
|-----------|------|-----|
| DB monitoring | RDS Performance Insights (free tier) | Slow queries, connection pool, wait events |
| CDN analytics | Cloudflare Analytics (free) | Traffic patterns, threats, cache hit rate |
| Alerting escalation | BetterStack On-call (free: 1 team) | Slack → Email → SMS escalation chain |

### Tier 3: Intelligence (Week 3-4)
**Cost: ~$0-50/mo**

| Component | Tool | Why |
|-----------|------|-----|
| Business metrics | Custom endpoint + Grafana Cloud (free: 10K metrics) | Tx/hour, success rate, revenue |
| Application metrics | Prometheus client in app | Request latency, error rate, saturation |
| Dashboards | Grafana Cloud (free tier) | SLO tracking, operational dashboards |

### Tier 4: Advanced (Future)
**Kad bude potreba (production scale)**

| Component | Tool | Why |
|-----------|------|-----|
| Distributed tracing | OpenTelemetry → Grafana Tempo | Cross-service request flow |
| Log aggregation | Grafana Loki | Centralized searchable logs |
| Chaos engineering | Manual game days | Resilience validation |

---

## FlowForge Execution Plan

### Phase 1: Plan (ovaj dokument)
- [x] AS-IS analiza Drop infrastrukture
- [x] Tool selection (free-first approach)
- [x] Tiered rollout plan
- [ ] CEO approval

**Gate:** Alem kaže GO

### Phase 2: Provision (Week 1)

**FlowForge SRE → implementacija:**

#### 2a. BetterStack Uptime (30 min)
- Kreiraj BetterStack account (free tier)
- Monitors: health endpoint (60s), landing page (5min), API (60s)
- Alerting: Slack #drop-ops → email → SMS
- Status page: public URL za klijente

#### 2b. Sentry Re-integracija (1-2h)
- `npm install @sentry/nextjs` u drop-app
- DSN u AWS Secrets Manager
- Source maps upload u CI/CD
- Error boundary u React components
- Server-side error capturing u API routes
- Alert rules: new issue → Slack, spike → email

#### 2c. CloudWatch Logs (1h)
- App Runner → CloudWatch Logs (native, samo enable)
- Log group: `/drop/production`
- Retention: 30 dana
- Metric filters: ERROR count, latency p99, 5xx count
- CloudWatch Alarm: 5xx > 5/min → SNS → Slack

### Phase 3: Deploy (Week 2)

#### 3a. RDS Performance Insights (15 min)
- Enable u Terraform (performance_insights_enabled = true)
- Terraform apply

#### 3b. Cloudflare Analytics (15 min)
- Već aktivan — samo verifikuj da radi
- Webhook za DDoS alerts → Slack

#### 3c. Alerting Escalation (30 min)
- BetterStack On-call team setup
- Escalation: Slack (0min) → Email (5min) → SMS (15min)
- On-call schedule: John (primary)

### Phase 4: Monitor (Week 3-4)

#### 4a. Business Metrics Endpoint (2-3h)
- Novi API route: `GET /api/metrics/business`
- Metrics: transactions/hour, success rate, active users, avg amount
- Prometheus format output
- Grafana Cloud dashboard

#### 4b. Application Metrics (2-3h)
- `prom-client` npm package
- Metrics: http_request_duration, http_requests_total, db_query_duration
- `GET /metrics` endpoint (Prometheus scrape)
- Grafana Cloud → Prometheus remote write

#### 4c. SLO Dashboard (1-2h)
- Grafana dashboard sa:
  - Uptime SLO (target: 99.9%)
  - Latency SLO (p99 < 500ms)
  - Error rate SLO (< 0.1%)
  - Business KPIs

### Phase 5: Optimize (Ongoing)

**HelixSupport preuzima:**
- Weekly metrics review
- Incident response po SLA (P1: 15min, P2: 1h)
- Post-mortem za svaki P1/P2
- Monthly SLA report za Alema
- Monitoring tuning (alert thresholds, noise reduction)

---

## Cost Summary

| Item | Monthly Cost |
|------|-------------|
| BetterStack Uptime (free tier) | $0 |
| Sentry (free tier, 5K events) | $0 |
| CloudWatch Logs (App Runner) | ~$5 |
| RDS Performance Insights (free) | $0 |
| Grafana Cloud (free tier) | $0 |
| **Total** | **~$5/mo** |

Kad Drop skalira → upgrade na paid tiers (~$50-100/mo).

---

## Success Metrics

| Metric | Target |
|--------|--------|
| MTTD (Mean Time to Detect) | < 5 min |
| MTTR (Mean Time to Recover) | < 1h (P1) |
| Uptime SLO | 99.9% |
| Undetected outages | 0 |
| Alert noise (false positives) | < 10% |

---

## Dependencies

- AWS credentials (Terraform apply access)
- BetterStack account creation
- Sentry project creation
- Grafana Cloud account creation

---

## Approval

**CEO Decision Required:**
- [ ] Approve plan (GO / NO-GO / MODIFY)
- [ ] Approve tool selection (BetterStack, Sentry, Grafana)
- [ ] Approve timeline (4 weeks)
- [ ] Budget confirmation (~$5/mo)