Skip to main content

Support Systems Analysis

Drop Support Systems Analysis

Date: 2026-02-22 Author: John (AI Director) Status: MVP Hardening Phase (0.5) Purpose: Comprehensive analysis of support systems for production-ready fintech deployment


Executive Summary

Drop currently has foundational support systems in place but requires critical enhancements before production launch. The application has health checks, CI/CD, error tracking (client-side), and basic alerting, but lacks enterprise-grade observability, audit logging, and incident response procedures required for a PSD2-compliant fintech service.

Key Findings:

  • βœ… Strong foundation: Comprehensive CI/CD with >80% coverage, health checks, structured logging
  • ⚠️ Critical gaps: No server-side error tracking, no audit trails, no APM, limited incident response
  • 🚨 Production blockers: 6 P0 items must be addressed before go-live (see Gap Analysis)

Recommendation: Implement P0 systems immediately (est. 2-3 days), defer P1 to Phase 2 (banking integration), and P2 to post-launch optimization.


Current State

1. Monitoring β€” Uptime & Health Checks

What Exists

  • βœ… Health endpoint: /api/health with database connectivity verification

    • Checks: DB query latency, driver type (pg/sqlite), service mode, uptime
    • Returns: ok (200), degraded (200), or down (503)
    • Source: src/drop-app/src/app/api/health/route.ts
  • βœ… Container health checks:

    • Docker: 30s interval, 10s timeout, 3 retries
    • Fly.io: 30s interval, 10s grace period, 5s timeout
    • Auto-restart on failure
  • βœ… External uptime monitoring (ready to deploy):

    • BetterStack setup guide documented
    • Free tier: 10 monitors, 3-min interval, SMS/email/Slack alerts
    • Documentation: docs/infrastructure/BETTERSTACK-SETUP.md
  • βœ… Cron health check script:

    • infrastructure/health-check.sh β€” AWS App Runner endpoint
    • Slack webhook integration (optional)
    • Can run via cron for local monitoring

What's Missing

  • ❌ Synthetic monitoring: No transaction flow testing (login β†’ send money β†’ verify)
  • ❌ Multi-region checks: No geographic availability testing
  • ❌ SLA tracking: No uptime percentage calculation or reporting
  • ❌ Dependency monitoring: No checks for external services (Swan API, BankID, Sumsub)

Assessment

Status: Adequate for MVP, requires enhancement for production. Gap: External monitoring configured but not deployed. Synthetic checks needed.


2. Logging β€” Centralized Log Aggregation

What Exists

  • βœ… Structured logging:

    • JSON format with timestamp, level, message, requestId, metadata
    • Source: src/drop-app/src/lib/logger.ts
    • Writes to stdout (Docker-friendly)
  • βœ… Request correlation:

    • x-request-id header extraction or UUID generation
    • Request context propagation through logger instances
  • βœ… Log levels: debug, info, warn, error

What's Missing

  • ❌ Log aggregation: Logs write to stdout but aren't collected or indexed
  • ❌ Log retention: No policy for how long logs are kept
  • ❌ Log search: No way to query logs across time/instances
  • ❌ Log forwarding: No integration with log management service
  • ❌ Sensitive data scrubbing: Logger doesn't automatically redact PII

Assessment

Status: Foundation exists, but logs are ephemeral (lost on container restart). Gap: Critical for incident investigation and compliance audits. Need CloudWatch Logs or similar.


3. Error Tracking β€” Error Capture & Alerting

What Exists

  • βœ… Client-side error tracking:

    • Sentry browser integration (@sentry/browser)
    • PII scrubbing (passwords, pins, card numbers, fΓΈdselsnummer)
    • 10% trace sampling for performance monitoring
    • Source: src/drop-app/src/lib/sentry.ts, SENTRY.md
  • βœ… Error spike detection:

    • Tracks errors in rolling 1-minute window
    • Alerts when >5 errors in 60 seconds
    • Source: src/drop-app/src/lib/alerts.ts:trackError()
  • βœ… Global error boundaries:

    • React error boundaries for component crashes
    • global-error.tsx catches unhandled errors

What's Missing

  • ❌ Server-side error tracking: Sentry removed from server due to Next.js 16 Turbopack incompatibility (MC #1271)
  • ❌ API error context: Server errors log to console only, no structured capture
  • ❌ Error attribution: Can't trace errors to specific users or transactions
  • ❌ Error deduplication: Same error reported multiple times clogs alerts

Assessment

Status: Client errors tracked, server errors blind. Gap: CRITICAL β€” server-side errors (API, DB, integrations) are invisible. P0 fix required.


4. Alerting β€” On-Call & Escalation

What Exists

  • βœ… Slack alerting:

    • Operational alerts with severity levels (info/warning/critical)
    • 10-minute cooldown per alert title (spam prevention)
    • Source: src/drop-app/src/lib/alerts.ts
  • βœ… Lifecycle alerts:

    • App startup notification
    • Graceful shutdown notification
    • Source: instrumentation.ts
  • βœ… Error spike alerts:

    • Automatic critical alert when >5 errors/minute

What's Missing

  • ❌ On-call rotation: No defined on-call schedule or escalation policy
  • ❌ Alert routing: All alerts go to same Slack channel, no severity-based routing
  • ❌ Alert escalation: No automatic escalation after N minutes of unresolved incident
  • ❌ Alert acknowledgment: Can't mark alerts as "acknowledged" or "resolved"
  • ❌ SMS/phone alerts: Critical incidents only notify via Slack (single point of failure)
  • ❌ Alert testing: No way to test alert pipeline without triggering real incidents

Assessment

Status: Basic alerting works for small team, inadequate for 24/7 production. Gap: Need on-call schedule, escalation policy, and multi-channel delivery.


5. Security Monitoring β€” WAF, DDoS, Anomaly Detection, Audit Logs

What Exists

  • βœ… WAF rules defined:

    • CSRF origin validation (implemented in middleware)
    • Rate limiting on auth endpoints (10 req/60s)
    • CSP headers with nonce-based script loading
    • Source: infrastructure/waf-rules.md, src/drop-app/src/middleware.ts
  • βœ… Container security scanning:

    • Trivy vulnerability scanner in CI/CD
    • Blocks HIGH/CRITICAL vulnerabilities
    • SARIF upload to GitHub Security tab
  • βœ… Dependency scanning:

    • npm audit in CI pipeline (prod deps only)
  • βœ… AML transaction monitoring:

    • 5 automated rules: structuring, velocity, high amount, high-risk corridor, unusual pattern
    • Alerts stored in aml_alerts table
    • Source: src/drop-app/src/lib/transaction-monitor.ts

What's Missing

  • ❌ WAF deployment: Rules defined but not deployed (requires CDN/reverse proxy)
  • ❌ DDoS protection: No rate limiting at network edge, only app-level
  • ❌ Intrusion detection: No IDS/IPS monitoring unusual access patterns
  • ❌ Audit logs: No immutable log of authentication, authorization, data access events (PSD2 requirement)
  • ❌ Security incident response plan: No runbook for security breaches
  • ❌ Penetration testing: No external security audit completed

Assessment

Status: Security-aware codebase, but monitoring/audit infrastructure missing. Gap: CRITICAL β€” audit logs are PSD2/GDPR compliance requirement. P0 fix.


6. Performance β€” APM, Latency Tracking, Resource Utilization

What Exists

  • βœ… Health check latency:

    • DB query time measured in health endpoint
    • Reported in milliseconds
  • βœ… Performance budgets in CI:

    • Coverage thresholds enforced (80/70/80/80)

What's Missing

  • ❌ APM (Application Performance Monitoring): No distributed tracing
  • ❌ API latency tracking: Don't know which endpoints are slow
  • ❌ Database performance: No slow query alerts or query profiling
  • ❌ Resource utilization: No CPU/memory/disk usage monitoring
  • ❌ Frontend performance: No Core Web Vitals tracking (LCP, FID, CLS)
  • ❌ Transaction timing: Can't measure end-to-end payment latency

Assessment

Status: Minimal. Can detect total outage but not performance degradation. Gap: Need before production to identify bottlenecks and capacity issues.


7. Database β€” Backups, Replication, Monitoring

What Exists

  • βœ… Automated backups (RDS):

    • Daily automated snapshots, 7-day retention
    • Point-in-time recovery within 7 days
    • Source: docs/dr-runbook.md
  • βœ… Multi-AZ (production):

    • RDS configured for high availability (if enabled)
  • βœ… Database health check:

    • SELECT 1 query in health endpoint verifies connectivity

What's Missing

  • ❌ Backup verification: Snapshots created but never tested for restore
  • ❌ Backup monitoring: No alerts if backup fails
  • ❌ Replication lag monitoring: No alerts if replica falls behind
  • ❌ Connection pool monitoring: No visibility into connection usage
  • ❌ Query performance: No slow query log analysis
  • ❌ Storage monitoring: No alerts before disk fills up

Assessment

Status: Basic backup/restore exists, monitoring gaps. Gap: Backup testing and proactive monitoring needed before production.


8. Incident Response β€” Runbooks, Status Page, Communication Plan

What Exists

  • βœ… DR runbook:

    • Procedures for App Runner down, RDS down, full redeploy
    • Environment variable checklist
    • Contact escalation (John β†’ Alem)
    • Source: docs/dr-runbook.md
  • βœ… Incident checklist:

    • 8-step incident response workflow
    • Post-mortem requirement (48h)

What's Missing

  • ❌ Status page: No public/customer-facing status page
  • ❌ Incident templates: No standardized incident report format
  • ❌ Communication plan: No templates for customer notifications during outages
  • ❌ Runbook coverage: Only covers infrastructure, missing:
    • Payment failures (PISP/AISP errors)
    • BankID integration issues
    • KYC/AML false positive handling
    • Data breach response
  • ❌ Runbook testing: Procedures documented but never executed

Assessment

Status: Basic DR runbook exists, lacks fintech-specific scenarios. Gap: Need payment/banking integration runbooks before Phase 2.


9. CI/CD β€” Build Pipeline, Deployment, Rollback

What Exists

  • βœ… Comprehensive CI pipeline:

    • Multi-package change detection
    • Lint, typecheck, unit tests, E2E (Playwright), mutation testing (Stryker)
    • Coverage thresholds enforced (80/70/80/80) with ratchet (never decrease)
    • Docker build + Trivy security scan
    • Quality gate (required status check)
    • Source: .github/workflows/ci.yml
  • βœ… Deployment workflows:

    • GitHub Actions for deploy (backend, mobile)
    • Terraform for infrastructure
    • Source: .github/workflows/deploy.yml, terraform-ci.yml

What's Missing

  • ❌ Automated rollback: Deployment failure doesn't auto-revert
  • ❌ Canary deployments: All-or-nothing deployment, no gradual rollout
  • ❌ Deployment monitoring: No automatic health check after deploy
  • ❌ Deployment notifications: Team not notified of deployments/failures
  • ❌ Infrastructure drift detection: Terraform state not continuously validated

Assessment

Status: Strong quality gate, weak deployment safety. Gap: Add post-deployment health checks and rollback automation.


10. Compliance β€” Audit Trails, Data Retention, GDPR/PSD2 Logging

What Exists

  • βœ… AML monitoring:

    • Transaction alerts stored in aml_alerts table
    • 5 risk categories tracked
  • βœ… Security audit completed:

    • 4 CRITICAL, 5 HIGH, 6 MEDIUM, 4 LOW findings documented
    • Source: security/drop-security-rapport.md
  • βœ… Data retention service:

    • Code exists for GDPR compliance
    • Source: src/drop-app/src/lib/services/data-retention.ts

What's Missing

  • ❌ Audit logs: No immutable record of:
    • User authentication events (login, logout, failed attempts)
    • Authorization decisions (who accessed what, when)
    • Data modifications (user profile changes, transaction edits)
    • Administrative actions (KYC approvals, AML reviews)
  • ❌ Audit log retention policy: PSD2 requires 5+ years
  • ❌ Audit log integrity: No cryptographic proof of non-tampering
  • ❌ Compliance reporting: No automated report generation for regulators
  • ❌ STR (Suspicious Transaction Report) workflow: AML alerts created but no submission process

Assessment

Status: CRITICAL GAP. Audit logs are PSD2 legal requirement. Gap: P0 β€” must implement before production launch.


Gap Analysis

P0 β€” Production Blockers (Must Fix Before Go-Live)

# Category Gap Impact Effort
1 Error Tracking No server-side error monitoring Can't detect/debug API failures 4h
2 Compliance No audit logs (auth, data access, admin actions) PSD2 non-compliance, legal risk 8h
3 Security WAF rules defined but not deployed Vulnerable to SQLi, XSS, DDoS 2h (config)
4 Logging No log aggregation/retention Can't investigate incidents 2h (CloudWatch setup)
5 Monitoring BetterStack configured but not deployed No external incident detection 1h (account setup)
6 Incident Response No payment/banking failure runbooks Can't recover from PISP/BankID outages 4h

Total P0 effort: ~21 hours (2-3 days)


P1 β€” Needed Soon (Before Phase 2: Banking Integration)

# Category Gap Impact Effort
7 Alerting No on-call rotation or escalation policy Incidents may go unnoticed outside work hours 2h
8 Performance No APM for distributed tracing Can't diagnose slow transactions 4h
9 Database No backup testing or monitoring Backups may be corrupt, undetected 3h
10 Security No penetration testing Unknown vulnerabilities 16h (external)
11 CI/CD No automated rollback on deployment failure Bad deploys cause extended outages 6h
12 Compliance No STR submission workflow Can't fulfill AML obligations 8h

Total P1 effort: ~39 hours (5 days)


P2 β€” Nice to Have (Post-Launch Optimization)

# Category Gap Impact Effort
13 Monitoring No synthetic transaction monitoring Can't detect broken user flows 8h
14 Performance No Core Web Vitals tracking Poor user experience undetected 4h
15 Alerting No SMS/phone alerts for critical incidents Slack outage = missed alerts 2h
16 Database No slow query alerts Performance degradation undetected 6h
17 Security No IDS/IPS for intrusion detection Advanced attacks undetected 16h
18 Incident Response No public status page Customers unaware of outages 4h

Total P2 effort: ~40 hours (5 days)


Implementation Plan

Phase 1: P0 Production Blockers (NOW β€” before Phase 1 demo)

Goal: Address legal/compliance requirements and critical observability gaps.

1.1 Server-Side Error Tracking (4h)

Problem: All server errors invisible after Sentry removed (Next.js 16 Turbopack incompatibility).

Solution:

  • Option A: Sentry Edge SDK (compatible with Next.js middleware)
    • Install: @sentry/nextjs with edge-only config
    • Capture server errors via captureException() in middleware
    • Source maps via Sentry webpack plugin
  • Option B: Custom error aggregation service
    • POST errors to internal /api/errors/capture endpoint
    • Store in error_logs table with context
    • Alert on spike detection

Deliverable:

  • src/drop-app/sentry.edge.config.ts (if Option A)
  • Updated src/drop-app/src/lib/sentry-server.ts with edge-compatible capture
  • Test: Trigger 500 error, verify Sentry event created

Files: infrastructure/error-tracking-setup.md


1.2 Audit Logging System (8h)

Problem: PSD2 requires immutable audit trail for auth, data access, admin actions.

Solution:

  • Create audit_logs table:

    CREATE TABLE audit_logs (
      id TEXT PRIMARY KEY,
      timestamp TEXT NOT NULL,
      user_id TEXT,
      action TEXT NOT NULL, -- 'login', 'data_access', 'kyc_approval', etc.
      resource_type TEXT, -- 'user', 'transaction', 'aml_alert'
      resource_id TEXT,
      metadata JSON,
      ip_address TEXT,
      user_agent TEXT,
      request_id TEXT,
      result TEXT -- 'success', 'failure', 'denied'
    );
    CREATE INDEX idx_audit_user ON audit_logs(user_id, timestamp);
    CREATE INDEX idx_audit_action ON audit_logs(action, timestamp);
    
  • Audit functions:

    auditLog({
      userId: 'usr_123',
      action: 'login_success',
      resourceType: 'user',
      resourceId: 'usr_123',
      metadata: { method: 'bankid' },
      ip: '1.2.3.4',
      userAgent: 'Mozilla...',
      requestId: 'req_456'
    });
    
  • Integrate at:

    • POST /api/auth/login (login_success, login_failure)
    • POST /api/auth/logout (logout)
    • GET /api/users/:id (data_access)
    • PATCH /api/users/:id/kyc (kyc_approval, kyc_rejection)
    • PATCH /api/aml-alerts/:id (aml_review)

Deliverable:

  • src/drop-app/src/lib/audit-log.ts (audit logging functions)
  • Migration: migrations/003_audit_logs.sql
  • Integration in auth routes and admin endpoints
  • Retention policy: Document 5-year retention for PSD2 compliance

Files: support/audit-logging-setup.md


1.3 WAF Deployment (2h)

Problem: WAF rules defined but not enforced (requires reverse proxy).

Solution:

  • Option A: Cloudflare WAF (recommended)
    • Already using Cloudflare for DNS (terraform module exists)
    • Free tier includes basic WAF rules
    • Configure: SQLi, XSS, path traversal rules from infrastructure/waf-rules.md
  • Option B: AWS WAF (if using App Runner directly)
    • $5/month + $1/million requests
    • Associate with App Runner service

Deliverable:

  • Cloudflare WAF configuration (Terraform or UI)
  • Test: Send SQLi payload, verify 403 response
  • Document: Update infrastructure/waf-rules.md with deployment steps

Files: infrastructure/cloudflare-waf-setup.md


1.4 Log Aggregation (2h)

Problem: Structured logs write to stdout but aren't retained or searchable.

Solution:

  • AWS CloudWatch Logs (App Runner auto-integrates):

    • App Runner streams stdout β†’ CloudWatch Logs automatically
    • Configure retention: 30 days (production), 7 days (staging)
    • Set up log insights queries for common patterns
  • Fly.io (staging):

    • fly logs stores last 24h by default
    • Optional: Forward to external service (Papertrail, Logtail)

Deliverable:

  • CloudWatch Logs retention policy configured
  • Log Insights queries:
    • All errors: fields @timestamp, message | filter level = "error"
    • User actions: fields @timestamp, userId, message | filter userId = "usr_123"
    • Request trace: fields @timestamp, requestId, message | filter requestId = "req_456"
  • Documentation: infrastructure/logging-setup.md

Files: infrastructure/cloudwatch-logs-setup.md


1.5 External Uptime Monitoring (1h)

Problem: BetterStack documented but not deployed.

Solution:

  • Sign up: https://betterstack.com/uptime (free tier)
  • Create monitors:
    1. Production health: https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health
      • Interval: 3 minutes
      • Keyword check: "status":"ok"
    2. Staging health: https://drop-staging.fly.dev/api/health
    3. Landing page: https://getdrop.no (when live)
  • Slack integration: Connect to #drop-ops channel
  • Email alerts: [email protected]

Deliverable:

  • BetterStack account with 3 monitors configured
  • Test: Pause monitor, verify alert received
  • Documentation: Update docs/infrastructure/BETTERSTACK-SETUP.md with credentials

Files: support/betterstack-deployment.md


1.6 Payment/Banking Failure Runbooks (4h)

Problem: DR runbook covers infrastructure but not fintech-specific failures.

Solution:

  • Create runbooks for:

    1. BankID integration failure (authentication blocked)
    2. PISP payment failure (remittance/QR payment rejected)
    3. AISP balance retrieval failure (can't fetch account balance)
    4. Swan API outage (BaaS provider down)
    5. Sumsub KYC failure (identity verification unavailable)
    6. Open Banking provider outage (provider TBD)
  • Each runbook includes:

    • Symptoms (what users see)
    • Diagnosis steps (check service status, logs, error codes)
    • Recovery procedure (fallback, retry, escalation)
    • Customer communication template

Deliverable:

  • support/runbooks/bankid-failure.md
  • support/runbooks/pisp-payment-failure.md
  • support/runbooks/aisp-balance-failure.md
  • support/runbooks/swan-api-outage.md
  • support/runbooks/sumsub-kyc-failure.md
  • support/runbooks/neonomics-outage.md

Files: Created in /Users/makinja/ALAI/products/Drop/support/runbooks/


Phase 2: P1 Items (Phase 2: Banking Integration)

Defer to Phase 2 when real banking integrations are live and need production-grade support.

Priority order:

  1. Penetration testing (external security audit)
  2. APM for transaction tracing (identify slow payments)
  3. On-call rotation and escalation policy
  4. Automated rollback on failed deployments
  5. Backup testing and monitoring
  6. STR submission workflow (AML compliance)

Phase 3: P2 Items (Post-Launch)

Optimize after initial production deployment and user feedback.

Priority order:

  1. Synthetic transaction monitoring (test critical user flows)
  2. Public status page (customer transparency)
  3. Core Web Vitals tracking (frontend performance)
  4. SMS/phone alerts (redundancy)
  5. Slow query monitoring (database optimization)
  6. IDS/IPS (advanced threat detection)

Architecture

Support Systems Connectivity

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Drop Application                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  drop-app   β”‚  β”‚  drop-api    β”‚  β”‚  drop-mobile (Expo)  β”‚  β”‚
β”‚  β”‚  (Next.js)  β”‚  β”‚  (Hono)      β”‚  β”‚  (React Native)      β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚         β”‚                β”‚                      β”‚               β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚                          β”‚                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                  β”‚                              β”‚
        β–Ό                  β–Ό                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Structured    β”‚  β”‚ Health Check β”‚           β”‚ Audit Logs       β”‚
β”‚ Logging       β”‚  β”‚ Endpoint     β”‚           β”‚ (audit_logs      β”‚
β”‚ (JSON stdout) β”‚  β”‚ /api/health  β”‚           β”‚  table)          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                 β”‚                             β”‚
        β”‚                 β”‚                             β”‚
        β–Ό                 β”‚                             β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚                             β”‚
β”‚ CloudWatch     β”‚        β”‚                             β”‚
β”‚ Logs           β”‚        β”‚                             β”‚
β”‚ (30d retention)β”‚        β”‚                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚                             β”‚
        β”‚                 β”‚                             β”‚
        β”‚                 β–Ό                             β”‚
        β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                     β”‚
        β”‚         β”‚ BetterStack   β”‚                     β”‚
        β”‚         β”‚ (external     β”‚                     β”‚
        β”‚         β”‚  monitoring)  β”‚                     β”‚
        β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                     β”‚
        β”‚                 β”‚                             β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                          β–Ό
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚ Alerting Layer β”‚
                 β”‚ (alerts.ts)    β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                 β”‚                 β”‚
        β–Ό                 β–Ό                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Slack       β”‚  β”‚ Sentry      β”‚  β”‚ Email       β”‚
β”‚ Webhook     β”‚  β”‚ (client +   β”‚  β”‚ (SMTP)      β”‚
β”‚ (#drop-ops) β”‚  β”‚  edge)      β”‚  β”‚             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flows

  1. Error Flow:

    • Client error β†’ Sentry browser β†’ Slack alert (if spike)
    • Server error β†’ Sentry edge β†’ CloudWatch Logs β†’ Slack alert
    • API 5xx β†’ trackError() β†’ Spike detection β†’ Slack
  2. Monitoring Flow:

    • App β†’ stdout β†’ CloudWatch Logs
    • App β†’ /api/health β†’ BetterStack β†’ Slack/Email/SMS
    • Container β†’ Docker health check β†’ Auto-restart
  3. Audit Flow:

    • User action β†’ auditLog() β†’ audit_logs table
    • Compliance query β†’ SQL export β†’ Regulator submission
  4. Incident Flow:

    • Alert β†’ Slack #drop-ops
    • Unacknowledged (5 min) β†’ Email to Alem
    • Unresolved (15 min) β†’ SMS (BetterStack escalation)
    • Incident β†’ Runbook β†’ Recovery β†’ Post-mortem

Cost Estimate

Free Tier (MVP)

  • βœ… CloudWatch Logs: 5 GB ingestion/month free (AWS Free Tier)
  • βœ… BetterStack: 10 monitors, 3-min interval, unlimited alerts
  • βœ… Sentry: 5K events/month free
  • βœ… GitHub Actions: 2000 minutes/month free
  • βœ… Terraform state: S3 free tier (first 12 months)

Total MVP cost: $0/month

Paid Services (Production)

  • CloudWatch Logs: ~$5/month (30 GB ingestion estimate)
  • BetterStack Pro: $20/month (30s interval, SMS alerts)
  • Sentry Team: $26/month (50K events, enhanced features)
  • Optional: Datadog APM: $15/host/month (~$45 for 3 hosts)

Total production cost: ~$50-100/month (without APM)


Recommendations

Immediate (This Week)

  1. βœ… Deploy BetterStack (1h) β€” External monitoring is fast win
  2. βœ… Configure CloudWatch retention (30 min) β€” Logs already flow, just set policy
  3. βœ… Create audit log schema (2h) β€” Start with table, integrate incrementally

Before Phase 1 Demo (Next 2 Weeks)

  1. βœ… Implement server-side error tracking (4h) β€” Sentry edge or custom
  2. βœ… Write payment failure runbooks (4h) β€” Prepare for demo questions
  3. βœ… Deploy Cloudflare WAF (2h) β€” Security hygiene

Before Phase 2 Go-Live (Next 2-3 Months)

  1. πŸ”² External penetration test (hire security firm, ~$5K budget)
  2. πŸ”² APM implementation (Datadog or Sentry Performance)
  3. πŸ”² On-call rotation (define schedule, test escalation)
  4. πŸ”² Backup testing (restore from snapshot, verify data integrity)

Post-Launch Optimization

  1. πŸ”² Synthetic monitoring (Checkly or custom Playwright tests)
  2. πŸ”² Public status page (BetterStack included, just enable)
  3. πŸ”² Core Web Vitals (Google Lighthouse CI integration)

Success Metrics

Before Go-Live (P0 Checklist)

  • Server errors visible in Sentry (test: trigger 500, verify event)
  • Audit logs capture login/logout (test: log in, check audit_logs table)
  • WAF blocks SQLi attack (test: ?id=1' OR '1'='1, expect 403)
  • CloudWatch Logs retain 30 days (verify retention policy)
  • BetterStack alerts on downtime (test: stop app, receive alert <5 min)
  • Runbooks tested (simulate BankID failure, follow procedure)

Production KPIs

  • Uptime: >99.9% (measured by BetterStack)
  • MTTD (Mean Time To Detect): <3 minutes (external monitoring interval)
  • MTTR (Mean Time To Recover): <15 minutes (via runbooks)
  • Error rate: <0.1% of requests (tracked via Sentry)
  • Log retention: 100% compliance (30 days CloudWatch, 5 years audit logs)
  • Alert noise: <5 false positives/week (cooldown + severity tuning)

Appendices

  • docs/infrastructure/MONITORING.md β€” Current monitoring setup
  • docs/infrastructure/BETTERSTACK-SETUP.md β€” External monitoring guide
  • docs/dr-runbook.md β€” Infrastructure disaster recovery
  • infrastructure/waf-rules.md β€” WAF rule definitions
  • security/drop-security-rapport.md β€” Security audit findings

B. External Services

  • BetterStack: https://betterstack.com/uptime
  • Sentry: https://sentry.io/
  • AWS CloudWatch: https://console.aws.amazon.com/cloudwatch/
  • Cloudflare: https://dash.cloudflare.com/

C. Change History

  • 2026-02-22: Initial analysis (John)

Next Actions:

  1. Review this analysis with Alem
  2. Approve P0 implementation plan
  3. Begin P0 work (estimated 21 hours / 2-3 days)
  4. Track progress in Mission Control tasks