Support Systems Analysis

Drop Support Systems Analysis

Date: 2026-02-22 Author: John (AI Director) Status: MVP Hardening Phase (0.5) Purpose: Comprehensive analysis of support systems for production-ready fintech deployment

Executive Summary

Drop currently has foundational support systems in place but requires critical enhancements before production launch. The application has health checks, CI/CD, error tracking (client-side), and basic alerting, but lacks enterprise-grade observability, audit logging, and incident response procedures required for a PSD2-compliant fintech service.

Key Findings:

✅ Strong foundation: Comprehensive CI/CD with >80% coverage, health checks, structured logging
⚠️ Critical gaps: No server-side error tracking, no audit trails, no APM, limited incident response
🚨 Production blockers: 6 P0 items must be addressed before go-live (see Gap Analysis)

Recommendation: Implement P0 systems immediately (est. 2-3 days), defer P1 to Phase 2 (banking integration), and P2 to post-launch optimization.

Current State

1. Monitoring — Uptime & Health Checks

What Exists

✅ Health endpoint: /api/health with database connectivity verification
- Checks: DB query latency, driver type (pg/sqlite), service mode, uptime
- Returns: ok (200), degraded (200), or down (503)
- Source: src/drop-app/src/app/api/health/route.ts
✅ Container health checks:
- Docker: 30s interval, 10s timeout, 3 retries
- Fly.io: 30s interval, 10s grace period, 5s timeout
- Auto-restart on failure
✅ External uptime monitoring (ready to deploy):
- BetterStack setup guide documented
- Free tier: 10 monitors, 3-min interval, SMS/email/Slack alerts
- Documentation: docs/infrastructure/BETTERSTACK-SETUP.md
✅ Cron health check script:
- infrastructure/health-check.sh — AWS App Runner endpoint
- Slack webhook integration (optional)
- Can run via cron for local monitoring

What's Missing

❌ Synthetic monitoring: No transaction flow testing (login → send money → verify)
❌ Multi-region checks: No geographic availability testing
❌ SLA tracking: No uptime percentage calculation or reporting
❌ Dependency monitoring: No checks for external services (Swan API, BankID, Sumsub)

Assessment

Status: Adequate for MVP, requires enhancement for production. Gap: External monitoring configured but not deployed. Synthetic checks needed.

2. Logging — Centralized Log Aggregation

What Exists

✅ Structured logging:
- JSON format with timestamp, level, message, requestId, metadata
- Source: src/drop-app/src/lib/logger.ts
- Writes to stdout (Docker-friendly)
✅ Request correlation:
- x-request-id header extraction or UUID generation
- Request context propagation through logger instances
✅ Log levels: debug, info, warn, error

What's Missing

❌ Log aggregation: Logs write to stdout but aren't collected or indexed
❌ Log retention: No policy for how long logs are kept
❌ Log search: No way to query logs across time/instances
❌ Log forwarding: No integration with log management service
❌ Sensitive data scrubbing: Logger doesn't automatically redact PII

Assessment

Status: Foundation exists, but logs are ephemeral (lost on container restart). Gap: Critical for incident investigation and compliance audits. Need CloudWatch Logs or similar.

3. Error Tracking — Error Capture & Alerting

What Exists

✅ Client-side error tracking:
- Sentry browser integration (@sentry/browser)
- PII scrubbing (passwords, pins, card numbers, fødselsnummer)
- 10% trace sampling for performance monitoring
- Source: src/drop-app/src/lib/sentry.ts, SENTRY.md
✅ Error spike detection:
- Tracks errors in rolling 1-minute window
- Alerts when >5 errors in 60 seconds
- Source: src/drop-app/src/lib/alerts.ts:trackError()
✅ Global error boundaries:
- React error boundaries for component crashes
- global-error.tsx catches unhandled errors

What's Missing

❌ Server-side error tracking: Sentry removed from server due to Next.js 16 Turbopack incompatibility (MC #1271)
❌ API error context: Server errors log to console only, no structured capture
❌ Error attribution: Can't trace errors to specific users or transactions
❌ Error deduplication: Same error reported multiple times clogs alerts

Assessment

Status: Client errors tracked, server errors blind. Gap: CRITICAL — server-side errors (API, DB, integrations) are invisible. P0 fix required.

4. Alerting — On-Call & Escalation

What Exists

✅ Slack alerting:
- Operational alerts with severity levels (info/warning/critical)
- 10-minute cooldown per alert title (spam prevention)
- Source: src/drop-app/src/lib/alerts.ts
✅ Lifecycle alerts:
- App startup notification
- Graceful shutdown notification
- Source: instrumentation.ts
✅ Error spike alerts:
- Automatic critical alert when >5 errors/minute

What's Missing

❌ On-call rotation: No defined on-call schedule or escalation policy
❌ Alert routing: All alerts go to same Slack channel, no severity-based routing
❌ Alert escalation: No automatic escalation after N minutes of unresolved incident
❌ Alert acknowledgment: Can't mark alerts as "acknowledged" or "resolved"
❌ SMS/phone alerts: Critical incidents only notify via Slack (single point of failure)
❌ Alert testing: No way to test alert pipeline without triggering real incidents

Assessment

Status: Basic alerting works for small team, inadequate for 24/7 production. Gap: Need on-call schedule, escalation policy, and multi-channel delivery.

5. Security Monitoring — WAF, DDoS, Anomaly Detection, Audit Logs

What Exists

✅ WAF rules defined:
- CSRF origin validation (implemented in middleware)
- Rate limiting on auth endpoints (10 req/60s)
- CSP headers with nonce-based script loading
- Source: infrastructure/waf-rules.md, src/drop-app/src/middleware.ts
✅ Container security scanning:
- Trivy vulnerability scanner in CI/CD
- Blocks HIGH/CRITICAL vulnerabilities
- SARIF upload to GitHub Security tab
✅ Dependency scanning:
- npm audit in CI pipeline (prod deps only)
✅ AML transaction monitoring:
- 5 automated rules: structuring, velocity, high amount, high-risk corridor, unusual pattern
- Alerts stored in aml_alerts table
- Source: src/drop-app/src/lib/transaction-monitor.ts

What's Missing

❌ WAF deployment: Rules defined but not deployed (requires CDN/reverse proxy)
❌ DDoS protection: No rate limiting at network edge, only app-level
❌ Intrusion detection: No IDS/IPS monitoring unusual access patterns
❌ Audit logs: No immutable log of authentication, authorization, data access events (PSD2 requirement)
❌ Security incident response plan: No runbook for security breaches
❌ Penetration testing: No external security audit completed

Assessment

Status: Security-aware codebase, but monitoring/audit infrastructure missing. Gap: CRITICAL — audit logs are PSD2/GDPR compliance requirement. P0 fix.

6. Performance — APM, Latency Tracking, Resource Utilization

What Exists

✅ Health check latency:
- DB query time measured in health endpoint
- Reported in milliseconds
✅ Performance budgets in CI:
- Coverage thresholds enforced (80/70/80/80)

What's Missing

❌ APM (Application Performance Monitoring): No distributed tracing
❌ API latency tracking: Don't know which endpoints are slow
❌ Database performance: No slow query alerts or query profiling
❌ Resource utilization: No CPU/memory/disk usage monitoring
❌ Frontend performance: No Core Web Vitals tracking (LCP, FID, CLS)
❌ Transaction timing: Can't measure end-to-end payment latency

Assessment

Status: Minimal. Can detect total outage but not performance degradation. Gap: Need before production to identify bottlenecks and capacity issues.

7. Database — Backups, Replication, Monitoring

What Exists

✅ Automated backups (RDS):
- Daily automated snapshots, 7-day retention
- Point-in-time recovery within 7 days
- Source: docs/dr-runbook.md
✅ Multi-AZ (production):
- RDS configured for high availability (if enabled)
✅ Database health check:
- SELECT 1 query in health endpoint verifies connectivity

What's Missing

❌ Backup verification: Snapshots created but never tested for restore
❌ Backup monitoring: No alerts if backup fails
❌ Replication lag monitoring: No alerts if replica falls behind
❌ Connection pool monitoring: No visibility into connection usage
❌ Query performance: No slow query log analysis
❌ Storage monitoring: No alerts before disk fills up

Assessment

Status: Basic backup/restore exists, monitoring gaps. Gap: Backup testing and proactive monitoring needed before production.

8. Incident Response — Runbooks, Status Page, Communication Plan

What Exists

✅ DR runbook:
- Procedures for App Runner down, RDS down, full redeploy
- Environment variable checklist
- Contact escalation (John → Alem)
- Source: docs/dr-runbook.md
✅ Incident checklist:
- 8-step incident response workflow
- Post-mortem requirement (48h)

What's Missing

❌ Status page: No public/customer-facing status page
❌ Incident templates: No standardized incident report format
❌ Communication plan: No templates for customer notifications during outages
❌ Runbook coverage: Only covers infrastructure, missing:
- Payment failures (PISP/AISP errors)
- BankID integration issues
- KYC/AML false positive handling
- Data breach response
❌ Runbook testing: Procedures documented but never executed

Assessment

Status: Basic DR runbook exists, lacks fintech-specific scenarios. Gap: Need payment/banking integration runbooks before Phase 2.

9. CI/CD — Build Pipeline, Deployment, Rollback

What Exists

✅ Comprehensive CI pipeline:
- Multi-package change detection
- Lint, typecheck, unit tests, E2E (Playwright), mutation testing (Stryker)
- Coverage thresholds enforced (80/70/80/80) with ratchet (never decrease)
- Docker build + Trivy security scan
- Quality gate (required status check)
- Source: .github/workflows/ci.yml
✅ Deployment workflows:
- GitHub Actions for deploy (backend, mobile)
- Terraform for infrastructure
- Source: .github/workflows/deploy.yml, terraform-ci.yml

What's Missing

❌ Automated rollback: Deployment failure doesn't auto-revert
❌ Canary deployments: All-or-nothing deployment, no gradual rollout
❌ Deployment monitoring: No automatic health check after deploy
❌ Deployment notifications: Team not notified of deployments/failures
❌ Infrastructure drift detection: Terraform state not continuously validated

Assessment

Status: Strong quality gate, weak deployment safety. Gap: Add post-deployment health checks and rollback automation.

10. Compliance — Audit Trails, Data Retention, GDPR/PSD2 Logging

What Exists

✅ AML monitoring:
- Transaction alerts stored in aml_alerts table
- 5 risk categories tracked
✅ Security audit completed:
- 4 CRITICAL, 5 HIGH, 6 MEDIUM, 4 LOW findings documented
- Source: security/drop-security-rapport.md
✅ Data retention service:
- Code exists for GDPR compliance
- Source: src/drop-app/src/lib/services/data-retention.ts

What's Missing

❌ Audit logs: No immutable record of:
- User authentication events (login, logout, failed attempts)
- Authorization decisions (who accessed what, when)
- Data modifications (user profile changes, transaction edits)
- Administrative actions (KYC approvals, AML reviews)
❌ Audit log retention policy: PSD2 requires 5+ years
❌ Audit log integrity: No cryptographic proof of non-tampering
❌ Compliance reporting: No automated report generation for regulators
❌ STR (Suspicious Transaction Report) workflow: AML alerts created but no submission process

Assessment

Status: CRITICAL GAP. Audit logs are PSD2 legal requirement. Gap: P0 — must implement before production launch.

Gap Analysis

P0 — Production Blockers (Must Fix Before Go-Live)

#	Category	Gap	Impact	Effort
1	Error Tracking	No server-side error monitoring	Can't detect/debug API failures	4h
2	Compliance	No audit logs (auth, data access, admin actions)	PSD2 non-compliance, legal risk	8h
3	Security	WAF rules defined but not deployed	Vulnerable to SQLi, XSS, DDoS	2h (config)
4	Logging	No log aggregation/retention	Can't investigate incidents	2h (CloudWatch setup)
5	Monitoring	BetterStack configured but not deployed	No external incident detection	1h (account setup)
6	Incident Response	No payment/banking failure runbooks	Can't recover from PISP/BankID outages	4h

Total P0 effort: ~21 hours (2-3 days)

P1 — Needed Soon (Before Phase 2: Banking Integration)

#	Category	Gap	Impact	Effort
7	Alerting	No on-call rotation or escalation policy	Incidents may go unnoticed outside work hours	2h
8	Performance	No APM for distributed tracing	Can't diagnose slow transactions	4h
9	Database	No backup testing or monitoring	Backups may be corrupt, undetected	3h
10	Security	No penetration testing	Unknown vulnerabilities	16h (external)
11	CI/CD	No automated rollback on deployment failure	Bad deploys cause extended outages	6h
12	Compliance	No STR submission workflow	Can't fulfill AML obligations	8h

Total P1 effort: ~39 hours (5 days)

P2 — Nice to Have (Post-Launch Optimization)

#	Category	Gap	Impact	Effort
13	Monitoring	No synthetic transaction monitoring	Can't detect broken user flows	8h
14	Performance	No Core Web Vitals tracking	Poor user experience undetected	4h
15	Alerting	No SMS/phone alerts for critical incidents	Slack outage = missed alerts	2h
16	Database	No slow query alerts	Performance degradation undetected	6h
17	Security	No IDS/IPS for intrusion detection	Advanced attacks undetected	16h
18	Incident Response	No public status page	Customers unaware of outages	4h

Total P2 effort: ~40 hours (5 days)

Implementation Plan

Phase 1: P0 Production Blockers (NOW — before Phase 1 demo)

Goal: Address legal/compliance requirements and critical observability gaps.

1.1 Server-Side Error Tracking (4h)

Problem: All server errors invisible after Sentry removed (Next.js 16 Turbopack incompatibility).

Solution:

Option A: Sentry Edge SDK (compatible with Next.js middleware)
- Install: @sentry/nextjs with edge-only config
- Capture server errors via captureException() in middleware
- Source maps via Sentry webpack plugin
Option B: Custom error aggregation service
- POST errors to internal /api/errors/capture endpoint
- Store in error_logs table with context
- Alert on spike detection

Deliverable:

src/drop-app/sentry.edge.config.ts (if Option A)
Updated src/drop-app/src/lib/sentry-server.ts with edge-compatible capture
Test: Trigger 500 error, verify Sentry event created

Files: infrastructure/error-tracking-setup.md

1.2 Audit Logging System (8h)

Problem: PSD2 requires immutable audit trail for auth, data access, admin actions.

Solution:

Create audit_logs table:

CREATE TABLE audit_logs (
  id TEXT PRIMARY KEY,
  timestamp TEXT NOT NULL,
  user_id TEXT,
  action TEXT NOT NULL, -- 'login', 'data_access', 'kyc_approval', etc.
  resource_type TEXT, -- 'user', 'transaction', 'aml_alert'
  resource_id TEXT,
  metadata JSON,
  ip_address TEXT,
  user_agent TEXT,
  request_id TEXT,
  result TEXT -- 'success', 'failure', 'denied'
);
CREATE INDEX idx_audit_user ON audit_logs(user_id, timestamp);
CREATE INDEX idx_audit_action ON audit_logs(action, timestamp);

Audit functions:

auditLog({
  userId: 'usr_123',
  action: 'login_success',
  resourceType: 'user',
  resourceId: 'usr_123',
  metadata: { method: 'bankid' },
  ip: '1.2.3.4',
  userAgent: 'Mozilla...',
  requestId: 'req_456'
});

Integrate at:
- POST /api/auth/login (login_success, login_failure)
- POST /api/auth/logout (logout)
- GET /api/users/:id (data_access)
- PATCH /api/users/:id/kyc (kyc_approval, kyc_rejection)
- PATCH /api/aml-alerts/:id (aml_review)

Deliverable:

src/drop-app/src/lib/audit-log.ts (audit logging functions)
Migration: migrations/003_audit_logs.sql
Integration in auth routes and admin endpoints
Retention policy: Document 5-year retention for PSD2 compliance

Files: support/audit-logging-setup.md

1.3 WAF Deployment (2h)

Problem: WAF rules defined but not enforced (requires reverse proxy).

Solution:

Option A: Cloudflare WAF (recommended)
- Already using Cloudflare for DNS (terraform module exists)
- Free tier includes basic WAF rules
- Configure: SQLi, XSS, path traversal rules from infrastructure/waf-rules.md
Option B: AWS WAF (if using App Runner directly)
- $5/month + $1/million requests
- Associate with App Runner service

Deliverable:

Cloudflare WAF configuration (Terraform or UI)
Test: Send SQLi payload, verify 403 response
Document: Update infrastructure/waf-rules.md with deployment steps

Files: infrastructure/cloudflare-waf-setup.md

1.4 Log Aggregation (2h)

Problem: Structured logs write to stdout but aren't retained or searchable.

Solution:

AWS CloudWatch Logs (App Runner auto-integrates):
- App Runner streams stdout → CloudWatch Logs automatically
- Configure retention: 30 days (production), 7 days (staging)
- Set up log insights queries for common patterns
Fly.io (staging):
- fly logs stores last 24h by default
- Optional: Forward to external service (Papertrail, Logtail)

Deliverable:

CloudWatch Logs retention policy configured
Log Insights queries:
- All errors: fields @timestamp, message | filter level = "error"
- User actions: fields @timestamp, userId, message | filter userId = "usr_123"
- Request trace: fields @timestamp, requestId, message | filter requestId = "req_456"
Documentation: infrastructure/logging-setup.md

Files: infrastructure/cloudwatch-logs-setup.md

1.5 External Uptime Monitoring (1h)

Problem: BetterStack documented but not deployed.

Solution:

Sign up: https://betterstack.com/uptime (free tier)
Create monitors:
1. Production health: https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health
  - Interval: 3 minutes
  - Keyword check: "status":"ok"
2. Staging health: https://drop-staging.fly.dev/api/health
3. Landing page: https://getdrop.no (when live)
Slack integration: Connect to #drop-ops channel
Email alerts: [email protected]

Deliverable:

BetterStack account with 3 monitors configured
Test: Pause monitor, verify alert received
Documentation: Update docs/infrastructure/BETTERSTACK-SETUP.md with credentials

Files: support/betterstack-deployment.md

1.6 Payment/Banking Failure Runbooks (4h)

Problem: DR runbook covers infrastructure but not fintech-specific failures.

Solution:

Create runbooks for:
1. BankID integration failure (authentication blocked)
2. PISP payment failure (remittance/QR payment rejected)
3. AISP balance retrieval failure (can't fetch account balance)
4. Swan API outage (BaaS provider down)
5. Sumsub KYC failure (identity verification unavailable)
6. Neonomics open banking outage
Each runbook includes:
- Symptoms (what users see)
- Diagnosis steps (check service status, logs, error codes)
- Recovery procedure (fallback, retry, escalation)
- Customer communication template

Deliverable:

support/runbooks/bankid-failure.md
support/runbooks/pisp-payment-failure.md
support/runbooks/aisp-balance-failure.md
support/runbooks/swan-api-outage.md
support/runbooks/sumsub-kyc-failure.md
support/runbooks/neonomics-outage.md

Files: Created in /Users/makinja/ALAI/products/Drop/support/runbooks/

Phase 2: P1 Items (Phase 2: Banking Integration)

Defer to Phase 2 when real banking integrations are live and need production-grade support.

Priority order:

Penetration testing (external security audit)
APM for transaction tracing (identify slow payments)
On-call rotation and escalation policy
Automated rollback on failed deployments
Backup testing and monitoring
STR submission workflow (AML compliance)

Phase 3: P2 Items (Post-Launch)

Optimize after initial production deployment and user feedback.

Priority order:

Synthetic transaction monitoring (test critical user flows)
Public status page (customer transparency)
Core Web Vitals tracking (frontend performance)
SMS/phone alerts (redundancy)
Slow query monitoring (database optimization)
IDS/IPS (advanced threat detection)

Architecture

Support Systems Connectivity

┌─────────────────────────────────────────────────────────────────┐
│                         Drop Application                        │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │  drop-app   │  │  drop-api    │  │  drop-mobile (Expo)  │  │
│  │  (Next.js)  │  │  (Hono)      │  │  (React Native)      │  │
│  └─────────────┘  └──────────────┘  └──────────────────────┘  │
│         │                │                      │               │
│         └────────────────┴──────────────────────┘               │
│                          │                                      │
└──────────────────────────┼──────────────────────────────────────┘
                           │
        ┌──────────────────┼──────────────────────────────┐
        │                  │                              │
        ▼                  ▼                              ▼
┌───────────────┐  ┌──────────────┐           ┌──────────────────┐
│ Structured    │  │ Health Check │           │ Audit Logs       │
│ Logging       │  │ Endpoint     │           │ (audit_logs      │
│ (JSON stdout) │  │ /api/health  │           │  table)          │
└───────┬───────┘  └──────┬───────┘           └─────────┬────────┘
        │                 │                             │
        │                 │                             │
        ▼                 │                             │
┌────────────────┐        │                             │
│ CloudWatch     │        │                             │
│ Logs           │        │                             │
│ (30d retention)│        │                             │
└────────────────┘        │                             │
        │                 │                             │
        │                 ▼                             │
        │         ┌───────────────┐                     │
        │         │ BetterStack   │                     │
        │         │ (external     │                     │
        │         │  monitoring)  │                     │
        │         └───────┬───────┘                     │
        │                 │                             │
        └─────────────────┼─────────────────────────────┘
                          │
                          ▼
                 ┌────────────────┐
                 │ Alerting Layer │
                 │ (alerts.ts)    │
                 └────────┬───────┘
                          │
        ┌─────────────────┼─────────────────┐
        │                 │                 │
        ▼                 ▼                 ▼
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│ Slack       │  │ Sentry      │  │ Email       │
│ Webhook     │  │ (client +   │  │ (SMTP)      │
│ (#drop-ops) │  │  edge)      │  │             │
└─────────────┘  └─────────────┘  └─────────────┘

Data Flows

Error Flow:
- Client error → Sentry browser → Slack alert (if spike)
- Server error → Sentry edge → CloudWatch Logs → Slack alert
- API 5xx → trackError() → Spike detection → Slack
Monitoring Flow:
- App → stdout → CloudWatch Logs
- App → /api/health → BetterStack → Slack/Email/SMS
- Container → Docker health check → Auto-restart
Audit Flow:
- User action → auditLog() → audit_logs table
- Compliance query → SQL export → Regulator submission
Incident Flow:
- Alert → Slack #drop-ops
- Unacknowledged (5 min) → Email to Alem
- Unresolved (15 min) → SMS (BetterStack escalation)
- Incident → Runbook → Recovery → Post-mortem

Cost Estimate

Free Tier (MVP)

✅ CloudWatch Logs: 5 GB ingestion/month free (AWS Free Tier)
✅ BetterStack: 10 monitors, 3-min interval, unlimited alerts
✅ Sentry: 5K events/month free
✅ GitHub Actions: 2000 minutes/month free
✅ Terraform state: S3 free tier (first 12 months)

Total MVP cost: $0/month

Paid Services (Production)

CloudWatch Logs: ~$5/month (30 GB ingestion estimate)
BetterStack Pro: $20/month (30s interval, SMS alerts)
Sentry Team: $26/month (50K events, enhanced features)
Optional: Datadog APM: $15/host/month (~$45 for 3 hosts)

Total production cost: ~$50-100/month (without APM)

Recommendations

Immediate (This Week)

✅ Deploy BetterStack (1h) — External monitoring is fast win
✅ Configure CloudWatch retention (30 min) — Logs already flow, just set policy
✅ Create audit log schema (2h) — Start with table, integrate incrementally

Before Phase 1 Demo (Next 2 Weeks)

✅ Implement server-side error tracking (4h) — Sentry edge or custom
✅ Write payment failure runbooks (4h) — Prepare for demo questions
✅ Deploy Cloudflare WAF (2h) — Security hygiene

Before Phase 2 Go-Live (Next 2-3 Months)

🔲 External penetration test (hire security firm, ~$5K budget)
🔲 APM implementation (Datadog or Sentry Performance)
🔲 On-call rotation (define schedule, test escalation)
🔲 Backup testing (restore from snapshot, verify data integrity)

Post-Launch Optimization

🔲 Synthetic monitoring (Checkly or custom Playwright tests)
🔲 Public status page (BetterStack included, just enable)
🔲 Core Web Vitals (Google Lighthouse CI integration)

Success Metrics

Before Go-Live (P0 Checklist)

Server errors visible in Sentry (test: trigger 500, verify event)
Audit logs capture login/logout (test: log in, check audit_logs table)
WAF blocks SQLi attack (test: ?id=1' OR '1'='1, expect 403)
CloudWatch Logs retain 30 days (verify retention policy)
BetterStack alerts on downtime (test: stop app, receive alert <5 min)
Runbooks tested (simulate BankID failure, follow procedure)

Production KPIs

Uptime: >99.9% (measured by BetterStack)
MTTD (Mean Time To Detect): <3 minutes (external monitoring interval)
MTTR (Mean Time To Recover): <15 minutes (via runbooks)
Error rate: <0.1% of requests (tracked via Sentry)
Log retention: 100% compliance (30 days CloudWatch, 5 years audit logs)
Alert noise: <5 false positives/week (cooldown + severity tuning)

Appendices

docs/infrastructure/MONITORING.md — Current monitoring setup
docs/infrastructure/BETTERSTACK-SETUP.md — External monitoring guide
docs/dr-runbook.md — Infrastructure disaster recovery
infrastructure/waf-rules.md — WAF rule definitions
security/drop-security-rapport.md — Security audit findings

B. External Services

BetterStack: https://betterstack.com/uptime
Sentry: https://sentry.io/
AWS CloudWatch: https://console.aws.amazon.com/cloudwatch/
Cloudflare: https://dash.cloudflare.com/

C. Change History

2026-02-22: Initial analysis (John)

Next Actions:

Review this analysis with Alem
Approve P0 implementation plan
Begin P0 work (estimated 21 hours / 2-3 days)
Track progress in Mission Control tasks

P0: Implementation Checklist

Support Overview

Support Systems Analysis

Audit Logging Setup

Runbook: AISP Balance Failure

Runbook: BankID Failure

Runbook: PISP Payment Failure

Runbook: Sumsub KYC Failure

Runbook: Swan API Outage

ALAI Infrastructure — Service Catalog & Runbooks

Support Systems Analysis

Drop Support Systems Analysis

Executive Summary

Current State

1. Monitoring — Uptime & Health Checks

What Exists

What's Missing

Assessment

2. Logging — Centralized Log Aggregation

What Exists

What's Missing

Assessment

3. Error Tracking — Error Capture & Alerting

What Exists

What's Missing

Assessment

4. Alerting — On-Call & Escalation

What Exists

What's Missing

Assessment

5. Security Monitoring — WAF, DDoS, Anomaly Detection, Audit Logs

What Exists

What's Missing

Assessment

6. Performance — APM, Latency Tracking, Resource Utilization

What Exists

What's Missing

Assessment

7. Database — Backups, Replication, Monitoring

What Exists

What's Missing

Assessment

8. Incident Response — Runbooks, Status Page, Communication Plan

What Exists

What's Missing

Assessment

9. CI/CD — Build Pipeline, Deployment, Rollback

What Exists

What's Missing

Assessment

10. Compliance — Audit Trails, Data Retention, GDPR/PSD2 Logging

What Exists

What's Missing

Assessment

Gap Analysis

P0 — Production Blockers (Must Fix Before Go-Live)

P1 — Needed Soon (Before Phase 2: Banking Integration)

P2 — Nice to Have (Post-Launch Optimization)

Implementation Plan

Phase 1: P0 Production Blockers (NOW — before Phase 1 demo)

1.1 Server-Side Error Tracking (4h)

1.2 Audit Logging System (8h)

1.3 WAF Deployment (2h)

1.4 Log Aggregation (2h)

1.5 External Uptime Monitoring (1h)

1.6 Payment/Banking Failure Runbooks (4h)

Phase 2: P1 Items (Phase 2: Banking Integration)

Phase 3: P2 Items (Post-Launch)

Architecture

Support Systems Connectivity

Data Flows

Cost Estimate

Free Tier (MVP)

Paid Services (Production)

Recommendations

Immediate (This Week)

Before Phase 1 Demo (Next 2 Weeks)

Before Phase 2 Go-Live (Next 2-3 Months)

Post-Launch Optimization

Success Metrics