# Support & Runbooks P0 checklists, support systems, audit logging, and operational runbooks # Support Systems Support systems, checklists, and audit logging # P0: Implementation Checklist # P0 Implementation Checklist — Drop Support Systems **Date:** 2026-02-22 **Status:** Ready for Implementation **Total Effort:** ~21 hours (2-3 days) **Owner:** John (AI Director) --- ## Overview This checklist tracks the 6 **production-blocking** (P0) items that must be completed before Drop can launch to production. Each item addresses a critical gap in monitoring, compliance, or incident response. --- ## P0 Items ### 1. Server-Side Error Tracking ⏱️ 2 hours (revised) **Problem:** ~~All server errors are invisible after Sentry removed~~ **CORRECTED:** `sentry-server.ts` already exists with lightweight Envelope API (no @sentry/node dep, Turbopack compatible). However, only 5/25+ routes have `captureServerError` integrated. **Status:** 🟡 Partially Complete (library done, coverage gaps) **Tasks:** - [x] ~~Research Sentry Edge SDK compatibility~~ Already solved: custom Envelope API - [x] ~~Install and configure~~ `src/lib/sentry-server.ts` already complete - [x] ~~Update sentry-server.ts~~ Already has captureServerError + captureServerMessage - [ ] **Expand captureServerError to ALL API routes** (currently only 5 routes) - [ ] Test: Trigger 500 error in expanded routes, verify Sentry event - [ ] Configure source maps upload (optional but recommended) **Deliverables:** - ✅ `src/lib/sentry-server.ts` (already complete — Envelope API, no SDK dep) - ✅ Integrated in: bankid, bankid/callback, qr-payment, remittance, health - 🔨 Expanding to: all remaining API routes (~20 routes) **Acceptance Criteria:** - ALL API routes have captureServerError in catch blocks - Error includes context tags (endpoint name, userId) --- ### 2. Audit Logging System ⏱️ 0 hours (ALREADY COMPLETE) **Problem:** ~~PSD2 requires immutable audit trail~~ **CORRECTED:** Audit logging is FULLY IMPLEMENTED. **Status:** ✅ Complete **What exists:** - [x] `src/lib/audit.ts` — Full audit library with 30+ action types, logAudit(), getAuditLog(), countAuditEntries() - [x] `audit_log` table in DB schema (initial migration + db.ts fallback) - [x] Indexes on user_id, timestamp, action - [x] 5-year retention documented (data-retention.ts explicitly excludes audit_log from cleanup) - [x] Fire-and-forget pattern (doesn't block user actions) - [x] Integrated in 20+ API routes: auth, transactions, cards, recipients, settings, consents, complaints, user management, GDPR endpoints - [x] Admin audit export: `/api/admin/audit/` endpoint exists - [x] GDPR data export: `/api/user/data-export/` includes audit log - [x] Structured logger also captures audit events (stdout for CloudWatch) **No action needed.** This was incorrectly flagged as missing in the initial analysis. --- ### 3. WAF Deployment ⏱️ 2 hours **Problem:** WAF rules defined but not enforced (requires reverse proxy). **Status:** ⬜ Not Started **Tasks:** - [ ] Review `infrastructure/waf-rules.md` for required rules - [ ] Configure Cloudflare WAF (recommended): - [ ] Enable SQLi protection - [ ] Enable XSS protection - [ ] Enable path traversal blocking - [ ] Set request size limits (1MB API, 10KB auth) - [ ] OR configure AWS WAF (alternative): - [ ] Create WAF web ACL - [ ] Associate with App Runner service - [ ] Test WAF rules: - [ ] Send SQLi payload (`?id=1' OR '1'='1`), expect 403 - [ ] Send XSS payload (``), expect 403 - [ ] Document deployment steps **Deliverables:** - ✅ `infrastructure/cloudflare-waf-setup.md` (to be created) - ⬜ Cloudflare WAF configured - ⬜ Test results documented **Acceptance Criteria:** - SQLi attacks blocked with 403 - XSS attacks blocked with 403 - Legitimate requests pass through - WAF logs visible in Cloudflare dashboard --- ### 4. Log Aggregation & Retention ⏱️ 2 hours **Problem:** Structured logs write to stdout but aren't retained or searchable. **Status:** ⬜ Not Started **Tasks:** - [ ] Set CloudWatch Logs retention policy: - [ ] Production: 30 days - [ ] Staging: 7 days - [ ] Create CloudWatch Log Insights queries: - [ ] All errors (last hour) - [ ] User activity trace - [ ] Request trace by ID - [ ] API endpoint performance (slow queries) - [ ] Authentication events - [ ] Payment failures - [ ] Create CloudWatch alarms: - [ ] High error rate (>10/min) - [ ] No logs received (service down) - [ ] Database errors (>5 in 5 min) - [ ] Create SNS topic for alerts - [ ] Subscribe email/Slack to SNS topic - [ ] Test alarms (trigger error spike, verify alert) **Deliverables:** - ✅ `infrastructure/cloudwatch-logs-setup.md` (created) - ⬜ CloudWatch retention policies set - ⬜ Log Insights queries saved - ⬜ CloudWatch alarms active **Acceptance Criteria:** - Logs retained for 30 days (production) - Log Insights queries return results in <5 seconds - Error spike triggers Slack alert within 2 minutes - Service downtime triggers alert within 5 minutes --- ### 5. External Uptime Monitoring ⏱️ 1 hour **Problem:** BetterStack documented but not deployed. **Status:** ⬜ Not Started **Tasks:** - [ ] Sign up for BetterStack (free tier) - [ ] Create monitors: - [ ] Production health: `https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health` - Interval: 3 minutes - Keyword check: `"status":"ok"` - [ ] Staging health: `https://drop-staging.fly.dev/api/health` - [ ] Landing page: `https://getdrop.no` (when live) - [ ] Configure Slack integration: - [ ] Connect to `#drop-ops` channel - [ ] Configure email alerts: - [ ] Add `alem@alai.no` - [ ] Test monitoring: - [ ] Pause monitor manually - [ ] Verify alert received in Slack + email - [ ] Resume monitor **Deliverables:** - ✅ `docs/infrastructure/BETTERSTACK-SETUP.md` (already exists) - ⬜ BetterStack account with monitors active - ⬜ Slack integration tested **Acceptance Criteria:** - Health endpoint monitored every 3 minutes - Downtime alert received in <5 minutes - Alert includes endpoint URL and status - Status page shows current uptime % --- ### 6. Payment/Banking Failure Runbooks ⏱️ 4 hours **Problem:** DR runbook covers infrastructure but not fintech-specific failures. **Status:** ✅ Partially Complete **Tasks:** - [x] BankID integration failure runbook - [x] PISP payment failure runbook (remittance + QR) - [ ] AISP balance retrieval failure runbook - [ ] Swan API outage runbook - [ ] Sumsub KYC failure runbook - [ ] Neonomics open banking outage runbook - [ ] Test each runbook in staging (simulate failure) - [ ] Update `docs/dr-runbook.md` to reference new runbooks **Deliverables:** - ✅ `support/runbooks/bankid-failure.md` (created) - ✅ `support/runbooks/pisp-payment-failure.md` (created) - ⬜ `support/runbooks/aisp-balance-failure.md` - ⬜ `support/runbooks/swan-api-outage.md` - ⬜ `support/runbooks/sumsub-kyc-failure.md` - ⬜ `support/runbooks/neonomics-outage.md` **Acceptance Criteria:** - Each runbook includes: symptoms, diagnosis, solutions, escalation - Runbooks tested (manual simulation in staging) - Team trained on runbook usage - Runbooks linked from main DR runbook --- ## Progress Tracking ### Completion Status | Item | Status | Progress | Blocker | |------|--------|----------|---------| | 1. Server-side error tracking | 🟡 Expanding | 80% (lib done, expanding to all routes) | None | | 2. Audit logging | ✅ COMPLETE | 100% (was already built) | None | | 3. WAF deployment | 🟡 Ready | 90% (Terraform written, needs apply) | `terraform apply` | | 4. Log aggregation | 🔨 Building | 50% (CloudWatch alarms being added) | None | | 5. External monitoring | ⬜ Not Started | 0% | BetterStack account signup | | 6. Runbooks | 🔨 Building | 33% → 100% (4 remaining being written) | None | **Overall Progress:** ~70% (revised — audit logging was already 100%) --- ## Priority Order **Week 1 (High Impact, Low Effort):** 1. ✅ External monitoring (1h) — Immediate visibility into outages 2. ✅ CloudWatch retention (30min) — Logs already flowing, just set policy 3. ⬜ CloudWatch alarms (1.5h) — Automated alerting **Week 2 (Critical Compliance):** 4. ⬜ Audit logging schema (2h) — Create table and library 5. ⬜ Audit logging integration (6h) — Wire into endpoints **Week 3 (Security & Error Tracking):** 6. ⬜ Server-side error tracking (4h) — Sentry edge setup 7. ⬜ WAF deployment (2h) — Security hardening **Week 4 (Runbooks):** 8. ⬜ Remaining runbooks (2h) — AISP, Swan, Sumsub, Neonomics --- ## Dependencies ### External Dependencies - BetterStack account signup (5 min, no approval needed) - Sentry organization/project (existing, or create new) - Cloudflare account (existing for DNS, WAF is free tier) ### Internal Dependencies - Alem approval for: - Audit log schema changes - CloudWatch cost ($17/month estimate) - BetterStack Pro upgrade (optional, $20/month for 30s interval) ### Blocked Items - Some runbooks require Phase 2 context (real banking integrations) - Can document procedures but can't fully test without live APIs - Mark as "draft" until Phase 2 --- ## Testing Plan ### Test 1: Error Tracking ```bash # Trigger server error curl -X POST http://localhost:3000/api/test/error \ -H "Content-Type: application/json" \ -d '{"trigger":"server_error"}' # Verify in Sentry: # - Event appears within 30s # - Stack trace includes source file/line # - User context present (if logged in) ``` ### Test 2: Audit Logging ```bash # Perform audit-worthy action curl -X POST http://localhost:3000/api/auth/login \ -H "Content-Type: application/json" \ -d '{"email":"test@example.com","password":"wrong"}' # Check database (PostgreSQL 16): psql "$DATABASE_URL" -c "SELECT * FROM audit_log ORDER BY timestamp DESC LIMIT 1;" # Expected: # audit_xxx|2026-02-22T10:00:00Z|usr_123|login_failure|...|1.2.3.4|Mozilla... ``` ### Test 3: WAF ```bash # Test SQLi blocking curl "https://getdrop.no/api/test?id=1' OR '1'='1" -v # Expected: HTTP 403 Forbidden # Test legitimate request curl "https://getdrop.no/api/health" -v # Expected: HTTP 200 OK ``` ### Test 4: CloudWatch Alarms ```bash # Trigger error spike (loop 15 errors) for i in {1..15}; do curl http://localhost:3000/api/test/error sleep 2 done # Expected: # - CloudWatch alarm fires after 2 minutes (2 x 1min periods) # - Slack alert received in #drop-ops # - Email sent to alem@alai.no ``` ### Test 5: BetterStack ```bash # Stop app docker stop drop-app # Wait 3-5 minutes # Expected: # - BetterStack detects downtime # - Slack alert in #drop-ops # - Email to alem@alai.no # Restart app docker start drop-app # Expected: # - BetterStack detects recovery # - "UP" notification sent ``` --- ## Rollout Plan ### Phase 1: Non-Intrusive (Day 1) - External monitoring (BetterStack) - CloudWatch retention policies - CloudWatch alarms (passive, alerts only) **Risk:** None. These are read-only additions. ### Phase 2: Database Changes (Day 2) - Audit log schema migration - Audit log library (no integrations yet) **Risk:** Low. New table, no app changes. Test migration in dev first. ### Phase 3: Code Integration (Day 3-4) - Audit logging in auth endpoints - Server-side error tracking (Sentry edge) - WAF deployment **Risk:** Medium. Requires code changes + deployment. Deploy to staging first, test 24h, then production. ### Phase 4: Runbooks (Day 5) - Complete remaining runbooks - Team training session - Runbook testing in staging **Risk:** None. Documentation only, no production changes. --- ## Success Metrics **After P0 completion, we should achieve:** - ✅ 100% server errors visible (Sentry events) - ✅ 100% audit events logged (auth, admin, data access) - ✅ >99.9% uptime detection (BetterStack) - ✅ <5 min MTTD (mean time to detect incidents) - ✅ <15 min MTTR (mean time to recover, using runbooks) - ✅ 0 security vulnerabilities from WAF bypass --- ## Approvals ### Required Approvals - [ ] Alem: Audit log schema changes - [ ] Alem: CloudWatch cost ($17/month) - [ ] Alem: BetterStack account (free tier OK? or Pro $20/month?) ### Sign-Off - [ ] John (AI Director): Technical implementation complete - [ ] Alem (CEO): Business approval for costs + rollout - [ ] Validator (QA): Testing complete, acceptance criteria met --- ## Next Steps 1. **Review this analysis** with Alem 2. **Get approvals** for costs and schema changes 3. **Create Mission Control tasks** for each P0 item 4. **Begin implementation** (priority order above) 5. **Test thoroughly** in staging before production 6. **Document completion** in this checklist --- ## Related Documents - `support/SUPPORT-SYSTEMS-ANALYSIS.md` — Full analysis (all P0/P1/P2 items) - `support/audit-logging-setup.md` — Audit logging implementation guide - `support/runbooks/bankid-failure.md` — BankID failure recovery - `support/runbooks/pisp-payment-failure.md` — Payment failure recovery - `infrastructure/cloudwatch-logs-setup.md` — Log aggregation setup - `infrastructure/waf-rules.md` — WAF rule definitions --- **Status:** Ready for approval and implementation **Next Review:** After P0 completion (before Phase 2 launch) # Support Overview # Customer Support Customer support resources for Drop project: FAQs, guides, feedback. # Support Systems Analysis # Drop Support Systems Analysis **Date:** 2026-02-22 **Author:** John (AI Director) **Status:** MVP Hardening Phase (0.5) **Purpose:** Comprehensive analysis of support systems for production-ready fintech deployment --- ## Executive Summary Drop currently has **foundational support systems** in place but requires **critical enhancements** before production launch. The application has health checks, CI/CD, error tracking (client-side), and basic alerting, but lacks enterprise-grade observability, audit logging, and incident response procedures required for a PSD2-compliant fintech service. **Key Findings:** - ✅ **Strong foundation:** Comprehensive CI/CD with >80% coverage, health checks, structured logging - ⚠️ **Critical gaps:** No server-side error tracking, no audit trails, no APM, limited incident response - 🚨 **Production blockers:** 6 P0 items must be addressed before go-live (see Gap Analysis) **Recommendation:** Implement P0 systems immediately (est. 2-3 days), defer P1 to Phase 2 (banking integration), and P2 to post-launch optimization. --- ## Current State ### 1. Monitoring — Uptime & Health Checks #### What Exists - ✅ **Health endpoint:** `/api/health` with database connectivity verification - Checks: DB query latency, driver type (pg/sqlite), service mode, uptime - Returns: `ok` (200), `degraded` (200), or `down` (503) - Source: `src/drop-app/src/app/api/health/route.ts` - ✅ **Container health checks:** - Docker: 30s interval, 10s timeout, 3 retries - Fly.io: 30s interval, 10s grace period, 5s timeout - Auto-restart on failure - ✅ **External uptime monitoring (ready to deploy):** - BetterStack setup guide documented - Free tier: 10 monitors, 3-min interval, SMS/email/Slack alerts - Documentation: `docs/infrastructure/BETTERSTACK-SETUP.md` - ✅ **Cron health check script:** - `infrastructure/health-check.sh` — AWS App Runner endpoint - Slack webhook integration (optional) - Can run via cron for local monitoring #### What's Missing - ❌ **Synthetic monitoring:** No transaction flow testing (login → send money → verify) - ❌ **Multi-region checks:** No geographic availability testing - ❌ **SLA tracking:** No uptime percentage calculation or reporting - ❌ **Dependency monitoring:** No checks for external services (Swan API, BankID, Sumsub) #### Assessment **Status:** Adequate for MVP, requires enhancement for production. **Gap:** External monitoring configured but not deployed. Synthetic checks needed. --- ### 2. Logging — Centralized Log Aggregation #### What Exists - ✅ **Structured logging:** - JSON format with timestamp, level, message, requestId, metadata - Source: `src/drop-app/src/lib/logger.ts` - Writes to stdout (Docker-friendly) - ✅ **Request correlation:** - `x-request-id` header extraction or UUID generation - Request context propagation through logger instances - ✅ **Log levels:** debug, info, warn, error #### What's Missing - ❌ **Log aggregation:** Logs write to stdout but aren't collected or indexed - ❌ **Log retention:** No policy for how long logs are kept - ❌ **Log search:** No way to query logs across time/instances - ❌ **Log forwarding:** No integration with log management service - ❌ **Sensitive data scrubbing:** Logger doesn't automatically redact PII #### Assessment **Status:** Foundation exists, but logs are ephemeral (lost on container restart). **Gap:** Critical for incident investigation and compliance audits. Need CloudWatch Logs or similar. --- ### 3. Error Tracking — Error Capture & Alerting #### What Exists - ✅ **Client-side error tracking:** - Sentry browser integration (`@sentry/browser`) - PII scrubbing (passwords, pins, card numbers, fødselsnummer) - 10% trace sampling for performance monitoring - Source: `src/drop-app/src/lib/sentry.ts`, `SENTRY.md` - ✅ **Error spike detection:** - Tracks errors in rolling 1-minute window - Alerts when >5 errors in 60 seconds - Source: `src/drop-app/src/lib/alerts.ts:trackError()` - ✅ **Global error boundaries:** - React error boundaries for component crashes - `global-error.tsx` catches unhandled errors #### What's Missing - ❌ **Server-side error tracking:** Sentry removed from server due to Next.js 16 Turbopack incompatibility (MC #1271) - ❌ **API error context:** Server errors log to console only, no structured capture - ❌ **Error attribution:** Can't trace errors to specific users or transactions - ❌ **Error deduplication:** Same error reported multiple times clogs alerts #### Assessment **Status:** Client errors tracked, server errors blind. **Gap:** CRITICAL — server-side errors (API, DB, integrations) are invisible. P0 fix required. --- ### 4. Alerting — On-Call & Escalation #### What Exists - ✅ **Slack alerting:** - Operational alerts with severity levels (info/warning/critical) - 10-minute cooldown per alert title (spam prevention) - Source: `src/drop-app/src/lib/alerts.ts` - ✅ **Lifecycle alerts:** - App startup notification - Graceful shutdown notification - Source: `instrumentation.ts` - ✅ **Error spike alerts:** - Automatic critical alert when >5 errors/minute #### What's Missing - ❌ **On-call rotation:** No defined on-call schedule or escalation policy - ❌ **Alert routing:** All alerts go to same Slack channel, no severity-based routing - ❌ **Alert escalation:** No automatic escalation after N minutes of unresolved incident - ❌ **Alert acknowledgment:** Can't mark alerts as "acknowledged" or "resolved" - ❌ **SMS/phone alerts:** Critical incidents only notify via Slack (single point of failure) - ❌ **Alert testing:** No way to test alert pipeline without triggering real incidents #### Assessment **Status:** Basic alerting works for small team, inadequate for 24/7 production. **Gap:** Need on-call schedule, escalation policy, and multi-channel delivery. --- ### 5. Security Monitoring — WAF, DDoS, Anomaly Detection, Audit Logs #### What Exists - ✅ **WAF rules defined:** - CSRF origin validation (implemented in middleware) - Rate limiting on auth endpoints (10 req/60s) - CSP headers with nonce-based script loading - Source: `infrastructure/waf-rules.md`, `src/drop-app/src/middleware.ts` - ✅ **Container security scanning:** - Trivy vulnerability scanner in CI/CD - Blocks HIGH/CRITICAL vulnerabilities - SARIF upload to GitHub Security tab - ✅ **Dependency scanning:** - `npm audit` in CI pipeline (prod deps only) - ✅ **AML transaction monitoring:** - 5 automated rules: structuring, velocity, high amount, high-risk corridor, unusual pattern - Alerts stored in `aml_alerts` table - Source: `src/drop-app/src/lib/transaction-monitor.ts` #### What's Missing - ❌ **WAF deployment:** Rules defined but not deployed (requires CDN/reverse proxy) - ❌ **DDoS protection:** No rate limiting at network edge, only app-level - ❌ **Intrusion detection:** No IDS/IPS monitoring unusual access patterns - ❌ **Audit logs:** No immutable log of authentication, authorization, data access events (PSD2 requirement) - ❌ **Security incident response plan:** No runbook for security breaches - ❌ **Penetration testing:** No external security audit completed #### Assessment **Status:** Security-aware codebase, but monitoring/audit infrastructure missing. **Gap:** CRITICAL — audit logs are PSD2/GDPR compliance requirement. P0 fix. --- ### 6. Performance — APM, Latency Tracking, Resource Utilization #### What Exists - ✅ **Health check latency:** - DB query time measured in health endpoint - Reported in milliseconds - ✅ **Performance budgets in CI:** - Coverage thresholds enforced (80/70/80/80) #### What's Missing - ❌ **APM (Application Performance Monitoring):** No distributed tracing - ❌ **API latency tracking:** Don't know which endpoints are slow - ❌ **Database performance:** No slow query alerts or query profiling - ❌ **Resource utilization:** No CPU/memory/disk usage monitoring - ❌ **Frontend performance:** No Core Web Vitals tracking (LCP, FID, CLS) - ❌ **Transaction timing:** Can't measure end-to-end payment latency #### Assessment **Status:** Minimal. Can detect total outage but not performance degradation. **Gap:** Need before production to identify bottlenecks and capacity issues. --- ### 7. Database — Backups, Replication, Monitoring #### What Exists - ✅ **Automated backups (RDS):** - Daily automated snapshots, 7-day retention - Point-in-time recovery within 7 days - Source: `docs/dr-runbook.md` - ✅ **Multi-AZ (production):** - RDS configured for high availability (if enabled) - ✅ **Database health check:** - `SELECT 1` query in health endpoint verifies connectivity #### What's Missing - ❌ **Backup verification:** Snapshots created but never tested for restore - ❌ **Backup monitoring:** No alerts if backup fails - ❌ **Replication lag monitoring:** No alerts if replica falls behind - ❌ **Connection pool monitoring:** No visibility into connection usage - ❌ **Query performance:** No slow query log analysis - ❌ **Storage monitoring:** No alerts before disk fills up #### Assessment **Status:** Basic backup/restore exists, monitoring gaps. **Gap:** Backup testing and proactive monitoring needed before production. --- ### 8. Incident Response — Runbooks, Status Page, Communication Plan #### What Exists - ✅ **DR runbook:** - Procedures for App Runner down, RDS down, full redeploy - Environment variable checklist - Contact escalation (John → Alem) - Source: `docs/dr-runbook.md` - ✅ **Incident checklist:** - 8-step incident response workflow - Post-mortem requirement (48h) #### What's Missing - ❌ **Status page:** No public/customer-facing status page - ❌ **Incident templates:** No standardized incident report format - ❌ **Communication plan:** No templates for customer notifications during outages - ❌ **Runbook coverage:** Only covers infrastructure, missing: - Payment failures (PISP/AISP errors) - BankID integration issues - KYC/AML false positive handling - Data breach response - ❌ **Runbook testing:** Procedures documented but never executed #### Assessment **Status:** Basic DR runbook exists, lacks fintech-specific scenarios. **Gap:** Need payment/banking integration runbooks before Phase 2. --- ### 9. CI/CD — Build Pipeline, Deployment, Rollback #### What Exists - ✅ **Comprehensive CI pipeline:** - Multi-package change detection - Lint, typecheck, unit tests, E2E (Playwright), mutation testing (Stryker) - Coverage thresholds enforced (80/70/80/80) with ratchet (never decrease) - Docker build + Trivy security scan - Quality gate (required status check) - Source: `.github/workflows/ci.yml` - ✅ **Deployment workflows:** - GitHub Actions for deploy (backend, mobile) - Terraform for infrastructure - Source: `.github/workflows/deploy.yml`, `terraform-ci.yml` #### What's Missing - ❌ **Automated rollback:** Deployment failure doesn't auto-revert - ❌ **Canary deployments:** All-or-nothing deployment, no gradual rollout - ❌ **Deployment monitoring:** No automatic health check after deploy - ❌ **Deployment notifications:** Team not notified of deployments/failures - ❌ **Infrastructure drift detection:** Terraform state not continuously validated #### Assessment **Status:** Strong quality gate, weak deployment safety. **Gap:** Add post-deployment health checks and rollback automation. --- ### 10. Compliance — Audit Trails, Data Retention, GDPR/PSD2 Logging #### What Exists - ✅ **AML monitoring:** - Transaction alerts stored in `aml_alerts` table - 5 risk categories tracked - ✅ **Security audit completed:** - 4 CRITICAL, 5 HIGH, 6 MEDIUM, 4 LOW findings documented - Source: `security/drop-security-rapport.md` - ✅ **Data retention service:** - Code exists for GDPR compliance - Source: `src/drop-app/src/lib/services/data-retention.ts` #### What's Missing - ❌ **Audit logs:** No immutable record of: - User authentication events (login, logout, failed attempts) - Authorization decisions (who accessed what, when) - Data modifications (user profile changes, transaction edits) - Administrative actions (KYC approvals, AML reviews) - ❌ **Audit log retention policy:** PSD2 requires 5+ years - ❌ **Audit log integrity:** No cryptographic proof of non-tampering - ❌ **Compliance reporting:** No automated report generation for regulators - ❌ **STR (Suspicious Transaction Report) workflow:** AML alerts created but no submission process #### Assessment **Status:** CRITICAL GAP. Audit logs are PSD2 legal requirement. **Gap:** P0 — must implement before production launch. --- ## Gap Analysis ### P0 — Production Blockers (Must Fix Before Go-Live) | # | Category | Gap | Impact | Effort | |---|----------|-----|--------|--------| | 1 | **Error Tracking** | No server-side error monitoring | Can't detect/debug API failures | 4h | | 2 | **Compliance** | No audit logs (auth, data access, admin actions) | PSD2 non-compliance, legal risk | 8h | | 3 | **Security** | WAF rules defined but not deployed | Vulnerable to SQLi, XSS, DDoS | 2h (config) | | 4 | **Logging** | No log aggregation/retention | Can't investigate incidents | 2h (CloudWatch setup) | | 5 | **Monitoring** | BetterStack configured but not deployed | No external incident detection | 1h (account setup) | | 6 | **Incident Response** | No payment/banking failure runbooks | Can't recover from PISP/BankID outages | 4h | **Total P0 effort:** ~21 hours (2-3 days) --- ### P1 — Needed Soon (Before Phase 2: Banking Integration) | # | Category | Gap | Impact | Effort | |---|----------|-----|--------|--------| | 7 | **Alerting** | No on-call rotation or escalation policy | Incidents may go unnoticed outside work hours | 2h | | 8 | **Performance** | No APM for distributed tracing | Can't diagnose slow transactions | 4h | | 9 | **Database** | No backup testing or monitoring | Backups may be corrupt, undetected | 3h | | 10 | **Security** | No penetration testing | Unknown vulnerabilities | 16h (external) | | 11 | **CI/CD** | No automated rollback on deployment failure | Bad deploys cause extended outages | 6h | | 12 | **Compliance** | No STR submission workflow | Can't fulfill AML obligations | 8h | **Total P1 effort:** ~39 hours (5 days) --- ### P2 — Nice to Have (Post-Launch Optimization) | # | Category | Gap | Impact | Effort | |---|----------|-----|--------|--------| | 13 | **Monitoring** | No synthetic transaction monitoring | Can't detect broken user flows | 8h | | 14 | **Performance** | No Core Web Vitals tracking | Poor user experience undetected | 4h | | 15 | **Alerting** | No SMS/phone alerts for critical incidents | Slack outage = missed alerts | 2h | | 16 | **Database** | No slow query alerts | Performance degradation undetected | 6h | | 17 | **Security** | No IDS/IPS for intrusion detection | Advanced attacks undetected | 16h | | 18 | **Incident Response** | No public status page | Customers unaware of outages | 4h | **Total P2 effort:** ~40 hours (5 days) --- ## Implementation Plan ### Phase 1: P0 Production Blockers (NOW — before Phase 1 demo) **Goal:** Address legal/compliance requirements and critical observability gaps. #### 1.1 Server-Side Error Tracking (4h) **Problem:** All server errors invisible after Sentry removed (Next.js 16 Turbopack incompatibility). **Solution:** - **Option A:** Sentry Edge SDK (compatible with Next.js middleware) - Install: `@sentry/nextjs` with edge-only config - Capture server errors via `captureException()` in middleware - Source maps via Sentry webpack plugin - **Option B:** Custom error aggregation service - POST errors to internal `/api/errors/capture` endpoint - Store in `error_logs` table with context - Alert on spike detection **Deliverable:** - `src/drop-app/sentry.edge.config.ts` (if Option A) - Updated `src/drop-app/src/lib/sentry-server.ts` with edge-compatible capture - Test: Trigger 500 error, verify Sentry event created **Files:** `infrastructure/error-tracking-setup.md` --- #### 1.2 Audit Logging System (8h) **Problem:** PSD2 requires immutable audit trail for auth, data access, admin actions. **Solution:** - Create `audit_logs` table: ```sql CREATE TABLE audit_logs ( id TEXT PRIMARY KEY, timestamp TEXT NOT NULL, user_id TEXT, action TEXT NOT NULL, -- 'login', 'data_access', 'kyc_approval', etc. resource_type TEXT, -- 'user', 'transaction', 'aml_alert' resource_id TEXT, metadata JSON, ip_address TEXT, user_agent TEXT, request_id TEXT, result TEXT -- 'success', 'failure', 'denied' ); CREATE INDEX idx_audit_user ON audit_logs(user_id, timestamp); CREATE INDEX idx_audit_action ON audit_logs(action, timestamp); ``` - Audit functions: ```typescript auditLog({ userId: 'usr_123', action: 'login_success', resourceType: 'user', resourceId: 'usr_123', metadata: { method: 'bankid' }, ip: '1.2.3.4', userAgent: 'Mozilla...', requestId: 'req_456' }); ``` - Integrate at: - `POST /api/auth/login` (login_success, login_failure) - `POST /api/auth/logout` (logout) - `GET /api/users/:id` (data_access) - `PATCH /api/users/:id/kyc` (kyc_approval, kyc_rejection) - `PATCH /api/aml-alerts/:id` (aml_review) **Deliverable:** - `src/drop-app/src/lib/audit-log.ts` (audit logging functions) - Migration: `migrations/003_audit_logs.sql` - Integration in auth routes and admin endpoints - Retention policy: Document 5-year retention for PSD2 compliance **Files:** `support/audit-logging-setup.md` --- #### 1.3 WAF Deployment (2h) **Problem:** WAF rules defined but not enforced (requires reverse proxy). **Solution:** - **Option A:** Cloudflare WAF (recommended) - Already using Cloudflare for DNS (terraform module exists) - Free tier includes basic WAF rules - Configure: SQLi, XSS, path traversal rules from `infrastructure/waf-rules.md` - **Option B:** AWS WAF (if using App Runner directly) - $5/month + $1/million requests - Associate with App Runner service **Deliverable:** - Cloudflare WAF configuration (Terraform or UI) - Test: Send SQLi payload, verify 403 response - Document: Update `infrastructure/waf-rules.md` with deployment steps **Files:** `infrastructure/cloudflare-waf-setup.md` --- #### 1.4 Log Aggregation (2h) **Problem:** Structured logs write to stdout but aren't retained or searchable. **Solution:** - **AWS CloudWatch Logs** (App Runner auto-integrates): - App Runner streams stdout → CloudWatch Logs automatically - Configure retention: 30 days (production), 7 days (staging) - Set up log insights queries for common patterns - **Fly.io (staging):** - `fly logs` stores last 24h by default - Optional: Forward to external service (Papertrail, Logtail) **Deliverable:** - CloudWatch Logs retention policy configured - Log Insights queries: - All errors: `fields @timestamp, message | filter level = "error"` - User actions: `fields @timestamp, userId, message | filter userId = "usr_123"` - Request trace: `fields @timestamp, requestId, message | filter requestId = "req_456"` - Documentation: `infrastructure/logging-setup.md` **Files:** `infrastructure/cloudwatch-logs-setup.md` --- #### 1.5 External Uptime Monitoring (1h) **Problem:** BetterStack documented but not deployed. **Solution:** - Sign up: https://betterstack.com/uptime (free tier) - Create monitors: 1. **Production health:** `https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health` - Interval: 3 minutes - Keyword check: `"status":"ok"` 2. **Staging health:** `https://drop-staging.fly.dev/api/health` 3. **Landing page:** `https://getdrop.no` (when live) - Slack integration: Connect to `#drop-ops` channel - Email alerts: `alem@alai.no` **Deliverable:** - BetterStack account with 3 monitors configured - Test: Pause monitor, verify alert received - Documentation: Update `docs/infrastructure/BETTERSTACK-SETUP.md` with credentials **Files:** `support/betterstack-deployment.md` --- #### 1.6 Payment/Banking Failure Runbooks (4h) **Problem:** DR runbook covers infrastructure but not fintech-specific failures. **Solution:** - Create runbooks for: 1. **BankID integration failure** (authentication blocked) 2. **PISP payment failure** (remittance/QR payment rejected) 3. **AISP balance retrieval failure** (can't fetch account balance) 4. **Swan API outage** (BaaS provider down) 5. **Sumsub KYC failure** (identity verification unavailable) 6. **Neonomics open banking outage** - Each runbook includes: - Symptoms (what users see) - Diagnosis steps (check service status, logs, error codes) - Recovery procedure (fallback, retry, escalation) - Customer communication template **Deliverable:** - `support/runbooks/bankid-failure.md` - `support/runbooks/pisp-payment-failure.md` - `support/runbooks/aisp-balance-failure.md` - `support/runbooks/swan-api-outage.md` - `support/runbooks/sumsub-kyc-failure.md` - `support/runbooks/neonomics-outage.md` **Files:** Created in `/Users/makinja/ALAI/products/Drop/support/runbooks/` --- ### Phase 2: P1 Items (Phase 2: Banking Integration) Defer to Phase 2 when real banking integrations are live and need production-grade support. **Priority order:** 1. Penetration testing (external security audit) 2. APM for transaction tracing (identify slow payments) 3. On-call rotation and escalation policy 4. Automated rollback on failed deployments 5. Backup testing and monitoring 6. STR submission workflow (AML compliance) --- ### Phase 3: P2 Items (Post-Launch) Optimize after initial production deployment and user feedback. **Priority order:** 1. Synthetic transaction monitoring (test critical user flows) 2. Public status page (customer transparency) 3. Core Web Vitals tracking (frontend performance) 4. SMS/phone alerts (redundancy) 5. Slow query monitoring (database optimization) 6. IDS/IPS (advanced threat detection) --- ## Architecture ### Support Systems Connectivity ``` ┌─────────────────────────────────────────────────────────────────┐ │ Drop Application │ │ ┌─────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │ │ drop-app │ │ drop-api │ │ drop-mobile (Expo) │ │ │ │ (Next.js) │ │ (Hono) │ │ (React Native) │ │ │ └─────────────┘ └──────────────┘ └──────────────────────┘ │ │ │ │ │ │ │ └────────────────┴──────────────────────┘ │ │ │ │ └──────────────────────────┼──────────────────────────────────────┘ │ ┌──────────────────┼──────────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────────┐ ┌──────────────┐ ┌──────────────────┐ │ Structured │ │ Health Check │ │ Audit Logs │ │ Logging │ │ Endpoint │ │ (audit_logs │ │ (JSON stdout) │ │ /api/health │ │ table) │ └───────┬───────┘ └──────┬───────┘ └─────────┬────────┘ │ │ │ │ │ │ ▼ │ │ ┌────────────────┐ │ │ │ CloudWatch │ │ │ │ Logs │ │ │ │ (30d retention)│ │ │ └────────────────┘ │ │ │ │ │ │ ▼ │ │ ┌───────────────┐ │ │ │ BetterStack │ │ │ │ (external │ │ │ │ monitoring) │ │ │ └───────┬───────┘ │ │ │ │ └─────────────────┼─────────────────────────────┘ │ ▼ ┌────────────────┐ │ Alerting Layer │ │ (alerts.ts) │ └────────┬───────┘ │ ┌─────────────────┼─────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Slack │ │ Sentry │ │ Email │ │ Webhook │ │ (client + │ │ (SMTP) │ │ (#drop-ops) │ │ edge) │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ ``` ### Data Flows 1. **Error Flow:** - Client error → Sentry browser → Slack alert (if spike) - Server error → Sentry edge → CloudWatch Logs → Slack alert - API 5xx → `trackError()` → Spike detection → Slack 2. **Monitoring Flow:** - App → stdout → CloudWatch Logs - App → `/api/health` → BetterStack → Slack/Email/SMS - Container → Docker health check → Auto-restart 3. **Audit Flow:** - User action → `auditLog()` → `audit_logs` table - Compliance query → SQL export → Regulator submission 4. **Incident Flow:** - Alert → Slack `#drop-ops` - Unacknowledged (5 min) → Email to Alem - Unresolved (15 min) → SMS (BetterStack escalation) - Incident → Runbook → Recovery → Post-mortem --- ## Cost Estimate ### Free Tier (MVP) - ✅ CloudWatch Logs: 5 GB ingestion/month free (AWS Free Tier) - ✅ BetterStack: 10 monitors, 3-min interval, unlimited alerts - ✅ Sentry: 5K events/month free - ✅ GitHub Actions: 2000 minutes/month free - ✅ Terraform state: S3 free tier (first 12 months) **Total MVP cost:** $0/month ### Paid Services (Production) - CloudWatch Logs: ~$5/month (30 GB ingestion estimate) - BetterStack Pro: $20/month (30s interval, SMS alerts) - Sentry Team: $26/month (50K events, enhanced features) - **Optional:** Datadog APM: $15/host/month (~$45 for 3 hosts) **Total production cost:** ~$50-100/month (without APM) --- ## Recommendations ### Immediate (This Week) 1. ✅ **Deploy BetterStack** (1h) — External monitoring is fast win 2. ✅ **Configure CloudWatch retention** (30 min) — Logs already flow, just set policy 3. ✅ **Create audit log schema** (2h) — Start with table, integrate incrementally ### Before Phase 1 Demo (Next 2 Weeks) 4. ✅ **Implement server-side error tracking** (4h) — Sentry edge or custom 5. ✅ **Write payment failure runbooks** (4h) — Prepare for demo questions 6. ✅ **Deploy Cloudflare WAF** (2h) — Security hygiene ### Before Phase 2 Go-Live (Next 2-3 Months) 7. 🔲 **External penetration test** (hire security firm, ~$5K budget) 8. 🔲 **APM implementation** (Datadog or Sentry Performance) 9. 🔲 **On-call rotation** (define schedule, test escalation) 10. 🔲 **Backup testing** (restore from snapshot, verify data integrity) ### Post-Launch Optimization 11. 🔲 **Synthetic monitoring** (Checkly or custom Playwright tests) 12. 🔲 **Public status page** (BetterStack included, just enable) 13. 🔲 **Core Web Vitals** (Google Lighthouse CI integration) --- ## Success Metrics ### Before Go-Live (P0 Checklist) - [ ] Server errors visible in Sentry (test: trigger 500, verify event) - [ ] Audit logs capture login/logout (test: log in, check `audit_logs` table) - [ ] WAF blocks SQLi attack (test: `?id=1' OR '1'='1`, expect 403) - [ ] CloudWatch Logs retain 30 days (verify retention policy) - [ ] BetterStack alerts on downtime (test: stop app, receive alert <5 min) - [ ] Runbooks tested (simulate BankID failure, follow procedure) ### Production KPIs - **Uptime:** >99.9% (measured by BetterStack) - **MTTD (Mean Time To Detect):** <3 minutes (external monitoring interval) - **MTTR (Mean Time To Recover):** <15 minutes (via runbooks) - **Error rate:** <0.1% of requests (tracked via Sentry) - **Log retention:** 100% compliance (30 days CloudWatch, 5 years audit logs) - **Alert noise:** <5 false positives/week (cooldown + severity tuning) --- ## Appendices ### A. Related Documentation - `docs/infrastructure/MONITORING.md` — Current monitoring setup - `docs/infrastructure/BETTERSTACK-SETUP.md` — External monitoring guide - `docs/dr-runbook.md` — Infrastructure disaster recovery - `infrastructure/waf-rules.md` — WAF rule definitions - `security/drop-security-rapport.md` — Security audit findings ### B. External Services - BetterStack: https://betterstack.com/uptime - Sentry: https://sentry.io/ - AWS CloudWatch: https://console.aws.amazon.com/cloudwatch/ - Cloudflare: https://dash.cloudflare.com/ ### C. Change History - 2026-02-22: Initial analysis (John) --- **Next Actions:** 1. Review this analysis with Alem 2. Approve P0 implementation plan 3. Begin P0 work (estimated 21 hours / 2-3 days) 4. Track progress in Mission Control tasks # Audit Logging Setup # Audit Logging Setup — Drop Fintech **Date:** 2026-02-22 **Priority:** P0 (Production Blocker) **Compliance:** PSD2, GDPR **Effort:** 8 hours --- ## Overview Audit logging provides an **immutable record** of all authentication, authorization, data access, and administrative actions. This is a **legal requirement** for PSD2-regulated payment services and GDPR data protection compliance. --- ## Requirements ### PSD2 Audit Trail Requirements - All authentication events (login, logout, failed attempts) - Authorization decisions (who accessed what resource) - Transaction creation and modification - KYC/AML review actions - Administrative user actions - Data exports and bulk operations - Retention: **5 years minimum** ### GDPR Right of Access - Users must be able to request all logged actions related to their data - Export format: Human-readable (CSV or JSON) --- ## Database Schema ### Migration: `003_audit_logs.sql` ```sql -- Audit Logs Table (PostgreSQL 16 — ADR-014) -- Schema managed via Drizzle ORM (src/shared/db/schema.ts) -- Apply with: make db-push CREATE TABLE IF NOT EXISTS audit_log ( id TEXT PRIMARY KEY, timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(), user_id TEXT, action TEXT NOT NULL, resource_type TEXT, resource_id TEXT, details JSONB, ip_address TEXT, user_agent TEXT, request_id TEXT, result TEXT NOT NULL DEFAULT 'success', -- 'success', 'failure', 'denied' created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); -- Indexes for common queries CREATE INDEX IF NOT EXISTS idx_audit_user_time ON audit_log(user_id, timestamp DESC); CREATE INDEX IF NOT EXISTS idx_audit_action_time ON audit_log(action, timestamp DESC); CREATE INDEX IF NOT EXISTS idx_audit_resource ON audit_log(resource_type, resource_id, timestamp DESC); CREATE INDEX IF NOT EXISTS idx_audit_request ON audit_log(request_id); CREATE INDEX IF NOT EXISTS idx_audit_result ON audit_log(result, timestamp DESC); -- Partitioning by month (production) CREATE TABLE audit_log_2026_02 PARTITION OF audit_log FOR VALUES FROM ('2026-02-01') TO ('2026-03-01'); ``` **Migration steps (PostgreSQL 16 via Drizzle ORM):** 1. Schema is defined in `src/shared/db/schema.ts` 2. Apply with: ```bash make db-push # or: cd src/shared && npx drizzle-kit push ``` 3. Verify table exists: ```bash psql "$DATABASE_URL" -c "SELECT table_name FROM information_schema.tables WHERE table_name='audit_log';" ``` --- ## Implementation ### Audit Log Library: `src/lib/audit-log.ts` ```typescript import { db } from '@drop/shared/db'; import { randomId } from './utils-server'; import { logger } from './logger'; export type AuditAction = // Authentication | 'login_success' | 'login_failure' | 'logout' | 'password_change' | 'session_revoked' // Authorization | 'access_granted' | 'access_denied' // Data Access | 'data_view' | 'data_export' | 'data_delete' // Transactions | 'transaction_created' | 'transaction_completed' | 'transaction_failed' // KYC/AML | 'kyc_approved' | 'kyc_rejected' | 'aml_alert_created' | 'aml_alert_reviewed' // Admin | 'user_created' | 'user_updated' | 'user_deleted' | 'role_changed'; export type AuditResult = 'success' | 'failure' | 'denied'; export interface AuditLogEntry { userId?: string; action: AuditAction; resourceType?: string; resourceId?: string; metadata?: Record; ip?: string; userAgent?: string; requestId?: string; result?: AuditResult; } /** * Create an audit log entry * * IMPORTANT: This function must NEVER throw errors. * Audit failures should not block user actions. */ export async function auditLog(entry: AuditLogEntry): Promise { try { const id = randomId('audit'); const timestamp = new Date().toISOString(); await run( `INSERT INTO audit_logs ( id, timestamp, user_id, action, resource_type, resource_id, metadata, ip_address, user_agent, request_id, result ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`, [ id, timestamp, entry.userId || null, entry.action, entry.resourceType || null, entry.resourceId || null, entry.metadata ? JSON.stringify(entry.metadata) : null, entry.ip || null, entry.userAgent || null, entry.requestId || null, entry.result || 'success', ] ); logger.debug('Audit log created', { auditId: id, action: entry.action }); } catch (error) { // Log error but do NOT throw (audit failures should not block operations) logger.error('Failed to create audit log', { error: error instanceof Error ? error.message : String(error), action: entry.action, }); } } /** * Retrieve audit logs for a user (GDPR Right of Access) */ export async function getUserAuditLogs( userId: string, options?: { limit?: number; offset?: number; startDate?: string; endDate?: string } ): Promise { const { limit = 100, offset = 0, startDate, endDate } = options || {}; let sql = 'SELECT * FROM audit_logs WHERE user_id = ?'; const params: unknown[] = [userId]; if (startDate) { sql += ' AND timestamp >= ?'; params.push(startDate); } if (endDate) { sql += ' AND timestamp <= ?'; params.push(endDate); } sql += ' ORDER BY timestamp DESC LIMIT ? OFFSET ?'; params.push(limit, offset); const { query } = await import('./db'); return query(sql, params); } /** * Export audit logs as CSV (for compliance reporting) */ export async function exportAuditLogsCSV( filters?: { userId?: string; action?: AuditAction; startDate?: string; endDate?: string; } ): Promise { let sql = 'SELECT * FROM audit_logs WHERE 1=1'; const params: unknown[] = []; if (filters?.userId) { sql += ' AND user_id = ?'; params.push(filters.userId); } if (filters?.action) { sql += ' AND action = ?'; params.push(filters.action); } if (filters?.startDate) { sql += ' AND timestamp >= ?'; params.push(filters.startDate); } if (filters?.endDate) { sql += ' AND timestamp <= ?'; params.push(filters.endDate); } sql += ' ORDER BY timestamp DESC'; const { query } = await import('./db'); const rows = await query(sql, params); // Convert to CSV const headers = [ 'id', 'timestamp', 'user_id', 'action', 'resource_type', 'resource_id', 'metadata', 'ip_address', 'user_agent', 'request_id', 'result', ]; const csvRows = [headers.join(',')]; for (const row of rows as Record[]) { const values = headers.map((h) => { const val = row[h]; if (val === null || val === undefined) return ''; return String(val).replace(/"/g, '""'); // Escape quotes }); csvRows.push(values.map((v) => `"${v}"`).join(',')); } return csvRows.join('\n'); } ``` --- ## Integration Points ### 1. Authentication (`src/app/api/auth/login/route.ts`) ```typescript import { auditLog } from '@/lib/audit-log'; export async function POST(request: NextRequest) { const { email, password } = await request.json(); const ip = request.headers.get('x-forwarded-for') || request.headers.get('x-real-ip'); const userAgent = request.headers.get('user-agent'); const requestId = getRequestId(request.headers); try { const user = await getUserByEmail(email); if (!user || !await verifyPassword(password, user.password_hash)) { // Audit failed login attempt await auditLog({ userId: user?.id, action: 'login_failure', metadata: { email, reason: 'invalid_credentials' }, ip, userAgent, requestId, result: 'failure', }); return jsonError('Invalid credentials', 401); } // Audit successful login await auditLog({ userId: user.id, action: 'login_success', metadata: { email }, ip, userAgent, requestId, result: 'success', }); // ... rest of login logic } catch (error) { // ... error handling } } ``` ### 2. Logout (`src/app/api/auth/logout/route.ts`) ```typescript await auditLog({ userId: session.userId, action: 'logout', metadata: { sessionId: session.id }, ip, userAgent, requestId, }); ``` ### 3. Data Access (`src/app/api/users/[id]/route.ts`) ```typescript export async function GET(request: NextRequest, { params }: { params: { id: string } }) { const session = await requireAuth(request); const userId = params.id; // Check authorization if (session.userId !== userId && session.role !== 'admin') { await auditLog({ userId: session.userId, action: 'access_denied', resourceType: 'user', resourceId: userId, metadata: { reason: 'insufficient_permissions' }, ip: request.headers.get('x-forwarded-for'), userAgent: request.headers.get('user-agent'), requestId: getRequestId(request.headers), result: 'denied', }); return jsonError('Access denied', 403); } // Audit successful data access await auditLog({ userId: session.userId, action: 'data_view', resourceType: 'user', resourceId: userId, ip: request.headers.get('x-forwarded-for'), userAgent: request.headers.get('user-agent'), requestId: getRequestId(request.headers), }); const user = await getUserById(userId); return jsonSuccess(user); } ``` ### 4. KYC Approval (`src/app/api/admin/kyc/route.ts`) ```typescript await auditLog({ userId: adminSession.userId, action: 'kyc_approved', resourceType: 'user', resourceId: targetUserId, metadata: { reason: kycApprovalReason }, ip: request.headers.get('x-forwarded-for'), userAgent: request.headers.get('user-agent'), requestId: getRequestId(request.headers), }); ``` ### 5. Transaction Creation (`src/app/api/transactions/route.ts`) ```typescript await auditLog({ userId: session.userId, action: 'transaction_created', resourceType: 'transaction', resourceId: transactionId, metadata: { type: transactionType, amount: amount, currency: currency, }, ip: request.headers.get('x-forwarded-for'), userAgent: request.headers.get('user-agent'), requestId: getRequestId(request.headers), }); ``` --- ## Compliance Reporting ### GDPR Right of Access (User Data Export) ```typescript // src/app/api/users/[id]/audit-logs/route.ts export async function GET(request: NextRequest, { params }: { params: { id: string } }) { const session = await requireAuth(request); // Users can only access their own audit logs if (session.userId !== params.id && session.role !== 'admin') { return jsonError('Access denied', 403); } const logs = await getUserAuditLogs(params.id, { limit: 1000, // GDPR requires "all data" startDate: request.nextUrl.searchParams.get('start') || undefined, endDate: request.nextUrl.searchParams.get('end') || undefined, }); return jsonSuccess({ logs }); } ``` ### PSD2 Audit Trail Export (Admin) ```typescript // src/app/api/admin/audit/export/route.ts export async function GET(request: NextRequest) { const session = await requireAuth(request); if (session.role !== 'admin') { return jsonError('Admin access required', 403); } const startDate = request.nextUrl.searchParams.get('start'); const endDate = request.nextUrl.searchParams.get('end'); const action = request.nextUrl.searchParams.get('action'); const userId = request.nextUrl.searchParams.get('userId'); const csv = await exportAuditLogsCSV({ userId: userId || undefined, action: action as AuditAction | undefined, startDate: startDate || undefined, endDate: endDate || undefined, }); return new Response(csv, { headers: { 'Content-Type': 'text/csv', 'Content-Disposition': `attachment; filename="audit-logs-${new Date().toISOString()}.csv"`, }, }); } ``` --- ## Retention Policy ### PSD2 Requirement: 5 Years **PostgreSQL 16 (all environments — ADR-014):** - Use table partitioning by month: ```sql CREATE TABLE audit_log ( id TEXT PRIMARY KEY, timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(), -- ... other columns ) PARTITION BY RANGE (timestamp); -- Create partitions for each month CREATE TABLE audit_log_2026_02 PARTITION OF audit_log FOR VALUES FROM ('2026-02-01') TO ('2026-03-01'); ``` - Automatic cleanup script (cron weekly): ```bash #\!/bin/bash # Delete audit logs older than 5 years (PSD2 retention) psql "$DATABASE_URL" -c "DELETE FROM audit_log WHERE timestamp < NOW() - INTERVAL '5 years';" --- ## Testing ### Test Audit Logging ```bash # 1. Create audit log entry curl -X POST http://localhost:3000/api/auth/login \ -H "Content-Type: application/json" \ -d '{"email":"test@example.com","password":"wrong"}' # 2. Check audit log table (PostgreSQL 16) psql "$DATABASE_URL" -c "SELECT * FROM audit_log ORDER BY timestamp DESC LIMIT 5;" # Expected output: # audit_123 | 2026-02-22T10:00:00.000Z | usr_456 | login_failure | ... | {"email":"test@example.com","reason":"invalid_credentials"} | 1.2.3.4 | Mozilla/5.0... | req_789 | failure # 3. Export audit logs (admin) curl -X GET "http://localhost:3000/api/admin/audit/export?start=2026-02-01&end=2026-02-28" \ -H "Cookie: auth-token=" \ > audit-logs.csv # 4. Verify CSV format head -n 5 audit-logs.csv ``` --- ## Monitoring ### Alert on Audit Failures Add to `src/lib/audit-log.ts`: ```typescript import { sendAlert } from './alerts'; export async function auditLog(entry: AuditLogEntry): Promise { try { // ... insert logic } catch (error) { logger.error('Failed to create audit log', { error, action: entry.action }); // CRITICAL: Alert if audit logging fails (compliance risk) await sendAlert({ severity: 'critical', title: 'Audit logging failure', message: `Failed to record ${entry.action} for user ${entry.userId}`, }); } } ``` ### Metrics to Track - Audit logs created per hour (should correlate with user activity) - Failed audit log attempts (should be zero) - Audit log export requests (GDPR compliance) - Audit log storage size (retention planning) --- ## Security Considerations ### Immutability - Audit logs should NEVER be updated or deleted (except by automated retention policy) - No UPDATE or DELETE API endpoints for audit logs - Database permissions: Read-only for application, Write-only for audit service ### Access Control - Only admins can view full audit trails - Users can view their own audit logs only - Export requires elevated permissions ### Data Redaction - Do NOT log passwords, tokens, or sensitive PII in metadata - Card numbers: Log last 4 digits only - Fødselsnummer: Log checksum/hash, not full number --- ## Checklist - [ ] Migration `003_audit_logs.sql` created - [ ] Migration applied to dev database - [ ] `src/lib/audit-log.ts` implemented - [ ] Login/logout endpoints integrated - [ ] Data access endpoints integrated - [ ] KYC/AML admin actions integrated - [ ] GDPR export endpoint created - [ ] PSD2 CSV export endpoint created - [ ] Retention policy documented - [ ] Monitoring alerts configured - [ ] Testing completed (manual + automated) - [ ] Documentation updated (API docs, compliance docs) --- **Next Steps:** 1. Create migration file 2. Implement `audit-log.ts` library 3. Integrate into auth routes (high priority) 4. Add to remaining endpoints incrementally 5. Test with real login/logout flows 6. Deploy to staging for verification # Runbooks Operational runbooks for failure scenarios # Runbook: AISP Balance Failure # Runbook: AISP Balance Fetch Failure **Service:** AISP (Account Information Service Provider) **Severity:** MEDIUM (users can't see bank balance) **MTTR Target:** <20 minutes **Owner:** John (AI Director) --- ## Symptoms Users report they cannot see their bank account balance in Drop. Symptoms include: - Dashboard shows "Balance unavailable" or stale balance - Error message: "Could not fetch account information" - Infinite loading spinner on balance widget - Balance shows "0 kr" or "—" instead of actual amount **User impact:** Cannot verify available funds before making payments (may lead to insufficient funds errors). --- ## Diagnosis ### 1. Check Neonomics AISP Status **External status:** ```bash # Neonomics has no public status page — test via API curl -X GET https://api.neonomics.io/health \ -H "Authorization: Bearer " \ -v # Expected: HTTP 200 # If 500/503: Neonomics outage ``` **Check specific bank connectivity:** ```bash # List supported banks and their status curl -X GET https://api.neonomics.io/banks \ -H "Authorization: Bearer " \ | jq '.[] | select(.country == "NO") | {name, status}' # Look for: "status": "degraded" or "offline" ``` ### 2. Check Drop Logs ```bash # CloudWatch Logs (production) aws logs filter-log-events \ --log-group-name /aws/apprunner/drop-production \ --filter-pattern "aisp" \ --start-time $(date -u -d '15 minutes ago' +%s)000 \ --region eu-west-1 # Look for: # - "AISP consent expired" # - "AISP API timeout" # - "AISP 401 Unauthorized" # - "Bank API unavailable: DNB" ``` ### 3. Check User Consent Status ```bash # Verify Open Banking consent hasn't expired # Consent is valid for 90 days from last authorization # Check database for expired consents (PostgreSQL 16) psql "$DATABASE_URL" <" \ -v # If timeout or 503: confirmed outage ``` 2. **Contact Neonomics support:** - Email: support@neonomics.io - Slack: #neonomics-support (if available) - Check Neonomics Slack for incident updates 3. **Enable fallback mode:** ```bash # Show cached balances to all users aws apprunner update-service --service-arn \ --instance-configuration "EnvironmentVariables={ AISP_FALLBACK_MODE=cached, AISP_FALLBACK_CACHE_TTL=3600 }" ``` 4. **Communicate to users (Norwegian):** ``` Emne: Saldo vises med forsinkelse Hei, Vår leverandør for bankdata opplever tekniske problemer. Saldoen du ser kan være opptil 1 time gammel. Du kan fortsatt gjøre betalinger som normalt. Vi forventer at tjenesten er tilbake innen [X minutter]. Mvh, Drop ``` 5. **Monitor Neonomics status:** - Check every 10 minutes for resolution - When API is back: disable fallback mode ```bash aws apprunner update-service --service-arn \ --instance-configuration "EnvironmentVariables={ AISP_FALLBACK_MODE=live }" ``` **ETA:** Depends on Neonomics (typically <2 hours) --- ### Cause 4: Invalid or Revoked API Credentials **Probability:** 5% (after credential rotation or account issue) **Symptoms:** - Logs show: "401 Unauthorized" or "invalid_api_key" - All AISP requests fail immediately - Other Drop services work fine (auth, database, etc.) **Solution:** 1. **Verify Neonomics API credentials:** ```bash bw get item "Neonomics API" --session $BW_SESSION # Check: # - API key is not expired # - API key has AISP permissions # - Correct environment (production vs sandbox) ``` 2. **Update App Runner environment variables:** ```bash aws apprunner update-service --service-arn \ --source-configuration "ImageRepository={...}" \ --instance-configuration "EnvironmentVariables={ NEONOMICS_API_KEY=, NEONOMICS_ENVIRONMENT=production }" ``` 3. **Trigger deployment:** ```bash aws apprunner start-deployment --service-arn --region eu-west-1 # Wait 3-5 minutes for deployment to complete ``` 4. **Test after deployment:** ```bash # Verify AISP working curl -X GET https://getdrop.no/api/accounts/balance \ -H "Authorization: Bearer " \ -v # Expected: HTTP 200 with balance data ``` **ETA:** 10 minutes --- ### Cause 5: Network or Firewall Issues **Probability:** 5% (AWS security group misconfiguration) **Symptoms:** - Logs show: "Connection timeout" or "ECONNREFUSED" - AISP API requests never reach Neonomics - Other external APIs may also fail **Solution:** 1. **Check outbound connectivity:** ```bash # App Runner egress is unrestricted by default # If using VPC connector, check security group aws ec2 describe-security-groups \ --group-ids \ --region eu-west-1 \ | jq '.SecurityGroups[].IpPermissionsEgress' ``` 2. **Test DNS resolution:** ```bash # From your local machine or bastion host nslookup api.neonomics.io # Should resolve to Neonomics IP # If NXDOMAIN: DNS issue ``` 3. **Check AWS service health:** ```bash # Check App Runner service events aws apprunner list-operations \ --service-arn \ --region eu-west-1 \ | jq '.OperationSummaryList[] | select(.Type == "CREATE_SERVICE" or .Type == "UPDATE_SERVICE")' # Look for recent errors ``` 4. **Whitelist Neonomics IPs (if using strict firewall):** - Contact Neonomics for IP ranges - Add to security group outbound rules - Allow HTTPS (443) to Neonomics endpoints **ETA:** 15 minutes (if quick fix), 1 hour (if requires networking changes) --- ### Cause 6: Rate Limiting (High Traffic) **Probability:** 10% (during peak hours or viral event) **Symptoms:** - Logs show: HTTP 429 "Too Many Requests" - Intermittent failures (some users see balance, others don't) - Rate limit headers in logs **Solution:** 1. **Check rate limit headers:** ```bash aws logs filter-log-events \ --log-group-name /aws/apprunner/drop-production \ --filter-pattern "X-RateLimit" \ --start-time $(date -u -d '5 minutes ago' +%s)000 \ | jq -r '.events[].message' \ | grep -E "X-RateLimit-(Limit|Remaining|Reset)" ``` 2. **Implement request throttling:** ```typescript // src/lib/aisp-client.ts import PQueue from 'p-queue'; const queue = new PQueue({ concurrency: 10, // Max 10 concurrent requests interval: 1000, // Per second intervalCap: 50 // Max 50 requests per second }); export async function fetchBalance(userId: string) { return queue.add(() => neonomicsClient.getBalance(userId)); } ``` 3. **Cache balance aggressively during rate limit:** ```typescript // src/lib/balance-cache.ts const CACHE_TTL_NORMAL = 60; // 60 seconds const CACHE_TTL_RATE_LIMIT = 300; // 5 minutes during rate limit export async function getBalanceWithCache(userId: string) { const cached = await redis.get(`balance:${userId}`); if (cached) return JSON.parse(cached); try { const balance = await fetchBalance(userId); await redis.setex(`balance:${userId}`, CACHE_TTL_NORMAL, JSON.stringify(balance)); return balance; } catch (error) { if (error.status === 429) { // Extend cache TTL during rate limit await redis.expire(`balance:${userId}`, CACHE_TTL_RATE_LIMIT); } throw error; } } ``` 4. **Contact Neonomics to increase rate limit:** - Email support with traffic stats - Request higher API quota for production - Provide justification (user growth, peak times) **ETA:** 5 minutes (automatic caching), 1-2 days (if quota increase needed) --- ## Emergency Workarounds ### Option 1: Cached Balance Mode **Use case:** AISP provider down >30 minutes, users need to see approximate balance **Steps:** 1. Enable cached balance fallback: ```bash aws apprunner update-service --service-arn \ --instance-configuration "EnvironmentVariables={ AISP_MODE=cached, AISP_CACHE_TTL=3600 }" ``` 2. Show warning banner in app: ``` ⚠️ Saldo vises med forsinkelse Vi viser din sist kjente saldo fra [timestamp]. Tjenesten er tilbake til normal snart. ``` 3. Allow payments to proceed: - Users can still initiate payments (PISP) - Balance check uses cached value - Risk: Insufficient funds errors if balance changed 4. **Revert when AISP is back:** ```bash aws apprunner update-service --service-arn \ --instance-configuration "EnvironmentVariables={ AISP_MODE=live }" ``` **Risk:** Cached balance may be stale (up to 1 hour old). Users may attempt payments with insufficient funds. --- ### Option 2: Hide Balance, Allow Payments **Use case:** AISP down, no reliable cache, but PISP still works **Steps:** 1. Show "Balance unavailable" message: ``` Saldo midlertidig utilgjengelig Du kan fortsatt gjøre betalinger som normalt. Banken vil avvise betalingen hvis du ikke har nok midler. ``` 2. Allow payments without balance check: - User enters payment amount - Drop initiates payment via PISP - Bank performs real-time balance check - If insufficient funds: bank rejects, user gets clear error 3. Communicate ETA to users: ``` Vi jobber med å gjenopprette saldovisning. Estimert tid: [X minutter] ``` **Risk:** User experience degraded. May attempt failed payments. --- ## Post-Incident Actions 1. **Refresh all expired consents proactively:** ```sql -- PostgreSQL 16: send renewal reminders 7 days before expiry SELECT user_id, email, consent_expires_at FROM bank_accounts JOIN users ON users.id = bank_accounts.user_id WHERE consent_expires_at < NOW() + INTERVAL '7 days' AND consent_renewal_reminder_sent = FALSE; ``` 2. **Document incident:** ```bash touch ~/ALAI/products/Drop/comms/incidents/$(date +%Y-%m-%d)-aisp-failure.md ``` 3. **Review caching strategy:** - Is cache TTL appropriate? - Should we cache balance longer during incidents? - Add metrics: cache hit rate, staleness 4. **Update monitoring:** - Add synthetic AISP test (fetch balance every 5 min) - Alert on AISP failure rate >10% - Track consent expiry dates 5. **Improve user communication:** - Auto-notify users when AISP is degraded - Show balance age: "Updated 5 minutes ago" --- ## Escalation | Time | Action | |------|--------| | 0 min | John starts diagnosis | | 10 min | If Neonomics outage confirmed, notify Alem | | 20 min | If not resolved, enable cached balance mode | | 1 hour | Public communication to users (Norwegian email/push) | | 2 hours | Contact Neonomics support via phone if no response | --- ## Contacts - **Neonomics Support:** support@neonomics.io - **Neonomics Slack:** #neonomics-support (if available) - **Internal:** Alem (CEO, final decision on fallback modes) --- ## Related Documentation - `docs/architecture/open-banking.md` — AISP flow diagrams - `src/app/api/accounts/balance/route.ts` — Balance fetch implementation - `docs/compliance/psd2-requirements.md` — PSD2 consent rules (90-day expiry) - Vaultwarden item: "Neonomics API" — Credentials --- **Last Updated:** 2026-02-22 **Next Review:** Before Phase 2 (Banking Integration) # Runbook: BankID Failure # Runbook: BankID Integration Failure **Service:** BankID OAuth Authentication **Severity:** CRITICAL (blocks all logins) **MTTR Target:** <15 minutes **Owner:** John (AI Director) --- ## Symptoms Users report they cannot log in. Symptoms include: - Login button doesn't redirect to BankID - BankID redirect returns error page - OAuth callback fails with 401/403 - Error message: "Authentication service unavailable" --- ## Diagnosis ### 1. Check BankID Service Status **External status page:** ```bash # Check BankID status (no official status page, monitor Twitter) open https://twitter.com/search?q=BankID%20Norge # Or check community forums open https://www.reddit.com/r/Norge/search?q=BankID ``` **Quick test:** ```bash # Try BankID login from another service (e.g., tax portal) open https://www.skatteetaten.no/person/ # If BankID works there but not in Drop → problem is our integration ``` ### 2. Check Drop Logs ```bash # CloudWatch Logs (production) aws logs filter-log-events \ --log-group-name /aws/apprunner/drop-production \ --filter-pattern "bankid" \ --start-time $(date -u -d '10 minutes ago' +%s)000 \ --region eu-west-1 # Look for: # - "BankID OAuth error: invalid_client" # - "BankID callback failed: invalid_state" # - "BankID API timeout" ``` ### 3. Check Environment Variables ```bash # Verify BankID credentials are set aws apprunner describe-service \ --service-arn \ --region eu-west-1 \ | jq '.Service.SourceConfiguration.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables' \ | grep BANKID # Expected: # BANKID_CLIENT_ID: # BANKID_CLIENT_SECRET: (value hidden) # BANKID_CALLBACK_URL: https://getdrop.no/api/auth/bankid/callback ``` ### 4. Check OAuth Flow **Test OAuth initiation:** ```bash # Start OAuth flow curl -X POST https://getdrop.no/api/auth/bankid/initiate \ -H "Content-Type: application/json" \ -d '{"redirectUrl": "/dashboard"}' \ -v # Expected: HTTP 302 redirect to BankID with state parameter # If 500: Check BANKID_CLIENT_ID and BANKID_CALLBACK_URL ``` **Test OAuth callback:** ```bash # Simulate callback (replace

 and  with real values from BankID redirect)
curl -X GET "https://getdrop.no/api/auth/bankid/callback?code=&state=" \
  -v

# Expected: HTTP 302 redirect to /dashboard with auth cookie
# If 401: Check BANKID_CLIENT_SECRET
# If 400: Check state validation logic
```

---

## Common Causes & Solutions

### Cause 1: BankID Service Outage (External)

**Probability:** 5% (BankID is highly reliable)

**Symptoms:**
- All BankID logins fail across all services
- BankID status page reports incident
- Social media mentions BankID outage

**Solution:**
1. **Communicate:** Post status update to users
   ```
   Subject: Login temporarily unavailable
   Body: BankID authentication is experiencing issues.
         We're monitoring the situation and will restore service
         as soon as BankID is back online. Estimated:  minutes.
   ```

2. **Monitor:** Watch BankID Twitter/status for updates

3. **Fallback (if available):** If demo mode exists, consider temporary activation:
   ```bash
   # Enable demo mode (ONLY in emergency, requires Alem approval)
   aws apprunner update-service --service-arn  \
     --source-configuration "ImageRepository={...}" \
     --instance-configuration "EnvironmentVariables={NEXT_PUBLIC_SERVICE_MODE=demo}"
   ```

4. **Post-incident:** Document outage duration, user impact

**ETA:** Depends on BankID (typically <2 hours)

---

### Cause 2: Invalid OAuth Credentials

**Probability:** 20% (after credential rotation or environment change)

**Symptoms:**
- Logs show: "invalid_client" or "unauthorized_client"
- OAuth flow fails immediately (no redirect to BankID)

**Solution:**
1. **Verify credentials in Vaultwarden:**
   ```bash
   bw get item "BankID OAuth" --session $BW_SESSION
   ```

2. **Update App Runner environment variables:**
   ```bash
   aws apprunner update-service --service-arn  \
     --source-configuration "ImageRepository={...}" \
     --instance-configuration "EnvironmentVariables={
       BANKID_CLIENT_ID=,
       BANKID_CLIENT_SECRET=
     }"
   ```

3. **Trigger deployment:**
   ```bash
   aws apprunner start-deployment --service-arn  --region eu-west-1
   ```

4. **Test:** Attempt login after deployment completes (3-5 minutes)

**ETA:** 10 minutes

---

### Cause 3: Callback URL Mismatch

**Probability:** 15% (after domain change or deployment error)

**Symptoms:**
- Logs show: "redirect_uri_mismatch"
- BankID redirects to wrong URL (404 or CORS error)

**Solution:**
1. **Check registered callback URL in BankID portal:**
   - Login to BankID integration portal
   - Navigate to OAuth settings
   - Verify callback URL: `https://getdrop.no/api/auth/bankid/callback`

2. **If mismatch, update BankID portal:**
   - Change redirect URI to match current domain
   - Save changes (may require approval, 1-2 hours)

3. **Update App Runner env var:**
   ```bash
   aws apprunner update-service --service-arn  \
     --source-configuration "ImageRepository={...}" \
     --instance-configuration "EnvironmentVariables={
       BANKID_CALLBACK_URL=https://getdrop.no/api/auth/bankid/callback
     }"
   ```

4. **Test:** Login flow should work after both changes

**ETA:** 15 minutes (if no BankID approval required), 2 hours (if approval needed)

---

### Cause 4: State Parameter Validation Failure

**Probability:** 10% (race condition or session timeout)

**Symptoms:**
- Logs show: "Invalid state parameter"
- User completes BankID flow but callback rejects

**Solution:**
1. **Check session storage:**
   - BankID state is stored in server session
   - If session expires before callback (>10 min), state is lost

2. **Increase session timeout (if needed):**
   ```typescript
   // src/lib/auth.ts
   const SESSION_TIMEOUT = 15 * 60 * 1000; // 15 minutes (was 10)
   ```

3. **Clear stale sessions:**
   ```bash
   # If using Redis for sessions
   redis-cli FLUSHDB

   # If using database sessions
   sqlite3 drop.db "DELETE FROM sessions WHERE expires_at < datetime('now');"
   ```

4. **Ask user to retry:** State timeout is usually one-time issue

**ETA:** 5 minutes

---

### Cause 5: BankID API Rate Limiting

**Probability:** 5% (during high-traffic events)

**Symptoms:**
- Logs show: "rate_limit_exceeded" or HTTP 429
- Intermittent failures (some users succeed, others fail)

**Solution:**
1. **Check rate limit headers in logs:**
   ```
   X-RateLimit-Limit: 100
   X-RateLimit-Remaining: 0
   X-RateLimit-Reset: 1640000000
   ```

2. **Wait for rate limit reset:** Typically resets every 60 seconds

3. **Implement exponential backoff (if not present):**
   ```typescript
   // src/lib/bankid-client.ts
   async function callBankIDAPI(retries = 3) {
     try {
       return await fetch(url);
     } catch (error) {
       if (error.status === 429 && retries > 0) {
         await sleep(1000 * (4 - retries)); // 1s, 2s, 3s
         return callBankIDAPI(retries - 1);
       }
       throw error;
     }
   }
   ```

4. **Contact BankID support:** If rate limits are too low for production traffic

**ETA:** 5 minutes (automatic), 1-2 days (if support ticket needed)

---

### Cause 6: Network/Firewall Issues

**Probability:** 5% (AWS security group misconfiguration)

**Symptoms:**
- Logs show: "Connection timeout" or "ECONNREFUSED"
- BankID API requests never reach destination

**Solution:**
1. **Check outbound rules (App Runner → BankID):**
   ```bash
   # App Runner egress is unrestricted by default
   # Check VPC connector security group (if using VPC)
   aws ec2 describe-security-groups --group-ids  --region eu-west-1
   ```

2. **Test connectivity from container:**
   ```bash
   # Exec into running container (if possible)
   curl -v https://oidc.bankid.no/.well-known/openid-configuration

   # Expected: HTTP 200 with JSON response
   # If timeout: Network/firewall issue
   ```

3. **Check DNS resolution:**
   ```bash
   nslookup oidc.bankid.no
   # Should resolve to BankID IP addresses
   ```

4. **Whitelist BankID IPs (if using strict firewall):**
   - Contact BankID for IP ranges
   - Add to AWS security group outbound rules

**ETA:** 15 minutes (if quick fix), 1 hour (if requires networking changes)

---

## Emergency Workarounds

### Option 1: Fallback to Demo Mode (Temporary)

**Use case:** BankID outage affects all users, estimated >1 hour downtime

**Steps:**
1. Enable demo mode:
   ```bash
   aws apprunner update-service --service-arn  \
     --instance-configuration "EnvironmentVariables={NEXT_PUBLIC_SERVICE_MODE=demo}"
   ```

2. Communicate to users:
   ```
   Subject: Temporary login method available
   Body: Due to BankID outage, we've enabled demo login.
         Use email/password to access your account.
         BankID will be restored as soon as possible.
   ```

3. Monitor BankID status

4. **Revert to BankID when available:**
   ```bash
   aws apprunner update-service --service-arn  \
     --instance-configuration "EnvironmentVariables={NEXT_PUBLIC_SERVICE_MODE=live}"
   ```

**Risk:** Demo mode may bypass KYC checks. Only use with Alem approval.

---

### Option 2: Redirect to Status Page

**Use case:** BankID outage, no ETA, no fallback available

**Steps:**
1. Deploy maintenance page:
   ```bash
   # Update health endpoint to return 503
   # This triggers BetterStack alert + status page update
   ```

2. Show user-friendly message:
   ```html
   Login Temporarily Unavailable
   Our authentication provider (BankID) is experiencing issues.
   We expect service to resume within X minutes.
   Status updates: status.drop.no
   ```

3. Monitor and communicate updates every 30 minutes

---

## Post-Incident Actions

1. **Document incident:**
   ```bash
   # Create incident report
   touch ~/ALAI/products/Drop/comms/incidents/$(date +%Y-%m-%d)-bankid-failure.md
   ```

2. **Root cause analysis:**
   - What triggered the failure?
   - Why didn't monitoring detect it sooner?
   - What prevented faster recovery?

3. **Update monitoring:**
   - Add synthetic BankID login test (every 5 min)
   - Alert on OAuth callback failures >5/min

4. **Update runbook:**
   - Add new failure mode if discovered
   - Improve diagnosis steps based on what worked

5. **Team debrief (if >30 min outage):**
   - Review timeline
   - Identify improvements
   - Update on-call procedures

---

## Escalation

| Time | Action |
|------|--------|
| 0 min | John starts diagnosis |
| 5 min | If not resolved, alert Alem via Slack + SMS |
| 15 min | If BankID outage confirmed, enable fallback (Alem approval) |
| 30 min | If still unresolved, schedule team call |
| 1 hour | If major outage, public communication via email/social media |

---

## Contacts

- **BankID Support:** support@bankid.no
- **BankID Phone:** +47 XXXX XXXX (24/7 for critical issues)
- **Internal:** Alem (CEO, final decision on fallback modes)

---

## Related Documentation

- `docs/architecture/authentication.md` — BankID OAuth flow
- `src/app/api/auth/bankid/route.ts` — BankID integration code
- `docs/dr-runbook.md` — Infrastructure disaster recovery
- Vaultwarden item: "BankID OAuth" — Credentials

---

**Last Updated:** 2026-02-22
**Next Review:** Before Phase 2 (Banking Integration)

# Runbook: PISP Payment Failure

# Runbook: PISP Payment Failure (Remittance & QR)

**Service:** Payment Initiation (PISP via Open Banking)
**Severity:** HIGH (blocks money transfers)
**MTTR Target:** <30 minutes
**Owner:** John (AI Director)

---

## Overview

PISP (Payment Initiation Service Provider) enables Drop to initiate payments directly from users' bank accounts. Failures in PISP prevent both **remittance** (send money abroad) and **QR payments** (in-store merchant payments).

---

## Symptoms

Users report they cannot complete payments:

- Payment initiation fails with error message
- Payment status stuck at "pending" indefinitely
- Bank redirect loop (never returns to Drop)
- Error: "Payment service unavailable"

**User impact:** Cannot send money or pay merchants.

---

## Diagnosis

### 1. Identify Payment Type

Determine which payment flow is affected:

- **Remittance:** User sends money to recipient abroad (`POST /api/transactions/remittance`)
- **QR Payment:** User pays merchant by scanning QR code (`POST /api/transactions/qr-payment`)

**Check recent transactions:**
```bash
# CloudWatch Logs
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "payment_initiation" \
  --start-time $(date -u -d '30 minutes ago' +%s)000 \
  --region eu-west-1 \
  | jq '.events[].message' \
  | grep -E "remittance|qr_payment|pisp_error"
```

### 2. Check Open Banking Provider Status

**Provider:** Neonomics (Norway), Swan BaaS (cross-border)

**Neonomics Status:**
```bash
# No official status page — check via test API call
curl -X POST https://sandbox.neonomics.io/payments/v1/payment-initiation \
  -H "Authorization: Bearer " \
  -H "Content-Type: application/json" \
  -d '{"amount":100,"currency":"NOK"}' \
  -v

# Expected: HTTP 200 or 400 (validation error)
# If 500/503: Neonomics outage
```

**Swan API Status:**
```bash
# Check Swan status page
open https://status.swan.io

# Or test API
curl https://api.swan.io/graphql \
  -H "Authorization: Bearer " \
  -d '{"query": "{viewer{id}}"}' \
  -v

# Expected: HTTP 200
# If 500/503: Swan outage
```

### 3. Check Drop Logs for Error Codes

**Common PISP error codes:**

| Code | Meaning | Cause |
|------|---------|-------|
| `INSUFFICIENT_FUNDS` | User's bank account balance too low | User error |
| `ACCOUNT_NOT_ACCESSIBLE` | Bank account locked or closed | Bank issue |
| `CONSENT_EXPIRED` | Open Banking consent needs renewal | User must re-authenticate |
| `PAYMENT_REJECTED` | Bank declined payment | Fraud detection, limits |
| `TIMEOUT` | Bank API took too long to respond | Network/bank issue |
| `INVALID_IBAN` | Recipient bank account number invalid | User error |
| `LIMIT_EXCEEDED` | Payment exceeds daily limit | User or bank limit |

**Search logs for error codes:**
```bash
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "PISP_ERROR" \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  | jq -r '.events[].message' \
  | jq '.metadata.errorCode'
```

### 4. Test Payment Flow

**Manual test (staging environment):**
```bash
# 1. Login
TOKEN=$(curl -X POST https://drop-staging.fly.dev/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"test1234"}' \
  | jq -r '.data.token')

# 2. Initiate test payment (small amount)
curl -X POST https://drop-staging.fly.dev/api/transactions/remittance \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "recipientId": "rec_test123",
    "amount": 100,
    "currency": "NOK",
    "sendCurrency": "NOK",
    "receiveCurrency": "EUR"
  }' \
  -v

# Expected: HTTP 200, transaction created
# If 500: PISP integration broken
```

---

## Common Causes & Solutions

### Cause 1: Open Banking Provider Outage

**Probability:** 10% (Neonomics/Swan service disruption)

**Symptoms:**
- All payments fail with timeout or 503 error
- Provider status page reports incident
- Test API call fails

**Solution:**

1. **Verify outage:**
   - Check Neonomics/Swan status pages
   - Contact provider support if no public status

2. **Communicate to users:**
   ```
   Subject: Payment processing temporarily unavailable
   Body: Our payment provider is experiencing issues.
         We're monitoring the situation and expect service
         to resume within  minutes.
   ```

3. **Monitor provider status:**
   - Subscribe to provider status updates
   - Check every 15 minutes for resolution

4. **Queue failed payments (if applicable):**
   - Store payment requests in `pending_payments` table
   - Retry automatically when provider is back online

**ETA:** Depends on provider (typically <2 hours)

---

### Cause 2: Expired Open Banking Consent

**Probability:** 30% (user consent expires after 90 days)

**Symptoms:**
- Error code: `CONSENT_EXPIRED` or `ACCOUNT_NOT_ACCESSIBLE`
- Payments fail for specific users only (not all)
- Logs show: "Open Banking consent invalid"

**Solution:**

1. **Identify affected users:**
   ```sql
   SELECT user_id, bank_account_id, consent_expires_at
   FROM bank_accounts
   WHERE consent_expires_at < datetime('now');
   ```

2. **Notify users to re-authenticate:**
   - Send push notification: "Please reconnect your bank account"
   - In-app banner: "Bank connection expired, tap to reconnect"

3. **Guide user through re-consent flow:**
   - User taps "Reconnect bank account"
   - Redirect to AISP consent flow (BankID + bank approval)
   - Update `consent_expires_at` in database (90 days from now)

4. **Retry payment after re-consent:**
   - Original payment request should be retryable
   - Or user initiates new payment

**ETA:** Immediate (user action required)

---

### Cause 3: Insufficient Funds in User's Bank Account

**Probability:** 25% (user error)

**Symptoms:**
- Error code: `INSUFFICIENT_FUNDS`
- Payment fails for specific transaction only
- Logs show: "Account balance too low"

**Solution:**

1. **Show clear error message to user:**
   ```
   Payment failed: Insufficient funds
   Your bank account balance is too low to complete this payment.
   Please add funds or choose a different payment method.
   ```

2. **Suggest alternatives:**
   - Link different bank account (if multi-account supported)
   - Reduce payment amount
   - Try again later

3. **No action needed on Drop side** (user must resolve)

**ETA:** N/A (user-side issue)

---

### Cause 4: Bank Fraud Detection / Payment Rejection

**Probability:** 15% (bank security systems)

**Symptoms:**
- Error code: `PAYMENT_REJECTED` or `SECURITY_BLOCK`
- Payment fails after bank redirect
- Logs show: "Bank declined transaction"

**Solution:**

1. **Advise user to contact their bank:**
   ```
   Payment failed: Your bank declined this transaction.
   This may be due to fraud protection or payment limits.
   Please contact your bank to authorize the payment.
   ```

2. **Check if payment is unusual for user:**
   - First international transfer?
   - Amount significantly higher than usual?
   - High-risk destination country?

3. **User should:**
   - Call their bank's fraud department
   - Confirm the payment is legitimate
   - Ask bank to whitelist Drop payments
   - Retry after bank approval

4. **Document pattern:**
   - If many users from same bank report this, investigate bank compatibility
   - May need to add bank-specific messaging

**ETA:** Depends on user's bank (minutes to hours)

---

### Cause 5: PISP API Rate Limiting

**Probability:** 5% (during high-traffic periods)

**Symptoms:**
- Error code: `RATE_LIMIT_EXCEEDED` or HTTP 429
- Intermittent failures (some payments succeed, others fail)
- Logs show: "Too many requests"

**Solution:**

1. **Check rate limit headers:**
   ```bash
   # Find rate limit status in logs
   aws logs filter-log-events \
     --log-group-name /aws/apprunner/drop-production \
     --filter-pattern "X-RateLimit" \
     --start-time $(date -u -d '10 minutes ago' +%s)000
   ```

2. **Implement request queuing:**
   ```typescript
   // src/lib/pisp-client.ts
   const queue = new PQueue({ concurrency: 5, interval: 1000 });

   async function initiatePayment(params) {
     return queue.add(() => pisService.createPayment(params));
   }
   ```

3. **Exponential backoff on retry:**
   ```typescript
   async function retryPayment(id, attempt = 1) {
     if (attempt > 3) throw new Error('Max retries exceeded');
     try {
       return await initiatePayment(id);
     } catch (error) {
       if (error.status === 429) {
         await sleep(1000 * Math.pow(2, attempt)); // 2s, 4s, 8s
         return retryPayment(id, attempt + 1);
       }
       throw error;
     }
   }
   ```

4. **Contact provider to increase limits (if persistent):**
   - Email Neonomics support with usage stats
   - Request higher API quota for production

**ETA:** 5 minutes (automatic retry), 1-2 days (if quota increase needed)

---

### Cause 6: Invalid Recipient Bank Account (IBAN/SWIFT)

**Probability:** 20% (user input error)

**Symptoms:**
- Error code: `INVALID_IBAN` or `ACCOUNT_NOT_FOUND`
- Payment fails immediately (no bank redirect)
- Logs show: "Recipient account validation failed"

**Solution:**

1. **Show clear validation error:**
   ```
   Payment failed: Invalid recipient bank account
   The IBAN you entered is not valid. Please check and try again.
   IBAN: DE89 3704 0044 0532 0130 00 (example format)
   ```

2. **Improve frontend validation:**
   - Add real-time IBAN validation (checksum algorithm)
   - Use IBAN validation library (e.g., `ibantools`)
   - Show format hints per country

3. **Ask user to verify recipient details:**
   - Double-check IBAN/SWIFT code
   - Confirm with recipient
   - Try alternative payment method if IBAN is correct but still rejected

**ETA:** Immediate (user correction)

---

## Emergency Workarounds

### Option 1: Manual Payment Processing

**Use case:** PISP provider down >2 hours, urgent payments needed

**Steps:**
1. Collect payment requests manually:
   ```sql
   SELECT id, user_id, amount, currency, recipient_iban
   FROM transactions
   WHERE status = 'pending' AND created_at > datetime('now', '-2 hours');
   ```

2. **Alem initiates payments manually** via Drop's business bank account:
   - Log into business banking portal
   - Enter recipient details manually
   - Process payment one by one

3. Update Drop transaction status:
   ```sql
   UPDATE transactions SET status = 'completed', completed_at = datetime('now')
   WHERE id = '';
   ```

4. Notify users:
   ```
   Subject: Your payment has been processed
   Body: Your payment of  to  has been completed manually
         due to a temporary service issue. Thank you for your patience.
   ```

**Risk:** Manual work, prone to errors. Only use for critical/urgent payments.

---

### Option 2: Redirect to Alternative Payment Method

**Use case:** PISP down, no ETA, users need alternative

**Steps:**
1. Show modal in app:
   ```
   Payment Initiation Unavailable
   Our payment service is temporarily down.
   Alternative options:
   - Bank transfer (manual IBAN entry)
   - Try again later (we'll notify you when service is restored)
   ```

2. Provide manual bank transfer instructions:
   ```
   Transfer to:
   Account holder: Drop AS
   IBAN: NO93 8601 1117 947
   Amount: 
   Reference: 
   ```

3. Monitor for manual transfers:
   - Check business bank account for incoming payments
   - Match reference code to pending Drop transactions
   - Mark as completed when received

**ETA:** Immediate (user can pay via manual transfer)

---

## Monitoring & Alerts

### Metrics to Track

- **Payment success rate:** Should be >95%
- **Payment latency:** p50 <5s, p95 <15s, p99 <30s
- **Error rate by code:** Track `INSUFFICIENT_FUNDS`, `CONSENT_EXPIRED`, `TIMEOUT` separately

### Alert Rules

```typescript
// src/lib/payment-monitor.ts
export async function trackPaymentFailure(errorCode: string, transactionId: string) {
  const failureRate = await calculateFailureRate('last_5_minutes');

  if (failureRate > 0.1) { // 10% failure rate
    await sendAlert({
      severity: 'critical',
      title: 'High payment failure rate',
      message: `${(failureRate * 100).toFixed(1)}% of payments failing in last 5 min`,
    });
  }
}
```

### Dashboard Queries

```sql
-- Payment success rate (last 24h)
SELECT
  COUNT(*) FILTER (WHERE status = 'completed') * 100.0 / COUNT(*) as success_rate,
  COUNT(*) as total_payments
FROM transactions
WHERE created_at > datetime('now', '-24 hours');

-- Top error codes (last hour)
SELECT error_code, COUNT(*) as count
FROM transactions
WHERE status = 'failed' AND created_at > datetime('now', '-1 hour')
GROUP BY error_code
ORDER BY count DESC;
```

---

## Post-Incident Actions

1. **Update transaction status:**
   ```sql
   -- Mark timed-out payments as failed (after 1 hour)
   UPDATE transactions
   SET status = 'failed', error_code = 'TIMEOUT', error_message = 'Payment timed out'
   WHERE status = 'pending' AND created_at < datetime('now', '-1 hour');
   ```

2. **Notify affected users:**
   - Send email/push notification about failed payment
   - Offer to retry or refund

3. **Document incident:**
   - Create post-mortem in `comms/incidents/`
   - Track downtime duration
   - Calculate financial impact (lost transactions)

4. **Review provider SLA:**
   - Check if outage violates SLA
   - Request compensation/credits if applicable

5. **Improve resilience:**
   - Add payment retry queue
   - Implement circuit breaker for provider API
   - Consider multi-provider failover (backup PISP)

---

## Escalation

| Time | Action |
|------|--------|
| 0 min | John starts diagnosis |
| 10 min | If provider outage confirmed, notify Alem |
| 30 min | If not resolved, assess manual processing need |
| 1 hour | If critical payments pending, start manual workaround (Alem approval) |
| 2 hours | Public communication to all users |

---

## Contacts

- **Neonomics Support:** support@neonomics.io, Slack: #neonomics-support
- **Swan Support:** support@swan.io (email), Swan Slack (if available)
- **Internal:** Alem (CEO, manual payment approval)

---

## Related Documentation

- `docs/architecture/payments.md` — PISP flow diagrams
- `src/app/api/transactions/remittance/route.ts` — Remittance implementation
- `src/app/api/transactions/qr-payment/route.ts` — QR payment implementation
- `docs/compliance/psd2-requirements.md` — Regulatory requirements

---

**Last Updated:** 2026-02-22
**Next Review:** Before Phase 2 (Banking Integration)
**Test Status:** Pending (Phase 2 live payments)

# Runbook: Sumsub KYC Failure

# Runbook: Sumsub KYC/AML Verification Failure

**Service:** Sumsub Identity Verification (KYC/AML)
**Severity:** HIGH (blocks new user registrations)
**MTTR Target:** <30 minutes
**Owner:** John (AI Director)

---

## Overview

Sumsub provides automated identity verification (KYC - Know Your Customer) and AML (Anti-Money Laundering) checks for Drop. Required for regulatory compliance before users can make payments.

**KYC Process:**
1. User uploads ID document (passport, driver's license, national ID)
2. User takes selfie (liveness check)
3. Sumsub verifies document authenticity
4. Sumsub performs AML sanctions screening
5. Result: APPROVED, REJECTED, or MANUAL_REVIEW

**Impact:** If Sumsub fails, new users cannot complete registration. Existing users are unaffected.

---

## Symptoms

Users report they cannot complete identity verification:

- ID upload fails with error
- Verification stuck at "Processing..." indefinitely
- Error message: "Verification service unavailable"
- Webhook never receives result from Sumsub
- User status stuck at "pending_kyc"

**User impact:** Cannot complete registration, cannot make payments.

---

## Diagnosis

### 1. Check Sumsub Service Status

**External status:**
```bash
# Sumsub does not have a public status page
# Test via API health check
curl https://api.sumsub.com/resources/healthcheck \
  -H "X-App-Token: " \
  -v

# Expected: HTTP 200
# If 500/503: Sumsub outage
```

### 2. Check Drop Logs

```bash
# CloudWatch Logs (production)
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "sumsub" \
  --start-time $(date -u -d '30 minutes ago' +%s)000 \
  --region eu-west-1

# Look for:
# - "Sumsub API timeout"
# - "Sumsub webhook failed"
# - "KYC verification failed: document_expired"
# - "AML sanctions match: [name]"
```

### 3. Check Sumsub Dashboard

```bash
# Login to Sumsub Dashboard
open https://cockpit.sumsub.com

# Check:
# - Recent applicants (last 1 hour)
# - Failed verifications
# - Manual review queue length
# - Webhook delivery status
```

### 4. Check Webhook Delivery

**Verify webhook endpoint is reachable:**
```bash
# Sumsub sends webhooks to: https://getdrop.no/api/webhooks/sumsub
# Test endpoint manually
curl -X POST https://getdrop.no/api/webhooks/sumsub \
  -H "Content-Type: application/json" \
  -H "X-Sumsub-Signature: test" \
  -d '{"type":"applicantReviewed","reviewResult":{"reviewAnswer":"GREEN"}}' \
  -v

# Expected: HTTP 200
# If 404: Webhook endpoint not deployed
# If 401: Signature validation issue
```

### 5. Test KYC Flow

**Manual test (staging):**
```bash
# 1. Create test applicant
curl -X POST https://api.sumsub.com/resources/applicants \
  -H "X-App-Token: " \
  -H "Content-Type: application/json" \
  -d '{
    "externalUserId": "test-user-123",
    "levelName": "basic-kyc-level",
    "email": "test@example.com"
  }' \
  -v

# Expected: HTTP 201, applicant created
# If 400: Invalid request
# If 500: Sumsub API issue
```

---

## Common Causes & Solutions

### Cause 1: Sumsub API Outage (External)

**Probability:** 5% (Sumsub service disruption)

**Symptoms:**
- All KYC verifications fail
- Sumsub API health check returns 503
- Dashboard shows no recent applicants
- Logs show API timeouts

**Solution:**

1. **Verify outage:**
   ```bash
   # Test Sumsub API from different networks
   curl https://api.sumsub.com/resources/healthcheck \
     -H "X-App-Token: " \
     -v

   # If consistent failure: confirmed outage
   ```

2. **Contact Sumsub support:**
   - Email: support@sumsub.com
   - Live chat: https://cockpit.sumsub.com (bottom-right)
   - Phone: Check Sumsub Dashboard for support number

3. **Communicate to users (Norwegian):**
   ```
   Emne: Identitetsverifisering midlertidig utilgjengelig

   Hei,

   Vi opplever for øyeblikket tekniske problemer med identitetsverifisering.
   Du kan fortsette registreringen senere.

   Vi forventer at tjenesten er tilbake innen [X minutter/timer].

   Mvh,
   Drop
   ```

4. **Queue pending verifications:**
   ```sql
   -- Mark users as pending KYC retry
   UPDATE users
   SET kyc_status = 'pending_retry',
       kyc_retry_at = datetime('now', '+1 hour')
   WHERE kyc_status = 'pending_kyc'
   AND created_at > datetime('now', '-2 hours');
   ```

5. **Retry when Sumsub is back:**
   ```bash
   # Cron job to retry pending KYC
   node ~/ALAI/products/Drop/scripts/retry-kyc.js
   ```

**ETA:** Depends on Sumsub (typically <2 hours)

---

### Cause 2: Document Verification Failure (User Error)

**Probability:** 40% (user uploads poor quality or invalid document)

**Symptoms:**
- Specific users fail KYC (not all users)
- Logs show: "document_not_readable", "document_expired", "document_type_mismatch"
- Sumsub dashboard shows rejection reason

**Common rejection reasons:**
- Blurry photo (document not readable)
- Expired document (passport/ID expired)
- Wrong document type (e.g., bank statement instead of ID)
- Photo cropped (missing corners/edges)
- Underage (user < 18 years old)

**Solution:**

1. **Identify rejection reason:**
   ```sql
   SELECT user_id, kyc_rejection_reason, kyc_rejected_at
   FROM users
   WHERE kyc_status = 'rejected'
   ORDER BY kyc_rejected_at DESC
   LIMIT 10;
   ```

2. **Show clear error to user (Norwegian):**

   **Blurry document:**
   ```
   Dokumentet er ikke leselig
   Ta et nytt bilde i godt lys.
   Sørg for at all tekst er skarp og leselig.
   ```

   **Expired document:**
   ```
   Dokumentet er utløpt
   Vennligst last opp et gyldig pass eller førerkort.
   Dokumentet må være gyldig i minst 1 måned.
   ```

   **Wrong document type:**
   ```
   Feil dokumenttype
   Vi godtar kun: Pass, Nasjonalt ID-kort, Førerkort.
   Bankkort og regninger godtas ikke.
   ```

   **Underage:**
   ```
   Du må være 18 år eller eldre
   Drop er kun tilgjengelig for brukere over 18 år.
   ```

3. **Allow user to retry:**
   - Show "Try Again" button in app
   - Provide tips for better photo quality
   - Link to FAQ: "How to take a good ID photo"

4. **Track retry success rate:**
   ```sql
   -- How many users succeed on 2nd attempt?
   SELECT
     COUNT(*) FILTER (WHERE kyc_attempt = 1 AND kyc_status = 'approved') as first_attempt_success,
     COUNT(*) FILTER (WHERE kyc_attempt = 2 AND kyc_status = 'approved') as second_attempt_success,
     COUNT(*) FILTER (WHERE kyc_attempt >= 3) as multiple_retries
   FROM users;
   ```

**ETA:** Immediate (user must retry with better document)

---

### Cause 3: AML Sanctions Match (Compliance Issue)

**Probability:** 3% (user flagged by sanctions screening)

**Symptoms:**
- Specific user's KYC fails with: "AML_SANCTIONS_MATCH"
- Sumsub dashboard shows "Red flag" or "Manual review required"
- User name matches sanctions list (OFAC, EU, UN, etc.)

**Solution:**

1. **Identify flagged users:**
   ```sql
   SELECT user_id, email, full_name, kyc_rejection_reason
   FROM users
   WHERE kyc_rejection_reason LIKE '%sanctions%'
   OR kyc_status = 'manual_review_aml';
   ```

2. **Review Sumsub dashboard:**
   - Login: https://cockpit.sumsub.com
   - Navigate to applicant
   - Check AML screening results
   - Review sanctions list match details

3. **False positive (common names):**
   - Example: "Ali Hassan" may match many sanctioned individuals
   - Sumsub shows match details (date of birth, nationality)
   - If clearly different person: manually approve in Sumsub

4. **True positive (actual sanctions match):**
   - **DO NOT approve.** This is a legal/regulatory issue.
   - Reject user registration immediately
   - Document incident for compliance records

5. **Notify user (if false positive, manually approved):**
   ```
   Din identitetsverifisering er godkjent
   Takk for tålmodigheten. Du kan nå bruke Drop.
   ```

6. **Notify user (if true positive, rejected):**
   ```
   Vi kan dessverre ikke godkjenne din registrering
   På grunn av regulatoriske krav kan vi ikke tilby tjenester til deg.
   Ta kontakt med support@getdrop.no hvis du mener dette er en feil.
   ```

7. **Escalate to Alem if uncertain:**
   - AML compliance is critical
   - False rejection = bad UX, but false approval = legal risk
   - Alem makes final call on borderline cases

**ETA:** 10 minutes (false positive), N/A (true positive - reject)

---

### Cause 4: Webhook Delivery Failure

**Probability:** 15% (Drop webhook endpoint down or unreachable)

**Symptoms:**
- Sumsub completes verification, but Drop never updates user status
- Logs show: "Webhook not received"
- Sumsub dashboard shows "Webhook delivery failed"
- User stuck at "pending_kyc" despite Sumsub showing "approved"

**Solution:**

1. **Check webhook endpoint health:**
   ```bash
   # Test webhook endpoint
   curl -X POST https://getdrop.no/api/webhooks/sumsub \
     -H "Content-Type: application/json" \
     -d '{"type":"ping"}' \
     -v

   # Expected: HTTP 200
   # If 404/500: Drop webhook endpoint broken
   ```

2. **Check Sumsub webhook delivery logs:**
   - Login: https://cockpit.sumsub.com
   - Navigate to Settings → Webhooks
   - Check recent delivery attempts
   - Look for: 404, 500, timeout errors

3. **Manually retry failed webhooks:**
   - Sumsub Dashboard → Applicant → "Resend Webhook"
   - This triggers new webhook delivery to Drop
   - Verify Drop receives and processes it

4. **Fetch verification results via API (if webhook lost):**
   ```bash
   # Manually fetch applicant status from Sumsub
   curl -X GET https://api.sumsub.com/resources/applicants//status \
     -H "X-App-Token: " \
     -v

   # Parse result and update Drop database
   ```

5. **Update Drop database manually:**
   ```sql
   UPDATE users
   SET kyc_status = 'approved',
       kyc_approved_at = datetime('now')
   WHERE sumsub_applicant_id = '';
   ```

6. **Fix webhook endpoint (if broken):**
   - Check App Runner deployment status
   - Verify webhook route exists: `src/app/api/webhooks/sumsub/route.ts`
   - Check signature validation (Sumsub signs webhooks with HMAC)

**ETA:** 10 minutes (manual retry), 30 minutes (if endpoint fix needed)

---

### Cause 5: Invalid or Expired API Credentials

**Probability:** 5% (after credential rotation)

**Symptoms:**
- Logs show: "401 Unauthorized" or "403 Forbidden"
- All Sumsub API calls fail
- Webhook signature validation fails

**Solution:**

1. **Verify Sumsub API credentials:**
   ```bash
   bw get item "Sumsub API" --session $BW_SESSION

   # Check:
   # - App Token is correct
   # - Secret Key is correct (for webhook signature)
   # - Environment: production vs sandbox
   ```

2. **Regenerate API credentials (if needed):**
   - Login: https://cockpit.sumsub.com
   - Navigate to Settings → API
   - Generate new App Token + Secret Key
   - Copy to Vaultwarden

3. **Update App Runner environment variables:**
   ```bash
   aws apprunner update-service --service-arn  \
     --instance-configuration "EnvironmentVariables={
       SUMSUB_APP_TOKEN=,
       SUMSUB_SECRET_KEY=,
       SUMSUB_ENVIRONMENT=production
     }"
   ```

4. **Trigger deployment:**
   ```bash
   aws apprunner start-deployment --service-arn  --region eu-west-1
   ```

5. **Test after deployment:**
   ```bash
   # Try creating test applicant
   curl -X POST https://getdrop.no/api/kyc/initiate \
     -H "Authorization: Bearer " \
     -v

   # Expected: HTTP 200, Sumsub applicant created
   ```

**ETA:** 10 minutes

---

### Cause 6: Liveness Check Failure (Selfie)

**Probability:** 20% (user fails selfie/liveness verification)

**Symptoms:**
- Specific users fail at selfie stage
- Logs show: "liveness_check_failed", "face_mismatch"
- Sumsub dashboard shows "Selfie does not match ID photo"

**Common reasons:**
- Poor lighting (too dark, too bright)
- User wears sunglasses/hat
- Multiple people in frame
- Photo of a photo (not live person)
- Face does not match ID document

**Solution:**

1. **Show clear instructions before selfie (Norwegian):**
   ```
   Slik tar du et godt selfie-bilde:
   ✓ God belysning (dagslys er best)
   ✓ Fjern briller/solbriller
   ✓ Se rett i kameraet
   ✓ Kun ditt ansikt i bildet
   ✗ Ikke bruk foto av foto
   ```

2. **Allow retry with better instructions:**
   ```
   Selfie-verifisering mislyktes
   Prøv igjen med bedre belysning.
   Sørg for at ansiktet ditt er tydelig synlig.
   ```

3. **Improve liveness detection settings (if too strict):**
   - Login: https://cockpit.sumsub.com
   - Navigate to Settings → Verification Levels
   - Adjust liveness sensitivity (low/medium/high)
   - Balance: security vs user friction

4. **Manual review (if automated fails repeatedly):**
   - Some users may need manual review
   - Sumsub team reviews video/photos manually
   - ETA: 1-24 hours depending on Sumsub queue

**ETA:** Immediate (user retry), 1-24 hours (manual review)

---

## Emergency Workarounds

### Option 1: Manual KYC Review (Temporary)

**Use case:** Sumsub down >1 hour, urgent user needs verification

**Steps:**

1. Collect KYC documents manually:
   - Ask user to email ID photo + selfie to support@getdrop.no
   - Subject: "KYC Manual Review - [User ID]"

2. **Alem or John reviews manually:**
   - Verify ID document authenticity (check security features)
   - Compare selfie to ID photo
   - Check ID expiry date
   - Verify age >= 18

3. **Manual AML check:**
   - Search user name on: https://sanctionssearch.ofac.treas.gov
   - Check EU sanctions list: https://eeas.europa.eu/topics/sanctions-policy
   - Document findings

4. **Approve in database (if passes checks):**
   ```sql
   UPDATE users
   SET kyc_status = 'approved_manual',
       kyc_approved_at = datetime('now'),
       kyc_approved_by = 'john',
       kyc_notes = 'Manual review during Sumsub outage'
   WHERE user_id = '';
   ```

5. **Notify user:**
   ```
   Din identitet er verifisert
   Velkommen til Drop! Du kan nå gjøre betalinger.
   ```

**Risk:** Manual review is slow, error-prone, not scalable. Only for critical cases.

---

### Option 2: Delay Registration, Notify When Ready

**Use case:** Sumsub down, no ETA, non-urgent registrations

**Steps:**

1. Show maintenance message:
   ```
   Identitetsverifisering midlertidig utilgjengelig
   Vi jobber med å løse problemet.
   Du vil motta en e-post når du kan fortsette registreringen.
   ```

2. Collect user email:
   ```typescript
   // src/app/api/auth/register/route.ts
   if (sumsubUnavailable) {
     await db.insert('pending_registrations', {
       email: userEmail,
       status: 'waiting_kyc',
       created_at: new Date(),
     });

     return {
       success: true,
       message: 'We will notify you when registration is available',
     };
   }
   ```

3. **When Sumsub is back, notify users:**
   ```sql
   SELECT email FROM pending_registrations WHERE status = 'waiting_kyc';
   ```

   Email (Norwegian):
   ```
   Emne: Du kan nå fullføre registreringen i Drop

   Hei,

   Identitetsverifisering er tilbake.
   Klikk her for å fortsette registreringen: [Link]

   Mvh,
   Drop
   ```

**ETA:** Delayed registration (hours to days)

---

## Monitoring & Alerts

### Metrics to Track

- **KYC success rate:** Should be >85% (accounting for user errors)
- **KYC processing time:** p50 <5min, p95 <30min, p99 <2h (includes manual review)
- **Rejection reasons:** Track document_not_readable, expired, underage, sanctions separately

### Alert Rules

```typescript
// src/lib/kyc-monitor.ts
export async function trackKYCFailure(userId: string, reason: string) {
  const failureRate = await calculateKYCFailureRate('last_hour');

  if (failureRate > 0.3) { // 30% failure rate
    await sendAlert({
      severity: 'high',
      title: 'KYC failure rate high',
      message: `${(failureRate * 100).toFixed(1)}% of KYC attempts failing`,
      reason,
    });
  }
}
```

---

## Post-Incident Actions

1. **Retry failed KYC verifications:**
   ```sql
   UPDATE users
   SET kyc_status = 'pending_retry',
       kyc_retry_at = datetime('now')
   WHERE kyc_status IN ('failed', 'pending_kyc')
   AND created_at > datetime('now', '-24 hours');
   ```

2. **Document incident:**
   ```bash
   touch ~/ALAI/products/Drop/comms/incidents/$(date +%Y-%m-%d)-sumsub-kyc-failure.md
   ```

3. **Review rejection reasons:**
   - High document_not_readable rate? Improve photo instructions
   - High liveness_check_failed rate? Adjust Sumsub settings
   - Track improvements in next month's KYC metrics

4. **Update user onboarding:**
   - Add better photo guides
   - Show example of good vs bad ID photos
   - Pre-flight check: "Is your ID expired?"

---

## Escalation

| Time | Action |
|------|--------|
| 0 min | John starts diagnosis |
| 15 min | If Sumsub outage confirmed, notify Alem |
| 30 min | If urgent user needs KYC, consider manual review (Alem approval) |
| 1 hour | Public communication to users |
| 2 hours | Contact Sumsub support via phone if no response |

---

## Contacts

- **Sumsub Support:** support@sumsub.com
- **Sumsub Live Chat:** https://cockpit.sumsub.com (bottom-right)
- **Sumsub Phone:** Check Sumsub Dashboard for support number
- **Internal:** Alem (CEO, manual KYC approval authority)

---

## Related Documentation

- `docs/architecture/kyc-aml.md` — KYC/AML flow diagrams
- `src/app/api/kyc/initiate/route.ts` — Sumsub integration code
- `docs/compliance/kyc-requirements.md` — Regulatory requirements (age, ID types)
- Vaultwarden item: "Sumsub API" — Credentials

---

**Last Updated:** 2026-02-22
**Next Review:** Before Phase 2 (Banking Integration)

# Runbook: Swan API Outage

# Runbook: Swan BaaS API Outage

**Service:** Swan Banking-as-a-Service
**Severity:** CRITICAL (blocks accounts, cards, payments if Swan is primary provider)
**MTTR Target:** <15 minutes
**Owner:** John (AI Director)

---

## Overview

Swan provides core banking infrastructure for Drop. Depending on Drop's architecture phase, Swan may handle:
- **Account creation** (virtual IBAN accounts for users)
- **Card issuance** (virtual/physical debit cards)
- **Payment processing** (domestic/international transfers)
- **Balance management** (wallet balances, not Open Banking)

**Impact:** If Swan is the primary BaaS provider, an outage affects ALL core banking operations.

---

## Symptoms

Users report critical failures:

- Cannot create new account
- Cannot view wallet balance (if using Swan wallets)
- Card payments fail or decline
- Error: "Banking service unavailable"
- Dashboard shows "System error" for account-related features

**User impact:** Complete inability to use banking features (depending on Drop's reliance on Swan).

---

## Diagnosis

### 1. Check Swan Status Page

**External status:**
```bash
# Swan official status page
open https://status.swan.io

# Check for:
# - Incident reported
# - Degraded performance
# - Scheduled maintenance
```

### 2. Test Swan API

**Health check:**
```bash
# GraphQL health query
curl https://api.swan.io/graphql \
  -H "Authorization: Bearer " \
  -H "Content-Type: application/json" \
  -d '{"query": "{viewer{id}}"}' \
  -v

# Expected: HTTP 200, {"data": {"viewer": {"id": "..."}}}
# If 500/503: Swan API down
# If 401: Credential issue
# If timeout: Network or Swan connectivity issue
```

**Test account creation:**
```bash
# Attempt to create test account
curl https://api.swan.io/graphql \
  -H "Authorization: Bearer " \
  -H "Content-Type: application/json" \
  -d '{
    "query": "mutation { createAccount(input: {name: \"Test Account\"}) { id } }"
  }' \
  -v

# Expected: HTTP 200 with account ID
# If error: Check response for Swan error codes
```

### 3. Check Drop Logs

```bash
# CloudWatch Logs (production)
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "swan" \
  --start-time $(date -u -d '15 minutes ago' +%s)000 \
  --region eu-west-1

# Look for:
# - "Swan API timeout"
# - "Swan GraphQL error: INTERNAL_SERVER_ERROR"
# - "Swan 503 Service Unavailable"
# - "Swan rate limit exceeded"
```

### 4. Check Swan API Credentials

```bash
# Verify Swan API key is valid
bw get item "Swan API" --session $BW_SESSION

# Check App Runner environment variables
aws apprunner describe-service \
  --service-arn  \
  --region eu-west-1 \
  | jq '.Service.SourceConfiguration.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables' \
  | grep SWAN

# Expected:
# SWAN_API_KEY: 
# SWAN_ENVIRONMENT: production (or sandbox)
# SWAN_PARTNER_ID: 
```

### 5. Check Recent Swan API Changes

**Review Swan changelog:**
```bash
# Swan may deprecate API endpoints or change schemas
# Check Swan developer portal for breaking changes
open https://docs.swan.io/changelog

# Review recent GraphQL schema changes
# Verify Drop uses supported API versions
```

---

## Common Causes & Solutions

### Cause 1: Swan Service Outage (External)

**Probability:** 5% (Swan is highly reliable, but incidents happen)

**Symptoms:**
- Swan status page reports incident
- All Swan API calls fail with 500/503
- No error in Drop code/config
- Social media mentions Swan issues

**Solution:**

1. **Verify outage scope:**
   - Check Swan status page
   - Test API from different networks (rule out local network issue)
   - Contact Swan support for ETA

2. **Communicate to users (Norwegian):**
   ```
   Emne: Bankfunksjoner midlertidig utilgjengelig

   Hei,

   Vår bankinfrastruktur-leverandør (Swan) opplever tekniske problemer.
   Dette påvirker:
   - Kontoopprettelse
   - Korttransaksjoner
   - Overføringer

   Vi overvåker situasjonen og forventer at tjenesten er tilbake innen [X minutter/timer].

   Mvh,
   Drop
   ```

3. **Enable degraded mode:**
   ```bash
   # Disable features that depend on Swan
   aws apprunner update-service --service-arn  \
     --instance-configuration "EnvironmentVariables={
       FEATURE_ACCOUNTS=disabled,
       FEATURE_CARDS=disabled,
       SWAN_MODE=degraded
     }"

   # Show maintenance banner in app
   ```

4. **Monitor Swan status:**
   - Subscribe to Swan status updates (RSS/email)
   - Check every 10 minutes for resolution
   - Test API as soon as Swan reports "Resolved"

5. **Re-enable features when Swan is back:**
   ```bash
   aws apprunner update-service --service-arn  \
     --instance-configuration "EnvironmentVariables={
       FEATURE_ACCOUNTS=enabled,
       FEATURE_CARDS=enabled,
       SWAN_MODE=live
     }"
   ```

**ETA:** Depends on Swan (typically <2 hours for major incidents)

---

### Cause 2: Invalid or Expired API Credentials

**Probability:** 15% (after credential rotation or Swan account changes)

**Symptoms:**
- Logs show: "401 Unauthorized" or "Forbidden"
- All Swan API requests fail immediately
- Swan API test returns authentication error

**Solution:**

1. **Verify Swan API credentials:**
   ```bash
   bw get item "Swan API" --session $BW_SESSION

   # Check:
   # - API key is not expired
   # - API key has correct permissions (accounts, cards, payments)
   # - Partner ID is correct
   ```

2. **Regenerate API key (if needed):**
   - Login to Swan Dashboard: https://dashboard.swan.io
   - Navigate to Settings → API Keys
   - Revoke old key, generate new key
   - Copy new key to Vaultwarden

3. **Update App Runner environment variables:**
   ```bash
   aws apprunner update-service --service-arn  \
     --source-configuration "ImageRepository={...}" \
     --instance-configuration "EnvironmentVariables={
       SWAN_API_KEY=,
       SWAN_PARTNER_ID=
     }"
   ```

4. **Trigger deployment:**
   ```bash
   aws apprunner start-deployment --service-arn  --region eu-west-1
   ```

5. **Test after deployment (3-5 min):**
   ```bash
   curl https://getdrop.no/api/accounts/create \
     -H "Authorization: Bearer " \
     -H "Content-Type: application/json" \
     -d '{"accountType": "personal"}' \
     -v

   # Expected: HTTP 200, account created
   ```

**ETA:** 10 minutes

---

### Cause 3: Swan API Rate Limiting

**Probability:** 10% (during high-traffic events or viral growth)

**Symptoms:**
- Logs show: HTTP 429 "Too Many Requests"
- Intermittent failures (some requests succeed, others fail)
- Rate limit headers in response

**Solution:**

1. **Check rate limit headers:**
   ```bash
   aws logs filter-log-events \
     --log-group-name /aws/apprunner/drop-production \
     --filter-pattern "X-RateLimit" \
     --start-time $(date -u -d '10 minutes ago' +%s)000 \
     | jq -r '.events[].message' \
     | grep Swan
   ```

2. **Implement request queuing:**
   ```typescript
   // src/lib/swan-client.ts
   import PQueue from 'p-queue';

   const queue = new PQueue({
     concurrency: 5,     // Max 5 concurrent Swan requests
     interval: 1000,      // Per second
     intervalCap: 20      // Max 20 requests per second
   });

   export async function swanGraphQL(query: string, variables?: any) {
     return queue.add(() =>
       fetch('https://api.swan.io/graphql', {
         method: 'POST',
         headers: {
           'Authorization': `Bearer ${process.env.SWAN_API_KEY}`,
           'Content-Type': 'application/json',
         },
         body: JSON.stringify({ query, variables }),
       })
     );
   }
   ```

3. **Exponential backoff on retry:**
   ```typescript
   async function retrySwan(operation: () => Promise, attempt = 1) {
     try {
       return await operation();
     } catch (error) {
       if (error.status === 429 && attempt <= 3) {
         const delay = 1000 * Math.pow(2, attempt); // 2s, 4s, 8s
         await sleep(delay);
         return retrySwan(operation, attempt + 1);
       }
       throw error;
     }
   }
   ```

4. **Contact Swan to increase rate limit:**
   - Email Swan support with traffic stats
   - Provide justification: user growth, peak times
   - Request higher API quota

**ETA:** 5 minutes (automatic retry), 1-2 days (if quota increase needed)

---

### Cause 4: Swan GraphQL Schema Change (Breaking)

**Probability:** 5% (Swan updates API, breaks Drop integration)

**Symptoms:**
- Logs show: "GraphQL validation error"
- Specific queries fail: "Field 'X' doesn't exist on type 'Y'"
- Swan API works for some operations, fails for others

**Solution:**

1. **Check Swan changelog:**
   ```bash
   # Review recent API changes
   open https://docs.swan.io/changelog

   # Look for:
   # - Deprecated fields
   # - Required fields added
   # - Type changes
   ```

2. **Identify breaking changes:**
   ```bash
   # Compare current Drop queries to Swan schema
   # Example: account creation query
   grep -r "createAccount" src/lib/swan-client.ts

   # Cross-reference with Swan GraphQL schema
   # https://api.swan.io/graphql (GraphQL Playground)
   ```

3. **Update Drop GraphQL queries:**
   ```typescript
   // Before (deprecated)
   mutation {
     createAccount(input: { name: "User Account" }) {
       id
       balance  // ❌ Deprecated field
     }
   }

   // After (updated)
   mutation {
     createAccount(input: { name: "User Account" }) {
       id
       balances {  // ✅ New field structure
         available
         currency
       }
     }
   }
   ```

4. **Test updated queries:**
   ```bash
   # Test in Swan GraphQL Playground first
   # Then deploy to staging
   # Verify all Swan-dependent features work
   ```

5. **Deploy fix:**
   ```bash
   git add src/lib/swan-client.ts
   git commit -m "Fix: Update Swan GraphQL queries to match latest schema"
   git push origin main

   # CI/CD triggers deployment
   ```

**ETA:** 30 minutes (if simple field change), 2 hours (if major refactor needed)

---

### Cause 5: Network or Firewall Issues

**Probability:** 5% (AWS security group misconfiguration)

**Symptoms:**
- Logs show: "Connection timeout" or "ECONNREFUSED"
- Swan API requests never reach destination
- Works locally but fails in production

**Solution:**

1. **Check outbound connectivity:**
   ```bash
   # App Runner egress is unrestricted by default
   # If using VPC connector, check security group
   aws ec2 describe-security-groups \
     --group-ids  \
     --region eu-west-1 \
     | jq '.SecurityGroups[].IpPermissionsEgress'
   ```

2. **Test DNS resolution:**
   ```bash
   nslookup api.swan.io

   # Should resolve to Swan IPs
   # If NXDOMAIN: DNS issue
   ```

3. **Check AWS service health:**
   ```bash
   # Check App Runner service events
   aws apprunner list-operations \
     --service-arn  \
     --region eu-west-1 \
     | jq '.OperationSummaryList[0]'
   ```

4. **Whitelist Swan IPs (if strict firewall):**
   - Contact Swan for IP ranges
   - Add to security group outbound rules (port 443)

**ETA:** 15 minutes (if quick fix), 1 hour (if requires networking changes)

---

### Cause 6: Swan Account Suspended or Payment Overdue

**Probability:** 2% (billing issue or compliance violation)

**Symptoms:**
- All Swan API calls fail with "Account suspended"
- Swan Dashboard shows billing alert
- Email from Swan about overdue payment or compliance issue

**Solution:**

1. **Check Swan Dashboard:**
   - Login: https://dashboard.swan.io
   - Look for alerts: billing, compliance, KYC

2. **Resolve billing issue:**
   - If overdue payment: pay immediately via Swan Dashboard
   - If billing method expired: update payment method
   - Contact Swan billing: billing@swan.io

3. **Resolve compliance issue:**
   - Swan requires KYC for partner accounts
   - Upload missing documents (company registration, director ID, etc.)
   - Respond to Swan compliance team ASAP

4. **Request urgent reactivation:**
   - Email Swan support: support@swan.io
   - Subject: "URGENT: Account reactivation needed - [Partner ID]"
   - Explain impact (users affected)
   - Provide evidence of issue resolution

**ETA:** 15 minutes (if billing), 24 hours (if compliance review needed)

---

## Emergency Workarounds

### Option 1: Degraded Mode (Disable Swan Features)

**Use case:** Swan down >30 minutes, no ETA, users need core app functionality

**Steps:**

1. Disable Swan-dependent features:
   ```bash
   aws apprunner update-service --service-arn  \
     --instance-configuration "EnvironmentVariables={
       FEATURE_ACCOUNTS=disabled,
       FEATURE_CARDS=disabled,
       FEATURE_SWAN_WALLETS=disabled
     }"
   ```

2. Show banner in app:
   ```
   ⚠️ Noen funksjoner er midlertidig utilgjengelige
   Kontoopprettelse og korttransaksjoner er ikke tilgjengelig for øyeblikket.
   Andre funksjoner virker som normalt.
   ```

3. Allow core features to work:
   - BankID login: ✅ (not Swan-dependent)
   - Open Banking balance: ✅ (uses Neonomics, not Swan)
   - PISP payments: ✅ (uses Neonomics, not Swan)
   - Swan accounts: ❌ (disabled)

4. **Re-enable when Swan is back:**
   ```bash
   aws apprunner update-service --service-arn  \
     --instance-configuration "EnvironmentVariables={
       FEATURE_ACCOUNTS=enabled,
       FEATURE_CARDS=enabled,
       FEATURE_SWAN_WALLETS=enabled
     }"
   ```

**Risk:** Users cannot create accounts or use cards during outage.

---

### Option 2: Queue Swan Operations for Later

**Use case:** Swan down, users need to create accounts but can wait

**Steps:**

1. Queue account creation requests:
   ```typescript
   // src/app/api/accounts/create/route.ts
   export async function POST(request: Request) {
     const { accountType } = await request.json();

     try {
       return await swanClient.createAccount(accountType);
     } catch (error) {
       if (error.code === 'SWAN_UNAVAILABLE') {
         // Queue for later processing
         await db.insert('pending_accounts', {
           user_id: userId,
           account_type: accountType,
           status: 'queued',
           created_at: new Date(),
         });

         return {
           success: true,
           message: 'Account creation queued, will complete within 1 hour',
         };
       }
       throw error;
     }
   }
   ```

2. Process queue when Swan is back:
   ```bash
   # Run cron job to process pending accounts
   node ~/ALAI/products/Drop/scripts/process-pending-accounts.js
   ```

3. Notify users when account is ready:
   ```
   Din konto er klar!
   Takk for tålmodigheten. Du kan nå bruke alle funksjoner i Drop.
   ```

**Risk:** Delayed user experience. Users may expect instant account creation.

---

## Monitoring & Alerts

### Metrics to Track

- **Swan API success rate:** Should be >99%
- **Swan API latency:** p50 <500ms, p95 <2s, p99 <5s
- **Swan error rate by operation:** Track createAccount, issueCard, makePayment separately

### Alert Rules

```typescript
// src/lib/swan-monitor.ts
export async function trackSwanFailure(operation: string, error: any) {
  const failureRate = await calculateSwanFailureRate('last_5_minutes');

  if (failureRate > 0.05) { // 5% failure rate
    await sendAlert({
      severity: 'critical',
      title: 'Swan API failure rate high',
      message: `${(failureRate * 100).toFixed(1)}% of Swan calls failing`,
      operation,
    });
  }
}
```

---

## Post-Incident Actions

1. **Process queued operations:**
   ```sql
   SELECT * FROM pending_accounts WHERE status = 'queued';
   -- Retry all pending account creations
   ```

2. **Document incident:**
   ```bash
   touch ~/ALAI/products/Drop/comms/incidents/$(date +%Y-%m-%d)-swan-outage.md
   ```

3. **Review SLA with Swan:**
   - Check if outage violated SLA
   - Request compensation/credits
   - Discuss failover options

4. **Improve resilience:**
   - Add Swan health check (every 5 min)
   - Implement circuit breaker for Swan API
   - Consider multi-provider strategy (backup BaaS)

---

## Escalation

| Time | Action |
|------|--------|
| 0 min | John starts diagnosis |
| 5 min | If Swan status page shows incident, notify Alem |
| 15 min | If not resolved, enable degraded mode |
| 30 min | Contact Swan support via phone if no ETA |
| 1 hour | Public communication to users |

---

## Contacts

- **Swan Support:** support@swan.io
- **Swan Phone:** +33 X XXXX XXXX (check Swan Dashboard for number)
- **Swan Status:** https://status.swan.io
- **Internal:** Alem (CEO, final decision on feature disabling)

---

## Related Documentation

- `docs/architecture/banking.md` — Swan BaaS integration
- `src/lib/swan-client.ts` — Swan GraphQL client
- `docs/compliance/swan-requirements.md` — Swan partner KYC/compliance
- Vaultwarden item: "Swan API" — Credentials

---

**Last Updated:** 2026-02-22
**Next Review:** Before Phase 2 (Banking Integration)

# Runbook: Neonomics Outage

# Runbook: Neonomics Open Banking Outage

**Service:** Neonomics Open Banking Aggregator
**Severity:** CRITICAL (blocks AISP balance fetch and PISP payments)
**MTTR Target:** <20 minutes
**Owner:** John (AI Director)

---

## Overview

Neonomics is Drop's Open Banking aggregator for Norwegian banks. It provides:
- **AISP (Account Information):** Fetch user's bank account balance via PSD2 consent
- **PISP (Payment Initiation):** Initiate payments from user's bank account
- **Bank connectivity:** Single API to connect to all Norwegian banks (DNB, Nordea, SpareBank 1, etc.)

**Impact:** If Neonomics is down, Drop cannot:
- Show bank balances
- Initiate remittance payments
- Process QR payments

This is a **critical** outage affecting core functionality.

---

## Symptoms

Users report core features not working:

- Cannot see bank balance (shows "unavailable")
- Cannot initiate payments (error at payment step)
- Bank connection fails ("Cannot connect to bank")
- Error: "Open Banking service unavailable"

**User impact:** Cannot use core Drop features (balance, payments).

---

## Diagnosis

### 1. Check Neonomics Service Status

**External status:**
```bash
# Neonomics has no public status page
# Test via API health check
curl -X GET https://api.neonomics.io/health \
  -H "Authorization: Bearer " \
  -v

# Expected: HTTP 200
# If 500/503: Neonomics outage
# If timeout: Network or Neonomics connectivity issue
```

**Check specific bank connectivity:**
```bash
# List banks and their status
curl -X GET https://api.neonomics.io/banks \
  -H "Authorization: Bearer " \
  | jq '.[] | select(.country == "NO") | {name, status, lastChecked}'

# Look for:
# - "status": "degraded" or "offline"
# - Specific bank down (e.g., DNB) vs all banks
```

### 2. Check Drop Logs

```bash
# CloudWatch Logs (production)
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "neonomics" \
  --start-time $(date -u -d '15 minutes ago' +%s)000 \
  --region eu-west-1

# Look for:
# - "Neonomics API timeout"
# - "Neonomics 503 Service Unavailable"
# - "Bank API unavailable: DNB"
# - "Payment initiation failed: NEONOMICS_TIMEOUT"
```

### 3. Determine Scope of Outage

**Is it all banks or specific banks?**
```bash
# Count recent failures by bank
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "Neonomics.*failed" \
  --start-time $(date -u -d '30 minutes ago' +%s)000 \
  | jq '.events[].message' \
  | grep -o '"bank":"[^"]*"' \
  | sort | uniq -c | sort -rn

# Example output:
# 45 "bank":"DNB"        ← DNB-specific issue
# 2 "bank":"Nordea"      ← Nordea working mostly
# 1 "bank":"SpareBank1"  ← SpareBank1 working
```

**Is it AISP, PISP, or both?**
```bash
# Check failure type
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "Neonomics" \
  --start-time $(date -u -d '15 minutes ago' +%s)000 \
  | jq '.events[].message' \
  | grep -E "aisp|pisp" \
  | sort | uniq -c

# Example:
# 30 "service":"aisp"  ← AISP failing
# 45 "service":"pisp"  ← PISP failing
# If both high: full Neonomics outage
```

### 4. Test AISP and PISP Flows

**Test AISP (balance fetch):**
```bash
# Staging environment
TOKEN=$(curl -X POST https://drop-staging.fly.dev/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"test1234"}' \
  | jq -r '.data.token')

curl -X GET https://drop-staging.fly.dev/api/accounts/balance \
  -H "Authorization: Bearer $TOKEN" \
  -v

# Expected: HTTP 200, balance data
# If 500: AISP broken
```

**Test PISP (payment initiation):**
```bash
curl -X POST https://drop-staging.fly.dev/api/transactions/remittance \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "recipientId": "rec_test123",
    "amount": 100,
    "currency": "NOK"
  }' \
  -v

# Expected: HTTP 200, payment initiated
# If 500: PISP broken
```

### 5. Check Neonomics API Credentials

```bash
# Verify API key is valid
bw get item "Neonomics API" --session $BW_SESSION

# Check App Runner environment variables
aws apprunner describe-service \
  --service-arn  \
  --region eu-west-1 \
  | jq '.Service.SourceConfiguration.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables' \
  | grep NEONOMICS

# Expected:
# NEONOMICS_API_KEY: 
# NEONOMICS_ENVIRONMENT: production
```

---

## Common Causes & Solutions

### Cause 1: Neonomics Full Outage (All Banks)

**Probability:** 10% (rare but critical)

**Symptoms:**
- ALL banks fail (DNB, Nordea, SpareBank 1, etc.)
- All AISP and PISP requests timeout or return 503
- Neonomics API health check fails

**Solution:**

1. **Verify full outage:**
   ```bash
   # Test multiple endpoints
   curl -X GET https://api.neonomics.io/health -v
   curl -X GET https://api.neonomics.io/banks -H "Authorization: Bearer " -v

   # If both fail: confirmed full outage
   ```

2. **Contact Neonomics support URGENTLY:**
   - Email: support@neonomics.io
   - Slack: #neonomics-support (if available)
   - Phone: +47 XXXX XXXX (check Neonomics Dashboard)

3. **Communicate to users (Norwegian):**
   ```
   Emne: Betalingstjenester midlertidig utilgjengelige

   Hei,

   Vi opplever for øyeblikket tekniske problemer med vår betalingsleverandør.
   Dette påvirker:
   - Visning av saldo
   - Nye betalinger

   Vi jobber med å gjenopprette tjenesten så raskt som mulig.
   Estimert løsning: [X minutter/timer]

   Mvh,
   Drop
   ```

4. **Enable degraded mode:**
   ```bash
   # Show cached balances, disable new payments
   aws apprunner update-service --service-arn  \
     --instance-configuration "EnvironmentVariables={
       AISP_MODE=cached,
       PISP_MODE=disabled,
       NEONOMICS_FALLBACK=true
     }"
   ```

5. **Show maintenance banner in app:**
   ```
   ⚠️ Betalinger midlertidig utilgjengelig
   Vi opplever tekniske problemer. Saldo vises med forsinkelse.
   Betalinger er deaktivert midlertidig.
   ```

6. **Monitor Neonomics status:**
   - Check API health every 5 minutes
   - When API returns 200: test AISP/PISP flows
   - Re-enable features gradually

7. **Re-enable live mode when resolved:**
   ```bash
   aws apprunner update-service --service-arn  \
     --instance-configuration "EnvironmentVariables={
       AISP_MODE=live,
       PISP_MODE=live,
       NEONOMICS_FALLBACK=false
     }"
   ```

**ETA:** Depends on Neonomics (typically <2 hours for major incidents)

---

### Cause 2: Specific Bank API Down

**Probability:** 25% (one bank's API temporarily unavailable)

**Symptoms:**
- Only users of specific bank (e.g., DNB) affected
- Other banks work fine (Nordea, SpareBank 1)
- Logs show: "Bank API timeout: DNB"

**Common reasons:**
- Bank's API maintenance (often 02:00-06:00 CET)
- Bank's API outage
- Bank rate limiting Neonomics
- Bank API certificate expired

**Solution:**

1. **Identify affected bank:**
   ```bash
   # Count failures by bank
   aws logs filter-log-events \
     --log-group-name /aws/apprunner/drop-production \
     --filter-pattern "Bank API" \
     --start-time $(date -u -d '30 minutes ago' +%s)000 \
     | jq '.events[].message' \
     | grep -o '"bank":"[^"]*"' \
     | sort | uniq -c | sort -rn
   ```

2. **Check bank status:**
   - **DNB:** https://www.dnb.no/drift
   - **Nordea:** https://www.nordea.no/info/driftsmeldinger
   - **SpareBank 1:** https://www.sparebank1.no/driftsmeldinger
   - Norwegian banks often announce maintenance

3. **Contact Neonomics to verify:**
   - Neonomics may already know about bank API issues
   - Ask for ETA on bank connectivity restoration

4. **Notify affected users (bank-specific):**
   ```sql
   -- Find users with affected bank
   SELECT user_id, email
   FROM bank_accounts
   JOIN users ON users.id = bank_accounts.user_id
   WHERE bank_name = 'DNB';
   ```

   Email (Norwegian):
   ```
   Emne: Problemer med [Bank] tilkobling

   Hei,

   Vi opplever for øyeblikket problemer med tilkoblingen til [Bank].
   Dette skyldes tekniske problemer hos banken.

   Andre banker virker som normalt.
   Hvis du har konto i en annen bank, kan du bruke den i mellomtiden.

   Estimert løsning: [X minutter/timer]

   Mvh,
   Drop
   ```

5. **Graceful degradation (bank-specific):**
   ```typescript
   // src/lib/neonomics-client.ts
   async function fetchBalance(userId: string, bankId: string) {
     try {
       return await neonomicsAPI.getBalance(userId, bankId);
     } catch (error) {
       if (error.code === 'BANK_API_TIMEOUT' && error.bank === 'DNB') {
         // Return cached balance for DNB users
         const cached = await getCachedBalance(userId);
         return {
           balance: cached?.balance || null,
           currency: 'NOK',
           lastUpdated: cached?.timestamp,
           warning: 'DNB opplever tekniske problemer. Saldo kan være utdatert.'
         };
       }
       throw error;
     }
   }
   ```

**ETA:** Depends on bank (typically <2 hours for maintenance, <4 hours for incidents)

---

### Cause 3: Neonomics API Rate Limiting

**Probability:** 15% (during peak hours or viral growth)

**Symptoms:**
- Logs show: HTTP 429 "Too Many Requests"
- Intermittent failures (some requests succeed, others fail)
- Rate limit headers in logs

**Solution:**

1. **Check rate limit headers:**
   ```bash
   aws logs filter-log-events \
     --log-group-name /aws/apprunner/drop-production \
     --filter-pattern "X-RateLimit" \
     --start-time $(date -u -d '10 minutes ago' +%s)000 \
     | jq -r '.events[].message' \
     | grep -E "X-RateLimit-(Limit|Remaining|Reset)"
   ```

2. **Implement request throttling:**
   ```typescript
   // src/lib/neonomics-client.ts
   import PQueue from 'p-queue';

   const queue = new PQueue({
     concurrency: 10,      // Max 10 concurrent requests
     interval: 1000,        // Per second
     intervalCap: 50        // Max 50 requests per second
   });

   export async function callNeonomics(endpoint: string, options: any) {
     return queue.add(() =>
       fetch(`https://api.neonomics.io${endpoint}`, {
         ...options,
         headers: {
           'Authorization': `Bearer ${process.env.NEONOMICS_API_KEY}`,
           ...options.headers,
         },
       })
     );
   }
   ```

3. **Aggressive caching during rate limit:**
   ```typescript
   // Cache balance for 5 minutes during rate limit (vs 1 minute normally)
   const CACHE_TTL_NORMAL = 60;      // 1 minute
   const CACHE_TTL_RATE_LIMIT = 300; // 5 minutes

   async function getBalanceWithCache(userId: string) {
     const cached = await redis.get(`balance:${userId}`);
     if (cached) return JSON.parse(cached);

     try {
       const balance = await neonomicsAPI.getBalance(userId);
       await redis.setex(`balance:${userId}`, CACHE_TTL_NORMAL, JSON.stringify(balance));
       return balance;
     } catch (error) {
       if (error.status === 429) {
         // Extend cache during rate limit
         if (cached) {
           await redis.expire(`balance:${userId}`, CACHE_TTL_RATE_LIMIT);
           return JSON.parse(cached);
         }
       }
       throw error;
     }
   }
   ```

4. **Contact Neonomics to increase rate limit:**
   - Email: support@neonomics.io
   - Provide traffic stats (requests/day, peak times)
   - Request higher API quota

**ETA:** 5 minutes (automatic throttling), 1-2 days (if quota increase needed)

---

### Cause 4: Invalid or Expired API Credentials

**Probability:** 5% (after credential rotation or account issue)

**Symptoms:**
- Logs show: "401 Unauthorized" or "403 Forbidden"
- All Neonomics API calls fail immediately
- API health check returns 401

**Solution:**

1. **Verify Neonomics API credentials:**
   ```bash
   bw get item "Neonomics API" --session $BW_SESSION

   # Check:
   # - API key is correct
   # - Not expired
   # - Correct environment (production vs sandbox)
   ```

2. **Regenerate API key (if needed):**
   - Login to Neonomics Dashboard (if available)
   - Navigate to Settings → API Keys
   - Generate new API key
   - Copy to Vaultwarden

3. **Update App Runner environment variables:**
   ```bash
   aws apprunner update-service --service-arn  \
     --instance-configuration "EnvironmentVariables={
       NEONOMICS_API_KEY=,
       NEONOMICS_ENVIRONMENT=production
     }"
   ```

4. **Trigger deployment:**
   ```bash
   aws apprunner start-deployment --service-arn  --region eu-west-1
   ```

5. **Test after deployment:**
   ```bash
   curl -X GET https://getdrop.no/api/accounts/balance \
     -H "Authorization: Bearer " \
     -v

   # Expected: HTTP 200, balance data
   ```

**ETA:** 10 minutes

---

### Cause 5: PSD2 Consent Expired (AISP Only)

**Probability:** 20% (affects AISP, not PISP)

**Symptoms:**
- Only AISP (balance fetch) fails
- PISP (payments) still works
- Logs show: "CONSENT_EXPIRED" or "CONSENT_INVALID"
- Specific users affected (not all)

**Note:** This is actually a user-level issue, not a Neonomics outage. See `aisp-balance-failure.md` runbook for full details.

**Quick solution:**

1. **Identify users with expired consent:**
   ```sql
   SELECT user_id, email, bank_name, consent_expires_at
   FROM bank_accounts
   JOIN users ON users.id = bank_accounts.user_id
   WHERE consent_expires_at < datetime('now');
   ```

2. **Notify users to re-authorize (Norwegian):**
   ```
   Push notification:
   Banktilkobling utløpt — Trykk her for å fornye
   ```

3. **User re-authorizes via BankID + bank consent flow**

**ETA:** Immediate (user action required)

---

### Cause 6: Network or Firewall Issues

**Probability:** 5% (AWS security group misconfiguration)

**Symptoms:**
- Logs show: "Connection timeout" or "ECONNREFUSED"
- Neonomics API requests never reach destination
- Works locally but fails in production

**Solution:**

1. **Check outbound connectivity:**
   ```bash
   # App Runner egress is unrestricted by default
   # If using VPC connector, check security group
   aws ec2 describe-security-groups \
     --group-ids  \
     --region eu-west-1 \
     | jq '.SecurityGroups[].IpPermissionsEgress'
   ```

2. **Test DNS resolution:**
   ```bash
   nslookup api.neonomics.io

   # Should resolve to Neonomics IPs
   # If NXDOMAIN: DNS issue
   ```

3. **Check AWS service health:**
   ```bash
   # Check App Runner service events
   aws apprunner list-operations \
     --service-arn  \
     --region eu-west-1
   ```

4. **Whitelist Neonomics IPs (if using strict firewall):**
   - Contact Neonomics for IP ranges
   - Add to security group outbound rules (port 443)

**ETA:** 15 minutes (if quick fix), 1 hour (if requires networking changes)

---

## Emergency Workarounds

### Option 1: Cached Balance + Disable Payments

**Use case:** Neonomics down >30 minutes, no ETA

**Steps:**

1. Enable cached balance mode:
   ```bash
   aws apprunner update-service --service-arn  \
     --instance-configuration "EnvironmentVariables={
       AISP_MODE=cached,
       AISP_CACHE_TTL=3600,
       PISP_MODE=disabled
     }"
   ```

2. Show warning banner in app:
   ```
   ⚠️ Betalinger midlertidig utilgjengelige
   Saldo vises med forsinkelse (opptil 1 time).
   Nye betalinger er deaktivert til tjenesten er tilbake.
   ```

3. Allow read-only features:
   - Users can see cached balance
   - Users can see transaction history
   - Cannot initiate new payments

4. **Re-enable when Neonomics is back:**
   ```bash
   aws apprunner update-service --service-arn  \
     --instance-configuration "EnvironmentVariables={
       AISP_MODE=live,
       PISP_MODE=live
     }"
   ```

**Risk:** Stale balance data. Users may think they have more/less money than reality.

---

### Option 2: Queue Payments for Later Processing

**Use case:** PISP down, users need to make urgent payments

**Steps:**

1. Queue payment requests:
   ```typescript
   // src/app/api/transactions/remittance/route.ts
   export async function POST(request: Request) {
     const paymentData = await request.json();

     try {
       return await neonomicsAPI.initiatePayment(paymentData);
     } catch (error) {
       if (error.code === 'NEONOMICS_UNAVAILABLE') {
         // Queue for later
         await db.insert('pending_payments', {
           user_id: userId,
           payment_data: paymentData,
           status: 'queued',
           created_at: new Date(),
         });

         return {
           success: true,
           message: 'Betaling satt i kø. Vil bli behandlet innen 2 timer.',
         };
       }
       throw error;
     }
   }
   ```

2. Process queue when Neonomics is back:
   ```bash
   node ~/ALAI/products/Drop/scripts/process-pending-payments.js
   ```

3. Notify users when payment completes:
   ```
   Din betaling er behandlet
   Betalingen på [amount] til [recipient] er fullført.
   ```

**Risk:** Delayed payments. User may expect instant transfer.

---

## Monitoring & Alerts

### Metrics to Track

- **Neonomics API success rate:** Should be >99%
- **Neonomics API latency:** p50 <2s, p95 <5s, p99 <10s
- **Bank-specific failure rate:** Track DNB, Nordea, SpareBank 1 separately

### Alert Rules

```typescript
// src/lib/neonomics-monitor.ts
export async function trackNeonomicsFailure(service: 'aisp' | 'pisp', error: any) {
  const failureRate = await calculateFailureRate('neonomics', 'last_5_minutes');

  if (failureRate > 0.1) { // 10% failure rate
    await sendAlert({
      severity: 'critical',
      title: 'Neonomics API failure rate high',
      message: `${(failureRate * 100).toFixed(1)}% of Neonomics calls failing`,
      service,
    });
  }
}
```

---

## Post-Incident Actions

1. **Process queued operations:**
   ```sql
   SELECT * FROM pending_payments WHERE status = 'queued';
   -- Retry all pending payments
   ```

2. **Document incident:**
   ```bash
   touch ~/ALAI/products/Drop/comms/incidents/$(date +%Y-%m-%d)-neonomics-outage.md
   ```

3. **Review SLA with Neonomics:**
   - Check if outage violated SLA
   - Request compensation/credits
   - Discuss redundancy options

4. **Improve resilience:**
   - Add Neonomics health check (synthetic test every 5 min)
   - Implement circuit breaker for Neonomics API
   - Consider multi-provider strategy (backup Open Banking aggregator)

---

## Escalation

| Time | Action |
|------|--------|
| 0 min | John starts diagnosis |
| 10 min | If full Neonomics outage confirmed, notify Alem |
| 20 min | If not resolved, enable degraded mode (cached balance, disable payments) |
| 30 min | Contact Neonomics support via phone if no response |
| 1 hour | Public communication to all users |
| 2 hours | Assess alternative Open Banking providers (emergency only) |

---

## Contacts

- **Neonomics Support:** support@neonomics.io
- **Neonomics Slack:** #neonomics-support (if available)
- **Neonomics Phone:** +47 XXXX XXXX (check Neonomics Dashboard)
- **Internal:** Alem (CEO, final decision on fallback modes)

---

## Related Documentation

- `docs/architecture/open-banking.md` — Neonomics AISP/PISP flow
- `src/lib/neonomics-client.ts` — Neonomics API client
- `docs/compliance/psd2-requirements.md` — PSD2 regulatory requirements
- `support/runbooks/aisp-balance-failure.md` — AISP-specific failures
- `support/runbooks/pisp-payment-failure.md` — PISP-specific failures
- Vaultwarden item: "Neonomics API" — Credentials

---

**Last Updated:** 2026-02-22
**Next Review:** Before Phase 2 (Banking Integration)

# Infrastructure & Internal Services

Complete runbooks for all ALAI internal services: Docker containers, LaunchAgent daemons, Cloudflare tunnel, Vaultwarden, email system, bots, and more.

# ALAI Infrastructure — Service Catalog & Runbooks

# ALAI Infrastructure — Service Catalog & Runbooks

> **Last updated:** 2026-03-11 | **Maintained by:** John (AI Director)
> **Host:** Mac Studio M3 Ultra (ANVIL) | **OS:** macOS
> **Quick health:** `node ~/system/tools/daemon-health.js`

---

## 🐳 Docker Services (23 containers)

### Core Platform Services

| Service | Image | Port | External URL | Health | Restart |
|---------|-------|------|--------------|--------|---------|
| **Vaultwarden** | vaultwarden/server | :8200 | vault.basicconsulting.no | ✅ healthy | `cd ~/system/services/vaultwarden && docker compose restart` |
| **BookStack** | linuxserver/bookstack | :6875 | docs.basicconsulting.no | ✅ running | `cd ~/system/services/bookstack && docker compose restart` |
| **BookStack DB** | linuxserver/mariadb | :3306 (internal) | — | ✅ running | Restarts with BookStack |
| **Planka** | plankanban/planka | :3100 | boards.basicconsulting.no | ✅ healthy | `cd ~/system/services/planka && docker compose restart` |
| **Planka DB** | postgres:15-alpine | internal | — | ✅ healthy | Restarts with Planka |
| **Documenso** | documenso/documenso | :3003 | sign.basicconsulting.no | ✅ running | `cd ~/system/services/documenso && docker compose restart` |
| **Documenso DB** | postgres:15-alpine | internal | — | ✅ healthy | Restarts with Documenso |
| **Documenso MinIO** | minio/minio | :9002/:9003 | — | ✅ running | Restarts with Documenso |
| **Baikal (CalDAV)** | ckulka/baikal:nginx | :5232 | calendar.basicconsulting.no | ✅ running | `cd ~/system/services/baikal && docker compose restart` |
| **Qdrant (Vector DB)** | qdrant/qdrant | :6333/:6334 | — | ✅ running | `docker restart qdrant` |

### Product Database Services

| Service | Port | Product | Health | Restart |
|---------|------|---------|--------|---------|
| **drop-postgres** | :5433 | Drop | ✅ healthy | `cd ~/ALAI/products/Drop && docker compose restart drop-postgres` |
| **plock-db** | :5434 | Plock | ✅ healthy | `cd ~/ALAI/products/Plock && docker compose restart plock-db` |
| **plock-redis** | :6380 | Plock | ✅ healthy | Restarts with plock-db |
| **bilko-postgres** | :5436 | Bilko | ✅ running | `cd ~/ALAI/products/Bilko && docker compose restart bilko-postgres` |
| **bilko-redis** | :6382 | Bilko | ✅ running | Restarts with bilko |
| **lobby-postgres** | :5437 | Lobby | ✅ healthy | `cd ~/ALAI/products/Lobby && docker compose restart lobby-postgres` |
| **lumiscare-postgres** | :5432 | LumisCare | ✅ healthy | Client project |
| **lumiscare-redis** | :6379 | LumisCare | ✅ healthy | Client project |
| **backend-postgres** | :5435 | BasicFakta | ✅ healthy | `cd ~/ALAI/products/BasicFakta && docker compose restart` |
| **backend-redis** | :6381 | BasicFakta | ✅ healthy | Restarts with backend |

### Monitoring Stack (Drop)

| Service | Port | URL | Restart |
|---------|------|-----|---------|
| **Grafana** | :3300 | grafana.basicconsulting.no | `docker restart drop-grafana` |
| **Prometheus** | :9090 | prometheus.basicconsulting.no | `docker restart drop-prometheus` |
| **Node Exporter** | :9100 | — | `docker restart drop-node-exporter` |

---

## ☁️ Cloudflare Tunnel (cloudflared)

**LaunchAgent:** `com.john.cloudflared`
**Config:** `~/.cloudflared/config.yml`
**Tunnel ID:** `3315a609-7934-45c5-ad0c-56d86d16374d`

### Exposed Services

| Hostname | Backend | Purpose |
|----------|---------|---------|
| docs.basicconsulting.no | localhost:6875 | BookStack wiki |
| vault.basicconsulting.no | localhost:8200 | Vaultwarden |
| sign.basicconsulting.no | localhost:3003 | Documenso (e-signing) |
| boards.basicconsulting.no | localhost:3100 | Planka (kanban) |
| calendar.basicconsulting.no | localhost:5232 | Baikal (CalDAV) |
| mc.basicconsulting.no | localhost:3030 | MC Dashboard |
| api.basicconsulting.no | localhost:3001 | API gateway |
| drop-api.basicconsulting.no | localhost:3201 | Drop API |
| lobby.basicconsulting.no | localhost:3010 | Lobby frontend |
| lobby-api.basicconsulting.no | localhost:3009 | Lobby API |
| auth.basicconsulting.no | localhost:9000 | Authentik (SSO) |
| grafana.basicconsulting.no | localhost:3300 | Grafana dashboards |
| prometheus.basicconsulting.no | localhost:9090 | Prometheus metrics |
| track.basicconsulting.no | localhost:3456 | Email tracking pixel |
| ssh.basicconsulting.no | localhost:22 | SSH access |
| vnc.basicconsulting.no | localhost:5900 | VNC screen sharing |

### Runbook: Tunnel down

```bash
# Check status
launchctl list | grep cloudflared

# Restart
launchctl stop com.john.cloudflared
launchctl start com.john.cloudflared

# Verify
cloudflared tunnel info 3315a609-7934-45c5-ad0c-56d86d16374d

# Logs
tail -50 ~/system/logs/cloudflared.log
```

---

## 🔐 Vaultwarden

**Container:** vaultwarden | **Port:** :8200
**URL:** vault.basicconsulting.no (Cloudflare Access protected)
**Local:** http://localhost:8200 | **HTTPS proxy:** https://localhost:8443 (Caddy)
**Admin token:** In `~/system/services/vaultwarden/.env`

### Dependencies
- Docker
- Caddy HTTPS proxy (`com.john.caddy-vault`) — needed for `bw` CLI
- vault-keeper daemon (`com.john.vault-keeper`) — auto-unlock

### Runbook: Vault locked/unauthenticated

```bash
# Check status
NODE_TLS_REJECT_UNAUTHORIZED=0 bw status

# If "locked" — vault-keeper auto-fixes every 15 min. Manual:
NODE_TLS_REJECT_UNAUTHORIZED=0 bw unlock --raw > /tmp/bw-session

# If "unauthenticated" — needs full re-login:
NODE_TLS_REJECT_UNAUTHORIZED=0 bw login --apikey
# Enter client_id and client_secret from ~/system/config/vault-apikey.json
# Then unlock:
NODE_TLS_REJECT_UNAUTHORIZED=0 bw unlock --raw > /tmp/bw-session

# Verify
NODE_TLS_REJECT_UNAUTHORIZED=0 BW_SESSION=$(cat /tmp/bw-session) bw list items --search "Email" | head
```

### Runbook: Caddy proxy down

```bash
# Caddy provides HTTPS for bw CLI (self-signed cert)
launchctl list | grep caddy-vault
# Restart
launchctl stop com.john.caddy-vault && launchctl start com.john.caddy-vault
# Verify
curl -sk https://localhost:8443 | head -1
```

---

## 📧 Email System

**Daemon:** `com.john.email-agent` (every 5 min)
**Accounts:** john@basicconsulting.no, info@basicconsulting.no, john@alai.no, alem@alai.no, dev@alai.no
**IMAP:** imap.one.com:993 | **SMTP:** send.one.com:465
**Credentials:** Vaultwarden (via bw CLI)

### Runbook: Email agent not processing

```bash
# Check logs
tail -30 ~/system/logs/email-agent-launchd.log

# Common issue: Vault not unlocked
NODE_TLS_REJECT_UNAUTHORIZED=0 bw status
# Fix: See Vaultwarden runbook above

# Manual test run
NODE_TLS_REJECT_UNAUTHORIZED=0 node ~/system/daemons/email-agent.js --dry-run

# Restart daemon
launchctl stop com.john.email-agent && launchctl start com.john.email-agent

# Check inbox DB
node -e "const e=require('$HOME/system/tools/email-inbox.js');console.log(JSON.stringify(e.getStats(),null,2))"
```

---

## 💬 Telegram Bot

**Daemon:** `com.john.telegram-agent` (KeepAlive)
**Bot:** @johnbasicas_bot
**Config:** macOS Keychain (telegram-bot-token)
**AI Backend:** Claude CLI → Ollama (llama3.1:8b) → static fallback

### Runbook: Bot not responding

```bash
# Check daemon
launchctl list | grep telegram-agent

# Check logs
tail -20 ~/system/logs/telegram-agent.log

# Restart
launchctl stop com.john.telegram-agent && launchctl start com.john.telegram-agent

# Test AI backend
node -e "const{getResponse}=require('$HOME/system/tools/comms-responder.js');getResponse('test',[]).then(r=>console.log(r.backend,r.text.substring(0,100)))"

# Test connection
node ~/system/tools/telegram-agent.js --test
```

---

## 💬 Slack Bot

**Daemon:** `com.john.slack-bot` (KeepAlive)
**Workspace:** ALAI Holding AS

### Runbook: Slack bot not responding

```bash
launchctl list | grep slack-bot
tail -20 ~/system/logs/slack-bot.log
launchctl stop com.john.slack-bot && launchctl start com.john.slack-bot
```

---

## 📋 BookStack (Wiki)

**Container:** bookstack + bookstack_db
**Port:** :6875 | **URL:** docs.basicconsulting.no
**API config:** ~/system/config/bookstack.json (creds in Vaultwarden)

### Runbook: BookStack down

```bash
cd ~/system/services/bookstack
docker compose ps
docker compose restart
# Check logs
docker logs bookstack --tail 20
```

---

## 📝 Documenso (E-Signing)

**Containers:** documenso + documenso-db + documenso-minio
**Port:** :3003 | **URL:** sign.basicconsulting.no

### Runbook: Documenso down

```bash
cd ~/system/services/documenso
docker compose ps
docker compose restart
docker logs documenso --tail 20
```

---

## 📋 Planka (Kanban)

**Containers:** planka + planka-db
**Port:** :3100 | **URL:** boards.basicconsulting.no

### Runbook: Planka down

```bash
cd ~/system/services/planka
docker compose ps
docker compose restart
docker logs planka --tail 20
```

---

## 📅 Baikal (CalDAV/CardDAV)

**Container:** baikal
**Port:** :5232 | **URL:** calendar.basicconsulting.no

### Runbook: Baikal down

```bash
cd ~/system/services/baikal
docker compose ps
docker compose restart
docker logs baikal --tail 20
```

---

## 🤖 Ollama (Local AI)

**Process:** ollama serve (background)
**Port:** :11434
**Models:** llama3.1:8b, qwen2.5-coder:32b, bge-m3, llama-guard3:8b, custom ALAI models

### Runbook: Ollama down

```bash
# Check
curl -s http://localhost:11434/api/tags | python3 -m json.tool | head

# Restart
ollama serve &

# Verify models
ollama list
```

---

## ⚙️ Key LaunchAgent Daemons

| Daemon | Label | Purpose | Priority |
|--------|-------|---------|----------|
| Cloudflared | com.john.cloudflared | Tunnel to internet | P1 |
| Vault Keeper | com.john.vault-keeper | Auto-unlock Vaultwarden | P1 |
| Caddy Vault | com.john.caddy-vault | HTTPS proxy for bw CLI | P1 |
| Slack Bot | com.john.slack-bot | Slack communication | P1 |
| Telegram Agent | com.john.telegram-agent | Telegram bot | P1 |
| Email Agent | com.john.email-agent | Email processing | P1 |
| Email Tracker | com.john.email-tracker | Open/click tracking | P2 |
| Comms Agent | com.john.comms-agent | Cross-platform comms | P2 |
| Ops Watchdog | com.john.ops-watchdog | Service health checks | P1 |
| Event Dispatcher | com.john.event-dispatcher | Event bus processing | P1 |
| Pi Orchestrator | com.john.pi-orchestrator | Task delegation to agents | P1 |
| Autowork | com.john.autowork | Background task execution | P2 |
| N8N | com.john.n8n | Workflow automation | P2 |
| MC Dashboard | com.john.mc-dashboard | Mission Control web UI | P2 |

### Generic daemon restart

```bash
# Stop
launchctl stop com.john.
# Start
launchctl start com.john.
# Full reload
launchctl unload ~/Library/LaunchAgents/com.john..plist
launchctl load ~/Library/LaunchAgents/com.john..plist
# Check status
launchctl list | grep 
```

---

## 🔄 Cold Start (Full System Bring-Up)

If the Mac Studio reboots:

```bash
# 1. Docker starts automatically (Docker Desktop)
# 2. LaunchAgents auto-load (RunAtLoad=true)
# 3. vault-keeper unlocks Vaultwarden (reads Keychain)
# 4. All services come up within ~2 minutes

# Verify everything:
bash ~/system/ops/cold-start.sh
node ~/system/tools/daemon-health.js
docker ps
```

---

## 🆘 Emergency Contacts

- **Alem Basic** (CEO): alem@alai.no
- **John** (AI Director): john@basicconsulting.no, @johnbasicas_bot (Telegram), #exec (Slack)