# Support & Runbooks

P0 checklists, support systems, audit logging, and operational runbooks

# Support Systems

Support systems, checklists, and audit logging

# P0: Implementation Checklist

# P0 Implementation Checklist — Drop Support Systems

**Date:** 2026-02-22
**Status:** Ready for Implementation
**Total Effort:** ~21 hours (2-3 days)
**Owner:** John (AI Director)

---

## Overview

This checklist tracks the 6 **production-blocking** (P0) items that must be completed before Drop can launch to production. Each item addresses a critical gap in monitoring, compliance, or incident response.

---

## P0 Items

### 1. Server-Side Error Tracking ⏱️ 2 hours (revised)

**Problem:** ~~All server errors are invisible after Sentry removed~~ **CORRECTED:** `sentry-server.ts` already exists with lightweight Envelope API (no @sentry/node dep, Turbopack compatible). However, only 5/25+ routes have `captureServerError` integrated.

**Status:** 🟡 Partially Complete (library done, coverage gaps)

**Tasks:**
- [x] ~~Research Sentry Edge SDK compatibility~~ Already solved: custom Envelope API
- [x] ~~Install and configure~~ `src/lib/sentry-server.ts` already complete
- [x] ~~Update sentry-server.ts~~ Already has captureServerError + captureServerMessage
- [ ] **Expand captureServerError to ALL API routes** (currently only 5 routes)
- [ ] Test: Trigger 500 error in expanded routes, verify Sentry event
- [ ] Configure source maps upload (optional but recommended)

**Deliverables:**
- ✅ `src/lib/sentry-server.ts` (already complete — Envelope API, no SDK dep)
- ✅ Integrated in: bankid, bankid/callback, qr-payment, remittance, health
- 🔨 Expanding to: all remaining API routes (~20 routes)

**Acceptance Criteria:**
- ALL API routes have captureServerError in catch blocks
- Error includes context tags (endpoint name, userId)

---

### 2. Audit Logging System ⏱️ 0 hours (ALREADY COMPLETE)

**Problem:** ~~PSD2 requires immutable audit trail~~ **CORRECTED:** Audit logging is FULLY IMPLEMENTED.

**Status:** ✅ Complete

**What exists:**
- [x] `src/lib/audit.ts` — Full audit library with 30+ action types, logAudit(), getAuditLog(), countAuditEntries()
- [x] `audit_log` table in DB schema (initial migration + db.ts fallback)
- [x] Indexes on user_id, timestamp, action
- [x] 5-year retention documented (data-retention.ts explicitly excludes audit_log from cleanup)
- [x] Fire-and-forget pattern (doesn't block user actions)
- [x] Integrated in 20+ API routes: auth, transactions, cards, recipients, settings, consents, complaints, user management, GDPR endpoints
- [x] Admin audit export: `/api/admin/audit/` endpoint exists
- [x] GDPR data export: `/api/user/data-export/` includes audit log
- [x] Structured logger also captures audit events (stdout for CloudWatch)

**No action needed.** This was incorrectly flagged as missing in the initial analysis.

---

### 3. WAF Deployment ⏱️ 2 hours

**Problem:** WAF rules defined but not enforced (requires reverse proxy).

**Status:** ⬜ Not Started

**Tasks:**
- [ ] Review `infrastructure/waf-rules.md` for required rules
- [ ] Configure Cloudflare WAF (recommended):
  - [ ] Enable SQLi protection
  - [ ] Enable XSS protection
  - [ ] Enable path traversal blocking
  - [ ] Set request size limits (1MB API, 10KB auth)
- [ ] OR configure AWS WAF (alternative):
  - [ ] Create WAF web ACL
  - [ ] Associate with App Runner service
- [ ] Test WAF rules:
  - [ ] Send SQLi payload (`?id=1' OR '1'='1`), expect 403
  - [ ] Send XSS payload (`<script>alert(1)</script>`), expect 403
- [ ] Document deployment steps

**Deliverables:**
- ✅ `infrastructure/cloudflare-waf-setup.md` (to be created)
- ⬜ Cloudflare WAF configured
- ⬜ Test results documented

**Acceptance Criteria:**
- SQLi attacks blocked with 403
- XSS attacks blocked with 403
- Legitimate requests pass through
- WAF logs visible in Cloudflare dashboard

---

### 4. Log Aggregation & Retention ⏱️ 2 hours

**Problem:** Structured logs write to stdout but aren't retained or searchable.

**Status:** ⬜ Not Started

**Tasks:**
- [ ] Set CloudWatch Logs retention policy:
  - [ ] Production: 30 days
  - [ ] Staging: 7 days
- [ ] Create CloudWatch Log Insights queries:
  - [ ] All errors (last hour)
  - [ ] User activity trace
  - [ ] Request trace by ID
  - [ ] API endpoint performance (slow queries)
  - [ ] Authentication events
  - [ ] Payment failures
- [ ] Create CloudWatch alarms:
  - [ ] High error rate (>10/min)
  - [ ] No logs received (service down)
  - [ ] Database errors (>5 in 5 min)
- [ ] Create SNS topic for alerts
- [ ] Subscribe email/Slack to SNS topic
- [ ] Test alarms (trigger error spike, verify alert)

**Deliverables:**
- ✅ `infrastructure/cloudwatch-logs-setup.md` (created)
- ⬜ CloudWatch retention policies set
- ⬜ Log Insights queries saved
- ⬜ CloudWatch alarms active

**Acceptance Criteria:**
- Logs retained for 30 days (production)
- Log Insights queries return results in <5 seconds
- Error spike triggers Slack alert within 2 minutes
- Service downtime triggers alert within 5 minutes

---

### 5. External Uptime Monitoring ⏱️ 1 hour

**Problem:** BetterStack documented but not deployed.

**Status:** ⬜ Not Started

**Tasks:**
- [ ] Sign up for BetterStack (free tier)
- [ ] Create monitors:
  - [ ] Production health: `https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health`
    - Interval: 3 minutes
    - Keyword check: `"status":"ok"`
  - [ ] Staging health: `https://drop-staging.fly.dev/api/health`
  - [ ] Landing page: `https://getdrop.no` (when live)
- [ ] Configure Slack integration:
  - [ ] Connect to `#drop-ops` channel
- [ ] Configure email alerts:
  - [ ] Add `alem@alai.no`
- [ ] Test monitoring:
  - [ ] Pause monitor manually
  - [ ] Verify alert received in Slack + email
  - [ ] Resume monitor

**Deliverables:**
- ✅ `docs/infrastructure/BETTERSTACK-SETUP.md` (already exists)
- ⬜ BetterStack account with monitors active
- ⬜ Slack integration tested

**Acceptance Criteria:**
- Health endpoint monitored every 3 minutes
- Downtime alert received in <5 minutes
- Alert includes endpoint URL and status
- Status page shows current uptime %

---

### 6. Payment/Banking Failure Runbooks ⏱️ 4 hours

**Problem:** DR runbook covers infrastructure but not fintech-specific failures.

**Status:** ✅ Partially Complete

**Tasks:**
- [x] BankID integration failure runbook
- [x] PISP payment failure runbook (remittance + QR)
- [ ] AISP balance retrieval failure runbook
- [ ] Swan API outage runbook
- [ ] Sumsub KYC failure runbook
- [ ] Neonomics open banking outage runbook
- [ ] Test each runbook in staging (simulate failure)
- [ ] Update `docs/dr-runbook.md` to reference new runbooks

**Deliverables:**
- ✅ `support/runbooks/bankid-failure.md` (created)
- ✅ `support/runbooks/pisp-payment-failure.md` (created)
- ⬜ `support/runbooks/aisp-balance-failure.md`
- ⬜ `support/runbooks/swan-api-outage.md`
- ⬜ `support/runbooks/sumsub-kyc-failure.md`
- ⬜ `support/runbooks/neonomics-outage.md`

**Acceptance Criteria:**
- Each runbook includes: symptoms, diagnosis, solutions, escalation
- Runbooks tested (manual simulation in staging)
- Team trained on runbook usage
- Runbooks linked from main DR runbook

---

## Progress Tracking

### Completion Status

| Item | Status | Progress | Blocker |
|------|--------|----------|---------|
| 1. Server-side error tracking | 🟡 Expanding | 80% (lib done, expanding to all routes) | None |
| 2. Audit logging | ✅ COMPLETE | 100% (was already built) | None |
| 3. WAF deployment | 🟡 Ready | 90% (Terraform written, needs apply) | `terraform apply` |
| 4. Log aggregation | 🔨 Building | 50% (CloudWatch alarms being added) | None |
| 5. External monitoring | ⬜ Not Started | 0% | BetterStack account signup |
| 6. Runbooks | 🔨 Building | 33% → 100% (4 remaining being written) | None |

**Overall Progress:** ~70% (revised — audit logging was already 100%)

---

## Priority Order

**Week 1 (High Impact, Low Effort):**
1. ✅ External monitoring (1h) — Immediate visibility into outages
2. ✅ CloudWatch retention (30min) — Logs already flowing, just set policy
3. ⬜ CloudWatch alarms (1.5h) — Automated alerting

**Week 2 (Critical Compliance):**
4. ⬜ Audit logging schema (2h) — Create table and library
5. ⬜ Audit logging integration (6h) — Wire into endpoints

**Week 3 (Security & Error Tracking):**
6. ⬜ Server-side error tracking (4h) — Sentry edge setup
7. ⬜ WAF deployment (2h) — Security hardening

**Week 4 (Runbooks):**
8. ⬜ Remaining runbooks (2h) — AISP, Swan, Sumsub, Neonomics

---

## Dependencies

### External Dependencies
- BetterStack account signup (5 min, no approval needed)
- Sentry organization/project (existing, or create new)
- Cloudflare account (existing for DNS, WAF is free tier)

### Internal Dependencies
- Alem approval for:
  - Audit log schema changes
  - CloudWatch cost ($17/month estimate)
  - BetterStack Pro upgrade (optional, $20/month for 30s interval)

### Blocked Items
- Some runbooks require Phase 2 context (real banking integrations)
  - Can document procedures but can't fully test without live APIs
  - Mark as "draft" until Phase 2

---

## Testing Plan

### Test 1: Error Tracking
```bash
# Trigger server error
curl -X POST http://localhost:3000/api/test/error \
  -H "Content-Type: application/json" \
  -d '{"trigger":"server_error"}'

# Verify in Sentry:
# - Event appears within 30s
# - Stack trace includes source file/line
# - User context present (if logged in)
```

### Test 2: Audit Logging
```bash
# Perform audit-worthy action
curl -X POST http://localhost:3000/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"wrong"}'

# Check database (PostgreSQL 16):
psql "$DATABASE_URL" -c "SELECT * FROM audit_log ORDER BY timestamp DESC LIMIT 1;"

# Expected:
# audit_xxx|2026-02-22T10:00:00Z|usr_123|login_failure|...|1.2.3.4|Mozilla...
```

### Test 3: WAF
```bash
# Test SQLi blocking
curl "https://getdrop.no/api/test?id=1' OR '1'='1" -v

# Expected: HTTP 403 Forbidden

# Test legitimate request
curl "https://getdrop.no/api/health" -v

# Expected: HTTP 200 OK
```

### Test 4: CloudWatch Alarms
```bash
# Trigger error spike (loop 15 errors)
for i in {1..15}; do
  curl http://localhost:3000/api/test/error
  sleep 2
done

# Expected:
# - CloudWatch alarm fires after 2 minutes (2 x 1min periods)
# - Slack alert received in #drop-ops
# - Email sent to alem@alai.no
```

### Test 5: BetterStack
```bash
# Stop app
docker stop drop-app

# Wait 3-5 minutes

# Expected:
# - BetterStack detects downtime
# - Slack alert in #drop-ops
# - Email to alem@alai.no

# Restart app
docker start drop-app

# Expected:
# - BetterStack detects recovery
# - "UP" notification sent
```

---

## Rollout Plan

### Phase 1: Non-Intrusive (Day 1)
- External monitoring (BetterStack)
- CloudWatch retention policies
- CloudWatch alarms (passive, alerts only)

**Risk:** None. These are read-only additions.

### Phase 2: Database Changes (Day 2)
- Audit log schema migration
- Audit log library (no integrations yet)

**Risk:** Low. New table, no app changes. Test migration in dev first.

### Phase 3: Code Integration (Day 3-4)
- Audit logging in auth endpoints
- Server-side error tracking (Sentry edge)
- WAF deployment

**Risk:** Medium. Requires code changes + deployment. Deploy to staging first, test 24h, then production.

### Phase 4: Runbooks (Day 5)
- Complete remaining runbooks
- Team training session
- Runbook testing in staging

**Risk:** None. Documentation only, no production changes.

---

## Success Metrics

**After P0 completion, we should achieve:**
- ✅ 100% server errors visible (Sentry events)
- ✅ 100% audit events logged (auth, admin, data access)
- ✅ >99.9% uptime detection (BetterStack)
- ✅ <5 min MTTD (mean time to detect incidents)
- ✅ <15 min MTTR (mean time to recover, using runbooks)
- ✅ 0 security vulnerabilities from WAF bypass

---

## Approvals

### Required Approvals
- [ ] Alem: Audit log schema changes
- [ ] Alem: CloudWatch cost ($17/month)
- [ ] Alem: BetterStack account (free tier OK? or Pro $20/month?)

### Sign-Off
- [ ] John (AI Director): Technical implementation complete
- [ ] Alem (CEO): Business approval for costs + rollout
- [ ] Validator (QA): Testing complete, acceptance criteria met

---

## Next Steps

1. **Review this analysis** with Alem
2. **Get approvals** for costs and schema changes
3. **Create Mission Control tasks** for each P0 item
4. **Begin implementation** (priority order above)
5. **Test thoroughly** in staging before production
6. **Document completion** in this checklist

---

## Related Documents

- `support/SUPPORT-SYSTEMS-ANALYSIS.md` — Full analysis (all P0/P1/P2 items)
- `support/audit-logging-setup.md` — Audit logging implementation guide
- `support/runbooks/bankid-failure.md` — BankID failure recovery
- `support/runbooks/pisp-payment-failure.md` — Payment failure recovery
- `infrastructure/cloudwatch-logs-setup.md` — Log aggregation setup
- `infrastructure/waf-rules.md` — WAF rule definitions

---

**Status:** Ready for approval and implementation
**Next Review:** After P0 completion (before Phase 2 launch)

# Support Overview

# Customer Support

Customer support resources for Drop project: FAQs, guides, feedback.

# Support Systems Analysis

# Drop Support Systems Analysis

**Date:** 2026-02-22
**Author:** John (AI Director)
**Status:** MVP Hardening Phase (0.5)
**Purpose:** Comprehensive analysis of support systems for production-ready fintech deployment

---

## Executive Summary

Drop currently has **foundational support systems** in place but requires **critical enhancements** before production launch. The application has health checks, CI/CD, error tracking (client-side), and basic alerting, but lacks enterprise-grade observability, audit logging, and incident response procedures required for a PSD2-compliant fintech service.

**Key Findings:**
- ✅ **Strong foundation:** Comprehensive CI/CD with >80% coverage, health checks, structured logging
- ⚠️ **Critical gaps:** No server-side error tracking, no audit trails, no APM, limited incident response
- 🚨 **Production blockers:** 6 P0 items must be addressed before go-live (see Gap Analysis)

**Recommendation:** Implement P0 systems immediately (est. 2-3 days), defer P1 to Phase 2 (banking integration), and P2 to post-launch optimization.

---

## Current State

### 1. Monitoring — Uptime & Health Checks

#### What Exists
- ✅ **Health endpoint:** `/api/health` with database connectivity verification
  - Checks: DB query latency, driver type (pg/sqlite), service mode, uptime
  - Returns: `ok` (200), `degraded` (200), or `down` (503)
  - Source: `src/drop-app/src/app/api/health/route.ts`

- ✅ **Container health checks:**
  - Docker: 30s interval, 10s timeout, 3 retries
  - Fly.io: 30s interval, 10s grace period, 5s timeout
  - Auto-restart on failure

- ✅ **External uptime monitoring (ready to deploy):**
  - BetterStack setup guide documented
  - Free tier: 10 monitors, 3-min interval, SMS/email/Slack alerts
  - Documentation: `docs/infrastructure/BETTERSTACK-SETUP.md`

- ✅ **Cron health check script:**
  - `infrastructure/health-check.sh` — AWS App Runner endpoint
  - Slack webhook integration (optional)
  - Can run via cron for local monitoring

#### What's Missing
- ❌ **Synthetic monitoring:** No transaction flow testing (login → send money → verify)
- ❌ **Multi-region checks:** No geographic availability testing
- ❌ **SLA tracking:** No uptime percentage calculation or reporting
- ❌ **Dependency monitoring:** No checks for external services (Swan API, BankID, Sumsub)

#### Assessment
**Status:** Adequate for MVP, requires enhancement for production.
**Gap:** External monitoring configured but not deployed. Synthetic checks needed.

---

### 2. Logging — Centralized Log Aggregation

#### What Exists
- ✅ **Structured logging:**
  - JSON format with timestamp, level, message, requestId, metadata
  - Source: `src/drop-app/src/lib/logger.ts`
  - Writes to stdout (Docker-friendly)

- ✅ **Request correlation:**
  - `x-request-id` header extraction or UUID generation
  - Request context propagation through logger instances

- ✅ **Log levels:** debug, info, warn, error

#### What's Missing
- ❌ **Log aggregation:** Logs write to stdout but aren't collected or indexed
- ❌ **Log retention:** No policy for how long logs are kept
- ❌ **Log search:** No way to query logs across time/instances
- ❌ **Log forwarding:** No integration with log management service
- ❌ **Sensitive data scrubbing:** Logger doesn't automatically redact PII

#### Assessment
**Status:** Foundation exists, but logs are ephemeral (lost on container restart).
**Gap:** Critical for incident investigation and compliance audits. Need CloudWatch Logs or similar.

---

### 3. Error Tracking — Error Capture & Alerting

#### What Exists
- ✅ **Client-side error tracking:**
  - Sentry browser integration (`@sentry/browser`)
  - PII scrubbing (passwords, pins, card numbers, fødselsnummer)
  - 10% trace sampling for performance monitoring
  - Source: `src/drop-app/src/lib/sentry.ts`, `SENTRY.md`

- ✅ **Error spike detection:**
  - Tracks errors in rolling 1-minute window
  - Alerts when >5 errors in 60 seconds
  - Source: `src/drop-app/src/lib/alerts.ts:trackError()`

- ✅ **Global error boundaries:**
  - React error boundaries for component crashes
  - `global-error.tsx` catches unhandled errors

#### What's Missing
- ❌ **Server-side error tracking:** Sentry removed from server due to Next.js 16 Turbopack incompatibility (MC #1271)
- ❌ **API error context:** Server errors log to console only, no structured capture
- ❌ **Error attribution:** Can't trace errors to specific users or transactions
- ❌ **Error deduplication:** Same error reported multiple times clogs alerts

#### Assessment
**Status:** Client errors tracked, server errors blind.
**Gap:** CRITICAL — server-side errors (API, DB, integrations) are invisible. P0 fix required.

---

### 4. Alerting — On-Call & Escalation

#### What Exists
- ✅ **Slack alerting:**
  - Operational alerts with severity levels (info/warning/critical)
  - 10-minute cooldown per alert title (spam prevention)
  - Source: `src/drop-app/src/lib/alerts.ts`

- ✅ **Lifecycle alerts:**
  - App startup notification
  - Graceful shutdown notification
  - Source: `instrumentation.ts`

- ✅ **Error spike alerts:**
  - Automatic critical alert when >5 errors/minute

#### What's Missing
- ❌ **On-call rotation:** No defined on-call schedule or escalation policy
- ❌ **Alert routing:** All alerts go to same Slack channel, no severity-based routing
- ❌ **Alert escalation:** No automatic escalation after N minutes of unresolved incident
- ❌ **Alert acknowledgment:** Can't mark alerts as "acknowledged" or "resolved"
- ❌ **SMS/phone alerts:** Critical incidents only notify via Slack (single point of failure)
- ❌ **Alert testing:** No way to test alert pipeline without triggering real incidents

#### Assessment
**Status:** Basic alerting works for small team, inadequate for 24/7 production.
**Gap:** Need on-call schedule, escalation policy, and multi-channel delivery.

---

### 5. Security Monitoring — WAF, DDoS, Anomaly Detection, Audit Logs

#### What Exists
- ✅ **WAF rules defined:**
  - CSRF origin validation (implemented in middleware)
  - Rate limiting on auth endpoints (10 req/60s)
  - CSP headers with nonce-based script loading
  - Source: `infrastructure/waf-rules.md`, `src/drop-app/src/middleware.ts`

- ✅ **Container security scanning:**
  - Trivy vulnerability scanner in CI/CD
  - Blocks HIGH/CRITICAL vulnerabilities
  - SARIF upload to GitHub Security tab

- ✅ **Dependency scanning:**
  - `npm audit` in CI pipeline (prod deps only)

- ✅ **AML transaction monitoring:**
  - 5 automated rules: structuring, velocity, high amount, high-risk corridor, unusual pattern
  - Alerts stored in `aml_alerts` table
  - Source: `src/drop-app/src/lib/transaction-monitor.ts`

#### What's Missing
- ❌ **WAF deployment:** Rules defined but not deployed (requires CDN/reverse proxy)
- ❌ **DDoS protection:** No rate limiting at network edge, only app-level
- ❌ **Intrusion detection:** No IDS/IPS monitoring unusual access patterns
- ❌ **Audit logs:** No immutable log of authentication, authorization, data access events (PSD2 requirement)
- ❌ **Security incident response plan:** No runbook for security breaches
- ❌ **Penetration testing:** No external security audit completed

#### Assessment
**Status:** Security-aware codebase, but monitoring/audit infrastructure missing.
**Gap:** CRITICAL — audit logs are PSD2/GDPR compliance requirement. P0 fix.

---

### 6. Performance — APM, Latency Tracking, Resource Utilization

#### What Exists
- ✅ **Health check latency:**
  - DB query time measured in health endpoint
  - Reported in milliseconds

- ✅ **Performance budgets in CI:**
  - Coverage thresholds enforced (80/70/80/80)

#### What's Missing
- ❌ **APM (Application Performance Monitoring):** No distributed tracing
- ❌ **API latency tracking:** Don't know which endpoints are slow
- ❌ **Database performance:** No slow query alerts or query profiling
- ❌ **Resource utilization:** No CPU/memory/disk usage monitoring
- ❌ **Frontend performance:** No Core Web Vitals tracking (LCP, FID, CLS)
- ❌ **Transaction timing:** Can't measure end-to-end payment latency

#### Assessment
**Status:** Minimal. Can detect total outage but not performance degradation.
**Gap:** Need before production to identify bottlenecks and capacity issues.

---

### 7. Database — Backups, Replication, Monitoring

#### What Exists
- ✅ **Automated backups (RDS):**
  - Daily automated snapshots, 7-day retention
  - Point-in-time recovery within 7 days
  - Source: `docs/dr-runbook.md`

- ✅ **Multi-AZ (production):**
  - RDS configured for high availability (if enabled)

- ✅ **Database health check:**
  - `SELECT 1` query in health endpoint verifies connectivity

#### What's Missing
- ❌ **Backup verification:** Snapshots created but never tested for restore
- ❌ **Backup monitoring:** No alerts if backup fails
- ❌ **Replication lag monitoring:** No alerts if replica falls behind
- ❌ **Connection pool monitoring:** No visibility into connection usage
- ❌ **Query performance:** No slow query log analysis
- ❌ **Storage monitoring:** No alerts before disk fills up

#### Assessment
**Status:** Basic backup/restore exists, monitoring gaps.
**Gap:** Backup testing and proactive monitoring needed before production.

---

### 8. Incident Response — Runbooks, Status Page, Communication Plan

#### What Exists
- ✅ **DR runbook:**
  - Procedures for App Runner down, RDS down, full redeploy
  - Environment variable checklist
  - Contact escalation (John → Alem)
  - Source: `docs/dr-runbook.md`

- ✅ **Incident checklist:**
  - 8-step incident response workflow
  - Post-mortem requirement (48h)

#### What's Missing
- ❌ **Status page:** No public/customer-facing status page
- ❌ **Incident templates:** No standardized incident report format
- ❌ **Communication plan:** No templates for customer notifications during outages
- ❌ **Runbook coverage:** Only covers infrastructure, missing:
  - Payment failures (PISP/AISP errors)
  - BankID integration issues
  - KYC/AML false positive handling
  - Data breach response
- ❌ **Runbook testing:** Procedures documented but never executed

#### Assessment
**Status:** Basic DR runbook exists, lacks fintech-specific scenarios.
**Gap:** Need payment/banking integration runbooks before Phase 2.

---

### 9. CI/CD — Build Pipeline, Deployment, Rollback

#### What Exists
- ✅ **Comprehensive CI pipeline:**
  - Multi-package change detection
  - Lint, typecheck, unit tests, E2E (Playwright), mutation testing (Stryker)
  - Coverage thresholds enforced (80/70/80/80) with ratchet (never decrease)
  - Docker build + Trivy security scan
  - Quality gate (required status check)
  - Source: `.github/workflows/ci.yml`

- ✅ **Deployment workflows:**
  - GitHub Actions for deploy (backend, mobile)
  - Terraform for infrastructure
  - Source: `.github/workflows/deploy.yml`, `terraform-ci.yml`

#### What's Missing
- ❌ **Automated rollback:** Deployment failure doesn't auto-revert
- ❌ **Canary deployments:** All-or-nothing deployment, no gradual rollout
- ❌ **Deployment monitoring:** No automatic health check after deploy
- ❌ **Deployment notifications:** Team not notified of deployments/failures
- ❌ **Infrastructure drift detection:** Terraform state not continuously validated

#### Assessment
**Status:** Strong quality gate, weak deployment safety.
**Gap:** Add post-deployment health checks and rollback automation.

---

### 10. Compliance — Audit Trails, Data Retention, GDPR/PSD2 Logging

#### What Exists
- ✅ **AML monitoring:**
  - Transaction alerts stored in `aml_alerts` table
  - 5 risk categories tracked

- ✅ **Security audit completed:**
  - 4 CRITICAL, 5 HIGH, 6 MEDIUM, 4 LOW findings documented
  - Source: `security/drop-security-rapport.md`

- ✅ **Data retention service:**
  - Code exists for GDPR compliance
  - Source: `src/drop-app/src/lib/services/data-retention.ts`

#### What's Missing
- ❌ **Audit logs:** No immutable record of:
  - User authentication events (login, logout, failed attempts)
  - Authorization decisions (who accessed what, when)
  - Data modifications (user profile changes, transaction edits)
  - Administrative actions (KYC approvals, AML reviews)
- ❌ **Audit log retention policy:** PSD2 requires 5+ years
- ❌ **Audit log integrity:** No cryptographic proof of non-tampering
- ❌ **Compliance reporting:** No automated report generation for regulators
- ❌ **STR (Suspicious Transaction Report) workflow:** AML alerts created but no submission process

#### Assessment
**Status:** CRITICAL GAP. Audit logs are PSD2 legal requirement.
**Gap:** P0 — must implement before production launch.

---

## Gap Analysis

### P0 — Production Blockers (Must Fix Before Go-Live)

| # | Category | Gap | Impact | Effort |
|---|----------|-----|--------|--------|
| 1 | **Error Tracking** | No server-side error monitoring | Can't detect/debug API failures | 4h |
| 2 | **Compliance** | No audit logs (auth, data access, admin actions) | PSD2 non-compliance, legal risk | 8h |
| 3 | **Security** | WAF rules defined but not deployed | Vulnerable to SQLi, XSS, DDoS | 2h (config) |
| 4 | **Logging** | No log aggregation/retention | Can't investigate incidents | 2h (CloudWatch setup) |
| 5 | **Monitoring** | BetterStack configured but not deployed | No external incident detection | 1h (account setup) |
| 6 | **Incident Response** | No payment/banking failure runbooks | Can't recover from PISP/BankID outages | 4h |

**Total P0 effort:** ~21 hours (2-3 days)

---

### P1 — Needed Soon (Before Phase 2: Banking Integration)

| # | Category | Gap | Impact | Effort |
|---|----------|-----|--------|--------|
| 7 | **Alerting** | No on-call rotation or escalation policy | Incidents may go unnoticed outside work hours | 2h |
| 8 | **Performance** | No APM for distributed tracing | Can't diagnose slow transactions | 4h |
| 9 | **Database** | No backup testing or monitoring | Backups may be corrupt, undetected | 3h |
| 10 | **Security** | No penetration testing | Unknown vulnerabilities | 16h (external) |
| 11 | **CI/CD** | No automated rollback on deployment failure | Bad deploys cause extended outages | 6h |
| 12 | **Compliance** | No STR submission workflow | Can't fulfill AML obligations | 8h |

**Total P1 effort:** ~39 hours (5 days)

---

### P2 — Nice to Have (Post-Launch Optimization)

| # | Category | Gap | Impact | Effort |
|---|----------|-----|--------|--------|
| 13 | **Monitoring** | No synthetic transaction monitoring | Can't detect broken user flows | 8h |
| 14 | **Performance** | No Core Web Vitals tracking | Poor user experience undetected | 4h |
| 15 | **Alerting** | No SMS/phone alerts for critical incidents | Slack outage = missed alerts | 2h |
| 16 | **Database** | No slow query alerts | Performance degradation undetected | 6h |
| 17 | **Security** | No IDS/IPS for intrusion detection | Advanced attacks undetected | 16h |
| 18 | **Incident Response** | No public status page | Customers unaware of outages | 4h |

**Total P2 effort:** ~40 hours (5 days)

---

## Implementation Plan

### Phase 1: P0 Production Blockers (NOW — before Phase 1 demo)

**Goal:** Address legal/compliance requirements and critical observability gaps.

#### 1.1 Server-Side Error Tracking (4h)
**Problem:** All server errors invisible after Sentry removed (Next.js 16 Turbopack incompatibility).

**Solution:**
- **Option A:** Sentry Edge SDK (compatible with Next.js middleware)
  - Install: `@sentry/nextjs` with edge-only config
  - Capture server errors via `captureException()` in middleware
  - Source maps via Sentry webpack plugin
- **Option B:** Custom error aggregation service
  - POST errors to internal `/api/errors/capture` endpoint
  - Store in `error_logs` table with context
  - Alert on spike detection

**Deliverable:**
- `src/drop-app/sentry.edge.config.ts` (if Option A)
- Updated `src/drop-app/src/lib/sentry-server.ts` with edge-compatible capture
- Test: Trigger 500 error, verify Sentry event created

**Files:** `infrastructure/error-tracking-setup.md`

---

#### 1.2 Audit Logging System (8h)
**Problem:** PSD2 requires immutable audit trail for auth, data access, admin actions.

**Solution:**
- Create `audit_logs` table:
  ```sql
  CREATE TABLE audit_logs (
    id TEXT PRIMARY KEY,
    timestamp TEXT NOT NULL,
    user_id TEXT,
    action TEXT NOT NULL, -- 'login', 'data_access', 'kyc_approval', etc.
    resource_type TEXT, -- 'user', 'transaction', 'aml_alert'
    resource_id TEXT,
    metadata JSON,
    ip_address TEXT,
    user_agent TEXT,
    request_id TEXT,
    result TEXT -- 'success', 'failure', 'denied'
  );
  CREATE INDEX idx_audit_user ON audit_logs(user_id, timestamp);
  CREATE INDEX idx_audit_action ON audit_logs(action, timestamp);
  ```

- Audit functions:
  ```typescript
  auditLog({
    userId: 'usr_123',
    action: 'login_success',
    resourceType: 'user',
    resourceId: 'usr_123',
    metadata: { method: 'bankid' },
    ip: '1.2.3.4',
    userAgent: 'Mozilla...',
    requestId: 'req_456'
  });
  ```

- Integrate at:
  - `POST /api/auth/login` (login_success, login_failure)
  - `POST /api/auth/logout` (logout)
  - `GET /api/users/:id` (data_access)
  - `PATCH /api/users/:id/kyc` (kyc_approval, kyc_rejection)
  - `PATCH /api/aml-alerts/:id` (aml_review)

**Deliverable:**
- `src/drop-app/src/lib/audit-log.ts` (audit logging functions)
- Migration: `migrations/003_audit_logs.sql`
- Integration in auth routes and admin endpoints
- Retention policy: Document 5-year retention for PSD2 compliance

**Files:** `support/audit-logging-setup.md`

---

#### 1.3 WAF Deployment (2h)
**Problem:** WAF rules defined but not enforced (requires reverse proxy).

**Solution:**
- **Option A:** Cloudflare WAF (recommended)
  - Already using Cloudflare for DNS (terraform module exists)
  - Free tier includes basic WAF rules
  - Configure: SQLi, XSS, path traversal rules from `infrastructure/waf-rules.md`
- **Option B:** AWS WAF (if using App Runner directly)
  - $5/month + $1/million requests
  - Associate with App Runner service

**Deliverable:**
- Cloudflare WAF configuration (Terraform or UI)
- Test: Send SQLi payload, verify 403 response
- Document: Update `infrastructure/waf-rules.md` with deployment steps

**Files:** `infrastructure/cloudflare-waf-setup.md`

---

#### 1.4 Log Aggregation (2h)
**Problem:** Structured logs write to stdout but aren't retained or searchable.

**Solution:**
- **AWS CloudWatch Logs** (App Runner auto-integrates):
  - App Runner streams stdout → CloudWatch Logs automatically
  - Configure retention: 30 days (production), 7 days (staging)
  - Set up log insights queries for common patterns

- **Fly.io (staging):**
  - `fly logs` stores last 24h by default
  - Optional: Forward to external service (Papertrail, Logtail)

**Deliverable:**
- CloudWatch Logs retention policy configured
- Log Insights queries:
  - All errors: `fields @timestamp, message | filter level = "error"`
  - User actions: `fields @timestamp, userId, message | filter userId = "usr_123"`
  - Request trace: `fields @timestamp, requestId, message | filter requestId = "req_456"`
- Documentation: `infrastructure/logging-setup.md`

**Files:** `infrastructure/cloudwatch-logs-setup.md`

---

#### 1.5 External Uptime Monitoring (1h)
**Problem:** BetterStack documented but not deployed.

**Solution:**
- Sign up: https://betterstack.com/uptime (free tier)
- Create monitors:
  1. **Production health:** `https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health`
     - Interval: 3 minutes
     - Keyword check: `"status":"ok"`
  2. **Staging health:** `https://drop-staging.fly.dev/api/health`
  3. **Landing page:** `https://getdrop.no` (when live)
- Slack integration: Connect to `#drop-ops` channel
- Email alerts: `alem@alai.no`

**Deliverable:**
- BetterStack account with 3 monitors configured
- Test: Pause monitor, verify alert received
- Documentation: Update `docs/infrastructure/BETTERSTACK-SETUP.md` with credentials

**Files:** `support/betterstack-deployment.md`

---

#### 1.6 Payment/Banking Failure Runbooks (4h)
**Problem:** DR runbook covers infrastructure but not fintech-specific failures.

**Solution:**
- Create runbooks for:
  1. **BankID integration failure** (authentication blocked)
  2. **PISP payment failure** (remittance/QR payment rejected)
  3. **AISP balance retrieval failure** (can't fetch account balance)
  4. **Swan API outage** (BaaS provider down)
  5. **Sumsub KYC failure** (identity verification unavailable)
  6. **Neonomics open banking outage**

- Each runbook includes:
  - Symptoms (what users see)
  - Diagnosis steps (check service status, logs, error codes)
  - Recovery procedure (fallback, retry, escalation)
  - Customer communication template

**Deliverable:**
- `support/runbooks/bankid-failure.md`
- `support/runbooks/pisp-payment-failure.md`
- `support/runbooks/aisp-balance-failure.md`
- `support/runbooks/swan-api-outage.md`
- `support/runbooks/sumsub-kyc-failure.md`
- `support/runbooks/neonomics-outage.md`

**Files:** Created in `/Users/makinja/ALAI/products/Drop/support/runbooks/`

---

### Phase 2: P1 Items (Phase 2: Banking Integration)

Defer to Phase 2 when real banking integrations are live and need production-grade support.

**Priority order:**
1. Penetration testing (external security audit)
2. APM for transaction tracing (identify slow payments)
3. On-call rotation and escalation policy
4. Automated rollback on failed deployments
5. Backup testing and monitoring
6. STR submission workflow (AML compliance)

---

### Phase 3: P2 Items (Post-Launch)

Optimize after initial production deployment and user feedback.

**Priority order:**
1. Synthetic transaction monitoring (test critical user flows)
2. Public status page (customer transparency)
3. Core Web Vitals tracking (frontend performance)
4. SMS/phone alerts (redundancy)
5. Slow query monitoring (database optimization)
6. IDS/IPS (advanced threat detection)

---

## Architecture

### Support Systems Connectivity

```
┌─────────────────────────────────────────────────────────────────┐
│                         Drop Application                        │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │  drop-app   │  │  drop-api    │  │  drop-mobile (Expo)  │  │
│  │  (Next.js)  │  │  (Hono)      │  │  (React Native)      │  │
│  └─────────────┘  └──────────────┘  └──────────────────────┘  │
│         │                │                      │               │
│         └────────────────┴──────────────────────┘               │
│                          │                                      │
└──────────────────────────┼──────────────────────────────────────┘
                           │
        ┌──────────────────┼──────────────────────────────┐
        │                  │                              │
        ▼                  ▼                              ▼
┌───────────────┐  ┌──────────────┐           ┌──────────────────┐
│ Structured    │  │ Health Check │           │ Audit Logs       │
│ Logging       │  │ Endpoint     │           │ (audit_logs      │
│ (JSON stdout) │  │ /api/health  │           │  table)          │
└───────┬───────┘  └──────┬───────┘           └─────────┬────────┘
        │                 │                             │
        │                 │                             │
        ▼                 │                             │
┌────────────────┐        │                             │
│ CloudWatch     │        │                             │
│ Logs           │        │                             │
│ (30d retention)│        │                             │
└────────────────┘        │                             │
        │                 │                             │
        │                 ▼                             │
        │         ┌───────────────┐                     │
        │         │ BetterStack   │                     │
        │         │ (external     │                     │
        │         │  monitoring)  │                     │
        │         └───────┬───────┘                     │
        │                 │                             │
        └─────────────────┼─────────────────────────────┘
                          │
                          ▼
                 ┌────────────────┐
                 │ Alerting Layer │
                 │ (alerts.ts)    │
                 └────────┬───────┘
                          │
        ┌─────────────────┼─────────────────┐
        │                 │                 │
        ▼                 ▼                 ▼
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│ Slack       │  │ Sentry      │  │ Email       │
│ Webhook     │  │ (client +   │  │ (SMTP)      │
│ (#drop-ops) │  │  edge)      │  │             │
└─────────────┘  └─────────────┘  └─────────────┘
```

### Data Flows

1. **Error Flow:**
   - Client error → Sentry browser → Slack alert (if spike)
   - Server error → Sentry edge → CloudWatch Logs → Slack alert
   - API 5xx → `trackError()` → Spike detection → Slack

2. **Monitoring Flow:**
   - App → stdout → CloudWatch Logs
   - App → `/api/health` → BetterStack → Slack/Email/SMS
   - Container → Docker health check → Auto-restart

3. **Audit Flow:**
   - User action → `auditLog()` → `audit_logs` table
   - Compliance query → SQL export → Regulator submission

4. **Incident Flow:**
   - Alert → Slack `#drop-ops`
   - Unacknowledged (5 min) → Email to Alem
   - Unresolved (15 min) → SMS (BetterStack escalation)
   - Incident → Runbook → Recovery → Post-mortem

---

## Cost Estimate

### Free Tier (MVP)
- ✅ CloudWatch Logs: 5 GB ingestion/month free (AWS Free Tier)
- ✅ BetterStack: 10 monitors, 3-min interval, unlimited alerts
- ✅ Sentry: 5K events/month free
- ✅ GitHub Actions: 2000 minutes/month free
- ✅ Terraform state: S3 free tier (first 12 months)

**Total MVP cost:** $0/month

### Paid Services (Production)
- CloudWatch Logs: ~$5/month (30 GB ingestion estimate)
- BetterStack Pro: $20/month (30s interval, SMS alerts)
- Sentry Team: $26/month (50K events, enhanced features)
- **Optional:** Datadog APM: $15/host/month (~$45 for 3 hosts)

**Total production cost:** ~$50-100/month (without APM)

---

## Recommendations

### Immediate (This Week)
1. ✅ **Deploy BetterStack** (1h) — External monitoring is fast win
2. ✅ **Configure CloudWatch retention** (30 min) — Logs already flow, just set policy
3. ✅ **Create audit log schema** (2h) — Start with table, integrate incrementally

### Before Phase 1 Demo (Next 2 Weeks)
4. ✅ **Implement server-side error tracking** (4h) — Sentry edge or custom
5. ✅ **Write payment failure runbooks** (4h) — Prepare for demo questions
6. ✅ **Deploy Cloudflare WAF** (2h) — Security hygiene

### Before Phase 2 Go-Live (Next 2-3 Months)
7. 🔲 **External penetration test** (hire security firm, ~$5K budget)
8. 🔲 **APM implementation** (Datadog or Sentry Performance)
9. 🔲 **On-call rotation** (define schedule, test escalation)
10. 🔲 **Backup testing** (restore from snapshot, verify data integrity)

### Post-Launch Optimization
11. 🔲 **Synthetic monitoring** (Checkly or custom Playwright tests)
12. 🔲 **Public status page** (BetterStack included, just enable)
13. 🔲 **Core Web Vitals** (Google Lighthouse CI integration)

---

## Success Metrics

### Before Go-Live (P0 Checklist)
- [ ] Server errors visible in Sentry (test: trigger 500, verify event)
- [ ] Audit logs capture login/logout (test: log in, check `audit_logs` table)
- [ ] WAF blocks SQLi attack (test: `?id=1' OR '1'='1`, expect 403)
- [ ] CloudWatch Logs retain 30 days (verify retention policy)
- [ ] BetterStack alerts on downtime (test: stop app, receive alert <5 min)
- [ ] Runbooks tested (simulate BankID failure, follow procedure)

### Production KPIs
- **Uptime:** >99.9% (measured by BetterStack)
- **MTTD (Mean Time To Detect):** <3 minutes (external monitoring interval)
- **MTTR (Mean Time To Recover):** <15 minutes (via runbooks)
- **Error rate:** <0.1% of requests (tracked via Sentry)
- **Log retention:** 100% compliance (30 days CloudWatch, 5 years audit logs)
- **Alert noise:** <5 false positives/week (cooldown + severity tuning)

---

## Appendices

### A. Related Documentation
- `docs/infrastructure/MONITORING.md` — Current monitoring setup
- `docs/infrastructure/BETTERSTACK-SETUP.md` — External monitoring guide
- `docs/dr-runbook.md` — Infrastructure disaster recovery
- `infrastructure/waf-rules.md` — WAF rule definitions
- `security/drop-security-rapport.md` — Security audit findings

### B. External Services
- BetterStack: https://betterstack.com/uptime
- Sentry: https://sentry.io/
- AWS CloudWatch: https://console.aws.amazon.com/cloudwatch/
- Cloudflare: https://dash.cloudflare.com/

### C. Change History
- 2026-02-22: Initial analysis (John)

---

**Next Actions:**
1. Review this analysis with Alem
2. Approve P0 implementation plan
3. Begin P0 work (estimated 21 hours / 2-3 days)
4. Track progress in Mission Control tasks

# Audit Logging Setup

# Audit Logging Setup — Drop Fintech

**Date:** 2026-02-22
**Priority:** P0 (Production Blocker)
**Compliance:** PSD2, GDPR
**Effort:** 8 hours

---

## Overview

Audit logging provides an **immutable record** of all authentication, authorization, data access, and administrative actions. This is a **legal requirement** for PSD2-regulated payment services and GDPR data protection compliance.

---

## Requirements

### PSD2 Audit Trail Requirements
- All authentication events (login, logout, failed attempts)
- Authorization decisions (who accessed what resource)
- Transaction creation and modification
- KYC/AML review actions
- Administrative user actions
- Data exports and bulk operations
- Retention: **5 years minimum**

### GDPR Right of Access
- Users must be able to request all logged actions related to their data
- Export format: Human-readable (CSV or JSON)

---

## Database Schema

### Migration: `003_audit_logs.sql`

```sql
-- Audit Logs Table (PostgreSQL 16 — ADR-014)
-- Schema managed via Drizzle ORM (src/shared/db/schema.ts)
-- Apply with: make db-push

CREATE TABLE IF NOT EXISTS audit_log (
  id TEXT PRIMARY KEY,
  timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  user_id TEXT,
  action TEXT NOT NULL,
  resource_type TEXT,
  resource_id TEXT,
  details JSONB,
  ip_address TEXT,
  user_agent TEXT,
  request_id TEXT,
  result TEXT NOT NULL DEFAULT 'success', -- 'success', 'failure', 'denied'
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Indexes for common queries
CREATE INDEX IF NOT EXISTS idx_audit_user_time ON audit_log(user_id, timestamp DESC);
CREATE INDEX IF NOT EXISTS idx_audit_action_time ON audit_log(action, timestamp DESC);
CREATE INDEX IF NOT EXISTS idx_audit_resource ON audit_log(resource_type, resource_id, timestamp DESC);
CREATE INDEX IF NOT EXISTS idx_audit_request ON audit_log(request_id);
CREATE INDEX IF NOT EXISTS idx_audit_result ON audit_log(result, timestamp DESC);

-- Partitioning by month (production)
CREATE TABLE audit_log_2026_02 PARTITION OF audit_log FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');
```

**Migration steps (PostgreSQL 16 via Drizzle ORM):**
1. Schema is defined in `src/shared/db/schema.ts`
2. Apply with:
   ```bash
   make db-push
   # or: cd src/shared && npx drizzle-kit push
   ```
3. Verify table exists:
   ```bash
   psql "$DATABASE_URL" -c "SELECT table_name FROM information_schema.tables WHERE table_name='audit_log';"
   ```

---

## Implementation

### Audit Log Library: `src/lib/audit-log.ts`

```typescript
import { db } from '@drop/shared/db';
import { randomId } from './utils-server';
import { logger } from './logger';

export type AuditAction =
  // Authentication
  | 'login_success'
  | 'login_failure'
  | 'logout'
  | 'password_change'
  | 'session_revoked'
  // Authorization
  | 'access_granted'
  | 'access_denied'
  // Data Access
  | 'data_view'
  | 'data_export'
  | 'data_delete'
  // Transactions
  | 'transaction_created'
  | 'transaction_completed'
  | 'transaction_failed'
  // KYC/AML
  | 'kyc_approved'
  | 'kyc_rejected'
  | 'aml_alert_created'
  | 'aml_alert_reviewed'
  // Admin
  | 'user_created'
  | 'user_updated'
  | 'user_deleted'
  | 'role_changed';

export type AuditResult = 'success' | 'failure' | 'denied';

export interface AuditLogEntry {
  userId?: string;
  action: AuditAction;
  resourceType?: string;
  resourceId?: string;
  metadata?: Record<string, unknown>;
  ip?: string;
  userAgent?: string;
  requestId?: string;
  result?: AuditResult;
}

/**
 * Create an audit log entry
 *
 * IMPORTANT: This function must NEVER throw errors.
 * Audit failures should not block user actions.
 */
export async function auditLog(entry: AuditLogEntry): Promise<void> {
  try {
    const id = randomId('audit');
    const timestamp = new Date().toISOString();

    await run(
      `INSERT INTO audit_logs (
        id, timestamp, user_id, action, resource_type, resource_id,
        metadata, ip_address, user_agent, request_id, result
      ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`,
      [
        id,
        timestamp,
        entry.userId || null,
        entry.action,
        entry.resourceType || null,
        entry.resourceId || null,
        entry.metadata ? JSON.stringify(entry.metadata) : null,
        entry.ip || null,
        entry.userAgent || null,
        entry.requestId || null,
        entry.result || 'success',
      ]
    );

    logger.debug('Audit log created', { auditId: id, action: entry.action });
  } catch (error) {
    // Log error but do NOT throw (audit failures should not block operations)
    logger.error('Failed to create audit log', {
      error: error instanceof Error ? error.message : String(error),
      action: entry.action,
    });
  }
}

/**
 * Retrieve audit logs for a user (GDPR Right of Access)
 */
export async function getUserAuditLogs(
  userId: string,
  options?: { limit?: number; offset?: number; startDate?: string; endDate?: string }
): Promise<unknown[]> {
  const { limit = 100, offset = 0, startDate, endDate } = options || {};

  let sql = 'SELECT * FROM audit_logs WHERE user_id = ?';
  const params: unknown[] = [userId];

  if (startDate) {
    sql += ' AND timestamp >= ?';
    params.push(startDate);
  }

  if (endDate) {
    sql += ' AND timestamp <= ?';
    params.push(endDate);
  }

  sql += ' ORDER BY timestamp DESC LIMIT ? OFFSET ?';
  params.push(limit, offset);

  const { query } = await import('./db');
  return query(sql, params);
}

/**
 * Export audit logs as CSV (for compliance reporting)
 */
export async function exportAuditLogsCSV(
  filters?: {
    userId?: string;
    action?: AuditAction;
    startDate?: string;
    endDate?: string;
  }
): Promise<string> {
  let sql = 'SELECT * FROM audit_logs WHERE 1=1';
  const params: unknown[] = [];

  if (filters?.userId) {
    sql += ' AND user_id = ?';
    params.push(filters.userId);
  }

  if (filters?.action) {
    sql += ' AND action = ?';
    params.push(filters.action);
  }

  if (filters?.startDate) {
    sql += ' AND timestamp >= ?';
    params.push(filters.startDate);
  }

  if (filters?.endDate) {
    sql += ' AND timestamp <= ?';
    params.push(filters.endDate);
  }

  sql += ' ORDER BY timestamp DESC';

  const { query } = await import('./db');
  const rows = await query(sql, params);

  // Convert to CSV
  const headers = [
    'id',
    'timestamp',
    'user_id',
    'action',
    'resource_type',
    'resource_id',
    'metadata',
    'ip_address',
    'user_agent',
    'request_id',
    'result',
  ];

  const csvRows = [headers.join(',')];

  for (const row of rows as Record<string, unknown>[]) {
    const values = headers.map((h) => {
      const val = row[h];
      if (val === null || val === undefined) return '';
      return String(val).replace(/"/g, '""'); // Escape quotes
    });
    csvRows.push(values.map((v) => `"${v}"`).join(','));
  }

  return csvRows.join('\n');
}
```

---

## Integration Points

### 1. Authentication (`src/app/api/auth/login/route.ts`)

```typescript
import { auditLog } from '@/lib/audit-log';

export async function POST(request: NextRequest) {
  const { email, password } = await request.json();
  const ip = request.headers.get('x-forwarded-for') || request.headers.get('x-real-ip');
  const userAgent = request.headers.get('user-agent');
  const requestId = getRequestId(request.headers);

  try {
    const user = await getUserByEmail(email);
    if (!user || !await verifyPassword(password, user.password_hash)) {
      // Audit failed login attempt
      await auditLog({
        userId: user?.id,
        action: 'login_failure',
        metadata: { email, reason: 'invalid_credentials' },
        ip,
        userAgent,
        requestId,
        result: 'failure',
      });

      return jsonError('Invalid credentials', 401);
    }

    // Audit successful login
    await auditLog({
      userId: user.id,
      action: 'login_success',
      metadata: { email },
      ip,
      userAgent,
      requestId,
      result: 'success',
    });

    // ... rest of login logic
  } catch (error) {
    // ... error handling
  }
}
```

### 2. Logout (`src/app/api/auth/logout/route.ts`)

```typescript
await auditLog({
  userId: session.userId,
  action: 'logout',
  metadata: { sessionId: session.id },
  ip,
  userAgent,
  requestId,
});
```

### 3. Data Access (`src/app/api/users/[id]/route.ts`)

```typescript
export async function GET(request: NextRequest, { params }: { params: { id: string } }) {
  const session = await requireAuth(request);
  const userId = params.id;

  // Check authorization
  if (session.userId !== userId && session.role !== 'admin') {
    await auditLog({
      userId: session.userId,
      action: 'access_denied',
      resourceType: 'user',
      resourceId: userId,
      metadata: { reason: 'insufficient_permissions' },
      ip: request.headers.get('x-forwarded-for'),
      userAgent: request.headers.get('user-agent'),
      requestId: getRequestId(request.headers),
      result: 'denied',
    });

    return jsonError('Access denied', 403);
  }

  // Audit successful data access
  await auditLog({
    userId: session.userId,
    action: 'data_view',
    resourceType: 'user',
    resourceId: userId,
    ip: request.headers.get('x-forwarded-for'),
    userAgent: request.headers.get('user-agent'),
    requestId: getRequestId(request.headers),
  });

  const user = await getUserById(userId);
  return jsonSuccess(user);
}
```

### 4. KYC Approval (`src/app/api/admin/kyc/route.ts`)

```typescript
await auditLog({
  userId: adminSession.userId,
  action: 'kyc_approved',
  resourceType: 'user',
  resourceId: targetUserId,
  metadata: { reason: kycApprovalReason },
  ip: request.headers.get('x-forwarded-for'),
  userAgent: request.headers.get('user-agent'),
  requestId: getRequestId(request.headers),
});
```

### 5. Transaction Creation (`src/app/api/transactions/route.ts`)

```typescript
await auditLog({
  userId: session.userId,
  action: 'transaction_created',
  resourceType: 'transaction',
  resourceId: transactionId,
  metadata: {
    type: transactionType,
    amount: amount,
    currency: currency,
  },
  ip: request.headers.get('x-forwarded-for'),
  userAgent: request.headers.get('user-agent'),
  requestId: getRequestId(request.headers),
});
```

---

## Compliance Reporting

### GDPR Right of Access (User Data Export)

```typescript
// src/app/api/users/[id]/audit-logs/route.ts
export async function GET(request: NextRequest, { params }: { params: { id: string } }) {
  const session = await requireAuth(request);

  // Users can only access their own audit logs
  if (session.userId !== params.id && session.role !== 'admin') {
    return jsonError('Access denied', 403);
  }

  const logs = await getUserAuditLogs(params.id, {
    limit: 1000, // GDPR requires "all data"
    startDate: request.nextUrl.searchParams.get('start') || undefined,
    endDate: request.nextUrl.searchParams.get('end') || undefined,
  });

  return jsonSuccess({ logs });
}
```

### PSD2 Audit Trail Export (Admin)

```typescript
// src/app/api/admin/audit/export/route.ts
export async function GET(request: NextRequest) {
  const session = await requireAuth(request);

  if (session.role !== 'admin') {
    return jsonError('Admin access required', 403);
  }

  const startDate = request.nextUrl.searchParams.get('start');
  const endDate = request.nextUrl.searchParams.get('end');
  const action = request.nextUrl.searchParams.get('action');
  const userId = request.nextUrl.searchParams.get('userId');

  const csv = await exportAuditLogsCSV({
    userId: userId || undefined,
    action: action as AuditAction | undefined,
    startDate: startDate || undefined,
    endDate: endDate || undefined,
  });

  return new Response(csv, {
    headers: {
      'Content-Type': 'text/csv',
      'Content-Disposition': `attachment; filename="audit-logs-${new Date().toISOString()}.csv"`,
    },
  });
}
```

---

## Retention Policy

### PSD2 Requirement: 5 Years

**PostgreSQL 16 (all environments — ADR-014):**
- Use table partitioning by month:
  ```sql
  CREATE TABLE audit_log (
    id TEXT PRIMARY KEY,
    timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    -- ... other columns
  ) PARTITION BY RANGE (timestamp);

  -- Create partitions for each month
  CREATE TABLE audit_log_2026_02 PARTITION OF audit_log
    FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');
  ```

- Automatic cleanup script (cron weekly):
  ```bash
  #\!/bin/bash
  # Delete audit logs older than 5 years (PSD2 retention)
  psql "$DATABASE_URL" -c "DELETE FROM audit_log WHERE timestamp < NOW() - INTERVAL '5 years';"

---

## Testing

### Test Audit Logging

```bash
# 1. Create audit log entry
curl -X POST http://localhost:3000/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"wrong"}'

# 2. Check audit log table (PostgreSQL 16)
psql "$DATABASE_URL" -c "SELECT * FROM audit_log ORDER BY timestamp DESC LIMIT 5;"

# Expected output:
# audit_123 | 2026-02-22T10:00:00.000Z | usr_456 | login_failure | ... | {"email":"test@example.com","reason":"invalid_credentials"} | 1.2.3.4 | Mozilla/5.0... | req_789 | failure

# 3. Export audit logs (admin)
curl -X GET "http://localhost:3000/api/admin/audit/export?start=2026-02-01&end=2026-02-28" \
  -H "Cookie: auth-token=<admin-jwt>" \
  > audit-logs.csv

# 4. Verify CSV format
head -n 5 audit-logs.csv
```

---

## Monitoring

### Alert on Audit Failures

Add to `src/lib/audit-log.ts`:

```typescript
import { sendAlert } from './alerts';

export async function auditLog(entry: AuditLogEntry): Promise<void> {
  try {
    // ... insert logic
  } catch (error) {
    logger.error('Failed to create audit log', { error, action: entry.action });

    // CRITICAL: Alert if audit logging fails (compliance risk)
    await sendAlert({
      severity: 'critical',
      title: 'Audit logging failure',
      message: `Failed to record ${entry.action} for user ${entry.userId}`,
    });
  }
}
```

### Metrics to Track

- Audit logs created per hour (should correlate with user activity)
- Failed audit log attempts (should be zero)
- Audit log export requests (GDPR compliance)
- Audit log storage size (retention planning)

---

## Security Considerations

### Immutability
- Audit logs should NEVER be updated or deleted (except by automated retention policy)
- No UPDATE or DELETE API endpoints for audit logs
- Database permissions: Read-only for application, Write-only for audit service

### Access Control
- Only admins can view full audit trails
- Users can view their own audit logs only
- Export requires elevated permissions

### Data Redaction
- Do NOT log passwords, tokens, or sensitive PII in metadata
- Card numbers: Log last 4 digits only
- Fødselsnummer: Log checksum/hash, not full number

---

## Checklist

- [ ] Migration `003_audit_logs.sql` created
- [ ] Migration applied to dev database
- [ ] `src/lib/audit-log.ts` implemented
- [ ] Login/logout endpoints integrated
- [ ] Data access endpoints integrated
- [ ] KYC/AML admin actions integrated
- [ ] GDPR export endpoint created
- [ ] PSD2 CSV export endpoint created
- [ ] Retention policy documented
- [ ] Monitoring alerts configured
- [ ] Testing completed (manual + automated)
- [ ] Documentation updated (API docs, compliance docs)

---

**Next Steps:**
1. Create migration file
2. Implement `audit-log.ts` library
3. Integrate into auth routes (high priority)
4. Add to remaining endpoints incrementally
5. Test with real login/logout flows
6. Deploy to staging for verification

# Runbooks

Operational runbooks for failure scenarios

# Runbook: AISP Balance Failure

# Runbook: AISP Balance Fetch Failure

**Service:** AISP (Account Information Service Provider)
**Severity:** MEDIUM (users can't see bank balance)
**MTTR Target:** <20 minutes
**Owner:** John (AI Director)

---

## Symptoms

Users report they cannot see their bank account balance in Drop. Symptoms include:

- Dashboard shows "Balance unavailable" or stale balance
- Error message: "Could not fetch account information"
- Infinite loading spinner on balance widget
- Balance shows "0 kr" or "—" instead of actual amount

**User impact:** Cannot verify available funds before making payments (may lead to insufficient funds errors).

---

## Diagnosis

### 1. Check Neonomics AISP Status

**External status:**
```bash
# Neonomics has no public status page — test via API
curl -X GET https://api.neonomics.io/health \
  -H "Authorization: Bearer <api-key>" \
  -v

# Expected: HTTP 200
# If 500/503: Neonomics outage
```

**Check specific bank connectivity:**
```bash
# List supported banks and their status
curl -X GET https://api.neonomics.io/banks \
  -H "Authorization: Bearer <api-key>" \
  | jq '.[] | select(.country == "NO") | {name, status}'

# Look for: "status": "degraded" or "offline"
```

### 2. Check Drop Logs

```bash
# CloudWatch Logs (production)
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "aisp" \
  --start-time $(date -u -d '15 minutes ago' +%s)000 \
  --region eu-west-1

# Look for:
# - "AISP consent expired"
# - "AISP API timeout"
# - "AISP 401 Unauthorized"
# - "Bank API unavailable: DNB"
```

### 3. Check User Consent Status

```bash
# Verify Open Banking consent hasn't expired
# Consent is valid for 90 days from last authorization

# Check database for expired consents (PostgreSQL 16)
psql "$DATABASE_URL" <<EOF
SELECT
  user_id,
  bank_name,
  consent_expires_at,
  EXTRACT(EPOCH FROM (consent_expires_at - NOW())) / 86400 AS days_remaining
FROM bank_accounts
WHERE consent_expires_at < NOW() + INTERVAL '7 days'
ORDER BY consent_expires_at ASC
LIMIT 10;
EOF

# If days_remaining < 0: consent expired
# If days_remaining < 7: warn user to renew soon
```

### 4. Test AISP Flow

**Manual test (staging):**
```bash
# 1. Login
TOKEN=$(curl -X POST https://drop-staging.fly.dev/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"test1234"}' \
  | jq -r '.data.token')

# 2. Fetch balance
curl -X GET https://drop-staging.fly.dev/api/accounts/balance \
  -H "Authorization: Bearer $TOKEN" \
  -v

# Expected: HTTP 200, { "balance": 15000.50, "currency": "NOK" }
# If 401: Consent expired
# If 500: AISP integration broken
```

### 5. Check Rate Limiting

```bash
# Check if Neonomics API rate limit exceeded
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "rate_limit" \
  --start-time $(date -u -d '10 minutes ago' +%s)000 \
  | jq '.events[].message' \
  | grep -E "429|X-RateLimit"

# If many 429 errors: rate limiting issue
```

---

## Common Causes & Solutions

### Cause 1: Expired Open Banking Consent

**Probability:** 40% (PSD2 consent expires after 90 days)

**Symptoms:**
- Error code: `CONSENT_EXPIRED` or `CONSENT_INVALID`
- Logs show: "AISP consent no longer valid"
- Specific users affected (not all users)

**Solution:**

1. **Identify affected users:**
   ```sql
   -- PostgreSQL 16
   SELECT user_id, email, bank_name, consent_expires_at
   FROM bank_accounts
   JOIN users ON users.id = bank_accounts.user_id
   WHERE consent_expires_at < NOW();
   ```

2. **Notify users to re-authorize:**

   **Push notification (Norwegian):**
   ```
   Banktilkobling utløpt
   Godkjenningen for å hente saldo fra [Bank] har utløpt.
   Trykk her for å fornye tilkoblingen.
   ```

   **Email (Norwegian):**
   ```
   Emne: Godkjenn tilgang til bankkonto på nytt

   Hei,

   Din godkjenning for å vise saldo fra [Bank] har utløpt etter 90 dager.
   Dette er et PSD2-sikkerhetskrav.

   Logg inn i Drop og koble til bankkontoen på nytt for å fortsette å se saldoen din.

   Mvh,
   Drop
   ```

3. **Guide user through re-consent:**
   - User taps notification → redirect to "Reconnect Bank Account" screen
   - Initiate new AISP consent flow (BankID + bank authorization)
   - Update `consent_expires_at` = NOW() + INTERVAL '90 days'

4. **Automatic consent renewal reminder:**
   ```bash
   # Cron job to warn users 7 days before expiry
   # Send reminder: "Your bank connection expires in 7 days, renew now"
   ```

**ETA:** Immediate (user action required)

---

### Cause 2: Bank API Outage or Maintenance

**Probability:** 15% (specific bank temporarily unavailable)

**Symptoms:**
- All users of specific bank (e.g., DNB, Nordea) cannot fetch balance
- Other banks work fine
- Logs show: "Bank API timeout" or "502 Bad Gateway"

**Solution:**

1. **Identify affected bank:**
   ```bash
   # Check which bank is failing
   aws logs filter-log-events \
     --log-group-name /aws/apprunner/drop-production \
     --filter-pattern "Bank API" \
     --start-time $(date -u -d '30 minutes ago' +%s)000 \
     | jq '.events[].message' \
     | grep -o '"bank":"[^"]*"' \
     | sort | uniq -c | sort -rn

   # Example output: "bank":"DNB" appears 50 times
   ```

2. **Check bank status:**
   - Visit bank's website: check for maintenance announcements
   - Norwegian banks often schedule maintenance 02:00-06:00 CET
   - DNB status: https://www.dnb.no/drift
   - Nordea status: https://www.nordea.no/info/driftsmeldinger

3. **Notify affected users (Norwegian):**
   ```
   Emne: Saldo midlertidig utilgjengelig for [Bank]

   Hei,

   Vi opplever for øyeblikket problemer med å hente saldo fra [Bank].
   Dette skyldes tekniske problemer hos banken.

   Du kan fortsatt gjøre betalinger, men saldoen vises ikke akkurat nå.
   Vi jobber med å gjenopprette tjenesten.

   Estimert løsning: [X minutter/timer]

   Mvh,
   Drop
   ```

4. **Implement graceful degradation:**
   ```typescript
   // src/app/api/accounts/balance/route.ts
   async function fetchBalance(userId: string) {
     try {
       return await neonomicsClient.getBalance(userId);
     } catch (error) {
       if (error.code === 'BANK_API_TIMEOUT') {
         // Return cached balance with warning
         const cached = await getCachedBalance(userId);
         return {
           balance: cached?.balance || null,
           currency: 'NOK',
           lastUpdated: cached?.timestamp,
           warning: 'Balance may be outdated due to bank API issues'
         };
       }
       throw error;
     }
   }
   ```

**ETA:** Depends on bank (typically <2 hours for maintenance, <1 hour for incidents)

---

### Cause 3: Neonomics API Outage

**Probability:** 10% (Neonomics service disruption)

**Symptoms:**
- ALL users cannot fetch balance regardless of bank
- Logs show: "Neonomics API unreachable" or HTTP 503
- Test API call to Neonomics fails

**Solution:**

1. **Verify Neonomics outage:**
   ```bash
   # Test Neonomics health endpoint
   curl -X GET https://api.neonomics.io/health \
     -H "Authorization: Bearer <api-key>" \
     -v

   # If timeout or 503: confirmed outage
   ```

2. **Contact Neonomics support:**
   - Email: support@neonomics.io
   - Slack: #neonomics-support (if available)
   - Check Neonomics Slack for incident updates

3. **Enable fallback mode:**
   ```bash
   # Show cached balances to all users
   aws apprunner update-service --service-arn <ARN> \
     --instance-configuration "EnvironmentVariables={
       AISP_FALLBACK_MODE=cached,
       AISP_FALLBACK_CACHE_TTL=3600
     }"
   ```

4. **Communicate to users (Norwegian):**
   ```
   Emne: Saldo vises med forsinkelse

   Hei,

   Vår leverandør for bankdata opplever tekniske problemer.
   Saldoen du ser kan være opptil 1 time gammel.

   Du kan fortsatt gjøre betalinger som normalt.
   Vi forventer at tjenesten er tilbake innen [X minutter].

   Mvh,
   Drop
   ```

5. **Monitor Neonomics status:**
   - Check every 10 minutes for resolution
   - When API is back: disable fallback mode
   ```bash
   aws apprunner update-service --service-arn <ARN> \
     --instance-configuration "EnvironmentVariables={
       AISP_FALLBACK_MODE=live
     }"
   ```

**ETA:** Depends on Neonomics (typically <2 hours)

---

### Cause 4: Invalid or Revoked API Credentials

**Probability:** 5% (after credential rotation or account issue)

**Symptoms:**
- Logs show: "401 Unauthorized" or "invalid_api_key"
- All AISP requests fail immediately
- Other Drop services work fine (auth, database, etc.)

**Solution:**

1. **Verify Neonomics API credentials:**
   ```bash
   bw get item "Neonomics API" --session $BW_SESSION

   # Check:
   # - API key is not expired
   # - API key has AISP permissions
   # - Correct environment (production vs sandbox)
   ```

2. **Update App Runner environment variables:**
   ```bash
   aws apprunner update-service --service-arn <ARN> \
     --source-configuration "ImageRepository={...}" \
     --instance-configuration "EnvironmentVariables={
       NEONOMICS_API_KEY=<correct-key>,
       NEONOMICS_ENVIRONMENT=production
     }"
   ```

3. **Trigger deployment:**
   ```bash
   aws apprunner start-deployment --service-arn <ARN> --region eu-west-1

   # Wait 3-5 minutes for deployment to complete
   ```

4. **Test after deployment:**
   ```bash
   # Verify AISP working
   curl -X GET https://getdrop.no/api/accounts/balance \
     -H "Authorization: Bearer <test-user-token>" \
     -v

   # Expected: HTTP 200 with balance data
   ```

**ETA:** 10 minutes

---

### Cause 5: Network or Firewall Issues

**Probability:** 5% (AWS security group misconfiguration)

**Symptoms:**
- Logs show: "Connection timeout" or "ECONNREFUSED"
- AISP API requests never reach Neonomics
- Other external APIs may also fail

**Solution:**

1. **Check outbound connectivity:**
   ```bash
   # App Runner egress is unrestricted by default
   # If using VPC connector, check security group
   aws ec2 describe-security-groups \
     --group-ids <vpc-connector-sg> \
     --region eu-west-1 \
     | jq '.SecurityGroups[].IpPermissionsEgress'
   ```

2. **Test DNS resolution:**
   ```bash
   # From your local machine or bastion host
   nslookup api.neonomics.io

   # Should resolve to Neonomics IP
   # If NXDOMAIN: DNS issue
   ```

3. **Check AWS service health:**
   ```bash
   # Check App Runner service events
   aws apprunner list-operations \
     --service-arn <ARN> \
     --region eu-west-1 \
     | jq '.OperationSummaryList[] | select(.Type == "CREATE_SERVICE" or .Type == "UPDATE_SERVICE")'

   # Look for recent errors
   ```

4. **Whitelist Neonomics IPs (if using strict firewall):**
   - Contact Neonomics for IP ranges
   - Add to security group outbound rules
   - Allow HTTPS (443) to Neonomics endpoints

**ETA:** 15 minutes (if quick fix), 1 hour (if requires networking changes)

---

### Cause 6: Rate Limiting (High Traffic)

**Probability:** 10% (during peak hours or viral event)

**Symptoms:**
- Logs show: HTTP 429 "Too Many Requests"
- Intermittent failures (some users see balance, others don't)
- Rate limit headers in logs

**Solution:**

1. **Check rate limit headers:**
   ```bash
   aws logs filter-log-events \
     --log-group-name /aws/apprunner/drop-production \
     --filter-pattern "X-RateLimit" \
     --start-time $(date -u -d '5 minutes ago' +%s)000 \
     | jq -r '.events[].message' \
     | grep -E "X-RateLimit-(Limit|Remaining|Reset)"
   ```

2. **Implement request throttling:**
   ```typescript
   // src/lib/aisp-client.ts
   import PQueue from 'p-queue';

   const queue = new PQueue({
     concurrency: 10,      // Max 10 concurrent requests
     interval: 1000,        // Per second
     intervalCap: 50        // Max 50 requests per second
   });

   export async function fetchBalance(userId: string) {
     return queue.add(() => neonomicsClient.getBalance(userId));
   }
   ```

3. **Cache balance aggressively during rate limit:**
   ```typescript
   // src/lib/balance-cache.ts
   const CACHE_TTL_NORMAL = 60;      // 60 seconds
   const CACHE_TTL_RATE_LIMIT = 300; // 5 minutes during rate limit

   export async function getBalanceWithCache(userId: string) {
     const cached = await redis.get(`balance:${userId}`);
     if (cached) return JSON.parse(cached);

     try {
       const balance = await fetchBalance(userId);
       await redis.setex(`balance:${userId}`, CACHE_TTL_NORMAL, JSON.stringify(balance));
       return balance;
     } catch (error) {
       if (error.status === 429) {
         // Extend cache TTL during rate limit
         await redis.expire(`balance:${userId}`, CACHE_TTL_RATE_LIMIT);
       }
       throw error;
     }
   }
   ```

4. **Contact Neonomics to increase rate limit:**
   - Email support with traffic stats
   - Request higher API quota for production
   - Provide justification (user growth, peak times)

**ETA:** 5 minutes (automatic caching), 1-2 days (if quota increase needed)

---

## Emergency Workarounds

### Option 1: Cached Balance Mode

**Use case:** AISP provider down >30 minutes, users need to see approximate balance

**Steps:**

1. Enable cached balance fallback:
   ```bash
   aws apprunner update-service --service-arn <ARN> \
     --instance-configuration "EnvironmentVariables={
       AISP_MODE=cached,
       AISP_CACHE_TTL=3600
     }"
   ```

2. Show warning banner in app:
   ```
   ⚠️ Saldo vises med forsinkelse
   Vi viser din sist kjente saldo fra [timestamp].
   Tjenesten er tilbake til normal snart.
   ```

3. Allow payments to proceed:
   - Users can still initiate payments (PISP)
   - Balance check uses cached value
   - Risk: Insufficient funds errors if balance changed

4. **Revert when AISP is back:**
   ```bash
   aws apprunner update-service --service-arn <ARN> \
     --instance-configuration "EnvironmentVariables={
       AISP_MODE=live
     }"
   ```

**Risk:** Cached balance may be stale (up to 1 hour old). Users may attempt payments with insufficient funds.

---

### Option 2: Hide Balance, Allow Payments

**Use case:** AISP down, no reliable cache, but PISP still works

**Steps:**

1. Show "Balance unavailable" message:
   ```
   Saldo midlertidig utilgjengelig
   Du kan fortsatt gjøre betalinger som normalt.
   Banken vil avvise betalingen hvis du ikke har nok midler.
   ```

2. Allow payments without balance check:
   - User enters payment amount
   - Drop initiates payment via PISP
   - Bank performs real-time balance check
   - If insufficient funds: bank rejects, user gets clear error

3. Communicate ETA to users:
   ```
   Vi jobber med å gjenopprette saldovisning.
   Estimert tid: [X minutter]
   ```

**Risk:** User experience degraded. May attempt failed payments.

---

## Post-Incident Actions

1. **Refresh all expired consents proactively:**
   ```sql
   -- PostgreSQL 16: send renewal reminders 7 days before expiry
   SELECT user_id, email, consent_expires_at
   FROM bank_accounts
   JOIN users ON users.id = bank_accounts.user_id
   WHERE consent_expires_at < NOW() + INTERVAL '7 days'
   AND consent_renewal_reminder_sent = FALSE;
   ```

2. **Document incident:**
   ```bash
   touch ~/ALAI/products/Drop/comms/incidents/$(date +%Y-%m-%d)-aisp-failure.md
   ```

3. **Review caching strategy:**
   - Is cache TTL appropriate?
   - Should we cache balance longer during incidents?
   - Add metrics: cache hit rate, staleness

4. **Update monitoring:**
   - Add synthetic AISP test (fetch balance every 5 min)
   - Alert on AISP failure rate >10%
   - Track consent expiry dates

5. **Improve user communication:**
   - Auto-notify users when AISP is degraded
   - Show balance age: "Updated 5 minutes ago"

---

## Escalation

| Time | Action |
|------|--------|
| 0 min | John starts diagnosis |
| 10 min | If Neonomics outage confirmed, notify Alem |
| 20 min | If not resolved, enable cached balance mode |
| 1 hour | Public communication to users (Norwegian email/push) |
| 2 hours | Contact Neonomics support via phone if no response |

---

## Contacts

- **Neonomics Support:** support@neonomics.io
- **Neonomics Slack:** #neonomics-support (if available)
- **Internal:** Alem (CEO, final decision on fallback modes)

---

## Related Documentation

- `docs/architecture/open-banking.md` — AISP flow diagrams
- `src/app/api/accounts/balance/route.ts` — Balance fetch implementation
- `docs/compliance/psd2-requirements.md` — PSD2 consent rules (90-day expiry)
- Vaultwarden item: "Neonomics API" — Credentials

---

**Last Updated:** 2026-02-22
**Next Review:** Before Phase 2 (Banking Integration)

# Runbook: BankID Failure

# Runbook: BankID Integration Failure

**Service:** BankID OAuth Authentication
**Severity:** CRITICAL (blocks all logins)
**MTTR Target:** <15 minutes
**Owner:** John (AI Director)

---

## Symptoms

Users report they cannot log in. Symptoms include:

- Login button doesn't redirect to BankID
- BankID redirect returns error page
- OAuth callback fails with 401/403
- Error message: "Authentication service unavailable"

---

## Diagnosis

### 1. Check BankID Service Status

**External status page:**
```bash
# Check BankID status (no official status page, monitor Twitter)
open https://twitter.com/search?q=BankID%20Norge

# Or check community forums
open https://www.reddit.com/r/Norge/search?q=BankID
```

**Quick test:**
```bash
# Try BankID login from another service (e.g., tax portal)
open https://www.skatteetaten.no/person/
# If BankID works there but not in Drop → problem is our integration
```

### 2. Check Drop Logs

```bash
# CloudWatch Logs (production)
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "bankid" \
  --start-time $(date -u -d '10 minutes ago' +%s)000 \
  --region eu-west-1

# Look for:
# - "BankID OAuth error: invalid_client"
# - "BankID callback failed: invalid_state"
# - "BankID API timeout"
```

### 3. Check Environment Variables

```bash
# Verify BankID credentials are set
aws apprunner describe-service \
  --service-arn <ARN> \
  --region eu-west-1 \
  | jq '.Service.SourceConfiguration.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables' \
  | grep BANKID

# Expected:
# BANKID_CLIENT_ID: <client-id>
# BANKID_CLIENT_SECRET: <exists> (value hidden)
# BANKID_CALLBACK_URL: https://getdrop.no/api/auth/bankid/callback
```

### 4. Check OAuth Flow

**Test OAuth initiation:**
```bash
# Start OAuth flow
curl -X POST https://getdrop.no/api/auth/bankid/initiate \
  -H "Content-Type: application/json" \
  -d '{"redirectUrl": "/dashboard"}' \
  -v

# Expected: HTTP 302 redirect to BankID with state parameter
# If 500: Check BANKID_CLIENT_ID and BANKID_CALLBACK_URL
```

**Test OAuth callback:**
```bash
# Simulate callback (replace <code> and <state> with real values from BankID redirect)
curl -X GET "https://getdrop.no/api/auth/bankid/callback?code=<code>&state=<state>" \
  -v

# Expected: HTTP 302 redirect to /dashboard with auth cookie
# If 401: Check BANKID_CLIENT_SECRET
# If 400: Check state validation logic
```

---

## Common Causes & Solutions

### Cause 1: BankID Service Outage (External)

**Probability:** 5% (BankID is highly reliable)

**Symptoms:**
- All BankID logins fail across all services
- BankID status page reports incident
- Social media mentions BankID outage

**Solution:**
1. **Communicate:** Post status update to users
   ```
   Subject: Login temporarily unavailable
   Body: BankID authentication is experiencing issues.
         We're monitoring the situation and will restore service
         as soon as BankID is back online. Estimated: <X> minutes.
   ```

2. **Monitor:** Watch BankID Twitter/status for updates

3. **Fallback (if available):** If demo mode exists, consider temporary activation:
   ```bash
   # Enable demo mode (ONLY in emergency, requires Alem approval)
   aws apprunner update-service --service-arn <ARN> \
     --source-configuration "ImageRepository={...}" \
     --instance-configuration "EnvironmentVariables={NEXT_PUBLIC_SERVICE_MODE=demo}"
   ```

4. **Post-incident:** Document outage duration, user impact

**ETA:** Depends on BankID (typically <2 hours)

---

### Cause 2: Invalid OAuth Credentials

**Probability:** 20% (after credential rotation or environment change)

**Symptoms:**
- Logs show: "invalid_client" or "unauthorized_client"
- OAuth flow fails immediately (no redirect to BankID)

**Solution:**
1. **Verify credentials in Vaultwarden:**
   ```bash
   bw get item "BankID OAuth" --session $BW_SESSION
   ```

2. **Update App Runner environment variables:**
   ```bash
   aws apprunner update-service --service-arn <ARN> \
     --source-configuration "ImageRepository={...}" \
     --instance-configuration "EnvironmentVariables={
       BANKID_CLIENT_ID=<correct-client-id>,
       BANKID_CLIENT_SECRET=<correct-secret>
     }"
   ```

3. **Trigger deployment:**
   ```bash
   aws apprunner start-deployment --service-arn <ARN> --region eu-west-1
   ```

4. **Test:** Attempt login after deployment completes (3-5 minutes)

**ETA:** 10 minutes

---

### Cause 3: Callback URL Mismatch

**Probability:** 15% (after domain change or deployment error)

**Symptoms:**
- Logs show: "redirect_uri_mismatch"
- BankID redirects to wrong URL (404 or CORS error)

**Solution:**
1. **Check registered callback URL in BankID portal:**
   - Login to BankID integration portal
   - Navigate to OAuth settings
   - Verify callback URL: `https://getdrop.no/api/auth/bankid/callback`

2. **If mismatch, update BankID portal:**
   - Change redirect URI to match current domain
   - Save changes (may require approval, 1-2 hours)

3. **Update App Runner env var:**
   ```bash
   aws apprunner update-service --service-arn <ARN> \
     --source-configuration "ImageRepository={...}" \
     --instance-configuration "EnvironmentVariables={
       BANKID_CALLBACK_URL=https://getdrop.no/api/auth/bankid/callback
     }"
   ```

4. **Test:** Login flow should work after both changes

**ETA:** 15 minutes (if no BankID approval required), 2 hours (if approval needed)

---

### Cause 4: State Parameter Validation Failure

**Probability:** 10% (race condition or session timeout)

**Symptoms:**
- Logs show: "Invalid state parameter"
- User completes BankID flow but callback rejects

**Solution:**
1. **Check session storage:**
   - BankID state is stored in server session
   - If session expires before callback (>10 min), state is lost

2. **Increase session timeout (if needed):**
   ```typescript
   // src/lib/auth.ts
   const SESSION_TIMEOUT = 15 * 60 * 1000; // 15 minutes (was 10)
   ```

3. **Clear stale sessions:**
   ```bash
   # If using Redis for sessions
   redis-cli FLUSHDB

   # If using database sessions
   sqlite3 drop.db "DELETE FROM sessions WHERE expires_at < datetime('now');"
   ```

4. **Ask user to retry:** State timeout is usually one-time issue

**ETA:** 5 minutes

---

### Cause 5: BankID API Rate Limiting

**Probability:** 5% (during high-traffic events)

**Symptoms:**
- Logs show: "rate_limit_exceeded" or HTTP 429
- Intermittent failures (some users succeed, others fail)

**Solution:**
1. **Check rate limit headers in logs:**
   ```
   X-RateLimit-Limit: 100
   X-RateLimit-Remaining: 0
   X-RateLimit-Reset: 1640000000
   ```

2. **Wait for rate limit reset:** Typically resets every 60 seconds

3. **Implement exponential backoff (if not present):**
   ```typescript
   // src/lib/bankid-client.ts
   async function callBankIDAPI(retries = 3) {
     try {
       return await fetch(url);
     } catch (error) {
       if (error.status === 429 && retries > 0) {
         await sleep(1000 * (4 - retries)); // 1s, 2s, 3s
         return callBankIDAPI(retries - 1);
       }
       throw error;
     }
   }
   ```

4. **Contact BankID support:** If rate limits are too low for production traffic

**ETA:** 5 minutes (automatic), 1-2 days (if support ticket needed)

---

### Cause 6: Network/Firewall Issues

**Probability:** 5% (AWS security group misconfiguration)

**Symptoms:**
- Logs show: "Connection timeout" or "ECONNREFUSED"
- BankID API requests never reach destination

**Solution:**
1. **Check outbound rules (App Runner → BankID):**
   ```bash
   # App Runner egress is unrestricted by default
   # Check VPC connector security group (if using VPC)
   aws ec2 describe-security-groups --group-ids <vpc-connector-sg> --region eu-west-1
   ```

2. **Test connectivity from container:**
   ```bash
   # Exec into running container (if possible)
   curl -v https://oidc.bankid.no/.well-known/openid-configuration

   # Expected: HTTP 200 with JSON response
   # If timeout: Network/firewall issue
   ```

3. **Check DNS resolution:**
   ```bash
   nslookup oidc.bankid.no
   # Should resolve to BankID IP addresses
   ```

4. **Whitelist BankID IPs (if using strict firewall):**
   - Contact BankID for IP ranges
   - Add to AWS security group outbound rules

**ETA:** 15 minutes (if quick fix), 1 hour (if requires networking changes)

---

## Emergency Workarounds

### Option 1: Fallback to Demo Mode (Temporary)

**Use case:** BankID outage affects all users, estimated >1 hour downtime

**Steps:**
1. Enable demo mode:
   ```bash
   aws apprunner update-service --service-arn <ARN> \
     --instance-configuration "EnvironmentVariables={NEXT_PUBLIC_SERVICE_MODE=demo}"
   ```

2. Communicate to users:
   ```
   Subject: Temporary login method available
   Body: Due to BankID outage, we've enabled demo login.
         Use email/password to access your account.
         BankID will be restored as soon as possible.
   ```

3. Monitor BankID status

4. **Revert to BankID when available:**
   ```bash
   aws apprunner update-service --service-arn <ARN> \
     --instance-configuration "EnvironmentVariables={NEXT_PUBLIC_SERVICE_MODE=live}"
   ```

**Risk:** Demo mode may bypass KYC checks. Only use with Alem approval.

---

### Option 2: Redirect to Status Page

**Use case:** BankID outage, no ETA, no fallback available

**Steps:**
1. Deploy maintenance page:
   ```bash
   # Update health endpoint to return 503
   # This triggers BetterStack alert + status page update
   ```

2. Show user-friendly message:
   ```html
   <h1>Login Temporarily Unavailable</h1>
   <p>Our authentication provider (BankID) is experiencing issues.</p>
   <p>We expect service to resume within <strong>X minutes</strong>.</p>
   <p>Status updates: <a href="https://status.drop.no">status.drop.no</a></p>
   ```

3. Monitor and communicate updates every 30 minutes

---

## Post-Incident Actions

1. **Document incident:**
   ```bash
   # Create incident report
   touch ~/ALAI/products/Drop/comms/incidents/$(date +%Y-%m-%d)-bankid-failure.md
   ```

2. **Root cause analysis:**
   - What triggered the failure?
   - Why didn't monitoring detect it sooner?
   - What prevented faster recovery?

3. **Update monitoring:**
   - Add synthetic BankID login test (every 5 min)
   - Alert on OAuth callback failures >5/min

4. **Update runbook:**
   - Add new failure mode if discovered
   - Improve diagnosis steps based on what worked

5. **Team debrief (if >30 min outage):**
   - Review timeline
   - Identify improvements
   - Update on-call procedures

---

## Escalation

| Time | Action |
|------|--------|
| 0 min | John starts diagnosis |
| 5 min | If not resolved, alert Alem via Slack + SMS |
| 15 min | If BankID outage confirmed, enable fallback (Alem approval) |
| 30 min | If still unresolved, schedule team call |
| 1 hour | If major outage, public communication via email/social media |

---

## Contacts

- **BankID Support:** support@bankid.no
- **BankID Phone:** +47 XXXX XXXX (24/7 for critical issues)
- **Internal:** Alem (CEO, final decision on fallback modes)

---

## Related Documentation

- `docs/architecture/authentication.md` — BankID OAuth flow
- `src/app/api/auth/bankid/route.ts` — BankID integration code
- `docs/dr-runbook.md` — Infrastructure disaster recovery
- Vaultwarden item: "BankID OAuth" — Credentials

---

**Last Updated:** 2026-02-22
**Next Review:** Before Phase 2 (Banking Integration)

# Runbook: PISP Payment Failure

# Runbook: PISP Payment Failure (Remittance & QR)

**Service:** Payment Initiation (PISP via Open Banking)
**Severity:** HIGH (blocks money transfers)
**MTTR Target:** <30 minutes
**Owner:** John (AI Director)

---

## Overview

PISP (Payment Initiation Service Provider) enables Drop to initiate payments directly from users' bank accounts. Failures in PISP prevent both **remittance** (send money abroad) and **QR payments** (in-store merchant payments).

---

## Symptoms

Users report they cannot complete payments:

- Payment initiation fails with error message
- Payment status stuck at "pending" indefinitely
- Bank redirect loop (never returns to Drop)
- Error: "Payment service unavailable"

**User impact:** Cannot send money or pay merchants.

---

## Diagnosis

### 1. Identify Payment Type

Determine which payment flow is affected:

- **Remittance:** User sends money to recipient abroad (`POST /api/transactions/remittance`)
- **QR Payment:** User pays merchant by scanning QR code (`POST /api/transactions/qr-payment`)

**Check recent transactions:**
```bash
# CloudWatch Logs
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "payment_initiation" \
  --start-time $(date -u -d '30 minutes ago' +%s)000 \
  --region eu-west-1 \
  | jq '.events[].message' \
  | grep -E "remittance|qr_payment|pisp_error"
```

### 2. Check Open Banking Provider Status

**Provider:** Neonomics (Norway), Swan BaaS (cross-border)

**Neonomics Status:**
```bash
# No official status page — check via test API call
curl -X POST https://sandbox.neonomics.io/payments/v1/payment-initiation \
  -H "Authorization: Bearer <sandbox-token>" \
  -H "Content-Type: application/json" \
  -d '{"amount":100,"currency":"NOK"}' \
  -v

# Expected: HTTP 200 or 400 (validation error)
# If 500/503: Neonomics outage
```

**Swan API Status:**
```bash
# Check Swan status page
open https://status.swan.io

# Or test API
curl https://api.swan.io/graphql \
  -H "Authorization: Bearer <api-key>" \
  -d '{"query": "{viewer{id}}"}' \
  -v

# Expected: HTTP 200
# If 500/503: Swan outage
```

### 3. Check Drop Logs for Error Codes

**Common PISP error codes:**

| Code | Meaning | Cause |
|------|---------|-------|
| `INSUFFICIENT_FUNDS` | User's bank account balance too low | User error |
| `ACCOUNT_NOT_ACCESSIBLE` | Bank account locked or closed | Bank issue |
| `CONSENT_EXPIRED` | Open Banking consent needs renewal | User must re-authenticate |
| `PAYMENT_REJECTED` | Bank declined payment | Fraud detection, limits |
| `TIMEOUT` | Bank API took too long to respond | Network/bank issue |
| `INVALID_IBAN` | Recipient bank account number invalid | User error |
| `LIMIT_EXCEEDED` | Payment exceeds daily limit | User or bank limit |

**Search logs for error codes:**
```bash
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "PISP_ERROR" \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  | jq -r '.events[].message' \
  | jq '.metadata.errorCode'
```

### 4. Test Payment Flow

**Manual test (staging environment):**
```bash
# 1. Login
TOKEN=$(curl -X POST https://drop-staging.fly.dev/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"test1234"}' \
  | jq -r '.data.token')

# 2. Initiate test payment (small amount)
curl -X POST https://drop-staging.fly.dev/api/transactions/remittance \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "recipientId": "rec_test123",
    "amount": 100,
    "currency": "NOK",
    "sendCurrency": "NOK",
    "receiveCurrency": "EUR"
  }' \
  -v

# Expected: HTTP 200, transaction created
# If 500: PISP integration broken
```

---

## Common Causes & Solutions

### Cause 1: Open Banking Provider Outage

**Probability:** 10% (Neonomics/Swan service disruption)

**Symptoms:**
- All payments fail with timeout or 503 error
- Provider status page reports incident
- Test API call fails

**Solution:**

1. **Verify outage:**
   - Check Neonomics/Swan status pages
   - Contact provider support if no public status

2. **Communicate to users:**
   ```
   Subject: Payment processing temporarily unavailable
   Body: Our payment provider is experiencing issues.
         We're monitoring the situation and expect service
         to resume within <X> minutes.
   ```

3. **Monitor provider status:**
   - Subscribe to provider status updates
   - Check every 15 minutes for resolution

4. **Queue failed payments (if applicable):**
   - Store payment requests in `pending_payments` table
   - Retry automatically when provider is back online

**ETA:** Depends on provider (typically <2 hours)

---

### Cause 2: Expired Open Banking Consent

**Probability:** 30% (user consent expires after 90 days)

**Symptoms:**
- Error code: `CONSENT_EXPIRED` or `ACCOUNT_NOT_ACCESSIBLE`
- Payments fail for specific users only (not all)
- Logs show: "Open Banking consent invalid"

**Solution:**

1. **Identify affected users:**
   ```sql
   SELECT user_id, bank_account_id, consent_expires_at
   FROM bank_accounts
   WHERE consent_expires_at < datetime('now');
   ```

2. **Notify users to re-authenticate:**
   - Send push notification: "Please reconnect your bank account"
   - In-app banner: "Bank connection expired, tap to reconnect"

3. **Guide user through re-consent flow:**
   - User taps "Reconnect bank account"
   - Redirect to AISP consent flow (BankID + bank approval)
   - Update `consent_expires_at` in database (90 days from now)

4. **Retry payment after re-consent:**
   - Original payment request should be retryable
   - Or user initiates new payment

**ETA:** Immediate (user action required)

---

### Cause 3: Insufficient Funds in User's Bank Account

**Probability:** 25% (user error)

**Symptoms:**
- Error code: `INSUFFICIENT_FUNDS`
- Payment fails for specific transaction only
- Logs show: "Account balance too low"

**Solution:**

1. **Show clear error message to user:**
   ```
   Payment failed: Insufficient funds
   Your bank account balance is too low to complete this payment.
   Please add funds or choose a different payment method.
   ```

2. **Suggest alternatives:**
   - Link different bank account (if multi-account supported)
   - Reduce payment amount
   - Try again later

3. **No action needed on Drop side** (user must resolve)

**ETA:** N/A (user-side issue)

---

### Cause 4: Bank Fraud Detection / Payment Rejection

**Probability:** 15% (bank security systems)

**Symptoms:**
- Error code: `PAYMENT_REJECTED` or `SECURITY_BLOCK`
- Payment fails after bank redirect
- Logs show: "Bank declined transaction"

**Solution:**

1. **Advise user to contact their bank:**
   ```
   Payment failed: Your bank declined this transaction.
   This may be due to fraud protection or payment limits.
   Please contact your bank to authorize the payment.
   ```

2. **Check if payment is unusual for user:**
   - First international transfer?
   - Amount significantly higher than usual?
   - High-risk destination country?

3. **User should:**
   - Call their bank's fraud department
   - Confirm the payment is legitimate
   - Ask bank to whitelist Drop payments
   - Retry after bank approval

4. **Document pattern:**
   - If many users from same bank report this, investigate bank compatibility
   - May need to add bank-specific messaging

**ETA:** Depends on user's bank (minutes to hours)

---

### Cause 5: PISP API Rate Limiting

**Probability:** 5% (during high-traffic periods)

**Symptoms:**
- Error code: `RATE_LIMIT_EXCEEDED` or HTTP 429
- Intermittent failures (some payments succeed, others fail)
- Logs show: "Too many requests"

**Solution:**

1. **Check rate limit headers:**
   ```bash
   # Find rate limit status in logs
   aws logs filter-log-events \
     --log-group-name /aws/apprunner/drop-production \
     --filter-pattern "X-RateLimit" \
     --start-time $(date -u -d '10 minutes ago' +%s)000
   ```

2. **Implement request queuing:**
   ```typescript
   // src/lib/pisp-client.ts
   const queue = new PQueue({ concurrency: 5, interval: 1000 });

   async function initiatePayment(params) {
     return queue.add(() => pisService.createPayment(params));
   }
   ```

3. **Exponential backoff on retry:**
   ```typescript
   async function retryPayment(id, attempt = 1) {
     if (attempt > 3) throw new Error('Max retries exceeded');
     try {
       return await initiatePayment(id);
     } catch (error) {
       if (error.status === 429) {
         await sleep(1000 * Math.pow(2, attempt)); // 2s, 4s, 8s
         return retryPayment(id, attempt + 1);
       }
       throw error;
     }
   }
   ```

4. **Contact provider to increase limits (if persistent):**
   - Email Neonomics support with usage stats
   - Request higher API quota for production

**ETA:** 5 minutes (automatic retry), 1-2 days (if quota increase needed)

---

### Cause 6: Invalid Recipient Bank Account (IBAN/SWIFT)

**Probability:** 20% (user input error)

**Symptoms:**
- Error code: `INVALID_IBAN` or `ACCOUNT_NOT_FOUND`
- Payment fails immediately (no bank redirect)
- Logs show: "Recipient account validation failed"

**Solution:**

1. **Show clear validation error:**
   ```
   Payment failed: Invalid recipient bank account
   The IBAN you entered is not valid. Please check and try again.
   IBAN: DE89 3704 0044 0532 0130 00 (example format)
   ```

2. **Improve frontend validation:**
   - Add real-time IBAN validation (checksum algorithm)
   - Use IBAN validation library (e.g., `ibantools`)
   - Show format hints per country

3. **Ask user to verify recipient details:**
   - Double-check IBAN/SWIFT code
   - Confirm with recipient
   - Try alternative payment method if IBAN is correct but still rejected

**ETA:** Immediate (user correction)

---

## Emergency Workarounds

### Option 1: Manual Payment Processing

**Use case:** PISP provider down >2 hours, urgent payments needed

**Steps:**
1. Collect payment requests manually:
   ```sql
   SELECT id, user_id, amount, currency, recipient_iban
   FROM transactions
   WHERE status = 'pending' AND created_at > datetime('now', '-2 hours');
   ```

2. **Alem initiates payments manually** via Drop's business bank account:
   - Log into business banking portal
   - Enter recipient details manually
   - Process payment one by one

3. Update Drop transaction status:
   ```sql
   UPDATE transactions SET status = 'completed', completed_at = datetime('now')
   WHERE id = '<transaction-id>';
   ```

4. Notify users:
   ```
   Subject: Your payment has been processed
   Body: Your payment of <amount> to <recipient> has been completed manually
         due to a temporary service issue. Thank you for your patience.
   ```

**Risk:** Manual work, prone to errors. Only use for critical/urgent payments.

---

### Option 2: Redirect to Alternative Payment Method

**Use case:** PISP down, no ETA, users need alternative

**Steps:**
1. Show modal in app:
   ```
   Payment Initiation Unavailable
   Our payment service is temporarily down.
   Alternative options:
   - Bank transfer (manual IBAN entry)
   - Try again later (we'll notify you when service is restored)
   ```

2. Provide manual bank transfer instructions:
   ```
   Transfer to:
   Account holder: Drop AS
   IBAN: NO93 8601 1117 947
   Amount: <calculated-amount>
   Reference: <unique-ref>
   ```

3. Monitor for manual transfers:
   - Check business bank account for incoming payments
   - Match reference code to pending Drop transactions
   - Mark as completed when received

**ETA:** Immediate (user can pay via manual transfer)

---

## Monitoring & Alerts

### Metrics to Track

- **Payment success rate:** Should be >95%
- **Payment latency:** p50 <5s, p95 <15s, p99 <30s
- **Error rate by code:** Track `INSUFFICIENT_FUNDS`, `CONSENT_EXPIRED`, `TIMEOUT` separately

### Alert Rules

```typescript
// src/lib/payment-monitor.ts
export async function trackPaymentFailure(errorCode: string, transactionId: string) {
  const failureRate = await calculateFailureRate('last_5_minutes');

  if (failureRate > 0.1) { // 10% failure rate
    await sendAlert({
      severity: 'critical',
      title: 'High payment failure rate',
      message: `${(failureRate * 100).toFixed(1)}% of payments failing in last 5 min`,
    });
  }
}
```

### Dashboard Queries

```sql
-- Payment success rate (last 24h)
SELECT
  COUNT(*) FILTER (WHERE status = 'completed') * 100.0 / COUNT(*) as success_rate,
  COUNT(*) as total_payments
FROM transactions
WHERE created_at > datetime('now', '-24 hours');

-- Top error codes (last hour)
SELECT error_code, COUNT(*) as count
FROM transactions
WHERE status = 'failed' AND created_at > datetime('now', '-1 hour')
GROUP BY error_code
ORDER BY count DESC;
```

---

## Post-Incident Actions

1. **Update transaction status:**
   ```sql
   -- Mark timed-out payments as failed (after 1 hour)
   UPDATE transactions
   SET status = 'failed', error_code = 'TIMEOUT', error_message = 'Payment timed out'
   WHERE status = 'pending' AND created_at < datetime('now', '-1 hour');
   ```

2. **Notify affected users:**
   - Send email/push notification about failed payment
   - Offer to retry or refund

3. **Document incident:**
   - Create post-mortem in `comms/incidents/`
   - Track downtime duration
   - Calculate financial impact (lost transactions)

4. **Review provider SLA:**
   - Check if outage violates SLA
   - Request compensation/credits if applicable

5. **Improve resilience:**
   - Add payment retry queue
   - Implement circuit breaker for provider API
   - Consider multi-provider failover (backup PISP)

---

## Escalation

| Time | Action |
|------|--------|
| 0 min | John starts diagnosis |
| 10 min | If provider outage confirmed, notify Alem |
| 30 min | If not resolved, assess manual processing need |
| 1 hour | If critical payments pending, start manual workaround (Alem approval) |
| 2 hours | Public communication to all users |

---

## Contacts

- **Neonomics Support:** support@neonomics.io, Slack: #neonomics-support
- **Swan Support:** support@swan.io (email), Swan Slack (if available)
- **Internal:** Alem (CEO, manual payment approval)

---

## Related Documentation

- `docs/architecture/payments.md` — PISP flow diagrams
- `src/app/api/transactions/remittance/route.ts` — Remittance implementation
- `src/app/api/transactions/qr-payment/route.ts` — QR payment implementation
- `docs/compliance/psd2-requirements.md` — Regulatory requirements

---

**Last Updated:** 2026-02-22
**Next Review:** Before Phase 2 (Banking Integration)
**Test Status:** Pending (Phase 2 live payments)

# Runbook: Sumsub KYC Failure

# Runbook: Sumsub KYC/AML Verification Failure

**Service:** Sumsub Identity Verification (KYC/AML)
**Severity:** HIGH (blocks new user registrations)
**MTTR Target:** <30 minutes
**Owner:** John (AI Director)

---

## Overview

Sumsub provides automated identity verification (KYC - Know Your Customer) and AML (Anti-Money Laundering) checks for Drop. Required for regulatory compliance before users can make payments.

**KYC Process:**
1. User uploads ID document (passport, driver's license, national ID)
2. User takes selfie (liveness check)
3. Sumsub verifies document authenticity
4. Sumsub performs AML sanctions screening
5. Result: APPROVED, REJECTED, or MANUAL_REVIEW

**Impact:** If Sumsub fails, new users cannot complete registration. Existing users are unaffected.

---

## Symptoms

Users report they cannot complete identity verification:

- ID upload fails with error
- Verification stuck at "Processing..." indefinitely
- Error message: "Verification service unavailable"
- Webhook never receives result from Sumsub
- User status stuck at "pending_kyc"

**User impact:** Cannot complete registration, cannot make payments.

---

## Diagnosis

### 1. Check Sumsub Service Status

**External status:**
```bash
# Sumsub does not have a public status page
# Test via API health check
curl https://api.sumsub.com/resources/healthcheck \
  -H "X-App-Token: <app-token>" \
  -v

# Expected: HTTP 200
# If 500/503: Sumsub outage
```

### 2. Check Drop Logs

```bash
# CloudWatch Logs (production)
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "sumsub" \
  --start-time $(date -u -d '30 minutes ago' +%s)000 \
  --region eu-west-1

# Look for:
# - "Sumsub API timeout"
# - "Sumsub webhook failed"
# - "KYC verification failed: document_expired"
# - "AML sanctions match: [name]"
```

### 3. Check Sumsub Dashboard

```bash
# Login to Sumsub Dashboard
open https://cockpit.sumsub.com

# Check:
# - Recent applicants (last 1 hour)
# - Failed verifications
# - Manual review queue length
# - Webhook delivery status
```

### 4. Check Webhook Delivery

**Verify webhook endpoint is reachable:**
```bash
# Sumsub sends webhooks to: https://getdrop.no/api/webhooks/sumsub
# Test endpoint manually
curl -X POST https://getdrop.no/api/webhooks/sumsub \
  -H "Content-Type: application/json" \
  -H "X-Sumsub-Signature: test" \
  -d '{"type":"applicantReviewed","reviewResult":{"reviewAnswer":"GREEN"}}' \
  -v

# Expected: HTTP 200
# If 404: Webhook endpoint not deployed
# If 401: Signature validation issue
```

### 5. Test KYC Flow

**Manual test (staging):**
```bash
# 1. Create test applicant
curl -X POST https://api.sumsub.com/resources/applicants \
  -H "X-App-Token: <sandbox-app-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "externalUserId": "test-user-123",
    "levelName": "basic-kyc-level",
    "email": "test@example.com"
  }' \
  -v

# Expected: HTTP 201, applicant created
# If 400: Invalid request
# If 500: Sumsub API issue
```

---

## Common Causes & Solutions

### Cause 1: Sumsub API Outage (External)

**Probability:** 5% (Sumsub service disruption)

**Symptoms:**
- All KYC verifications fail
- Sumsub API health check returns 503
- Dashboard shows no recent applicants
- Logs show API timeouts

**Solution:**

1. **Verify outage:**
   ```bash
   # Test Sumsub API from different networks
   curl https://api.sumsub.com/resources/healthcheck \
     -H "X-App-Token: <app-token>" \
     -v

   # If consistent failure: confirmed outage
   ```

2. **Contact Sumsub support:**
   - Email: support@sumsub.com
   - Live chat: https://cockpit.sumsub.com (bottom-right)
   - Phone: Check Sumsub Dashboard for support number

3. **Communicate to users (Norwegian):**
   ```
   Emne: Identitetsverifisering midlertidig utilgjengelig

   Hei,

   Vi opplever for øyeblikket tekniske problemer med identitetsverifisering.
   Du kan fortsette registreringen senere.

   Vi forventer at tjenesten er tilbake innen [X minutter/timer].

   Mvh,
   Drop
   ```

4. **Queue pending verifications:**
   ```sql
   -- Mark users as pending KYC retry
   UPDATE users
   SET kyc_status = 'pending_retry',
       kyc_retry_at = datetime('now', '+1 hour')
   WHERE kyc_status = 'pending_kyc'
   AND created_at > datetime('now', '-2 hours');
   ```

5. **Retry when Sumsub is back:**
   ```bash
   # Cron job to retry pending KYC
   node ~/ALAI/products/Drop/scripts/retry-kyc.js
   ```

**ETA:** Depends on Sumsub (typically <2 hours)

---

### Cause 2: Document Verification Failure (User Error)

**Probability:** 40% (user uploads poor quality or invalid document)

**Symptoms:**
- Specific users fail KYC (not all users)
- Logs show: "document_not_readable", "document_expired", "document_type_mismatch"
- Sumsub dashboard shows rejection reason

**Common rejection reasons:**
- Blurry photo (document not readable)
- Expired document (passport/ID expired)
- Wrong document type (e.g., bank statement instead of ID)
- Photo cropped (missing corners/edges)
- Underage (user < 18 years old)

**Solution:**

1. **Identify rejection reason:**
   ```sql
   SELECT user_id, kyc_rejection_reason, kyc_rejected_at
   FROM users
   WHERE kyc_status = 'rejected'
   ORDER BY kyc_rejected_at DESC
   LIMIT 10;
   ```

2. **Show clear error to user (Norwegian):**

   **Blurry document:**
   ```
   Dokumentet er ikke leselig
   Ta et nytt bilde i godt lys.
   Sørg for at all tekst er skarp og leselig.
   ```

   **Expired document:**
   ```
   Dokumentet er utløpt
   Vennligst last opp et gyldig pass eller førerkort.
   Dokumentet må være gyldig i minst 1 måned.
   ```

   **Wrong document type:**
   ```
   Feil dokumenttype
   Vi godtar kun: Pass, Nasjonalt ID-kort, Førerkort.
   Bankkort og regninger godtas ikke.
   ```

   **Underage:**
   ```
   Du må være 18 år eller eldre
   Drop er kun tilgjengelig for brukere over 18 år.
   ```

3. **Allow user to retry:**
   - Show "Try Again" button in app
   - Provide tips for better photo quality
   - Link to FAQ: "How to take a good ID photo"

4. **Track retry success rate:**
   ```sql
   -- How many users succeed on 2nd attempt?
   SELECT
     COUNT(*) FILTER (WHERE kyc_attempt = 1 AND kyc_status = 'approved') as first_attempt_success,
     COUNT(*) FILTER (WHERE kyc_attempt = 2 AND kyc_status = 'approved') as second_attempt_success,
     COUNT(*) FILTER (WHERE kyc_attempt >= 3) as multiple_retries
   FROM users;
   ```

**ETA:** Immediate (user must retry with better document)

---

### Cause 3: AML Sanctions Match (Compliance Issue)

**Probability:** 3% (user flagged by sanctions screening)

**Symptoms:**
- Specific user's KYC fails with: "AML_SANCTIONS_MATCH"
- Sumsub dashboard shows "Red flag" or "Manual review required"
- User name matches sanctions list (OFAC, EU, UN, etc.)

**Solution:**

1. **Identify flagged users:**
   ```sql
   SELECT user_id, email, full_name, kyc_rejection_reason
   FROM users
   WHERE kyc_rejection_reason LIKE '%sanctions%'
   OR kyc_status = 'manual_review_aml';
   ```

2. **Review Sumsub dashboard:**
   - Login: https://cockpit.sumsub.com
   - Navigate to applicant
   - Check AML screening results
   - Review sanctions list match details

3. **False positive (common names):**
   - Example: "Ali Hassan" may match many sanctioned individuals
   - Sumsub shows match details (date of birth, nationality)
   - If clearly different person: manually approve in Sumsub

4. **True positive (actual sanctions match):**
   - **DO NOT approve.** This is a legal/regulatory issue.
   - Reject user registration immediately
   - Document incident for compliance records

5. **Notify user (if false positive, manually approved):**
   ```
   Din identitetsverifisering er godkjent
   Takk for tålmodigheten. Du kan nå bruke Drop.
   ```

6. **Notify user (if true positive, rejected):**
   ```
   Vi kan dessverre ikke godkjenne din registrering
   På grunn av regulatoriske krav kan vi ikke tilby tjenester til deg.
   Ta kontakt med support@getdrop.no hvis du mener dette er en feil.
   ```

7. **Escalate to Alem if uncertain:**
   - AML compliance is critical
   - False rejection = bad UX, but false approval = legal risk
   - Alem makes final call on borderline cases

**ETA:** 10 minutes (false positive), N/A (true positive - reject)

---

### Cause 4: Webhook Delivery Failure

**Probability:** 15% (Drop webhook endpoint down or unreachable)

**Symptoms:**
- Sumsub completes verification, but Drop never updates user status
- Logs show: "Webhook not received"
- Sumsub dashboard shows "Webhook delivery failed"
- User stuck at "pending_kyc" despite Sumsub showing "approved"

**Solution:**

1. **Check webhook endpoint health:**
   ```bash
   # Test webhook endpoint
   curl -X POST https://getdrop.no/api/webhooks/sumsub \
     -H "Content-Type: application/json" \
     -d '{"type":"ping"}' \
     -v

   # Expected: HTTP 200
   # If 404/500: Drop webhook endpoint broken
   ```

2. **Check Sumsub webhook delivery logs:**
   - Login: https://cockpit.sumsub.com
   - Navigate to Settings → Webhooks
   - Check recent delivery attempts
   - Look for: 404, 500, timeout errors

3. **Manually retry failed webhooks:**
   - Sumsub Dashboard → Applicant → "Resend Webhook"
   - This triggers new webhook delivery to Drop
   - Verify Drop receives and processes it

4. **Fetch verification results via API (if webhook lost):**
   ```bash
   # Manually fetch applicant status from Sumsub
   curl -X GET https://api.sumsub.com/resources/applicants/<applicant-id>/status \
     -H "X-App-Token: <app-token>" \
     -v

   # Parse result and update Drop database
   ```

5. **Update Drop database manually:**
   ```sql
   UPDATE users
   SET kyc_status = 'approved',
       kyc_approved_at = datetime('now')
   WHERE sumsub_applicant_id = '<applicant-id>';
   ```

6. **Fix webhook endpoint (if broken):**
   - Check App Runner deployment status
   - Verify webhook route exists: `src/app/api/webhooks/sumsub/route.ts`
   - Check signature validation (Sumsub signs webhooks with HMAC)

**ETA:** 10 minutes (manual retry), 30 minutes (if endpoint fix needed)

---

### Cause 5: Invalid or Expired API Credentials

**Probability:** 5% (after credential rotation)

**Symptoms:**
- Logs show: "401 Unauthorized" or "403 Forbidden"
- All Sumsub API calls fail
- Webhook signature validation fails

**Solution:**

1. **Verify Sumsub API credentials:**
   ```bash
   bw get item "Sumsub API" --session $BW_SESSION

   # Check:
   # - App Token is correct
   # - Secret Key is correct (for webhook signature)
   # - Environment: production vs sandbox
   ```

2. **Regenerate API credentials (if needed):**
   - Login: https://cockpit.sumsub.com
   - Navigate to Settings → API
   - Generate new App Token + Secret Key
   - Copy to Vaultwarden

3. **Update App Runner environment variables:**
   ```bash
   aws apprunner update-service --service-arn <ARN> \
     --instance-configuration "EnvironmentVariables={
       SUMSUB_APP_TOKEN=<new-app-token>,
       SUMSUB_SECRET_KEY=<new-secret-key>,
       SUMSUB_ENVIRONMENT=production
     }"
   ```

4. **Trigger deployment:**
   ```bash
   aws apprunner start-deployment --service-arn <ARN> --region eu-west-1
   ```

5. **Test after deployment:**
   ```bash
   # Try creating test applicant
   curl -X POST https://getdrop.no/api/kyc/initiate \
     -H "Authorization: Bearer <test-user-token>" \
     -v

   # Expected: HTTP 200, Sumsub applicant created
   ```

**ETA:** 10 minutes

---

### Cause 6: Liveness Check Failure (Selfie)

**Probability:** 20% (user fails selfie/liveness verification)

**Symptoms:**
- Specific users fail at selfie stage
- Logs show: "liveness_check_failed", "face_mismatch"
- Sumsub dashboard shows "Selfie does not match ID photo"

**Common reasons:**
- Poor lighting (too dark, too bright)
- User wears sunglasses/hat
- Multiple people in frame
- Photo of a photo (not live person)
- Face does not match ID document

**Solution:**

1. **Show clear instructions before selfie (Norwegian):**
   ```
   Slik tar du et godt selfie-bilde:
   ✓ God belysning (dagslys er best)
   ✓ Fjern briller/solbriller
   ✓ Se rett i kameraet
   ✓ Kun ditt ansikt i bildet
   ✗ Ikke bruk foto av foto
   ```

2. **Allow retry with better instructions:**
   ```
   Selfie-verifisering mislyktes
   Prøv igjen med bedre belysning.
   Sørg for at ansiktet ditt er tydelig synlig.
   ```

3. **Improve liveness detection settings (if too strict):**
   - Login: https://cockpit.sumsub.com
   - Navigate to Settings → Verification Levels
   - Adjust liveness sensitivity (low/medium/high)
   - Balance: security vs user friction

4. **Manual review (if automated fails repeatedly):**
   - Some users may need manual review
   - Sumsub team reviews video/photos manually
   - ETA: 1-24 hours depending on Sumsub queue

**ETA:** Immediate (user retry), 1-24 hours (manual review)

---

## Emergency Workarounds

### Option 1: Manual KYC Review (Temporary)

**Use case:** Sumsub down >1 hour, urgent user needs verification

**Steps:**

1. Collect KYC documents manually:
   - Ask user to email ID photo + selfie to support@getdrop.no
   - Subject: "KYC Manual Review - [User ID]"

2. **Alem or John reviews manually:**
   - Verify ID document authenticity (check security features)
   - Compare selfie to ID photo
   - Check ID expiry date
   - Verify age >= 18

3. **Manual AML check:**
   - Search user name on: https://sanctionssearch.ofac.treas.gov
   - Check EU sanctions list: https://eeas.europa.eu/topics/sanctions-policy
   - Document findings

4. **Approve in database (if passes checks):**
   ```sql
   UPDATE users
   SET kyc_status = 'approved_manual',
       kyc_approved_at = datetime('now'),
       kyc_approved_by = 'john',
       kyc_notes = 'Manual review during Sumsub outage'
   WHERE user_id = '<user-id>';
   ```

5. **Notify user:**
   ```
   Din identitet er verifisert
   Velkommen til Drop! Du kan nå gjøre betalinger.
   ```

**Risk:** Manual review is slow, error-prone, not scalable. Only for critical cases.

---

### Option 2: Delay Registration, Notify When Ready

**Use case:** Sumsub down, no ETA, non-urgent registrations

**Steps:**

1. Show maintenance message:
   ```
   Identitetsverifisering midlertidig utilgjengelig
   Vi jobber med å løse problemet.
   Du vil motta en e-post når du kan fortsette registreringen.
   ```

2. Collect user email:
   ```typescript
   // src/app/api/auth/register/route.ts
   if (sumsubUnavailable) {
     await db.insert('pending_registrations', {
       email: userEmail,
       status: 'waiting_kyc',
       created_at: new Date(),
     });

     return {
       success: true,
       message: 'We will notify you when registration is available',
     };
   }
   ```

3. **When Sumsub is back, notify users:**
   ```sql
   SELECT email FROM pending_registrations WHERE status = 'waiting_kyc';
   ```

   Email (Norwegian):
   ```
   Emne: Du kan nå fullføre registreringen i Drop

   Hei,

   Identitetsverifisering er tilbake.
   Klikk her for å fortsette registreringen: [Link]

   Mvh,
   Drop
   ```

**ETA:** Delayed registration (hours to days)

---

## Monitoring & Alerts

### Metrics to Track

- **KYC success rate:** Should be >85% (accounting for user errors)
- **KYC processing time:** p50 <5min, p95 <30min, p99 <2h (includes manual review)
- **Rejection reasons:** Track document_not_readable, expired, underage, sanctions separately

### Alert Rules

```typescript
// src/lib/kyc-monitor.ts
export async function trackKYCFailure(userId: string, reason: string) {
  const failureRate = await calculateKYCFailureRate('last_hour');

  if (failureRate > 0.3) { // 30% failure rate
    await sendAlert({
      severity: 'high',
      title: 'KYC failure rate high',
      message: `${(failureRate * 100).toFixed(1)}% of KYC attempts failing`,
      reason,
    });
  }
}
```

---

## Post-Incident Actions

1. **Retry failed KYC verifications:**
   ```sql
   UPDATE users
   SET kyc_status = 'pending_retry',
       kyc_retry_at = datetime('now')
   WHERE kyc_status IN ('failed', 'pending_kyc')
   AND created_at > datetime('now', '-24 hours');
   ```

2. **Document incident:**
   ```bash
   touch ~/ALAI/products/Drop/comms/incidents/$(date +%Y-%m-%d)-sumsub-kyc-failure.md
   ```

3. **Review rejection reasons:**
   - High document_not_readable rate? Improve photo instructions
   - High liveness_check_failed rate? Adjust Sumsub settings
   - Track improvements in next month's KYC metrics

4. **Update user onboarding:**
   - Add better photo guides
   - Show example of good vs bad ID photos
   - Pre-flight check: "Is your ID expired?"

---

## Escalation

| Time | Action |
|------|--------|
| 0 min | John starts diagnosis |
| 15 min | If Sumsub outage confirmed, notify Alem |
| 30 min | If urgent user needs KYC, consider manual review (Alem approval) |
| 1 hour | Public communication to users |
| 2 hours | Contact Sumsub support via phone if no response |

---

## Contacts

- **Sumsub Support:** support@sumsub.com
- **Sumsub Live Chat:** https://cockpit.sumsub.com (bottom-right)
- **Sumsub Phone:** Check Sumsub Dashboard for support number
- **Internal:** Alem (CEO, manual KYC approval authority)

---

## Related Documentation

- `docs/architecture/kyc-aml.md` — KYC/AML flow diagrams
- `src/app/api/kyc/initiate/route.ts` — Sumsub integration code
- `docs/compliance/kyc-requirements.md` — Regulatory requirements (age, ID types)
- Vaultwarden item: "Sumsub API" — Credentials

---

**Last Updated:** 2026-02-22
**Next Review:** Before Phase 2 (Banking Integration)

# Runbook: Swan API Outage

# Runbook: Swan BaaS API Outage

**Service:** Swan Banking-as-a-Service
**Severity:** CRITICAL (blocks accounts, cards, payments if Swan is primary provider)
**MTTR Target:** <15 minutes
**Owner:** John (AI Director)

---

## Overview

Swan provides core banking infrastructure for Drop. Depending on Drop's architecture phase, Swan may handle:
- **Account creation** (virtual IBAN accounts for users)
- **Card issuance** (virtual/physical debit cards)
- **Payment processing** (domestic/international transfers)
- **Balance management** (wallet balances, not Open Banking)

**Impact:** If Swan is the primary BaaS provider, an outage affects ALL core banking operations.

---

## Symptoms

Users report critical failures:

- Cannot create new account
- Cannot view wallet balance (if using Swan wallets)
- Card payments fail or decline
- Error: "Banking service unavailable"
- Dashboard shows "System error" for account-related features

**User impact:** Complete inability to use banking features (depending on Drop's reliance on Swan).

---

## Diagnosis

### 1. Check Swan Status Page

**External status:**
```bash
# Swan official status page
open https://status.swan.io

# Check for:
# - Incident reported
# - Degraded performance
# - Scheduled maintenance
```

### 2. Test Swan API

**Health check:**
```bash
# GraphQL health query
curl https://api.swan.io/graphql \
  -H "Authorization: Bearer <api-key>" \
  -H "Content-Type: application/json" \
  -d '{"query": "{viewer{id}}"}' \
  -v

# Expected: HTTP 200, {"data": {"viewer": {"id": "..."}}}
# If 500/503: Swan API down
# If 401: Credential issue
# If timeout: Network or Swan connectivity issue
```

**Test account creation:**
```bash
# Attempt to create test account
curl https://api.swan.io/graphql \
  -H "Authorization: Bearer <api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "mutation { createAccount(input: {name: \"Test Account\"}) { id } }"
  }' \
  -v

# Expected: HTTP 200 with account ID
# If error: Check response for Swan error codes
```

### 3. Check Drop Logs

```bash
# CloudWatch Logs (production)
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "swan" \
  --start-time $(date -u -d '15 minutes ago' +%s)000 \
  --region eu-west-1

# Look for:
# - "Swan API timeout"
# - "Swan GraphQL error: INTERNAL_SERVER_ERROR"
# - "Swan 503 Service Unavailable"
# - "Swan rate limit exceeded"
```

### 4. Check Swan API Credentials

```bash
# Verify Swan API key is valid
bw get item "Swan API" --session $BW_SESSION

# Check App Runner environment variables
aws apprunner describe-service \
  --service-arn <ARN> \
  --region eu-west-1 \
  | jq '.Service.SourceConfiguration.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables' \
  | grep SWAN

# Expected:
# SWAN_API_KEY: <exists>
# SWAN_ENVIRONMENT: production (or sandbox)
# SWAN_PARTNER_ID: <partner-id>
```

### 5. Check Recent Swan API Changes

**Review Swan changelog:**
```bash
# Swan may deprecate API endpoints or change schemas
# Check Swan developer portal for breaking changes
open https://docs.swan.io/changelog

# Review recent GraphQL schema changes
# Verify Drop uses supported API versions
```

---

## Common Causes & Solutions

### Cause 1: Swan Service Outage (External)

**Probability:** 5% (Swan is highly reliable, but incidents happen)

**Symptoms:**
- Swan status page reports incident
- All Swan API calls fail with 500/503
- No error in Drop code/config
- Social media mentions Swan issues

**Solution:**

1. **Verify outage scope:**
   - Check Swan status page
   - Test API from different networks (rule out local network issue)
   - Contact Swan support for ETA

2. **Communicate to users (Norwegian):**
   ```
   Emne: Bankfunksjoner midlertidig utilgjengelig

   Hei,

   Vår bankinfrastruktur-leverandør (Swan) opplever tekniske problemer.
   Dette påvirker:
   - Kontoopprettelse
   - Korttransaksjoner
   - Overføringer

   Vi overvåker situasjonen og forventer at tjenesten er tilbake innen [X minutter/timer].

   Mvh,
   Drop
   ```

3. **Enable degraded mode:**
   ```bash
   # Disable features that depend on Swan
   aws apprunner update-service --service-arn <ARN> \
     --instance-configuration "EnvironmentVariables={
       FEATURE_ACCOUNTS=disabled,
       FEATURE_CARDS=disabled,
       SWAN_MODE=degraded
     }"

   # Show maintenance banner in app
   ```

4. **Monitor Swan status:**
   - Subscribe to Swan status updates (RSS/email)
   - Check every 10 minutes for resolution
   - Test API as soon as Swan reports "Resolved"

5. **Re-enable features when Swan is back:**
   ```bash
   aws apprunner update-service --service-arn <ARN> \
     --instance-configuration "EnvironmentVariables={
       FEATURE_ACCOUNTS=enabled,
       FEATURE_CARDS=enabled,
       SWAN_MODE=live
     }"
   ```

**ETA:** Depends on Swan (typically <2 hours for major incidents)

---

### Cause 2: Invalid or Expired API Credentials

**Probability:** 15% (after credential rotation or Swan account changes)

**Symptoms:**
- Logs show: "401 Unauthorized" or "Forbidden"
- All Swan API requests fail immediately
- Swan API test returns authentication error

**Solution:**

1. **Verify Swan API credentials:**
   ```bash
   bw get item "Swan API" --session $BW_SESSION

   # Check:
   # - API key is not expired
   # - API key has correct permissions (accounts, cards, payments)
   # - Partner ID is correct
   ```

2. **Regenerate API key (if needed):**
   - Login to Swan Dashboard: https://dashboard.swan.io
   - Navigate to Settings → API Keys
   - Revoke old key, generate new key
   - Copy new key to Vaultwarden

3. **Update App Runner environment variables:**
   ```bash
   aws apprunner update-service --service-arn <ARN> \
     --source-configuration "ImageRepository={...}" \
     --instance-configuration "EnvironmentVariables={
       SWAN_API_KEY=<new-key>,
       SWAN_PARTNER_ID=<partner-id>
     }"
   ```

4. **Trigger deployment:**
   ```bash
   aws apprunner start-deployment --service-arn <ARN> --region eu-west-1
   ```

5. **Test after deployment (3-5 min):**
   ```bash
   curl https://getdrop.no/api/accounts/create \
     -H "Authorization: Bearer <test-user-token>" \
     -H "Content-Type: application/json" \
     -d '{"accountType": "personal"}' \
     -v

   # Expected: HTTP 200, account created
   ```

**ETA:** 10 minutes

---

### Cause 3: Swan API Rate Limiting

**Probability:** 10% (during high-traffic events or viral growth)

**Symptoms:**
- Logs show: HTTP 429 "Too Many Requests"
- Intermittent failures (some requests succeed, others fail)
- Rate limit headers in response

**Solution:**

1. **Check rate limit headers:**
   ```bash
   aws logs filter-log-events \
     --log-group-name /aws/apprunner/drop-production \
     --filter-pattern "X-RateLimit" \
     --start-time $(date -u -d '10 minutes ago' +%s)000 \
     | jq -r '.events[].message' \
     | grep Swan
   ```

2. **Implement request queuing:**
   ```typescript
   // src/lib/swan-client.ts
   import PQueue from 'p-queue';

   const queue = new PQueue({
     concurrency: 5,     // Max 5 concurrent Swan requests
     interval: 1000,      // Per second
     intervalCap: 20      // Max 20 requests per second
   });

   export async function swanGraphQL(query: string, variables?: any) {
     return queue.add(() =>
       fetch('https://api.swan.io/graphql', {
         method: 'POST',
         headers: {
           'Authorization': `Bearer ${process.env.SWAN_API_KEY}`,
           'Content-Type': 'application/json',
         },
         body: JSON.stringify({ query, variables }),
       })
     );
   }
   ```

3. **Exponential backoff on retry:**
   ```typescript
   async function retrySwan(operation: () => Promise<any>, attempt = 1) {
     try {
       return await operation();
     } catch (error) {
       if (error.status === 429 && attempt <= 3) {
         const delay = 1000 * Math.pow(2, attempt); // 2s, 4s, 8s
         await sleep(delay);
         return retrySwan(operation, attempt + 1);
       }
       throw error;
     }
   }
   ```

4. **Contact Swan to increase rate limit:**
   - Email Swan support with traffic stats
   - Provide justification: user growth, peak times
   - Request higher API quota

**ETA:** 5 minutes (automatic retry), 1-2 days (if quota increase needed)

---

### Cause 4: Swan GraphQL Schema Change (Breaking)

**Probability:** 5% (Swan updates API, breaks Drop integration)

**Symptoms:**
- Logs show: "GraphQL validation error"
- Specific queries fail: "Field 'X' doesn't exist on type 'Y'"
- Swan API works for some operations, fails for others

**Solution:**

1. **Check Swan changelog:**
   ```bash
   # Review recent API changes
   open https://docs.swan.io/changelog

   # Look for:
   # - Deprecated fields
   # - Required fields added
   # - Type changes
   ```

2. **Identify breaking changes:**
   ```bash
   # Compare current Drop queries to Swan schema
   # Example: account creation query
   grep -r "createAccount" src/lib/swan-client.ts

   # Cross-reference with Swan GraphQL schema
   # https://api.swan.io/graphql (GraphQL Playground)
   ```

3. **Update Drop GraphQL queries:**
   ```typescript
   // Before (deprecated)
   mutation {
     createAccount(input: { name: "User Account" }) {
       id
       balance  // ❌ Deprecated field
     }
   }

   // After (updated)
   mutation {
     createAccount(input: { name: "User Account" }) {
       id
       balances {  // ✅ New field structure
         available
         currency
       }
     }
   }
   ```

4. **Test updated queries:**
   ```bash
   # Test in Swan GraphQL Playground first
   # Then deploy to staging
   # Verify all Swan-dependent features work
   ```

5. **Deploy fix:**
   ```bash
   git add src/lib/swan-client.ts
   git commit -m "Fix: Update Swan GraphQL queries to match latest schema"
   git push origin main

   # CI/CD triggers deployment
   ```

**ETA:** 30 minutes (if simple field change), 2 hours (if major refactor needed)

---

### Cause 5: Network or Firewall Issues

**Probability:** 5% (AWS security group misconfiguration)

**Symptoms:**
- Logs show: "Connection timeout" or "ECONNREFUSED"
- Swan API requests never reach destination
- Works locally but fails in production

**Solution:**

1. **Check outbound connectivity:**
   ```bash
   # App Runner egress is unrestricted by default
   # If using VPC connector, check security group
   aws ec2 describe-security-groups \
     --group-ids <vpc-connector-sg> \
     --region eu-west-1 \
     | jq '.SecurityGroups[].IpPermissionsEgress'
   ```

2. **Test DNS resolution:**
   ```bash
   nslookup api.swan.io

   # Should resolve to Swan IPs
   # If NXDOMAIN: DNS issue
   ```

3. **Check AWS service health:**
   ```bash
   # Check App Runner service events
   aws apprunner list-operations \
     --service-arn <ARN> \
     --region eu-west-1 \
     | jq '.OperationSummaryList[0]'
   ```

4. **Whitelist Swan IPs (if strict firewall):**
   - Contact Swan for IP ranges
   - Add to security group outbound rules (port 443)

**ETA:** 15 minutes (if quick fix), 1 hour (if requires networking changes)

---

### Cause 6: Swan Account Suspended or Payment Overdue

**Probability:** 2% (billing issue or compliance violation)

**Symptoms:**
- All Swan API calls fail with "Account suspended"
- Swan Dashboard shows billing alert
- Email from Swan about overdue payment or compliance issue

**Solution:**

1. **Check Swan Dashboard:**
   - Login: https://dashboard.swan.io
   - Look for alerts: billing, compliance, KYC

2. **Resolve billing issue:**
   - If overdue payment: pay immediately via Swan Dashboard
   - If billing method expired: update payment method
   - Contact Swan billing: billing@swan.io

3. **Resolve compliance issue:**
   - Swan requires KYC for partner accounts
   - Upload missing documents (company registration, director ID, etc.)
   - Respond to Swan compliance team ASAP

4. **Request urgent reactivation:**
   - Email Swan support: support@swan.io
   - Subject: "URGENT: Account reactivation needed - [Partner ID]"
   - Explain impact (users affected)
   - Provide evidence of issue resolution

**ETA:** 15 minutes (if billing), 24 hours (if compliance review needed)

---

## Emergency Workarounds

### Option 1: Degraded Mode (Disable Swan Features)

**Use case:** Swan down >30 minutes, no ETA, users need core app functionality

**Steps:**

1. Disable Swan-dependent features:
   ```bash
   aws apprunner update-service --service-arn <ARN> \
     --instance-configuration "EnvironmentVariables={
       FEATURE_ACCOUNTS=disabled,
       FEATURE_CARDS=disabled,
       FEATURE_SWAN_WALLETS=disabled
     }"
   ```

2. Show banner in app:
   ```
   ⚠️ Noen funksjoner er midlertidig utilgjengelige
   Kontoopprettelse og korttransaksjoner er ikke tilgjengelig for øyeblikket.
   Andre funksjoner virker som normalt.
   ```

3. Allow core features to work:
   - BankID login: ✅ (not Swan-dependent)
   - Open Banking balance: ✅ (uses Neonomics, not Swan)
   - PISP payments: ✅ (uses Neonomics, not Swan)
   - Swan accounts: ❌ (disabled)

4. **Re-enable when Swan is back:**
   ```bash
   aws apprunner update-service --service-arn <ARN> \
     --instance-configuration "EnvironmentVariables={
       FEATURE_ACCOUNTS=enabled,
       FEATURE_CARDS=enabled,
       FEATURE_SWAN_WALLETS=enabled
     }"
   ```

**Risk:** Users cannot create accounts or use cards during outage.

---

### Option 2: Queue Swan Operations for Later

**Use case:** Swan down, users need to create accounts but can wait

**Steps:**

1. Queue account creation requests:
   ```typescript
   // src/app/api/accounts/create/route.ts
   export async function POST(request: Request) {
     const { accountType } = await request.json();

     try {
       return await swanClient.createAccount(accountType);
     } catch (error) {
       if (error.code === 'SWAN_UNAVAILABLE') {
         // Queue for later processing
         await db.insert('pending_accounts', {
           user_id: userId,
           account_type: accountType,
           status: 'queued',
           created_at: new Date(),
         });

         return {
           success: true,
           message: 'Account creation queued, will complete within 1 hour',
         };
       }
       throw error;
     }
   }
   ```

2. Process queue when Swan is back:
   ```bash
   # Run cron job to process pending accounts
   node ~/ALAI/products/Drop/scripts/process-pending-accounts.js
   ```

3. Notify users when account is ready:
   ```
   Din konto er klar!
   Takk for tålmodigheten. Du kan nå bruke alle funksjoner i Drop.
   ```

**Risk:** Delayed user experience. Users may expect instant account creation.

---

## Monitoring & Alerts

### Metrics to Track

- **Swan API success rate:** Should be >99%
- **Swan API latency:** p50 <500ms, p95 <2s, p99 <5s
- **Swan error rate by operation:** Track createAccount, issueCard, makePayment separately

### Alert Rules

```typescript
// src/lib/swan-monitor.ts
export async function trackSwanFailure(operation: string, error: any) {
  const failureRate = await calculateSwanFailureRate('last_5_minutes');

  if (failureRate > 0.05) { // 5% failure rate
    await sendAlert({
      severity: 'critical',
      title: 'Swan API failure rate high',
      message: `${(failureRate * 100).toFixed(1)}% of Swan calls failing`,
      operation,
    });
  }
}
```

---

## Post-Incident Actions

1. **Process queued operations:**
   ```sql
   SELECT * FROM pending_accounts WHERE status = 'queued';
   -- Retry all pending account creations
   ```

2. **Document incident:**
   ```bash
   touch ~/ALAI/products/Drop/comms/incidents/$(date +%Y-%m-%d)-swan-outage.md
   ```

3. **Review SLA with Swan:**
   - Check if outage violated SLA
   - Request compensation/credits
   - Discuss failover options

4. **Improve resilience:**
   - Add Swan health check (every 5 min)
   - Implement circuit breaker for Swan API
   - Consider multi-provider strategy (backup BaaS)

---

## Escalation

| Time | Action |
|------|--------|
| 0 min | John starts diagnosis |
| 5 min | If Swan status page shows incident, notify Alem |
| 15 min | If not resolved, enable degraded mode |
| 30 min | Contact Swan support via phone if no ETA |
| 1 hour | Public communication to users |

---

## Contacts

- **Swan Support:** support@swan.io
- **Swan Phone:** +33 X XXXX XXXX (check Swan Dashboard for number)
- **Swan Status:** https://status.swan.io
- **Internal:** Alem (CEO, final decision on feature disabling)

---

## Related Documentation

- `docs/architecture/banking.md` — Swan BaaS integration
- `src/lib/swan-client.ts` — Swan GraphQL client
- `docs/compliance/swan-requirements.md` — Swan partner KYC/compliance
- Vaultwarden item: "Swan API" — Credentials

---

**Last Updated:** 2026-02-22
**Next Review:** Before Phase 2 (Banking Integration)

# Runbook: Neonomics Outage

# Runbook: Neonomics Open Banking Outage

**Service:** Neonomics Open Banking Aggregator
**Severity:** CRITICAL (blocks AISP balance fetch and PISP payments)
**MTTR Target:** <20 minutes
**Owner:** John (AI Director)

---

## Overview

Neonomics is Drop's Open Banking aggregator for Norwegian banks. It provides:
- **AISP (Account Information):** Fetch user's bank account balance via PSD2 consent
- **PISP (Payment Initiation):** Initiate payments from user's bank account
- **Bank connectivity:** Single API to connect to all Norwegian banks (DNB, Nordea, SpareBank 1, etc.)

**Impact:** If Neonomics is down, Drop cannot:
- Show bank balances
- Initiate remittance payments
- Process QR payments

This is a **critical** outage affecting core functionality.

---

## Symptoms

Users report core features not working:

- Cannot see bank balance (shows "unavailable")
- Cannot initiate payments (error at payment step)
- Bank connection fails ("Cannot connect to bank")
- Error: "Open Banking service unavailable"

**User impact:** Cannot use core Drop features (balance, payments).

---

## Diagnosis

### 1. Check Neonomics Service Status

**External status:**
```bash
# Neonomics has no public status page
# Test via API health check
curl -X GET https://api.neonomics.io/health \
  -H "Authorization: Bearer <api-key>" \
  -v

# Expected: HTTP 200
# If 500/503: Neonomics outage
# If timeout: Network or Neonomics connectivity issue
```

**Check specific bank connectivity:**
```bash
# List banks and their status
curl -X GET https://api.neonomics.io/banks \
  -H "Authorization: Bearer <api-key>" \
  | jq '.[] | select(.country == "NO") | {name, status, lastChecked}'

# Look for:
# - "status": "degraded" or "offline"
# - Specific bank down (e.g., DNB) vs all banks
```

### 2. Check Drop Logs

```bash
# CloudWatch Logs (production)
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "neonomics" \
  --start-time $(date -u -d '15 minutes ago' +%s)000 \
  --region eu-west-1

# Look for:
# - "Neonomics API timeout"
# - "Neonomics 503 Service Unavailable"
# - "Bank API unavailable: DNB"
# - "Payment initiation failed: NEONOMICS_TIMEOUT"
```

### 3. Determine Scope of Outage

**Is it all banks or specific banks?**
```bash
# Count recent failures by bank
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "Neonomics.*failed" \
  --start-time $(date -u -d '30 minutes ago' +%s)000 \
  | jq '.events[].message' \
  | grep -o '"bank":"[^"]*"' \
  | sort | uniq -c | sort -rn

# Example output:
# 45 "bank":"DNB"        ← DNB-specific issue
# 2 "bank":"Nordea"      ← Nordea working mostly
# 1 "bank":"SpareBank1"  ← SpareBank1 working
```

**Is it AISP, PISP, or both?**
```bash
# Check failure type
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "Neonomics" \
  --start-time $(date -u -d '15 minutes ago' +%s)000 \
  | jq '.events[].message' \
  | grep -E "aisp|pisp" \
  | sort | uniq -c

# Example:
# 30 "service":"aisp"  ← AISP failing
# 45 "service":"pisp"  ← PISP failing
# If both high: full Neonomics outage
```

### 4. Test AISP and PISP Flows

**Test AISP (balance fetch):**
```bash
# Staging environment
TOKEN=$(curl -X POST https://drop-staging.fly.dev/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"test1234"}' \
  | jq -r '.data.token')

curl -X GET https://drop-staging.fly.dev/api/accounts/balance \
  -H "Authorization: Bearer $TOKEN" \
  -v

# Expected: HTTP 200, balance data
# If 500: AISP broken
```

**Test PISP (payment initiation):**
```bash
curl -X POST https://drop-staging.fly.dev/api/transactions/remittance \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "recipientId": "rec_test123",
    "amount": 100,
    "currency": "NOK"
  }' \
  -v

# Expected: HTTP 200, payment initiated
# If 500: PISP broken
```

### 5. Check Neonomics API Credentials

```bash
# Verify API key is valid
bw get item "Neonomics API" --session $BW_SESSION

# Check App Runner environment variables
aws apprunner describe-service \
  --service-arn <ARN> \
  --region eu-west-1 \
  | jq '.Service.SourceConfiguration.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables' \
  | grep NEONOMICS

# Expected:
# NEONOMICS_API_KEY: <exists>
# NEONOMICS_ENVIRONMENT: production
```

---

## Common Causes & Solutions

### Cause 1: Neonomics Full Outage (All Banks)

**Probability:** 10% (rare but critical)

**Symptoms:**
- ALL banks fail (DNB, Nordea, SpareBank 1, etc.)
- All AISP and PISP requests timeout or return 503
- Neonomics API health check fails

**Solution:**

1. **Verify full outage:**
   ```bash
   # Test multiple endpoints
   curl -X GET https://api.neonomics.io/health -v
   curl -X GET https://api.neonomics.io/banks -H "Authorization: Bearer <key>" -v

   # If both fail: confirmed full outage
   ```

2. **Contact Neonomics support URGENTLY:**
   - Email: support@neonomics.io
   - Slack: #neonomics-support (if available)
   - Phone: +47 XXXX XXXX (check Neonomics Dashboard)

3. **Communicate to users (Norwegian):**
   ```
   Emne: Betalingstjenester midlertidig utilgjengelige

   Hei,

   Vi opplever for øyeblikket tekniske problemer med vår betalingsleverandør.
   Dette påvirker:
   - Visning av saldo
   - Nye betalinger

   Vi jobber med å gjenopprette tjenesten så raskt som mulig.
   Estimert løsning: [X minutter/timer]

   Mvh,
   Drop
   ```

4. **Enable degraded mode:**
   ```bash
   # Show cached balances, disable new payments
   aws apprunner update-service --service-arn <ARN> \
     --instance-configuration "EnvironmentVariables={
       AISP_MODE=cached,
       PISP_MODE=disabled,
       NEONOMICS_FALLBACK=true
     }"
   ```

5. **Show maintenance banner in app:**
   ```
   ⚠️ Betalinger midlertidig utilgjengelig
   Vi opplever tekniske problemer. Saldo vises med forsinkelse.
   Betalinger er deaktivert midlertidig.
   ```

6. **Monitor Neonomics status:**
   - Check API health every 5 minutes
   - When API returns 200: test AISP/PISP flows
   - Re-enable features gradually

7. **Re-enable live mode when resolved:**
   ```bash
   aws apprunner update-service --service-arn <ARN> \
     --instance-configuration "EnvironmentVariables={
       AISP_MODE=live,
       PISP_MODE=live,
       NEONOMICS_FALLBACK=false
     }"
   ```

**ETA:** Depends on Neonomics (typically <2 hours for major incidents)

---

### Cause 2: Specific Bank API Down

**Probability:** 25% (one bank's API temporarily unavailable)

**Symptoms:**
- Only users of specific bank (e.g., DNB) affected
- Other banks work fine (Nordea, SpareBank 1)
- Logs show: "Bank API timeout: DNB"

**Common reasons:**
- Bank's API maintenance (often 02:00-06:00 CET)
- Bank's API outage
- Bank rate limiting Neonomics
- Bank API certificate expired

**Solution:**

1. **Identify affected bank:**
   ```bash
   # Count failures by bank
   aws logs filter-log-events \
     --log-group-name /aws/apprunner/drop-production \
     --filter-pattern "Bank API" \
     --start-time $(date -u -d '30 minutes ago' +%s)000 \
     | jq '.events[].message' \
     | grep -o '"bank":"[^"]*"' \
     | sort | uniq -c | sort -rn
   ```

2. **Check bank status:**
   - **DNB:** https://www.dnb.no/drift
   - **Nordea:** https://www.nordea.no/info/driftsmeldinger
   - **SpareBank 1:** https://www.sparebank1.no/driftsmeldinger
   - Norwegian banks often announce maintenance

3. **Contact Neonomics to verify:**
   - Neonomics may already know about bank API issues
   - Ask for ETA on bank connectivity restoration

4. **Notify affected users (bank-specific):**
   ```sql
   -- Find users with affected bank
   SELECT user_id, email
   FROM bank_accounts
   JOIN users ON users.id = bank_accounts.user_id
   WHERE bank_name = 'DNB';
   ```

   Email (Norwegian):
   ```
   Emne: Problemer med [Bank] tilkobling

   Hei,

   Vi opplever for øyeblikket problemer med tilkoblingen til [Bank].
   Dette skyldes tekniske problemer hos banken.

   Andre banker virker som normalt.
   Hvis du har konto i en annen bank, kan du bruke den i mellomtiden.

   Estimert løsning: [X minutter/timer]

   Mvh,
   Drop
   ```

5. **Graceful degradation (bank-specific):**
   ```typescript
   // src/lib/neonomics-client.ts
   async function fetchBalance(userId: string, bankId: string) {
     try {
       return await neonomicsAPI.getBalance(userId, bankId);
     } catch (error) {
       if (error.code === 'BANK_API_TIMEOUT' && error.bank === 'DNB') {
         // Return cached balance for DNB users
         const cached = await getCachedBalance(userId);
         return {
           balance: cached?.balance || null,
           currency: 'NOK',
           lastUpdated: cached?.timestamp,
           warning: 'DNB opplever tekniske problemer. Saldo kan være utdatert.'
         };
       }
       throw error;
     }
   }
   ```

**ETA:** Depends on bank (typically <2 hours for maintenance, <4 hours for incidents)

---

### Cause 3: Neonomics API Rate Limiting

**Probability:** 15% (during peak hours or viral growth)

**Symptoms:**
- Logs show: HTTP 429 "Too Many Requests"
- Intermittent failures (some requests succeed, others fail)
- Rate limit headers in logs

**Solution:**

1. **Check rate limit headers:**
   ```bash
   aws logs filter-log-events \
     --log-group-name /aws/apprunner/drop-production \
     --filter-pattern "X-RateLimit" \
     --start-time $(date -u -d '10 minutes ago' +%s)000 \
     | jq -r '.events[].message' \
     | grep -E "X-RateLimit-(Limit|Remaining|Reset)"
   ```

2. **Implement request throttling:**
   ```typescript
   // src/lib/neonomics-client.ts
   import PQueue from 'p-queue';

   const queue = new PQueue({
     concurrency: 10,      // Max 10 concurrent requests
     interval: 1000,        // Per second
     intervalCap: 50        // Max 50 requests per second
   });

   export async function callNeonomics(endpoint: string, options: any) {
     return queue.add(() =>
       fetch(`https://api.neonomics.io${endpoint}`, {
         ...options,
         headers: {
           'Authorization': `Bearer ${process.env.NEONOMICS_API_KEY}`,
           ...options.headers,
         },
       })
     );
   }
   ```

3. **Aggressive caching during rate limit:**
   ```typescript
   // Cache balance for 5 minutes during rate limit (vs 1 minute normally)
   const CACHE_TTL_NORMAL = 60;      // 1 minute
   const CACHE_TTL_RATE_LIMIT = 300; // 5 minutes

   async function getBalanceWithCache(userId: string) {
     const cached = await redis.get(`balance:${userId}`);
     if (cached) return JSON.parse(cached);

     try {
       const balance = await neonomicsAPI.getBalance(userId);
       await redis.setex(`balance:${userId}`, CACHE_TTL_NORMAL, JSON.stringify(balance));
       return balance;
     } catch (error) {
       if (error.status === 429) {
         // Extend cache during rate limit
         if (cached) {
           await redis.expire(`balance:${userId}`, CACHE_TTL_RATE_LIMIT);
           return JSON.parse(cached);
         }
       }
       throw error;
     }
   }
   ```

4. **Contact Neonomics to increase rate limit:**
   - Email: support@neonomics.io
   - Provide traffic stats (requests/day, peak times)
   - Request higher API quota

**ETA:** 5 minutes (automatic throttling), 1-2 days (if quota increase needed)

---

### Cause 4: Invalid or Expired API Credentials

**Probability:** 5% (after credential rotation or account issue)

**Symptoms:**
- Logs show: "401 Unauthorized" or "403 Forbidden"
- All Neonomics API calls fail immediately
- API health check returns 401

**Solution:**

1. **Verify Neonomics API credentials:**
   ```bash
   bw get item "Neonomics API" --session $BW_SESSION

   # Check:
   # - API key is correct
   # - Not expired
   # - Correct environment (production vs sandbox)
   ```

2. **Regenerate API key (if needed):**
   - Login to Neonomics Dashboard (if available)
   - Navigate to Settings → API Keys
   - Generate new API key
   - Copy to Vaultwarden

3. **Update App Runner environment variables:**
   ```bash
   aws apprunner update-service --service-arn <ARN> \
     --instance-configuration "EnvironmentVariables={
       NEONOMICS_API_KEY=<new-key>,
       NEONOMICS_ENVIRONMENT=production
     }"
   ```

4. **Trigger deployment:**
   ```bash
   aws apprunner start-deployment --service-arn <ARN> --region eu-west-1
   ```

5. **Test after deployment:**
   ```bash
   curl -X GET https://getdrop.no/api/accounts/balance \
     -H "Authorization: Bearer <test-user-token>" \
     -v

   # Expected: HTTP 200, balance data
   ```

**ETA:** 10 minutes

---

### Cause 5: PSD2 Consent Expired (AISP Only)

**Probability:** 20% (affects AISP, not PISP)

**Symptoms:**
- Only AISP (balance fetch) fails
- PISP (payments) still works
- Logs show: "CONSENT_EXPIRED" or "CONSENT_INVALID"
- Specific users affected (not all)

**Note:** This is actually a user-level issue, not a Neonomics outage. See `aisp-balance-failure.md` runbook for full details.

**Quick solution:**

1. **Identify users with expired consent:**
   ```sql
   SELECT user_id, email, bank_name, consent_expires_at
   FROM bank_accounts
   JOIN users ON users.id = bank_accounts.user_id
   WHERE consent_expires_at < datetime('now');
   ```

2. **Notify users to re-authorize (Norwegian):**
   ```
   Push notification:
   Banktilkobling utløpt — Trykk her for å fornye
   ```

3. **User re-authorizes via BankID + bank consent flow**

**ETA:** Immediate (user action required)

---

### Cause 6: Network or Firewall Issues

**Probability:** 5% (AWS security group misconfiguration)

**Symptoms:**
- Logs show: "Connection timeout" or "ECONNREFUSED"
- Neonomics API requests never reach destination
- Works locally but fails in production

**Solution:**

1. **Check outbound connectivity:**
   ```bash
   # App Runner egress is unrestricted by default
   # If using VPC connector, check security group
   aws ec2 describe-security-groups \
     --group-ids <vpc-connector-sg> \
     --region eu-west-1 \
     | jq '.SecurityGroups[].IpPermissionsEgress'
   ```

2. **Test DNS resolution:**
   ```bash
   nslookup api.neonomics.io

   # Should resolve to Neonomics IPs
   # If NXDOMAIN: DNS issue
   ```

3. **Check AWS service health:**
   ```bash
   # Check App Runner service events
   aws apprunner list-operations \
     --service-arn <ARN> \
     --region eu-west-1
   ```

4. **Whitelist Neonomics IPs (if using strict firewall):**
   - Contact Neonomics for IP ranges
   - Add to security group outbound rules (port 443)

**ETA:** 15 minutes (if quick fix), 1 hour (if requires networking changes)

---

## Emergency Workarounds

### Option 1: Cached Balance + Disable Payments

**Use case:** Neonomics down >30 minutes, no ETA

**Steps:**

1. Enable cached balance mode:
   ```bash
   aws apprunner update-service --service-arn <ARN> \
     --instance-configuration "EnvironmentVariables={
       AISP_MODE=cached,
       AISP_CACHE_TTL=3600,
       PISP_MODE=disabled
     }"
   ```

2. Show warning banner in app:
   ```
   ⚠️ Betalinger midlertidig utilgjengelige
   Saldo vises med forsinkelse (opptil 1 time).
   Nye betalinger er deaktivert til tjenesten er tilbake.
   ```

3. Allow read-only features:
   - Users can see cached balance
   - Users can see transaction history
   - Cannot initiate new payments

4. **Re-enable when Neonomics is back:**
   ```bash
   aws apprunner update-service --service-arn <ARN> \
     --instance-configuration "EnvironmentVariables={
       AISP_MODE=live,
       PISP_MODE=live
     }"
   ```

**Risk:** Stale balance data. Users may think they have more/less money than reality.

---

### Option 2: Queue Payments for Later Processing

**Use case:** PISP down, users need to make urgent payments

**Steps:**

1. Queue payment requests:
   ```typescript
   // src/app/api/transactions/remittance/route.ts
   export async function POST(request: Request) {
     const paymentData = await request.json();

     try {
       return await neonomicsAPI.initiatePayment(paymentData);
     } catch (error) {
       if (error.code === 'NEONOMICS_UNAVAILABLE') {
         // Queue for later
         await db.insert('pending_payments', {
           user_id: userId,
           payment_data: paymentData,
           status: 'queued',
           created_at: new Date(),
         });

         return {
           success: true,
           message: 'Betaling satt i kø. Vil bli behandlet innen 2 timer.',
         };
       }
       throw error;
     }
   }
   ```

2. Process queue when Neonomics is back:
   ```bash
   node ~/ALAI/products/Drop/scripts/process-pending-payments.js
   ```

3. Notify users when payment completes:
   ```
   Din betaling er behandlet
   Betalingen på [amount] til [recipient] er fullført.
   ```

**Risk:** Delayed payments. User may expect instant transfer.

---

## Monitoring & Alerts

### Metrics to Track

- **Neonomics API success rate:** Should be >99%
- **Neonomics API latency:** p50 <2s, p95 <5s, p99 <10s
- **Bank-specific failure rate:** Track DNB, Nordea, SpareBank 1 separately

### Alert Rules

```typescript
// src/lib/neonomics-monitor.ts
export async function trackNeonomicsFailure(service: 'aisp' | 'pisp', error: any) {
  const failureRate = await calculateFailureRate('neonomics', 'last_5_minutes');

  if (failureRate > 0.1) { // 10% failure rate
    await sendAlert({
      severity: 'critical',
      title: 'Neonomics API failure rate high',
      message: `${(failureRate * 100).toFixed(1)}% of Neonomics calls failing`,
      service,
    });
  }
}
```

---

## Post-Incident Actions

1. **Process queued operations:**
   ```sql
   SELECT * FROM pending_payments WHERE status = 'queued';
   -- Retry all pending payments
   ```

2. **Document incident:**
   ```bash
   touch ~/ALAI/products/Drop/comms/incidents/$(date +%Y-%m-%d)-neonomics-outage.md
   ```

3. **Review SLA with Neonomics:**
   - Check if outage violated SLA
   - Request compensation/credits
   - Discuss redundancy options

4. **Improve resilience:**
   - Add Neonomics health check (synthetic test every 5 min)
   - Implement circuit breaker for Neonomics API
   - Consider multi-provider strategy (backup Open Banking aggregator)

---

## Escalation

| Time | Action |
|------|--------|
| 0 min | John starts diagnosis |
| 10 min | If full Neonomics outage confirmed, notify Alem |
| 20 min | If not resolved, enable degraded mode (cached balance, disable payments) |
| 30 min | Contact Neonomics support via phone if no response |
| 1 hour | Public communication to all users |
| 2 hours | Assess alternative Open Banking providers (emergency only) |

---

## Contacts

- **Neonomics Support:** support@neonomics.io
- **Neonomics Slack:** #neonomics-support (if available)
- **Neonomics Phone:** +47 XXXX XXXX (check Neonomics Dashboard)
- **Internal:** Alem (CEO, final decision on fallback modes)

---

## Related Documentation

- `docs/architecture/open-banking.md` — Neonomics AISP/PISP flow
- `src/lib/neonomics-client.ts` — Neonomics API client
- `docs/compliance/psd2-requirements.md` — PSD2 regulatory requirements
- `support/runbooks/aisp-balance-failure.md` — AISP-specific failures
- `support/runbooks/pisp-payment-failure.md` — PISP-specific failures
- Vaultwarden item: "Neonomics API" — Credentials

---

**Last Updated:** 2026-02-22
**Next Review:** Before Phase 2 (Banking Integration)

# Infrastructure & Internal Services

Complete runbooks for all ALAI internal services: Docker containers, LaunchAgent daemons, Cloudflare tunnel, Vaultwarden, email system, bots, and more.

# ALAI Infrastructure — Service Catalog & Runbooks

# ALAI Infrastructure — Service Catalog & Runbooks

> **Last updated:** 2026-03-11 | **Maintained by:** John (AI Director)
> **Host:** Mac Studio M3 Ultra (ANVIL) | **OS:** macOS
> **Quick health:** `node ~/system/tools/daemon-health.js`

---

## 🐳 Docker Services (23 containers)

### Core Platform Services

| Service | Image | Port | External URL | Health | Restart |
|---------|-------|------|--------------|--------|---------|
| **Vaultwarden** | vaultwarden/server | :8200 | vault.basicconsulting.no | ✅ healthy | `cd ~/system/services/vaultwarden && docker compose restart` |
| **BookStack** | linuxserver/bookstack | :6875 | docs.basicconsulting.no | ✅ running | `cd ~/system/services/bookstack && docker compose restart` |
| **BookStack DB** | linuxserver/mariadb | :3306 (internal) | — | ✅ running | Restarts with BookStack |
| **Planka** | plankanban/planka | :3100 | boards.basicconsulting.no | ✅ healthy | `cd ~/system/services/planka && docker compose restart` |
| **Planka DB** | postgres:15-alpine | internal | — | ✅ healthy | Restarts with Planka |
| **Documenso** | documenso/documenso | :3003 | sign.basicconsulting.no | ✅ running | `cd ~/system/services/documenso && docker compose restart` |
| **Documenso DB** | postgres:15-alpine | internal | — | ✅ healthy | Restarts with Documenso |
| **Documenso MinIO** | minio/minio | :9002/:9003 | — | ✅ running | Restarts with Documenso |
| **Baikal (CalDAV)** | ckulka/baikal:nginx | :5232 | calendar.basicconsulting.no | ✅ running | `cd ~/system/services/baikal && docker compose restart` |
| **Qdrant (Vector DB)** | qdrant/qdrant | :6333/:6334 | — | ✅ running | `docker restart qdrant` |

### Product Database Services

| Service | Port | Product | Health | Restart |
|---------|------|---------|--------|---------|
| **drop-postgres** | :5433 | Drop | ✅ healthy | `cd ~/ALAI/products/Drop && docker compose restart drop-postgres` |
| **plock-db** | :5434 | Plock | ✅ healthy | `cd ~/ALAI/products/Plock && docker compose restart plock-db` |
| **plock-redis** | :6380 | Plock | ✅ healthy | Restarts with plock-db |
| **bilko-postgres** | :5436 | Bilko | ✅ running | `cd ~/ALAI/products/Bilko && docker compose restart bilko-postgres` |
| **bilko-redis** | :6382 | Bilko | ✅ running | Restarts with bilko |
| **lobby-postgres** | :5437 | Lobby | ✅ healthy | `cd ~/ALAI/products/Lobby && docker compose restart lobby-postgres` |
| **lumiscare-postgres** | :5432 | LumisCare | ✅ healthy | Client project |
| **lumiscare-redis** | :6379 | LumisCare | ✅ healthy | Client project |
| **backend-postgres** | :5435 | BasicFakta | ✅ healthy | `cd ~/ALAI/products/BasicFakta && docker compose restart` |
| **backend-redis** | :6381 | BasicFakta | ✅ healthy | Restarts with backend |

### Monitoring Stack (Drop)

| Service | Port | URL | Restart |
|---------|------|-----|---------|
| **Grafana** | :3300 | grafana.basicconsulting.no | `docker restart drop-grafana` |
| **Prometheus** | :9090 | prometheus.basicconsulting.no | `docker restart drop-prometheus` |
| **Node Exporter** | :9100 | — | `docker restart drop-node-exporter` |

---

## ☁️ Cloudflare Tunnel (cloudflared)

**LaunchAgent:** `com.john.cloudflared`
**Config:** `~/.cloudflared/config.yml`
**Tunnel ID:** `3315a609-7934-45c5-ad0c-56d86d16374d`

### Exposed Services

| Hostname | Backend | Purpose |
|----------|---------|---------|
| docs.basicconsulting.no | localhost:6875 | BookStack wiki |
| vault.basicconsulting.no | localhost:8200 | Vaultwarden |
| sign.basicconsulting.no | localhost:3003 | Documenso (e-signing) |
| boards.basicconsulting.no | localhost:3100 | Planka (kanban) |
| calendar.basicconsulting.no | localhost:5232 | Baikal (CalDAV) |
| mc.basicconsulting.no | localhost:3030 | MC Dashboard |
| api.basicconsulting.no | localhost:3001 | API gateway |
| drop-api.basicconsulting.no | localhost:3201 | Drop API |
| lobby.basicconsulting.no | localhost:3010 | Lobby frontend |
| lobby-api.basicconsulting.no | localhost:3009 | Lobby API |
| auth.basicconsulting.no | localhost:9000 | Authentik (SSO) |
| grafana.basicconsulting.no | localhost:3300 | Grafana dashboards |
| prometheus.basicconsulting.no | localhost:9090 | Prometheus metrics |
| track.basicconsulting.no | localhost:3456 | Email tracking pixel |
| ssh.basicconsulting.no | localhost:22 | SSH access |
| vnc.basicconsulting.no | localhost:5900 | VNC screen sharing |

### Runbook: Tunnel down

```bash
# Check status
launchctl list | grep cloudflared

# Restart
launchctl stop com.john.cloudflared
launchctl start com.john.cloudflared

# Verify
cloudflared tunnel info 3315a609-7934-45c5-ad0c-56d86d16374d

# Logs
tail -50 ~/system/logs/cloudflared.log
```

---

## 🔐 Vaultwarden

**Container:** vaultwarden | **Port:** :8200
**URL:** vault.basicconsulting.no (Cloudflare Access protected)
**Local:** http://localhost:8200 | **HTTPS proxy:** https://localhost:8443 (Caddy)
**Admin token:** In `~/system/services/vaultwarden/.env`

### Dependencies
- Docker
- Caddy HTTPS proxy (`com.john.caddy-vault`) — needed for `bw` CLI
- vault-keeper daemon (`com.john.vault-keeper`) — auto-unlock

### Runbook: Vault locked/unauthenticated

```bash
# Check status
NODE_TLS_REJECT_UNAUTHORIZED=0 bw status

# If "locked" — vault-keeper auto-fixes every 15 min. Manual:
NODE_TLS_REJECT_UNAUTHORIZED=0 bw unlock --raw > /tmp/bw-session

# If "unauthenticated" — needs full re-login:
NODE_TLS_REJECT_UNAUTHORIZED=0 bw login --apikey
# Enter client_id and client_secret from ~/system/config/vault-apikey.json
# Then unlock:
NODE_TLS_REJECT_UNAUTHORIZED=0 bw unlock --raw > /tmp/bw-session

# Verify
NODE_TLS_REJECT_UNAUTHORIZED=0 BW_SESSION=$(cat /tmp/bw-session) bw list items --search "Email" | head
```

### Runbook: Caddy proxy down

```bash
# Caddy provides HTTPS for bw CLI (self-signed cert)
launchctl list | grep caddy-vault
# Restart
launchctl stop com.john.caddy-vault && launchctl start com.john.caddy-vault
# Verify
curl -sk https://localhost:8443 | head -1
```

---

## 📧 Email System

**Daemon:** `com.john.email-agent` (every 5 min)
**Accounts:** john@basicconsulting.no, info@basicconsulting.no, john@alai.no, alem@alai.no, dev@alai.no
**IMAP:** imap.one.com:993 | **SMTP:** send.one.com:465
**Credentials:** Vaultwarden (via bw CLI)

### Runbook: Email agent not processing

```bash
# Check logs
tail -30 ~/system/logs/email-agent-launchd.log

# Common issue: Vault not unlocked
NODE_TLS_REJECT_UNAUTHORIZED=0 bw status
# Fix: See Vaultwarden runbook above

# Manual test run
NODE_TLS_REJECT_UNAUTHORIZED=0 node ~/system/daemons/email-agent.js --dry-run

# Restart daemon
launchctl stop com.john.email-agent && launchctl start com.john.email-agent

# Check inbox DB
node -e "const e=require('$HOME/system/tools/email-inbox.js');console.log(JSON.stringify(e.getStats(),null,2))"
```

---

## 💬 Telegram Bot

**Daemon:** `com.john.telegram-agent` (KeepAlive)
**Bot:** @johnbasicas_bot
**Config:** macOS Keychain (telegram-bot-token)
**AI Backend:** Claude CLI → Ollama (llama3.1:8b) → static fallback

### Runbook: Bot not responding

```bash
# Check daemon
launchctl list | grep telegram-agent

# Check logs
tail -20 ~/system/logs/telegram-agent.log

# Restart
launchctl stop com.john.telegram-agent && launchctl start com.john.telegram-agent

# Test AI backend
node -e "const{getResponse}=require('$HOME/system/tools/comms-responder.js');getResponse('test',[]).then(r=>console.log(r.backend,r.text.substring(0,100)))"

# Test connection
node ~/system/tools/telegram-agent.js --test
```

---

## 💬 Slack Bot

**Daemon:** `com.john.slack-bot` (KeepAlive)
**Workspace:** ALAI Holding AS

### Runbook: Slack bot not responding

```bash
launchctl list | grep slack-bot
tail -20 ~/system/logs/slack-bot.log
launchctl stop com.john.slack-bot && launchctl start com.john.slack-bot
```

---

## 📋 BookStack (Wiki)

**Container:** bookstack + bookstack_db
**Port:** :6875 | **URL:** docs.basicconsulting.no
**API config:** ~/system/config/bookstack.json (creds in Vaultwarden)

### Runbook: BookStack down

```bash
cd ~/system/services/bookstack
docker compose ps
docker compose restart
# Check logs
docker logs bookstack --tail 20
```

---

## 📝 Documenso (E-Signing)

**Containers:** documenso + documenso-db + documenso-minio
**Port:** :3003 | **URL:** sign.basicconsulting.no

### Runbook: Documenso down

```bash
cd ~/system/services/documenso
docker compose ps
docker compose restart
docker logs documenso --tail 20
```

---

## 📋 Planka (Kanban)

**Containers:** planka + planka-db
**Port:** :3100 | **URL:** boards.basicconsulting.no

### Runbook: Planka down

```bash
cd ~/system/services/planka
docker compose ps
docker compose restart
docker logs planka --tail 20
```

---

## 📅 Baikal (CalDAV/CardDAV)

**Container:** baikal
**Port:** :5232 | **URL:** calendar.basicconsulting.no

### Runbook: Baikal down

```bash
cd ~/system/services/baikal
docker compose ps
docker compose restart
docker logs baikal --tail 20
```

---

## 🤖 Ollama (Local AI)

**Process:** ollama serve (background)
**Port:** :11434
**Models:** llama3.1:8b, qwen2.5-coder:32b, bge-m3, llama-guard3:8b, custom ALAI models

### Runbook: Ollama down

```bash
# Check
curl -s http://localhost:11434/api/tags | python3 -m json.tool | head

# Restart
ollama serve &

# Verify models
ollama list
```

---

## ⚙️ Key LaunchAgent Daemons

| Daemon | Label | Purpose | Priority |
|--------|-------|---------|----------|
| Cloudflared | com.john.cloudflared | Tunnel to internet | P1 |
| Vault Keeper | com.john.vault-keeper | Auto-unlock Vaultwarden | P1 |
| Caddy Vault | com.john.caddy-vault | HTTPS proxy for bw CLI | P1 |
| Slack Bot | com.john.slack-bot | Slack communication | P1 |
| Telegram Agent | com.john.telegram-agent | Telegram bot | P1 |
| Email Agent | com.john.email-agent | Email processing | P1 |
| Email Tracker | com.john.email-tracker | Open/click tracking | P2 |
| Comms Agent | com.john.comms-agent | Cross-platform comms | P2 |
| Ops Watchdog | com.john.ops-watchdog | Service health checks | P1 |
| Event Dispatcher | com.john.event-dispatcher | Event bus processing | P1 |
| Pi Orchestrator | com.john.pi-orchestrator | Task delegation to agents | P1 |
| Autowork | com.john.autowork | Background task execution | P2 |
| N8N | com.john.n8n | Workflow automation | P2 |
| MC Dashboard | com.john.mc-dashboard | Mission Control web UI | P2 |

### Generic daemon restart

```bash
# Stop
launchctl stop com.john.<name>
# Start
launchctl start com.john.<name>
# Full reload
launchctl unload ~/Library/LaunchAgents/com.john.<name>.plist
launchctl load ~/Library/LaunchAgents/com.john.<name>.plist
# Check status
launchctl list | grep <name>
```

---

## 🔄 Cold Start (Full System Bring-Up)

If the Mac Studio reboots:

```bash
# 1. Docker starts automatically (Docker Desktop)
# 2. LaunchAgents auto-load (RunAtLoad=true)
# 3. vault-keeper unlocks Vaultwarden (reads Keychain)
# 4. All services come up within ~2 minutes

# Verify everything:
bash ~/system/ops/cold-start.sh
node ~/system/tools/daemon-health.js
docker ps
```

---

## 🆘 Emergency Contacts

- **Alem Basic** (CEO): alem@alai.no
- **John** (AI Director): john@basicconsulting.no, @johnbasicas_bot (Telegram), #exec (Slack)