Support & Runbooks

P0 checklists, support systems, audit logging, and operational runbooks

Support Systems

P0: Implementation Checklist
Support Overview
Support Systems Analysis
Audit Logging Setup

Runbooks

Runbook: AISP Balance Failure
Runbook: BankID Failure
Runbook: PISP Payment Failure
Runbook: Sumsub KYC Failure
Runbook: Swan API Outage

Runbook: Neonomics Outage
Infrastructure & Internal Services

ALAI Infrastructure — Service Catalog & Runbooks

Support Systems

Support systems, checklists, and audit logging

Support Systems

P0: Implementation Checklist

P0 Implementation Checklist — Drop Support Systems

Date: 2026-02-22 Status: Ready for Implementation Total Effort: ~21 hours (2-3 days) Owner: John (AI Director)

Overview

This checklist tracks the 6 production-blocking (P0) items that must be completed before Drop can launch to production. Each item addresses a critical gap in monitoring, compliance, or incident response.

P0 Items

1. Server-Side Error Tracking ⏱️ 2 hours (revised)

Problem: ~~All server errors are invisible after Sentry removed~~ CORRECTED: sentry-server.ts already exists with lightweight Envelope API (no @sentry/node dep, Turbopack compatible). However, only 5/25+ routes have captureServerError integrated.

Status: 🟡 Partially Complete (library done, coverage gaps)

Tasks:

~~Research Sentry Edge SDK compatibility~~ Already solved: custom Envelope API
~~Install and configure~~ src/lib/sentry-server.ts already complete
~~Update sentry-server.ts~~ Already has captureServerError + captureServerMessage
Expand captureServerError to ALL API routes (currently only 5 routes)
Test: Trigger 500 error in expanded routes, verify Sentry event
Configure source maps upload (optional but recommended)

Deliverables:

✅ src/lib/sentry-server.ts (already complete — Envelope API, no SDK dep)
✅ Integrated in: bankid, bankid/callback, qr-payment, remittance, health
🔨 Expanding to: all remaining API routes (~20 routes)

Acceptance Criteria:

ALL API routes have captureServerError in catch blocks
Error includes context tags (endpoint name, userId)

2. Audit Logging System ⏱️ 0 hours (ALREADY COMPLETE)

Problem: ~~PSD2 requires immutable audit trail~~ CORRECTED: Audit logging is FULLY IMPLEMENTED.

Status: ✅ Complete

What exists:

src/lib/audit.ts — Full audit library with 30+ action types, logAudit(), getAuditLog(), countAuditEntries()
audit_log table in DB schema (initial migration + db.ts fallback)
Indexes on user_id, timestamp, action
5-year retention documented (data-retention.ts explicitly excludes audit_log from cleanup)
Fire-and-forget pattern (doesn't block user actions)
Integrated in 20+ API routes: auth, transactions, cards, recipients, settings, consents, complaints, user management, GDPR endpoints
Admin audit export: /api/admin/audit/ endpoint exists
GDPR data export: /api/user/data-export/ includes audit log
Structured logger also captures audit events (stdout for CloudWatch)

No action needed. This was incorrectly flagged as missing in the initial analysis.

3. WAF Deployment ⏱️ 2 hours

Problem: WAF rules defined but not enforced (requires reverse proxy).

Status: ⬜ Not Started

Tasks:

Deliverables:

✅ infrastructure/cloudflare-waf-setup.md (to be created)
⬜ Cloudflare WAF configured
⬜ Test results documented

Acceptance Criteria:

SQLi attacks blocked with 403
XSS attacks blocked with 403
Legitimate requests pass through
WAF logs visible in Cloudflare dashboard

4. Log Aggregation & Retention ⏱️ 2 hours

Problem: Structured logs write to stdout but aren't retained or searchable.

Status: ⬜ Not Started

Tasks:

Deliverables:

✅ infrastructure/cloudwatch-logs-setup.md (created)
⬜ CloudWatch retention policies set
⬜ Log Insights queries saved
⬜ CloudWatch alarms active

Acceptance Criteria:

Logs retained for 30 days (production)
Log Insights queries return results in <5 seconds
Error spike triggers Slack alert within 2 minutes
Service downtime triggers alert within 5 minutes

5. External Uptime Monitoring ⏱️ 1 hour

Problem: BetterStack documented but not deployed.

Status: ⬜ Not Started

Tasks:

Deliverables:

✅ docs/infrastructure/BETTERSTACK-SETUP.md (already exists)
⬜ BetterStack account with monitors active
⬜ Slack integration tested

Acceptance Criteria:

Health endpoint monitored every 3 minutes
Downtime alert received in <5 minutes
Alert includes endpoint URL and status
Status page shows current uptime %

6. Payment/Banking Failure Runbooks ⏱️ 4 hours

Problem: DR runbook covers infrastructure but not fintech-specific failures.

Status: ✅ Partially Complete

Tasks:

BankID integration failure runbook
PISP payment failure runbook (remittance + QR)
AISP balance retrieval failure runbook
Swan API outage runbook
Sumsub KYC failure runbook
Neonomics open banking outage runbook
Test each runbook in staging (simulate failure)
Update docs/dr-runbook.md to reference new runbooks

Deliverables:

✅ support/runbooks/bankid-failure.md (created)
✅ support/runbooks/pisp-payment-failure.md (created)
⬜ support/runbooks/aisp-balance-failure.md
⬜ support/runbooks/swan-api-outage.md
⬜ support/runbooks/sumsub-kyc-failure.md
⬜ support/runbooks/neonomics-outage.md

Acceptance Criteria:

Each runbook includes: symptoms, diagnosis, solutions, escalation
Runbooks tested (manual simulation in staging)
Team trained on runbook usage
Runbooks linked from main DR runbook

Progress Tracking

Completion Status

Item	Status	Progress	Blocker
1. Server-side error tracking	🟡 Expanding	80% (lib done, expanding to all routes)	None
2. Audit logging	✅ COMPLETE	100% (was already built)	None
3. WAF deployment	🟡 Ready	90% (Terraform written, needs apply)	`terraform apply`
4. Log aggregation	🔨 Building	50% (CloudWatch alarms being added)	None
5. External monitoring	⬜ Not Started	0%	BetterStack account signup
6. Runbooks	🔨 Building	33% → 100% (4 remaining being written)	None

Overall Progress: ~70% (revised — audit logging was already 100%)

Priority Order

Week 1 (High Impact, Low Effort):

✅ External monitoring (1h) — Immediate visibility into outages
✅ CloudWatch retention (30min) — Logs already flowing, just set policy
⬜ CloudWatch alarms (1.5h) — Automated alerting

Week 2 (Critical Compliance): 4. ⬜ Audit logging schema (2h) — Create table and library 5. ⬜ Audit logging integration (6h) — Wire into endpoints

Week 3 (Security & Error Tracking): 6. ⬜ Server-side error tracking (4h) — Sentry edge setup 7. ⬜ WAF deployment (2h) — Security hardening

Week 4 (Runbooks): 8. ⬜ Remaining runbooks (2h) — AISP, Swan, Sumsub, Neonomics

Dependencies

External Dependencies

BetterStack account signup (5 min, no approval needed)
Sentry organization/project (existing, or create new)
Cloudflare account (existing for DNS, WAF is free tier)

Internal Dependencies

Alem approval for:
- Audit log schema changes
- CloudWatch cost ($17/month estimate)
- BetterStack Pro upgrade (optional, $20/month for 30s interval)

Blocked Items

Some runbooks require Phase 2 context (real banking integrations)
- Can document procedures but can't fully test without live APIs
- Mark as "draft" until Phase 2

Testing Plan

Test 1: Error Tracking

# Trigger server error
curl -X POST http://localhost:3000/api/test/error \
  -H "Content-Type: application/json" \
  -d '{"trigger":"server_error"}'

# Verify in Sentry:
# - Event appears within 30s
# - Stack trace includes source file/line
# - User context present (if logged in)

Test 2: Audit Logging

# Perform audit-worthy action
curl -X POST http://localhost:3000/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"wrong"}'

# Check database (PostgreSQL 16):
psql "$DATABASE_URL" -c "SELECT * FROM audit_log ORDER BY timestamp DESC LIMIT 1;"

# Expected:
# audit_xxx|2026-02-22T10:00:00Z|usr_123|login_failure|...|1.2.3.4|Mozilla...

Test 3: WAF

# Test SQLi blocking
curl "https://getdrop.no/api/test?id=1' OR '1'='1" -v

# Expected: HTTP 403 Forbidden

# Test legitimate request
curl "https://getdrop.no/api/health" -v

# Expected: HTTP 200 OK

Test 4: CloudWatch Alarms

# Trigger error spike (loop 15 errors)
for i in {1..15}; do
  curl http://localhost:3000/api/test/error
  sleep 2
done

# Expected:
# - CloudWatch alarm fires after 2 minutes (2 x 1min periods)
# - Slack alert received in #drop-ops
# - Email sent to alem@alai.no

Test 5: BetterStack

# Stop app
docker stop drop-app

# Wait 3-5 minutes

# Expected:
# - BetterStack detects downtime
# - Slack alert in #drop-ops
# - Email to alem@alai.no

# Restart app
docker start drop-app

# Expected:
# - BetterStack detects recovery
# - "UP" notification sent

Rollout Plan

Phase 1: Non-Intrusive (Day 1)

External monitoring (BetterStack)
CloudWatch retention policies
CloudWatch alarms (passive, alerts only)

Risk: None. These are read-only additions.

Phase 2: Database Changes (Day 2)

Audit log schema migration
Audit log library (no integrations yet)

Risk: Low. New table, no app changes. Test migration in dev first.

Phase 3: Code Integration (Day 3-4)

Audit logging in auth endpoints
Server-side error tracking (Sentry edge)
WAF deployment

Risk: Medium. Requires code changes + deployment. Deploy to staging first, test 24h, then production.

Phase 4: Runbooks (Day 5)

Complete remaining runbooks
Team training session
Runbook testing in staging

Risk: None. Documentation only, no production changes.

Success Metrics

After P0 completion, we should achieve:

✅ 100% server errors visible (Sentry events)
✅ 100% audit events logged (auth, admin, data access)
✅ >99.9% uptime detection (BetterStack)
✅ <5 min MTTD (mean time to detect incidents)
✅ <15 min MTTR (mean time to recover, using runbooks)
✅ 0 security vulnerabilities from WAF bypass

Approvals

Required Approvals

Alem: Audit log schema changes
Alem: CloudWatch cost ($17/month)
Alem: BetterStack account (free tier OK? or Pro $20/month?)

Sign-Off

John (AI Director): Technical implementation complete
Alem (CEO): Business approval for costs + rollout
Validator (QA): Testing complete, acceptance criteria met

Next Steps

Review this analysis with Alem
Get approvals for costs and schema changes
Create Mission Control tasks for each P0 item
Begin implementation (priority order above)
Test thoroughly in staging before production
Document completion in this checklist

support/SUPPORT-SYSTEMS-ANALYSIS.md — Full analysis (all P0/P1/P2 items)
support/audit-logging-setup.md — Audit logging implementation guide
support/runbooks/bankid-failure.md — BankID failure recovery
support/runbooks/pisp-payment-failure.md — Payment failure recovery
infrastructure/cloudwatch-logs-setup.md — Log aggregation setup
infrastructure/waf-rules.md — WAF rule definitions

Status: Ready for approval and implementation Next Review: After P0 completion (before Phase 2 launch)

Support Systems

Support Overview

Customer Support

Customer support resources for Drop project: FAQs, guides, feedback.

Support Systems

Support Systems Analysis

Drop Support Systems Analysis

Date: 2026-02-22 Author: John (AI Director) Status: MVP Hardening Phase (0.5) Purpose: Comprehensive analysis of support systems for production-ready fintech deployment

Executive Summary

Drop currently has foundational support systems in place but requires critical enhancements before production launch. The application has health checks, CI/CD, error tracking (client-side), and basic alerting, but lacks enterprise-grade observability, audit logging, and incident response procedures required for a PSD2-compliant fintech service.

Key Findings:

✅ Strong foundation: Comprehensive CI/CD with >80% coverage, health checks, structured logging
⚠️ Critical gaps: No server-side error tracking, no audit trails, no APM, limited incident response
🚨 Production blockers: 6 P0 items must be addressed before go-live (see Gap Analysis)

Recommendation: Implement P0 systems immediately (est. 2-3 days), defer P1 to Phase 2 (banking integration), and P2 to post-launch optimization.

Current State

1. Monitoring — Uptime & Health Checks

What Exists

✅ Health endpoint: /api/health with database connectivity verification
- Checks: DB query latency, driver type (pg/sqlite), service mode, uptime
- Returns: ok (200), degraded (200), or down (503)
- Source: src/drop-app/src/app/api/health/route.ts
✅ Container health checks:
- Docker: 30s interval, 10s timeout, 3 retries
- Fly.io: 30s interval, 10s grace period, 5s timeout
- Auto-restart on failure
✅ External uptime monitoring (ready to deploy):
- BetterStack setup guide documented
- Free tier: 10 monitors, 3-min interval, SMS/email/Slack alerts
- Documentation: docs/infrastructure/BETTERSTACK-SETUP.md
✅ Cron health check script:
- infrastructure/health-check.sh — AWS App Runner endpoint
- Slack webhook integration (optional)
- Can run via cron for local monitoring

What's Missing

❌ Synthetic monitoring: No transaction flow testing (login → send money → verify)
❌ Multi-region checks: No geographic availability testing
❌ SLA tracking: No uptime percentage calculation or reporting
❌ Dependency monitoring: No checks for external services (Swan API, BankID, Sumsub)

Assessment

Status: Adequate for MVP, requires enhancement for production. Gap: External monitoring configured but not deployed. Synthetic checks needed.

2. Logging — Centralized Log Aggregation

What Exists

✅ Structured logging:
- JSON format with timestamp, level, message, requestId, metadata
- Source: src/drop-app/src/lib/logger.ts
- Writes to stdout (Docker-friendly)
✅ Request correlation:
- x-request-id header extraction or UUID generation
- Request context propagation through logger instances
✅ Log levels: debug, info, warn, error

What's Missing

❌ Log aggregation: Logs write to stdout but aren't collected or indexed
❌ Log retention: No policy for how long logs are kept
❌ Log search: No way to query logs across time/instances
❌ Log forwarding: No integration with log management service
❌ Sensitive data scrubbing: Logger doesn't automatically redact PII

Assessment

Status: Foundation exists, but logs are ephemeral (lost on container restart). Gap: Critical for incident investigation and compliance audits. Need CloudWatch Logs or similar.

3. Error Tracking — Error Capture & Alerting

What Exists

✅ Client-side error tracking:
- Sentry browser integration (@sentry/browser)
- PII scrubbing (passwords, pins, card numbers, fødselsnummer)
- 10% trace sampling for performance monitoring
- Source: src/drop-app/src/lib/sentry.ts, SENTRY.md
✅ Error spike detection:
- Tracks errors in rolling 1-minute window
- Alerts when >5 errors in 60 seconds
- Source: src/drop-app/src/lib/alerts.ts:trackError()
✅ Global error boundaries:
- React error boundaries for component crashes
- global-error.tsx catches unhandled errors

What's Missing

❌ Server-side error tracking: Sentry removed from server due to Next.js 16 Turbopack incompatibility (MC #1271)
❌ API error context: Server errors log to console only, no structured capture
❌ Error attribution: Can't trace errors to specific users or transactions
❌ Error deduplication: Same error reported multiple times clogs alerts

Assessment

Status: Client errors tracked, server errors blind. Gap: CRITICAL — server-side errors (API, DB, integrations) are invisible. P0 fix required.

4. Alerting — On-Call & Escalation

What Exists

✅ Slack alerting:
- Operational alerts with severity levels (info/warning/critical)
- 10-minute cooldown per alert title (spam prevention)
- Source: src/drop-app/src/lib/alerts.ts
✅ Lifecycle alerts:
- App startup notification
- Graceful shutdown notification
- Source: instrumentation.ts
✅ Error spike alerts:
- Automatic critical alert when >5 errors/minute

What's Missing

❌ On-call rotation: No defined on-call schedule or escalation policy
❌ Alert routing: All alerts go to same Slack channel, no severity-based routing
❌ Alert escalation: No automatic escalation after N minutes of unresolved incident
❌ Alert acknowledgment: Can't mark alerts as "acknowledged" or "resolved"
❌ SMS/phone alerts: Critical incidents only notify via Slack (single point of failure)
❌ Alert testing: No way to test alert pipeline without triggering real incidents

Assessment

Status: Basic alerting works for small team, inadequate for 24/7 production. Gap: Need on-call schedule, escalation policy, and multi-channel delivery.

5. Security Monitoring — WAF, DDoS, Anomaly Detection, Audit Logs

What Exists

✅ WAF rules defined:
- CSRF origin validation (implemented in middleware)
- Rate limiting on auth endpoints (10 req/60s)
- CSP headers with nonce-based script loading
- Source: infrastructure/waf-rules.md, src/drop-app/src/middleware.ts
✅ Container security scanning:
- Trivy vulnerability scanner in CI/CD
- Blocks HIGH/CRITICAL vulnerabilities
- SARIF upload to GitHub Security tab
✅ Dependency scanning:
- npm audit in CI pipeline (prod deps only)
✅ AML transaction monitoring:
- 5 automated rules: structuring, velocity, high amount, high-risk corridor, unusual pattern
- Alerts stored in aml_alerts table
- Source: src/drop-app/src/lib/transaction-monitor.ts

What's Missing

❌ WAF deployment: Rules defined but not deployed (requires CDN/reverse proxy)
❌ DDoS protection: No rate limiting at network edge, only app-level
❌ Intrusion detection: No IDS/IPS monitoring unusual access patterns
❌ Audit logs: No immutable log of authentication, authorization, data access events (PSD2 requirement)
❌ Security incident response plan: No runbook for security breaches
❌ Penetration testing: No external security audit completed

Assessment

Status: Security-aware codebase, but monitoring/audit infrastructure missing. Gap: CRITICAL — audit logs are PSD2/GDPR compliance requirement. P0 fix.

6. Performance — APM, Latency Tracking, Resource Utilization

What Exists

✅ Health check latency:
- DB query time measured in health endpoint
- Reported in milliseconds
✅ Performance budgets in CI:
- Coverage thresholds enforced (80/70/80/80)

What's Missing

❌ APM (Application Performance Monitoring): No distributed tracing
❌ API latency tracking: Don't know which endpoints are slow
❌ Database performance: No slow query alerts or query profiling
❌ Resource utilization: No CPU/memory/disk usage monitoring
❌ Frontend performance: No Core Web Vitals tracking (LCP, FID, CLS)
❌ Transaction timing: Can't measure end-to-end payment latency

Assessment

Status: Minimal. Can detect total outage but not performance degradation. Gap: Need before production to identify bottlenecks and capacity issues.

7. Database — Backups, Replication, Monitoring

What Exists

✅ Automated backups (RDS):
- Daily automated snapshots, 7-day retention
- Point-in-time recovery within 7 days
- Source: docs/dr-runbook.md
✅ Multi-AZ (production):
- RDS configured for high availability (if enabled)
✅ Database health check:
- SELECT 1 query in health endpoint verifies connectivity

What's Missing

❌ Backup verification: Snapshots created but never tested for restore
❌ Backup monitoring: No alerts if backup fails
❌ Replication lag monitoring: No alerts if replica falls behind
❌ Connection pool monitoring: No visibility into connection usage
❌ Query performance: No slow query log analysis
❌ Storage monitoring: No alerts before disk fills up

Assessment

Status: Basic backup/restore exists, monitoring gaps. Gap: Backup testing and proactive monitoring needed before production.

8. Incident Response — Runbooks, Status Page, Communication Plan

What Exists

✅ DR runbook:
- Procedures for App Runner down, RDS down, full redeploy
- Environment variable checklist
- Contact escalation (John → Alem)
- Source: docs/dr-runbook.md
✅ Incident checklist:
- 8-step incident response workflow
- Post-mortem requirement (48h)

What's Missing

❌ Status page: No public/customer-facing status page
❌ Incident templates: No standardized incident report format
❌ Communication plan: No templates for customer notifications during outages
❌ Runbook coverage: Only covers infrastructure, missing:
- Payment failures (PISP/AISP errors)
- BankID integration issues
- KYC/AML false positive handling
- Data breach response
❌ Runbook testing: Procedures documented but never executed

Assessment

Status: Basic DR runbook exists, lacks fintech-specific scenarios. Gap: Need payment/banking integration runbooks before Phase 2.

9. CI/CD — Build Pipeline, Deployment, Rollback

What Exists

✅ Comprehensive CI pipeline:
- Multi-package change detection
- Lint, typecheck, unit tests, E2E (Playwright), mutation testing (Stryker)
- Coverage thresholds enforced (80/70/80/80) with ratchet (never decrease)
- Docker build + Trivy security scan
- Quality gate (required status check)
- Source: .github/workflows/ci.yml
✅ Deployment workflows:
- GitHub Actions for deploy (backend, mobile)
- Terraform for infrastructure
- Source: .github/workflows/deploy.yml, terraform-ci.yml

What's Missing

❌ Automated rollback: Deployment failure doesn't auto-revert
❌ Canary deployments: All-or-nothing deployment, no gradual rollout
❌ Deployment monitoring: No automatic health check after deploy
❌ Deployment notifications: Team not notified of deployments/failures
❌ Infrastructure drift detection: Terraform state not continuously validated

Assessment

Status: Strong quality gate, weak deployment safety. Gap: Add post-deployment health checks and rollback automation.

10. Compliance — Audit Trails, Data Retention, GDPR/PSD2 Logging

What Exists

✅ AML monitoring:
- Transaction alerts stored in aml_alerts table
- 5 risk categories tracked
✅ Security audit completed:
- 4 CRITICAL, 5 HIGH, 6 MEDIUM, 4 LOW findings documented
- Source: security/drop-security-rapport.md
✅ Data retention service:
- Code exists for GDPR compliance
- Source: src/drop-app/src/lib/services/data-retention.ts

What's Missing

❌ Audit logs: No immutable record of:
- User authentication events (login, logout, failed attempts)
- Authorization decisions (who accessed what, when)
- Data modifications (user profile changes, transaction edits)
- Administrative actions (KYC approvals, AML reviews)
❌ Audit log retention policy: PSD2 requires 5+ years
❌ Audit log integrity: No cryptographic proof of non-tampering
❌ Compliance reporting: No automated report generation for regulators
❌ STR (Suspicious Transaction Report) workflow: AML alerts created but no submission process

Assessment

Status: CRITICAL GAP. Audit logs are PSD2 legal requirement. Gap: P0 — must implement before production launch.

Gap Analysis

P0 — Production Blockers (Must Fix Before Go-Live)

#	Category	Gap	Impact	Effort
1	Error Tracking	No server-side error monitoring	Can't detect/debug API failures	4h
2	Compliance	No audit logs (auth, data access, admin actions)	PSD2 non-compliance, legal risk	8h
3	Security	WAF rules defined but not deployed	Vulnerable to SQLi, XSS, DDoS	2h (config)
4	Logging	No log aggregation/retention	Can't investigate incidents	2h (CloudWatch setup)
5	Monitoring	BetterStack configured but not deployed	No external incident detection	1h (account setup)
6	Incident Response	No payment/banking failure runbooks	Can't recover from PISP/BankID outages	4h

Total P0 effort: ~21 hours (2-3 days)

P1 — Needed Soon (Before Phase 2: Banking Integration)

#	Category	Gap	Impact	Effort
7	Alerting	No on-call rotation or escalation policy	Incidents may go unnoticed outside work hours	2h
8	Performance	No APM for distributed tracing	Can't diagnose slow transactions	4h
9	Database	No backup testing or monitoring	Backups may be corrupt, undetected	3h
10	Security	No penetration testing	Unknown vulnerabilities	16h (external)
11	CI/CD	No automated rollback on deployment failure	Bad deploys cause extended outages	6h
12	Compliance	No STR submission workflow	Can't fulfill AML obligations	8h

Total P1 effort: ~39 hours (5 days)

P2 — Nice to Have (Post-Launch Optimization)

#	Category	Gap	Impact	Effort
13	Monitoring	No synthetic transaction monitoring	Can't detect broken user flows	8h
14	Performance	No Core Web Vitals tracking	Poor user experience undetected	4h
15	Alerting	No SMS/phone alerts for critical incidents	Slack outage = missed alerts	2h
16	Database	No slow query alerts	Performance degradation undetected	6h
17	Security	No IDS/IPS for intrusion detection	Advanced attacks undetected	16h
18	Incident Response	No public status page	Customers unaware of outages	4h

Total P2 effort: ~40 hours (5 days)

Implementation Plan

Phase 1: P0 Production Blockers (NOW — before Phase 1 demo)

Goal: Address legal/compliance requirements and critical observability gaps.

1.1 Server-Side Error Tracking (4h)

Problem: All server errors invisible after Sentry removed (Next.js 16 Turbopack incompatibility).

Solution:

Option A: Sentry Edge SDK (compatible with Next.js middleware)
- Install: @sentry/nextjs with edge-only config
- Capture server errors via captureException() in middleware
- Source maps via Sentry webpack plugin
Option B: Custom error aggregation service
- POST errors to internal /api/errors/capture endpoint
- Store in error_logs table with context
- Alert on spike detection

Deliverable:

src/drop-app/sentry.edge.config.ts (if Option A)
Updated src/drop-app/src/lib/sentry-server.ts with edge-compatible capture
Test: Trigger 500 error, verify Sentry event created

Files: infrastructure/error-tracking-setup.md

1.2 Audit Logging System (8h)

Problem: PSD2 requires immutable audit trail for auth, data access, admin actions.

Solution:

Create audit_logs table:

CREATE TABLE audit_logs (
  id TEXT PRIMARY KEY,
  timestamp TEXT NOT NULL,
  user_id TEXT,
  action TEXT NOT NULL, -- 'login', 'data_access', 'kyc_approval', etc.
  resource_type TEXT, -- 'user', 'transaction', 'aml_alert'
  resource_id TEXT,
  metadata JSON,
  ip_address TEXT,
  user_agent TEXT,
  request_id TEXT,
  result TEXT -- 'success', 'failure', 'denied'
);
CREATE INDEX idx_audit_user ON audit_logs(user_id, timestamp);
CREATE INDEX idx_audit_action ON audit_logs(action, timestamp);

Audit functions:

auditLog({
  userId: 'usr_123',
  action: 'login_success',
  resourceType: 'user',
  resourceId: 'usr_123',
  metadata: { method: 'bankid' },
  ip: '1.2.3.4',
  userAgent: 'Mozilla...',
  requestId: 'req_456'
});

Integrate at:
- POST /api/auth/login (login_success, login_failure)
- POST /api/auth/logout (logout)
- GET /api/users/:id (data_access)
- PATCH /api/users/:id/kyc (kyc_approval, kyc_rejection)
- PATCH /api/aml-alerts/:id (aml_review)

Deliverable:

src/drop-app/src/lib/audit-log.ts (audit logging functions)
Migration: migrations/003_audit_logs.sql
Integration in auth routes and admin endpoints
Retention policy: Document 5-year retention for PSD2 compliance

Files: support/audit-logging-setup.md

1.3 WAF Deployment (2h)

Problem: WAF rules defined but not enforced (requires reverse proxy).

Solution:

Option A: Cloudflare WAF (recommended)
- Already using Cloudflare for DNS (terraform module exists)
- Free tier includes basic WAF rules
- Configure: SQLi, XSS, path traversal rules from infrastructure/waf-rules.md
Option B: AWS WAF (if using App Runner directly)
- $5/month + $1/million requests
- Associate with App Runner service

Deliverable:

Cloudflare WAF configuration (Terraform or UI)
Test: Send SQLi payload, verify 403 response
Document: Update infrastructure/waf-rules.md with deployment steps

Files: infrastructure/cloudflare-waf-setup.md

1.4 Log Aggregation (2h)

Problem: Structured logs write to stdout but aren't retained or searchable.

Solution:

AWS CloudWatch Logs (App Runner auto-integrates):
- App Runner streams stdout → CloudWatch Logs automatically
- Configure retention: 30 days (production), 7 days (staging)
- Set up log insights queries for common patterns
Fly.io (staging):
- fly logs stores last 24h by default
- Optional: Forward to external service (Papertrail, Logtail)

Deliverable:

CloudWatch Logs retention policy configured
Log Insights queries:
- All errors: fields @timestamp, message | filter level = "error"
- User actions: fields @timestamp, userId, message | filter userId = "usr_123"
- Request trace: fields @timestamp, requestId, message | filter requestId = "req_456"
Documentation: infrastructure/logging-setup.md

Files: infrastructure/cloudwatch-logs-setup.md

1.5 External Uptime Monitoring (1h)

Problem: BetterStack documented but not deployed.

Solution:

Sign up: https://betterstack.com/uptime (free tier)
Create monitors:
1. Production health: https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health
  - Interval: 3 minutes
  - Keyword check: "status":"ok"
2. Staging health: https://drop-staging.fly.dev/api/health
3. Landing page: https://getdrop.no (when live)
Slack integration: Connect to #drop-ops channel
Email alerts: alem@alai.no

Deliverable:

BetterStack account with 3 monitors configured
Test: Pause monitor, verify alert received
Documentation: Update docs/infrastructure/BETTERSTACK-SETUP.md with credentials

Files: support/betterstack-deployment.md

1.6 Payment/Banking Failure Runbooks (4h)

Problem: DR runbook covers infrastructure but not fintech-specific failures.

Solution:

Create runbooks for:
1. BankID integration failure (authentication blocked)
2. PISP payment failure (remittance/QR payment rejected)
3. AISP balance retrieval failure (can't fetch account balance)
4. Swan API outage (BaaS provider down)
5. Sumsub KYC failure (identity verification unavailable)
6. Neonomics open banking outage
Each runbook includes:
- Symptoms (what users see)
- Diagnosis steps (check service status, logs, error codes)
- Recovery procedure (fallback, retry, escalation)
- Customer communication template

Deliverable:

support/runbooks/bankid-failure.md
support/runbooks/pisp-payment-failure.md
support/runbooks/aisp-balance-failure.md
support/runbooks/swan-api-outage.md
support/runbooks/sumsub-kyc-failure.md
support/runbooks/neonomics-outage.md

Files: Created in /Users/makinja/ALAI/products/Drop/support/runbooks/

Phase 2: P1 Items (Phase 2: Banking Integration)

Defer to Phase 2 when real banking integrations are live and need production-grade support.

Priority order:

Penetration testing (external security audit)
APM for transaction tracing (identify slow payments)
On-call rotation and escalation policy
Automated rollback on failed deployments
Backup testing and monitoring
STR submission workflow (AML compliance)

Phase 3: P2 Items (Post-Launch)

Optimize after initial production deployment and user feedback.

Priority order:

Synthetic transaction monitoring (test critical user flows)
Public status page (customer transparency)
Core Web Vitals tracking (frontend performance)
SMS/phone alerts (redundancy)
Slow query monitoring (database optimization)
IDS/IPS (advanced threat detection)

Architecture

Support Systems Connectivity

┌─────────────────────────────────────────────────────────────────┐
│                         Drop Application                        │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │  drop-app   │  │  drop-api    │  │  drop-mobile (Expo)  │  │
│  │  (Next.js)  │  │  (Hono)      │  │  (React Native)      │  │
│  └─────────────┘  └──────────────┘  └──────────────────────┘  │
│         │                │                      │               │
│         └────────────────┴──────────────────────┘               │
│                          │                                      │
└──────────────────────────┼──────────────────────────────────────┘
                           │
        ┌──────────────────┼──────────────────────────────┐
        │                  │                              │
        ▼                  ▼                              ▼
┌───────────────┐  ┌──────────────┐           ┌──────────────────┐
│ Structured    │  │ Health Check │           │ Audit Logs       │
│ Logging       │  │ Endpoint     │           │ (audit_logs      │
│ (JSON stdout) │  │ /api/health  │           │  table)          │
└───────┬───────┘  └──────┬───────┘           └─────────┬────────┘
        │                 │                             │
        │                 │                             │
        ▼                 │                             │
┌────────────────┐        │                             │
│ CloudWatch     │        │                             │
│ Logs           │        │                             │
│ (30d retention)│        │                             │
└────────────────┘        │                             │
        │                 │                             │
        │                 ▼                             │
        │         ┌───────────────┐                     │
        │         │ BetterStack   │                     │
        │         │ (external     │                     │
        │         │  monitoring)  │                     │
        │         └───────┬───────┘                     │
        │                 │                             │
        └─────────────────┼─────────────────────────────┘
                          │
                          ▼
                 ┌────────────────┐
                 │ Alerting Layer │
                 │ (alerts.ts)    │
                 └────────┬───────┘
                          │
        ┌─────────────────┼─────────────────┐
        │                 │                 │
        ▼                 ▼                 ▼
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│ Slack       │  │ Sentry      │  │ Email       │
│ Webhook     │  │ (client +   │  │ (SMTP)      │
│ (#drop-ops) │  │  edge)      │  │             │
└─────────────┘  └─────────────┘  └─────────────┘

Data Flows

Error Flow:
- Client error → Sentry browser → Slack alert (if spike)
- Server error → Sentry edge → CloudWatch Logs → Slack alert
- API 5xx → trackError() → Spike detection → Slack
Monitoring Flow:
- App → stdout → CloudWatch Logs
- App → /api/health → BetterStack → Slack/Email/SMS
- Container → Docker health check → Auto-restart
Audit Flow:
- User action → auditLog() → audit_logs table
- Compliance query → SQL export → Regulator submission
Incident Flow:
- Alert → Slack #drop-ops
- Unacknowledged (5 min) → Email to Alem
- Unresolved (15 min) → SMS (BetterStack escalation)
- Incident → Runbook → Recovery → Post-mortem

Cost Estimate

Free Tier (MVP)

✅ CloudWatch Logs: 5 GB ingestion/month free (AWS Free Tier)
✅ BetterStack: 10 monitors, 3-min interval, unlimited alerts
✅ Sentry: 5K events/month free
✅ GitHub Actions: 2000 minutes/month free
✅ Terraform state: S3 free tier (first 12 months)

Total MVP cost: $0/month

Paid Services (Production)

CloudWatch Logs: ~$5/month (30 GB ingestion estimate)
BetterStack Pro: $20/month (30s interval, SMS alerts)
Sentry Team: $26/month (50K events, enhanced features)
Optional: Datadog APM: $15/host/month (~$45 for 3 hosts)

Total production cost: ~$50-100/month (without APM)

Recommendations

Immediate (This Week)

✅ Deploy BetterStack (1h) — External monitoring is fast win
✅ Configure CloudWatch retention (30 min) — Logs already flow, just set policy
✅ Create audit log schema (2h) — Start with table, integrate incrementally

Before Phase 1 Demo (Next 2 Weeks)

✅ Implement server-side error tracking (4h) — Sentry edge or custom
✅ Write payment failure runbooks (4h) — Prepare for demo questions
✅ Deploy Cloudflare WAF (2h) — Security hygiene

Before Phase 2 Go-Live (Next 2-3 Months)

🔲 External penetration test (hire security firm, ~$5K budget)
🔲 APM implementation (Datadog or Sentry Performance)
🔲 On-call rotation (define schedule, test escalation)
🔲 Backup testing (restore from snapshot, verify data integrity)

Post-Launch Optimization

🔲 Synthetic monitoring (Checkly or custom Playwright tests)
🔲 Public status page (BetterStack included, just enable)
🔲 Core Web Vitals (Google Lighthouse CI integration)

Success Metrics

Before Go-Live (P0 Checklist)

Server errors visible in Sentry (test: trigger 500, verify event)
Audit logs capture login/logout (test: log in, check audit_logs table)
WAF blocks SQLi attack (test: ?id=1' OR '1'='1, expect 403)
CloudWatch Logs retain 30 days (verify retention policy)
BetterStack alerts on downtime (test: stop app, receive alert <5 min)
Runbooks tested (simulate BankID failure, follow procedure)

Production KPIs

Uptime: >99.9% (measured by BetterStack)
MTTD (Mean Time To Detect): <3 minutes (external monitoring interval)
MTTR (Mean Time To Recover): <15 minutes (via runbooks)
Error rate: <0.1% of requests (tracked via Sentry)
Log retention: 100% compliance (30 days CloudWatch, 5 years audit logs)
Alert noise: <5 false positives/week (cooldown + severity tuning)

Appendices

docs/infrastructure/MONITORING.md — Current monitoring setup
docs/infrastructure/BETTERSTACK-SETUP.md — External monitoring guide
docs/dr-runbook.md — Infrastructure disaster recovery
infrastructure/waf-rules.md — WAF rule definitions
security/drop-security-rapport.md — Security audit findings

B. External Services

BetterStack: https://betterstack.com/uptime
Sentry: https://sentry.io/
AWS CloudWatch: https://console.aws.amazon.com/cloudwatch/
Cloudflare: https://dash.cloudflare.com/

C. Change History

2026-02-22: Initial analysis (John)

Next Actions:

Review this analysis with Alem
Approve P0 implementation plan
Begin P0 work (estimated 21 hours / 2-3 days)
Track progress in Mission Control tasks

Support Systems

Audit Logging Setup

Audit Logging Setup — Drop Fintech

Date: 2026-02-22 Priority: P0 (Production Blocker) Compliance: PSD2, GDPR Effort: 8 hours

Overview

Audit logging provides an immutable record of all authentication, authorization, data access, and administrative actions. This is a legal requirement for PSD2-regulated payment services and GDPR data protection compliance.

Requirements

PSD2 Audit Trail Requirements

All authentication events (login, logout, failed attempts)
Authorization decisions (who accessed what resource)
Transaction creation and modification
KYC/AML review actions
Administrative user actions
Data exports and bulk operations
Retention: 5 years minimum

Users must be able to request all logged actions related to their data
Export format: Human-readable (CSV or JSON)

Database Schema

Migration: `003_audit_logs.sql`

-- Audit Logs Table (PostgreSQL 16 — ADR-014)
-- Schema managed via Drizzle ORM (src/shared/db/schema.ts)
-- Apply with: make db-push

CREATE TABLE IF NOT EXISTS audit_log (
  id TEXT PRIMARY KEY,
  timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  user_id TEXT,
  action TEXT NOT NULL,
  resource_type TEXT,
  resource_id TEXT,
  details JSONB,
  ip_address TEXT,
  user_agent TEXT,
  request_id TEXT,
  result TEXT NOT NULL DEFAULT 'success', -- 'success', 'failure', 'denied'
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Indexes for common queries
CREATE INDEX IF NOT EXISTS idx_audit_user_time ON audit_log(user_id, timestamp DESC);
CREATE INDEX IF NOT EXISTS idx_audit_action_time ON audit_log(action, timestamp DESC);
CREATE INDEX IF NOT EXISTS idx_audit_resource ON audit_log(resource_type, resource_id, timestamp DESC);
CREATE INDEX IF NOT EXISTS idx_audit_request ON audit_log(request_id);
CREATE INDEX IF NOT EXISTS idx_audit_result ON audit_log(result, timestamp DESC);

-- Partitioning by month (production)
CREATE TABLE audit_log_2026_02 PARTITION OF audit_log FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');

Migration steps (PostgreSQL 16 via Drizzle ORM):

Schema is defined in src/shared/db/schema.ts

Apply with:

make db-push
# or: cd src/shared && npx drizzle-kit push

Verify table exists:

psql "$DATABASE_URL" -c "SELECT table_name FROM information_schema.tables WHERE table_name='audit_log';"

Implementation

Audit Log Library: `src/lib/audit-log.ts`

import { db } from '@drop/shared/db';
import { randomId } from './utils-server';
import { logger } from './logger';

export type AuditAction =
  // Authentication
  | 'login_success'
  | 'login_failure'
  | 'logout'
  | 'password_change'
  | 'session_revoked'
  // Authorization
  | 'access_granted'
  | 'access_denied'
  // Data Access
  | 'data_view'
  | 'data_export'
  | 'data_delete'
  // Transactions
  | 'transaction_created'
  | 'transaction_completed'
  | 'transaction_failed'
  // KYC/AML
  | 'kyc_approved'
  | 'kyc_rejected'
  | 'aml_alert_created'
  | 'aml_alert_reviewed'
  // Admin
  | 'user_created'
  | 'user_updated'
  | 'user_deleted'
  | 'role_changed';

export type AuditResult = 'success' | 'failure' | 'denied';

export interface AuditLogEntry {
  userId?: string;
  action: AuditAction;
  resourceType?: string;
  resourceId?: string;
  metadata?: Record<string, unknown>;
  ip?: string;
  userAgent?: string;
  requestId?: string;
  result?: AuditResult;
}

/**
 * Create an audit log entry
 *
 * IMPORTANT: This function must NEVER throw errors.
 * Audit failures should not block user actions.
 */
export async function auditLog(entry: AuditLogEntry): Promise<void> {
  try {
    const id = randomId('audit');
    const timestamp = new Date().toISOString();

    await run(
      `INSERT INTO audit_logs (
        id, timestamp, user_id, action, resource_type, resource_id,
        metadata, ip_address, user_agent, request_id, result
      ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`,
      [
        id,
        timestamp,
        entry.userId || null,
        entry.action,
        entry.resourceType || null,
        entry.resourceId || null,
        entry.metadata ? JSON.stringify(entry.metadata) : null,
        entry.ip || null,
        entry.userAgent || null,
        entry.requestId || null,
        entry.result || 'success',
      ]
    );

    logger.debug('Audit log created', { auditId: id, action: entry.action });
  } catch (error) {
    // Log error but do NOT throw (audit failures should not block operations)
    logger.error('Failed to create audit log', {
      error: error instanceof Error ? error.message : String(error),
      action: entry.action,
    });
  }
}

/**
 * Retrieve audit logs for a user (GDPR Right of Access)
 */
export async function getUserAuditLogs(
  userId: string,
  options?: { limit?: number; offset?: number; startDate?: string; endDate?: string }
): Promise<unknown[]> {
  const { limit = 100, offset = 0, startDate, endDate } = options || {};

  let sql = 'SELECT * FROM audit_logs WHERE user_id = ?';
  const params: unknown[] = [userId];

  if (startDate) {
    sql += ' AND timestamp >= ?';
    params.push(startDate);
  }

  if (endDate) {
    sql += ' AND timestamp <= ?';
    params.push(endDate);
  }

  sql += ' ORDER BY timestamp DESC LIMIT ? OFFSET ?';
  params.push(limit, offset);

  const { query } = await import('./db');
  return query(sql, params);
}

/**
 * Export audit logs as CSV (for compliance reporting)
 */
export async function exportAuditLogsCSV(
  filters?: {
    userId?: string;
    action?: AuditAction;
    startDate?: string;
    endDate?: string;
  }
): Promise<string> {
  let sql = 'SELECT * FROM audit_logs WHERE 1=1';
  const params: unknown[] = [];

  if (filters?.userId) {
    sql += ' AND user_id = ?';
    params.push(filters.userId);
  }

  if (filters?.action) {
    sql += ' AND action = ?';
    params.push(filters.action);
  }

  if (filters?.startDate) {
    sql += ' AND timestamp >= ?';
    params.push(filters.startDate);
  }

  if (filters?.endDate) {
    sql += ' AND timestamp <= ?';
    params.push(filters.endDate);
  }

  sql += ' ORDER BY timestamp DESC';

  const { query } = await import('./db');
  const rows = await query(sql, params);

  // Convert to CSV
  const headers = [
    'id',
    'timestamp',
    'user_id',
    'action',
    'resource_type',
    'resource_id',
    'metadata',
    'ip_address',
    'user_agent',
    'request_id',
    'result',
  ];

  const csvRows = [headers.join(',')];

  for (const row of rows as Record<string, unknown>[]) {
    const values = headers.map((h) => {
      const val = row[h];
      if (val === null || val === undefined) return '';
      return String(val).replace(/"/g, '""'); // Escape quotes
    });
    csvRows.push(values.map((v) => `"${v}"`).join(','));
  }

  return csvRows.join('\n');
}

Integration Points

1. Authentication (`src/app/api/auth/login/route.ts`)

import { auditLog } from '@/lib/audit-log';

export async function POST(request: NextRequest) {
  const { email, password } = await request.json();
  const ip = request.headers.get('x-forwarded-for') || request.headers.get('x-real-ip');
  const userAgent = request.headers.get('user-agent');
  const requestId = getRequestId(request.headers);

  try {
    const user = await getUserByEmail(email);
    if (!user || !await verifyPassword(password, user.password_hash)) {
      // Audit failed login attempt
      await auditLog({
        userId: user?.id,
        action: 'login_failure',
        metadata: { email, reason: 'invalid_credentials' },
        ip,
        userAgent,
        requestId,
        result: 'failure',
      });

      return jsonError('Invalid credentials', 401);
    }

    // Audit successful login
    await auditLog({
      userId: user.id,
      action: 'login_success',
      metadata: { email },
      ip,
      userAgent,
      requestId,
      result: 'success',
    });

    // ... rest of login logic
  } catch (error) {
    // ... error handling
  }
}

2. Logout (`src/app/api/auth/logout/route.ts`)

await auditLog({
  userId: session.userId,
  action: 'logout',
  metadata: { sessionId: session.id },
  ip,
  userAgent,
  requestId,
});

3. Data Access (`src/app/api/users/[id]/route.ts`)

export async function GET(request: NextRequest, { params }: { params: { id: string } }) {
  const session = await requireAuth(request);
  const userId = params.id;

  // Check authorization
  if (session.userId !== userId && session.role !== 'admin') {
    await auditLog({
      userId: session.userId,
      action: 'access_denied',
      resourceType: 'user',
      resourceId: userId,
      metadata: { reason: 'insufficient_permissions' },
      ip: request.headers.get('x-forwarded-for'),
      userAgent: request.headers.get('user-agent'),
      requestId: getRequestId(request.headers),
      result: 'denied',
    });

    return jsonError('Access denied', 403);
  }

  // Audit successful data access
  await auditLog({
    userId: session.userId,
    action: 'data_view',
    resourceType: 'user',
    resourceId: userId,
    ip: request.headers.get('x-forwarded-for'),
    userAgent: request.headers.get('user-agent'),
    requestId: getRequestId(request.headers),
  });

  const user = await getUserById(userId);
  return jsonSuccess(user);
}

4. KYC Approval (`src/app/api/admin/kyc/route.ts`)

await auditLog({
  userId: adminSession.userId,
  action: 'kyc_approved',
  resourceType: 'user',
  resourceId: targetUserId,
  metadata: { reason: kycApprovalReason },
  ip: request.headers.get('x-forwarded-for'),
  userAgent: request.headers.get('user-agent'),
  requestId: getRequestId(request.headers),
});

5. Transaction Creation (`src/app/api/transactions/route.ts`)

await auditLog({
  userId: session.userId,
  action: 'transaction_created',
  resourceType: 'transaction',
  resourceId: transactionId,
  metadata: {
    type: transactionType,
    amount: amount,
    currency: currency,
  },
  ip: request.headers.get('x-forwarded-for'),
  userAgent: request.headers.get('user-agent'),
  requestId: getRequestId(request.headers),
});

Compliance Reporting

// src/app/api/users/[id]/audit-logs/route.ts
export async function GET(request: NextRequest, { params }: { params: { id: string } }) {
  const session = await requireAuth(request);

  // Users can only access their own audit logs
  if (session.userId !== params.id && session.role !== 'admin') {
    return jsonError('Access denied', 403);
  }

  const logs = await getUserAuditLogs(params.id, {
    limit: 1000, // GDPR requires "all data"
    startDate: request.nextUrl.searchParams.get('start') || undefined,
    endDate: request.nextUrl.searchParams.get('end') || undefined,
  });

  return jsonSuccess({ logs });
}

PSD2 Audit Trail Export (Admin)

// src/app/api/admin/audit/export/route.ts
export async function GET(request: NextRequest) {
  const session = await requireAuth(request);

  if (session.role !== 'admin') {
    return jsonError('Admin access required', 403);
  }

  const startDate = request.nextUrl.searchParams.get('start');
  const endDate = request.nextUrl.searchParams.get('end');
  const action = request.nextUrl.searchParams.get('action');
  const userId = request.nextUrl.searchParams.get('userId');

  const csv = await exportAuditLogsCSV({
    userId: userId || undefined,
    action: action as AuditAction | undefined,
    startDate: startDate || undefined,
    endDate: endDate || undefined,
  });

  return new Response(csv, {
    headers: {
      'Content-Type': 'text/csv',
      'Content-Disposition': `attachment; filename="audit-logs-${new Date().toISOString()}.csv"`,
    },
  });
}

Retention Policy

PSD2 Requirement: 5 Years

PostgreSQL 16 (all environments — ADR-014):

Use table partitioning by month:

CREATE TABLE audit_log (
  id TEXT PRIMARY KEY,
  timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  -- ... other columns
) PARTITION BY RANGE (timestamp);

-- Create partitions for each month
CREATE TABLE audit_log_2026_02 PARTITION OF audit_log
  FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');

Automatic cleanup script (cron weekly):

#\!/bin/bash
# Delete audit logs older than 5 years (PSD2 retention)
psql "$DATABASE_URL" -c "DELETE FROM audit_log WHERE timestamp < NOW() - INTERVAL '5 years';"

Testing

Test Audit Logging

# 1. Create audit log entry
curl -X POST http://localhost:3000/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"wrong"}'

# 2. Check audit log table (PostgreSQL 16)
psql "$DATABASE_URL" -c "SELECT * FROM audit_log ORDER BY timestamp DESC LIMIT 5;"

# Expected output:
# audit_123 | 2026-02-22T10:00:00.000Z | usr_456 | login_failure | ... | {"email":"test@example.com","reason":"invalid_credentials"} | 1.2.3.4 | Mozilla/5.0... | req_789 | failure

# 3. Export audit logs (admin)
curl -X GET "http://localhost:3000/api/admin/audit/export?start=2026-02-01&end=2026-02-28" \
  -H "Cookie: auth-token=<admin-jwt>" \
  > audit-logs.csv

# 4. Verify CSV format
head -n 5 audit-logs.csv

Monitoring

Alert on Audit Failures

Add to src/lib/audit-log.ts:

import { sendAlert } from './alerts';

export async function auditLog(entry: AuditLogEntry): Promise<void> {
  try {
    // ... insert logic
  } catch (error) {
    logger.error('Failed to create audit log', { error, action: entry.action });

    // CRITICAL: Alert if audit logging fails (compliance risk)
    await sendAlert({
      severity: 'critical',
      title: 'Audit logging failure',
      message: `Failed to record ${entry.action} for user ${entry.userId}`,
    });
  }
}

Metrics to Track

Audit logs created per hour (should correlate with user activity)
Failed audit log attempts (should be zero)
Audit log export requests (GDPR compliance)
Audit log storage size (retention planning)

Security Considerations

Immutability

Audit logs should NEVER be updated or deleted (except by automated retention policy)
No UPDATE or DELETE API endpoints for audit logs
Database permissions: Read-only for application, Write-only for audit service

Access Control

Only admins can view full audit trails
Users can view their own audit logs only
Export requires elevated permissions

Data Redaction

Do NOT log passwords, tokens, or sensitive PII in metadata
Card numbers: Log last 4 digits only
Fødselsnummer: Log checksum/hash, not full number

Checklist

Next Steps:

Create migration file
Implement audit-log.ts library
Integrate into auth routes (high priority)
Add to remaining endpoints incrementally
Test with real login/logout flows
Deploy to staging for verification

Runbooks

Operational runbooks for failure scenarios

Runbooks

Runbook: AISP Balance Failure

Runbook: AISP Balance Fetch Failure

Service: AISP (Account Information Service Provider) Severity: MEDIUM (users can't see bank balance) MTTR Target: <20 minutes Owner: John (AI Director)

Symptoms

Users report they cannot see their bank account balance in Drop. Symptoms include:

Dashboard shows "Balance unavailable" or stale balance
Error message: "Could not fetch account information"
Infinite loading spinner on balance widget
Balance shows "0 kr" or "—" instead of actual amount

User impact: Cannot verify available funds before making payments (may lead to insufficient funds errors).

Diagnosis

1. Check Neonomics AISP Status

External status:

# Neonomics has no public status page — test via API
curl -X GET https://api.neonomics.io/health \
  -H "Authorization: Bearer <api-key>" \
  -v

# Expected: HTTP 200
# If 500/503: Neonomics outage

Check specific bank connectivity:

# List supported banks and their status
curl -X GET https://api.neonomics.io/banks \
  -H "Authorization: Bearer <api-key>" \
  | jq '.[] | select(.country == "NO") | {name, status}'

# Look for: "status": "degraded" or "offline"

2. Check Drop Logs

# CloudWatch Logs (production)
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "aisp" \
  --start-time $(date -u -d '15 minutes ago' +%s)000 \
  --region eu-west-1

# Look for:
# - "AISP consent expired"
# - "AISP API timeout"
# - "AISP 401 Unauthorized"
# - "Bank API unavailable: DNB"

3. Check User Consent Status

# Verify Open Banking consent hasn't expired
# Consent is valid for 90 days from last authorization

# Check database for expired consents (PostgreSQL 16)
psql "$DATABASE_URL" <<EOF
SELECT
  user_id,
  bank_name,
  consent_expires_at,
  EXTRACT(EPOCH FROM (consent_expires_at - NOW())) / 86400 AS days_remaining
FROM bank_accounts
WHERE consent_expires_at < NOW() + INTERVAL '7 days'
ORDER BY consent_expires_at ASC
LIMIT 10;
EOF

# If days_remaining < 0: consent expired
# If days_remaining < 7: warn user to renew soon

4. Test AISP Flow

Manual test (staging):

# 1. Login
TOKEN=$(curl -X POST https://drop-staging.fly.dev/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"test1234"}' \
  | jq -r '.data.token')

# 2. Fetch balance
curl -X GET https://drop-staging.fly.dev/api/accounts/balance \
  -H "Authorization: Bearer $TOKEN" \
  -v

# Expected: HTTP 200, { "balance": 15000.50, "currency": "NOK" }
# If 401: Consent expired
# If 500: AISP integration broken

5. Check Rate Limiting

# Check if Neonomics API rate limit exceeded
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "rate_limit" \
  --start-time $(date -u -d '10 minutes ago' +%s)000 \
  | jq '.events[].message' \
  | grep -E "429|X-RateLimit"

# If many 429 errors: rate limiting issue

Common Causes & Solutions

Cause 1: Expired Open Banking Consent

Probability: 40% (PSD2 consent expires after 90 days)

Symptoms:

Error code: CONSENT_EXPIRED or CONSENT_INVALID
Logs show: "AISP consent no longer valid"
Specific users affected (not all users)

Solution:

Identify affected users:

-- PostgreSQL 16
SELECT user_id, email, bank_name, consent_expires_at
FROM bank_accounts
JOIN users ON users.id = bank_accounts.user_id
WHERE consent_expires_at < NOW();

Notify users to re-authorize:

Push notification (Norwegian):

Banktilkobling utløpt
Godkjenningen for å hente saldo fra [Bank] har utløpt.
Trykk her for å fornye tilkoblingen.

Email (Norwegian):

Emne: Godkjenn tilgang til bankkonto på nytt

Hei,

Din godkjenning for å vise saldo fra [Bank] har utløpt etter 90 dager.
Dette er et PSD2-sikkerhetskrav.

Logg inn i Drop og koble til bankkontoen på nytt for å fortsette å se saldoen din.

Mvh,
Drop

Guide user through re-consent:
- User taps notification → redirect to "Reconnect Bank Account" screen
- Initiate new AISP consent flow (BankID + bank authorization)
- Update consent_expires_at = NOW() + INTERVAL '90 days'

Automatic consent renewal reminder:

# Cron job to warn users 7 days before expiry
# Send reminder: "Your bank connection expires in 7 days, renew now"

ETA: Immediate (user action required)

Cause 2: Bank API Outage or Maintenance

Probability: 15% (specific bank temporarily unavailable)

Symptoms:

All users of specific bank (e.g., DNB, Nordea) cannot fetch balance
Other banks work fine
Logs show: "Bank API timeout" or "502 Bad Gateway"

Solution:

Identify affected bank:

# Check which bank is failing
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "Bank API" \
  --start-time $(date -u -d '30 minutes ago' +%s)000 \
  | jq '.events[].message' \
  | grep -o '"bank":"[^"]*"' \
  | sort | uniq -c | sort -rn

# Example output: "bank":"DNB" appears 50 times

Check bank status:
- Visit bank's website: check for maintenance announcements
- Norwegian banks often schedule maintenance 02:00-06:00 CET
- DNB status: https://www.dnb.no/drift
- Nordea status: https://www.nordea.no/info/driftsmeldinger

Notify affected users (Norwegian):

Emne: Saldo midlertidig utilgjengelig for [Bank]

Hei,

Vi opplever for øyeblikket problemer med å hente saldo fra [Bank].
Dette skyldes tekniske problemer hos banken.

Du kan fortsatt gjøre betalinger, men saldoen vises ikke akkurat nå.
Vi jobber med å gjenopprette tjenesten.

Estimert løsning: [X minutter/timer]

Mvh,
Drop

Implement graceful degradation:

// src/app/api/accounts/balance/route.ts
async function fetchBalance(userId: string) {
  try {
    return await neonomicsClient.getBalance(userId);
  } catch (error) {
    if (error.code === 'BANK_API_TIMEOUT') {
      // Return cached balance with warning
      const cached = await getCachedBalance(userId);
      return {
        balance: cached?.balance || null,
        currency: 'NOK',
        lastUpdated: cached?.timestamp,
        warning: 'Balance may be outdated due to bank API issues'
      };
    }
    throw error;
  }
}

ETA: Depends on bank (typically <2 hours for maintenance, <1 hour for incidents)

Cause 3: Neonomics API Outage

Probability: 10% (Neonomics service disruption)

Symptoms:

ALL users cannot fetch balance regardless of bank
Logs show: "Neonomics API unreachable" or HTTP 503
Test API call to Neonomics fails

Solution:

Verify Neonomics outage:

# Test Neonomics health endpoint
curl -X GET https://api.neonomics.io/health \
  -H "Authorization: Bearer <api-key>" \
  -v

# If timeout or 503: confirmed outage

Contact Neonomics support:
- Email: support@neonomics.io
- Slack: #neonomics-support (if available)
- Check Neonomics Slack for incident updates

Enable fallback mode:

# Show cached balances to all users
aws apprunner update-service --service-arn <ARN> \
  --instance-configuration "EnvironmentVariables={
    AISP_FALLBACK_MODE=cached,
    AISP_FALLBACK_CACHE_TTL=3600
  }"

Communicate to users (Norwegian):

Emne: Saldo vises med forsinkelse

Hei,

Vår leverandør for bankdata opplever tekniske problemer.
Saldoen du ser kan være opptil 1 time gammel.

Du kan fortsatt gjøre betalinger som normalt.
Vi forventer at tjenesten er tilbake innen [X minutter].

Mvh,
Drop

Monitor Neonomics status:

Check every 10 minutes for resolution
When API is back: disable fallback mode

aws apprunner update-service --service-arn <ARN> \
  --instance-configuration "EnvironmentVariables={
    AISP_FALLBACK_MODE=live
  }"

ETA: Depends on Neonomics (typically <2 hours)

Cause 4: Invalid or Revoked API Credentials

Probability: 5% (after credential rotation or account issue)

Symptoms:

Logs show: "401 Unauthorized" or "invalid_api_key"
All AISP requests fail immediately
Other Drop services work fine (auth, database, etc.)

Solution:

Verify Neonomics API credentials:

bw get item "Neonomics API" --session $BW_SESSION

# Check:
# - API key is not expired
# - API key has AISP permissions
# - Correct environment (production vs sandbox)

Update App Runner environment variables:

aws apprunner update-service --service-arn <ARN> \
  --source-configuration "ImageRepository={...}" \
  --instance-configuration "EnvironmentVariables={
    NEONOMICS_API_KEY=<correct-key>,
    NEONOMICS_ENVIRONMENT=production
  }"

Trigger deployment:

aws apprunner start-deployment --service-arn <ARN> --region eu-west-1

# Wait 3-5 minutes for deployment to complete

Test after deployment:

# Verify AISP working
curl -X GET https://getdrop.no/api/accounts/balance \
  -H "Authorization: Bearer <test-user-token>" \
  -v

# Expected: HTTP 200 with balance data

ETA: 10 minutes

Cause 5: Network or Firewall Issues

Probability: 5% (AWS security group misconfiguration)

Symptoms:

Logs show: "Connection timeout" or "ECONNREFUSED"
AISP API requests never reach Neonomics
Other external APIs may also fail

Solution:

Check outbound connectivity:

# App Runner egress is unrestricted by default
# If using VPC connector, check security group
aws ec2 describe-security-groups \
  --group-ids <vpc-connector-sg> \
  --region eu-west-1 \
  | jq '.SecurityGroups[].IpPermissionsEgress'

Test DNS resolution:

# From your local machine or bastion host
nslookup api.neonomics.io

# Should resolve to Neonomics IP
# If NXDOMAIN: DNS issue

Check AWS service health:

# Check App Runner service events
aws apprunner list-operations \
  --service-arn <ARN> \
  --region eu-west-1 \
  | jq '.OperationSummaryList[] | select(.Type == "CREATE_SERVICE" or .Type == "UPDATE_SERVICE")'

# Look for recent errors

Whitelist Neonomics IPs (if using strict firewall):
- Contact Neonomics for IP ranges
- Add to security group outbound rules
- Allow HTTPS (443) to Neonomics endpoints

ETA: 15 minutes (if quick fix), 1 hour (if requires networking changes)

Cause 6: Rate Limiting (High Traffic)

Probability: 10% (during peak hours or viral event)

Symptoms:

Logs show: HTTP 429 "Too Many Requests"
Intermittent failures (some users see balance, others don't)
Rate limit headers in logs

Solution:

Check rate limit headers:

aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "X-RateLimit" \
  --start-time $(date -u -d '5 minutes ago' +%s)000 \
  | jq -r '.events[].message' \
  | grep -E "X-RateLimit-(Limit|Remaining|Reset)"

Implement request throttling:

// src/lib/aisp-client.ts
import PQueue from 'p-queue';

const queue = new PQueue({
  concurrency: 10,      // Max 10 concurrent requests
  interval: 1000,        // Per second
  intervalCap: 50        // Max 50 requests per second
});

export async function fetchBalance(userId: string) {
  return queue.add(() => neonomicsClient.getBalance(userId));
}

Cache balance aggressively during rate limit:

// src/lib/balance-cache.ts
const CACHE_TTL_NORMAL = 60;      // 60 seconds
const CACHE_TTL_RATE_LIMIT = 300; // 5 minutes during rate limit

export async function getBalanceWithCache(userId: string) {
  const cached = await redis.get(`balance:${userId}`);
  if (cached) return JSON.parse(cached);

  try {
    const balance = await fetchBalance(userId);
    await redis.setex(`balance:${userId}`, CACHE_TTL_NORMAL, JSON.stringify(balance));
    return balance;
  } catch (error) {
    if (error.status === 429) {
      // Extend cache TTL during rate limit
      await redis.expire(`balance:${userId}`, CACHE_TTL_RATE_LIMIT);
    }
    throw error;
  }
}

Contact Neonomics to increase rate limit:
- Email support with traffic stats
- Request higher API quota for production
- Provide justification (user growth, peak times)

ETA: 5 minutes (automatic caching), 1-2 days (if quota increase needed)

Emergency Workarounds

Option 1: Cached Balance Mode

Use case: AISP provider down >30 minutes, users need to see approximate balance

Steps:

Enable cached balance fallback:

aws apprunner update-service --service-arn <ARN> \
  --instance-configuration "EnvironmentVariables={
    AISP_MODE=cached,
    AISP_CACHE_TTL=3600
  }"

Show warning banner in app:

⚠️ Saldo vises med forsinkelse
Vi viser din sist kjente saldo fra [timestamp].
Tjenesten er tilbake til normal snart.

Allow payments to proceed:
- Users can still initiate payments (PISP)
- Balance check uses cached value
- Risk: Insufficient funds errors if balance changed

Revert when AISP is back:

aws apprunner update-service --service-arn <ARN> \
  --instance-configuration "EnvironmentVariables={
    AISP_MODE=live
  }"

Risk: Cached balance may be stale (up to 1 hour old). Users may attempt payments with insufficient funds.

Option 2: Hide Balance, Allow Payments

Use case: AISP down, no reliable cache, but PISP still works

Steps:

Show "Balance unavailable" message:

Saldo midlertidig utilgjengelig
Du kan fortsatt gjøre betalinger som normalt.
Banken vil avvise betalingen hvis du ikke har nok midler.

Allow payments without balance check:
- User enters payment amount
- Drop initiates payment via PISP
- Bank performs real-time balance check
- If insufficient funds: bank rejects, user gets clear error

Communicate ETA to users:

Vi jobber med å gjenopprette saldovisning.
Estimert tid: [X minutter]

Risk: User experience degraded. May attempt failed payments.

Post-Incident Actions

Refresh all expired consents proactively:

-- PostgreSQL 16: send renewal reminders 7 days before expiry
SELECT user_id, email, consent_expires_at
FROM bank_accounts
JOIN users ON users.id = bank_accounts.user_id
WHERE consent_expires_at < NOW() + INTERVAL '7 days'
AND consent_renewal_reminder_sent = FALSE;

Document incident:

touch ~/ALAI/products/Drop/comms/incidents/$(date +%Y-%m-%d)-aisp-failure.md

Review caching strategy:
- Is cache TTL appropriate?
- Should we cache balance longer during incidents?
- Add metrics: cache hit rate, staleness
Update monitoring:
- Add synthetic AISP test (fetch balance every 5 min)
- Alert on AISP failure rate >10%
- Track consent expiry dates
Improve user communication:
- Auto-notify users when AISP is degraded
- Show balance age: "Updated 5 minutes ago"

Escalation

Time	Action
0 min	John starts diagnosis
10 min	If Neonomics outage confirmed, notify Alem
20 min	If not resolved, enable cached balance mode
1 hour	Public communication to users (Norwegian email/push)
2 hours	Contact Neonomics support via phone if no response

Contacts

Neonomics Support: support@neonomics.io
Neonomics Slack: #neonomics-support (if available)
Internal: Alem (CEO, final decision on fallback modes)

docs/architecture/open-banking.md — AISP flow diagrams
src/app/api/accounts/balance/route.ts — Balance fetch implementation
docs/compliance/psd2-requirements.md — PSD2 consent rules (90-day expiry)
Vaultwarden item: "Neonomics API" — Credentials

Last Updated: 2026-02-22 Next Review: Before Phase 2 (Banking Integration)

Runbooks

Runbook: BankID Failure

Runbook: BankID Integration Failure

Service: BankID OAuth Authentication Severity: CRITICAL (blocks all logins) MTTR Target: <15 minutes Owner: John (AI Director)

Symptoms

Users report they cannot log in. Symptoms include:

Diagnosis

1. Check BankID Service Status

External status page:

# Check BankID status (no official status page, monitor Twitter)
open https://twitter.com/search?q=BankID%20Norge

# Or check community forums
open https://www.reddit.com/r/Norge/search?q=BankID

Quick test:

# Try BankID login from another service (e.g., tax portal)
open https://www.skatteetaten.no/person/
# If BankID works there but not in Drop → problem is our integration

2. Check Drop Logs

# CloudWatch Logs (production)
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "bankid" \
  --start-time $(date -u -d '10 minutes ago' +%s)000 \
  --region eu-west-1

# Look for:
# - "BankID OAuth error: invalid_client"
# - "BankID callback failed: invalid_state"
# - "BankID API timeout"

3. Check Environment Variables

# Verify BankID credentials are set
aws apprunner describe-service \
  --service-arn <ARN> \
  --region eu-west-1 \
  | jq '.Service.SourceConfiguration.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables' \
  | grep BANKID

# Expected:
# BANKID_CLIENT_ID: <client-id>
# BANKID_CLIENT_SECRET: <exists> (value hidden)
# BANKID_CALLBACK_URL: https://getdrop.no/api/auth/bankid/callback

4. Check OAuth Flow

Test OAuth initiation:

# Start OAuth flow
curl -X POST https://getdrop.no/api/auth/bankid/initiate \
  -H "Content-Type: application/json" \
  -d '{"redirectUrl": "/dashboard"}' \
  -v

# Expected: HTTP 302 redirect to BankID with state parameter
# If 500: Check BANKID_CLIENT_ID and BANKID_CALLBACK_URL

Test OAuth callback:

# Simulate callback (replace <code> and <state> with real values from BankID redirect)
curl -X GET "https://getdrop.no/api/auth/bankid/callback?code=<code>&state=<state>" \
  -v

# Expected: HTTP 302 redirect to /dashboard with auth cookie
# If 401: Check BANKID_CLIENT_SECRET
# If 400: Check state validation logic

Common Causes & Solutions

Cause 1: BankID Service Outage (External)

Probability: 5% (BankID is highly reliable)

Symptoms:

All BankID logins fail across all services
BankID status page reports incident
Social media mentions BankID outage

Solution:

Communicate: Post status update to users

Subject: Login temporarily unavailable
Body: BankID authentication is experiencing issues.
      We're monitoring the situation and will restore service
      as soon as BankID is back online. Estimated: <X> minutes.

Monitor: Watch BankID Twitter/status for updates

Fallback (if available): If demo mode exists, consider temporary activation:

# Enable demo mode (ONLY in emergency, requires Alem approval)
aws apprunner update-service --service-arn <ARN> \
  --source-configuration "ImageRepository={...}" \
  --instance-configuration "EnvironmentVariables={NEXT_PUBLIC_SERVICE_MODE=demo}"

Post-incident: Document outage duration, user impact

ETA: Depends on BankID (typically <2 hours)

Cause 2: Invalid OAuth Credentials

Probability: 20% (after credential rotation or environment change)

Symptoms:

Logs show: "invalid_client" or "unauthorized_client"
OAuth flow fails immediately (no redirect to BankID)

Solution:

Verify credentials in Vaultwarden:

bw get item "BankID OAuth" --session $BW_SESSION

Update App Runner environment variables:

aws apprunner update-service --service-arn <ARN> \
  --source-configuration "ImageRepository={...}" \
  --instance-configuration "EnvironmentVariables={
    BANKID_CLIENT_ID=<correct-client-id>,
    BANKID_CLIENT_SECRET=<correct-secret>
  }"

Trigger deployment:

aws apprunner start-deployment --service-arn <ARN> --region eu-west-1

Test: Attempt login after deployment completes (3-5 minutes)

ETA: 10 minutes

Cause 3: Callback URL Mismatch

Probability: 15% (after domain change or deployment error)

Symptoms:

Logs show: "redirect_uri_mismatch"
BankID redirects to wrong URL (404 or CORS error)

Solution:

Check registered callback URL in BankID portal:
- Login to BankID integration portal
- Navigate to OAuth settings
- Verify callback URL: https://getdrop.no/api/auth/bankid/callback
If mismatch, update BankID portal:
- Change redirect URI to match current domain
- Save changes (may require approval, 1-2 hours)

Update App Runner env var:

aws apprunner update-service --service-arn <ARN> \
  --source-configuration "ImageRepository={...}" \
  --instance-configuration "EnvironmentVariables={
    BANKID_CALLBACK_URL=https://getdrop.no/api/auth/bankid/callback
  }"

Test: Login flow should work after both changes

ETA: 15 minutes (if no BankID approval required), 2 hours (if approval needed)

Cause 4: State Parameter Validation Failure

Probability: 10% (race condition or session timeout)

Symptoms:

Logs show: "Invalid state parameter"
User completes BankID flow but callback rejects

Solution:

Check session storage:
- BankID state is stored in server session
- If session expires before callback (>10 min), state is lost

Increase session timeout (if needed):

// src/lib/auth.ts
const SESSION_TIMEOUT = 15 * 60 * 1000; // 15 minutes (was 10)

Clear stale sessions:

# If using Redis for sessions
redis-cli FLUSHDB

# If using database sessions
sqlite3 drop.db "DELETE FROM sessions WHERE expires_at < datetime('now');"

Ask user to retry: State timeout is usually one-time issue

ETA: 5 minutes

Cause 5: BankID API Rate Limiting

Probability: 5% (during high-traffic events)

Symptoms:

Logs show: "rate_limit_exceeded" or HTTP 429
Intermittent failures (some users succeed, others fail)

Solution:

Check rate limit headers in logs:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640000000

Wait for rate limit reset: Typically resets every 60 seconds

Implement exponential backoff (if not present):

// src/lib/bankid-client.ts
async function callBankIDAPI(retries = 3) {
  try {
    return await fetch(url);
  } catch (error) {
    if (error.status === 429 && retries > 0) {
      await sleep(1000 * (4 - retries)); // 1s, 2s, 3s
      return callBankIDAPI(retries - 1);
    }
    throw error;
  }
}

Contact BankID support: If rate limits are too low for production traffic

ETA: 5 minutes (automatic), 1-2 days (if support ticket needed)

Cause 6: Network/Firewall Issues

Probability: 5% (AWS security group misconfiguration)

Symptoms:

Logs show: "Connection timeout" or "ECONNREFUSED"
BankID API requests never reach destination

Solution:

Check outbound rules (App Runner → BankID):

# App Runner egress is unrestricted by default
# Check VPC connector security group (if using VPC)
aws ec2 describe-security-groups --group-ids <vpc-connector-sg> --region eu-west-1

Test connectivity from container:

# Exec into running container (if possible)
curl -v https://oidc.bankid.no/.well-known/openid-configuration

# Expected: HTTP 200 with JSON response
# If timeout: Network/firewall issue

Check DNS resolution:

nslookup oidc.bankid.no
# Should resolve to BankID IP addresses

Whitelist BankID IPs (if using strict firewall):
- Contact BankID for IP ranges
- Add to AWS security group outbound rules

ETA: 15 minutes (if quick fix), 1 hour (if requires networking changes)

Emergency Workarounds

Option 1: Fallback to Demo Mode (Temporary)

Use case: BankID outage affects all users, estimated >1 hour downtime

Steps:

Enable demo mode:

aws apprunner update-service --service-arn <ARN> \
  --instance-configuration "EnvironmentVariables={NEXT_PUBLIC_SERVICE_MODE=demo}"

Communicate to users:

Subject: Temporary login method available
Body: Due to BankID outage, we've enabled demo login.
      Use email/password to access your account.
      BankID will be restored as soon as possible.

Monitor BankID status

Revert to BankID when available:

aws apprunner update-service --service-arn <ARN> \
  --instance-configuration "EnvironmentVariables={NEXT_PUBLIC_SERVICE_MODE=live}"

Risk: Demo mode may bypass KYC checks. Only use with Alem approval.

Option 2: Redirect to Status Page

Use case: BankID outage, no ETA, no fallback available

Steps:

Deploy maintenance page:

# Update health endpoint to return 503
# This triggers BetterStack alert + status page update

Show user-friendly message:

<h1>Login Temporarily Unavailable</h1>
<p>Our authentication provider (BankID) is experiencing issues.</p>
<p>We expect service to resume within <strong>X minutes</strong>.</p>
<p>Status updates: <a href="https://status.drop.no">status.drop.no</a></p>

Monitor and communicate updates every 30 minutes

Post-Incident Actions

Document incident:

# Create incident report
touch ~/ALAI/products/Drop/comms/incidents/$(date +%Y-%m-%d)-bankid-failure.md

Root cause analysis:
- What triggered the failure?
- Why didn't monitoring detect it sooner?
- What prevented faster recovery?
Update monitoring:
- Add synthetic BankID login test (every 5 min)
- Alert on OAuth callback failures >5/min
Update runbook:
- Add new failure mode if discovered
- Improve diagnosis steps based on what worked
Team debrief (if >30 min outage):
- Review timeline
- Identify improvements
- Update on-call procedures

Escalation

Time	Action
0 min	John starts diagnosis
5 min	If not resolved, alert Alem via Slack + SMS
15 min	If BankID outage confirmed, enable fallback (Alem approval)
30 min	If still unresolved, schedule team call
1 hour	If major outage, public communication via email/social media

Contacts

BankID Support: support@bankid.no
BankID Phone: +47 XXXX XXXX (24/7 for critical issues)
Internal: Alem (CEO, final decision on fallback modes)

docs/architecture/authentication.md — BankID OAuth flow
src/app/api/auth/bankid/route.ts — BankID integration code
docs/dr-runbook.md — Infrastructure disaster recovery
Vaultwarden item: "BankID OAuth" — Credentials

Last Updated: 2026-02-22 Next Review: Before Phase 2 (Banking Integration)

Runbooks

Runbook: PISP Payment Failure

Runbook: PISP Payment Failure (Remittance & QR)

Service: Payment Initiation (PISP via Open Banking) Severity: HIGH (blocks money transfers) MTTR Target: <30 minutes Owner: John (AI Director)

Overview

PISP (Payment Initiation Service Provider) enables Drop to initiate payments directly from users' bank accounts. Failures in PISP prevent both remittance (send money abroad) and QR payments (in-store merchant payments).

Symptoms

Users report they cannot complete payments:

Payment initiation fails with error message
Payment status stuck at "pending" indefinitely
Bank redirect loop (never returns to Drop)
Error: "Payment service unavailable"

User impact: Cannot send money or pay merchants.

Diagnosis

1. Identify Payment Type

Determine which payment flow is affected:

Remittance: User sends money to recipient abroad (POST /api/transactions/remittance)
QR Payment: User pays merchant by scanning QR code (POST /api/transactions/qr-payment)

Check recent transactions:

# CloudWatch Logs
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "payment_initiation" \
  --start-time $(date -u -d '30 minutes ago' +%s)000 \
  --region eu-west-1 \
  | jq '.events[].message' \
  | grep -E "remittance|qr_payment|pisp_error"

2. Check Open Banking Provider Status

Provider: Neonomics (Norway), Swan BaaS (cross-border)

Neonomics Status:

# No official status page — check via test API call
curl -X POST https://sandbox.neonomics.io/payments/v1/payment-initiation \
  -H "Authorization: Bearer <sandbox-token>" \
  -H "Content-Type: application/json" \
  -d '{"amount":100,"currency":"NOK"}' \
  -v

# Expected: HTTP 200 or 400 (validation error)
# If 500/503: Neonomics outage

Swan API Status:

# Check Swan status page
open https://status.swan.io

# Or test API
curl https://api.swan.io/graphql \
  -H "Authorization: Bearer <api-key>" \
  -d '{"query": "{viewer{id}}"}' \
  -v

# Expected: HTTP 200
# If 500/503: Swan outage

3. Check Drop Logs for Error Codes

Common PISP error codes:

Code	Meaning	Cause
`INSUFFICIENT_FUNDS`	User's bank account balance too low	User error
`ACCOUNT_NOT_ACCESSIBLE`	Bank account locked or closed	Bank issue
`CONSENT_EXPIRED`	Open Banking consent needs renewal	User must re-authenticate
`PAYMENT_REJECTED`	Bank declined payment	Fraud detection, limits
`TIMEOUT`	Bank API took too long to respond	Network/bank issue
`INVALID_IBAN`	Recipient bank account number invalid	User error
`LIMIT_EXCEEDED`	Payment exceeds daily limit	User or bank limit

Search logs for error codes:

aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "PISP_ERROR" \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  | jq -r '.events[].message' \
  | jq '.metadata.errorCode'

4. Test Payment Flow

Manual test (staging environment):

# 1. Login
TOKEN=$(curl -X POST https://drop-staging.fly.dev/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"test1234"}' \
  | jq -r '.data.token')

# 2. Initiate test payment (small amount)
curl -X POST https://drop-staging.fly.dev/api/transactions/remittance \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "recipientId": "rec_test123",
    "amount": 100,
    "currency": "NOK",
    "sendCurrency": "NOK",
    "receiveCurrency": "EUR"
  }' \
  -v

# Expected: HTTP 200, transaction created
# If 500: PISP integration broken

Common Causes & Solutions

Cause 1: Open Banking Provider Outage

Probability: 10% (Neonomics/Swan service disruption)

Symptoms:

All payments fail with timeout or 503 error
Provider status page reports incident
Test API call fails

Solution:

Verify outage:
- Check Neonomics/Swan status pages
- Contact provider support if no public status

Communicate to users:

Subject: Payment processing temporarily unavailable
Body: Our payment provider is experiencing issues.
      We're monitoring the situation and expect service
      to resume within <X> minutes.

Monitor provider status:
- Subscribe to provider status updates
- Check every 15 minutes for resolution
Queue failed payments (if applicable):
- Store payment requests in pending_payments table
- Retry automatically when provider is back online

ETA: Depends on provider (typically <2 hours)

Cause 2: Expired Open Banking Consent

Probability: 30% (user consent expires after 90 days)

Symptoms:

Error code: CONSENT_EXPIRED or ACCOUNT_NOT_ACCESSIBLE
Payments fail for specific users only (not all)
Logs show: "Open Banking consent invalid"

Solution:

Identify affected users:

SELECT user_id, bank_account_id, consent_expires_at
FROM bank_accounts
WHERE consent_expires_at < datetime('now');

Notify users to re-authenticate:
- Send push notification: "Please reconnect your bank account"
- In-app banner: "Bank connection expired, tap to reconnect"
Guide user through re-consent flow:
- User taps "Reconnect bank account"
- Redirect to AISP consent flow (BankID + bank approval)
- Update consent_expires_at in database (90 days from now)
Retry payment after re-consent:
- Original payment request should be retryable
- Or user initiates new payment

ETA: Immediate (user action required)

Cause 3: Insufficient Funds in User's Bank Account

Probability: 25% (user error)

Symptoms:

Error code: INSUFFICIENT_FUNDS
Payment fails for specific transaction only
Logs show: "Account balance too low"

Solution:

Show clear error message to user:

Payment failed: Insufficient funds
Your bank account balance is too low to complete this payment.
Please add funds or choose a different payment method.

Suggest alternatives:
- Link different bank account (if multi-account supported)
- Reduce payment amount
- Try again later
No action needed on Drop side (user must resolve)

ETA: N/A (user-side issue)

Cause 4: Bank Fraud Detection / Payment Rejection

Probability: 15% (bank security systems)

Symptoms:

Error code: PAYMENT_REJECTED or SECURITY_BLOCK
Payment fails after bank redirect
Logs show: "Bank declined transaction"

Solution:

Advise user to contact their bank:

Payment failed: Your bank declined this transaction.
This may be due to fraud protection or payment limits.
Please contact your bank to authorize the payment.

Check if payment is unusual for user:
- First international transfer?
- Amount significantly higher than usual?
- High-risk destination country?
User should:
- Call their bank's fraud department
- Confirm the payment is legitimate
- Ask bank to whitelist Drop payments
- Retry after bank approval
Document pattern:
- If many users from same bank report this, investigate bank compatibility
- May need to add bank-specific messaging

ETA: Depends on user's bank (minutes to hours)

Cause 5: PISP API Rate Limiting

Probability: 5% (during high-traffic periods)

Symptoms:

Error code: RATE_LIMIT_EXCEEDED or HTTP 429
Intermittent failures (some payments succeed, others fail)
Logs show: "Too many requests"

Solution:

Check rate limit headers:

# Find rate limit status in logs
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "X-RateLimit" \
  --start-time $(date -u -d '10 minutes ago' +%s)000

Implement request queuing:

// src/lib/pisp-client.ts
const queue = new PQueue({ concurrency: 5, interval: 1000 });

async function initiatePayment(params) {
  return queue.add(() => pisService.createPayment(params));
}

Exponential backoff on retry:

async function retryPayment(id, attempt = 1) {
  if (attempt > 3) throw new Error('Max retries exceeded');
  try {
    return await initiatePayment(id);
  } catch (error) {
    if (error.status === 429) {
      await sleep(1000 * Math.pow(2, attempt)); // 2s, 4s, 8s
      return retryPayment(id, attempt + 1);
    }
    throw error;
  }
}

Contact provider to increase limits (if persistent):
- Email Neonomics support with usage stats
- Request higher API quota for production

ETA: 5 minutes (automatic retry), 1-2 days (if quota increase needed)

Cause 6: Invalid Recipient Bank Account (IBAN/SWIFT)

Probability: 20% (user input error)

Symptoms:

Error code: INVALID_IBAN or ACCOUNT_NOT_FOUND
Payment fails immediately (no bank redirect)
Logs show: "Recipient account validation failed"

Solution:

Show clear validation error:

Payment failed: Invalid recipient bank account
The IBAN you entered is not valid. Please check and try again.
IBAN: DE89 3704 0044 0532 0130 00 (example format)

Improve frontend validation:
- Add real-time IBAN validation (checksum algorithm)
- Use IBAN validation library (e.g., ibantools)
- Show format hints per country
Ask user to verify recipient details:
- Double-check IBAN/SWIFT code
- Confirm with recipient
- Try alternative payment method if IBAN is correct but still rejected

ETA: Immediate (user correction)

Emergency Workarounds

Option 1: Manual Payment Processing

Use case: PISP provider down >2 hours, urgent payments needed

Steps:

Collect payment requests manually:

SELECT id, user_id, amount, currency, recipient_iban
FROM transactions
WHERE status = 'pending' AND created_at > datetime('now', '-2 hours');

Alem initiates payments manually via Drop's business bank account:
- Log into business banking portal
- Enter recipient details manually
- Process payment one by one

Update Drop transaction status:

UPDATE transactions SET status = 'completed', completed_at = datetime('now')
WHERE id = '<transaction-id>';

Notify users:

Subject: Your payment has been processed
Body: Your payment of <amount> to <recipient> has been completed manually
      due to a temporary service issue. Thank you for your patience.

Risk: Manual work, prone to errors. Only use for critical/urgent payments.

Option 2: Redirect to Alternative Payment Method

Use case: PISP down, no ETA, users need alternative

Steps:

Show modal in app:

Payment Initiation Unavailable
Our payment service is temporarily down.
Alternative options:
- Bank transfer (manual IBAN entry)
- Try again later (we'll notify you when service is restored)

Provide manual bank transfer instructions:

Transfer to:
Account holder: Drop AS
IBAN: NO93 8601 1117 947
Amount: <calculated-amount>
Reference: <unique-ref>

Monitor for manual transfers:
- Check business bank account for incoming payments
- Match reference code to pending Drop transactions
- Mark as completed when received

ETA: Immediate (user can pay via manual transfer)

Monitoring & Alerts

Metrics to Track

Payment success rate: Should be >95%
Payment latency: p50 <5s, p95 <15s, p99 <30s
Error rate by code: Track INSUFFICIENT_FUNDS, CONSENT_EXPIRED, TIMEOUT separately

Alert Rules

// src/lib/payment-monitor.ts
export async function trackPaymentFailure(errorCode: string, transactionId: string) {
  const failureRate = await calculateFailureRate('last_5_minutes');

  if (failureRate > 0.1) { // 10% failure rate
    await sendAlert({
      severity: 'critical',
      title: 'High payment failure rate',
      message: `${(failureRate * 100).toFixed(1)}% of payments failing in last 5 min`,
    });
  }
}

Dashboard Queries

-- Payment success rate (last 24h)
SELECT
  COUNT(*) FILTER (WHERE status = 'completed') * 100.0 / COUNT(*) as success_rate,
  COUNT(*) as total_payments
FROM transactions
WHERE created_at > datetime('now', '-24 hours');

-- Top error codes (last hour)
SELECT error_code, COUNT(*) as count
FROM transactions
WHERE status = 'failed' AND created_at > datetime('now', '-1 hour')
GROUP BY error_code
ORDER BY count DESC;

Post-Incident Actions

Update transaction status:

-- Mark timed-out payments as failed (after 1 hour)
UPDATE transactions
SET status = 'failed', error_code = 'TIMEOUT', error_message = 'Payment timed out'
WHERE status = 'pending' AND created_at < datetime('now', '-1 hour');

Notify affected users:
- Send email/push notification about failed payment
- Offer to retry or refund
Document incident:
- Create post-mortem in comms/incidents/
- Track downtime duration
- Calculate financial impact (lost transactions)
Review provider SLA:
- Check if outage violates SLA
- Request compensation/credits if applicable
Improve resilience:
- Add payment retry queue
- Implement circuit breaker for provider API
- Consider multi-provider failover (backup PISP)

Escalation

Time	Action
0 min	John starts diagnosis
10 min	If provider outage confirmed, notify Alem
30 min	If not resolved, assess manual processing need
1 hour	If critical payments pending, start manual workaround (Alem approval)
2 hours	Public communication to all users

Contacts

Neonomics Support: support@neonomics.io, Slack: #neonomics-support
Swan Support: support@swan.io (email), Swan Slack (if available)
Internal: Alem (CEO, manual payment approval)

docs/architecture/payments.md — PISP flow diagrams
src/app/api/transactions/remittance/route.ts — Remittance implementation
src/app/api/transactions/qr-payment/route.ts — QR payment implementation
docs/compliance/psd2-requirements.md — Regulatory requirements

Last Updated: 2026-02-22 Next Review: Before Phase 2 (Banking Integration) Test Status: Pending (Phase 2 live payments)

Runbooks

Runbook: Sumsub KYC Failure

Runbook: Sumsub KYC/AML Verification Failure

Service: Sumsub Identity Verification (KYC/AML) Severity: HIGH (blocks new user registrations) MTTR Target: <30 minutes Owner: John (AI Director)

Overview

Sumsub provides automated identity verification (KYC - Know Your Customer) and AML (Anti-Money Laundering) checks for Drop. Required for regulatory compliance before users can make payments.

KYC Process:

User uploads ID document (passport, driver's license, national ID)
User takes selfie (liveness check)
Sumsub verifies document authenticity
Sumsub performs AML sanctions screening
Result: APPROVED, REJECTED, or MANUAL_REVIEW

Impact: If Sumsub fails, new users cannot complete registration. Existing users are unaffected.

Symptoms

Users report they cannot complete identity verification:

ID upload fails with error
Verification stuck at "Processing..." indefinitely
Error message: "Verification service unavailable"
Webhook never receives result from Sumsub
User status stuck at "pending_kyc"

User impact: Cannot complete registration, cannot make payments.

Diagnosis

1. Check Sumsub Service Status

External status:

# Sumsub does not have a public status page
# Test via API health check
curl https://api.sumsub.com/resources/healthcheck \
  -H "X-App-Token: <app-token>" \
  -v

# Expected: HTTP 200
# If 500/503: Sumsub outage

2. Check Drop Logs

# CloudWatch Logs (production)
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "sumsub" \
  --start-time $(date -u -d '30 minutes ago' +%s)000 \
  --region eu-west-1

# Look for:
# - "Sumsub API timeout"
# - "Sumsub webhook failed"
# - "KYC verification failed: document_expired"
# - "AML sanctions match: [name]"

3. Check Sumsub Dashboard

# Login to Sumsub Dashboard
open https://cockpit.sumsub.com

# Check:
# - Recent applicants (last 1 hour)
# - Failed verifications
# - Manual review queue length
# - Webhook delivery status

4. Check Webhook Delivery

Verify webhook endpoint is reachable:

# Sumsub sends webhooks to: https://getdrop.no/api/webhooks/sumsub
# Test endpoint manually
curl -X POST https://getdrop.no/api/webhooks/sumsub \
  -H "Content-Type: application/json" \
  -H "X-Sumsub-Signature: test" \
  -d '{"type":"applicantReviewed","reviewResult":{"reviewAnswer":"GREEN"}}' \
  -v

# Expected: HTTP 200
# If 404: Webhook endpoint not deployed
# If 401: Signature validation issue

5. Test KYC Flow

Manual test (staging):

# 1. Create test applicant
curl -X POST https://api.sumsub.com/resources/applicants \
  -H "X-App-Token: <sandbox-app-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "externalUserId": "test-user-123",
    "levelName": "basic-kyc-level",
    "email": "test@example.com"
  }' \
  -v

# Expected: HTTP 201, applicant created
# If 400: Invalid request
# If 500: Sumsub API issue

Common Causes & Solutions

Cause 1: Sumsub API Outage (External)

Probability: 5% (Sumsub service disruption)

Symptoms:

All KYC verifications fail
Sumsub API health check returns 503
Dashboard shows no recent applicants
Logs show API timeouts

Solution:

Verify outage:

# Test Sumsub API from different networks
curl https://api.sumsub.com/resources/healthcheck \
  -H "X-App-Token: <app-token>" \
  -v

# If consistent failure: confirmed outage

Contact Sumsub support:
- Email: support@sumsub.com
- Live chat: https://cockpit.sumsub.com (bottom-right)
- Phone: Check Sumsub Dashboard for support number

Communicate to users (Norwegian):

Emne: Identitetsverifisering midlertidig utilgjengelig

Hei,

Vi opplever for øyeblikket tekniske problemer med identitetsverifisering.
Du kan fortsette registreringen senere.

Vi forventer at tjenesten er tilbake innen [X minutter/timer].

Mvh,
Drop

Queue pending verifications:

-- Mark users as pending KYC retry
UPDATE users
SET kyc_status = 'pending_retry',
    kyc_retry_at = datetime('now', '+1 hour')
WHERE kyc_status = 'pending_kyc'
AND created_at > datetime('now', '-2 hours');

Retry when Sumsub is back:

# Cron job to retry pending KYC
node ~/ALAI/products/Drop/scripts/retry-kyc.js

ETA: Depends on Sumsub (typically <2 hours)

Cause 2: Document Verification Failure (User Error)

Probability: 40% (user uploads poor quality or invalid document)

Symptoms:

Specific users fail KYC (not all users)
Logs show: "document_not_readable", "document_expired", "document_type_mismatch"
Sumsub dashboard shows rejection reason

Common rejection reasons:

Blurry photo (document not readable)
Expired document (passport/ID expired)
Wrong document type (e.g., bank statement instead of ID)
Photo cropped (missing corners/edges)
Underage (user < 18 years old)

Solution:

Identify rejection reason:

SELECT user_id, kyc_rejection_reason, kyc_rejected_at
FROM users
WHERE kyc_status = 'rejected'
ORDER BY kyc_rejected_at DESC
LIMIT 10;

Show clear error to user (Norwegian):

Blurry document:

Dokumentet er ikke leselig
Ta et nytt bilde i godt lys.
Sørg for at all tekst er skarp og leselig.

Expired document:

Dokumentet er utløpt
Vennligst last opp et gyldig pass eller førerkort.
Dokumentet må være gyldig i minst 1 måned.

Wrong document type:

Feil dokumenttype
Vi godtar kun: Pass, Nasjonalt ID-kort, Førerkort.
Bankkort og regninger godtas ikke.

Underage:

Du må være 18 år eller eldre
Drop er kun tilgjengelig for brukere over 18 år.

Allow user to retry:
- Show "Try Again" button in app
- Provide tips for better photo quality
- Link to FAQ: "How to take a good ID photo"

Track retry success rate:

-- How many users succeed on 2nd attempt?
SELECT
  COUNT(*) FILTER (WHERE kyc_attempt = 1 AND kyc_status = 'approved') as first_attempt_success,
  COUNT(*) FILTER (WHERE kyc_attempt = 2 AND kyc_status = 'approved') as second_attempt_success,
  COUNT(*) FILTER (WHERE kyc_attempt >= 3) as multiple_retries
FROM users;

ETA: Immediate (user must retry with better document)

Cause 3: AML Sanctions Match (Compliance Issue)

Probability: 3% (user flagged by sanctions screening)

Symptoms:

Specific user's KYC fails with: "AML_SANCTIONS_MATCH"
Sumsub dashboard shows "Red flag" or "Manual review required"
User name matches sanctions list (OFAC, EU, UN, etc.)

Solution:

Identify flagged users:

SELECT user_id, email, full_name, kyc_rejection_reason
FROM users
WHERE kyc_rejection_reason LIKE '%sanctions%'
OR kyc_status = 'manual_review_aml';

Review Sumsub dashboard:
- Login: https://cockpit.sumsub.com
- Navigate to applicant
- Check AML screening results
- Review sanctions list match details
False positive (common names):
- Example: "Ali Hassan" may match many sanctioned individuals
- Sumsub shows match details (date of birth, nationality)
- If clearly different person: manually approve in Sumsub
True positive (actual sanctions match):
- DO NOT approve. This is a legal/regulatory issue.
- Reject user registration immediately
- Document incident for compliance records

Notify user (if false positive, manually approved):

Din identitetsverifisering er godkjent
Takk for tålmodigheten. Du kan nå bruke Drop.

Notify user (if true positive, rejected):

Vi kan dessverre ikke godkjenne din registrering
På grunn av regulatoriske krav kan vi ikke tilby tjenester til deg.
Ta kontakt med support@getdrop.no hvis du mener dette er en feil.

Escalate to Alem if uncertain:
- AML compliance is critical
- False rejection = bad UX, but false approval = legal risk
- Alem makes final call on borderline cases

ETA: 10 minutes (false positive), N/A (true positive - reject)

Cause 4: Webhook Delivery Failure

Probability: 15% (Drop webhook endpoint down or unreachable)

Symptoms:

Sumsub completes verification, but Drop never updates user status
Logs show: "Webhook not received"
Sumsub dashboard shows "Webhook delivery failed"
User stuck at "pending_kyc" despite Sumsub showing "approved"

Solution:

Check webhook endpoint health:

# Test webhook endpoint
curl -X POST https://getdrop.no/api/webhooks/sumsub \
  -H "Content-Type: application/json" \
  -d '{"type":"ping"}' \
  -v

# Expected: HTTP 200
# If 404/500: Drop webhook endpoint broken

Check Sumsub webhook delivery logs:
- Login: https://cockpit.sumsub.com
- Navigate to Settings → Webhooks
- Check recent delivery attempts
- Look for: 404, 500, timeout errors
Manually retry failed webhooks:
- Sumsub Dashboard → Applicant → "Resend Webhook"
- This triggers new webhook delivery to Drop
- Verify Drop receives and processes it

Fetch verification results via API (if webhook lost):

# Manually fetch applicant status from Sumsub
curl -X GET https://api.sumsub.com/resources/applicants/<applicant-id>/status \
  -H "X-App-Token: <app-token>" \
  -v

# Parse result and update Drop database

Update Drop database manually:

UPDATE users
SET kyc_status = 'approved',
    kyc_approved_at = datetime('now')
WHERE sumsub_applicant_id = '<applicant-id>';

Fix webhook endpoint (if broken):
- Check App Runner deployment status
- Verify webhook route exists: src/app/api/webhooks/sumsub/route.ts
- Check signature validation (Sumsub signs webhooks with HMAC)

ETA: 10 minutes (manual retry), 30 minutes (if endpoint fix needed)

Cause 5: Invalid or Expired API Credentials

Probability: 5% (after credential rotation)

Symptoms:

Logs show: "401 Unauthorized" or "403 Forbidden"
All Sumsub API calls fail
Webhook signature validation fails

Solution:

Verify Sumsub API credentials:

bw get item "Sumsub API" --session $BW_SESSION

# Check:
# - App Token is correct
# - Secret Key is correct (for webhook signature)
# - Environment: production vs sandbox

Regenerate API credentials (if needed):
- Login: https://cockpit.sumsub.com
- Navigate to Settings → API
- Generate new App Token + Secret Key
- Copy to Vaultwarden

Update App Runner environment variables:

aws apprunner update-service --service-arn <ARN> \
  --instance-configuration "EnvironmentVariables={
    SUMSUB_APP_TOKEN=<new-app-token>,
    SUMSUB_SECRET_KEY=<new-secret-key>,
    SUMSUB_ENVIRONMENT=production
  }"

Trigger deployment:

aws apprunner start-deployment --service-arn <ARN> --region eu-west-1

Test after deployment:

# Try creating test applicant
curl -X POST https://getdrop.no/api/kyc/initiate \
  -H "Authorization: Bearer <test-user-token>" \
  -v

# Expected: HTTP 200, Sumsub applicant created

ETA: 10 minutes

Cause 6: Liveness Check Failure (Selfie)

Probability: 20% (user fails selfie/liveness verification)

Symptoms:

Specific users fail at selfie stage
Logs show: "liveness_check_failed", "face_mismatch"
Sumsub dashboard shows "Selfie does not match ID photo"

Common reasons:

Poor lighting (too dark, too bright)
User wears sunglasses/hat
Multiple people in frame
Photo of a photo (not live person)
Face does not match ID document

Solution:

Show clear instructions before selfie (Norwegian):

Slik tar du et godt selfie-bilde:
✓ God belysning (dagslys er best)
✓ Fjern briller/solbriller
✓ Se rett i kameraet
✓ Kun ditt ansikt i bildet
✗ Ikke bruk foto av foto

Allow retry with better instructions:

Selfie-verifisering mislyktes
Prøv igjen med bedre belysning.
Sørg for at ansiktet ditt er tydelig synlig.

Improve liveness detection settings (if too strict):
- Login: https://cockpit.sumsub.com
- Navigate to Settings → Verification Levels
- Adjust liveness sensitivity (low/medium/high)
- Balance: security vs user friction
Manual review (if automated fails repeatedly):
- Some users may need manual review
- Sumsub team reviews video/photos manually
- ETA: 1-24 hours depending on Sumsub queue

ETA: Immediate (user retry), 1-24 hours (manual review)

Emergency Workarounds

Option 1: Manual KYC Review (Temporary)

Use case: Sumsub down >1 hour, urgent user needs verification

Steps:

Collect KYC documents manually:
- Ask user to email ID photo + selfie to support@getdrop.no
- Subject: "KYC Manual Review - [User ID]"
Alem or John reviews manually:
- Verify ID document authenticity (check security features)
- Compare selfie to ID photo
- Check ID expiry date
- Verify age >= 18
Manual AML check:
- Search user name on: https://sanctionssearch.ofac.treas.gov
- Check EU sanctions list: https://eeas.europa.eu/topics/sanctions-policy
- Document findings

Approve in database (if passes checks):

UPDATE users
SET kyc_status = 'approved_manual',
    kyc_approved_at = datetime('now'),
    kyc_approved_by = 'john',
    kyc_notes = 'Manual review during Sumsub outage'
WHERE user_id = '<user-id>';

Notify user:

Din identitet er verifisert
Velkommen til Drop! Du kan nå gjøre betalinger.

Risk: Manual review is slow, error-prone, not scalable. Only for critical cases.

Option 2: Delay Registration, Notify When Ready

Use case: Sumsub down, no ETA, non-urgent registrations

Steps:

Show maintenance message:

Identitetsverifisering midlertidig utilgjengelig
Vi jobber med å løse problemet.
Du vil motta en e-post når du kan fortsette registreringen.

Collect user email:

// src/app/api/auth/register/route.ts
if (sumsubUnavailable) {
  await db.insert('pending_registrations', {
    email: userEmail,
    status: 'waiting_kyc',
    created_at: new Date(),
  });

  return {
    success: true,
    message: 'We will notify you when registration is available',
  };
}

When Sumsub is back, notify users:

SELECT email FROM pending_registrations WHERE status = 'waiting_kyc';

Email (Norwegian):

Emne: Du kan nå fullføre registreringen i Drop

Hei,

Identitetsverifisering er tilbake.
Klikk her for å fortsette registreringen: [Link]

Mvh,
Drop

ETA: Delayed registration (hours to days)

Monitoring & Alerts

Metrics to Track

KYC success rate: Should be >85% (accounting for user errors)
KYC processing time: p50 <5min, p95 <30min, p99 <2h (includes manual review)
Rejection reasons: Track document_not_readable, expired, underage, sanctions separately

Alert Rules

// src/lib/kyc-monitor.ts
export async function trackKYCFailure(userId: string, reason: string) {
  const failureRate = await calculateKYCFailureRate('last_hour');

  if (failureRate > 0.3) { // 30% failure rate
    await sendAlert({
      severity: 'high',
      title: 'KYC failure rate high',
      message: `${(failureRate * 100).toFixed(1)}% of KYC attempts failing`,
      reason,
    });
  }
}

Post-Incident Actions

Retry failed KYC verifications:

UPDATE users
SET kyc_status = 'pending_retry',
    kyc_retry_at = datetime('now')
WHERE kyc_status IN ('failed', 'pending_kyc')
AND created_at > datetime('now', '-24 hours');

Document incident:

touch ~/ALAI/products/Drop/comms/incidents/$(date +%Y-%m-%d)-sumsub-kyc-failure.md

Review rejection reasons:
- High document_not_readable rate? Improve photo instructions
- High liveness_check_failed rate? Adjust Sumsub settings
- Track improvements in next month's KYC metrics
Update user onboarding:
- Add better photo guides
- Show example of good vs bad ID photos
- Pre-flight check: "Is your ID expired?"

Escalation

Time	Action
0 min	John starts diagnosis
15 min	If Sumsub outage confirmed, notify Alem
30 min	If urgent user needs KYC, consider manual review (Alem approval)
1 hour	Public communication to users
2 hours	Contact Sumsub support via phone if no response

Contacts

Sumsub Support: support@sumsub.com
Sumsub Live Chat: https://cockpit.sumsub.com (bottom-right)
Sumsub Phone: Check Sumsub Dashboard for support number
Internal: Alem (CEO, manual KYC approval authority)

docs/architecture/kyc-aml.md — KYC/AML flow diagrams
src/app/api/kyc/initiate/route.ts — Sumsub integration code
docs/compliance/kyc-requirements.md — Regulatory requirements (age, ID types)
Vaultwarden item: "Sumsub API" — Credentials

Last Updated: 2026-02-22 Next Review: Before Phase 2 (Banking Integration)

Runbooks

Runbook: Swan API Outage

Runbook: Swan BaaS API Outage

Service: Swan Banking-as-a-Service Severity: CRITICAL (blocks accounts, cards, payments if Swan is primary provider) MTTR Target: <15 minutes Owner: John (AI Director)

Overview

Swan provides core banking infrastructure for Drop. Depending on Drop's architecture phase, Swan may handle:

Account creation (virtual IBAN accounts for users)
Card issuance (virtual/physical debit cards)
Payment processing (domestic/international transfers)
Balance management (wallet balances, not Open Banking)

Impact: If Swan is the primary BaaS provider, an outage affects ALL core banking operations.

Symptoms

Users report critical failures:

Cannot create new account
Cannot view wallet balance (if using Swan wallets)
Card payments fail or decline
Error: "Banking service unavailable"
Dashboard shows "System error" for account-related features

User impact: Complete inability to use banking features (depending on Drop's reliance on Swan).

Diagnosis

1. Check Swan Status Page

External status:

# Swan official status page
open https://status.swan.io

# Check for:
# - Incident reported
# - Degraded performance
# - Scheduled maintenance

2. Test Swan API

Health check:

# GraphQL health query
curl https://api.swan.io/graphql \
  -H "Authorization: Bearer <api-key>" \
  -H "Content-Type: application/json" \
  -d '{"query": "{viewer{id}}"}' \
  -v

# Expected: HTTP 200, {"data": {"viewer": {"id": "..."}}}
# If 500/503: Swan API down
# If 401: Credential issue
# If timeout: Network or Swan connectivity issue

Test account creation:

# Attempt to create test account
curl https://api.swan.io/graphql \
  -H "Authorization: Bearer <api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "mutation { createAccount(input: {name: \"Test Account\"}) { id } }"
  }' \
  -v

# Expected: HTTP 200 with account ID
# If error: Check response for Swan error codes

3. Check Drop Logs

# CloudWatch Logs (production)
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "swan" \
  --start-time $(date -u -d '15 minutes ago' +%s)000 \
  --region eu-west-1

# Look for:
# - "Swan API timeout"
# - "Swan GraphQL error: INTERNAL_SERVER_ERROR"
# - "Swan 503 Service Unavailable"
# - "Swan rate limit exceeded"

4. Check Swan API Credentials

# Verify Swan API key is valid
bw get item "Swan API" --session $BW_SESSION

# Check App Runner environment variables
aws apprunner describe-service \
  --service-arn <ARN> \
  --region eu-west-1 \
  | jq '.Service.SourceConfiguration.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables' \
  | grep SWAN

# Expected:
# SWAN_API_KEY: <exists>
# SWAN_ENVIRONMENT: production (or sandbox)
# SWAN_PARTNER_ID: <partner-id>

5. Check Recent Swan API Changes

Review Swan changelog:

# Swan may deprecate API endpoints or change schemas
# Check Swan developer portal for breaking changes
open https://docs.swan.io/changelog

# Review recent GraphQL schema changes
# Verify Drop uses supported API versions

Common Causes & Solutions

Cause 1: Swan Service Outage (External)

Probability: 5% (Swan is highly reliable, but incidents happen)

Symptoms:

Swan status page reports incident
All Swan API calls fail with 500/503
No error in Drop code/config
Social media mentions Swan issues

Solution:

Verify outage scope:
- Check Swan status page
- Test API from different networks (rule out local network issue)
- Contact Swan support for ETA

Communicate to users (Norwegian):

Emne: Bankfunksjoner midlertidig utilgjengelig

Hei,

Vår bankinfrastruktur-leverandør (Swan) opplever tekniske problemer.
Dette påvirker:
- Kontoopprettelse
- Korttransaksjoner
- Overføringer

Vi overvåker situasjonen og forventer at tjenesten er tilbake innen [X minutter/timer].

Mvh,
Drop

Enable degraded mode:

# Disable features that depend on Swan
aws apprunner update-service --service-arn <ARN> \
  --instance-configuration "EnvironmentVariables={
    FEATURE_ACCOUNTS=disabled,
    FEATURE_CARDS=disabled,
    SWAN_MODE=degraded
  }"

# Show maintenance banner in app

Monitor Swan status:
- Subscribe to Swan status updates (RSS/email)
- Check every 10 minutes for resolution
- Test API as soon as Swan reports "Resolved"

Re-enable features when Swan is back:

aws apprunner update-service --service-arn <ARN> \
  --instance-configuration "EnvironmentVariables={
    FEATURE_ACCOUNTS=enabled,
    FEATURE_CARDS=enabled,
    SWAN_MODE=live
  }"

ETA: Depends on Swan (typically <2 hours for major incidents)

Cause 2: Invalid or Expired API Credentials

Probability: 15% (after credential rotation or Swan account changes)

Symptoms:

Logs show: "401 Unauthorized" or "Forbidden"
All Swan API requests fail immediately
Swan API test returns authentication error

Solution:

Verify Swan API credentials:

bw get item "Swan API" --session $BW_SESSION

# Check:
# - API key is not expired
# - API key has correct permissions (accounts, cards, payments)
# - Partner ID is correct

Regenerate API key (if needed):
- Login to Swan Dashboard: https://dashboard.swan.io
- Navigate to Settings → API Keys
- Revoke old key, generate new key
- Copy new key to Vaultwarden

Update App Runner environment variables:

aws apprunner update-service --service-arn <ARN> \
  --source-configuration "ImageRepository={...}" \
  --instance-configuration "EnvironmentVariables={
    SWAN_API_KEY=<new-key>,
    SWAN_PARTNER_ID=<partner-id>
  }"

Trigger deployment:

aws apprunner start-deployment --service-arn <ARN> --region eu-west-1

Test after deployment (3-5 min):

curl https://getdrop.no/api/accounts/create \
  -H "Authorization: Bearer <test-user-token>" \
  -H "Content-Type: application/json" \
  -d '{"accountType": "personal"}' \
  -v

# Expected: HTTP 200, account created

ETA: 10 minutes

Cause 3: Swan API Rate Limiting

Probability: 10% (during high-traffic events or viral growth)

Symptoms:

Logs show: HTTP 429 "Too Many Requests"
Intermittent failures (some requests succeed, others fail)
Rate limit headers in response

Solution:

Check rate limit headers:

aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "X-RateLimit" \
  --start-time $(date -u -d '10 minutes ago' +%s)000 \
  | jq -r '.events[].message' \
  | grep Swan

Implement request queuing:

// src/lib/swan-client.ts
import PQueue from 'p-queue';

const queue = new PQueue({
  concurrency: 5,     // Max 5 concurrent Swan requests
  interval: 1000,      // Per second
  intervalCap: 20      // Max 20 requests per second
});

export async function swanGraphQL(query: string, variables?: any) {
  return queue.add(() =>
    fetch('https://api.swan.io/graphql', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.SWAN_API_KEY}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({ query, variables }),
    })
  );
}

Exponential backoff on retry:

async function retrySwan(operation: () => Promise<any>, attempt = 1) {
  try {
    return await operation();
  } catch (error) {
    if (error.status === 429 && attempt <= 3) {
      const delay = 1000 * Math.pow(2, attempt); // 2s, 4s, 8s
      await sleep(delay);
      return retrySwan(operation, attempt + 1);
    }
    throw error;
  }
}

Contact Swan to increase rate limit:
- Email Swan support with traffic stats
- Provide justification: user growth, peak times
- Request higher API quota

ETA: 5 minutes (automatic retry), 1-2 days (if quota increase needed)

Cause 4: Swan GraphQL Schema Change (Breaking)

Probability: 5% (Swan updates API, breaks Drop integration)

Symptoms:

Logs show: "GraphQL validation error"
Specific queries fail: "Field 'X' doesn't exist on type 'Y'"
Swan API works for some operations, fails for others

Solution:

Check Swan changelog:

# Review recent API changes
open https://docs.swan.io/changelog

# Look for:
# - Deprecated fields
# - Required fields added
# - Type changes

Identify breaking changes:

# Compare current Drop queries to Swan schema
# Example: account creation query
grep -r "createAccount" src/lib/swan-client.ts

# Cross-reference with Swan GraphQL schema
# https://api.swan.io/graphql (GraphQL Playground)

Update Drop GraphQL queries:

// Before (deprecated)
mutation {
  createAccount(input: { name: "User Account" }) {
    id
    balance  // ❌ Deprecated field
  }
}

// After (updated)
mutation {
  createAccount(input: { name: "User Account" }) {
    id
    balances {  // ✅ New field structure
      available
      currency
    }
  }
}

Test updated queries:

# Test in Swan GraphQL Playground first
# Then deploy to staging
# Verify all Swan-dependent features work

Deploy fix:

git add src/lib/swan-client.ts
git commit -m "Fix: Update Swan GraphQL queries to match latest schema"
git push origin main

# CI/CD triggers deployment

ETA: 30 minutes (if simple field change), 2 hours (if major refactor needed)

Cause 5: Network or Firewall Issues

Probability: 5% (AWS security group misconfiguration)

Symptoms:

Logs show: "Connection timeout" or "ECONNREFUSED"
Swan API requests never reach destination
Works locally but fails in production

Solution:

Check outbound connectivity:

# App Runner egress is unrestricted by default
# If using VPC connector, check security group
aws ec2 describe-security-groups \
  --group-ids <vpc-connector-sg> \
  --region eu-west-1 \
  | jq '.SecurityGroups[].IpPermissionsEgress'

Test DNS resolution:

nslookup api.swan.io

# Should resolve to Swan IPs
# If NXDOMAIN: DNS issue

Check AWS service health:

# Check App Runner service events
aws apprunner list-operations \
  --service-arn <ARN> \
  --region eu-west-1 \
  | jq '.OperationSummaryList[0]'

Whitelist Swan IPs (if strict firewall):
- Contact Swan for IP ranges
- Add to security group outbound rules (port 443)

ETA: 15 minutes (if quick fix), 1 hour (if requires networking changes)

Cause 6: Swan Account Suspended or Payment Overdue

Probability: 2% (billing issue or compliance violation)

Symptoms:

All Swan API calls fail with "Account suspended"
Swan Dashboard shows billing alert
Email from Swan about overdue payment or compliance issue

Solution:

Check Swan Dashboard:
- Login: https://dashboard.swan.io
- Look for alerts: billing, compliance, KYC
Resolve billing issue:
- If overdue payment: pay immediately via Swan Dashboard
- If billing method expired: update payment method
- Contact Swan billing: billing@swan.io
Resolve compliance issue:
- Swan requires KYC for partner accounts
- Upload missing documents (company registration, director ID, etc.)
- Respond to Swan compliance team ASAP
Request urgent reactivation:
- Email Swan support: support@swan.io
- Subject: "URGENT: Account reactivation needed - [Partner ID]"
- Explain impact (users affected)
- Provide evidence of issue resolution

ETA: 15 minutes (if billing), 24 hours (if compliance review needed)

Emergency Workarounds

Option 1: Degraded Mode (Disable Swan Features)

Use case: Swan down >30 minutes, no ETA, users need core app functionality

Steps:

Disable Swan-dependent features:

aws apprunner update-service --service-arn <ARN> \
  --instance-configuration "EnvironmentVariables={
    FEATURE_ACCOUNTS=disabled,
    FEATURE_CARDS=disabled,
    FEATURE_SWAN_WALLETS=disabled
  }"

Show banner in app:

⚠️ Noen funksjoner er midlertidig utilgjengelige
Kontoopprettelse og korttransaksjoner er ikke tilgjengelig for øyeblikket.
Andre funksjoner virker som normalt.

Allow core features to work:
- BankID login: ✅ (not Swan-dependent)
- Open Banking balance: ✅ (uses Neonomics, not Swan)
- PISP payments: ✅ (uses Neonomics, not Swan)
- Swan accounts: ❌ (disabled)

Re-enable when Swan is back:

aws apprunner update-service --service-arn <ARN> \
  --instance-configuration "EnvironmentVariables={
    FEATURE_ACCOUNTS=enabled,
    FEATURE_CARDS=enabled,
    FEATURE_SWAN_WALLETS=enabled
  }"

Risk: Users cannot create accounts or use cards during outage.

Option 2: Queue Swan Operations for Later

Use case: Swan down, users need to create accounts but can wait

Steps:

Queue account creation requests:

// src/app/api/accounts/create/route.ts
export async function POST(request: Request) {
  const { accountType } = await request.json();

  try {
    return await swanClient.createAccount(accountType);
  } catch (error) {
    if (error.code === 'SWAN_UNAVAILABLE') {
      // Queue for later processing
      await db.insert('pending_accounts', {
        user_id: userId,
        account_type: accountType,
        status: 'queued',
        created_at: new Date(),
      });

      return {
        success: true,
        message: 'Account creation queued, will complete within 1 hour',
      };
    }
    throw error;
  }
}

Process queue when Swan is back:

# Run cron job to process pending accounts
node ~/ALAI/products/Drop/scripts/process-pending-accounts.js

Notify users when account is ready:

Din konto er klar!
Takk for tålmodigheten. Du kan nå bruke alle funksjoner i Drop.

Risk: Delayed user experience. Users may expect instant account creation.

Monitoring & Alerts

Metrics to Track

Swan API success rate: Should be >99%
Swan API latency: p50 <500ms, p95 <2s, p99 <5s
Swan error rate by operation: Track createAccount, issueCard, makePayment separately

Alert Rules

// src/lib/swan-monitor.ts
export async function trackSwanFailure(operation: string, error: any) {
  const failureRate = await calculateSwanFailureRate('last_5_minutes');

  if (failureRate > 0.05) { // 5% failure rate
    await sendAlert({
      severity: 'critical',
      title: 'Swan API failure rate high',
      message: `${(failureRate * 100).toFixed(1)}% of Swan calls failing`,
      operation,
    });
  }
}

Post-Incident Actions

Process queued operations:

SELECT * FROM pending_accounts WHERE status = 'queued';
-- Retry all pending account creations

Document incident:

touch ~/ALAI/products/Drop/comms/incidents/$(date +%Y-%m-%d)-swan-outage.md

Review SLA with Swan:
- Check if outage violated SLA
- Request compensation/credits
- Discuss failover options
Improve resilience:
- Add Swan health check (every 5 min)
- Implement circuit breaker for Swan API
- Consider multi-provider strategy (backup BaaS)

Escalation

Time	Action
0 min	John starts diagnosis
5 min	If Swan status page shows incident, notify Alem
15 min	If not resolved, enable degraded mode
30 min	Contact Swan support via phone if no ETA
1 hour	Public communication to users

Contacts

Swan Support: support@swan.io
Swan Phone: +33 X XXXX XXXX (check Swan Dashboard for number)
Swan Status: https://status.swan.io
Internal: Alem (CEO, final decision on feature disabling)

docs/architecture/banking.md — Swan BaaS integration
src/lib/swan-client.ts — Swan GraphQL client
docs/compliance/swan-requirements.md — Swan partner KYC/compliance
Vaultwarden item: "Swan API" — Credentials

Last Updated: 2026-02-22 Next Review: Before Phase 2 (Banking Integration)

Runbook: Neonomics Outage

Runbook: Neonomics Open Banking Outage

Service: Neonomics Open Banking Aggregator Severity: CRITICAL (blocks AISP balance fetch and PISP payments) MTTR Target: <20 minutes Owner: John (AI Director)

Overview

Neonomics is Drop's Open Banking aggregator for Norwegian banks. It provides:

AISP (Account Information): Fetch user's bank account balance via PSD2 consent
PISP (Payment Initiation): Initiate payments from user's bank account
Bank connectivity: Single API to connect to all Norwegian banks (DNB, Nordea, SpareBank 1, etc.)

Impact: If Neonomics is down, Drop cannot:

Show bank balances
Initiate remittance payments
Process QR payments

This is a critical outage affecting core functionality.

Symptoms

Users report core features not working:

Cannot see bank balance (shows "unavailable")
Cannot initiate payments (error at payment step)
Bank connection fails ("Cannot connect to bank")
Error: "Open Banking service unavailable"

User impact: Cannot use core Drop features (balance, payments).

Diagnosis

1. Check Neonomics Service Status

External status:

# Neonomics has no public status page
# Test via API health check
curl -X GET https://api.neonomics.io/health \
  -H "Authorization: Bearer <api-key>" \
  -v

# Expected: HTTP 200
# If 500/503: Neonomics outage
# If timeout: Network or Neonomics connectivity issue

Check specific bank connectivity:

# List banks and their status
curl -X GET https://api.neonomics.io/banks \
  -H "Authorization: Bearer <api-key>" \
  | jq '.[] | select(.country == "NO") | {name, status, lastChecked}'

# Look for:
# - "status": "degraded" or "offline"
# - Specific bank down (e.g., DNB) vs all banks

2. Check Drop Logs

# CloudWatch Logs (production)
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "neonomics" \
  --start-time $(date -u -d '15 minutes ago' +%s)000 \
  --region eu-west-1

# Look for:
# - "Neonomics API timeout"
# - "Neonomics 503 Service Unavailable"
# - "Bank API unavailable: DNB"
# - "Payment initiation failed: NEONOMICS_TIMEOUT"

3. Determine Scope of Outage

Is it all banks or specific banks?

# Count recent failures by bank
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "Neonomics.*failed" \
  --start-time $(date -u -d '30 minutes ago' +%s)000 \
  | jq '.events[].message' \
  | grep -o '"bank":"[^"]*"' \
  | sort | uniq -c | sort -rn

# Example output:
# 45 "bank":"DNB"        ← DNB-specific issue
# 2 "bank":"Nordea"      ← Nordea working mostly
# 1 "bank":"SpareBank1"  ← SpareBank1 working

Is it AISP, PISP, or both?

# Check failure type
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "Neonomics" \
  --start-time $(date -u -d '15 minutes ago' +%s)000 \
  | jq '.events[].message' \
  | grep -E "aisp|pisp" \
  | sort | uniq -c

# Example:
# 30 "service":"aisp"  ← AISP failing
# 45 "service":"pisp"  ← PISP failing
# If both high: full Neonomics outage

4. Test AISP and PISP Flows

Test AISP (balance fetch):

# Staging environment
TOKEN=$(curl -X POST https://drop-staging.fly.dev/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com","password":"test1234"}' \
  | jq -r '.data.token')

curl -X GET https://drop-staging.fly.dev/api/accounts/balance \
  -H "Authorization: Bearer $TOKEN" \
  -v

# Expected: HTTP 200, balance data
# If 500: AISP broken

Test PISP (payment initiation):

curl -X POST https://drop-staging.fly.dev/api/transactions/remittance \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "recipientId": "rec_test123",
    "amount": 100,
    "currency": "NOK"
  }' \
  -v

# Expected: HTTP 200, payment initiated
# If 500: PISP broken

5. Check Neonomics API Credentials

# Verify API key is valid
bw get item "Neonomics API" --session $BW_SESSION

# Check App Runner environment variables
aws apprunner describe-service \
  --service-arn <ARN> \
  --region eu-west-1 \
  | jq '.Service.SourceConfiguration.ImageRepository.ImageConfiguration.RuntimeEnvironmentVariables' \
  | grep NEONOMICS

# Expected:
# NEONOMICS_API_KEY: <exists>
# NEONOMICS_ENVIRONMENT: production

Common Causes & Solutions

Cause 1: Neonomics Full Outage (All Banks)

Probability: 10% (rare but critical)

Symptoms:

ALL banks fail (DNB, Nordea, SpareBank 1, etc.)
All AISP and PISP requests timeout or return 503
Neonomics API health check fails

Solution:

Verify full outage:

# Test multiple endpoints
curl -X GET https://api.neonomics.io/health -v
curl -X GET https://api.neonomics.io/banks -H "Authorization: Bearer <key>" -v

# If both fail: confirmed full outage

Contact Neonomics support URGENTLY:
- Email: support@neonomics.io
- Slack: #neonomics-support (if available)
- Phone: +47 XXXX XXXX (check Neonomics Dashboard)

Communicate to users (Norwegian):

Emne: Betalingstjenester midlertidig utilgjengelige

Hei,

Vi opplever for øyeblikket tekniske problemer med vår betalingsleverandør.
Dette påvirker:
- Visning av saldo
- Nye betalinger

Vi jobber med å gjenopprette tjenesten så raskt som mulig.
Estimert løsning: [X minutter/timer]

Mvh,
Drop

Enable degraded mode:

# Show cached balances, disable new payments
aws apprunner update-service --service-arn <ARN> \
  --instance-configuration "EnvironmentVariables={
    AISP_MODE=cached,
    PISP_MODE=disabled,
    NEONOMICS_FALLBACK=true
  }"

Show maintenance banner in app:

⚠️ Betalinger midlertidig utilgjengelig
Vi opplever tekniske problemer. Saldo vises med forsinkelse.
Betalinger er deaktivert midlertidig.

Monitor Neonomics status:
- Check API health every 5 minutes
- When API returns 200: test AISP/PISP flows
- Re-enable features gradually

Re-enable live mode when resolved:

aws apprunner update-service --service-arn <ARN> \
  --instance-configuration "EnvironmentVariables={
    AISP_MODE=live,
    PISP_MODE=live,
    NEONOMICS_FALLBACK=false
  }"

ETA: Depends on Neonomics (typically <2 hours for major incidents)

Cause 2: Specific Bank API Down

Probability: 25% (one bank's API temporarily unavailable)

Symptoms:

Only users of specific bank (e.g., DNB) affected
Other banks work fine (Nordea, SpareBank 1)
Logs show: "Bank API timeout: DNB"

Common reasons:

Bank's API maintenance (often 02:00-06:00 CET)
Bank's API outage
Bank rate limiting Neonomics
Bank API certificate expired

Solution:

Identify affected bank:

# Count failures by bank
aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "Bank API" \
  --start-time $(date -u -d '30 minutes ago' +%s)000 \
  | jq '.events[].message' \
  | grep -o '"bank":"[^"]*"' \
  | sort | uniq -c | sort -rn

Check bank status:
- DNB: https://www.dnb.no/drift
- Nordea: https://www.nordea.no/info/driftsmeldinger
- SpareBank 1: https://www.sparebank1.no/driftsmeldinger
- Norwegian banks often announce maintenance
Contact Neonomics to verify:
- Neonomics may already know about bank API issues
- Ask for ETA on bank connectivity restoration

Notify affected users (bank-specific):

-- Find users with affected bank
SELECT user_id, email
FROM bank_accounts
JOIN users ON users.id = bank_accounts.user_id
WHERE bank_name = 'DNB';

Email (Norwegian):

Emne: Problemer med [Bank] tilkobling

Hei,

Vi opplever for øyeblikket problemer med tilkoblingen til [Bank].
Dette skyldes tekniske problemer hos banken.

Andre banker virker som normalt.
Hvis du har konto i en annen bank, kan du bruke den i mellomtiden.

Estimert løsning: [X minutter/timer]

Mvh,
Drop

Graceful degradation (bank-specific):

// src/lib/neonomics-client.ts
async function fetchBalance(userId: string, bankId: string) {
  try {
    return await neonomicsAPI.getBalance(userId, bankId);
  } catch (error) {
    if (error.code === 'BANK_API_TIMEOUT' && error.bank === 'DNB') {
      // Return cached balance for DNB users
      const cached = await getCachedBalance(userId);
      return {
        balance: cached?.balance || null,
        currency: 'NOK',
        lastUpdated: cached?.timestamp,
        warning: 'DNB opplever tekniske problemer. Saldo kan være utdatert.'
      };
    }
    throw error;
  }
}

ETA: Depends on bank (typically <2 hours for maintenance, <4 hours for incidents)

Cause 3: Neonomics API Rate Limiting

Probability: 15% (during peak hours or viral growth)

Symptoms:

Logs show: HTTP 429 "Too Many Requests"
Intermittent failures (some requests succeed, others fail)
Rate limit headers in logs

Solution:

Check rate limit headers:

aws logs filter-log-events \
  --log-group-name /aws/apprunner/drop-production \
  --filter-pattern "X-RateLimit" \
  --start-time $(date -u -d '10 minutes ago' +%s)000 \
  | jq -r '.events[].message' \
  | grep -E "X-RateLimit-(Limit|Remaining|Reset)"

Implement request throttling:

// src/lib/neonomics-client.ts
import PQueue from 'p-queue';

const queue = new PQueue({
  concurrency: 10,      // Max 10 concurrent requests
  interval: 1000,        // Per second
  intervalCap: 50        // Max 50 requests per second
});

export async function callNeonomics(endpoint: string, options: any) {
  return queue.add(() =>
    fetch(`https://api.neonomics.io${endpoint}`, {
      ...options,
      headers: {
        'Authorization': `Bearer ${process.env.NEONOMICS_API_KEY}`,
        ...options.headers,
      },
    })
  );
}

Aggressive caching during rate limit:

// Cache balance for 5 minutes during rate limit (vs 1 minute normally)
const CACHE_TTL_NORMAL = 60;      // 1 minute
const CACHE_TTL_RATE_LIMIT = 300; // 5 minutes

async function getBalanceWithCache(userId: string) {
  const cached = await redis.get(`balance:${userId}`);
  if (cached) return JSON.parse(cached);

  try {
    const balance = await neonomicsAPI.getBalance(userId);
    await redis.setex(`balance:${userId}`, CACHE_TTL_NORMAL, JSON.stringify(balance));
    return balance;
  } catch (error) {
    if (error.status === 429) {
      // Extend cache during rate limit
      if (cached) {
        await redis.expire(`balance:${userId}`, CACHE_TTL_RATE_LIMIT);
        return JSON.parse(cached);
      }
    }
    throw error;
  }
}

Contact Neonomics to increase rate limit:
- Email: support@neonomics.io
- Provide traffic stats (requests/day, peak times)
- Request higher API quota

ETA: 5 minutes (automatic throttling), 1-2 days (if quota increase needed)

Cause 4: Invalid or Expired API Credentials

Probability: 5% (after credential rotation or account issue)

Symptoms:

Logs show: "401 Unauthorized" or "403 Forbidden"
All Neonomics API calls fail immediately
API health check returns 401

Solution:

Verify Neonomics API credentials:

bw get item "Neonomics API" --session $BW_SESSION

# Check:
# - API key is correct
# - Not expired
# - Correct environment (production vs sandbox)

Regenerate API key (if needed):
- Login to Neonomics Dashboard (if available)
- Navigate to Settings → API Keys
- Generate new API key
- Copy to Vaultwarden

Update App Runner environment variables:

aws apprunner update-service --service-arn <ARN> \
  --instance-configuration "EnvironmentVariables={
    NEONOMICS_API_KEY=<new-key>,
    NEONOMICS_ENVIRONMENT=production
  }"

Trigger deployment:

aws apprunner start-deployment --service-arn <ARN> --region eu-west-1

Test after deployment:

curl -X GET https://getdrop.no/api/accounts/balance \
  -H "Authorization: Bearer <test-user-token>" \
  -v

# Expected: HTTP 200, balance data

ETA: 10 minutes

Cause 5: PSD2 Consent Expired (AISP Only)

Probability: 20% (affects AISP, not PISP)

Symptoms:

Only AISP (balance fetch) fails
PISP (payments) still works
Logs show: "CONSENT_EXPIRED" or "CONSENT_INVALID"
Specific users affected (not all)

Note: This is actually a user-level issue, not a Neonomics outage. See aisp-balance-failure.md runbook for full details.

Quick solution:

Identify users with expired consent:

SELECT user_id, email, bank_name, consent_expires_at
FROM bank_accounts
JOIN users ON users.id = bank_accounts.user_id
WHERE consent_expires_at < datetime('now');

Notify users to re-authorize (Norwegian):

Push notification:
Banktilkobling utløpt — Trykk her for å fornye

User re-authorizes via BankID + bank consent flow

ETA: Immediate (user action required)

Cause 6: Network or Firewall Issues

Probability: 5% (AWS security group misconfiguration)

Symptoms:

Logs show: "Connection timeout" or "ECONNREFUSED"
Neonomics API requests never reach destination
Works locally but fails in production

Solution:

Check outbound connectivity:

# App Runner egress is unrestricted by default
# If using VPC connector, check security group
aws ec2 describe-security-groups \
  --group-ids <vpc-connector-sg> \
  --region eu-west-1 \
  | jq '.SecurityGroups[].IpPermissionsEgress'

Test DNS resolution:

nslookup api.neonomics.io

# Should resolve to Neonomics IPs
# If NXDOMAIN: DNS issue

Check AWS service health:

# Check App Runner service events
aws apprunner list-operations \
  --service-arn <ARN> \
  --region eu-west-1

Whitelist Neonomics IPs (if using strict firewall):
- Contact Neonomics for IP ranges
- Add to security group outbound rules (port 443)

ETA: 15 minutes (if quick fix), 1 hour (if requires networking changes)

Emergency Workarounds

Option 1: Cached Balance + Disable Payments

Use case: Neonomics down >30 minutes, no ETA

Steps:

Enable cached balance mode:

aws apprunner update-service --service-arn <ARN> \
  --instance-configuration "EnvironmentVariables={
    AISP_MODE=cached,
    AISP_CACHE_TTL=3600,
    PISP_MODE=disabled
  }"

Show warning banner in app:

⚠️ Betalinger midlertidig utilgjengelige
Saldo vises med forsinkelse (opptil 1 time).
Nye betalinger er deaktivert til tjenesten er tilbake.

Allow read-only features:
- Users can see cached balance
- Users can see transaction history
- Cannot initiate new payments

Re-enable when Neonomics is back:

aws apprunner update-service --service-arn <ARN> \
  --instance-configuration "EnvironmentVariables={
    AISP_MODE=live,
    PISP_MODE=live
  }"

Risk: Stale balance data. Users may think they have more/less money than reality.

Option 2: Queue Payments for Later Processing

Use case: PISP down, users need to make urgent payments

Steps:

Queue payment requests:

// src/app/api/transactions/remittance/route.ts
export async function POST(request: Request) {
  const paymentData = await request.json();

  try {
    return await neonomicsAPI.initiatePayment(paymentData);
  } catch (error) {
    if (error.code === 'NEONOMICS_UNAVAILABLE') {
      // Queue for later
      await db.insert('pending_payments', {
        user_id: userId,
        payment_data: paymentData,
        status: 'queued',
        created_at: new Date(),
      });

      return {
        success: true,
        message: 'Betaling satt i kø. Vil bli behandlet innen 2 timer.',
      };
    }
    throw error;
  }
}

Process queue when Neonomics is back:

node ~/ALAI/products/Drop/scripts/process-pending-payments.js

Notify users when payment completes:

Din betaling er behandlet
Betalingen på [amount] til [recipient] er fullført.

Risk: Delayed payments. User may expect instant transfer.

Monitoring & Alerts

Metrics to Track

Neonomics API success rate: Should be >99%
Neonomics API latency: p50 <2s, p95 <5s, p99 <10s
Bank-specific failure rate: Track DNB, Nordea, SpareBank 1 separately

Alert Rules

// src/lib/neonomics-monitor.ts
export async function trackNeonomicsFailure(service: 'aisp' | 'pisp', error: any) {
  const failureRate = await calculateFailureRate('neonomics', 'last_5_minutes');

  if (failureRate > 0.1) { // 10% failure rate
    await sendAlert({
      severity: 'critical',
      title: 'Neonomics API failure rate high',
      message: `${(failureRate * 100).toFixed(1)}% of Neonomics calls failing`,
      service,
    });
  }
}

Post-Incident Actions

Process queued operations:

SELECT * FROM pending_payments WHERE status = 'queued';
-- Retry all pending payments

Document incident:

touch ~/ALAI/products/Drop/comms/incidents/$(date +%Y-%m-%d)-neonomics-outage.md

Review SLA with Neonomics:
- Check if outage violated SLA
- Request compensation/credits
- Discuss redundancy options
Improve resilience:
- Add Neonomics health check (synthetic test every 5 min)
- Implement circuit breaker for Neonomics API
- Consider multi-provider strategy (backup Open Banking aggregator)

Escalation

Time	Action
0 min	John starts diagnosis
10 min	If full Neonomics outage confirmed, notify Alem
20 min	If not resolved, enable degraded mode (cached balance, disable payments)
30 min	Contact Neonomics support via phone if no response
1 hour	Public communication to all users
2 hours	Assess alternative Open Banking providers (emergency only)

Contacts

Neonomics Support: support@neonomics.io
Neonomics Slack: #neonomics-support (if available)
Neonomics Phone: +47 XXXX XXXX (check Neonomics Dashboard)
Internal: Alem (CEO, final decision on fallback modes)

docs/architecture/open-banking.md — Neonomics AISP/PISP flow
src/lib/neonomics-client.ts — Neonomics API client
docs/compliance/psd2-requirements.md — PSD2 regulatory requirements
support/runbooks/aisp-balance-failure.md — AISP-specific failures
support/runbooks/pisp-payment-failure.md — PISP-specific failures
Vaultwarden item: "Neonomics API" — Credentials

Last Updated: 2026-02-22 Next Review: Before Phase 2 (Banking Integration)

Infrastructure & Internal Services

Complete runbooks for all ALAI internal services: Docker containers, LaunchAgent daemons, Cloudflare tunnel, Vaultwarden, email system, bots, and more.

Infrastructure & Internal Services

ALAI Infrastructure — Service Catalog & Runbooks

Last updated: 2026-03-11 | Maintained by: John (AI Director) Host: Mac Studio M3 Ultra (ANVIL) | OS: macOS Quick health: node ~/system/tools/daemon-health.js

🐳 Docker Services (23 containers)

Core Platform Services

Service	Image	Port	External URL	Health	Restart
Vaultwarden	vaultwarden/server	:8200	vault.basicconsulting.no	✅ healthy	`cd ~/system/services/vaultwarden && docker compose restart`
BookStack	linuxserver/bookstack	:6875	docs.basicconsulting.no	✅ running	`cd ~/system/services/bookstack && docker compose restart`
BookStack DB	linuxserver/mariadb	:3306 (internal)	—	✅ running	Restarts with BookStack
Planka	plankanban/planka	:3100	boards.basicconsulting.no	✅ healthy	`cd ~/system/services/planka && docker compose restart`
Planka DB	postgres:15-alpine	internal	—	✅ healthy	Restarts with Planka
Documenso	documenso/documenso	:3003	sign.basicconsulting.no	✅ running	`cd ~/system/services/documenso && docker compose restart`
Documenso DB	postgres:15-alpine	internal	—	✅ healthy	Restarts with Documenso
Documenso MinIO	minio/minio	:9002/:9003	—	✅ running	Restarts with Documenso
Baikal (CalDAV)	ckulka/baikal:nginx	:5232	calendar.basicconsulting.no	✅ running	`cd ~/system/services/baikal && docker compose restart`
Qdrant (Vector DB)	qdrant/qdrant	:6333/:6334	—	✅ running	`docker restart qdrant`

Product Database Services

Service	Port	Product	Health	Restart
drop-postgres	:5433	Drop	✅ healthy	`cd ~/ALAI/products/Drop && docker compose restart drop-postgres`
plock-db	:5434	Plock	✅ healthy	`cd ~/ALAI/products/Plock && docker compose restart plock-db`
plock-redis	:6380	Plock	✅ healthy	Restarts with plock-db
bilko-postgres	:5436	Bilko	✅ running	`cd ~/ALAI/products/Bilko && docker compose restart bilko-postgres`
bilko-redis	:6382	Bilko	✅ running	Restarts with bilko
lobby-postgres	:5437	Lobby	✅ healthy	`cd ~/ALAI/products/Lobby && docker compose restart lobby-postgres`
lumiscare-postgres	:5432	LumisCare	✅ healthy	Client project
lumiscare-redis	:6379	LumisCare	✅ healthy	Client project
backend-postgres	:5435	BasicFakta	✅ healthy	`cd ~/ALAI/products/BasicFakta && docker compose restart`
backend-redis	:6381	BasicFakta	✅ healthy	Restarts with backend

Monitoring Stack (Drop)

Service	Port	URL	Restart
Grafana	:3300	grafana.basicconsulting.no	`docker restart drop-grafana`
Prometheus	:9090	prometheus.basicconsulting.no	`docker restart drop-prometheus`
Node Exporter	:9100	—	`docker restart drop-node-exporter`

☁️ Cloudflare Tunnel (cloudflared)

LaunchAgent: com.john.cloudflared Config: ~/.cloudflared/config.yml Tunnel ID: 3315a609-7934-45c5-ad0c-56d86d16374d

Exposed Services

Hostname	Backend	Purpose
docs.basicconsulting.no	localhost:6875	BookStack wiki
vault.basicconsulting.no	localhost:8200	Vaultwarden
sign.basicconsulting.no	localhost:3003	Documenso (e-signing)
boards.basicconsulting.no	localhost:3100	Planka (kanban)
calendar.basicconsulting.no	localhost:5232	Baikal (CalDAV)
mc.basicconsulting.no	localhost:3030	MC Dashboard
api.basicconsulting.no	localhost:3001	API gateway
drop-api.basicconsulting.no	localhost:3201	Drop API
lobby.basicconsulting.no	localhost:3010	Lobby frontend
lobby-api.basicconsulting.no	localhost:3009	Lobby API
auth.basicconsulting.no	localhost:9000	Authentik (SSO)
grafana.basicconsulting.no	localhost:3300	Grafana dashboards
prometheus.basicconsulting.no	localhost:9090	Prometheus metrics
track.basicconsulting.no	localhost:3456	Email tracking pixel
ssh.basicconsulting.no	localhost:22	SSH access
vnc.basicconsulting.no	localhost:5900	VNC screen sharing

Runbook: Tunnel down

# Check status
launchctl list | grep cloudflared

# Restart
launchctl stop com.john.cloudflared
launchctl start com.john.cloudflared

# Verify
cloudflared tunnel info 3315a609-7934-45c5-ad0c-56d86d16374d

# Logs
tail -50 ~/system/logs/cloudflared.log

🔐 Vaultwarden

Container: vaultwarden | Port: :8200 URL: vault.basicconsulting.no (Cloudflare Access protected) Local: http://localhost:8200 | HTTPS proxy: https://localhost:8443 (Caddy) Admin token: In ~/system/services/vaultwarden/.env

Dependencies

Docker
Caddy HTTPS proxy (com.john.caddy-vault) — needed for bw CLI
vault-keeper daemon (com.john.vault-keeper) — auto-unlock

Runbook: Vault locked/unauthenticated

# Check status
NODE_TLS_REJECT_UNAUTHORIZED=0 bw status

# If "locked" — vault-keeper auto-fixes every 15 min. Manual:
NODE_TLS_REJECT_UNAUTHORIZED=0 bw unlock --raw > /tmp/bw-session

# If "unauthenticated" — needs full re-login:
NODE_TLS_REJECT_UNAUTHORIZED=0 bw login --apikey
# Enter client_id and client_secret from ~/system/config/vault-apikey.json
# Then unlock:
NODE_TLS_REJECT_UNAUTHORIZED=0 bw unlock --raw > /tmp/bw-session

# Verify
NODE_TLS_REJECT_UNAUTHORIZED=0 BW_SESSION=$(cat /tmp/bw-session) bw list items --search "Email" | head

Runbook: Caddy proxy down

# Caddy provides HTTPS for bw CLI (self-signed cert)
launchctl list | grep caddy-vault
# Restart
launchctl stop com.john.caddy-vault && launchctl start com.john.caddy-vault
# Verify
curl -sk https://localhost:8443 | head -1

📧 Email System

Daemon: com.john.email-agent (every 5 min) Accounts: john@basicconsulting.no, info@basicconsulting.no, john@alai.no, alem@alai.no, dev@alai.no IMAP: imap.one.com:993 | SMTP: send.one.com:465 Credentials: Vaultwarden (via bw CLI)

Runbook: Email agent not processing

# Check logs
tail -30 ~/system/logs/email-agent-launchd.log

# Common issue: Vault not unlocked
NODE_TLS_REJECT_UNAUTHORIZED=0 bw status
# Fix: See Vaultwarden runbook above

# Manual test run
NODE_TLS_REJECT_UNAUTHORIZED=0 node ~/system/daemons/email-agent.js --dry-run

# Restart daemon
launchctl stop com.john.email-agent && launchctl start com.john.email-agent

# Check inbox DB
node -e "const e=require('$HOME/system/tools/email-inbox.js');console.log(JSON.stringify(e.getStats(),null,2))"

💬 Telegram Bot

Daemon: com.john.telegram-agent (KeepAlive) Bot: @johnbasicas_bot Config: macOS Keychain (telegram-bot-token) AI Backend: Claude CLI → Ollama (llama3.1:8b) → static fallback

Runbook: Bot not responding

# Check daemon
launchctl list | grep telegram-agent

# Check logs
tail -20 ~/system/logs/telegram-agent.log

# Restart
launchctl stop com.john.telegram-agent && launchctl start com.john.telegram-agent

# Test AI backend
node -e "const{getResponse}=require('$HOME/system/tools/comms-responder.js');getResponse('test',[]).then(r=>console.log(r.backend,r.text.substring(0,100)))"

# Test connection
node ~/system/tools/telegram-agent.js --test

💬 Slack Bot

Daemon: com.john.slack-bot (KeepAlive) Workspace: ALAI Holding AS

Runbook: Slack bot not responding

launchctl list | grep slack-bot
tail -20 ~/system/logs/slack-bot.log
launchctl stop com.john.slack-bot && launchctl start com.john.slack-bot

📋 BookStack (Wiki)

Container: bookstack + bookstack_db Port: :6875 | URL: docs.basicconsulting.no API config: ~/system/config/bookstack.json (creds in Vaultwarden)

Runbook: BookStack down

cd ~/system/services/bookstack
docker compose ps
docker compose restart
# Check logs
docker logs bookstack --tail 20

📝 Documenso (E-Signing)

Containers: documenso + documenso-db + documenso-minio Port: :3003 | URL: sign.basicconsulting.no

Runbook: Documenso down

cd ~/system/services/documenso
docker compose ps
docker compose restart
docker logs documenso --tail 20

📋 Planka (Kanban)

Containers: planka + planka-db Port: :3100 | URL: boards.basicconsulting.no

Runbook: Planka down

cd ~/system/services/planka
docker compose ps
docker compose restart
docker logs planka --tail 20

📅 Baikal (CalDAV/CardDAV)

Container: baikal Port: :5232 | URL: calendar.basicconsulting.no

Runbook: Baikal down

cd ~/system/services/baikal
docker compose ps
docker compose restart
docker logs baikal --tail 20

🤖 Ollama (Local AI)

Process: ollama serve (background) Port: :11434 Models: llama3.1:8b, qwen2.5-coder:32b, bge-m3, llama-guard3:8b, custom ALAI models

Runbook: Ollama down

# Check
curl -s http://localhost:11434/api/tags | python3 -m json.tool | head

# Restart
ollama serve &

# Verify models
ollama list

⚙️ Key LaunchAgent Daemons

Daemon	Label	Purpose	Priority
Cloudflared	com.john.cloudflared	Tunnel to internet	P1
Vault Keeper	com.john.vault-keeper	Auto-unlock Vaultwarden	P1
Caddy Vault	com.john.caddy-vault	HTTPS proxy for bw CLI	P1
Slack Bot	com.john.slack-bot	Slack communication	P1
Telegram Agent	com.john.telegram-agent	Telegram bot	P1
Email Agent	com.john.email-agent	Email processing	P1
Email Tracker	com.john.email-tracker	Open/click tracking	P2
Comms Agent	com.john.comms-agent	Cross-platform comms	P2
Ops Watchdog	com.john.ops-watchdog	Service health checks	P1
Event Dispatcher	com.john.event-dispatcher	Event bus processing	P1
Pi Orchestrator	com.john.pi-orchestrator	Task delegation to agents	P1
Autowork	com.john.autowork	Background task execution	P2
N8N	com.john.n8n	Workflow automation	P2
MC Dashboard	com.john.mc-dashboard	Mission Control web UI	P2

Generic daemon restart

# Stop
launchctl stop com.john.<name>
# Start
launchctl start com.john.<name>
# Full reload
launchctl unload ~/Library/LaunchAgents/com.john.<name>.plist
launchctl load ~/Library/LaunchAgents/com.john.<name>.plist
# Check status
launchctl list | grep <name>

🔄 Cold Start (Full System Bring-Up)

If the Mac Studio reboots:

# 1. Docker starts automatically (Docker Desktop)
# 2. LaunchAgents auto-load (RunAtLoad=true)
# 3. vault-keeper unlocks Vaultwarden (reads Keychain)
# 4. All services come up within ~2 minutes

# Verify everything:
bash ~/system/ops/cold-start.sh
node ~/system/tools/daemon-health.js
docker ps

🆘 Emergency Contacts

Alem Basic (CEO): alem@alai.no
John (AI Director): john@basicconsulting.no, @johnbasicas_bot (Telegram), #exec (Slack)

Support & Runbooks

Support Systems

P0: Implementation Checklist

P0 Implementation Checklist — Drop Support Systems

Overview

P0 Items

1. Server-Side Error Tracking ⏱️ 2 hours (revised)

2. Audit Logging System ⏱️ 0 hours (ALREADY COMPLETE)

3. WAF Deployment ⏱️ 2 hours

4. Log Aggregation & Retention ⏱️ 2 hours

5. External Uptime Monitoring ⏱️ 1 hour

6. Payment/Banking Failure Runbooks ⏱️ 4 hours

Progress Tracking

Completion Status

Priority Order

Dependencies

External Dependencies

Internal Dependencies

Blocked Items

Testing Plan

Test 1: Error Tracking

Test 2: Audit Logging

Test 3: WAF

Test 4: CloudWatch Alarms

Test 5: BetterStack

Rollout Plan

Phase 1: Non-Intrusive (Day 1)

Phase 2: Database Changes (Day 2)

Phase 3: Code Integration (Day 3-4)

Phase 4: Runbooks (Day 5)

Success Metrics

Approvals

Required Approvals

Sign-Off

Next Steps

Related Documents

Support Overview

Customer Support

Support Systems Analysis

Drop Support Systems Analysis

Executive Summary

Current State

1. Monitoring — Uptime & Health Checks

What Exists

What's Missing

Assessment

2. Logging — Centralized Log Aggregation

What Exists

What's Missing

Assessment

3. Error Tracking — Error Capture & Alerting

What Exists

What's Missing

Assessment

4. Alerting — On-Call & Escalation

What Exists

What's Missing

Assessment

5. Security Monitoring — WAF, DDoS, Anomaly Detection, Audit Logs

What Exists

What's Missing

Assessment

6. Performance — APM, Latency Tracking, Resource Utilization

What Exists

What's Missing

Assessment

7. Database — Backups, Replication, Monitoring

What Exists

What's Missing

Assessment

8. Incident Response — Runbooks, Status Page, Communication Plan

What Exists

What's Missing

Assessment

9. CI/CD — Build Pipeline, Deployment, Rollback

What Exists

What's Missing

Assessment

10. Compliance — Audit Trails, Data Retention, GDPR/PSD2 Logging

What Exists

Migration: `003_audit_logs.sql`

Audit Log Library: `src/lib/audit-log.ts`

1. Authentication (`src/app/api/auth/login/route.ts`)

2. Logout (`src/app/api/auth/logout/route.ts`)

3. Data Access (`src/app/api/users/[id]/route.ts`)

4. KYC Approval (`src/app/api/admin/kyc/route.ts`)

5. Transaction Creation (`src/app/api/transactions/route.ts`)