P0: Implementation Checklist P0 Implementation Checklist — Drop Support Systems Date: 2026-02-22 Status: Ready for Implementation Total Effort: ~21 hours (2-3 days) Owner: John (AI Director) Overview This checklist tracks the 6 production-blocking (P0) items that must be completed before Drop can launch to production. Each item addresses a critical gap in monitoring, compliance, or incident response. P0 Items 1. Server-Side Error Tracking ⏱️ 2 hours (revised) Problem: All server errors are invisible after Sentry removed CORRECTED: sentry-server.ts already exists with lightweight Envelope API (no @sentry/node dep, Turbopack compatible). However, only 5/25+ routes have captureServerError integrated. Status: 🟡 Partially Complete (library done, coverage gaps) Tasks: Research Sentry Edge SDK compatibility Already solved: custom Envelope API Install and configure src/lib/sentry-server.ts already complete Update sentry-server.ts Already has captureServerError + captureServerMessage Expand captureServerError to ALL API routes (currently only 5 routes) Test: Trigger 500 error in expanded routes, verify Sentry event Configure source maps upload (optional but recommended) Deliverables: ✅ src/lib/sentry-server.ts (already complete — Envelope API, no SDK dep) ✅ Integrated in: bankid, bankid/callback, qr-payment, remittance, health 🔨 Expanding to: all remaining API routes (~20 routes) Acceptance Criteria: ALL API routes have captureServerError in catch blocks Error includes context tags (endpoint name, userId) 2. Audit Logging System ⏱️ 0 hours (ALREADY COMPLETE) Problem: PSD2 requires immutable audit trail CORRECTED: Audit logging is FULLY IMPLEMENTED. Status: ✅ Complete What exists: src/lib/audit.ts — Full audit library with 30+ action types, logAudit(), getAuditLog(), countAuditEntries() audit_log table in DB schema (initial migration + db.ts fallback) Indexes on user_id, timestamp, action 5-year retention documented (data-retention.ts explicitly excludes audit_log from cleanup) Fire-and-forget pattern (doesn't block user actions) Integrated in 20+ API routes: auth, transactions, cards, recipients, settings, consents, complaints, user management, GDPR endpoints Admin audit export: /api/admin/audit/ endpoint exists GDPR data export: /api/user/data-export/ includes audit log Structured logger also captures audit events (stdout for CloudWatch) No action needed. This was incorrectly flagged as missing in the initial analysis. 3. WAF Deployment ⏱️ 2 hours Problem: WAF rules defined but not enforced (requires reverse proxy). Status: ⬜ Not Started Tasks: Review infrastructure/waf-rules.md for required rules Configure Cloudflare WAF (recommended): Enable SQLi protection Enable XSS protection Enable path traversal blocking Set request size limits (1MB API, 10KB auth) OR configure AWS WAF (alternative): Create WAF web ACL Associate with App Runner service Test WAF rules: Send SQLi payload ( ?id=1' OR '1'='1 ), expect 403 Send XSS payload ( ), expect 403 Document deployment steps Deliverables: ✅ infrastructure/cloudflare-waf-setup.md (to be created) ⬜ Cloudflare WAF configured ⬜ Test results documented Acceptance Criteria: SQLi attacks blocked with 403 XSS attacks blocked with 403 Legitimate requests pass through WAF logs visible in Cloudflare dashboard 4. Log Aggregation & Retention ⏱️ 2 hours Problem: Structured logs write to stdout but aren't retained or searchable. Status: ⬜ Not Started Tasks: Set CloudWatch Logs retention policy: Production: 30 days Staging: 7 days Create CloudWatch Log Insights queries: All errors (last hour) User activity trace Request trace by ID API endpoint performance (slow queries) Authentication events Payment failures Create CloudWatch alarms: High error rate (>10/min) No logs received (service down) Database errors (>5 in 5 min) Create SNS topic for alerts Subscribe email/Slack to SNS topic Test alarms (trigger error spike, verify alert) Deliverables: ✅ infrastructure/cloudwatch-logs-setup.md (created) ⬜ CloudWatch retention policies set ⬜ Log Insights queries saved ⬜ CloudWatch alarms active Acceptance Criteria: Logs retained for 30 days (production) Log Insights queries return results in <5 seconds Error spike triggers Slack alert within 2 minutes Service downtime triggers alert within 5 minutes 5. External Uptime Monitoring ⏱️ 1 hour Problem: BetterStack documented but not deployed. Status: ⬜ Not Started Tasks: Sign up for BetterStack (free tier) Create monitors: Production health: https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health Interval: 3 minutes Keyword check: "status":"ok" Staging health: https://drop-staging.fly.dev/api/health Landing page: https://getdrop.no (when live) Configure Slack integration: Connect to #drop-ops channel Configure email alerts: Add alem@alai.no Test monitoring: Pause monitor manually Verify alert received in Slack + email Resume monitor Deliverables: ✅ docs/infrastructure/BETTERSTACK-SETUP.md (already exists) ⬜ BetterStack account with monitors active ⬜ Slack integration tested Acceptance Criteria: Health endpoint monitored every 3 minutes Downtime alert received in <5 minutes Alert includes endpoint URL and status Status page shows current uptime % 6. Payment/Banking Failure Runbooks ⏱️ 4 hours Problem: DR runbook covers infrastructure but not fintech-specific failures. Status: ✅ Partially Complete Tasks: BankID integration failure runbook PISP payment failure runbook (remittance + QR) AISP balance retrieval failure runbook Swan API outage runbook Sumsub KYC failure runbook Neonomics open banking outage runbook Test each runbook in staging (simulate failure) Update docs/dr-runbook.md to reference new runbooks Deliverables: ✅ support/runbooks/bankid-failure.md (created) ✅ support/runbooks/pisp-payment-failure.md (created) ⬜ support/runbooks/aisp-balance-failure.md ⬜ support/runbooks/swan-api-outage.md ⬜ support/runbooks/sumsub-kyc-failure.md ⬜ support/runbooks/neonomics-outage.md Acceptance Criteria: Each runbook includes: symptoms, diagnosis, solutions, escalation Runbooks tested (manual simulation in staging) Team trained on runbook usage Runbooks linked from main DR runbook Progress Tracking Completion Status Item Status Progress Blocker 1. Server-side error tracking 🟡 Expanding 80% (lib done, expanding to all routes) None 2. Audit logging ✅ COMPLETE 100% (was already built) None 3. WAF deployment 🟡 Ready 90% (Terraform written, needs apply) terraform apply 4. Log aggregation 🔨 Building 50% (CloudWatch alarms being added) None 5. External monitoring ⬜ Not Started 0% BetterStack account signup 6. Runbooks 🔨 Building 33% → 100% (4 remaining being written) None Overall Progress: ~70% (revised — audit logging was already 100%) Priority Order Week 1 (High Impact, Low Effort): ✅ External monitoring (1h) — Immediate visibility into outages ✅ CloudWatch retention (30min) — Logs already flowing, just set policy ⬜ CloudWatch alarms (1.5h) — Automated alerting Week 2 (Critical Compliance): 4. ⬜ Audit logging schema (2h) — Create table and library 5. ⬜ Audit logging integration (6h) — Wire into endpoints Week 3 (Security & Error Tracking): 6. ⬜ Server-side error tracking (4h) — Sentry edge setup 7. ⬜ WAF deployment (2h) — Security hardening Week 4 (Runbooks): 8. ⬜ Remaining runbooks (2h) — AISP, Swan, Sumsub, Neonomics Dependencies External Dependencies BetterStack account signup (5 min, no approval needed) Sentry organization/project (existing, or create new) Cloudflare account (existing for DNS, WAF is free tier) Internal Dependencies Alem approval for: Audit log schema changes CloudWatch cost ($17/month estimate) BetterStack Pro upgrade (optional, $20/month for 30s interval) Blocked Items Some runbooks require Phase 2 context (real banking integrations) Can document procedures but can't fully test without live APIs Mark as "draft" until Phase 2 Testing Plan Test 1: Error Tracking # Trigger server error curl -X POST http://localhost:3000/api/test/error \ -H "Content-Type: application/json" \ -d '{"trigger":"server_error"}' # Verify in Sentry: # - Event appears within 30s # - Stack trace includes source file/line # - User context present (if logged in) Test 2: Audit Logging # Perform audit-worthy action curl -X POST http://localhost:3000/api/auth/login \ -H "Content-Type: application/json" \ -d '{"email":"test@example.com","password":"wrong"}' # Check database (PostgreSQL 16): psql "$DATABASE_URL" -c "SELECT * FROM audit_log ORDER BY timestamp DESC LIMIT 1;" # Expected: # audit_xxx|2026-02-22T10:00:00Z|usr_123|login_failure|...|1.2.3.4|Mozilla... Test 3: WAF # Test SQLi blocking curl "https://getdrop.no/api/test?id=1' OR '1'='1" -v # Expected: HTTP 403 Forbidden # Test legitimate request curl "https://getdrop.no/api/health" -v # Expected: HTTP 200 OK Test 4: CloudWatch Alarms # Trigger error spike (loop 15 errors) for i in {1..15}; do curl http://localhost:3000/api/test/error sleep 2 done # Expected: # - CloudWatch alarm fires after 2 minutes (2 x 1min periods) # - Slack alert received in #drop-ops # - Email sent to alem@alai.no Test 5: BetterStack # Stop app docker stop drop-app # Wait 3-5 minutes # Expected: # - BetterStack detects downtime # - Slack alert in #drop-ops # - Email to alem@alai.no # Restart app docker start drop-app # Expected: # - BetterStack detects recovery # - "UP" notification sent Rollout Plan Phase 1: Non-Intrusive (Day 1) External monitoring (BetterStack) CloudWatch retention policies CloudWatch alarms (passive, alerts only) Risk: None. These are read-only additions. Phase 2: Database Changes (Day 2) Audit log schema migration Audit log library (no integrations yet) Risk: Low. New table, no app changes. Test migration in dev first. Phase 3: Code Integration (Day 3-4) Audit logging in auth endpoints Server-side error tracking (Sentry edge) WAF deployment Risk: Medium. Requires code changes + deployment. Deploy to staging first, test 24h, then production. Phase 4: Runbooks (Day 5) Complete remaining runbooks Team training session Runbook testing in staging Risk: None. Documentation only, no production changes. Success Metrics After P0 completion, we should achieve: ✅ 100% server errors visible (Sentry events) ✅ 100% audit events logged (auth, admin, data access) ✅ >99.9% uptime detection (BetterStack) ✅ <5 min MTTD (mean time to detect incidents) ✅ <15 min MTTR (mean time to recover, using runbooks) ✅ 0 security vulnerabilities from WAF bypass Approvals Required Approvals Alem: Audit log schema changes Alem: CloudWatch cost ($17/month) Alem: BetterStack account (free tier OK? or Pro $20/month?) Sign-Off John (AI Director): Technical implementation complete Alem (CEO): Business approval for costs + rollout Validator (QA): Testing complete, acceptance criteria met Next Steps Review this analysis with Alem Get approvals for costs and schema changes Create Mission Control tasks for each P0 item Begin implementation (priority order above) Test thoroughly in staging before production Document completion in this checklist Related Documents support/SUPPORT-SYSTEMS-ANALYSIS.md — Full analysis (all P0/P1/P2 items) support/audit-logging-setup.md — Audit logging implementation guide support/runbooks/bankid-failure.md — BankID failure recovery support/runbooks/pisp-payment-failure.md — Payment failure recovery infrastructure/cloudwatch-logs-setup.md — Log aggregation setup infrastructure/waf-rules.md — WAF rule definitions Status: Ready for approval and implementation Next Review: After P0 completion (before Phase 2 launch)