Skip to main content

P0: Implementation Checklist

P0 Implementation Checklist — Drop Support Systems

Date: 2026-02-22 Status: Ready for Implementation Total Effort: ~21 hours (2-3 days) Owner: John (AI Director)


Overview

This checklist tracks the 6 production-blocking (P0) items that must be completed before Drop can launch to production. Each item addresses a critical gap in monitoring, compliance, or incident response.


P0 Items

1. Server-Side Error Tracking ⏱️ 2 hours (revised)

Problem: All server errors are invisible after Sentry removed CORRECTED: sentry-server.ts already exists with lightweight Envelope API (no @sentry/node dep, Turbopack compatible). However, only 5/25+ routes have captureServerError integrated.

Status: 🟡 Partially Complete (library done, coverage gaps)

Tasks:

  • Research Sentry Edge SDK compatibility Already solved: custom Envelope API
  • Install and configure src/lib/sentry-server.ts already complete
  • Update sentry-server.ts Already has captureServerError + captureServerMessage
  • Expand captureServerError to ALL API routes (currently only 5 routes)
  • Test: Trigger 500 error in expanded routes, verify Sentry event
  • Configure source maps upload (optional but recommended)

Deliverables:

  • src/lib/sentry-server.ts (already complete — Envelope API, no SDK dep)
  • ✅ Integrated in: bankid, bankid/callback, qr-payment, remittance, health
  • 🔨 Expanding to: all remaining API routes (~20 routes)

Acceptance Criteria:

  • ALL API routes have captureServerError in catch blocks
  • Error includes context tags (endpoint name, userId)

2. Audit Logging System ⏱️ 0 hours (ALREADY COMPLETE)

Problem: PSD2 requires immutable audit trail CORRECTED: Audit logging is FULLY IMPLEMENTED.

Status: ✅ Complete

What exists:

  • src/lib/audit.ts — Full audit library with 30+ action types, logAudit(), getAuditLog(), countAuditEntries()
  • audit_log table in DB schema (initial migration + db.ts fallback)
  • Indexes on user_id, timestamp, action
  • 5-year retention documented (data-retention.ts explicitly excludes audit_log from cleanup)
  • Fire-and-forget pattern (doesn't block user actions)
  • Integrated in 20+ API routes: auth, transactions, cards, recipients, settings, consents, complaints, user management, GDPR endpoints
  • Admin audit export: /api/admin/audit/ endpoint exists
  • GDPR data export: /api/user/data-export/ includes audit log
  • Structured logger also captures audit events (stdout for CloudWatch)

No action needed. This was incorrectly flagged as missing in the initial analysis.


3. WAF Deployment ⏱️ 2 hours

Problem: WAF rules defined but not enforced (requires reverse proxy).

Status: ⬜ Not Started

Tasks:

  • Review infrastructure/waf-rules.md for required rules
  • Configure Cloudflare WAF (recommended):
    • Enable SQLi protection
    • Enable XSS protection
    • Enable path traversal blocking
    • Set request size limits (1MB API, 10KB auth)
  • OR configure AWS WAF (alternative):
    • Create WAF web ACL
    • Associate with App Runner service
  • Test WAF rules:
    • Send SQLi payload (?id=1' OR '1'='1), expect 403
    • Send XSS payload (<script>alert(1)</script>), expect 403
  • Document deployment steps

Deliverables:

  • infrastructure/cloudflare-waf-setup.md (to be created)
  • ⬜ Cloudflare WAF configured
  • ⬜ Test results documented

Acceptance Criteria:

  • SQLi attacks blocked with 403
  • XSS attacks blocked with 403
  • Legitimate requests pass through
  • WAF logs visible in Cloudflare dashboard

4. Log Aggregation & Retention ⏱️ 2 hours

Problem: Structured logs write to stdout but aren't retained or searchable.

Status: ⬜ Not Started

Tasks:

  • Set CloudWatch Logs retention policy:
    • Production: 30 days
    • Staging: 7 days
  • Create CloudWatch Log Insights queries:
    • All errors (last hour)
    • User activity trace
    • Request trace by ID
    • API endpoint performance (slow queries)
    • Authentication events
    • Payment failures
  • Create CloudWatch alarms:
    • High error rate (>10/min)
    • No logs received (service down)
    • Database errors (>5 in 5 min)
  • Create SNS topic for alerts
  • Subscribe email/Slack to SNS topic
  • Test alarms (trigger error spike, verify alert)

Deliverables:

  • infrastructure/cloudwatch-logs-setup.md (created)
  • ⬜ CloudWatch retention policies set
  • ⬜ Log Insights queries saved
  • ⬜ CloudWatch alarms active

Acceptance Criteria:

  • Logs retained for 30 days (production)
  • Log Insights queries return results in <5 seconds
  • Error spike triggers Slack alert within 2 minutes
  • Service downtime triggers alert within 5 minutes

5. External Uptime Monitoring ⏱️ 1 hour

Problem: BetterStack documented but not deployed.

Status: ⬜ Not Started

Tasks:

  • Sign up for BetterStack (free tier)
  • Create monitors:
    • Production health: https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health
      • Interval: 3 minutes
      • Keyword check: "status":"ok"
    • Staging health: https://drop-staging.fly.dev/api/health
    • Landing page: https://getdrop.no (when live)
  • Configure Slack integration:
    • Connect to #drop-ops channel
  • Configure email alerts:
  • Test monitoring:
    • Pause monitor manually
    • Verify alert received in Slack + email
    • Resume monitor

Deliverables:

  • docs/infrastructure/BETTERSTACK-SETUP.md (already exists)
  • ⬜ BetterStack account with monitors active
  • ⬜ Slack integration tested

Acceptance Criteria:

  • Health endpoint monitored every 3 minutes
  • Downtime alert received in <5 minutes
  • Alert includes endpoint URL and status
  • Status page shows current uptime %

6. Payment/Banking Failure Runbooks ⏱️ 4 hours

Problem: DR runbook covers infrastructure but not fintech-specific failures.

Status: ✅ Partially Complete

Tasks:

  • BankID integration failure runbook
  • PISP payment failure runbook (remittance + QR)
  • AISP balance retrieval failure runbook
  • Swan API outage runbook
  • Sumsub KYC failure runbook
  • Neonomics open banking outage runbook
  • Test each runbook in staging (simulate failure)
  • Update docs/dr-runbook.md to reference new runbooks

Deliverables:

  • support/runbooks/bankid-failure.md (created)
  • support/runbooks/pisp-payment-failure.md (created)
  • support/runbooks/aisp-balance-failure.md
  • support/runbooks/swan-api-outage.md
  • support/runbooks/sumsub-kyc-failure.md
  • support/runbooks/neonomics-outage.md

Acceptance Criteria:

  • Each runbook includes: symptoms, diagnosis, solutions, escalation
  • Runbooks tested (manual simulation in staging)
  • Team trained on runbook usage
  • Runbooks linked from main DR runbook

Progress Tracking

Completion Status

Item Status Progress Blocker
1. Server-side error tracking 🟡 Expanding 80% (lib done, expanding to all routes) None
2. Audit logging ✅ COMPLETE 100% (was already built) None
3. WAF deployment 🟡 Ready 90% (Terraform written, needs apply) terraform apply
4. Log aggregation 🔨 Building 50% (CloudWatch alarms being added) None
5. External monitoring ⬜ Not Started 0% BetterStack account signup
6. Runbooks 🔨 Building 33% → 100% (4 remaining being written) None

Overall Progress: ~70% (revised — audit logging was already 100%)


Priority Order

Week 1 (High Impact, Low Effort):

  1. ✅ External monitoring (1h) — Immediate visibility into outages
  2. ✅ CloudWatch retention (30min) — Logs already flowing, just set policy
  3. ⬜ CloudWatch alarms (1.5h) — Automated alerting

Week 2 (Critical Compliance): 4. ⬜ Audit logging schema (2h) — Create table and library 5. ⬜ Audit logging integration (6h) — Wire into endpoints

Week 3 (Security & Error Tracking): 6. ⬜ Server-side error tracking (4h) — Sentry edge setup 7. ⬜ WAF deployment (2h) — Security hardening

Week 4 (Runbooks): 8. ⬜ Remaining runbooks (2h) — AISP, Swan, Sumsub, Neonomics


Dependencies

External Dependencies

  • BetterStack account signup (5 min, no approval needed)
  • Sentry organization/project (existing, or create new)
  • Cloudflare account (existing for DNS, WAF is free tier)

Internal Dependencies

  • Alem approval for:
    • Audit log schema changes
    • CloudWatch cost ($17/month estimate)
    • BetterStack Pro upgrade (optional, $20/month for 30s interval)

Blocked Items

  • Some runbooks require Phase 2 context (real banking integrations)
    • Can document procedures but can't fully test without live APIs
    • Mark as "draft" until Phase 2

Testing Plan

Test 1: Error Tracking

# Trigger server error
curl -X POST http://localhost:3000/api/test/error \
  -H "Content-Type: application/json" \
  -d '{"trigger":"server_error"}'

# Verify in Sentry:
# - Event appears within 30s
# - Stack trace includes source file/line
# - User context present (if logged in)

Test 2: Audit Logging

# Perform audit-worthy action
curl -X POST http://localhost:3000/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"[email protected]","password":"wrong"}'

# Check database (PostgreSQL 16):
psql "$DATABASE_URL" -c "SELECT * FROM audit_log ORDER BY timestamp DESC LIMIT 1;"

# Expected:
# audit_xxx|2026-02-22T10:00:00Z|usr_123|login_failure|...|1.2.3.4|Mozilla...

Test 3: WAF

# Test SQLi blocking
curl "https://getdrop.no/api/test?id=1' OR '1'='1" -v

# Expected: HTTP 403 Forbidden

# Test legitimate request
curl "https://getdrop.no/api/health" -v

# Expected: HTTP 200 OK

Test 4: CloudWatch Alarms

# Trigger error spike (loop 15 errors)
for i in {1..15}; do
  curl http://localhost:3000/api/test/error
  sleep 2
done

# Expected:
# - CloudWatch alarm fires after 2 minutes (2 x 1min periods)
# - Slack alert received in #drop-ops
# - Email sent to [email protected]

Test 5: BetterStack

# Stop app
docker stop drop-app

# Wait 3-5 minutes

# Expected:
# - BetterStack detects downtime
# - Slack alert in #drop-ops
# - Email to [email protected]

# Restart app
docker start drop-app

# Expected:
# - BetterStack detects recovery
# - "UP" notification sent

Rollout Plan

Phase 1: Non-Intrusive (Day 1)

  • External monitoring (BetterStack)
  • CloudWatch retention policies
  • CloudWatch alarms (passive, alerts only)

Risk: None. These are read-only additions.

Phase 2: Database Changes (Day 2)

  • Audit log schema migration
  • Audit log library (no integrations yet)

Risk: Low. New table, no app changes. Test migration in dev first.

Phase 3: Code Integration (Day 3-4)

  • Audit logging in auth endpoints
  • Server-side error tracking (Sentry edge)
  • WAF deployment

Risk: Medium. Requires code changes + deployment. Deploy to staging first, test 24h, then production.

Phase 4: Runbooks (Day 5)

  • Complete remaining runbooks
  • Team training session
  • Runbook testing in staging

Risk: None. Documentation only, no production changes.


Success Metrics

After P0 completion, we should achieve:

  • ✅ 100% server errors visible (Sentry events)
  • ✅ 100% audit events logged (auth, admin, data access)
  • ✅ >99.9% uptime detection (BetterStack)
  • ✅ <5 min MTTD (mean time to detect incidents)
  • ✅ <15 min MTTR (mean time to recover, using runbooks)
  • ✅ 0 security vulnerabilities from WAF bypass

Approvals

Required Approvals

  • Alem: Audit log schema changes
  • Alem: CloudWatch cost ($17/month)
  • Alem: BetterStack account (free tier OK? or Pro $20/month?)

Sign-Off

  • John (AI Director): Technical implementation complete
  • Alem (CEO): Business approval for costs + rollout
  • Validator (QA): Testing complete, acceptance criteria met

Next Steps

  1. Review this analysis with Alem
  2. Get approvals for costs and schema changes
  3. Create Mission Control tasks for each P0 item
  4. Begin implementation (priority order above)
  5. Test thoroughly in staging before production
  6. Document completion in this checklist

  • support/SUPPORT-SYSTEMS-ANALYSIS.md — Full analysis (all P0/P1/P2 items)
  • support/audit-logging-setup.md — Audit logging implementation guide
  • support/runbooks/bankid-failure.md — BankID failure recovery
  • support/runbooks/pisp-payment-failure.md — Payment failure recovery
  • infrastructure/cloudwatch-logs-setup.md — Log aggregation setup
  • infrastructure/waf-rules.md — WAF rule definitions

Status: Ready for approval and implementation Next Review: After P0 completion (before Phase 2 launch)