P0: Implementation Checklist
P0 Implementation Checklist — Drop Support Systems
Date: 2026-02-22 Status: Ready for Implementation Total Effort: ~21 hours (2-3 days) Owner: John (AI Director)
Overview
This checklist tracks the 6 production-blocking (P0) items that must be completed before Drop can launch to production. Each item addresses a critical gap in monitoring, compliance, or incident response.
P0 Items
1. Server-Side Error Tracking ⏱️ 2 hours (revised)
Problem: All server errors are invisible after Sentry removed CORRECTED: sentry-server.ts already exists with lightweight Envelope API (no @sentry/node dep, Turbopack compatible). However, only 5/25+ routes have captureServerError integrated.
Status: 🟡 Partially Complete (library done, coverage gaps)
Tasks:
-
Research Sentry Edge SDK compatibilityAlready solved: custom Envelope API -
Install and configuresrc/lib/sentry-server.tsalready complete -
Update sentry-server.tsAlready has captureServerError + captureServerMessage - Expand captureServerError to ALL API routes (currently only 5 routes)
- Test: Trigger 500 error in expanded routes, verify Sentry event
- Configure source maps upload (optional but recommended)
Deliverables:
- ✅
src/lib/sentry-server.ts(already complete — Envelope API, no SDK dep) - ✅ Integrated in: bankid, bankid/callback, qr-payment, remittance, health
- 🔨 Expanding to: all remaining API routes (~20 routes)
Acceptance Criteria:
- ALL API routes have captureServerError in catch blocks
- Error includes context tags (endpoint name, userId)
2. Audit Logging System ⏱️ 0 hours (ALREADY COMPLETE)
Problem: PSD2 requires immutable audit trail CORRECTED: Audit logging is FULLY IMPLEMENTED.
Status: ✅ Complete
What exists:
-
src/lib/audit.ts— Full audit library with 30+ action types, logAudit(), getAuditLog(), countAuditEntries() -
audit_logtable in DB schema (initial migration + db.ts fallback) - Indexes on user_id, timestamp, action
- 5-year retention documented (data-retention.ts explicitly excludes audit_log from cleanup)
- Fire-and-forget pattern (doesn't block user actions)
- Integrated in 20+ API routes: auth, transactions, cards, recipients, settings, consents, complaints, user management, GDPR endpoints
- Admin audit export:
/api/admin/audit/endpoint exists - GDPR data export:
/api/user/data-export/includes audit log - Structured logger also captures audit events (stdout for CloudWatch)
No action needed. This was incorrectly flagged as missing in the initial analysis.
3. WAF Deployment ⏱️ 2 hours
Problem: WAF rules defined but not enforced (requires reverse proxy).
Status: ⬜ Not Started
Tasks:
- Review
infrastructure/waf-rules.mdfor required rules - Configure Cloudflare WAF (recommended):
- Enable SQLi protection
- Enable XSS protection
- Enable path traversal blocking
- Set request size limits (1MB API, 10KB auth)
- OR configure AWS WAF (alternative):
- Create WAF web ACL
- Associate with App Runner service
- Test WAF rules:
- Send SQLi payload (
?id=1' OR '1'='1), expect 403 - Send XSS payload (
<script>alert(1)</script>), expect 403
- Send SQLi payload (
- Document deployment steps
Deliverables:
- ✅
infrastructure/cloudflare-waf-setup.md(to be created) - ⬜ Cloudflare WAF configured
- ⬜ Test results documented
Acceptance Criteria:
- SQLi attacks blocked with 403
- XSS attacks blocked with 403
- Legitimate requests pass through
- WAF logs visible in Cloudflare dashboard
4. Log Aggregation & Retention ⏱️ 2 hours
Problem: Structured logs write to stdout but aren't retained or searchable.
Status: ⬜ Not Started
Tasks:
- Set CloudWatch Logs retention policy:
- Production: 30 days
- Staging: 7 days
- Create CloudWatch Log Insights queries:
- All errors (last hour)
- User activity trace
- Request trace by ID
- API endpoint performance (slow queries)
- Authentication events
- Payment failures
- Create CloudWatch alarms:
- High error rate (>10/min)
- No logs received (service down)
- Database errors (>5 in 5 min)
- Create SNS topic for alerts
- Subscribe email/Slack to SNS topic
- Test alarms (trigger error spike, verify alert)
Deliverables:
- ✅
infrastructure/cloudwatch-logs-setup.md(created) - ⬜ CloudWatch retention policies set
- ⬜ Log Insights queries saved
- ⬜ CloudWatch alarms active
Acceptance Criteria:
- Logs retained for 30 days (production)
- Log Insights queries return results in <5 seconds
- Error spike triggers Slack alert within 2 minutes
- Service downtime triggers alert within 5 minutes
5. External Uptime Monitoring ⏱️ 1 hour
Problem: BetterStack documented but not deployed.
Status: ⬜ Not Started
Tasks:
- Sign up for BetterStack (free tier)
- Create monitors:
- Production health:
https://9ef3szvvsb.eu-west-1.awsapprunner.com/api/health- Interval: 3 minutes
- Keyword check:
"status":"ok"
- Staging health:
https://drop-staging.fly.dev/api/health - Landing page:
https://getdrop.no(when live)
- Production health:
- Configure Slack integration:
- Connect to
#drop-opschannel
- Connect to
- Configure email alerts:
- Test monitoring:
- Pause monitor manually
- Verify alert received in Slack + email
- Resume monitor
Deliverables:
- ✅
docs/infrastructure/BETTERSTACK-SETUP.md(already exists) - ⬜ BetterStack account with monitors active
- ⬜ Slack integration tested
Acceptance Criteria:
- Health endpoint monitored every 3 minutes
- Downtime alert received in <5 minutes
- Alert includes endpoint URL and status
- Status page shows current uptime %
6. Payment/Banking Failure Runbooks ⏱️ 4 hours
Problem: DR runbook covers infrastructure but not fintech-specific failures.
Status: ✅ Partially Complete
Tasks:
- BankID integration failure runbook
- PISP payment failure runbook (remittance + QR)
- AISP balance retrieval failure runbook
- Swan API outage runbook
- Sumsub KYC failure runbook
- Neonomics open banking outage runbook
- Test each runbook in staging (simulate failure)
- Update
docs/dr-runbook.mdto reference new runbooks
Deliverables:
- ✅
support/runbooks/bankid-failure.md(created) - ✅
support/runbooks/pisp-payment-failure.md(created) - ⬜
support/runbooks/aisp-balance-failure.md - ⬜
support/runbooks/swan-api-outage.md - ⬜
support/runbooks/sumsub-kyc-failure.md - ⬜
support/runbooks/neonomics-outage.md
Acceptance Criteria:
- Each runbook includes: symptoms, diagnosis, solutions, escalation
- Runbooks tested (manual simulation in staging)
- Team trained on runbook usage
- Runbooks linked from main DR runbook
Progress Tracking
Completion Status
| Item | Status | Progress | Blocker |
|---|---|---|---|
| 1. Server-side error tracking | 🟡 Expanding | 80% (lib done, expanding to all routes) | None |
| 2. Audit logging | ✅ COMPLETE | 100% (was already built) | None |
| 3. WAF deployment | 🟡 Ready | 90% (Terraform written, needs apply) | terraform apply |
| 4. Log aggregation | 🔨 Building | 50% (CloudWatch alarms being added) | None |
| 5. External monitoring | ⬜ Not Started | 0% | BetterStack account signup |
| 6. Runbooks | 🔨 Building | 33% → 100% (4 remaining being written) | None |
Overall Progress: ~70% (revised — audit logging was already 100%)
Priority Order
Week 1 (High Impact, Low Effort):
- ✅ External monitoring (1h) — Immediate visibility into outages
- ✅ CloudWatch retention (30min) — Logs already flowing, just set policy
- ⬜ CloudWatch alarms (1.5h) — Automated alerting
Week 2 (Critical Compliance): 4. ⬜ Audit logging schema (2h) — Create table and library 5. ⬜ Audit logging integration (6h) — Wire into endpoints
Week 3 (Security & Error Tracking): 6. ⬜ Server-side error tracking (4h) — Sentry edge setup 7. ⬜ WAF deployment (2h) — Security hardening
Week 4 (Runbooks): 8. ⬜ Remaining runbooks (2h) — AISP, Swan, Sumsub, Neonomics
Dependencies
External Dependencies
- BetterStack account signup (5 min, no approval needed)
- Sentry organization/project (existing, or create new)
- Cloudflare account (existing for DNS, WAF is free tier)
Internal Dependencies
- Alem approval for:
- Audit log schema changes
- CloudWatch cost ($17/month estimate)
- BetterStack Pro upgrade (optional, $20/month for 30s interval)
Blocked Items
- Some runbooks require Phase 2 context (real banking integrations)
- Can document procedures but can't fully test without live APIs
- Mark as "draft" until Phase 2
Testing Plan
Test 1: Error Tracking
# Trigger server error
curl -X POST http://localhost:3000/api/test/error \
-H "Content-Type: application/json" \
-d '{"trigger":"server_error"}'
# Verify in Sentry:
# - Event appears within 30s
# - Stack trace includes source file/line
# - User context present (if logged in)
Test 2: Audit Logging
# Perform audit-worthy action
curl -X POST http://localhost:3000/api/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"[email protected]","password":"wrong"}'
# Check database (PostgreSQL 16):
psql "$DATABASE_URL" -c "SELECT * FROM audit_log ORDER BY timestamp DESC LIMIT 1;"
# Expected:
# audit_xxx|2026-02-22T10:00:00Z|usr_123|login_failure|...|1.2.3.4|Mozilla...
Test 3: WAF
# Test SQLi blocking
curl "https://getdrop.no/api/test?id=1' OR '1'='1" -v
# Expected: HTTP 403 Forbidden
# Test legitimate request
curl "https://getdrop.no/api/health" -v
# Expected: HTTP 200 OK
Test 4: CloudWatch Alarms
# Trigger error spike (loop 15 errors)
for i in {1..15}; do
curl http://localhost:3000/api/test/error
sleep 2
done
# Expected:
# - CloudWatch alarm fires after 2 minutes (2 x 1min periods)
# - Slack alert received in #drop-ops
# - Email sent to [email protected]
Test 5: BetterStack
# Stop app
docker stop drop-app
# Wait 3-5 minutes
# Expected:
# - BetterStack detects downtime
# - Slack alert in #drop-ops
# - Email to [email protected]
# Restart app
docker start drop-app
# Expected:
# - BetterStack detects recovery
# - "UP" notification sent
Rollout Plan
Phase 1: Non-Intrusive (Day 1)
- External monitoring (BetterStack)
- CloudWatch retention policies
- CloudWatch alarms (passive, alerts only)
Risk: None. These are read-only additions.
Phase 2: Database Changes (Day 2)
- Audit log schema migration
- Audit log library (no integrations yet)
Risk: Low. New table, no app changes. Test migration in dev first.
Phase 3: Code Integration (Day 3-4)
- Audit logging in auth endpoints
- Server-side error tracking (Sentry edge)
- WAF deployment
Risk: Medium. Requires code changes + deployment. Deploy to staging first, test 24h, then production.
Phase 4: Runbooks (Day 5)
- Complete remaining runbooks
- Team training session
- Runbook testing in staging
Risk: None. Documentation only, no production changes.
Success Metrics
After P0 completion, we should achieve:
- ✅ 100% server errors visible (Sentry events)
- ✅ 100% audit events logged (auth, admin, data access)
- ✅ >99.9% uptime detection (BetterStack)
- ✅ <5 min MTTD (mean time to detect incidents)
- ✅ <15 min MTTR (mean time to recover, using runbooks)
- ✅ 0 security vulnerabilities from WAF bypass
Approvals
Required Approvals
- Alem: Audit log schema changes
- Alem: CloudWatch cost ($17/month)
- Alem: BetterStack account (free tier OK? or Pro $20/month?)
Sign-Off
- John (AI Director): Technical implementation complete
- Alem (CEO): Business approval for costs + rollout
- Validator (QA): Testing complete, acceptance criteria met
Next Steps
- Review this analysis with Alem
- Get approvals for costs and schema changes
- Create Mission Control tasks for each P0 item
- Begin implementation (priority order above)
- Test thoroughly in staging before production
- Document completion in this checklist
Related Documents
support/SUPPORT-SYSTEMS-ANALYSIS.md— Full analysis (all P0/P1/P2 items)support/audit-logging-setup.md— Audit logging implementation guidesupport/runbooks/bankid-failure.md— BankID failure recoverysupport/runbooks/pisp-payment-failure.md— Payment failure recoveryinfrastructure/cloudwatch-logs-setup.md— Log aggregation setupinfrastructure/waf-rules.md— WAF rule definitions
Status: Ready for approval and implementation Next Review: After P0 completion (before Phase 2 launch)
No comments to display
No comments to display