Infrastructure & DevOps Deployment, CI/CD, monitoring, DevOps/SRE stack, WAF Deployment & Environment Environment Setup Drop Environment Configuration Last updated: 2026-02-13 Source: src/drop-app/package.json , next.config.ts , Dockerfile , docker-compose.yml , fly.toml Technology Stack Layer Technology Version Source Runtime Node.js 22 (Alpine) Dockerfile:2 Framework Next.js 16.1.6 package.json:14 UI React 19.2.3 package.json:15-16 Database (all environments) PostgreSQL 16 via Drizzle ORM drizzle-orm src/shared/db/schema.ts Auth JWT via jose ^6.1.3 package.json:8 Password hashing bcryptjs ^3.0.3 package.json:5 Styling Tailwind CSS ^4 package.json:33 UI Components Radix UI ^1.4.3 package.json:13 Icons Lucide React ^0.563.0 package.json:9 Theme next-themes ^0.4.6 package.json:10 Toasts Sonner ^2.0.7 package.json:17 Dev Dependencies Tool Version Purpose Source Vitest ^4.0.18 Unit/integration testing package.json:36 Playwright ^1.58.2 E2E testing package.json:21 TypeScript ^5 Type checking package.json:35 ESLint ^9 Linting package.json:29 shadcn ^3.8.4 UI component generation package.json:32 NPM Scripts Source: src/drop-app/package.json:5-12 Script Command Description dev next dev Start development server (port 3000) build next build Build for production (standalone output) start next start Start production server lint eslint Run ESLint test vitest run Run unit/integration tests (single run) test:watch vitest Run tests in watch mode Next.js Configuration Source: src/drop-app/next.config.ts:1-49 Setting Value Purpose output "standalone" Self-contained server for Docker ( next.config.ts:4 ) devIndicators false Disable dev indicators ( next.config.ts:5 ) Security Headers All responses include these headers (configured in next.config.ts:6-58 ): Header Value (Production) Value (Development) Purpose Content-Security-Policy default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline'; font-src 'self'; img-src 'self' data: blob:; connect-src 'self'; frame-ancestors 'none' default-src 'self'; script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; font-src 'self'; img-src 'self' data: blob:; connect-src 'self'; frame-ancestors 'none' XSS and injection protection X-Frame-Options DENY DENY Clickjacking prevention X-Content-Type-Options nosniff nosniff MIME sniffing prevention Referrer-Policy strict-origin-when-cross-origin strict-origin-when-cross-origin Referrer leakage prevention Permissions-Policy camera=(self), microphone=(), geolocation=(self) camera=(self), microphone=(), geolocation=(self) Feature restriction Strict-Transport-Security max-age=63072000; includeSubDomains; preload max-age=63072000; includeSubDomains; preload Force HTTPS Note: CSP is stricter in production (no unsafe-eval for scripts). Development mode allows unsafe-inline and unsafe-eval for HMR (Hot Module Replacement) to work. Environment Modes Development NODE_ENV=development (default) Demo user seeded automatically Login page shows demo credentials hint In-memory rate limiting fallback PostgreSQL 16 via Docker ( docker compose up -d ), port 5433 Production NODE_ENV=production Demo seed data disabled JWT_SECRET required (fatal error if missing) Cookies set with secure: true PostgreSQL 16 on AWS RDS via DATABASE_URL Test NODE_ENV=test PostgreSQL 16 test database ( drop_test ), created via pg-test-db.ts helper Tables truncated between tests; schema pushed via Drizzle before suite runs Mocked Next.js modules (server, headers) Port Mapping Service Internal Port External Port Protocol Drop App 3000 3000 HTTP PostgreSQL (local dev) 5432 5433 TCP PostgreSQL (production RDS) 5432 5432 TCP Docker Image Details Base: node:22-alpine User: nextjs (UID 1001) Working dir: /app Exposed port: 3000 Entrypoint: node server.js Build context: src/drop-app/ Image contents (runner stage): /app/public/ -- Static assets /app/.next/standalone/ -- Next.js standalone server /app/.next/static/ -- Static build output Secrets Management Secrets Management Last updated: 2026-02-17 Source: src/drop-app/src/lib/secrets.ts Overview Drop uses an abstracted secrets management system with pluggable providers. The system is backward compatible -- if no secrets provider is configured, it reads directly from environment variables (existing behavior). Provider Selection The provider is selected automatically based on which environment variables are set: Priority Condition Provider Description 1 DOPPLER_TOKEN set Doppler Cloud secrets manager via Doppler API 2 AWS_SECRET_ARN set AWS AWS Secrets Manager (requires AWS SDK) 3 (default) env Reads from process.env Initialization (call once at app startup): import { initSecrets } from '@/lib/secrets'; // Auto-detect provider based on env vars initSecrets(); // Optional: custom cache TTL (default 5 minutes) initSecrets({ ttlMs: 10 * 60 * 1000 }); // 10 minutes Usage: import { getSecret } from '@/lib/secrets'; const jwtSecret = await getSecret('JWT_SECRET'); const dbUrl = await getSecret('DATABASE_URL'); Caching All secret values are cached in memory with a configurable TTL (default: 5 minutes). This reduces API calls to external providers while ensuring secrets are refreshed periodically. Cache is cleared on initSecrets() call Cache entries expire individually based on TTL If a provider returns undefined , the system falls back to process.env Rotation Procedures JWT_SECRET Impact: All active user sessions will be invalidated. Generate new secret: openssl rand -base64 48 Update in secrets provider (Doppler/AWS/env) Call rotateSecret('JWT_SECRET', newValue) or restart the app Users will need to log in again Recommended frequency: Every 90 days or after a suspected compromise. DATABASE_URL (PostgreSQL credentials) Impact: Application loses DB connectivity until updated. Create new PostgreSQL credentials Update PostgreSQL user: ALTER USER drop WITH PASSWORD 'new_value'; Update DATABASE_URL in secrets provider with new credentials Restart the application (or call rotateSecret ) Recommended frequency: Every 90 days. SENTRY_DSN Status: REMOVED (MC #1271 — Sentry deinstalled) SLACK_WEBHOOK_URL Impact: Alerts stop sending to Slack until updated. Create new incoming webhook in Slack workspace Update SLACK_WEBHOOK_URL in secrets provider Restart the application Recommended frequency: Only on suspected compromise. Open Banking API Keys Impact: Bank connectivity (AISP/PISP) stops working. Regenerate keys in the Open Banking provider dashboard Update the relevant env vars in secrets provider Restart the application Verify bank account connectivity via /api/health Recommended frequency: Per provider policy or every 180 days. Environment Setup per Provider Environment Variables (Default) No setup required. Set secrets as environment variables: # .env.local (development) JWT_SECRET=dev-secret-do-not-use-in-production # Production (Fly.io) fly secrets set JWT_SECRET="$(openssl rand -base64 48)" fly secrets set DATABASE_URL="postgresql://..." # Production (Docker) # Pass via -e flags or docker-compose environment section Doppler Create account at doppler.com Create project "drop" with environments (dev, staging, production) Add all secrets in the Doppler dashboard Generate a service token for each environment Set DOPPLER_TOKEN in your deployment: # Fly.io fly secrets set DOPPLER_TOKEN="dp.st.production.xxxxx" # Docker (pass as environment variable) AWS Secrets Manager Create a secret in AWS Secrets Manager (JSON format): { "JWT_SECRET": "your-jwt-secret", "DATABASE_URL": "postgresql://...", "SLACK_WEBHOOK_URL": "https://..." } Note the secret ARN Ensure the application has IAM permissions for secretsmanager:GetSecretValue Install the AWS SDK: npm install @aws-sdk/client-secrets-manager Set AWS_SECRET_ARN in your deployment Audit Trail All secret rotation events are logged to the audit_log table: Field Value action secret_rotated resource_type secret resource_id Secret key name (e.g., JWT_SECRET ) details JSON with provider name and rotation timestamp Query rotation history: SELECT * FROM audit_log WHERE action = 'secret_rotated' ORDER BY timestamp DESC; Deployment Checklist Deployment Checklist: [PROJECT NAME] Release: v[X.Y.Z] Date: YYYY-MM-DD Deploy Lead: DevOps Approved by: Tech Lead + John Environment: Staging → Production Pre-Deployment (T-1 Day) Verification All tests passing in CI Code review approved and merged UAT sign-off received Release notes prepared No critical/high open bugs Preparation Database backup completed Staging environment matches production config Rollback procedure tested Stakeholders notified of deployment window On-call person confirmed and available Configuration Environment variables verified API keys / secrets rotated if needed DNS changes prepared (if applicable) SSL certificates valid (expiry > 30 days) Third-party service limits adequate Deployment (T-0) Window Allowed: Tue-Thu, 10:00-16:00 Never: Fridays Hotfix: Anytime business hours (Tech Lead + John approval) Emergency: Anytime (John + Alem approval) Execution Announce deployment start in channel Deploy to staging — verify Run staging smoke tests Manual approval gate — Tech Lead confirms Deploy to production Monitor deployment logs for errors Post-Deployment (T+0) Smoke Tests Homepage loads correctly Authentication works (login/logout) Core user flow #1 works Core user flow #2 works API health endpoint returns 200 No errors in error tracking (Sentry) Monitoring (First 30 Minutes) Error rate normal (< 1%) Response times normal (p95 < 500ms) No 5xx errors in logs Database connections stable Memory/CPU usage normal Communication Announce deployment complete Send release notes to stakeholders Update project status Rollback Plan Rollback Triggers Critical functionality broken Data integrity issues Security vulnerability discovered Error rate > 5% Response time > 3x normal Rollback Procedure Announce rollback in channel Revert to previous version Restore database backup (if schema changed) Verify rollback successful Announce rollback complete Create incident report Rollback Time Targets Application rollback: < 15 minutes Database rollback: < 30 minutes Full rollback: < 1 hour Sign-off Role Name Pre-Deploy Post-Deploy DevOps ☐ ☐ Tech Lead ☐ ☐ John ☐ ☐ DR Runbook Drop — Disaster Recovery Runbook Infrastructure Overview Production Environment Service: AWS App Runner Region: eu-west-1 (Ireland) Service ARN: arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec Service URL: https://9ef3szvvsb.eu-west-1.awsapprunner.com ECR Repository: 324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web Database RDS Instance: drop-db Endpoint: drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com:5432 Database Name: dropapp Username: dropuser Backup Strategy: Automated snapshots, 7-day retention Backup Window: 23:24-23:54 UTC daily Staging Environment Platform: Fly.io App Name: drop-staging Region: arn (Stockholm) Database: PostgreSQL 16 (RDS, eu-north-1, or Docker in CI) Domain Production: getdrop.no (future) Current: App Runner subdomain Backup Strategy RDS PostgreSQL (Production) Automated Snapshots: Daily at 23:24 UTC Retention Period: 7 days Point-in-Time Recovery: Enabled (5-minute granularity) Manual Snapshots: Created before major changes Storage: Same region (eu-west-1) Staging PostgreSQL (RDS) Automated Snapshots: Daily, 7-day retention (same config as production) Backup Method: Manual export via flyctl ssh console and pg_dump (PostgreSQL 16 — sqlite3 no longer applies; see ADR-014) Recommended: Export before major changes Recovery Procedures Scenario 1: App Runner Service Down Symptoms Service health checks failing 5xx errors from App Runner URL CloudWatch alarms triggered Investigation Steps # 1. Check service status aws apprunner describe-service \ --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \ --region eu-west-1 # 2. View recent logs (last 10 minutes) aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \ --follow \ --since 10m \ --region eu-west-1 # 3. Check deployment history aws apprunner list-operations \ --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \ --region eu-west-1 Recovery Actions Option A: Restart Service # Trigger new deployment (no code change) aws apprunner start-deployment \ --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \ --region eu-west-1 # Monitor deployment status aws apprunner describe-service \ --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \ --query 'Service.Status' \ --region eu-west-1 Option B: Rollback to Previous Image # 1. List recent ECR images aws ecr describe-images \ --repository-name drop-web \ --region eu-west-1 \ --query 'sort_by(imageDetails,& imagePushedAt)[-5:]' # 2. Update service to use previous image tag # (Manual step: Update .github/workflows/deploy-aws.yml with previous tag and push) # 3. Or update directly via App Runner console (rollback to previous deployment) RTO: 5-10 minutes (restart) / 15-20 minutes (rollback) Scenario 2: RDS Database Failure Symptoms Connection timeouts to drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com Database errors in App Runner logs RDS CloudWatch metrics show instance down Investigation Steps # 1. Check RDS instance status aws rds describe-db-instances \ --db-instance-identifier drop-db \ --region eu-west-1 \ --query 'DBInstances[0].DBInstanceStatus' # 2. Check for automated snapshots aws rds describe-db-snapshots \ --db-instance-identifier drop-db \ --region eu-west-1 \ --query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-5:]' # 3. Review recent events aws rds describe-events \ --source-identifier drop-db \ --source-type db-instance \ --region eu-west-1 \ --duration 60 Recovery Actions Option A: Restore from Latest Automated Snapshot # 1. Identify latest snapshot LATEST_SNAPSHOT=$(aws rds describe-db-snapshots \ --db-instance-identifier drop-db \ --region eu-west-1 \ --query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \ --output text) echo "Latest snapshot: $LATEST_SNAPSHOT" # 2. Restore to new instance aws rds restore-db-instance-from-db-snapshot \ --db-instance-identifier drop-db-restored \ --db-snapshot-identifier $LATEST_SNAPSHOT \ --db-instance-class db.t4g.micro \ --vpc-security-group-ids sg-XXXXX \ --db-subnet-group-name default \ --region eu-west-1 # 3. Wait for restore to complete (10-20 minutes) aws rds wait db-instance-available \ --db-instance-identifier drop-db-restored \ --region eu-west-1 # 4. Update DATABASE_URL in App Runner # (Manual step: Update environment variable via AWS Console or CLI) # 5. Verify connection NEW_ENDPOINT=$(aws rds describe-db-instances \ --db-instance-identifier drop-db-restored \ --query 'DBInstances[0].Endpoint.Address' \ --output text \ --region eu-west-1) echo "New endpoint: $NEW_ENDPOINT" Option B: Point-in-Time Recovery # Restore to specific timestamp (e.g., 1 hour ago) aws rds restore-db-instance-to-point-in-time \ --source-db-instance-identifier drop-db \ --target-db-instance-identifier drop-db-pitr \ --restore-time $(date -u -d '1 hour ago' '+%Y-%m-%dT%H:%M:%SZ') \ --db-instance-class db.t4g.micro \ --region eu-west-1 # Wait for restore aws rds wait db-instance-available \ --db-instance-identifier drop-db-pitr \ --region eu-west-1 RPO: 24 hours (snapshot) / 5 minutes (PITR) RTO: 30 minutes (snapshot) / 30 minutes (PITR) Scenario 3: Data Corruption Symptoms Application reports data inconsistencies Missing or incorrect records in database User reports of lost data Investigation Steps # 1. Connect to RDS and inspect data psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \ -U dropuser \ -d dropapp \ -c "SELECT COUNT(*) FROM users WHERE deleted_at IS NOT NULL;" # 2. Check audit_log table for suspicious activity psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \ -U dropuser \ -d dropapp \ -c "SELECT * FROM audit_log WHERE action IN ('DELETE', 'UPDATE') ORDER BY timestamp DESC LIMIT 50;" # 3. Identify time of corruption # Review application logs and database query logs Recovery Actions Option A: Selective Data Restore (if corruption is isolated) # 1. Create temporary snapshot of current state aws rds create-db-snapshot \ --db-instance-identifier drop-db \ --db-snapshot-identifier drop-db-before-restore-$(date +%Y%m%d-%H%M) \ --region eu-west-1 # 2. Restore clean snapshot to temporary instance CLEAN_SNAPSHOT= aws rds restore-db-instance-from-db-snapshot \ --db-instance-identifier drop-db-temp \ --db-snapshot-identifier $CLEAN_SNAPSHOT \ --db-instance-class db.t4g.micro \ --region eu-west-1 # 3. Export affected tables from clean instance pg_dump -h \ -U dropuser \ -d dropapp \ -t users \ -t transactions \ --data-only \ > clean_data.sql # 4. Selectively import into production (after verification) psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \ -U dropuser \ -d dropapp \ < clean_data.sql # 5. Terminate temporary instance aws rds delete-db-instance \ --db-instance-identifier drop-db-temp \ --skip-final-snapshot \ --region eu-west-1 Option B: Full Database Restore (see Scenario 2) RTO: 1-2 hours (selective) / 30 minutes (full restore) RPO: Depends on snapshot age Scenario 4: Full Region Outage (eu-west-1) Current State No automated cross-region failover No replica in secondary region Manual failover required Investigation Steps # 1. Check AWS Service Health Dashboard # https://health.aws.amazon.com/health/status # 2. Verify RDS snapshots are accessible aws rds describe-db-snapshots \ --db-instance-identifier drop-db \ --region eu-west-1 # 3. Check ECR images (may need to copy to secondary region) aws ecr describe-images \ --repository-name drop-web \ --region eu-west-1 Recovery Actions (Manual Failover to eu-north-1) # 1. Copy latest RDS snapshot to eu-north-1 LATEST_SNAPSHOT=$(aws rds describe-db-snapshots \ --db-instance-identifier drop-db \ --region eu-west-1 \ --query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \ --output text) aws rds copy-db-snapshot \ --source-db-snapshot-identifier arn:aws:rds:eu-west-1:324480209768:snapshot:$LATEST_SNAPSHOT \ --target-db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \ --region eu-north-1 # 2. Restore RDS in eu-north-1 aws rds restore-db-instance-from-db-snapshot \ --db-instance-identifier drop-db-failover \ --db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \ --db-instance-class db.t4g.micro \ --region eu-north-1 # 3. Copy ECR image to eu-north-1 # (Manual: create ECR repo in eu-north-1, retag and push latest image) # 4. Deploy App Runner in eu-north-1 # (Manual: create new App Runner service via console with failover database endpoint) # 5. Update DNS (when getdrop.no is active) # Point getdrop.no to new App Runner URL RTO: 2-4 hours (manual process) RPO: Last snapshot before outage (24 hours worst case, 5 minutes with PITR if available) Scenario 5: Security Incident Symptoms Suspicious database activity Unauthorized access attempts AML alerts triggered STR report filed Investigation Steps # 1. Check audit logs for suspicious activity psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \ -U dropuser \ -d dropapp \ -c "SELECT * FROM audit_log WHERE timestamp > NOW() - INTERVAL '24 hours' ORDER BY timestamp DESC;" # 2. Review AML alerts psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \ -U dropuser \ -d dropapp \ -c "SELECT * FROM aml_alerts WHERE status = 'open' OR created_at > NOW() - INTERVAL '24 hours';" # 3. Check AWS CloudTrail for API activity aws cloudtrail lookup-events \ --lookup-attributes AttributeKey=ResourceName,AttributeValue=drop-db \ --region eu-west-1 \ --max-results 50 # 4. Review App Runner access logs aws logs filter-log-events \ --log-group-name /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \ --start-time $(date -u -d '24 hours ago' +%s)000 \ --region eu-west-1 Containment Actions # 1. Revoke compromised sessions psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \ -U dropuser \ -d dropapp \ -c "UPDATE sessions SET revoked = 1 WHERE user_id IN (SELECT user_id FROM aml_alerts WHERE status = 'open');" # 2. Temporarily disable affected users psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \ -U dropuser \ -d dropapp \ -c "UPDATE users SET kyc_status = 'rejected' WHERE id IN (SELECT user_id FROM aml_alerts WHERE severity = 'critical');" # 3. Rotate database credentials aws rds modify-db-instance \ --db-instance-identifier drop-db \ --master-user-password \ --apply-immediately \ --region eu-west-1 # Update DATABASE_URL in App Runner with new password # 4. Enable enhanced monitoring aws rds modify-db-instance \ --db-instance-identifier drop-db \ --monitoring-interval 1 \ --monitoring-role-arn arn:aws:iam::324480209768:role/rds-monitoring-role \ --region eu-west-1 # 5. Take forensic snapshot aws rds create-db-snapshot \ --db-instance-identifier drop-db \ --db-snapshot-identifier drop-db-incident-$(date +%Y%m%d-%H%M) \ --region eu-west-1 Investigation & Remediation Analyze audit logs — identify scope of breach File STR reports — if financial crime suspected (via str_reports table) Notify Finanstilsynet — if user data compromised (GDPR requirement) Update security policies — patch vulnerabilities User communication — notify affected users if required by GDPR RTO: Immediate containment (revoke sessions) / 24-48 hours full investigation RTO/RPO Targets Scenario RTO RPO App Runner restart 5-10 minutes 0 (no data loss) App Runner rollback 15-20 minutes 0 (no data loss) RDS snapshot restore 30 minutes 24 hours (last snapshot) RDS PITR restore 30 minutes 5 minutes (PITR granularity) Full region failover 2-4 hours 24 hours (manual process) Security incident containment Immediate 0 (logs preserved) Contacts Primary Alem Bašić (CEO): +47 40 47 42 51 Email: alem@alai.no AI Operations John (AI Director): Slack #drop-alerts channel External Support AWS Support: Premium support via AWS Console Fly.io Support: Email support@fly.io Runbook Maintenance Review Schedule Quarterly review — verify all ARNs, endpoints, and procedures After incidents — update based on lessons learned Before major releases — verify backup and rollback procedures Test Schedule Annually — full DR drill (restore from snapshot to temporary instance) Quarterly — App Runner restart and rollback tests Monthly — verify snapshot creation and retention Change Log Date Change Author 2026-02-18 Initial version created Builder 3 (AI) Appendix: Useful Commands Quick Health Check # Check App Runner status aws apprunner describe-service \ --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \ --query 'Service.Status' \ --output text \ --region eu-west-1 # Check RDS status aws rds describe-db-instances \ --db-instance-identifier drop-db \ --query 'DBInstances[0].DBInstanceStatus' \ --output text \ --region eu-west-1 # Check latest snapshot age aws rds describe-db-snapshots \ --db-instance-identifier drop-db \ --region eu-west-1 \ --query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-1].SnapshotCreateTime' \ --output text Database Connection Test # Test connection from local machine psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \ -U dropuser \ -d dropapp \ -c "SELECT 1;" Log Streaming # Stream App Runner application logs aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \ --follow \ --region eu-west-1 # Stream RDS error logs aws rds download-db-log-file-portion \ --db-instance-identifier drop-db \ --log-file-name error/postgresql.log \ --region eu-west-1 Deployment Guide Drop Deployment Guide Last updated: 2026-03-03 Source: src/drop-app/Dockerfile , docker-compose.yml , DOCKER.md NOTE (2026-03-03): This document was updated for ADR-014 (PostgreSQL-only). The SQLite single-container deployment and better-sqlite3 native dependency have been removed. Current deployment: Docker + PostgreSQL 16 (dev), AWS App Runner + RDS (production). Architecture Overview Drop uses a multi-stage Docker build producing a minimal Node.js 22 Alpine production image. The application is a Next.js 16 standalone server. Build stages (from Dockerfile:1-41 ): Stage Base Purpose deps node:22-alpine Install node_modules via npm ci . builder node:22-alpine Copy deps + source, run npm run build (Next.js standalone output). runner node:22-alpine Minimal production image. Copies only public/ , .next/standalone/ , .next/static/ . Security features in the runner stage ( Dockerfile:25-26 ): Non-root user: nextjs (UID 1001, GID 1001) Data directory /app/data owned by nextjs:nodejs No build tools or source code in production image Deployment Configurations 1. Local Development -- docker-compose.yml PostgreSQL 16 + Drop app (ADR-014). File: src/drop-app/docker-compose.yml:1-22 services: drop-app: build: . ports: - "3000:3000" environment: - JWT_SECRET=${JWT_SECRET:?JWT_SECRET is required} - NODE_ENV=production - NEXT_PUBLIC_SERVICE_MODE=mock volumes: - drop_data:/app/data healthcheck: test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/api/health"] interval: 30s timeout: 10s retries: 3 start_period: 10s restart: unless-stopped Quick start: export JWT_SECRET="your-secure-random-string-min-32-chars" docker compose up -d Data persistence: PostgreSQL data stored in Docker volume drop_pgdata . 2. Production (PostgreSQL) -- docker-compose.production.yml Multi-container setup with separate PostgreSQL 16 database. File: src/drop-app/docker-compose.production.yml:1-38 services: drop-app: build: . ports: - "3000:3000" depends_on: postgres: condition: service_healthy restart: unless-stopped postgres: image: postgres:16-alpine environment: - POSTGRES_DB=drop - POSTGRES_USER=drop - POSTGRES_PASSWORD=${POSTGRES_PASSWORD:-drop_local_dev} volumes: - postgres_data:/var/lib/postgresql/data healthcheck: test: ["CMD-SHELL", "pg_isready -U drop"] interval: 10s timeout: 5s retries: 5 Quick start: export JWT_SECRET="your-secure-random-string-min-32-chars" export POSTGRES_PASSWORD="secure-postgres-password" docker compose -f docker-compose.production.yml up -d 3. Fly.io Staging -- fly.toml File: src/drop-app/fly.toml:1-28 Setting Value App name drop-staging Region arn (Stockholm -- closest to Norway) Internal port 3000 Force HTTPS true Auto-stop machines stop (scales to zero) Auto-start machines true Min machines 0 Persistent storage Volume drop_data mounted at /app/data Health check: GET /api/health every 30s, 5s timeout, 10s grace period. Environment Variables Variable Required Default Description JWT_SECRET Yes (production) Dev: process.cwd() hash JWT signing secret. Minimum 32 characters. Fatal error if missing in production. NODE_ENV No development Set to production in containers. Controls seed data gating. NEXT_PUBLIC_SERVICE_MODE No - Set to mock for MVP mode (no external API calls). DATABASE_URL Yes - PostgreSQL 16 connection string. Required in all environments. Local dev: postgresql://drop:dev_only_not_a_secret@localhost:5433/drop_dev POSTGRES_PASSWORD Production only drop_local_dev PostgreSQL password (production compose). PORT No 3000 HTTP server port. HOSTNAME No 0.0.0.0 Server bind address. Database: PostgreSQL 16 is required in all environments. There is no SQLite fallback (ADR-014). Health Check Endpoint: GET /api/health Source: src/drop-app/src/app/api/health/route.ts:1-35 The health check performs a real database query ( SELECT 1 as ok ) and reports latency. Success response (200): { "status": "ok", "version": "0.1.0", "uptime": 123, "db": "connected", "dbLatencyMs": 5, "timestamp": "2026-02-13T12:00:00.000Z" } Failure response (503): { "status": "error", "db": "disconnected", "timestamp": "..." } Building from Source # Build Docker image docker build -t drop-app . # Run standalone container docker run -d \ -p 3000:3000 \ -e JWT_SECRET="your-secret-min-32-chars" \ -v drop_data:/app/data \ --name drop-app \ drop-app Data Backup and Restore Production Backups (AWS RDS) Production database is PostgreSQL 16 on AWS RDS. Backups are managed by AWS: Automated backups: Daily snapshots, 7-day retention (configured in RDS) Point-in-time recovery: Available within the 7-day retention window Manual snapshot: Via AWS Console or CLI before major deployments Create a manual RDS snapshot before deployments: aws rds create-db-snapshot \ --db-instance-identifier drop-production \ --db-snapshot-identifier drop-pre-deploy-$(date +%Y%m%d-%H%M%S) Restore from snapshot: Via AWS Console → RDS → Snapshots → Restore. Local Dev Backups (Docker) Local development data in the drop_pgdata Docker volume is disposable. Recreate with: docker compose down -v # Remove volume (deletes local data) docker compose up -d make db-push && npm run db:seed Backup Verification Verify production database connectivity and integrity: # Check health endpoint curl https://your-app-runner-url/api/health # Connect to RDS (requires VPN or bastion) psql $DATABASE_URL -c "SELECT COUNT(*) FROM users;" Demo User In non-production mode ( NODE_ENV !== 'production' ), a demo user is seeded: Field Value Email amir@example.com Password demo1234 Role merchant Source: Drizzle seed script in src/shared/db/seed.ts . Gated behind NODE_ENV !== 'production' . Troubleshooting Container won't start: docker compose logs docker compose exec drop-app env | grep JWT_SECRET Database connection issues: # Check PostgreSQL container is running docker compose ps # Test connection docker compose exec db psql -U drop -d drop_dev -c "SELECT COUNT(*) FROM users;" # Check app DATABASE_URL is set correctly docker compose exec drop-app env | grep DATABASE_URL Permission denied: docker compose down -v # Remove volumes docker compose up -d # Recreate with correct permissions Cleanup: docker compose down # Stop containers docker compose down -v # Stop + remove volumes (WARNING: deletes data) docker rmi drop-app # Remove image CI/CD & Monitoring CI/CD Pipeline Drop CI/CD Pipeline Last updated: 2026-02-13 Source: src/drop-app/package.json , Dockerfile , fly.toml , vitest.config.ts , playwright.config.ts Current State Drop is in MVP/pre-production stage. Core CI/CD infrastructure exists including a GitHub Actions workflow. What exists: GitHub Actions CI workflow ( .github/workflows/ci.yml ) with 5 jobs: lint-and-typecheck, test, build, e2e, docker-build Dockerfile with multi-stage build ( Dockerfile:1-63 ) docker-compose for local and production ( docker-compose.yml , docker-compose.production.yml ) Fly.io deployment config ( fly.toml ) Vitest unit/integration test framework ( vitest.config.ts ) Playwright E2E test framework ( playwright.config.ts ) Health check endpoint ( /api/health ) QA report generation via scripts/qa-report.js (automated in CI) What does not exist yet: Automated deployment pipeline (CI builds but does not deploy) Container registry integration Automated security scanning (npm audit, Snyk) Test coverage reporting Staging environment (Fly.io config exists but not deployed) Build Pipeline Step 1: Install Dependencies npm ci Installs exact versions from package-lock.json . Step 2: Lint npm run lint # eslint Step 3: Type Check npx tsc --noEmit Step 4: Unit + Integration Tests npm test # vitest run Runs all tests in tests/**/*.test.ts (from vitest.config.ts:7 ). Test setup: tests/setup.ts sets NODE_ENV=test . Step 5: Build npm run build # next build Produces standalone output for Docker deployment. Step 6: Docker Build docker build -t drop-app . Multi-stage build: deps -> builder -> runner. Step 7: E2E Tests (requires running server) npx playwright test Requires dev server on http://localhost:3000 . Playwright auto-starts it via webServer config. Test Framework Configuration Vitest (Unit + Integration) Config: src/drop-app/vitest.config.ts:1-15 Setting Value Environment node Include tests/**/*.test.ts Setup tests/setup.ts Path alias @ -> ./src Playwright (E2E) Config: src/drop-app/playwright.config.ts:1-39 Setting Value Test dir ./tests/e2e Parallel false (serial -- rate limiter is shared) Workers 1 Retries (CI) 2 Timeout 30,000ms Base URL http://localhost:3000 Reporter HTML Trace on-first-retry Test projects: user-flows -- Basic user journey tests ( user-flows.spec.ts ) full-flows -- Complete feature journeys ( full-flows.spec.ts ) input-chaos -- Malicious/edge-case input testing ( input-chaos.spec.ts ). Depends on user-flows . Web server config: Auto-starts npm run dev for E2E tests. Reuses existing server if running. 30s timeout. Deployment Targets Fly.io (Staging) Config: fly.toml:1-28 # Deploy to Fly.io staging fly deploy # Set secrets fly secrets set JWT_SECRET="your-secret" fly secrets set NEXT_PUBLIC_SERVICE_MODE="mock" Region: arn (Stockholm) Auto-scaling: Scales to 0 when idle, auto-starts on request. Docker (Self-hosted) # Local dev (PostgreSQL 16 via Docker) docker compose up -d # Apply schema make db-push Existing GitHub Actions CI Workflow File: .github/workflows/ci.yml Triggers on push/PR to main or master : Jobs: 1. lint-and-typecheck — npm ci, npm run lint, tsc --noEmit 2. test — npm ci, npm test --if-present (depends on lint-and-typecheck) 3. build — npm ci, npm run build with JWT_SECRET placeholder (depends on lint-and-typecheck) 4. e2e — npm ci, npx playwright install chromium, npm run build, npm run start (production mode), npx playwright test user-flows + full-flows, generate QA report, upload artifacts (depends on build) 5. docker-build — docker build -t drop-app:ci (depends on test + build + e2e) Artifacts uploaded: playwright-report/ — Playwright HTML report (7 day retention) qa-report.html — QA metrics report (pass/fail, execution time) Not yet implemented: Security scan (npm audit, Snyk) Deploy to staging (fly deploy) Deploy to production (manual approval gate) Status: Full CI pipeline including E2E tests in place. CD deployment tracked in security hardening checklist ( security/hardening-checklist.md:120-126 ). Monitoring & Alerting Drop Monitoring Last updated: 2026-02-17 Source: src/drop-app/src/app/api/health/route.ts , docker-compose.yml , fly.toml , src/lib/alerts.ts Health Check Endpoint Route: GET /api/health Source: src/drop-app/src/app/api/health/route.ts:1-35 What It Checks Database connectivity -- Executes SELECT 1 as ok against the database Database latency -- Measures query execution time in milliseconds Database driver -- Reports pg (PostgreSQL 16 via Drizzle ORM) Service mode -- Reports NEXT_PUBLIC_SERVICE_MODE ( mock or live ) Application uptime -- Tracks seconds since server start Application version -- Reads from npm_package_version env var, defaults to 0.1.0 Status Values ok -- All checks pass (HTTP 200) degraded -- DB query returned unexpected result (HTTP 200) down -- DB unreachable (HTTP 503) Response Format Healthy (200 OK): { "data": { "status": "ok", "version": "0.1.0", "uptime": 3600, "checks": { "db": { "status": "pass", "latencyMs": 2, "driver": "pg" }, "services": { "mode": "live" } }, "timestamp": "2026-02-17T12:00:00.000Z" } } Down (503 Service Unavailable): { "data": { "status": "down", "version": "0.1.0", "uptime": 3600, "checks": { "db": { "status": "fail" }, "services": { "mode": "live" } }, "timestamp": "2026-02-17T12:00:00.000Z" } } Container Health Checks Docker Compose (MVP) Source: docker-compose.yml:12-17 healthcheck: test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/api/health"] interval: 30s timeout: 10s retries: 3 start_period: 10s Docker Compose (Production) Source: docker-compose.production.yml:9-14 Same health check configuration as MVP. Additionally, PostgreSQL has its own health check: healthcheck: test: ["CMD-SHELL", "pg_isready -U drop"] interval: 10s timeout: 5s retries: 5 The drop-app service depends on PostgreSQL being healthy before starting ( depends_on.postgres.condition: service_healthy ). Fly.io Source: fly.toml:19-23 [[http_service.checks]] grace_period = "10s" interval = "30s" method = "GET" path = "/api/health" timeout = "5s" Fly.io uses this health check to determine machine readiness and to route traffic. Current Monitoring State What Exists Health check endpoint with real database verification (not hardcoded) Container-level health checks (Docker + Fly.io) Automatic restart on failure ( restart: unless-stopped in docker-compose) Auto-scaling on Fly.io (scale to zero, auto-start on request) What Does Not Exist Yet External uptime monitoring service (see UptimeRobot setup below for recommended configuration) Application Performance Monitoring (APM) Structured logging (JSON format) Log aggregation and forwarding Database performance monitoring Rate limit monitoring/metrics Business metrics dashboard (transactions per hour, success rate) Sentry Error Tracking Status: REMOVED (MC #1271 — Sentry deinstalled) Slack Alerting Status: Implemented (MC #1183) Source: src/lib/alerts.ts , instrumentation.ts Features Operational alerts sent to Slack webhook 10-minute cooldown per alert title (prevents spam) Severity-based emoji prefixes (ℹ️ info, ⚠️ warning, 🚨 critical) Graceful degradation when webhook URL not set (dev mode) Setup Instructions Create incoming webhook in Slack workspace: Go to Slack App Directory → Incoming Webhooks Choose channel (e.g., #ops or #alerts ) Copy webhook URL Set environment variable: # .env.local (server-side secret) SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX Required Environment Variable Variable Required Description SLACK_WEBHOOK_URL Yes (production) Slack incoming webhook URL Note: When SLACK_WEBHOOK_URL is not set, alerts are logged to console but not sent to Slack. Alert Types and Severities Severity Emoji Use Case info ℹ️ Application startup, normal operations warning ⚠️ Degraded performance, non-critical issues critical 🚨 Service outages, data loss, security incidents Cooldown Behavior Each alert title has a 10-minute cooldown Same title sent within 10 minutes → skipped (prevents spam) Different titles → sent immediately (independent tracking) Cooldown resets on app restart (in-memory tracking) Example: If "Database connection failed" is sent at 10:00, the next attempt before 10:10 will be skipped. But "High latency detected" can still be sent at 10:05. Usage in Code import { sendAlert } from '@/lib/alerts'; // Basic alert await sendAlert({ severity: 'critical', title: 'Database connection failed', message: 'PostgreSQL unreachable after 3 retries', }); // Alert with details await sendAlert({ severity: 'warning', title: 'High error rate detected', message: '15 errors in last 5 minutes', }); Current Integrations App startup: Sends info alert when server starts ( instrumentation.ts ) App shutdown: Sends info alert on SIGTERM/SIGINT ( instrumentation.ts ) Error spike detection: Automatically tracks errors and alerts when >5 errors occur in 60 seconds ( src/lib/alerts.ts:trackError ) Unhandled exceptions: Logged and tracked via process event handlers ( instrumentation.ts ) Error Spike Detection The alerting system automatically detects error spikes using a rolling window approach: How it works: Every server error (HTTP 5xx) is tracked via trackError() Maintains rolling 1-minute window of error timestamps When count exceeds threshold (5 errors in 60 seconds), sends critical alert Integrates with middleware error handling Threshold: 5 errors within 60 seconds Alert severity: Critical (🚨) Implementation: src/lib/alerts.ts:trackError() , wired into src/lib/middleware.ts:jsonError() Note: Error counter is in-memory and resets on app restart. For production workloads requiring persistent tracking, consider Redis-backed counters. BetterStack Uptime Monitoring Status: Ready to configure (setup guide available) Documentation: BETTERSTACK-SETUP.md Overview BetterStack provides external uptime monitoring independent of Drop's infrastructure. Unlike internal health checks (Docker, Fly.io) that only work when containers are running, BetterStack detects total infrastructure failures. Free tier includes: 10 monitors (enough for Drop production) 3-minute check interval Unlimited integrations (Slack, email) Public status page SSL expiry monitoring Recommended Monitors Monitor URL Purpose Expected Response Health Endpoint https://drop.alai.no/api/health API + DB connectivity 200 , body contains "status":"ok" Landing Page https://drop.alai.no Public website 200 , body contains Send penger Multi-Region Check https://drop.alai.no/api/health Geographic availability 200 , body contains "status":"ok" Alert Escalation BetterStack sends alerts through multiple channels: Minute 0: Alert fires → Slack #drop-ops (immediate) Minute 5: Still down → Email to alem@alai.no Minute 15: Still down → SMS (requires paid plan) Status Page Public status page shows real-time service status: URL: https://drop-status.betteruptime.com Components: API Health, Landing Page, Global Network Auto-updates: Incidents automatically posted and resolved Subscriptions: Users can subscribe to email updates Setup Instructions Complete setup guide with step-by-step instructions: BETTERSTACK-SETUP.md Setup includes: Account creation (free tier) Configure 3 monitors (health, landing, multi-region) Slack integration ( #drop-ops channel) On-call schedule and escalation policy Public status page creation Testing and verification Key Features Proactive monitoring: 3-minute check interval (free tier) or 30s (paid) Keyword verification (not just HTTP 200) SSL certificate expiry warnings (14 days) Multi-region checks (detect geographic issues) Incident management: Automatic incident creation on downtime Status page updates (public transparency) Escalation to multiple channels (Slack → Email → SMS) Maintenance window support (suppress alerts during deployments) Reporting: Uptime SLA tracking (99.9% target) Incident history and analysis Response time graphs Downtime duration reports Integration with Drop Alerting BetterStack complements Drop's internal alerting ( src/lib/alerts.ts ): Feature Drop Internal Alerts BetterStack External Detects Application errors, error spikes Infrastructure outages When App is running App is unreachable Source Application logs External HTTP checks Delivery Slack webhook (direct) Escalation policy Use case Code bugs, DB issues Container crashes, network failures Example: Database connection fails: Drop internal alert: "Database connection failed" → Slack #drop-ops (immediate) BetterStack: Health check returns 503 → Slack #drop-ops + Email after 5 min Maintenance Windows When performing planned maintenance (deployments, upgrades): Create maintenance window in BetterStack Select affected monitors Set duration (e.g., 1 hour) Effect: Alerts suppressed, status page shows "Scheduled Maintenance" Prevents: False downtime alerts during intentional service interruptions. Best Practices Do's: ✅ Test alerts monthly (pause monitor to verify escalation) ✅ Use keyword checks (not just HTTP status codes) ✅ Monitor SSL expiry (14-day warnings) ✅ Create maintenance windows for deployments ✅ Review incident history monthly Don'ts: ❌ Don't ignore degraded status (investigate even if not fully down) ❌ Don't disable monitors (use pause for temporary suppression) ❌ Don't skip keyword checks (HTTP 200 ≠ working API) ❌ Don't rely solely on external monitoring (combine with internal checks) External Uptime Monitoring (Alternative: UptimeRobot) Status: Alternative to BetterStack (not recommended) BetterStack is recommended over UptimeRobot for Drop because: Better Slack integration (richer notifications) Built-in status page (UptimeRobot charges extra) Better UI/UX for incident management More flexible escalation policies UptimeRobot Setup (if BetterStack unavailable) Cost: Free tier (50 monitors, 5-minute interval) Create account at uptimerobot.com Add HTTP(S) monitor: Friendly Name: Drop Production URL: https://drop.alai.no/api/health Monitoring Interval: 5 minutes (free tier) or 1 minute (paid) Configure alert contacts: Slack webhook (via Alert Contacts) Email ( alem@alai.no ) Set Keyword Monitoring: Response contains "status":"ok" Limitations: No built-in escalation policies (requires third-party integrations) Status page requires paid plan Less detailed incident reports 5-minute check interval (vs 3-minute for BetterStack free) Monitoring Stack Summary Implemented (MC #1184) ✅ Health check endpoint — /api/health with real database verification ✅ Container health checks — Docker + Fly.io auto-restart on failure ❌ Error tracking — Sentry REMOVED (MC #1271) ✅ Slack alerting — Operational alerts with cooldown protection ✅ Lifecycle monitoring — App startup and graceful shutdown alerts ✅ Error spike detection — Automatic alerting when >5 errors/minute Recommended (Manual Setup) 📋 External uptime monitoring — UptimeRobot checking /api/health every 5 minutes 📋 Structured logging — JSON log format with request IDs for correlation 📋 Metrics dashboard — Request latency, error rates, database query times 📋 Audit logging — Tracked as security requirement ( security/drop-security-rapport.md finding L3) Future Enhancements (TODO) Database performance monitoring (slow query alerts) Rate limit metrics (track 429 errors per endpoint) Business metrics dashboard (transactions per hour, success rate) Redis-backed error counter (persistent across restarts) Per-endpoint error tracking (isolate problematic routes) Environment Variables Reference Required for Production # Slack alerting SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXX Dev Mode (All Optional) All monitoring features gracefully degrade when env vars are not set: No SLACK_WEBHOOK_URL: Alerts logged to console only This allows development to work without external services configured. Production Deployment Drop AWS Amplify Deployment Guide Rebrand note (2026-02-14): Originally titled "FontelePay". Product rebranded to Drop . Some env var references (Swan, Stripe) are FUTURE integrations — Drop uses a PSD2 pass-through model. See Drop CLAUDE.md . This guide covers deploying Drop to AWS Amplify in the Frankfurt (eu-central-1) region. Prerequisites AWS Account with Amplify access GitHub repository with Drop code Environment variables ready (see .env.example ) Step 1: Create Amplify App Go to AWS Amplify Console Ensure you're in eu-central-1 (Frankfurt) region Click Create new app Select Host web app Step 2: Connect Repository Choose GitHub as your Git provider Authorize AWS Amplify to access your GitHub account Select the Drop repository Choose the branch to deploy (e.g., main or production ) Step 3: Configure Build Settings Amplify will auto-detect Next.js. Verify the settings match amplify.yml : version: 1 frontend: phases: preBuild: commands: - npm ci build: commands: - npm run build artifacts: baseDirectory: .next files: - '**/*' cache: paths: - node_modules/**/* - .next/cache/**/* Step 4: Configure Environment Variables In Amplify Console, go to App settings > Environment variables and add: Required Variables Variable Description Example NODE_ENV Environment production NEXT_PUBLIC_APP_URL Your app URL https://drop.amplifyapp.com Swan BaaS Variable Description SWAN_API_URL https://api.swan.io (production) SWAN_CLIENT_ID OAuth2 Client ID SWAN_CLIENT_SECRET OAuth2 Client Secret SWAN_PROJECT_ID Project ID SWAN_WEBHOOK_SECRET Webhook validation secret Stripe Variable Description NEXT_PUBLIC_STRIPE_PUBLISHABLE_KEY Publishable key (pk_live_...) STRIPE_SECRET_KEY Secret key (sk_live_...) STRIPE_WEBHOOK_SECRET Webhook secret (whsec_...) Sumsub KYC Variable Description SUMSUB_APP_TOKEN App token SUMSUB_SECRET_KEY Secret key SUMSUB_WEBHOOK_SECRET Webhook secret SUMSUB_LEVEL_NAME KYC flow level Database Variable Description DATABASE_URL PostgreSQL connection string REDIS_URL Redis connection string Authentication Variable Description JWT_SECRET Min 32 characters SESSION_SECRET Min 32 characters Step 5: Configure Next.js for Standalone Output Update next.config.ts to enable standalone output for optimal Amplify deployment: import type { NextConfig } from "next"; const nextConfig: NextConfig = { output: 'standalone', }; export default nextConfig; Step 6: Deploy Click Save and deploy Monitor the build in the Amplify Console Once complete, your app will be available at https://..amplifyapp.com Step 7: Configure Custom Domain (Optional) Go to App settings > Domain management Click Add domain Enter your domain (e.g., app.getdrop.no ) Follow DNS configuration instructions SSL certificate is automatically provisioned Step 8: Set Up Branch Deployments For staging/production workflows: Go to App settings > General Click Edit Enable Branch auto-detection Configure branch patterns: main -> Production staging -> Staging feature/* -> Preview environments Monitoring & Health Checks Health Endpoint The app exposes /api/health for load balancer health checks: curl https://your-app.amplifyapp.com/api/health Response: { "status": "healthy", "timestamp": "2026-02-05T12:00:00.000Z", "version": "0.1.0", "uptime": 3600, "checks": {} } CloudWatch Logs Go to App settings > Monitoring View build logs and access logs Set up CloudWatch alarms for errors Troubleshooting Build Fails Check build logs in Amplify Console Verify package.json scripts are correct Ensure all dependencies are in package.json Environment Variables Not Working Verify variables are set in Amplify Console Remember: NEXT_PUBLIC_ prefix required for client-side access Redeploy after changing environment variables 502/503 Errors Check /api/health endpoint Review CloudWatch logs Verify database connections are correct Check memory limits (adjust if needed) Cold Starts For serverless functions, cold starts may occur. Mitigate by: Using connection pooling for databases Keeping functions warm with scheduled pings Optimizing bundle size Security Checklist All secrets in Environment Variables (not in code) HTTPS enforced (automatic in Amplify) CORS configured correctly Rate limiting implemented Webhook signatures validated No sensitive data in logs Cost Optimization Use cache.paths in amplify.yml to speed up builds Enable CloudFront caching for static assets Monitor build minutes usage Consider reserved concurrency for predictable traffic Rollback To rollback to a previous deployment: Go to Deployments in Amplify Console Find the previous successful deployment Click Redeploy this version Support AWS Amplify Documentation Next.js on AWS Amplify Drop Internal Docs BetterStack Setup BetterStack Uptime Monitoring Setup Guide Last updated: 2026-02-20 Related: MONITORING.md , health-check.sh Purpose: External uptime monitoring for Drop production environment Why BetterStack? BetterStack provides external uptime monitoring independent of Drop's infrastructure: Detects infrastructure failures (AWS App Runner crashes, network issues) Alerts when the entire application is unreachable Provides uptime SLA tracking and historical reports Multiple notification channels (Slack, Email, SMS) Status page for client transparency Key difference from internal health checks: Internal checks (Docker, Fly.io) only work when the container is running. BetterStack catches total outages. Free Tier Limits Plan: Free tier (no credit card required) Limits: 10 monitors (enough for Drop production) 3-minute check interval (paid plan: 30s minimum) 1 status page Unlimited team members Unlimited integrations (Slack, email, webhooks) Upgrade required for: Faster check intervals (<3 minutes) More than 10 monitors (e.g., multi-region checks) Advanced features (maintenance windows, custom headers) Account Setup Step 1: Create Account Go to https://betterstack.com/uptime Click "Start free trial" (becomes free tier after trial) Sign up with Alem's email: alem@alai.no Verify email address Create workspace name: "ALAI Products" (shared across Drop, BasicFakta) Step 2: Configure Team Navigate to Settings > Team Add team members: alem@alai.no (Owner) john@basicconsulting.no (Admin) Set Default timezone: Europe/Oslo (UTC+1) Monitor Configuration Monitor 1: Health Endpoint (Primary) Purpose: Verify API health and database connectivity Go to Monitors > Create Monitor Configure: Monitor name: Drop Health Check Monitor type: HTTP URL: https://drop.alai.no/api/health Check interval: 3 minutes (free tier) Request timeout: 5 seconds Method: GET Confirmation period: 30 seconds (1 retry before alerting) Expected Response: Status code: 200 Keyword check: Enable Response body contains: "status":"ok" Why: Ensures health endpoint returns valid JSON, not just HTTP 200 Advanced settings: Follow redirects: Enabled (default) Verify SSL certificate: Enabled SSL expiry warning: 14 days before expiration Click Create Monitor Monitor 2: Landing Page Purpose: Verify public website availability Go to Monitors > Create Monitor Configure: Monitor name: Drop Landing Page Monitor type: HTTP URL: https://drop.alai.no Check interval: 3 minutes Request timeout: 10 seconds (landing page has more assets) Method: GET Confirmation period: 30 seconds Expected Response: Status code: 200 Keyword check: Enable Response body contains: Send penger (tagline verification) Click Create Monitor Monitor 3: Multi-Region Health Check Purpose: Detect regional networking issues Go to Monitors > Create Monitor Configure: Monitor name: Drop Health (US East) Monitor type: HTTP URL: https://drop.alai.no/api/health Check interval: 3 minutes Request timeout: 5 seconds Method: GET Confirmation period: 30 seconds Expected Response: Status code: 200 Keyword check: Response body contains "status":"ok" Advanced settings: Region: US East (different from default EU region) Why: Detects if Drop is unreachable from specific geographies Click Create Monitor Slack Integration Step 1: Create Slack Incoming Webhook Go to your Slack workspace: alai-talk.slack.com Navigate to Slack App Directory > Incoming Webhooks Click Add to Slack Select channel: #drop-ops (create if doesn't exist) Click Add Incoming Webhooks Integration Copy webhook URL (format: https://hooks.slack.com/services/T.../B.../XXX ) Save this URL securely (needed for BetterStack) Step 2: Add Slack Integration in BetterStack In BetterStack, go to Integrations Click Add Integration > Slack Paste webhook URL from Step 1 Configure: Integration name: Drop Ops Slack Notification channel: #drop-ops Test integration: Click Send test message Verify message appears in #drop-ops channel Click Save Integration On-Call Team Setup Step 1: Create On-Call Schedule Go to On-Call > Create Schedule Configure: Schedule name: Drop Primary On-Call Timezone: Europe/Oslo Add rotation: Team member: alem@alai.no Schedule type: 24/7 (always on-call for now) Click Create Schedule Step 2: Configure Escalation Policy Go to Escalation Policies > Create Policy Configure: Policy name: Drop Production Incidents Add escalation steps: Step 1 (Immediate): Who: Drop Ops Slack integration Delay: 0 minutes Step 2 (If still down after 5 minutes): Who: alem@alai.no (Email) Delay: 5 minutes Step 3 (If still down after 15 minutes): Who: alem@alai.no (SMS) — Requires phone number Delay: 15 minutes Note: SMS requires paid plan or verified phone number Click Create Policy Step 3: Assign Policy to Monitors Go to Monitors For each monitor ( Drop Health Check , Drop Landing Page , Drop Health (US East) ): Click monitor name Go to Settings > Escalation Policy Select: Drop Production Incidents Click Save Status Page Setup Purpose Public status page allows clients and stakeholders to check Drop availability without contacting support. Step 1: Create Status Page Go to Status Pages > Create Status Page Configure: Page name: Drop Status Subdomain: drop-status (URL: https://drop-status.betteruptime.com ) Custom domain (optional): status.drop.alai.no (requires DNS setup) Design settings: Logo: Upload Drop logo (green rounded rectangle) Brand color: #0B6E35 (Drop primary green) Header text: Drop Status Tagline: Real-time service status and incident updates Visibility: Public: Yes (anyone can view) Search engine indexing: No (prevent Google indexing) Click Create Status Page Step 2: Add Components In the status page settings, go to Components Click Add Component Add three components: Component 1: Name: API & Health Endpoint Linked monitor: Drop Health Check Description: Core API functionality and database connectivity Component 2: Name: Landing Page Linked monitor: Drop Landing Page Description: Public website and marketing content Component 3: Name: Global Network Linked monitor: Drop Health (US East) Description: International access and routing Click Save Components Step 3: Configure Incident Communication Go to Status Pages > Settings > Incident Updates Enable: Auto-create incidents: Yes (when monitor goes down) Auto-resolve incidents: Yes (when monitor recovers) Notification subscribers: Email subscriptions: Enabled (users can subscribe to updates) Webhook notifications: Disabled (optional for future) Step 4: Share Status Page Once created, share the status page URL: Internal: Add to #drop-ops Slack channel description External: Link from Drop landing page footer (optional) Clients: Include in onboarding emails Status Page URL: https://drop-status.betteruptime.com Verification Checklist After completing setup, verify: Monitors running: All 3 monitors show green status Slack alerts working: Test by pausing a monitor (triggers down alert) Email notifications working: Verify Alem receives email on test alert Status page public: Open status page URL in incognito mode Escalation policy assigned: All monitors use Drop Production Incidents policy SSL expiry alerts: Monitors configured to warn 14 days before cert expiration Testing the Setup Test 1: Manual Down Alert Go to Monitors > Drop Health Check Click Pause Monitor (simulates downtime) Expected behavior: Slack alert in #drop-ops within 30 seconds Email to alem@alai.no after 5 minutes (if still paused) Click Resume Monitor to clear alert Test 2: Actual Downtime SSH into production server (or use AWS App Runner console) Stop the Drop application container temporarily Wait for BetterStack to detect downtime (max 3 minutes + 30s confirmation) Expected behavior: Monitor shows red status Slack alert in #drop-ops Status page component shows "Down" Restart application and verify recovery alert Test 3: SSL Expiry Warning Go to Monitors > Drop Health Check Verify SSL expiry warning is enabled (14 days) Expected behavior: Alert sent 14 days before SSL certificate expiration Action required: Renew certificate before expiry Alert Examples Downtime Alert (Slack) 🚨 Drop Health Check is DOWN Monitor: Drop Health Check Status: DOWN Response: Connection timeout Region: EU West Time: 2026-02-20 10:30 UTC View incident: https://betterstack.com/incidents/... Recovery Alert (Slack) ✅ Drop Health Check is UP Monitor: Drop Health Check Status: UP Response: 200 OK (2ms) Downtime duration: 3 minutes Time: 2026-02-20 10:33 UTC Incident closed: https://betterstack.com/incidents/... SSL Expiry Warning (Email) Subject: [BetterStack] SSL certificate expiring in 14 days Monitor: Drop Health Check Domain: drop.alai.no Certificate expiry: 2026-03-06 23:59 UTC Action required: Renew SSL certificate before expiration. Maintenance Mode When performing planned maintenance (deployments, infrastructure upgrades): Go to Maintenance Windows > Create Window Configure: Name: Drop Deployment Start time: 2026-02-20 22:00 UTC Duration: 1 hour Affected monitors: Select all Drop monitors Notification: Status page update: Yes (shows maintenance banner) Alert suppression: Yes (no downtime alerts during window) Click Create Maintenance Window Effect: During maintenance, downtime alerts are suppressed and status page shows "Scheduled Maintenance" instead of "Down". Best Practices Do's ✅ Test alerts monthly — Pause a monitor to verify escalation works ✅ Update on-call schedule — Rotate on-call duty if team grows ✅ Monitor SSL expiry — Enable 14-day warnings to prevent outages ✅ Use maintenance windows — Prevent false alerts during deployments ✅ Review incident history — Monthly review of downtime patterns Don'ts ❌ Don't ignore degraded status — Investigate even if not fully down ❌ Don't disable monitors — Use pause for temporary suppression only ❌ Don't skip keyword checks — HTTP 200 alone doesn't guarantee working API ❌ Don't forget to update URLs — When domain changes, update all monitors ❌ Don't rely solely on external monitoring — Combine with internal health checks Troubleshooting Monitor shows false positives (frequent up/down) Cause: Network instability or slow response times Fix: Increase Request timeout from 5s to 10s Increase Confirmation period from 30s to 60s Check Drop API latency in logs Slack alerts not received Cause: Webhook URL incorrect or channel archived Fix: Go to Integrations > Drop Ops Slack Click Send test message If fails, regenerate webhook in Slack and update BetterStack Email alerts delayed Cause: Email provider spam filtering Fix: Whitelist notifications@betterstack.com in email settings Check spam/junk folder Verify email address in BetterStack team settings Status page not updating Cause: Monitor not linked to status page component Fix: Go to Status Pages > Drop Status > Components Ensure each component has a Linked monitor assigned Save changes and trigger test alert Related Documentation MONITORING.md — Full monitoring stack overview health-check.sh — Internal health check script alerts.ts — Slack alerting implementation /api/health route — Health endpoint source code Support BetterStack Support: Documentation: https://betterstack.com/docs Email: support@betterstack.com Status: https://status.betterstack.com Internal Contact: Slack: #drop-ops Email: alem@alai.no Sentry Setup Drop Sentry Setup Last updated: 2026-02-20 Source: src/drop-app/src/lib/sentry.ts , src/drop-app/src/lib/sentry-server.ts , src/drop-api/src/lib/sentry.ts , src/drop-app/.env.example Overview Drop uses Sentry for error tracking and performance monitoring across three components: drop-app (client-side) - Browser errors via @sentry/browser drop-app (server-side) - Next.js middleware/API errors via custom envelope API drop-api - Backend API errors via @sentry/node All three components share the same DSN and gracefully degrade to console-only logging when Sentry is not configured. Sentry Account Setup 1. Create Free Sentry Account Visit sentry.io and sign up (free tier: 5,000 errors/month) Confirm email and log in 2. Create Projects Create two separate projects (one for app, one for API): Project 1: drop-app Click Projects → Create Project Platform: Next.js Project name: drop-app Team: Default team (or create drop-team ) Alert frequency: On every new issue Click Create Project Copy the DSN (format: https://examplePublicKey@o0.ingest.sentry.io/0 ) Project 2: drop-api Repeat steps above with platform Node.js Project name: drop-api Copy the DSN (different from drop-app) IMPORTANT: Use separate projects to keep frontend and backend errors isolated. Environment Variables Configuration drop-app (.env.local) Add these variables to src/drop-app/.env.local : # --- Sentry (Error Tracking) --- # Client-side error tracking (browser) NEXT_PUBLIC_SENTRY_DSN=https://YOUR_PUBLIC_KEY@o0.ingest.sentry.io/YOUR_PROJECT_ID # Server-side error tracking (middleware/API routes) # NOTE: drop-app server uses custom envelope API (no @sentry/nextjs due to Turbopack incompatibility) # Both client and server use the SAME DSN (NEXT_PUBLIC_SENTRY_DSN) # Optional: Performance monitoring sample rate (0.0 to 1.0, default: 0.1 = 10%) NEXT_PUBLIC_SENTRY_TRACES_SAMPLE_RATE=0.1 # Optional: For source map uploads (requires auth token from Sentry → Settings → Auth Tokens) SENTRY_ORG=your-org-slug SENTRY_PROJECT=drop-app SENTRY_AUTH_TOKEN=your-auth-token drop-api (.env) Add these variables to src/drop-api/.env : # --- Sentry (Error Tracking) --- SENTRY_DSN=https://YOUR_PUBLIC_KEY@o0.ingest.sentry.io/YOUR_API_PROJECT_ID # Optional: Performance monitoring sample rate (0.0 to 1.0, default: 0.1 = 10%) SENTRY_TRACES_SAMPLE_RATE=0.1 # Optional: For source map uploads SENTRY_ORG=your-org-slug SENTRY_PROJECT=drop-api SENTRY_AUTH_TOKEN=your-auth-token Where to find these values: DSN: Project Settings → Client Keys (DSN) Org slug: Settings → Organization → General Settings → Organization Slug Project name: Project Settings → General → Project Name Auth token: Settings → Auth Tokens → Create New Token (scopes: project:releases , project:write ) Verification Test Client-Side Error Capture (drop-app) Start the app: npm run dev (in src/drop-app/ ) Open browser console: http://localhost:3000 Trigger test error via console: throw new Error("Sentry test error - client-side"); Check Sentry dashboard: Projects → drop-app → Issues You should see the test error appear within 10 seconds Expected behavior: Error logged to browser console: [Sentry] Error captured: Error: Sentry test error - client-side Error appears in Sentry dashboard with stack trace, breadcrumbs, and browser context Test Server-Side Error Capture (drop-app) Create test API route: src/drop-app/src/app/api/sentry-test/route.ts import { NextResponse } from 'next/server'; import { captureServerError } from '@/lib/sentry-server'; export async function GET() { try { throw new Error('Sentry test error - server-side'); } catch (error) { captureServerError(error as Error, { tags: { test: 'true' } }); return NextResponse.json({ error: 'Test error sent to Sentry' }, { status: 500 }); } } Visit: http://localhost:3000/api/sentry-test Check server console: [Sentry Server] Error captured: Error: Sentry test error - server-side Check Sentry dashboard: Projects → drop-app → Issues Test API Error Capture (drop-api) Start the API: npm run dev (in src/drop-api/ ) Trigger test error via curl: curl http://localhost:4000/api/sentry-test OR create test endpoint in src/drop-api/src/routes/test.ts : import { Router } from 'express'; import { captureError } from '../lib/sentry.js'; const router = Router(); router.get('/sentry-test', (req, res) => { try { throw new Error('Sentry test error - API'); } catch (error) { captureError(error as Error, { tags: { test: 'true' } }); res.status(500).json({ error: 'Test error sent to Sentry' }); } }); export default router; Check Sentry dashboard: Projects → drop-api → Issues Source Map Upload Setup Source maps allow Sentry to show readable stack traces instead of minified code. 1. Install Sentry CLI # macOS (Homebrew) brew install getsentry/tools/sentry-cli # Or via npm (global) npm install -g @sentry/cli 2. Configure Sentry CLI Create .sentryclirc in project root: [defaults] url=https://sentry.io/ org=your-org-slug project=drop-app [auth] token=your-auth-token IMPORTANT: Add .sentryclirc to .gitignore (contains auth token). 3. Add Build Script (drop-app) Update src/drop-app/package.json : { "scripts": { "build": "next build", "build:sentry": "next build && sentry-cli sourcemaps upload --validate .next/static" } } 4. Test Source Map Upload cd src/drop-app npm run build:sentry Expected output: > Analyzing source maps for sentry > Uploading source maps to Sentry ✓ Successfully uploaded source maps 5. CI/CD Integration For automated uploads in CI/CD, add these secrets to your deployment platform: Vercel/Railway/Fly.io: SENTRY_ORG SENTRY_PROJECT SENTRY_AUTH_TOKEN Then update build command: npm run build && sentry-cli sourcemaps upload --validate .next/static Alert Rules Configuration Recommended Alert Rules 1. New Issue Alert (drop-app) Go to Projects → drop-app → Settings → Alerts Click Create Alert Rule Configure: Conditions: When a new issue is created Filters: Environment = production Actions: Send notification to: Slack channel #drop-alerts Send email to: alem@alai.no Save rule 2. High Error Rate Alert (drop-app) Create new alert rule Configure: Conditions: Number of events in an issue is more than 100 in 1 hour Filters: Environment = production, Level = error Actions: Send notification to: Slack channel #drop-alerts Send email to: alem@alai.no Save rule 3. Critical Error Alert (drop-api) Go to Projects → drop-api → Settings → Alerts Create alert rule: Conditions: When a new issue is created AND Level = fatal Filters: Environment = production Actions: Send notification to: Slack channel #drop-critical Send email to: alem@alai.no Save rule 4. Performance Degradation Alert (drop-app) Create alert rule: Conditions: Average transaction duration is above 2000ms for 5 minutes Filters: Environment = production, Transaction = /api/transactions/* Actions: Send notification to: Slack channel #drop-performance Save rule Slack Integration (Optional) Go to Settings → Integrations → Slack Click Add Workspace Authorize Sentry to access your Slack workspace Select channels: #drop-alerts , #drop-critical , #drop-performance Test integration by triggering a test error PII Scrubbing All three Sentry integrations automatically scrub sensitive data before sending events: Scrubbed fields: password pin cardNumber cvv fødselsnummer authorization headers cookie headers Implementation: drop-app (client): src/drop-app/src/lib/sentry.ts (lines 51-76) drop-app (server): Custom envelope API (no PII in server-side events) drop-api: src/drop-api/src/lib/sentry.ts (lines 48-139) Verification: Trigger error with sensitive data: try { throw new Error('Login failed for user with password=secret123'); } catch (error) { captureError(error, { extra: { cardNumber: '1234567890123456' } }); } Check Sentry event: Message should show: Login failed for user with password=[REDACTED] Extra context should show: cardNumber: [REDACTED] Environment-Specific Configuration Development DSN: Optional (errors log to console only if not set) Sample rate: 1.0 (capture all errors for debugging) Source maps: Not required (local stack traces are readable) # .env.local (development) NEXT_PUBLIC_SENTRY_DSN= # Leave empty to disable Sentry in dev Staging DSN: Required (test Sentry integration before production) Sample rate: 0.5 (capture 50% of transactions) Source maps: Enabled (verify uploads work) # .env.staging NEXT_PUBLIC_SENTRY_DSN=https://YOUR_KEY@sentry.io/YOUR_PROJECT_ID NEXT_PUBLIC_SENTRY_TRACES_SAMPLE_RATE=0.5 SENTRY_AUTH_TOKEN=your-auth-token Production DSN: Required (critical for production monitoring) Sample rate: 0.1 (capture 10% of transactions to stay within free tier) Source maps: Enabled (required for readable stack traces) # .env.production NEXT_PUBLIC_SENTRY_DSN=https://YOUR_KEY@sentry.io/YOUR_PROJECT_ID NEXT_PUBLIC_SENTRY_TRACES_SAMPLE_RATE=0.1 SENTRY_AUTH_TOKEN=your-auth-token Troubleshooting No errors appearing in Sentry dashboard Check 1: DSN configured? # drop-app echo $NEXT_PUBLIC_SENTRY_DSN # drop-api echo $SENTRY_DSN Check 2: Console output? Errors should ALWAYS log to console, even if Sentry upload fails Look for: [Sentry] Error captured: ... Check 3: Network errors? Open browser DevTools → Network tab Filter by sentry.io Check for failed requests (should see POST to https://o0.ingest.sentry.io/api/.../envelope/ ) Check 4: Environment mismatch? Sentry filters events by environment ( production , development , staging ) Verify NEXT_PUBLIC_APP_ENV or NODE_ENV matches your Sentry project filters Source maps not working (minified stack traces) Check 1: Source maps uploaded? cd src/drop-app sentry-cli releases list Check 2: Release version matches? Sentry matches source maps by release version Verify package.json version matches uploaded release Check 3: Upload command ran? # Manually test upload sentry-cli sourcemaps upload --validate .next/static PII still appearing in events Check 1: Verify beforeSend hook Inspect src/lib/sentry.ts (client) or src/lib/sentry.ts (API) Confirm beforeSend function is scrubbing sensitive keys Check 2: Add custom scrubbing If new sensitive fields appear, add them to scrubbing list: const sensitiveKeys = ["password", "pin", "yourNewField"]; Cost Management Sentry Free Tier: 5,000 errors per month 10,000 performance units per month 1 GB attachments 30 days retention Staying within free tier: Lower sample rate: Set SENTRY_TRACES_SAMPLE_RATE=0.1 (10%) Filter noisy errors: Use beforeSend to ignore expected errors (e.g., 404s) Set up quotas: Sentry → Settings → Quotas → Set monthly limits Example: Ignore 404 errors beforeSend(event, hint) { // Ignore 404 errors if (event.request?.url?.includes('/api/') && hint?.originalException?.message?.includes('404')) { return null; // Don't send to Sentry } return event; } Security Considerations Auth token storage: NEVER commit .sentryclirc to git Store SENTRY_AUTH_TOKEN in CI/CD secrets, not .env files DSN exposure: NEXT_PUBLIC_SENTRY_DSN is exposed to client-side code (safe - it's public) Sentry rate-limits abuse via DSN quotas PII scrubbing: Always verify PII scrubbing works before deploying to production Test with real-world data patterns (Norwegian fødselsnummer, BankID tokens) Access control: Limit Sentry dashboard access to authorized team members only Use Sentry Teams to restrict project access References Sentry Docs: https://docs.sentry.io/platforms/javascript/guides/nextjs/ Sentry CLI: https://docs.sentry.io/product/cli/ Source Maps: https://docs.sentry.io/platforms/javascript/sourcemaps/ PII Scrubbing: https://docs.sentry.io/platforms/javascript/data-management/sensitive-data/ Alert Rules: https://docs.sentry.io/product/alerts/ Next Steps Create Sentry account and projects (drop-app, drop-api) Add DSN to .env.local (development) and .env.production (production) Test error capture in all three components Configure alert rules (new issues, high error rate, critical errors) Set up source map uploads for production builds Integrate Slack notifications (optional) Monitor error dashboard daily during initial deployment CloudWatch Logs Setup CloudWatch Logs Setup — Drop Production Date: 2026-02-22 Priority: P0 (Production Blocker) Effort: 2 hours Cost: ~$5/month (30 GB ingestion) Overview AWS App Runner automatically streams application logs (stdout/stderr) to CloudWatch Logs. This setup guide configures retention policies , log insights queries , and alarms for production monitoring. Prerequisites AWS CLI configured with credentials App Runner service deployed to eu-west-1 Application writes JSON logs to stdout (already implemented via src/lib/logger.ts ) Configuration 1. Set Log Retention Policy Default: CloudWatch Logs retain forever (expensive) Recommendation: 30 days (production), 7 days (staging) # Production: 30 days retention aws logs put-retention-policy \ --log-group-name /aws/apprunner/drop-production \ --retention-in-days 30 \ --region eu-west-1 # Staging: 7 days retention aws logs put-retention-policy \ --log-group-name /aws/apprunner/drop-staging \ --retention-in-days 7 \ --region eu-west-1 Verify retention: aws logs describe-log-groups \ --log-group-name-prefix /aws/apprunner/drop \ --region eu-west-1 \ | jq '.logGroups[] | {name: .logGroupName, retention: .retentionInDays}' # Expected: # { # "name": "/aws/apprunner/drop-production", # "retention": 30 # } 2. Create Log Insights Queries Purpose: Pre-built queries for common investigations. Query 1: All Errors (Last Hour) fields @timestamp, level, message, metadata.error, metadata.userId, requestId | filter level = "error" | sort @timestamp desc | limit 100 Save as: drop-errors-last-hour Query 2: User Activity Trace fields @timestamp, level, message, metadata.userId, metadata.action, requestId | filter metadata.userId = "usr_123" | sort @timestamp desc | limit 500 Save as: drop-user-activity-trace Query 3: Request Trace by ID fields @timestamp, level, message, metadata | filter requestId = "req_abc123" | sort @timestamp asc Save as: drop-request-trace Query 4: API Endpoint Performance fields @timestamp, message, metadata.endpoint, metadata.latencyMs | filter metadata.latencyMs > 1000 | stats avg(metadata.latencyMs) as avg_latency, max(metadata.latencyMs) as max_latency, count() as slow_requests by metadata.endpoint | sort slow_requests desc Save as: drop-slow-endpoints Query 5: Authentication Events fields @timestamp, level, message, metadata.action, metadata.userId, metadata.ip | filter metadata.action in ["login_success", "login_failure", "logout"] | sort @timestamp desc | limit 100 Save as: drop-auth-events Query 6: Payment Failures fields @timestamp, level, message, metadata.errorCode, metadata.transactionId, metadata.userId | filter metadata.errorCode in ["INSUFFICIENT_FUNDS", "PAYMENT_REJECTED", "TIMEOUT"] | sort @timestamp desc | limit 50 Save as: drop-payment-failures 3. Create CloudWatch Alarms Alarm 1: High Error Rate Metric: Error log entries per minute Threshold: >10 errors/minute for 2 consecutive periods Action: Send SNS notification → Slack webhook # Create metric filter aws logs put-metric-filter \ --log-group-name /aws/apprunner/drop-production \ --filter-name drop-error-count \ --filter-pattern '{ $.level = "error" }' \ --metric-transformations \ metricName=ErrorCount,metricNamespace=Drop/Logs,metricValue=1,unit=Count \ --region eu-west-1 # Create alarm aws cloudwatch put-metric-alarm \ --alarm-name drop-high-error-rate \ --alarm-description "Alert when error rate exceeds threshold" \ --metric-name ErrorCount \ --namespace Drop/Logs \ --statistic Sum \ --period 60 \ --evaluation-periods 2 \ --threshold 10 \ --comparison-operator GreaterThanThreshold \ --treat-missing-data notBreaching \ --alarm-actions \ --region eu-west-1 Alarm 2: No Logs Received (Service Down) Metric: Log ingestion stopped Threshold: No logs for 5 minutes Action: Send SNS notification aws cloudwatch put-metric-alarm \ --alarm-name drop-no-logs-received \ --alarm-description "Alert when no logs received (service may be down)" \ --metric-name IncomingLogEvents \ --namespace AWS/Logs \ --dimensions Name=LogGroupName,Value=/aws/apprunner/drop-production \ --statistic Sum \ --period 300 \ --evaluation-periods 1 \ --threshold 1 \ --comparison-operator LessThanThreshold \ --treat-missing-data breaching \ --alarm-actions \ --region eu-west-1 Alarm 3: Database Errors Metric: Database connection errors Threshold: >5 DB errors in 5 minutes aws logs put-metric-filter \ --log-group-name /aws/apprunner/drop-production \ --filter-name drop-db-errors \ --filter-pattern '{ $.message = "*database*" && $.level = "error" }' \ --metric-transformations \ metricName=DatabaseErrors,metricNamespace=Drop/Logs,metricValue=1,unit=Count \ --region eu-west-1 aws cloudwatch put-metric-alarm \ --alarm-name drop-database-errors \ --metric-name DatabaseErrors \ --namespace Drop/Logs \ --statistic Sum \ --period 300 \ --evaluation-periods 1 \ --threshold 5 \ --comparison-operator GreaterThanThreshold \ --alarm-actions \ --region eu-west-1 4. SNS Topic for Alerts Create SNS topic (if not exists): aws sns create-topic \ --name drop-cloudwatch-alerts \ --region eu-west-1 # Output: # { # "TopicArn": "arn:aws:sns:eu-west-1:324480209768:drop-cloudwatch-alerts" # } Subscribe Slack webhook: # Option 1: Email subscription (immediate) aws sns subscribe \ --topic-arn arn:aws:sns:eu-west-1:324480209768:drop-cloudwatch-alerts \ --protocol email \ --notification-endpoint alem@alai.no \ --region eu-west-1 # Confirm subscription via email link # Option 2: Lambda → Slack (requires Lambda function) # See: infrastructure/cloudwatch-to-slack-lambda.md (future enhancement) 5. Export Logs to S3 (Compliance/Archival) Purpose: Long-term storage (>30 days) for compliance, cheaper than CloudWatch. Create S3 bucket: aws s3 mb s3://drop-logs-archive --region eu-west-1 # Set lifecycle policy (move to Glacier after 90 days) cat > lifecycle.json < --region eu-west-1 Via Log Streaming (Real-Time) # Stream logs in real-time (like tail -f) aws logs tail /aws/apprunner/drop-production \ --follow \ --format short \ --region eu-west-1 # Filter by error level aws logs tail /aws/apprunner/drop-production \ --follow \ --filter-pattern '{ $.level = "error" }' \ --region eu-west-1 Troubleshooting Issue: No logs appearing in CloudWatch Diagnosis: # Check if log group exists aws logs describe-log-groups \ --log-group-name-prefix /aws/apprunner/drop \ --region eu-west-1 # Check App Runner service logs integration aws apprunner describe-service \ --service-arn \ --region eu-west-1 \ | jq '.Service.ObservabilityConfiguration' Solution: App Runner auto-creates log group on first log output Verify app is writing to stdout (not file) Check IAM permissions (App Runner role needs logs:CreateLogStream , logs:PutLogEvents ) Issue: Logs not in JSON format Diagnosis: # Check log entries aws logs tail /aws/apprunner/drop-production --format short --region eu-west-1 | head -10 Solution: Ensure app uses logger.ts for all logging (not console.log ) Verify process.stdout.write(JSON.stringify(entry) + "\n") is used Checklist Retention policy set (30 days production, 7 days staging) Log Insights queries saved (6 queries) Metric filters created (error count, DB errors) CloudWatch alarms configured (3 alarms) SNS topic created and subscribed (email/Slack) S3 export bucket created (with lifecycle policy) Cost estimate reviewed and approved Team trained on log querying (AWS Console + CLI) Documentation updated Next Steps Deploy retention policies (run commands above) Test alarms (trigger error spike, verify alert received) Save Log Insights queries (via AWS Console) Schedule monthly S3 export (manual for now, automate later) Monitor costs (set billing alert at $20/month) Related Documentation docs/infrastructure/MONITORING.md — Overall monitoring setup src/lib/logger.ts — Structured logging implementation infrastructure/error-tracking-setup.md — Sentry integration AWS CloudWatch Logs docs: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/ Last Updated: 2026-02-22 Owner: John (AI Director) DevOps Stack DevOps/SRE Stack DevOps/SRE Stack for Drop (originally FontelePay) Rebrand note (2026-02-14): FontelePay was renamed to Drop. Some references to FontelePay remain in this document (metric names, Sentry projects, API URLs). These should be updated when implementing the actual DevOps stack. Drop uses a PSD2 pass-through model — no wallet, no balance held by Drop. Table of Contents Executive Summary CI/CD Pipeline Testing Strategy Monitoring & Observability Error Tracking Alerting & Incident Management Documentation Security Operations Cost Summary Implementation Priority Integration Diagram 1. Executive Summary Stack Philosophy Drop requires a DevOps/SRE stack that balances: Fintech compliance (audit trails, security, GDPR) Cost efficiency for MVP phase Scalability for growth to 100K+ users EU data residency where possible Small team maintainability (1-2 DevOps engineers) Recommended Stack Overview Area MVP Tool Scale Tool Reason CI/CD GitHub Actions GitHub Actions + ArgoCD Native GitHub, EU runners available E2E Testing Playwright Playwright Open-source, excellent mobile web Load Testing k6 k6 + Grafana Cloud Grafana ecosystem, scriptable APM Grafana Cloud Grafana Cloud EU-hosted, cost-effective Logs Grafana Loki Grafana Loki Part of Grafana stack Errors Sentry Sentry Best-in-class, EU hosting Alerts Slack + PagerDuty PagerDuty Start simple, scale Secrets AWS Secrets Manager AWS Secrets Manager Native AWS, compliant Security Scan Snyk Snyk + DAST Developer-friendly Total MVP Monthly Cost: EUR 800-1,200/month Total Scale Monthly Cost: EUR 2,500-4,000/month 2. CI/CD Pipeline 2.1 Recommendation: GitHub Actions Why GitHub Actions over alternatives: Criteria GitHub Actions GitLab CI CircleCI Native Integration Best (GitHub) Requires migration Good EU Runners Yes (Azure EU) Yes Limited Free Tier 2,000 min/month 400 min/month 6,000 min/month Secrets Management Native Native Native Self-hosted Runners Yes Yes Limited Marketplace Largest Growing Medium Learning Curve Low Medium Medium OIDC for AWS Native Requires setup Requires setup Decision: GitHub Actions Already using GitHub for source control Native OIDC integration with AWS (no long-lived credentials) EU-hosted runners available Excellent ecosystem of actions Cost-effective at scale 2.2 Pipeline Architecture # .github/workflows/main.yml structure Triggers: - push to main/develop - pull request - manual dispatch Jobs: 1. lint-and-format - ESLint, Prettier - Parallel for speed 2. security-scan - Snyk dependency check - Secret scanning - SAST (CodeQL) 3. test-unit - Jest (backend/frontend) - Coverage threshold: 80% 4. test-integration - Database tests - API contract tests 5. build - Docker image build - Multi-arch (amd64/arm64) 6. test-e2e (staging only) - Playwright - Against staging environment 7. deploy-staging - Automatic on develop merge 8. deploy-production - Manual approval required - Canary deployment 2.3 Deployment Strategies MVP Phase: Rolling Deployment Simple, works with small user base Zero-downtime with K8s rolling updates Easy rollback Scale Phase: Canary Deployment Production Traffic: ├── 95% → Current Version └── 5% → New Version (canary) Promotion: Manual after metrics validation Rollback: Automatic on error rate spike Implementation: ArgoCD + Argo Rollouts GitOps model (infrastructure as code) Automated sync from Git Progressive delivery Audit trail of all deployments 2.4 Branch Strategy main (production) ↑ └── develop (staging) ↑ └── feature/* (development) └── hotfix/* (emergency fixes) Rules: main : Protected, requires PR + approval + passing CI develop : Protected, requires PR + passing CI Feature branches: Deleted after merge Hotfixes: Can bypass develop in emergencies 2.5 GitHub Actions Cost Estimate Phase Minutes/Month Cost MVP (5 devs) ~3,000 Free (2,000) + EUR 20 Scale (15 devs) ~15,000 EUR 120/month 3. Testing Strategy 3.1 Testing Pyramid ┌─────────┐ │ E2E │ ~10% of tests │ (Slow) │ Critical user journeys └────┬────┘ │ ┌──────┴──────┐ │ Integration │ ~20% of tests │ (Medium) │ API contracts, DB └──────┬──────┘ │ ┌─────────┴─────────┐ │ Unit │ ~70% of tests │ (Fast) │ Business logic └───────────────────┘ 3.2 Unit Testing Current Stack: Jest (already configured) Coverage Requirements: Component Minimum Target Business Logic 90% 95% API Controllers 80% 90% Utilities 70% 80% UI Components 60% 70% Best Practices: Test business logic, not implementation Mock external dependencies Use factories for test data Run on every commit 3.3 Integration Testing Tools: Testcontainers - Spin up PostgreSQL, Redis in Docker Supertest - HTTP assertions for API testing Pact - Contract testing between services What to Test: Database queries (with real PostgreSQL) Redis caching behavior API contract between services BaaS webhook handlers Payment flow integration (sandbox) 3.4 E2E Testing Recommendation: Playwright Criteria Playwright Cypress Browser Support All major + mobile Chrome, Firefox, Edge Speed Faster (parallel) Slower Auto-wait Built-in Built-in Mobile Testing Better (device emulation) Limited CI Integration Excellent Good Cost Free Free (cloud paid) Learning Curve Medium Lower Decision: Playwright Better mobile web testing (critical for Drop) True parallel execution Multiple browser contexts API testing built-in Network interception for mocking Critical User Journeys to Test: User registration + KYC start Login flow (email + biometric) View balance and transactions Send P2P transfer Card top-up flow Card freeze/unfreeze SEPA transfer initiation Playwright Configuration: // playwright.config.ts { projects: [ { name: 'Desktop Chrome', use: { ...devices['Desktop Chrome'] } }, { name: 'Mobile Safari', use: { ...devices['iPhone 14'] } }, { name: 'Mobile Chrome', use: { ...devices['Pixel 7'] } }, ], retries: 2, reporter: [['html'], ['junit', { outputFile: 'results.xml' }]], } 3.5 Load Testing Recommendation: k6 Why k6: Open-source, scriptable in JavaScript Integrates with Grafana (our monitoring stack) Cloud option available for distributed load Can run locally or in CI/CD Load Test Scenarios: Scenario Virtual Users Duration Success Criteria Baseline 50 5 min p95 < 500ms Peak 200 10 min p95 < 1000ms Stress 500 5 min No crashes Soak 100 1 hour No memory leaks Critical Endpoints: POST /api/auth/login - 100 req/sec target GET /api/accounts/balance - 500 req/sec target POST /api/transfers - 50 req/sec target GET /api/transactions - 200 req/sec target 3.6 Security Testing SAST (Static Analysis): CodeQL (GitHub native) - Free, good coverage Snyk Code - Better for JavaScript/TypeScript SonarQube - Alternative if self-hosted preferred DAST (Dynamic Analysis): OWASP ZAP - Free, CI-integrated Burp Suite - For manual penetration testing Dependency Scanning: Snyk - Primary recommendation Dependabot - Free, GitHub native (backup) Schedule: Test Type Frequency Blocker? SAST Every PR Yes (high severity) Dependency Scan Daily Yes (critical) DAST Weekly No (review) Pen Test Quarterly N/A (manual) 4. Monitoring & Observability 4.1 Strategy: Unified Grafana Stack Why Grafana Cloud over alternatives: Criteria Grafana Cloud Datadog New Relic EU Hosting Yes (Frankfurt) Yes Yes Pricing Model Usage-based Per-host Per-user MVP Cost EUR 0-200 EUR 400+ EUR 300+ Scale Cost EUR 500-1,000 EUR 2,000+ EUR 1,500+ Open Standards Full (Prometheus, OTel) Partial Partial Vendor Lock-in Low High High Self-host Option Yes (fallback) No No Decision: Grafana Cloud Best cost/value for startup EU data residency (Frankfurt region) Open standards (can migrate if needed) Unified platform (metrics, logs, traces) Free tier generous for MVP 4.2 Metrics (Prometheus + Grafana) Infrastructure Metrics: CPU, Memory, Disk, Network Kubernetes pod health Database connections, query latency Redis hit/miss ratio Application Metrics: Request rate, latency, error rate (RED) Active users (DAU/MAU) Transaction volume and value KYC conversion funnel Card activation rate Business Metrics (Custom): fontelepay_transactions_total{type="p2p|sepa|card"} fontelepay_transaction_value_eur{type="p2p|sepa|card"} fontelepay_users_registered_total fontelepay_users_kyc_passed_total fontelepay_cards_issued_total{type="virtual|physical"} fontelepay_api_latency_seconds{endpoint="/api/..."} 4.3 Log Aggregation (Loki) Why Loki: Part of Grafana stack (unified UI) Cost-effective (indexes labels, not content) Kubernetes native Query language similar to Prometheus Log Structure (JSON): { "timestamp": "2026-02-05T10:30:00Z", "level": "info", "service": "payment-service", "trace_id": "abc123", "user_id": "usr_xxx", // pseudonymized "message": "Transfer initiated", "amount_eur": 100, "transfer_type": "sepa" } Retention Policy: Log Type Retention Reason Application 30 days Debugging Security/Audit 7 years Compliance Access Logs 90 days Security review GDPR Considerations: No PII in logs (use pseudonymized IDs) User IDs hashed or tokenized IP addresses masked after 30 days 4.4 Distributed Tracing (Tempo) Implementation: OpenTelemetry Why OpenTelemetry: Vendor-neutral standard Supports all our languages (Java, Node.js, Dart) Auto-instrumentation available Future-proof (industry standard) Trace Critical Paths: User login (app -> API -> auth -> DB) Payment initiation (app -> API -> payment -> BaaS -> ledger) Card transaction (webhook -> processor -> notification) Sampling Strategy: 100% for errors 100% for slow requests (>1s) 10% for successful requests (MVP) 1% for successful requests (scale) 4.5 Real User Monitoring (RUM) For Web (Next.js): Grafana Faro (free, part of Grafana) Captures: Page load, Web Vitals, JS errors For Mobile (Flutter): Custom implementation with OpenTelemetry Track: App start time, screen transitions, API calls Key Metrics: Metric Target Threshold LCP (Largest Contentful Paint) <2.5s <4s FID (First Input Delay) <100ms <300ms CLS (Cumulative Layout Shift) <0.1 <0.25 App Cold Start <2s <3s API Response (p95) <500ms <1s 4.6 Grafana Cloud Cost Estimate Component MVP Usage MVP Cost Scale Usage Scale Cost Metrics 10K series Free 50K series EUR 150 Logs 50 GB/mo Free 200 GB/mo EUR 200 Traces 10 GB/mo Free 50 GB/mo EUR 100 Total - EUR 0-50 - EUR 450 5. Error Tracking 5.1 Recommendation: Sentry Comparison: Criteria Sentry Bugsnag Rollbar EU Hosting Yes Yes No Flutter SDK Excellent Good Limited Source Maps Automatic Automatic Manual Performance Included Separate Included Pricing (MVP) Free EUR 100 EUR 100 Pricing (Scale) EUR 300 EUR 400 EUR 350 Slack Integration Native Native Native Issue Grouping Best Good Good Decision: Sentry Best Flutter support (critical for mobile) EU data residency available Excellent source map integration Issue grouping reduces noise Performance monitoring included Generous free tier (5K errors/month) 5.2 Sentry Configuration Projects: fontelepay-web (Next.js frontend) fontelepay-api (Node.js/Java backend) fontelepay-mobile (Flutter app) Settings: // sentry.config.js { dsn: "https://xxx@sentry.io/xxx", environment: process.env.NODE_ENV, release: process.env.GIT_SHA, tracesSampleRate: 0.1, // 10% of transactions // Filter sensitive data beforeSend(event) { // Remove PII if (event.user) { delete event.user.email; delete event.user.ip_address; } return event; } } Alert Rules: Condition Action Priority New issue (high severity) Slack + PagerDuty P1 Issue spike (>10x baseline) Slack + PagerDuty P1 New issue (medium) Slack only P2 Regression (resolved reopened) Slack P2 5.3 Source Maps Web (Next.js): Automatic upload via @sentry/nextjs Hidden from production (security) Mobile (Flutter): Upload dSYM (iOS) and mapping files (Android) Integrated with CI/CD 5.4 Sentry Cost Estimate Phase Events/Month Cost MVP <5,000 Free Growth ~50,000 EUR 26/month Scale ~500,000 EUR 300/month 6. Alerting & Incident Management 6.1 Phased Approach MVP (Team <5): Slack + Grafana Alerts Simple, no additional cost On-call rotation manual Suitable for low traffic Growth (Team 5-15): Add PagerDuty Proper escalation policies On-call schedules Mobile alerts Incident timeline Scale (Team 15+): Full Incident Management PagerDuty + Statuspage War room automation Post-incident reviews 6.2 Alert Levels Level Response Time Examples Notification P1 - Critical 15 min Payment processing down, data breach PagerDuty + Slack + SMS P2 - High 1 hour High error rate, degraded performance PagerDuty + Slack P3 - Medium 4 hours Non-critical service degraded Slack only P4 - Low Next business day Warning thresholds Slack (daily digest) 6.3 Critical Alerts (P1) Alert Condition Action API Down 0 successful requests for 2 min Page on-call Payment Failures >5% failure rate for 5 min Page on-call Database Unreachable Connection failures >10/min Page on-call Security Event Suspicious activity detected Page on-call + security Error Spike 10x baseline errors Page on-call 6.4 On-Call Rotation MVP Setup: Week 1: Dev A (primary) Week 2: Dev B (primary) Week 3: Dev A (primary) ... Escalation: 0-15 min: Primary on-call 15-30 min: Secondary on-call 30+ min: Engineering lead PagerDuty Cost: Plan Cost Features Free EUR 0 5 users, basic Professional EUR 21/user/mo Full features MVP: Free tier (5 users) Scale: Professional for core team 6.5 Incident Response Runbook Template ## Incident: [Title] ### Detection - Alert source: [Grafana/Sentry/PagerDuty] - Time detected: [timestamp] - Severity: [P1/P2/P3] ### Impact - Users affected: [estimate] - Services affected: [list] - Financial impact: [if applicable] ### Timeline - HH:MM - [Event] - HH:MM - [Event] ### Root Cause [Description] ### Resolution [Steps taken] ### Action Items - [ ] [Preventive measure] - [ ] [Process improvement] ### Participants - Incident Commander: [name] - Responders: [names] 7. Documentation 7.1 API Documentation Recommendation: OpenAPI 3.1 + Swagger UI Why: Industry standard Auto-generated from code annotations Interactive testing Client SDK generation Implementation: # openapi.yaml (partial) openapi: 3.1.0 info: title: Drop API version: 1.0.0 description: Mobile banking API servers: - url: https://api.fontelepay.com/v1 description: Production - url: https://api.staging.fontelepay.com/v1 description: Staging security: - bearerAuth: [] paths: /accounts/{id}/balance: get: summary: Get account balance tags: [Accounts] ... Hosting: Swagger UI at /docs endpoint Redoc as alternative (cleaner for external) Postman collection export for testing 7.2 Runbooks Location: /docs/runbooks/ in repository Required Runbooks: Runbook Purpose deploy-production.md Production deployment steps rollback.md How to rollback a bad deploy database-migration.md Safe DB migration process incident-response.md General incident handling scaling.md How to scale services secrets-rotation.md Rotating API keys, certs disaster-recovery.md Full recovery procedures Runbook Template: # Runbook: [Title] ## Overview [What this runbook covers] ## Prerequisites - [ ] Access to [system] - [ ] Permissions: [list] ## Steps 1. [Step with command examples] 2. [Step with verification] ## Verification [How to confirm success] ## Rollback [If something goes wrong] ## Contacts - Primary: [name/slack] - Escalation: [name/slack] 7.3 Architecture Decision Records (ADRs) Location: /docs/adr/ in repository Format: # ADR-001: Use PostgreSQL as Primary Database ## Status Accepted ## Context We need a reliable, ACID-compliant database for financial transactions. ## Decision Use PostgreSQL 16 as our primary database. ## Consequences ### Positive - Strong ACID compliance - Excellent JSON support - Proven in fintech ### Negative - Requires more ops than managed NoSQL - Horizontal scaling more complex ## Alternatives Considered - MySQL: Less JSON support - MongoDB: Not ACID by default - CockroachDB: Higher cost, complexity Key ADRs to Create: ADR-001: Database selection (PostgreSQL) ADR-002: Cloud provider (AWS) ADR-003: BaaS provider (Swan) ADR-004: Mobile framework (Flutter) ADR-005: Monitoring stack (Grafana) ADR-006: CI/CD platform (GitHub Actions) 7.4 Documentation Tooling Type Tool Cost API Docs Swagger/OpenAPI Free Internal Docs Notion or Confluence Free-EUR 50/mo Runbooks Git repository Free Diagrams Mermaid (in Markdown) Free Postmortems Notion template Free 8. Security Operations 8.1 Dependency Scanning Recommendation: Snyk Why Snyk: Best JavaScript/TypeScript support Dart/Flutter support Automatic PR fixes License compliance Container scanning Integration: # .github/workflows/security.yml - name: Snyk Security Scan uses: snyk/actions/node@master with: args: --severity-threshold=high Policy: Severity Action SLA Critical Block PR, fix immediately 24 hours High Block PR, fix before merge 72 hours Medium Warning, fix in sprint 2 weeks Low Track, fix when convenient 1 month Snyk Cost: Plan Cost Limits Free EUR 0 200 tests/month Team EUR 52/dev/mo Unlimited MVP: Free tier Scale: Team plan 8.2 Secret Management Recommendation: AWS Secrets Manager Why AWS Secrets Manager: Native AWS integration (using AWS already) Automatic rotation support Audit trail via CloudTrail GDPR compliant (EU region) No additional infrastructure Alternative: HashiCorp Vault More features but more operational overhead Consider for Scale phase if multi-cloud Secrets to Manage: Secret Rotation Access Database credentials 90 days Backend services API keys (Swan, Stripe) 180 days Backend services JWT signing keys 365 days Auth service Encryption keys Never (versioned) All services Implementation: // secrets.ts import { SecretsManager } from '@aws-sdk/client-secrets-manager'; const client = new SecretsManager({ region: 'eu-central-1' }); export async function getSecret(name: string): Promise { const response = await client.getSecretValue({ SecretId: name }); return response.SecretString!; } AWS Secrets Manager Cost: Secrets Cost 10 secrets EUR 4/month 50 secrets EUR 20/month 100 secrets EUR 40/month 8.3 Penetration Testing Schedule: Test Type Frequency Provider Automated DAST Weekly OWASP ZAP Web App Pen Test Quarterly External firm Mobile App Pen Test Quarterly External firm Infrastructure Pen Test Annually External firm Budget: Test Cost Web + API Pen Test EUR 5,000-10,000 Mobile Pen Test EUR 5,000-8,000 Infrastructure EUR 8,000-15,000 Annual Total EUR 25,000-45,000 EU-Based Pen Testing Firms: Cure53 (Germany) - Excellent reputation Securitum (Poland) - Cost-effective WithSecure (Finland) - Enterprise grade Secura (Netherlands) - Banking expertise 8.4 Security Monitoring SIEM Considerations: MVP: CloudWatch + Grafana alerts (sufficient) Scale: Consider AWS Security Hub or Elastic SIEM Security Alerts: Event Action Failed login spike Alert + temp block New device login User notification Large transfer Manual review queue Admin action Audit log + alert API key usage anomaly Alert + investigate 8.5 Compliance Automation Tools: AWS Config - Configuration compliance Prowler - AWS security assessment (free) Checkov - Infrastructure as code scanning Automated Checks: S3 buckets not public Encryption at rest enabled Security groups not overly permissive IAM policies least-privilege Audit logging enabled 9. Cost Summary 9.1 MVP Phase (Monthly) Category Tool Cost (EUR) CI/CD GitHub Actions 20-50 Monitoring Grafana Cloud (free tier) 0-50 Error Tracking Sentry (free tier) 0 Alerting Slack + PagerDuty Free 0 Security Snyk (free tier) 0 Secrets AWS Secrets Manager 10 Testing Playwright, k6 (OSS) 0 Total EUR 30-110 9.2 Growth Phase (Monthly) Category Tool Cost (EUR) CI/CD GitHub Actions 100-150 Monitoring Grafana Cloud 200-400 Error Tracking Sentry Team 100-300 Alerting PagerDuty Professional 100-200 Security Snyk Team 200-400 Secrets AWS Secrets Manager 20-40 Testing k6 Cloud (load testing) 100-200 Total EUR 820-1,690 9.3 Scale Phase (Monthly) Category Tool Cost (EUR) CI/CD GitHub Actions + ArgoCD 200-300 Monitoring Grafana Cloud 500-1,000 Error Tracking Sentry Business 300-500 Alerting PagerDuty + Statuspage 300-500 Security Snyk + DAST 500-800 Secrets AWS Secrets Manager 40-60 Testing k6 Cloud 200-400 Documentation Confluence 50-100 Total EUR 2,090-3,660 9.4 Annual Security Costs Item Cost (EUR) Penetration Testing (4x/year) 25,000-45,000 Compliance Audit (annual) 10,000-20,000 Security Training 2,000-5,000 Total EUR 37,000-70,000 10. Implementation Priority 10.1 Phase 1: Foundation (Week 1-2) Must Have: GitHub Actions basic pipeline (lint, test, build) Sentry error tracking (all environments) Basic Slack alerting AWS Secrets Manager setup Snyk dependency scanning Outcome: Can deploy safely with visibility into errors 10.2 Phase 2: Observability (Week 3-4) Must Have: Grafana Cloud setup (metrics, logs) Prometheus metrics in application Structured logging (JSON) Basic dashboards (RED metrics) Critical alerts configured Outcome: Can monitor application health 10.3 Phase 3: Testing (Week 5-6) Must Have: Unit test coverage >70% Integration tests for critical paths Playwright E2E for happy paths k6 load test baseline Test runs in CI/CD Outcome: Confidence in deployments 10.4 Phase 4: Security (Week 7-8) Must Have: CodeQL SAST enabled OWASP ZAP in staging Security headers configured Audit logging implemented First penetration test scheduled Outcome: Security baseline established 10.5 Phase 5: Operations (Week 9-12) Should Have: PagerDuty on-call rotation Runbooks for critical scenarios Disaster recovery tested OpenAPI documentation complete ADRs documented Outcome: Production-ready operations 10.6 Checklist Summary Week 1-2: CI/CD + Errors + Secrets Week 3-4: Monitoring + Logs + Alerts Week 5-6: Tests + E2E + Load Week 7-8: Security + Audit + Pen Test Week 9-12: On-call + Docs + DR 11. Integration Diagram ┌─────────────────────────────────────────────────────────────────────────────┐ │ DEVELOPER WORKFLOW │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────┐ ┌─────────┐ ┌─────────────────────────────────────────┐ │ │ │ Code │───>│ PR │───>│ GitHub Actions │ │ │ │ (IDE) │ │ (GitHub)│ │ ┌─────┐ ┌────┐ ┌────┐ ┌─────┐ ┌─────┐ │ │ │ └─────────┘ └─────────┘ │ │Lint │ │Test│ │SAST│ │Build│ │Snyk │ │ │ │ │ └──┬──┘ └──┬─┘ └──┬─┘ └──┬──┘ └──┬──┘ │ │ │ └────┼───────┼──────┼──────┼───────┼─────┘ │ │ └───────┴──────┴──────┴───────┘ │ │ │ │ └────────────────────────────────────────────────────┼────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ DEPLOYMENT (ArgoCD) │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ │ │ Staging │────────>│ Canary │────────>│ Production │ │ │ │ (automatic) │ │ (5% traffic) │ │ (95% -> 100%)│ │ │ └───────────────┘ └───────────────┘ └───────────────┘ │ │ │ │ │ │ │ └─────────────────────────┴─────────────────────────┘ │ │ │ │ └────────────────────────────────────┼────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ KUBERNETES CLUSTER (AWS EKS) │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ API Gateway│ │ Auth │ │ Payment │ │ Card │ │ │ │ (Kong) │ │ Service │ │ Service │ │ Service │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ │ └────────────────┴────────────────┴────────────────┘ │ │ │ │ │ ┌─────────────────────────┼─────────────────────────┐ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ PostgreSQL │ │ Redis │ │ Kafka │ │ │ │ (RDS) │ │(ElastiCache)│ │ (MSK) │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ │ │ Telemetry ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ OBSERVABILITY STACK │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ GRAFANA CLOUD (EU) │ │ │ │ │ │ │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ │ │ Prometheus │ │ Loki │ │ Tempo │ │ │ │ │ │ (Metrics) │ │ (Logs) │ │ (Traces) │ │ │ │ │ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │ │ │ │ └─────────────────┴─────────────────┘ │ │ │ │ │ │ │ │ │ ┌──────┴──────┐ │ │ │ │ │ Dashboards │ │ │ │ │ │ & Alerts │ │ │ │ │ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌────────────────┐ ┌────────────────┐ │ │ │ Sentry │ │ PagerDuty │ │ │ │ (Error Track) │ │ (Alerting) │ │ │ └───────┬────────┘ └───────┬────────┘ │ │ │ │ │ │ └───────────────────┬───────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────┐ │ │ │ Slack │ │ │ │ (Notif Hub) │ │ │ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────────────┐ │ SECURITY LAYER │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Snyk │ │ CodeQL │ │ OWASP ZAP │ │ AWS Secrets │ │ │ │ (Deps) │ │ (SAST) │ │ (DAST) │ │ Manager │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ Appendix A: Tool Links Tool URL Purpose GitHub Actions github.com/features/actions CI/CD ArgoCD argoproj.github.io/cd GitOps deployment Grafana Cloud grafana.com/cloud Monitoring Sentry sentry.io Error tracking PagerDuty pagerduty.com Incident management Snyk snyk.io Security scanning Playwright playwright.dev E2E testing k6 k6.io Load testing OpenTelemetry opentelemetry.io Observability Appendix B: Decision Matrix Decision Options Considered Winner Key Factor CI/CD GitHub Actions, GitLab, CircleCI GitHub Actions Native GitHub, EU runners Monitoring Datadog, New Relic, Grafana Grafana Cloud Cost, EU hosting, open standards E2E Testing Playwright, Cypress Playwright Mobile web support, speed Error Tracking Sentry, Bugsnag, Rollbar Sentry Flutter SDK, EU hosting Alerting PagerDuty, Opsgenie, Slack PagerDuty Industry standard, free tier Secrets AWS SM, Vault, GCP SM AWS Secrets Manager Already on AWS, simple Security Snyk, Dependabot, Sonar Snyk Best JS/TS coverage Appendix C: Compliance Mapping Requirement Solution Evidence PCI DSS 10.x (Logging) Grafana Loki, 7yr retention CloudTrail + Loki GDPR (Data Residency) Grafana EU, Sentry EU Region configs GDPR (Right to Erasure) Pseudonymized logs No PII in logs SOC 2 (Change Mgmt) GitHub PRs, ArgoCD Audit trail ISO 27001 (Incident) PagerDuty, Runbooks Incident records Document created: 2026-02-05 Last updated: 2026-02-05 Author: DevOps Research WAF Rules WAF Rules — Drop Payment App MC #1229 — Web Application Firewall configuration for Drop fintech. Overview Drop runs on Fly.io which does not provide a built-in WAF. Protection is layered: Middleware-level (Next.js Edge Middleware) — first line of defense Fly.io Proxy — TLS termination, DDoS mitigation at network edge Application-level — input validation, parameterized SQL, CSRF checks Middleware WAF Rules (Implemented in src/drop-app/src/middleware.ts ) 1. CSRF Origin Validation Rule: All mutation requests (POST/PUT/PATCH/DELETE) to /api/* must have valid Origin or Referer header Action: Block with 403 Bypass: None 2. Rate Limiting Rule: Per-IP rate limits on auth endpoints (10 req/window) Action: Block with 429 Scope: /api/auth/* 3. Content-Security-Policy Rule: Strict CSP with nonce-based script/style loading in production Action: Browser enforcement (block inline scripts/styles without nonce) Dev mode: unsafe-inline permitted for HMR Recommended Reverse Proxy Rules (Fly.io / Cloudflare) If a CDN or reverse proxy is added in front of Fly.io, configure these rules: SQL Injection (SQLi) Pattern: Block requests containing SQL keywords in query params and body: UNION SELECT , OR 1=1 , DROP TABLE , ; -- , ' OR ' Action: Block with 403 Note: Drop uses parameterized queries exclusively — this is defense-in-depth Cross-Site Scripting (XSS) Pattern: Block requests containing: " # Expected: 403 Forbidden # Test path traversal blocking curl "https://getdrop.no/../../etc/passwd" # Expected: 403 Forbidden Monitoring All WAF blocks should be logged with: timestamp, rule ID, client IP, request path, matched pattern Alert on >100 blocks/hour from single IP (potential attack) Weekly WAF report for security review Cloud Deployment Options Cloud Deployment Options for Drop Rebrand note (2026-02-14): Originally titled "FontelePay". Product rebranded to Drop . See Drop CLAUDE.md . Date: 2026-02-05 Purpose: Evaluate cloud deployment options for European mobile banking MVP Requirements Summary Requirement Priority Next.js support (static + SSR/API routes) Must-have EU data residency (GDPR) Must-have Financial compliance ready (PCI-DSS, SOC2) Must-have Cost-effective for MVP High Easy CI/CD integration High Scalability for production Medium Provider Comparison Overview Table Feature Vercel AWS (Amplify/Lambda) Google Cloud Run Next.js Support Native (created by Vercel) Full SSR support (v15) Via container deployment EU Regions Edge caching only Frankfurt, Ireland, Paris, Stockholm + ESC Frankfurt, Belgium, Netherlands, Zurich Data Residency US-based storage* Full EU residency available Full EU residency available PCI-DSS v4.0 (SAQ-D AOC) v4.0.1 certified v4.0.1 certified SOC 2 Type 2 certified Type 2 certified Type 2 certified ISO 27001 Certified Certified Certified GDPR EU-US DPF certified Compliant Compliant Ease of Use Excellent Moderate Moderate Vendor Lock-in Medium Low Low *Vercel: Static assets and function responses cached in EU, but primary storage remains US-based. Detailed Analysis 1. Vercel Strengths: Native Next.js support (Vercel created Next.js) Zero-config deployment from Git Excellent DX (Developer Experience) Edge Functions for low latency Preview deployments per PR PCI-DSS v4.0 compliant SOC 2 Type 2, ISO 27001 certified Weaknesses: No true EU data residency - data primarily stored in US Per-seat pricing scales poorly for teams Limited backend flexibility Enterprise tier required for some compliance features Pricing: Tier Cost Includes Hobby Free 100GB bandwidth, limited features Pro $20/user/month 1TB bandwidth, $20 credits, viewer seats free Enterprise Custom SAML SSO, SLAs, dedicated support GDPR Concern: Vercel is certified under EU-US Data Privacy Framework, but for banking applications requiring strict EU data residency, this may not be sufficient. Functions can run in EU regions, but metadata and logs may still traverse US infrastructure. 2. AWS (Amplify + Lambda) Strengths: True EU data residency with European Sovereign Cloud (ESC) Full Next.js 15 SSR support via Amplify 140+ security certifications including PCI-DSS v4.0.1 Frankfurt region well-established for EU fintech Pay-per-use with generous free tier No per-seat pricing Full infrastructure control Weaknesses: Steeper learning curve Complex billing (multiple services) Requires AWS expertise CI/CD via external tools (GitHub Actions, GitLab) Pricing (AWS Amplify): Resource Free Tier Paid Build minutes 1,000/month $0.01/min Data served 15 GB/month $0.15/GB Data stored 5 GB/month $0.023/GB SSR requests Varies ~$0.20/1M Estimated MVP Cost: $5-25/month for low-moderate traffic European Sovereign Cloud (ESC): Launched January 2026, provides EU-resident personnel and hardware-enforced access restrictions. Ideal for regulated financial services. 3. Google Cloud Run Strengths: Containerized deployment (flexible) Full EU data residency (Frankfurt, Belgium, Netherlands, Zurich) PCI-DSS v4.0.1 and SOC 2 certified Generous free tier Auto-scaling to zero Pay only for actual compute time Weaknesses: Requires containerization (Dockerfile) No native Next.js integration More DevOps overhead Less seamless than Vercel for frontend Pricing (Tier 1 - EU regions): Resource Free Tier Paid CPU 180,000 vCPU-seconds/month $0.000024/vCPU-second Memory 360,000 GiB-seconds/month $0.0000025/GiB-second Requests 2 million/month $0.40/million Estimated MVP Cost: $0-15/month for low-moderate traffic (often within free tier) Compliance Matrix for Fintech Certification Vercel AWS GCP Required for Drop PCI-DSS v4.0+ Yes Yes Yes Yes (payment processing) SOC 2 Type 2 Yes Yes Yes Yes (enterprise clients) ISO 27001 Yes Yes Yes Recommended GDPR DPF Full Full Yes (EU operations) EU Data Residency Partial Full Full Critical Recommendation MVP Phase (0-6 months) Primary: AWS Amplify (Frankfurt region) Rationale: True EU data residency - critical for banking MVP regulatory approval Full Next.js support - SSR, API routes, ISR all work Cost-effective - likely $10-30/month for MVP traffic Compliance-ready - PCI-DSS, SOC 2, ISO 27001 from day one No per-seat pricing - scales with team growth Path to production - same platform, just scale up Setup recommendation: Region: eu-central-1 (Frankfurt) CI/CD: GitHub Actions Database: Aurora Serverless or PlanetScale (EU region) Auth: Cognito or Auth0 (EU tenant) Production Phase (6+ months) Stay with AWS but consider: AWS European Sovereign Cloud (ESC) for maximum compliance ECS/EKS for more control if needed Multi-region deployment (Frankfurt + Ireland) for redundancy Why Not Vercel? Despite excellent DX, Vercel's partial EU data residency is a significant concern for a banking application. While Vercel is PCI-DSS compliant, regulators may question data flows through US infrastructure. For an MVP seeking banking licenses or partnerships, demonstrating full EU data residency is simpler with AWS or GCP. Why Not GCP Cloud Run? GCP is technically excellent but: Requires containerization overhead Less native Next.js support Smaller fintech ecosystem in EU compared to AWS AWS has more established EU banking relationships Cost Projection (12 months) Scenario Vercel Pro AWS Amplify GCP Cloud Run MVP (2 devs, 10k users) $480/year $120-300/year $0-180/year Growth (5 devs, 50k users) $1,200/year $300-600/year $200-400/year Scale (10 devs, 200k users) $2,400/year $600-1,500/year $500-1,200/year AWS and GCP costs vary based on usage patterns; Vercel costs fixed per-seat Action Items Set up AWS account with Frankfurt region default Configure Amplify for Next.js deployment Implement GitHub Actions CI/CD pipeline Document compliance controls for future audits Evaluate AWS ESC when banking license process begins Sources Vercel Pricing Vercel Security & Compliance Vercel PCI Compliance Guide AWS Amplify Pricing AWS European Sovereign Cloud AWS PCI DSS Compliance Google Cloud Run Pricing GCP PCI DSS Compliance GCP SOC 2 Compliance Infrastructure Overview Infrastructure Resources Infrastructure resources for Drop project: deployment, monitoring, CI/CD. Cloud Migration Strategy — GCP → Azure Cloud Migration Strategy — Drop Dato: 2026-02-18 Status: Planlegging Beslutning: Azure som produksjonsplattform, GCP for dev/staging SpareBank 1 — Teknisk Stack (Research) Lag Teknologi Cloud Azure (primær) — Eunomia-plattformen for 13 banker Sekundær AWS (mindre workloads) Backend Kotlin/Java (Spring Boot) Frontend React + TypeScript Orkestrering Kubernetes / OpenShift Meldingskø Apache Kafka Autentisering BankID (norsk eID) API Gateway Axway Partnerskap Microsoft (strategisk partner) Drops Nåværende Stack Lag Teknologi Frontend Next.js 16 + React 19 + Tailwind v4 Backend Next.js API Routes (Node.js) Database SQLite (better-sqlite3) → PostgreSQL (prod) Auth JWT (jose) i httpOnly cookies + BankID Hosting Fly.io (staging), Vercel (planned prod) Migreringsstrategi Fase 1: GCP Dev/Staging (NÅ) Gratis prøveperiode: $300/kr2,884 til 20. mai 2026 Tjenester: Cloud Run (containerisert Next.js), Cloud SQL (PostgreSQL), Cloud Storage Formål: Utviklingsmiljø + CI/CD testing Ingen regulatorisk risiko (kun testdata) Fase 2: Azure Produksjon (Når credits kommer) Microsoft Founders Hub — søknad sendt (#1362) Tjenester: Azure App Service eller AKS, Azure Database for PostgreSQL, Azure Blob Storage Formål: Produksjonsmiljø SpareBank 1-tilpasning: Samme skyplattform reduserer friksjon ved partnerskap Fase 3: Multi-Cloud Beredskap AWS — søknad sendt (#1360), backup/DR Containerisert arkitektur — Docker + eventuelt Kubernetes gjør leverandøruavhengig GCP Deploy Plan (Fase 1) Steg 1: Containerisering # Dockerfile for Drop Next.js FROM node:20-alpine WORKDIR /app COPY package*.json ./ RUN npm ci --production COPY . . RUN npm run build EXPOSE 3000 CMD ["npm", "start"] Steg 2: GCP Oppsett Opprett GCP-prosjekt (allerede: project-72cd303f) Aktiver Cloud Run API Opprett Cloud SQL PostgreSQL-instans (db-f1-micro for dev) Konfigurer secrets i Secret Manager Sett opp Cloud Build for CI/CD Steg 3: Deploy Pipeline # Build og push container gcloud builds submit --tag gcr.io/PROJECT_ID/drop-web # Deploy til Cloud Run gcloud run deploy drop-web \ --image gcr.io/PROJECT_ID/drop-web \ --platform managed \ --region europe-north1 \ --allow-unauthenticated \ --set-env-vars DATABASE_URL=postgresql://... Steg 4: DNS + SSL Custom domain: drop.alai.no → Cloud Run SSL: Automatisk via Google-managed certificates CDN: Cloud CDN foran Cloud Run (valgfritt) Regulatoriske Krav (Finanstilsynet) 60-dagers varsel før sky-outsourcing for finansielle tjenester Dokumentasjonskrav: Sikkerhetstiltak, data residency, exit-strategi Data residency: Europe (europe-north1 = Finland, closest to Norway) Handling: Forbered dokumentasjon FØR produksjonsmigrering Kotlin/Java Vurdering Beslutning: Nei — beholder Next.js/TypeScript Hvorfor: Drop er ~7,600 linjer TypeScript — full rewrite til Kotlin = 2-3 mnd SpareBank 1 sin API er REST/GraphQL — språket på vår side er irrelevant TypeScript fullstack = ett språk, ett team, raskere iterasjon Next.js API Routes er tilstrekkelig for vår skala (fintech MVP) Når vi trenger mikrotjenester → da vurderer vi Kotlin for spesifikke tjenester Estimerte Kostnader Tjeneste GCP (dev) Azure (prod) Compute Cloud Run: ~$0 (free tier) App Service B1: ~$13/mnd Database Cloud SQL micro: ~$7/mnd PostgreSQL Basic: ~$25/mnd Storage 5GB: ~$0.10/mnd 5GB: ~$0.10/mnd Totalt ~$7/mnd (dekkes av credits) ~$38/mnd (dekkes av credits) Relaterte Oppgaver MC #1360: AWS credits søknad (sendt) MC #1361: GCP credits søknad (sendt, free trial aktiv) MC #1362: Azure/Microsoft Founders Hub (sendt) MC #1364: Anthropic credits søknad Load Test Results — 2026-02-18 Load Test Results — Drop Staging Dato: 2026-02-18 Verktøy: k6 v1.6.1 Mål: https://drop-staging.fly.dev Server: Fly.io shared-cpu-1x (256MB RAM, 1 delt CPU, Stockholm) Testoppsett To scenarier kjørt samtidig: Scenario 1: Public Stress (helse + valutakurser) Ramper fra 0 → 200 samtidige brukere over 2m40s Ingen autentisering, tester rå serverkapasitet Scenario 2: Autentisert brukerflyt Ramper fra 0 → 30 samtidige brukere Login → Dashboard → Transaksjoner → Mottakere → Profil JWT token gjenbrukt per VU Resultater Samtidige brukere Median latens p95 latens Feilrate Status 1-10 74ms ~90ms 0% Fungerer utmerket 25-50 ~500ms ~3s ~5% Degradering starter 75-100 ~2-3s ~6s ~30% Alvorlige problemer 150-200 3s+ 27s+ 47% Praktisk talt nede Detaljerte tall (k6 output) Public endpoints: health_duration : avg=1134ms, min=54ms, med=74ms, max=44s, p90=3.5s, p95=6.2s rates_duration : avg=1077ms, min=55ms, med=74ms, max=45s, p90=3.3s, p95=6.4s /api/rates feilet i 95% av forespørslene ved høy last /api/health holdt (alltid 200) Autentiserte endpoints: http_req_duration (auth_flow) : med=3.26s, p90=14.8s, p95=27.6s Dashboard og transaksjonshistorikk hardest rammet Totalt: 12,841 HTTP-forespørsler over 2m53s 74 req/s gjennomsnitt 48.4% av alle forespørsler feilet Breaking Point ~25-30 samtidige brukere Etter dette eksploderer responstidene og endepunkter begynner å feile. Flaskehalser identifisert 1. Maskinressurser (KRITISK) shared-cpu-1x = 256MB RAM, 1 delt CPU Bokstavelig talt den minste Fly.io-planen CPU-metning ved ~50 samtidige forespørsler 2. SQLite single-writer (HØY) SQLite WAL-modus hjelper med samtidige lesinger Men ALLE skrivinger (rate_limits, sessions) er serialiserte Under last: skrivelås blokkerer lesinger 3. Null caching (MEDIUM) Ingen Redis eller in-memory cache Valutakurser hentes fra DB på hver forespørsel Brukersesjoner valideres mot DB hver gang 4. bcrypt 12 rounds (MEDIUM) Passord-hashing koster ~300ms CPU per innlogging Saturerer delt CPU raskt under innloggingsbølger 5. Enkeltinstans (MEDIUM) Ingen horisontal skalering auto_stop_machines = stop → kaldstarter (3.8s første forespørsel) min_machines_running = 0 → ingen alltid-på instanser Oppgraderingsplan Oppgradering Effekt Kostnad Fly.io performance-1x (2GB RAM, 1 dedikert CPU) ~3x kapasitet (~75 brukere) ~$30/mnd + PostgreSQL i stedet for SQLite Samtidige skrivinger, connection pooling ~$15/mnd + Redis cache (kurser, sesjoner) 10x raskere på lese-endepunkter ~$10/mnd + 2 instanser (auto-scale) ~150+ brukere ~$60/mnd totalt Full produksjonsoppsett ~500+ brukere ~$100/mnd Konklusjon For MVP/demo med SpareBank 1: Nåværende oppsett holder 10-15 samtidige brukere — tilstrekkelig for demo. For pilot med ekte brukere trengs minimum PostgreSQL + større maskin. Se også: Cloud Migration Strategy — GCP → Azure for migreringsplan. GCP Architecture — Cloud Run + Cloud SQL GCP Architecture for Drop Dato: 2026-02-18 Region: europe-north1 (Finland — nærmest Norge) Kontekst: Migrering fra Fly.io shared-cpu-1x som takler ~25 samtidige brukere Nåværende Fly.io vs GCP — Sammenligning Tier 1: Minimum (Dev/Demo) — ~25 brukere Komponent Fly.io (nå) GCP ekvivalent GCP kostnad Compute shared-cpu-1x (256MB) Cloud Run: 1 vCPU, 512MB ~$0 (free tier) Database SQLite på Fly Volume Cloud SQL db-f1-micro (0.6GB, 10GB) ~$9/mnd Cache Ingen Ingen $0 Totalt ~$5/mnd ~$9/mnd Kapasitet ~25 samtidige ~25 samtidige Tier 2: Pilot (SpareBank 1 demo) — ~100 brukere Komponent GCP tjeneste Spesifikasjon Kostnad Compute Cloud Run 2 vCPU, 1GB RAM, min 1 instans ~$15/mnd Database Cloud SQL db-g1-small (1.7GB, 20GB SSD) ~$30/mnd Cache Memorystore Redis Basic 1GB ~$35/mnd CDN Cloud CDN 10GB egress ~$1/mnd Secrets Secret Manager 10 secrets ~$0 (free tier) Monitoring Cloud Monitoring Basic ~$0 (free tier) Totalt ~$81/mnd Kapasitet ~100-150 samtidige Tier 3: Produksjon (Ekte brukere) — ~500+ brukere Komponent GCP tjeneste Spesifikasjon Kostnad Compute Cloud Run 2 vCPU, 2GB RAM, min 2 instanser, auto-scale til 10 ~$50/mnd Database Cloud SQL Enterprise: 2 vCPU, 8GB RAM, 50GB SSD, HA ~$150/mnd Cache Memorystore Redis Standard 1GB (HA) ~$70/mnd CDN Cloud CDN + Load Balancer Global ~$25/mnd Secrets Secret Manager ~$0 Monitoring Cloud Monitoring + Logging ~$10/mnd Backup Automated DB backup ~$5/mnd Totalt ~$310/mnd Kapasitet ~500-1000 samtidige Cloud Run Pricing (Tier 1 region) Ressurs Pris Gratis tier CPU $0.000024/vCPU-sekund 180,000 vCPU-sek/mnd Minne $0.0000025/GiB-sekund 360,000 GiB-sek/mnd Forespørsler $0.40/million 2 millioner/mnd Egress $0.12/GB (etter 1GB gratis) 1 GB/mnd Gratis tier dekker: ~50 timer med 1 vCPU + 256MB — nok for dev/staging med lav trafikk. Viktig: Gratis tier gjelder kun us-central1/us-east1/us-west1. I europe-north1 faktureres ALT fra første bruk — men dekkes av $300 free trial credits. Cloud SQL PostgreSQL Pricing Instanstype vCPU RAM Pris (ca.) db-f1-micro Delt 0.6 GB ~$9/mnd db-g1-small Delt 1.7 GB ~$27/mnd db-custom-2-8192 2 8 GB ~$130/mnd Lagring: ~$0.17/GB/mnd (SSD) Backup: ~$0.08/GB/mnd Deploy-arkitektur på GCP ┌─────────────────────────────────────────────┐ │ Cloud CDN │ │ (statiske filer + caching) │ └──────────────────┬──────────────────────────┘ │ ┌──────────────────▼──────────────────────────┐ │ Cloud Run Service │ │ drop-web (Next.js standalone) │ │ Region: europe-north1 │ │ Auto-scale: 0-10 instanser │ │ CPU: 1-2 vCPU, RAM: 512MB-2GB │ └──────┬──────────────────┬───────────────────┘ │ │ ┌──────▼──────┐ ┌──────▼──────┐ │ Cloud SQL │ │ Memorystore │ │ PostgreSQL │ │ Redis │ │ europe-n1 │ │ (cache) │ │ Private IP │ │ Basic 1GB │ └─────────────┘ └─────────────┘ Alt: Serverless VPC Connector for private nettverk Migreringssteg Steg 1: Containerisering (dag 1) Dockerfile allerede eksisterer (Node 22 Alpine) Bytt SQLite → PostgreSQL via DATABASE_URL env Drop har allerede PostgreSQL-adapter i koden Steg 2: GCP-oppsett (dag 1-2) # Opprett prosjekt (allerede: project-72cd303f) gcloud config set project project-72cd303f-66e5-46ee-a4c # Aktiver APIer gcloud services enable run.googleapis.com gcloud services enable sqladmin.googleapis.com gcloud services enable secretmanager.googleapis.com gcloud services enable cloudbuild.googleapis.com # Cloud SQL instans gcloud sql instances create drop-db \ --database-version=POSTGRES_15 \ --tier=db-f1-micro \ --region=europe-north1 \ --storage-size=10GB \ --storage-type=SSD # Opprett database gcloud sql databases create drop --instance=drop-db # Opprett bruker gcloud sql users create drop-user \ --instance=drop-db \ --password= Steg 3: Deploy til Cloud Run (dag 2) # Build og push container gcloud builds submit --tag gcr.io/PROJECT_ID/drop-web # Deploy gcloud run deploy drop-web \ --image gcr.io/PROJECT_ID/drop-web \ --platform managed \ --region europe-north1 \ --allow-unauthenticated \ --memory 512Mi \ --cpu 1 \ --min-instances 0 \ --max-instances 5 \ --set-env-vars NODE_ENV=production \ --set-cloudsql-instances PROJECT_ID:europe-north1:drop-db \ --set-secrets DATABASE_URL=drop-db-url:latest Steg 4: DNS + SSL (dag 2) # Custom domain gcloud run domain-mappings create \ --service drop-web \ --domain drop-dev.alai.no \ --region europe-north1 Steg 5: CI/CD (dag 3) Cloud Build trigger fra GitHub push Automatisk deploy ved push til main/staging branch Build + deploy tar ~2-3 minutter Kostnadsdekning Kilde Beløp Dekker GCP Free Trial $300 (kr 2,884) Tier 1+2 i ~3 mnd GCP Startups Program (søkt) Inntil $100,000 Alt i 1-2 år Microsoft Founders Hub (søkt) Inntil $150,000 Azure Azure-migrering senere Med free trial alene: $300 / ~$9 per mnd (Tier 1) = 33 måneder for dev. Med Tier 2 ( $81/mnd) = ~3.7 måneder. Alternativt: Billigere cache enn Memorystore Memorystore Redis (1GB basic = ~$35/mnd) er dyrt for MVP. Alternativer: Alternativ Pris Fordel Upstash Redis (serverless) Gratis opptil 10K kommandoer/dag Null kostnad for dev In-memory cache i Cloud Run $0 Forsvinner ved restart Cloud Run + node-cache $0 Enkel, per-instans cache Anbefaling: Start uten Redis. Legg til Upstash eller in-memory cache først. Memorystore kun hvis vi trenger delt cache mellom instanser. Kapasitetsestimat per tier Tier Samtidige brukere Responstid (p95) Kostnad Tier 1 (db-f1-micro, 1 vCPU) ~25-30 <200ms ~$9/mnd Tier 2 (db-g1-small, 2 instanser) ~100-150 <500ms ~$80/mnd Tier 3 (2 vCPU DB, auto-scale) ~500-1000 <300ms ~$310/mnd PostgreSQL alene gir ~2-3x bedre concurrent performance enn SQLite pga. connection pooling og parallelle skrivinger. AWS Deploy — App Runner + RDS (Live) AWS Deploy — Drop Staging Dato: 2026-02-18 Status: LIVE Region: eu-west-1 (Ireland) Infrastruktur Komponent Tjeneste Detaljer Compute App Runner 1 vCPU, 2GB RAM, auto-scale Container ECR 324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web Database RDS PostgreSQL 16.6 db.t3.micro (Free Tier), 20GB gp3 Region eu-west-1 Ireland URLer Tjeneste URL App https://9ef3szvvsb.eu-west-1.awsapprunner.com RDS drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com:5432 Credentials Nøkkel Verdi AWS Account 324480209768 IAM User john-deploy (AdministratorAccess) RDS User dropuser RDS Database dropapp JWT Secret drop-aws-jwt-secret-2026-xK9mP2vL NB: Passord i Vaultwarden, ikke i BookStack. Load Test — Sammenligning Metrikk Fly.io (256MB) AWS (2GB) Forbedring Throughput 74 req/s 186 req/s 2.5x Health p95 6,216ms 614ms 10x raskere Kapasitet ~25 brukere ~75-100 brukere 3-4x Neste steg Koble App Runner til RDS PostgreSQL (DATABASE_URL) — trenger VPC Connector Sett opp custom domene (drop-staging.alai.no) CI/CD via GitHub Actions → ECR → App Runner Load test med PostgreSQL (forventet ytterligere forbedring) Kostnad Tjeneste Estimert App Runner (1 vCPU, 2GB) ~$7/mnd (idle) RDS db.t3.micro $0 (Free Tier 12 mnd) ECR ~$1/mnd Totalt ~$8/mnd Dekkes av AWS Activate credits ($1,000 søkt). Cloud Audit Cloud infrastructure audit and multi-cloud design Cloud Audit: Resource Inventory Drop — AWS Resource Inventory Date: 2026-02-19 Region: eu-west-1 (Ireland) Account: Drop production Auditor: infra-lead (CloudForge cloud-audit team) MC Task: #1443 Executive Summary Drop runs a minimal AWS footprint: one App Runner service fronting a PostgreSQL RDS instance, with container images stored in ECR. Total estimated cost is $48-60/month. Three CRITICAL security findings require immediate action: RDS database is publicly accessible with security group open to the entire internet (0.0.0.0/0 on port 5432) Database storage is unencrypted Plaintext secrets (DATABASE_URL with password, JWT_SECRET) in App Runner environment variables No WAF, no CloudFront, no CloudWatch monitoring, no Route53 DNS management, and Secrets Manager is provisioned but empty. Resource Table Resource Type ID / Name Region Status Key Config App Runner Service drop-web eu-west-1 RUNNING 1 vCPU, 2 GB RAM, port 3000 RDS PostgreSQL 16.6 drop-db eu-west-1a Available db.t3.micro, 20 GB gp3, single-AZ ECR Repository drop-web eu-west-1 Active ScanOnPush: TRUE, Encryption: AES256 Security Group SG drop-db-sg eu-west-1 In use Inbound: 0.0.0.0/0 : 5432 VPC Default — eu-west-1 Active 172.31.0.0/16 IAM User User john-deploy Global Active Programmatic access IAM Role Role AppRunnerECRAccessRole Global Active ECR pull permissions Secrets Manager — (empty) eu-west-1 Provisioned 0 secrets stored CloudWatch — — — NOT CONFIGURED No alarms, no dashboards CloudFront — — — NOT PROVISIONED No CDN WAF — — — NOT PROVISIONED No web application firewall Route53 — — — NOT PROVISIONED DNS managed externally S3 — — — NOT PROVISIONED No buckets Architecture Diagram INTERNET | | HTTPS (public ingress) v +------------------+ | App Runner | | drop-web | | | | 1 vCPU / 2 GB | | Port 3000 | | ECR source | | | | ENV (plaintext):| | - DATABASE_URL | | - JWT_SECRET | +--------+---------+ | | VPC Connector (egress) | +-------------+-------------+ | Default VPC | | 172.31.0.0/16 | | | | +-------------------+ | | | drop-db-sg | | | | 0.0.0.0/0:5432 | | | +--------+----------+ | | | | | +--------v----------+ | | | RDS | | | | drop-db | | | | | | | | PostgreSQL 16.6 | | | | db.t3.micro | | | | 20 GB gp3 | | | | single-AZ (a) | | | | | | | | Public: YES | | | | Encrypted: NO | | | | Backup: 7 days | | | | DeletionProt: ON| | | | Monitoring: OFF | | | +-------------------+ | +---------------------------+ +----------+ +---------------------+ | ECR | | Secrets Manager | | drop-web | | (EMPTY) | | ScanPush | +---------------------+ +----------+ +----------+ +---------------------+ | IAM | | MISSING | | john- | | CloudWatch | | deploy | | CloudFront | | ECR Role | | WAF / Route53 / S3| +----------+ +---------------------+ Security Findings CRITICAL # Finding Resource Risk Remediation C1 Database publicly accessible RDS drop-db Direct internet access to PostgreSQL. Any attacker can attempt connections. Set PubliclyAccessible=false . App Runner already uses VPC Connector for egress — RDS only needs private subnet access. C2 Security group allows 0.0.0.0/0 on port 5432 drop-db-sg Combined with C1, the database is wide open to brute-force and exploitation from any IP on Earth. Restrict inbound rule to App Runner VPC Connector security group only. Remove 0.0.0.0/0 CIDR. C3 Plaintext secrets in App Runner env vars App Runner drop-web DATABASE_URL contains full connection string with password. JWT_SECRET in plaintext. Anyone with console/API access sees credentials. Visible in CloudTrail, config exports, and deployment logs. Migrate secrets to AWS Secrets Manager (already provisioned, currently empty). Reference via App Runner secret ARN configuration. Rotate both DATABASE_URL password and JWT_SECRET after migration. C4 Database storage unencrypted RDS drop-db Data at rest is not encrypted. Violates baseline security posture and most compliance frameworks (SOC2, GDPR, PCI). Enable storage encryption. Note: cannot enable on existing instance — requires snapshot, restore to encrypted instance, DNS/connection swap. Plan downtime window. HIGH # Finding Resource Risk Remediation H1 Single-AZ deployment RDS drop-db AZ failure = full database outage. No automatic failover. Enable Multi-AZ for production. Cost increase ~$14/mo for db.t3.micro. H2 No monitoring or alerting CloudWatch (missing) No CPU, memory, connection, or storage alarms. No visibility into failures, performance degradation, or security events. Silent failures. Configure CloudWatch alarms: CPU > 80%, FreeStorageSpace < 2 GB, DatabaseConnections > 80%, FreeableMemory < 200 MB. Enable Enhanced Monitoring on RDS. H3 No WAF WAF (missing) No protection against OWASP Top 10 attacks (SQLi, XSS, SSRF, etc.) at the edge. App Runner public endpoint is directly exposed. Deploy AWS WAF with managed rule groups (AWSManagedRulesCommonRuleSet, AWSManagedRulesSQLiRuleSet). Attach to CloudFront distribution (see H4). MEDIUM # Finding Resource Risk Remediation M1 No CDN / CloudFront CloudFront (missing) All traffic hits App Runner origin directly. No edge caching, no DDoS protection (Shield Standard), higher latency for distant users. Deploy CloudFront distribution in front of App Runner. Enables WAF attachment, caching, and Shield Standard. M2 Default VPC VPC 172.31.0.0/16 Default VPC has broad routing, public subnets by default, and no network segmentation. Not suitable for production workloads. Create custom VPC with private subnets for RDS, public subnets for NAT Gateway / ALB if needed. Migrate RDS to private subnet. M3 No DNS management Route53 (missing) DNS managed outside AWS. No health checks, no failover routing, no alias records for AWS resources. Consider Route53 for DNS if domain is Drop-owned. Enables health-check-based routing and simpler AWS integration. M4 TCP health check only App Runner drop-web TCP checks confirm port is open but not that the application is healthy. A process could accept connections while returning 500s. Configure HTTP health check on a dedicated /health endpoint that verifies database connectivity. LOW # Finding Resource Risk Remediation L1 No S3 buckets S3 (missing) If the app needs file storage in future, ensure encryption-at-rest (SSE-S3 or SSE-KMS), versioning, and public access block from day one. Provision with secure defaults when needed. L2 IAM user john-deploy IAM Long-lived access keys. No indication of key rotation policy or MFA. Audit key age. Enable MFA. Consider OIDC federation for CI/CD instead of IAM user. Rotate keys on a 90-day schedule. Cost Breakdown Service Specification Estimated Monthly Cost App Runner 1 vCPU, 2 GB, always running $29 - $36 RDS db.t3.micro, 20 GB gp3, single-AZ $15 - $18 ECR Image storage (~1-5 GB) $0.50 - $1.00 Data Transfer Minimal (< 10 GB/mo estimate) $1 - $2 Secrets Manager 0 secrets (currently unused) $0 Total $46 - $57/mo Cost Notes App Runner pricing: $0.064/vCPU-hour + $0.007/GB-hour (provisioned mode) RDS db.t3.micro: ~$0.018/hour ($13.14/mo) + $0.115/GB-month storage No NAT Gateway cost (App Runner VPC Connector handles egress) Adding Multi-AZ RDS: +$13-15/mo Adding CloudFront: +$0-5/mo (depends on traffic) Adding WAF: +$5-10/mo (depends on rules and requests) Gaps Analysis Category Current State Target State Priority Secrets management Plaintext env vars Secrets Manager with rotation CRITICAL Network security Public RDS + open SG Private subnet + restricted SG CRITICAL Encryption at rest Disabled AES-256 (KMS or default) CRITICAL Monitoring None CloudWatch alarms + dashboards HIGH High availability Single-AZ Multi-AZ RDS HIGH Edge security No WAF / CDN CloudFront + WAF HIGH Network architecture Default VPC Custom VPC with segmentation MEDIUM Health checks TCP only HTTP application-level MEDIUM IAM hygiene Long-lived keys OIDC + key rotation + MFA MEDIUM DNS External Route53 (optional) LOW Backup/DR 7-day automated only Cross-region snapshot copy LOW Recommendations (Priority Order) Phase 1 — Immediate (Week 1) — CRITICAL Security Lock down RDS network access Set PubliclyAccessible=false on drop-db Update drop-db-sg: remove 0.0.0.0/0, allow only App Runner VPC Connector SG Verify App Runner can still connect via VPC Connector Migrate secrets to Secrets Manager Create secrets: drop/database-url , drop/jwt-secret Update App Runner service to reference secret ARNs Remove plaintext env vars from App Runner config Rotate database password and JWT secret post-migration Enable RDS encryption Snapshot current instance Restore snapshot with encryption enabled Update connection string to new endpoint Verify, then delete old unencrypted instance Requires brief downtime — schedule maintenance window Phase 2 — Short Term (Week 2-3) — HIGH Priority Configure CloudWatch monitoring RDS alarms: CPU, storage, connections, memory App Runner alarms: request count, error rate, latency SNS topic for alert notifications Enable RDS Enhanced Monitoring Enable Multi-AZ RDS Modify instance to Multi-AZ Near-zero downtime (AWS handles failover setup) Deploy CloudFront + WAF CloudFront distribution pointing to App Runner WAF with AWS managed rule sets (Common, SQLi, Known Bad Inputs) Update DNS to point to CloudFront Phase 3 — Medium Term (Month 2) — Hardening Custom VPC migration Design VPC: 2 private subnets (RDS), 2 public subnets (NAT if needed) Migrate RDS to private subnets Update App Runner VPC Connector HTTP health checks Implement /health endpoint in Drop application (DB connectivity check) Configure App Runner HTTP health check path IAM improvements Audit john-deploy key age Enable MFA on IAM user Consider GitHub Actions OIDC for CI/CD (eliminates long-lived keys) Risk Matrix Risk Likelihood Impact Severity Mitigation Database breach via public access + open SG HIGH CRITICAL CRITICAL Phase 1: Lock down network (C1, C2) Credential leak from plaintext env vars MEDIUM CRITICAL CRITICAL Phase 1: Secrets Manager (C3) Data exposure from unencrypted storage LOW HIGH HIGH Phase 1: Enable encryption (C4) Database outage (single-AZ failure) LOW HIGH HIGH Phase 2: Multi-AZ (H1) Silent application failure (no monitoring) MEDIUM MEDIUM HIGH Phase 2: CloudWatch (H2) Application-layer attack (no WAF) MEDIUM HIGH HIGH Phase 2: WAF (H3) DDoS / performance degradation (no CDN) LOW MEDIUM MEDIUM Phase 2: CloudFront (M1) Lateral movement via default VPC LOW MEDIUM MEDIUM Phase 3: Custom VPC (M2) IAM key compromise LOW HIGH MEDIUM Phase 3: Key rotation + OIDC (L2) Appendix: Raw Resource Details App Runner — drop-web Service: drop-web Status: RUNNING Region: eu-west-1 Source: ECR (container image) CPU: 1 vCPU Memory: 2 GB Port: 3000 Ingress: Public Egress: VPC Connector Health Check: TCP Environment: DATABASE_URL (plaintext, contains password) JWT_SECRET (plaintext) RDS — drop-db Engine: PostgreSQL 16.6 Instance Class: db.t3.micro Storage: 20 GB gp3 AZ: eu-west-1a (single-AZ) VPC: Default (172.31.0.0/16) Public Access: TRUE Encrypted: FALSE Deletion Prot: TRUE Backup: 7-day automated Monitoring: DISABLED ECR — drop-web Repository: drop-web Scan on Push: TRUE Encryption: AES256 (default) Security Groups — drop-db-sg Inbound Rules: - Protocol: TCP Port: 5432 Source: 0.0.0.0/0 (ALL TRAFFIC) IAM User: john-deploy (programmatic access, deployment) Role: AppRunnerECRAccessRole (App Runner → ECR pull) Secrets Manager Secrets stored: 0 (service provisioned but unused) Cloud Audit: Multi-Cloud Design Drop — Multi-Cloud Architecture Design Date: 2026-02-19 Auditor: solution-arch (CloudForge cloud-audit team) MC Task: #1443 Executive Summary Drop is 85% cloud-portable thanks to Docker containerization and PostgreSQL. Main AWS lock-in: App Runner (easily replaceable). Recommendation: stay on AWS , optimize current setup, design Terraform with abstraction for future portability. 1. Provider Comparison Matrix Service AWS (Current) Azure GCP Compute App Runner ($25-35/mo) Container Apps ($20-30/mo) Cloud Run ($15-25/mo) Database RDS PostgreSQL ($15-18/mo) Azure DB for PG ($15-20/mo) Cloud SQL ($12-18/mo) Registry ECR ($1-2/mo) ACR ($5/mo) Artifact Registry ($1-2/mo) Secrets Secrets Manager ($0.40/secret) Key Vault ($0.03/10k ops) Secret Manager ($0.06/10k ops) CDN CloudFront ($0-5/mo) Front Door ($35+/mo) Cloud CDN ($0-5/mo) WAF AWS WAF ($5+/mo) Azure WAF ($20+/mo) Cloud Armor ($5+/mo) Monitoring CloudWatch ($3-10/mo) Azure Monitor ($5-15/mo) Cloud Monitoring ($0-8/mo) Total estimate $50-75/mo $100-130/mo $35-60/mo 2. Portable Architecture Cloudflare (DNS + CDN + WAF) ← Cloud-agnostic edge | | HTTPS v ┌──────────────────┐ │ CaaS Platform │ ← App Runner / Container Apps / Cloud Run │ ┌──────────┐ │ │ │ Docker │ │ ← Identical image everywhere │ │ Next.js │ │ │ │ :3000 │ │ │ └──────────┘ │ └────────┬────────┘ │ DATABASE_URL ┌────────┴────────┐ │ Managed PG │ ← RDS / Azure DB / Cloud SQL └─────────────────┘ Abstraction Strategy Layer Approach Compute Docker image to any CaaS. No platform SDK Database Standard PostgreSQL via DATABASE_URL Secrets Terraform abstracts provider. App reads env vars DNS/CDN/WAF Cloudflare (cloud-agnostic, free tier) Monitoring Sentry (errors) + structured logs to any aggregator CI/CD GitHub Actions (already cloud-agnostic) 3. Migration Paths AWS to Azure (3-5 days) Push image to ACR Create Azure DB for PostgreSQL Flexible Server pg_dump/pg_restore data migration Deploy to Azure Container Apps Update Cloudflare DNS Write Azure Terraform modules AWS to GCP (2-3 days) Push image to Artifact Registry Create Cloud SQL PostgreSQL pg_dump/pg_restore Deploy to Cloud Run (most similar to App Runner) Update Cloudflare DNS Write GCP Terraform modules Lock-In Assessment Component Lock-In Notes App Runner LOW Standard Docker, replaceable RDS PostgreSQL LOW Standard PG, any managed PG works ECR LOW Standard OCI registry VPC Connector MEDIUM AWS-specific networking IAM Roles MEDIUM AWS-specific auth model Secrets Manager LOW App reads env vars regardless 4. Recommendation: Stay AWS, Optimize Rationale: $50-75/mo already low No business need to migrate 85% portable — migration possible in 2-5 days if needed Azure costs MORE (~$100-130/mo) GCP saves ~$15/mo but not worth effort now Immediate Actions Security fixes (encrypt RDS, restrict SG, use Secrets Manager) Add Cloudflare free tier (DNS, CDN, WAF — cloud-agnostic) Terraform all resources (reproducibility) Add CloudWatch basic alarms ($3-5/mo) Future Migration Triggers AWS cost > $200/mo → evaluate GCP Cloud Run EU data sovereignty requirement → Azure Norway East Multi-region needed → Cloudflare Workers + D1 Kubernetes requirement → EKS or GKE 5. 12-Month Cost Projection Scenario Monthly Annual Current (no changes) $50-75 $600-900 Optimized AWS $55-80 $660-960 AWS + Cloudflare $55-80 $660-960 Azure equivalent $100-130 $1,200-1,560 GCP equivalent $35-60 $420-720 Cloud Audit: App Cloud Readiness Drop Application Cloud-Readiness Audit MC Task: #1443 Date: 2026-02-19 Auditor: software-arch (CloudForge team) Application: Drop Fintech Payment App (Next.js 15 + SQLite/PostgreSQL dual-driver) NOTE (2026-03-03): This audit was performed on 2026-02-19. ADR-014 (2026-03-03) removed SQLite and the dual-driver architecture. Drop now uses PostgreSQL 16 exclusively in all environments. SQLite concerns noted in this audit are resolved. The better-sqlite3 dependency has been removed. 1. Twelve-Factor Compliance I. Codebase — PASS Evidence: Single Git repository at /Users/makinja/ALAI/products/Drop/ .github/workflows/ci.yml triggers on main and develop branches One codebase tracked in revision control, multiple deploys (staging via Fly.io, production via Docker Compose) II. Dependencies — PASS Evidence: package.json:1-55 declares all dependencies explicitly npm ci used in CI ( ci.yml:36 ) and Dockerfile ( Dockerfile:6 ) for deterministic installs package-lock.json referenced in Dockerfile COPY ( Dockerfile:5 ) and CI cache ( ci.yml:32 ) Native modules (better-sqlite3) handled via apk add python3 make g++ in Dockerfile III. Config — PASS Evidence: .env.example:1-87 documents all env vars with clear groupings env.ts:1-45 validates critical vars at startup, crashes if missing in production fly.toml:16-20 injects env vars at runtime docker-compose.production.yml:7-8 uses ${JWT_SECRET:?} required substitution db.ts:9 — database driver selected via DATABASE_URL env var db.ts:26-30 — SQLite path varies by environment (Vercel /tmp , Docker /app/data , local ./data ) Feature flags externalized as NEXT_PUBLIC_FF_* env vars ( Dockerfile:19-26 ) Minor concern: NEXT_PUBLIC_* vars are baked into the build at compile time (Next.js limitation), requiring rebuild for changes. This is inherent to Next.js, not a code deficiency. IV. Backing Services — PASS Evidence: db.ts:9-22 — database treated as attached resource via DATABASE_URL PostgreSQL connection string is a single env var; switching databases requires zero code changes docker-compose.production.yml:17-35 — PostgreSQL is a separate service with its own health check BankID, PISP, AISP, Stripe, Sumsub — all configured via env vars ( .env.example:19-53 ) V. Build, Release, Run — PASS Evidence: Dockerfile uses 3-stage build (deps → builder → runner) Dockerfile:1-6 — Stage 1: dependency installation Dockerfile:9-37 — Stage 2: application build with next build Dockerfile:39-64 — Stage 3: minimal production runner next.config.ts:8 — output: "standalone" generates self-contained deployment CI builds Docker image tagged with commit SHA ( ci.yml:63 ) Build-time vs runtime config cleanly separated (ARG for build, ENV for runtime) VI. Processes — PARTIAL Evidence: Application runs as a single node server.js process ( Dockerfile:64 ) SQLite concern: When running with SQLite (no DATABASE_URL ), the process is stateful — data lives on local filesystem at /app/data/drop.db . This works on Fly.io with mounted volumes ( fly.toml:36-38 ) but violates share-nothing for horizontal scaling. PostgreSQL mode: Fully stateless — pg.Pool connects to external database ( db.ts:17-22 ). Multiple processes can run concurrently. Rate limiting: rate_limits table in the database ( middleware.ts:15-43 ), which works for single-instance but has race conditions under horizontal scale with SQLite. Assessment: PARTIAL because SQLite mode is actively used (Fly.io staging). In PostgreSQL mode this would be PASS. VII. Port Binding — PASS Evidence: Dockerfile:61-62 — EXPOSE 3000 , ENV PORT=3000 , ENV HOSTNAME="0.0.0.0" fly.toml:23 — internal_port = 3000 docker-compose.production.yml:5 — ports: "3000:3000" Self-contained via Next.js standalone server, no external HTTP server dependency. VIII. Concurrency — PARTIAL Evidence: Node.js single-threaded event loop handles concurrent requests via async I/O db.ts:16-22 — PostgreSQL connection pool (pg.Pool) supports concurrent queries fly.toml:25-27 — auto_stop_machines / auto_start_machines enables horizontal scaling Limitation: No explicit worker process types. Background work (e.g., exchange rate refresh) runs inline. No separate queue workers. For a fintech app, transaction processing should eventually be separated into dedicated worker processes. Limitation: SQLite mode limits to single process (WAL mode allows concurrent reads but single writer). IX. Disposability — PASS Evidence: Process starts fast — Next.js standalone is ~500ms cold start db.ts:719-789 — initDb() is idempotent with _initialized guard; safe for restarts Schema uses CREATE TABLE IF NOT EXISTS — safe for repeated initialization fly.toml:25-27 — machines auto-stop/start, confirming disposability design Graceful shutdown handled by Node.js default SIGTERM behavior PostgreSQL pool ( pg.Pool ) handles connection cleanup on process exit X. Dev/Prod Parity — PASS Evidence: db.ts:9-13 — dual-driver architecture (SQLite for dev, PostgreSQL for prod) with unified async API db.ts:47-63 — SQL compatibility layer translates SQLite idioms to PostgreSQL (placeholder conversion, INSERT OR IGNORE → ON CONFLICT DO NOTHING , datetime('now') → CURRENT_TIMESTAMP ) db.ts:204-460 (SQLITE_SCHEMA) and db.ts:462-690 (PG_SCHEMA) — parallel schemas maintained in sync migrations/0001_initial-schema.ts — node-pg-migrate for PostgreSQL schema versioning Docker Compose production config ( docker-compose.production.yml ) mirrors production topology locally Minor gap: SQLite schema is maintained inline in db.ts while PostgreSQL uses proper migrations ( node-pg-migrate ). Schema drift is possible if one is updated without the other. XI. Logs — PARTIAL Evidence: Health endpoint uses createLogger() ( health/route.ts:16 ) middleware.ts:82-84 — error tracking via trackError() and Sentry integration .env.example:62-74 — Sentry DSN configurable via env vars Concern: No structured logging to stdout visible in the codebase. Next.js default logging goes to stdout which is good for containers, but there's no consistent structured logging format (JSON lines) that cloud log aggregators can parse efficiently. console.error is used in places ( middleware.ts:83 ). XII. Admin Processes — PASS Evidence: package.json:12-14 — migration scripts: migrate:up , migrate:down , migrate:create via node-pg-migrate db.ts:735-774 — programmatic ALTER TABLE migrations for schema evolution Seed data controlled by SEED_DEMO env var and isDemoMode() check — admin data seeding decoupled from main app No one-off scripts embedded in application startup (seeding only runs when database is empty) 2. Containerization Quality Multi-Stage Build — EXCELLENT 3-stage Dockerfile ( Dockerfile:1-64 ): Stage 1 ( deps ): node:22-alpine , installs native build tools, runs npm ci Stage 2 ( builder ): Copies deps, builds Next.js app Stage 3 ( runner ): Minimal alpine, copies only standalone output + static assets Image Size Base: node:22-alpine (minimal, ~180MB base) Issue: Stage 3 installs python3 make g++ ( Dockerfile:42 ) for better-sqlite3 native module rebuild. This adds ~200MB to the production image unnecessarily if running in PostgreSQL mode. These build tools are a security and size concern in production. Recommendation: Either pre-build better-sqlite3 in stage 2 and copy the binary, or conditionally exclude it when PostgreSQL is the target. Security Non-root user: nextjs:nodejs (UID/GID 1001) created and used ( Dockerfile:48-49, 58 ) NEXT_TELEMETRY_DISABLED=1 set ( Dockerfile:14, 46 ) Data directory owned by non-root user ( Dockerfile:56 ) CI runs Trivy vulnerability scanner on built image ( ci.yml:67-73 ) with HIGH/CRITICAL severity gate SARIF results uploaded to GitHub Security tab ( ci.yml:85-89 ) Layer Caching Dependencies cached in separate stage ( Dockerfile:5-6 — COPY package.json package-lock.json* before source) Source code copy happens in stage 2 after deps, enabling Docker layer cache for unchanged dependencies Good practice: Build args for feature flags allow cache invalidation only when flags change Missing No .dockerignore verified (could copy unnecessary files like .git , node_modules into build context) No image tagging strategy beyond CI SHA tag 3. Database Portability Dual-Driver Architecture — STRONG Implementation: db.ts:9-13 — Runtime driver selection via DATABASE_URL presence Unified API: query() , getOne() , run() , transaction() — all async, both drivers ( db.ts:67-199 ) Type exports: DbClient interface ( db.ts:136-140 ) for transaction context SQL Translation Layer SQLite Idiom PostgreSQL Translation Location ? placeholders $1, $2, ... db.ts:47-50 INSERT OR IGNORE INTO INSERT INTO ... ON CONFLICT DO NOTHING db.ts:56, 104-118 INSERT OR REPLACE INTO INSERT INTO ... ON CONFLICT (col) DO UPDATE SET db.ts:58, 120-134 datetime('now') CURRENT_TIMESTAMP db.ts:60 INTEGER PRIMARY KEY AUTOINCREMENT SERIAL PRIMARY KEY db.ts:278 vs 530 hex(randomblob(32)) encode(gen_random_bytes(32), 'hex') db.ts:248 vs 504 Transaction Support PostgreSQL: BEGIN/COMMIT/ROLLBACK with pgClient.connect() and proper release in finally block ( db.ts:142-173 ) SQLite: db.exec("BEGIN/COMMIT/ROLLBACK") wrapper ( db.ts:174-198 ) Error handling: Both paths catch and rollback on failure Migrations node-pg-migrate for PostgreSQL ( package.json:12-14 , migrations/0001_initial-schema.ts ) Proper up() and down() functions with ordered table creation/deletion SQLite uses inline schema with CREATE TABLE IF NOT EXISTS + ALTER TABLE try/catch migrations ( db.ts:756-774 ) Risk: Two parallel schema definitions (SQLITE_SCHEMA and PG_SCHEMA in db.ts + node-pg-migrate files) could drift. No automated parity check exists. Indexes 22 indexes defined for both drivers (identical set) Partial indexes supported: idx_users_national_id WHERE national_id_hash IS NOT NULL , idx_tx_idempotency WHERE idempotency_key IS NOT NULL 4. Config Externalization Environment Variables Category Variables Source Core JWT_SECRET , JWT_EXPIRY , NODE_ENV .env.example:12-14 Database DATABASE_URL db.ts:9 Service Mode NEXT_PUBLIC_SERVICE_MODE , DROP_MODE .env.example:8 Auth (BankID) BANKID_CLIENT_ID/SECRET/URLS , BANKID_MOCK .env.example:19-29 Payments PISP_API_URL/KEY , AISP_API_URL/KEY .env.example:32-40 Cards STRIPE_SECRET_KEY , STRIPE_PUBLISHABLE_KEY .env.example:43-47 KYC SUMSUB_APP_TOKEN , SUMSUB_SECRET_KEY .env.example:50-52 Monitoring SENTRY_DSN , SENTRY_TRACES_SAMPLE_RATE .env.example:63-74 Feature Flags 8x NEXT_PUBLIC_FF_* .env.example:77-87 Exchange EXCHANGE_RATE_API_KEY/URL .env.example:55-59 Secrets Management env.ts:14-45 validates critical vars at production startup Dockerfile:15 — JWT_SECRET=build-phase-placeholder (safe build-time placeholder) env.ts:21-25 — Skip validation during build phase (detects NEXT_PHASE or placeholder) env.ts:36-38 — Rejects known dev placeholder in production runtime docker-compose.production.yml:7 — ${JWT_SECRET:?} required substitution (fails if missing) No hardcoded secrets found in source code Feature Flags 8 client-side feature flags via NEXT_PUBLIC_FF_* env vars Defaults to false (safe) for all card-related features NEXT_PUBLIC_FF_NOTIFICATIONS=true and NEXT_PUBLIC_FF_MERCHANT_DASHBOARD=true as defaults Build-time injection for client code ( Dockerfile:19-35 ), runtime for server code 5. CI/CD Quality Pipeline Structure ( ci.yml ) lint-test (parallel) docker-scan (sequential, needs lint-test) -- npm ci -- docker build -- eslint -- Trivy scan (table, exit-code=1 on HIGH/CRITICAL) -- tsc --noEmit -- Trivy SARIF -> GitHub Security -- vitest run -- npm audit (production) Reproducibility Pinned Node.js version: NODE_VERSION: "22" ( ci.yml:15 ) npm ci for deterministic installs ( ci.yml:36 ) Dependency caching via actions/setup-node with cache-dependency-path ( ci.yml:30-32 ) Docker image tagged with commit SHA ( ci.yml:63 ) Security Scanning npm audit: Production dependencies, HIGH level, continue-on-error ( ci.yml:48-49 ) Trivy: Container vulnerability scan, blocks on HIGH/CRITICAL unfixed vulns ( ci.yml:67-73 ) SARIF: Results uploaded to GitHub Security tab ( ci.yml:85-89 ) Permissions: Minimal — contents: read , security-events: write ( ci.yml:11-12 ) Testing vitest run in CI ( ci.yml:44 ) Unit test framework configured ( package.json:10-11 ) Coverage tool available: @vitest/coverage-v8 ( package.json:43 ) Missing: No coverage threshold enforcement in CI Missing: No E2E/integration tests in CI pipeline (Playwright is in devDependencies but not wired into CI) Deployment Fly.io staging configured ( fly.toml ) with health checks, auto-scaling, volume mounts Docker Compose production ( docker-compose.production.yml ) for self-hosted deployments Missing: No automated deployment step in CI (manual fly deploy or similar) Missing: No environment promotion pipeline (develop -> staging -> production) 6. Overall Score and Top 5 Improvements Overall Cloud-Readiness Score: 7.5 / 10 The application demonstrates strong cloud-native fundamentals: Excellent dual-driver database abstraction Proper multi-stage Dockerfile with security hardening Configuration fully externalized via environment variables Comprehensive CI with security scanning (Trivy + npm audit) Health endpoint with real database connectivity check Top 5 Improvements (Priority Order) 1. Eliminate Build Tools from Production Image (HIGH) File: Dockerfile:42 Issue: python3 make g++ in production stage adds ~200MB and attack surface Fix: Pre-compile better-sqlite3 in builder stage, copy only the .node binary. Or use a conditional build that excludes better-sqlite3 entirely when targeting PostgreSQL. 2. Add Structured Logging (HIGH) Files: Throughout — console.error used in middleware.ts:83 , health endpoint has createLogger() but no consistent format Issue: Cloud log aggregators (CloudWatch, Datadog, ELK) need structured JSON logs. Current mix of console.log/error and ad-hoc logger makes log parsing unreliable. Fix: Adopt pino or similar JSON logger, output to stdout in { level, message, timestamp, requestId } format. 3. Add CI Coverage Enforcement and E2E Tests (MEDIUM) File: ci.yml — no coverage gate, no Playwright CI step Issue: @vitest/coverage-v8 and @playwright/test are in devDeps but not enforced in CI Fix: Add --coverage --coverage.thresholds.lines=80 to vitest. Add Playwright E2E job with containerized app. 4. Automate Schema Parity Check (MEDIUM) File: db.ts:204-690 — two parallel schema definitions (SQLite + PostgreSQL) Issue: Manual sync between SQLITE_SCHEMA, PG_SCHEMA, and node-pg-migrate files. Drift will cause runtime errors that only surface in specific deployment targets. Fix: Write a CI check that extracts table/column definitions from both schemas and compares. Or generate both schemas from a single source of truth. 5. Add Deployment Pipeline and Environment Promotion (MEDIUM) File: ci.yml — CI only, no CD Issue: No automated deployment from CI. Fly.io deploy is manual. No staging -> production promotion gate. Fix: Add fly deploy step on develop push (staging) and manual approval gate for main (production). Add smoke test after deploy. Consider GitHub Environments for approval workflows. Honorable Mentions SQLite mode limits horizontal scaling — document clearly when to switch to PostgreSQL Rate limiting via database has race conditions under concurrent writes (consider Redis for high-throughput) No readiness probe separate from liveness (health endpoint serves both) No graceful shutdown handler (SIGTERM -> drain connections -> exit) playwright-core in production dependencies ( package.json:27 ) — should be devDependencies only Appendix: File Reference File Purpose src/drop-app/src/lib/db.ts Dual-driver database abstraction (SQLite + PostgreSQL) src/drop-app/Dockerfile 3-stage multi-stage build src/drop-app/.env.example Environment variable documentation (87 lines) src/drop-app/fly.toml Fly.io deployment config (Stockholm region) src/drop-app/docker-compose.production.yml Self-hosted production config src/drop-app/package.json Dependencies and scripts .github/workflows/ci.yml CI pipeline (lint, test, type-check, Trivy) src/drop-app/migrations/0001_initial-schema.ts PostgreSQL migration (node-pg-migrate) src/drop-app/next.config.ts Next.js config (standalone output, security headers) src/drop-app/src/middleware.ts Edge middleware (CSRF, CSP nonce) src/drop-app/src/lib/middleware.ts Server middleware (rate limiting, auth, validation, audit) src/drop-app/src/app/api/health/route.ts Health endpoint (real DB check) src/drop-app/src/lib/env.ts Environment validation at startup Cloud Audit: Validation Report Drop — Validation + Security + Cost Report Date: 2026-02-19 Auditor: cloud-tester (CloudForge cloud-audit team) MC Task: #1443 Executive Summary Drop's AWS infrastructure has 3 CRITICAL and 4 HIGH security findings requiring immediate remediation. Current spend is ~$50-75/mo, well-optimized for scale. The application is cloud-portable (7.5/10) and the recommended path is to stay on AWS with security hardening + Terraform IaC. 1. Security Posture Assessment Current vs Improved Area Current State After Remediation Risk Reduction Secrets Plaintext in App Runner env vars AWS Secrets Manager CRITICAL → LOW RDS Access Publicly accessible, SG open 0.0.0.0/0 Private, VPC-only access CRITICAL → LOW Encryption RDS unencrypted at rest AES-256 encryption enabled CRITICAL → RESOLVED Monitoring None (no CloudWatch) Basic alarms + Performance Insights HIGH → LOW WAF None Cloudflare WAF (free tier) HIGH → LOW CDN None (direct App Runner URL) Cloudflare CDN HIGH → LOW SSL/TLS App Runner managed cert Cloudflare + App Runner MEDIUM → LOW IAM Single user (john-deploy) Least-privilege roles MEDIUM → LOW Security Findings Summary # Severity Finding Remediation Effort S1 CRITICAL RDS publicly accessible with SG allowing 0.0.0.0/0:5432 Set publicly_accessible=false, restrict SG to VPC CIDR 1 hour S2 CRITICAL Database password in plaintext App Runner env var Migrate to Secrets Manager, update App Runner to read from SM 2 hours S3 CRITICAL JWT_SECRET in plaintext App Runner env var Migrate to Secrets Manager 1 hour S4 HIGH RDS storage not encrypted at rest Enable encryption (requires snapshot + restore for existing DB) 2-4 hours S5 HIGH No monitoring or alerting configured Add CloudWatch alarms for CPU, memory, DB connections 1 hour S6 HIGH No WAF protection Add Cloudflare WAF (free tier) 30 min S7 HIGH No CDN (direct App Runner URL exposed) Add Cloudflare CDN 30 min S8 MEDIUM Sentry DSN in plaintext (not secret, but cleanup) Move to Secrets Manager for consistency 30 min S9 MEDIUM Docker image has build tools in runner (attack surface) Remove python3/make/g++ from runner stage 1 hour S10 MEDIUM No structured logging (incident investigation gaps) Add pino/winston with JSON output 2 days S11 LOW ECR image tag mutability (tag overwrite risk) Set image_tag_mutability = IMMUTABLE 5 min S12 LOW No lifecycle policy for ECR images Add policy to clean old images 15 min Compliance Checklist Item Status Notes GDPR data tables (consents, data_access_requests) PASS Schema includes consent tracking, DSAR, right to erasure Audit logging PASS audit_log table with IP, user_agent, request_id AML/KYC compliance PASS aml_alerts, str_reports, screening_results tables Encryption at rest FAIL RDS storage unencrypted Encryption in transit PARTIAL App Runner HTTPS, but RDS sslmode=no-verify Secrets management FAIL Plaintext in env vars Access control PARTIAL Single IAM user, no MFA enforcement Backup & recovery PASS RDS 7-day automated backups DeletionProtection PASS Enabled on RDS 2. Cost Comparison Current AWS Spend Resource Monthly Cost Notes App Runner (1 vCPU, 2GB) $25-35 Always-on, no auto-stop RDS db.t3.micro $15-18 Single-AZ, 20GB gp3 ECR $1-2 Image storage VPC Connector $5 Flat fee Data transfer $2-5 Low traffic Total $48-65 Optimized AWS (after fixes) Resource Monthly Cost Change App Runner $25-35 No change RDS (encrypted) $15-18 No cost increase ECR $1-2 No change Secrets Manager (3 secrets) $1.20 +$1.20 CloudWatch (basic alarms) $3-5 +$3-5 Cloudflare (free tier) $0 Free CDN/WAF/DNS Total $52-70 +$4-7 Multi-Cloud Equivalent Provider Monthly Annual vs Current AWS (optimized) $52-70 $624-840 +$4-7/mo Azure $100-130 $1,200-1,560 +$50-65/mo GCP $35-60 $420-720 -$5-15/mo Verdict: AWS is cost-effective. GCP saves ~$10/mo but migration effort not justified at current scale. 3. Risk Matrix Risk Probability Impact Current Mitigation Recommended Data breach via public RDS HIGH CRITICAL DeletionProtection only Restrict SG, disable public access Secret exposure MEDIUM CRITICAL None (plaintext) Secrets Manager + rotation Service downtime LOW HIGH App Runner auto-scaling Add health checks, CloudWatch alarms Data loss LOW CRITICAL 7-day RDS backups Add cross-region backup copy Cost overrun LOW MEDIUM None Add AWS Budgets alarm at $100 Vendor lock-in LOW MEDIUM Docker + PostgreSQL Terraform abstraction modules DDoS attack MEDIUM HIGH None Cloudflare WAF + rate limiting Compliance failure MEDIUM HIGH Tables exist, no encryption Enable encryption, structured logging 4. Implementation Roadmap Phase 1: Security Fixes (Immediate — Day 1) Create Secrets Manager secrets (DATABASE_URL, JWT_SECRET, SENTRY_DSN) Update App Runner to read from Secrets Manager Restrict RDS security group to VPC CIDR Disable RDS public accessibility Effort: 4-6 hours | Cost impact: +$1.20/mo Phase 2: IaC Migration (Week 1) Create S3 bucket for Terraform state Import existing resources into Terraform state Run terraform plan to verify no drift Add terraform-ci.yml to GitHub Actions Effort: 1-2 days | Cost impact: $0 Phase 3: Monitoring & Observability (Week 2) Enable RDS Performance Insights Add CloudWatch alarms (CPU > 80%, memory > 80%, DB connections > 80%) Add structured logging (pino) to application Configure Sentry properly (traces, breadcrumbs) Effort: 2-3 days | Cost impact: +$3-5/mo Phase 4: Edge Security (Week 2-3) Set up Cloudflare (DNS, CDN, WAF) Custom domain (getdrop.no) through Cloudflare Enable Cloudflare WAF rules Add rate limiting at edge Effort: 1 day | Cost impact: $0 (free tier) Phase 5: RDS Encryption (Week 3) Create encrypted snapshot from current DB Restore to new encrypted instance Update Secrets Manager with new endpoint Verify and swap Effort: 2-4 hours (with downtime) | Cost impact: $0 Phase 6: Multi-Cloud Readiness (Month 2+) Create Azure Terraform modules (optional) Create GCP Terraform modules (optional) Test migration to staging on alternative cloud Effort: 3-5 days | Cost impact: Only if migrated 5. Recommendations Summary Priority Action Status P0 (NOW) Fix RDS public access + SG Terraform module created P0 (NOW) Move secrets to Secrets Manager Terraform module created P1 (Week 1) Enable RDS encryption Requires snapshot/restore P1 (Week 1) Deploy Terraform IaC Modules ready P2 (Week 2) Add monitoring (CloudWatch + Performance Insights) In Terraform P2 (Week 2) Add Cloudflare CDN/WAF Manual setup P3 (Month 1) Add structured logging Application code change P3 (Month 1) Add graceful shutdown handler Application code change P4 (Month 2+) Multi-cloud Terraform modules As needed Overall Assessment: Drop's infrastructure is functional but needs immediate security hardening. The Terraform IaC created by this audit provides a complete, reproducible foundation. Total investment: ~1 week of engineering time, ~$5/mo additional cost, significant risk reduction. Bilko Deploy — Standard Operating Procedure $(cat /tmp/bilko-deploy-sop.html | jq -Rs .) Bilko Deploy — Standard Operating Procedure Bilko Deploy — Standard Operating Procedure Last updated: 2026-04-22 Owner: FlowForge (Kelsey Hightower) Status: ACTIVE Cloud Run Architecture GCP Project: tribal-sign-487920-k0 Region: europe-north1 Services: bilko-web — Next.js 15 frontend (main branch → bilko-demo.alai.no) bilko-api — Express API (main branch → bilko-api-762788903040.europe-north1.run.app) bilko-intesa-demo — Intesa pitch demo (feat/intesa-bih-demo → manual deploy only) Deploy Map Branch Service URL CI Workflow Last Verified main bilko-web https://bilko-demo.alai.no gcp-deploy.yml (BROKEN) 2026-04-22 main bilko-api https://bilko-api-762788903040.europe-north1.run.app gcp-deploy.yml (BROKEN) 2026-04-18 feat/intesa-bih-demo bilko-intesa-demo https://bilko-intesa-demo-762788903040.europe-north1.run.app Manual gcloud only 2026-04-17 Pre-Flight Checks (ZAKON PI2 Check 2) OBAVEZNO — Run these 4 commands and paste output into MC task BEFORE touching code: # 1. Target URL alive? curl -sI https://bilko-demo.alai.no | head -3 # 2. Branch state? git log main --oneline -5 # 3. CI health? gh run list --repo alai-holding/bilko --branch main --limit 3 # 4. Cloud Run service status? gcloud run services describe bilko-web \ --region europe-north1 \ --project tribal-sign-487920-k0 \ --format='value(status.latestReadyRevisionName,status.url,status.traffic)' If any returns unexpected: STOP, escalate to John. Do not proceed. CI Pipeline Status Status: BROKEN (2026-04-15 onwards) Root Causes: GitHub Actions minutes quota exhausted (monthly limit reached) --no-traffic flag on line 206 of gcp-deploy.yml prevents traffic promotion for existing services Workaround: Use manual deploy path (see below) until CI fixed. Manual Deploy Path (Emergency + CI Broken) When CI is broken or for emergency fixes, follow this path: Step 1: Build Docker Image cd /Users/makinja/ALAI/products/Bilko docker build \ --platform linux/amd64 \ -f apps/web/Dockerfile \ --build-arg NEXT_PUBLIC_API_URL=https://bilko-api-762788903040.europe-north1.run.app/api/v1 \ -t europe-north1-docker.pkg.dev/tribal-sign-487920-k0/bilko/web:fix-- \ . Image tag convention: ✅ fix-bugs-22apr , fix-logo-23apr ❌ latest (not traceable) Context reduction (.dockerignore): As of 2026-04-22, .dockerignore reduces build context from 4.1GB → 50MB by excluding node_modules , .next , apps/e2e , docs , etc. Step 2: Push to Artifact Registry gcloud auth configure-docker europe-north1-docker.pkg.dev docker push europe-north1-docker.pkg.dev/tribal-sign-487920-k0/bilko/web:fix-- Step 3: Deploy to Cloud Run CRITICAL: Do NOT use --no-traffic flag for existing services. It blocks traffic promotion. gcloud run deploy bilko-web \ --image europe-north1-docker.pkg.dev/tribal-sign-487920-k0/bilko/web:fix-- \ --region europe-north1 \ --platform managed \ --allow-unauthenticated \ --max-instances 10 \ --min-instances 0 \ --memory 512Mi \ --cpu 1 \ --concurrency 100 \ --timeout 60s \ --port 3000 \ --set-env-vars NEXT_PUBLIC_API_URL=https://bilko-api-762788903040.europe-north1.run.app/api/v1,NEXT_TELEMETRY_DISABLED=1 \ --project=tribal-sign-487920-k0 Step 4: Verify Deployment # Check revisions gcloud run revisions list \ --service bilko-web \ --region europe-north1 \ --project=tribal-sign-487920-k0 \ --limit=5 # Verify traffic routing (should show 100% on latest revision) gcloud run services describe bilko-web \ --region europe-north1 \ --project=tribal-sign-487920-k0 \ --format='value(status.traffic)' Post-Deploy Evidence Gate (ZAKON PI2 Check 5) MC task CANNOT move to done without ALL three: curl checks: Paste output showing HTTP 200 for expected routes curl -sI https://bilko-demo.alai.no | head -3 curl -sI https://bilko-demo.alai.no/invoices/new | head -3 curl -sI https://bilko-demo.alai.no/settings | head -3 curl -sI https://bilko-demo.alai.no/intesa-bridge | head -3 # Should be 404 Playwright screenshots: Stored in docs/evidence//*.png Home page Feature verified (e.g., invoice template save button) Any isolation checks (e.g., 404 for client routes on main) verification.json: Machine-readable evidence file { "task_id": 8730, "timestamp": "2026-04-22T21:41:10Z", "revision": "bilko-web-00019-7tl", "traffic_100_percent": true, "curl_checks": { "home": 200, "intesa-bridge": 404, ... }, "playwright_pass": true, "screenshots": ["home.png", "invoices-new.png", ...] } Deploy Flow Diagram flowchart LR A[Code Change] --> B{CI Healthy?} B -->|Yes| C[CI: Build + Push] B -->|No| D[Manual Build] C --> E[Artifact Registry] D --> E E --> F[Cloud Run Deploy] F --> G{Traffic Routing} G -->|100%| H[Live] G -->|0%| I[Blocked - Check --no-traffic flag] H --> J[Evidence Gate] J --> K{All 3 checks pass?} K -->|Yes| L[MC task done] K -->|No| M[Block - Add evidence] Known Issues + Workarounds Issue 1: CI broken since 2026-04-15 Symptom: All main branch pushes fail at deploy step Root cause: GitHub Actions quota + --no-traffic flag Workaround: Use manual deploy path above Issue 2: Intesa content leaked to public URL (fixed 2026-04-22) Symptom: /intesa-bridge route returned 200 on bilko-demo.alai.no Root cause: Intesa feature branch merged to main Fix: Deleted intesa routes from main (commit 66d2220) + added branch-purity.yml CI check Issue 3: Manual paste-copy anti-pattern Symptom: CEO had to manually paste docker build output and gcloud commands Root cause: FlowForge task dispatched after image built locally Fix: Always dispatch FlowForge BEFORE build step, let agent own full flow Branch Purity Rules Client-specific routes MUST NOT appear on main. Reserved prefixes: intesa-* → feat/intesa-bih-demo → bilko-intesa-demo Cloud Run corpint-* → TBD client branch → TBD Cloud Run service CI Enforcement: .github/workflows/branch-purity.yml runs on every PR to main: find apps/web/app -type d \( -name "intesa-*" -o -name "corpint-*" \) | grep . && exit 1 || exit 0 Registry: ~/system/rules/client-prefix-registry.md Domain Mapping bilko-demo.alai.no → Cloud Run service bilko-web (configured via GCP Console) DNS: Cloudflare proxy enabled Mapping verified: 2026-04-22 Related Documentation DEPLOY-MAP.md: /Users/makinja/ALAI/products/Bilko/DEPLOY-MAP.md Incident Postmortem: BookStack → ALAI / Incidents / incident-2026-04-22-bilko-deploy-fix ZAKON PI2: ~/system/rules/zakon-pi2-deploy-verification.md CI Workflow: .github/workflows/gcp-deploy.yml Dockerfile: apps/web/Dockerfile Escalation Owner: FlowForge Escalate to: John → pi-orchestrator MC category: devops + priority: H Created by ALAI Skillforge, 2026-04-22 Bilko CI/CD — Stage→Prod Pipeline (MC #99477) Overview Stage pipeline: push-main → bilko-stage-auto-deploy → cloudbuild-stage.yaml → bilko-{web,api}-stage Prod pipeline: tag v* → bilko-main-deploy → cloudbuild.yaml → bilko-{web,api} Stage pipeline is optimized for FAST FEEDBACK — no quality gates. Prod pipeline has 8 production gates including SHA verification, Trivy scanning, Flyway migrations, and Cloud Build native approval. Stage Pipeline Step Purpose Image Tag Duration (avg) sanity-check Verify Docker socket + Artifact Registry reachability (environment health, NOT a quality gate) — ~2.3s build-web Build Next.js app with docker buildx (apps/web/Dockerfile) :stage-${SHORT_SHA} :stage-latest ~3m push-web Push image to Artifact Registry (europe-north1-docker.pkg.dev/tribal-sign-487920-k0/bilko/web) — ~7s migrate-db Run Flyway migrations against Cloud SQL bilko-staging-db (POSTGRES_16) via Cloud SQL proxy — ~22s deploy-web-stage Deploy bilko-web-stage Cloud Run service with :stage-${SHORT_SHA} image, --no-traffic — ~39s promote-web-stage Route 100% traffic to new revision (no canary for stage) — ~10s deploy-api-stage Deploy bilko-api-stage (redeploys EXISTING image only — no API build step, see OCD-1) — ~19s smoke-test curl -sf https://bilko-api-stage-dh4m46blja-lz.a.run.app/api/v1/health — exit 1 if non-200 — ~2.5s Total duration: ~5 minutes (build 6f2236f6, validated 2026-05-06) Prod Pipeline Existing prod pipeline (cloudbuild.yaml) has 8 gates and MUST NOT be rewritten. References: SHA verification (Git commit SHA in image metadata) Trivy vulnerability scanning Flyway migration validation Cloud Build native approval (approval_required=true in modules/build/main.tf) Smoke tests (health endpoint + web homepage) Gradual traffic rollout (0% → 100%) Rollback on smoke test failure Prod pipeline is BLOCKED on OCD-5 (bilko-db Cloud SQL instance does not exist — requires CEO approval for provisioning). Triggers Trigger Name Filename Branch/Tag Approval Service Account bilko-stage-auto-deploy infrastructure/gcp/cloudbuild-stage.yaml ^main$ No (auto-deploy) 762788903040@cloudbuild.gserviceaccount.com bilko-main-deploy infrastructure/gcp/cloudbuild.yaml v* (semver tag) Yes (Cloud Build UI) 762788903040@cloudbuild.gserviceaccount.com GCP project: tribal-sign-487920-k0 , region: europe-north1 Open Risks — 5 CEO Decisions Required These items require CEO judgment and are NOT resolved in this implementation: OCD-1: bilko-api Build Pipeline Gap Status: OPEN — BLOCKER for API continuous delivery Current state: bilko-api-stage is live and serving traffic at https://bilko-api-stage-dh4m46blja-lz.a.run.app/api/v1 with image api:stage-b7e8a59 . No Cloud Build pipeline exists for the Kotlin/Ktor API. Dockerfile path unconfirmed. Impact: Stage cloudbuild-stage.yaml deploy-api-stage step redeploys the EXISTING API image only — cannot build new API images. API deployments must be manual via gcloud run deploy until resolved. CEO decisions needed: What is the canonical Dockerfile path for apps/api? Should API have its own Cloud Build step in cloudbuild-stage.yaml or a separate trigger? Is bilko-api currently deployed manually via gcloud run deploy ? OCD-2: Stage Hostname — bilko-stage.alai.no vs Raw .run.app URL Status: OPEN — affects CORS configuration Current state: ENV-MATRIX.md CORS_ORIGINS for staging references staging.bilko.io (STALE). terraform.tfvars stage_api_url points to raw .a.run.app URL. Stage pipeline uses raw .run.app URL as default. Impact: Frontend CORS errors if staging.bilko.io DNS is ever pointed at stage services. CEO decision needed: Should bilko-stage.alai.no be the canonical stage hostname? If yes: Cloudflare DNS entry (manual — not in Bilko TF stack) + CORS_ORIGINS update required via separate MC. OCD-3: Postgres Version Mismatch — Stage POSTGRES_16 vs Prod POSTGRES_15 Status: OPEN — CRITICAL for financial data integrity Current state: bilko-staging-db runs POSTGRES_16 (confirmed live). envs/prod/main.tf line 94 specifies POSTGRES_15 for prod (bilko-db does not exist yet — see OCD-5). Stage validates migrations and queries against PG16; prod would run PG15. Impact: For a financial accounting SaaS, stage validation on PG16 while prod runs PG15 invalidates the "stage-as-test-environment" premise. Schema compatibility unverified. SQL dialect differences (PG15→PG16) may surface as prod-only bugs. CEO decision needed: Upgrade prod to POSTGRES_16 (requires maintenance window, pg_upgrade or dump/restore) OR downgrade stage to POSTGRES_15? ALAI standard tech stack (ALAI/CLAUDE.md) mandates POSTGRES_16 for all products, suggesting prod config is non-compliant. OCD-4: Stage → Prod SHA Promotion Strategy Status: OPEN — architectural decision Current state: Prod trigger fires on semver tag push, rebuilds from source. Stage-validated image digest is NOT carried to prod build. Stage tests one SHA and prod deploys a different build. If a hot dependency updates between stage build and prod build (e.g., npm registry serves new patch version), stage and prod can diverge on identical Git SHAs. CEO decision needed: Option A: Accept rebuild-on-tag (simpler, current model) with acknowledgment of hot-dependency risk. Option B: Implement digest promotion where prod trigger accepts an image digest input parameter and skips rebuild. Requires Cloud Build trigger API call from a promotion script or Google Cloud Deploy. OCD-5: Prod Cloud SQL bilko-db Existence Status: OPEN — BLOCKER for prod terraform apply Current state: gcloud sql instances list --project=tribal-sign-487920-k0 shows ONLY bilko-staging-db. No bilko-db (prod) exists. envs/prod/main.tf explicitly notes "bilko-db (prod) — TBD — audit required" (lines 4-6 and import.sh). Impact: Any terraform apply on envs/prod would attempt to create a REGIONAL HA POSTGRES_15 db-custom-2-7680 instance (~$100+/month). Without CEO sign-off, prod infra is BLOCKED. CEO decision needed: Approve prod DB provisioning (cost + data migration strategy if migrating from elsewhere) before ANY envs/prod TF apply is ever run. If bilko-db exists elsewhere (on-prem? Railway?), import.sh must be run first. Validation Evidence file: /tmp/99477-proveo-evidence.md Build ID: 6f2236f6-86ec-444c-96b7-7c22f63cf5a2 Build log: View in GCP Console Validation date: 2026-05-06T20:28Z Validator: Angie Jones (Proveo) Verdict: PASS — 7/7 Acceptance Criteria met AC1: Build SUCCESS (all 8 steps SUCCESS) AC2: bilko-web-stage HTTP/2 200 AC3: bilko-api-stage health endpoint 200 {"status":"ok"} AC4: New web revision deployed within 5min window AC5: Flyway migrate-db ran without error (21.5s) AC6: No gate-* steps executed (0 quality gates) AC7: Image pushed with :stage-${SHORT_SHA} tag (stage-277dd5a confirmed in Artifact Registry) Related MCs #99395 — VAT enum-cast genesis (billing_country ENUM cast to TEXT in Flyway migration) #99422 — Sibling task (stage Cloud Run services health check) #99477 — This task (Stage CI/CD pipeline implementation) ZAKON PI2 Compliance Status Stage pipeline: ✅ COMPLIANT DEPLOY-MAP.md exists at repo root ✅ Pre-flight checks executed (4 probes: triggers, GCS bucket, Cloud Run services, SQL instances) ✅ Post-deploy validation (curl 200 + Cloud Run revision evidence) ✅ Evidence files delivered (/tmp/99477-preflight.txt, /tmp/99477-proveo-evidence.md) ✅ Prod pipeline: ⏸ BLOCKED (awaiting OCD-5 bilko-db provisioning approval) Last Updated 2026-05-06, owner: FlowForge (Kelsey Hightower) ALAI CI/CD Blueprint Standardization 2026-05-08 ALAI CI/CD Blueprint Standardization — 2026-05-08 Master MC: #99881 Owner: John (AI Director) + Petter Graff persona for canonical refresh Status: All 4 phases verified closed. Triple-layer enforcement live. Cost: ~$15-30 LLM tokens Context CEO directive 2026-05-08 in single-day push: "Discuss CI/CD pipelines and blueprints" → triple-layer mechanical enforcement live + 7/7 fleet compliance + free-first routing across persona blueprints. 4-phase arc summary Faza MC Outcome 1 — Audit #99882 4 artifacts in ~/system/specs/cicd-audit-2026-05-08/ (gap matrix, deploy-map matrix, canonical self-audit, summary). 1 real bug caught: DropSrbija/BUILD-BLUEPRINT.md line 225 stale "Postgres 5434" comment (actual port 5436). 2 — Canonical refresh #99886 UNIVERSAL bumped to v3.0 (§13 6-mandatory files including DEPLOY-MAP, §15 forma-only variant, §16.3 CI gates, ZAKON PI2 invariant). DEPLOY bumped to v2.0 (multi-profile §1A GCP / §1B Azure VM / §1C Cloudflare Pages / §1D Vercel deprecated). blueprint-format.md disambiguation header (YAML agent layer vs MD product layer). alai-cicd-architecture.md staleness notice (sections §5.2 AWS, §9 Phase 3 superseded). 3 — Product migration #99896 7 in-scope products migrated to v2 §1A/§1B/§1C profiles. 6 new mandatory files created (web PIPELINE/RUNBOOK/CHANGELOG, Gotiva RUNBOOK/CHANGELOG, Drop PIPELINE). Drop §1B refactor reached FULL_COMPLIANCE 5/5 schema. Excluded: BasicFakta (MC #99893 Vercel→CF Pages migration), DropSrbija (MC #99883 scope decision), akershus-fylke (forma-only). 4 — Enforcement #99911 Triple-layer mechanical enforcement live. Triple-layer enforcement (all live, all verified) 1. Linter — ~/system/tools/blueprint-check.js v2 Dual-mode (backward compat with mehanik-commit + pre-dispatch-gate Check 9): Rubric mode (default, original): scores BUILD-BLUEPRINT.md 0-100 across 6 checks. Exit 0 if ≥ 60. Inventory mode ( --inventory ): checks 6 mandatory files per UNIVERSAL v3 §13. Validates DEPLOY-MAP.md schema 5/5 per DEPLOY v2 §4. Respects forma-only flag. Verdict states: FULL_COMPLIANCE / FORMA_ONLY_OK / PARTIAL_SCHEMA / MISSING_FILES. JSON output reusable by hook + daemon. 2. PostToolUse hook — ~/.claude/hooks/blueprint-schema-validator.sh Registered in settings.json under Write|Edit|MultiEdit matcher. Triggers on writes to product-root DEPLOY-MAP.md files under ~/business/ALAI-Holding-AS/{products,web,finance}/*/ . Blocks with exit 2 + structured BLOCKED message + missing sections + template pointers when schema fails. Override marker: . Trace log: ~/system/state/blueprint-schema-validator-trace.log . 3. Nightly daemon — ~/system/daemons/blueprint-fleet-watchdog.js LaunchAgent com.alai.blueprint-fleet-watchdog schedules daily 06:15. Scans 10 product roots, persists state to ~/system/state/blueprint-fleet-status.json , detects regressions (verdict drop, schema score drop, file removal) with differential alert. Exit 1 on regression. Free-first routing (CEO directive "ukljuci free modele gdje god mozes") ~/system/config/tier-routing.json updated: MLX FORGE tiers added : M2 (gemma-4-26b@11435), M2c (qwen3-coder-30b@11437), M3 (qwen3-32b@11436). All 3 servers verified live via curl before adding to canonical. callerRoutes added : verifier→2cHQ , fix-builder→2c , redzo-reviewer→M2c . providerFallback chains : verifier (MLX → Ollama ANVIL → Claude secondary), fix-builder (Ollama → Ollama → Claude secondary). Persona blueprint sweep (MC #99923): 13 yaml files — 9 all-sonnet personas (AgentForge, Axiom, Finverge, FlowForge, Lexicon, Proveo, Resolver, Skybound, Vizu) + 4 CodeCraft yaml (api-backend, codecraft-api, nextjs-app, openapi-sdk-package). 46 phase declarations swept sonnet → local-first (qwen2.5-coder:32b@anvil for general phases, qwen3-coder:latest@forge for code-gen phases). 6 KEPT-sonnet phases with explicit rationale: 3 Lexicon legal phases (Norwegian law / GDPR / PSD2 regulatory precision), 3 Resolver cross-company phases (multi-domain reasoning). Verifier pattern dokazan bp-verifier background agent ran ~15 rounds, ~178 atomic claims, 2 stvarna buga uhvaćena : DropSrbija/BUILD-BLUEPRINT.md line 225 stale comment "Postgres 5434" (actual port 5436 per docker-compose.yml). Fixed in both audit artifact + product blueprint. Drop/DEPLOY-MAP.md schema 3/5 PARTIAL — no formal OPEN RISK / OCD register, no SA distinction. Fixed via §1B-appropriate equivalents (SSH key → Trigger SA equivalent, container USER → Service SA equivalent). Pattern recommendation : For every multi-phase project, spawn named bp-verifier in BG ( Agent({subagent_type: "verifier", name: "bp-verifier", run_in_background: true}) ), send each artifact via SendMessage for atomic claim validation, fix-loop on FAIL. Cost: $0.10 per round Claude ( $0 if MLX primary per new tier-routing). Fleet compliance final (verified by daemon 2026-05-08) Product Verdict Files Schema Profile Bilko FULL_COMPLIANCE 6/6 5/5 §1A GCP Tok FULL_COMPLIANCE 6/6 5/5 §1A GCP Drop FULL_COMPLIANCE 6/6 5/5 §1B Azure VM Lobby FULL_COMPLIANCE 6/6 5/5 §1A GCP (stub) Plock FULL_COMPLIANCE 6/6 5/5 §1A GCP (stub) Gotiva FULL_COMPLIANCE 6/6 5/5 §1A GCP multi-service web FULL_COMPLIANCE 6/6 5/5 §1C CF Pages akershus-fylke FORMA_ONLY_OK 1/1 N/A non-deployable BasicFakta MISSING_FILES 5/6 0/5 §1D Vercel deprecated (MC #99893 migration backlog) DropSrbija MISSING_FILES 3/6 0/5 scope decision pending (MC #99883) Open follow-ups (parked, not blocking arc closure) #99883 DropSrbija scope decision (separate product vs Drop multi-tenant) — needs petter-graff arch memo #99893 BasicFakta Vercel→CF Pages migration — 3-4h work + 30d soak #99895 Coverage threshold review scheduled 2026-05-22 (after 2-week observability) #99955 Securion task/owner schema canonical alignment (L) Git audit trail ~/system commit: a02fd0109 — 29 files, +6184/-122 (canonical v3 + audit artifacts + linter v2 + daemon + tier-routing + 13 persona blueprints) ~/.claude commit: bf2ca2d49 — hook + settings.json registration Lessons Verifier-in-bg uhvati realne bugove — propagated stale comments + schema gaps. USE THIS PATTERN for every multi-phase project. Mehanik enforcement >> ZAKON-only — hook + daemon catch what memo can't. UNIVERSAL §13 / DEPLOY §4 sad mehanički enforced. Local-first viable for builder/verifier — qwen2.5-coder + qwen3-coder + MLX qwen3-coder-30b dovoljno za schema validation, code gen, doc draft. Sonnet ostaje za high-stakes synthesis (legal, cross-company). Closure-loop discipline — build-verify-mark-done pattern, ne build-verify-stop. CEO uhvatio gap u mid-session closure ("jel sve dokumentovano, merged, zatvoreno po propisima") and triggered this BookStack publish + git commit + memory entry. References Memory project entry: ~/.claude/projects/-Users-makinja/memory/project_cicd_standardization_2026-05-08.md Audit artifacts: ~/system/specs/cicd-audit-2026-05-08/{blueprint-gap-matrix,deploy-map-gap-matrix,canonical-self-audit,summary}.md v3 drafts (review trail): ~/system/specs/cicd-canonical-v3-drafts/ Canonical (production): ~/system/specs/{ALAI-UNIVERSAL-BLUEPRINT,DEPLOY-BLUEPRINT,blueprint-format,alai-cicd-architecture}.md Pre-promotion backups: ~/system/specs/_backups/20260508-111700/ ALAI CI/CD Blueprint Standardization 2026-05-08 ALAI CI/CD Blueprint Standardization — 2026-05-08 Master MC: #99881 Owner: John (AI Director) + Petter Graff persona for canonical refresh Status: All 4 phases verified closed. Triple-layer enforcement live. Cost: ~$15-30 LLM tokens Context CEO directive 2026-05-08 in single-day push: "Discuss CI/CD pipelines and blueprints" → triple-layer mechanical enforcement live + 7/7 fleet compliance + free-first routing across persona blueprints. 4-phase arc summary Faza MC Outcome 1 — Audit #99882 4 artifacts in ~/system/specs/cicd-audit-2026-05-08/ (gap matrix, deploy-map matrix, canonical self-audit, summary). 1 real bug caught: DropSrbija/BUILD-BLUEPRINT.md line 225 stale "Postgres 5434" comment (actual port 5436). 2 — Canonical refresh #99886 UNIVERSAL bumped to v3.0 (§13 6-mandatory files including DEPLOY-MAP, §15 forma-only variant, §16.3 CI gates, ZAKON PI2 invariant). DEPLOY bumped to v2.0 (multi-profile §1A GCP / §1B Azure VM / §1C Cloudflare Pages / §1D Vercel deprecated). blueprint-format.md disambiguation header (YAML agent layer vs MD product layer). alai-cicd-architecture.md staleness notice (sections §5.2 AWS, §9 Phase 3 superseded). 3 — Product migration #99896 7 in-scope products migrated to v2 §1A/§1B/§1C profiles. 6 new mandatory files created (web PIPELINE/RUNBOOK/CHANGELOG, Gotiva RUNBOOK/CHANGELOG, Drop PIPELINE). Drop §1B refactor reached FULL_COMPLIANCE 5/5 schema. Excluded: BasicFakta (MC #99893 Vercel→CF Pages migration), DropSrbija (MC #99883 scope decision), akershus-fylke (forma-only). 4 — Enforcement #99911 Triple-layer mechanical enforcement live. Triple-layer enforcement (all live, all verified) 1. Linter — ~/system/tools/blueprint-check.js v2 Dual-mode (backward compat with mehanik-commit + pre-dispatch-gate Check 9): Rubric mode (default, original): scores BUILD-BLUEPRINT.md 0-100 across 6 checks. Exit 0 if ≥ 60. Inventory mode ( --inventory ): checks 6 mandatory files per UNIVERSAL v3 §13. Validates DEPLOY-MAP.md schema 5/5 per DEPLOY v2 §4. Respects forma-only flag. Verdict states: FULL_COMPLIANCE / FORMA_ONLY_OK / PARTIAL_SCHEMA / MISSING_FILES. JSON output reusable by hook + daemon. 2. PostToolUse hook — ~/.claude/hooks/blueprint-schema-validator.sh Registered in settings.json under Write|Edit|MultiEdit matcher. Triggers on writes to product-root DEPLOY-MAP.md files under ~/business/ALAI-Holding-AS/{products,web,finance}/*/ . Blocks with exit 2 + structured BLOCKED message + missing sections + template pointers when schema fails. Override marker: . Trace log: ~/system/state/blueprint-schema-validator-trace.log . 3. Nightly daemon — ~/system/daemons/blueprint-fleet-watchdog.js LaunchAgent com.alai.blueprint-fleet-watchdog schedules daily 06:15. Scans 10 product roots, persists state to ~/system/state/blueprint-fleet-status.json , detects regressions (verdict drop, schema score drop, file removal) with differential alert. Exit 1 on regression. Free-first routing (CEO directive "ukljuci free modele gdje god mozes") ~/system/config/tier-routing.json updated: MLX FORGE tiers added : M2 (gemma-4-26b@11435), M2c (qwen3-coder-30b@11437), M3 (qwen3-32b@11436). All 3 servers verified live via curl before adding to canonical. callerRoutes added : verifier→2cHQ , fix-builder→2c , redzo-reviewer→M2c . providerFallback chains : verifier (MLX → Ollama ANVIL → Claude secondary), fix-builder (Ollama → Ollama → Claude secondary). Persona blueprint sweep (MC #99923): 13 yaml files — 9 all-sonnet personas (AgentForge, Axiom, Finverge, FlowForge, Lexicon, Proveo, Resolver, Skybound, Vizu) + 4 CodeCraft yaml (api-backend, codecraft-api, nextjs-app, openapi-sdk-package). 46 phase declarations swept sonnet → local-first (qwen2.5-coder:32b@anvil for general phases, qwen3-coder:latest@forge for code-gen phases). 6 KEPT-sonnet phases with explicit rationale: 3 Lexicon legal phases (Norwegian law / GDPR / PSD2 regulatory precision), 3 Resolver cross-company phases (multi-domain reasoning). Verifier pattern dokazan bp-verifier background agent ran ~15 rounds, ~178 atomic claims, 2 stvarna buga uhvaćena : DropSrbija/BUILD-BLUEPRINT.md line 225 stale comment "Postgres 5434" (actual port 5436 per docker-compose.yml). Fixed in both audit artifact + product blueprint. Drop/DEPLOY-MAP.md schema 3/5 PARTIAL — no formal OPEN RISK / OCD register, no SA distinction. Fixed via §1B-appropriate equivalents (SSH key → Trigger SA equivalent, container USER → Service SA equivalent). Pattern recommendation : For every multi-phase project, spawn named bp-verifier in BG ( Agent({subagent_type: "verifier", name: "bp-verifier", run_in_background: true}) ), send each artifact via SendMessage for atomic claim validation, fix-loop on FAIL. Cost: $0.10 per round Claude ( $0 if MLX primary per new tier-routing). Fleet compliance final (verified by daemon 2026-05-08) Product Verdict Files Schema Profile Bilko FULL_COMPLIANCE 6/6 5/5 §1A GCP Tok FULL_COMPLIANCE 6/6 5/5 §1A GCP Drop FULL_COMPLIANCE 6/6 5/5 §1B Azure VM Lobby FULL_COMPLIANCE 6/6 5/5 §1A GCP (stub) Plock FULL_COMPLIANCE 6/6 5/5 §1A GCP (stub) Gotiva FULL_COMPLIANCE 6/6 5/5 §1A GCP multi-service web FULL_COMPLIANCE 6/6 5/5 §1C CF Pages akershus-fylke FORMA_ONLY_OK 1/1 N/A non-deployable BasicFakta MISSING_FILES 5/6 0/5 §1D Vercel deprecated (MC #99893 migration backlog) DropSrbija MISSING_FILES 3/6 0/5 scope decision pending (MC #99883) Open follow-ups (parked, not blocking arc closure) #99883 DropSrbija scope decision (separate product vs Drop multi-tenant) — needs petter-graff arch memo #99893 BasicFakta Vercel→CF Pages migration — 3-4h work + 30d soak #99895 Coverage threshold review scheduled 2026-05-22 (after 2-week observability) #99955 Securion task/owner schema canonical alignment (L) Git audit trail ~/system commit: a02fd0109 — 29 files, +6184/-122 (canonical v3 + audit artifacts + linter v2 + daemon + tier-routing + 13 persona blueprints) ~/.claude commit: bf2ca2d49 — hook + settings.json registration Lessons Verifier-in-bg uhvati realne bugove — propagated stale comments + schema gaps. USE THIS PATTERN for every multi-phase project. Mehanik enforcement >> ZAKON-only — hook + daemon catch what memo can't. UNIVERSAL §13 / DEPLOY §4 sad mehanički enforced. Local-first viable for builder/verifier — qwen2.5-coder + qwen3-coder + MLX qwen3-coder-30b dovoljno za schema validation, code gen, doc draft. Sonnet ostaje za high-stakes synthesis (legal, cross-company). Closure-loop discipline — build-verify-mark-done pattern, ne build-verify-stop. CEO uhvatio gap u mid-session closure ("jel sve dokumentovano, merged, zatvoreno po propisima") and triggered this BookStack publish + git commit + memory entry. References Memory project entry: ~/.claude/projects/-Users-makinja/memory/project_cicd_standardization_2026-05-08.md Audit artifacts: ~/system/specs/cicd-audit-2026-05-08/{blueprint-gap-matrix,deploy-map-gap-matrix,canonical-self-audit,summary}.md v3 drafts (review trail): ~/system/specs/cicd-canonical-v3-drafts/ Canonical (production): ~/system/specs/{ALAI-UNIVERSAL-BLUEPRINT,DEPLOY-BLUEPRINT,blueprint-format,alai-cicd-architecture}.md Pre-promotion backups: ~/system/specs/_backups/20260508-111700/ Slack bot token SSOT — slack.json (MC #102830) — 2026-06-03 Summary MC #102830 makes ~/system/config/slack.json the single source of truth (SSOT) for the Slack bot's tokens, with environment-variable fallback, and removes the hardcoded tokens from the LaunchAgent plist. Previously the com.john.slack-bot.plist hardcoded both SLACK_BOT_TOKEN and SLACK_APP_TOKEN in EnvironmentVariables — so a token rotation that wasn't mirrored into the plist would strand the daemon with a stale token. Change slack.json ( ~/system/config/slack.json , mode 0600): now holds token (xoxb bot), app_token (xapp), workspace , bot_name . slack-bot.js loadSlackTokens() (line ~443): reads slack.json first (SSOT); returns {botToken, appToken} when both present; otherwise silently falls through to the existing Keychain → vault → env chain. Env-var fallback preserved. com.john.slack-bot.plist : SLACK_BOT_TOKEN and SLACK_APP_TOKEN removed from EnvironmentVariables (GROQ/HOME/PATH untouched). plutil -lint OK. run-slack-bot-reload.sh : reload wrapper added under ~/system/tools/ . Token rotation procedure (new) Edit ~/system/config/slack.json — update token (xoxb) and/or app_token (xapp). bash ~/system/tools/run-slack-bot-reload.sh No plist edit. No risk of stranding the daemon on rotation. Verification plist: grep -c SLACK_*_TOKEN = 0; plutil -lint OK. slack.json: mode 0600; keys token/app_token/workspace/bot_name. slack-bot.js: node --check SYNTAX_OK; SSOT branch at line 443. Daemon: launchctl PID 42749, LastExitStatus 0, stable; log Tokens loaded from slack.json (SSOT) + Slack bot started (Socket Mode) . Live: slack.js send #ops succeeded (token valid). Independent verifier (Company Mesh / eval-Proveo): PASS — mesh-thr-b04409c5-ab59-4ff1-bc24-b163433bd063 . til-done: DONE — /tmp/til-done/102830-20260603T142915Z.json . Security note This also improves posture: secrets moved out of a (potentially world-readable) LaunchAgent plist into the 0600 slack.json. Token values are never logged (masked). Bilko CI — integration-test job (Testcontainers) MC #102843 — 2026-06-03 Summary MC #102843 adds an integration-test job to Bilko's .github/workflows/ci.yml . Previously the backend-test job ran only ./gradlew test , and tasks.test does excludeTags("integration") (apps/api/build.gradle.kts:159) — so the integrationTest task (Testcontainers/Postgres, includeTags("integration") ) never ran in CI . PRs that broke integration tests passed green (surfaced manually by Proveo during MC #102798). Change (PR #245, base main, not merged) New integration-test job: ubuntu-latest, Java 21, ./gradlew integrationTest --no-daemon in apps/api . Testcontainers spins its own Postgres (no services: block needed). Non-blocking for now ( continue-on-error: true , NOT in build needs: ). Why non-blocking (important) Running the suite revealed it is currently broken on main: ~78/1147 integration tests fail (FlywayMigrateException in SettingsServiceRlsTest, ExposedSQLException in VatReportStatutoryGroupingTest, and others). These had never run in CI. Making the job a required gate immediately would red-lock every PR. So the job is visible on every PR (failures now surface) but does not block merges yet. Path to required gate Tracked in MC #102874 (H): fix the 78 failing integration suites. Once green, promotion is a one-line CI change — remove continue-on-error: true and add integration-test to build needs: [lint, unit, backend-test, integration-test] . Verification CI run 26900887524 : integration-test job executed; log shows Task :integrationTest , 1147 tests / 78 failed, Testcontainers Postgres started. yaml-lint PASS, actionlint PASS, gitleaks 0, diff = ci.yml only. Independent pre-verifier (Company Mesh / Proveo): PASS — mesh-thr-8f34975b-a1ef-4d99-8bd8-cd9f894a022a . til-done: DONE — /tmp/til-done/102843-20260603T172857Z.json . Incident (logged, low severity) During implementation a build branch was accidentally pushed to origin/main (commit ecf5a97 ) and immediately reverted ( 036e2c6 ). It triggered bilko-stage-auto-deploy twice; both SUCCESS, change was ci.yml -only (no app artifact change), stage is non-customer-facing. origin/main verified clean afterwards. Lesson recorded: build agents must git push -u origin HEAD: and verify upstream ≠ origin/main (push to Bilko main auto-deploys stage). Bilko integrationTest suite green — 79->0 failures (MC #102874) — 2026-06-03 Summary MC #102874 took the Bilko backend integrationTest suite from 79 failing → 0 failing (1213 tests, 91 suites) . These integration tests had never run in CI ( tasks.test does excludeTags("integration") ); the new CI job from MC #102843 exposed the rot. PR #246 (base main , not merged). Root-cause clusters fixed A. Stale error-envelope assertions (~39): tests asserted legacy "FORBIDDEN" / "NOT_FOUND" bodies; the app correctly returns RFC7807 with errorCode BILKO-AUTH-003 / BILKO-INV-001 . Tests updated to assert the strict error codes (stronger, not weaker). B. Flyway init cascade (3 init suites): V30_1__ensure_bilko_admin_role.sql ran GRANT bilko_admin TO CURRENT_USER as bilko_admin → self-membership SQLSTATE 0LP01 on newer PG → migration abort. Fixed test-side (Testcontainers withUsername → bilko_test , a non-admin user). Production migration left immutable. C1. PasswordReset SQL: SET LOCAL cannot take a subquery → resolve orgId via Exposed then pass the literal UUID. C3. Expense semantics: self-approval / unknown-vendor exception types aligned with documented MC #102746 behavior. Final 3 (this session): VatReportStatutoryGroupingTest init error — invoice_items seed omitted NOT-NULL line_number → added it to 5 inserts. HrFullPathE2ETest T1 + ImpersonationSessionIntegrationTest T8 (same cause) — ArchiveService.triggerBackgroundBuild fire-and-forget coroutine wrote completed_at to an already-closed Hikari pool → uncaught ExposedSQLException leaked into unrelated tests. Fixed: catch ExposedSQLException / SQLException -on-closed-pool and log (graceful degrade), rethrow CancellationException . Production-correct hardening. Assertion-strength (anti-gate-gaming) Two tests had been widened to multi-code accept lists; re-pinned to single deterministic codes: InvoiceRoutes TS-INV-01 → exactly 503 MARKET_NOT_AVAILABLE (RS org deterministic). BankingRoutes non-existent accountId → exactly 201 (auto-provisions Asset GL). No @Disabled / @Ignore / excludeTags /deleted tests. Flyway migrations untouched. Verification John forced fresh run: ./gradlew cleanIntegrationTest integrationTest --rerun-tasks --no-daemon → :integrationTest executed, BUILD SUCCESSFUL 2m31s . XML aggregate 1213 / 0 / 0 / 0 . Independent pre-verifier (Company Mesh / Proveo): PASS — mesh-thr-5947b05c-a6c5-4b3b-9ac8-37d7e6a7ea2c . til-done: DONE — /tmp/til-done/102874-20260603T193125Z.json . PR #246: 36 source files, 0 build artifacts ; origin/main untouched. Process note Multi-session: one session produced the WIP fix (~76 failures), a John review session affirmed it ("NOT gate-gaming") and flagged 3 items, this session ran the decisive green run, fixed the last 3 + the 2 flags, and closed. Earlier in the campaign a build agent accidentally pushed to main (reverted) — lesson recorded; this PR was pushed cleanly with explicit refspec, main untouched. Downstream (NOT part of this fix) Merge PR #246 → main green. Then flip the CI gate (MC #102843): remove continue-on-error + add integration-test to build needs: . Cannot flip before merge (main is still red until then). Prometheus Best Practices — USE vs RED Prometheus Best Practices and Pitfalls Source: YouTube Learning — Julius Volz (Prometheus co-founder), Swiss Cloud Native Day 2021 Indexed: 2026-06-15 (MC #103620) USE vs RED: Decision Framework USE Method (Resource-Oriented Systems) For infrastructure components (CPU, memory, disk, network): U tilization: % busy (0-100%) S aturation: degree of queuing (wait time, queue length) E rrors: error count/rate When to use: Cloud Run instances, Azure Container Apps, database connections, worker threads, storage volumes. RED Method (Request-Oriented Systems) For services handling requests: R ate: requests/second E rrors: failed request count or % D uration: latency (p50, p95, p99) When to use: REST APIs, BFF layers, RPC services, HTTP endpoints. Custom Metrics in Application Code Best Practices Counter for events that only go up (requests, errors, jobs completed) Gauge for values that go up/down (active connections, queue size, temperature) Histogram for bucketed observations (latency, request size) — auto-generates _sum , _count , _bucket Summary for client-side quantiles (use histogram + server-side quantiles in PromQL instead) Common Pitfalls High cardinality labels (user IDs, UUIDs, timestamps) → cardinality explosion → OOM Missing units in metric names ( http_request_duration vs http_request_duration_seconds ) Inconsistent naming (mix of snake_case/camelCase) Not exposing /metrics endpoint early in service development Using Summary instead of Histogram (histograms aggregate better) PromQL Essentials # Rate of HTTP errors over 5min rate(http_requests_total{status=~"5.."}[5m]) # 95th percentile latency histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # CPU utilization (USE) 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Error rate (RED) sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) How This Applies to ALAI Current Infrastructure Grafana: https://grafana.alai.no (monitoring hub) Bilko APIs/BFF: Java/Spring Boot → RED metrics for /api/* endpoints LumisCare BFF/services: Kotlin/Ktor → RED metrics for REST + USE metrics for connection pools Cloud Run / Azure Container Apps: Platform exposes USE metrics (CPU, memory, request queue) Recommended Next Steps Instrument Bilko/LumisCare services with Micrometer (auto-exposes Prometheus /actuator/prometheus ) Add RED dashboards for all user-facing APIs (Grafana template: https://grafana.com/grafana/dashboards/4701) Add USE dashboards for Cloud Run / ACA resource health Alert on SLIs: Error rate >1%, p95 latency >2s, CPU >80% ALAI-Specific Pitfall to Avoid Do NOT add per-user or per-client labels to core metrics. Use organization_id buckets (max ~50) or aggregate at service level. High cardinality = Prometheus death. References Prometheus docs: https://prometheus.io/docs/practices/naming/ USE Method: http://www.brendangregg.com/usemethod.html RED Method: https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/ Micrometer + Spring Boot: https://micrometer.io/docs/registry/prometheus