Infrastructure & DevOps
Deployment, CI/CD, monitoring, DevOps/SRE stack, WAF

Deployment & Environment

Environment Setup
Drop Environment Configuration 
 Last updated: 2026-02-13
 Source: src/drop-app/package.json , next.config.ts , Dockerfile , docker-compose.yml , fly.toml 
 
 Technology Stack 
 
 
 
 Layer 
 Technology 
 Version 
 Source 
 
 
 
 
 Runtime 
 Node.js 
 22 (Alpine) 
 Dockerfile:2 
 
 
 Framework 
 Next.js 
 16.1.6 
 package.json:14 
 
 
 UI 
 React 
 19.2.3 
 package.json:15-16 
 
 
 Database (all environments) 
 PostgreSQL 16 via Drizzle ORM 
 drizzle-orm 
 src/shared/db/schema.ts 
 
 
 Auth 
 JWT via jose 
 ^6.1.3 
 package.json:8 
 
 
 Password hashing 
 bcryptjs 
 ^3.0.3 
 package.json:5 
 
 
 Styling 
 Tailwind CSS 
 ^4 
 package.json:33 
 
 
 UI Components 
 Radix UI 
 ^1.4.3 
 package.json:13 
 
 
 Icons 
 Lucide React 
 ^0.563.0 
 package.json:9 
 
 
 Theme 
 next-themes 
 ^0.4.6 
 package.json:10 
 
 
 Toasts 
 Sonner 
 ^2.0.7 
 package.json:17 
 
 
 
 Dev Dependencies 
 
 
 
 Tool 
 Version 
 Purpose 
 Source 
 
 
 
 
 Vitest 
 ^4.0.18 
 Unit/integration testing 
 package.json:36 
 
 
 Playwright 
 ^1.58.2 
 E2E testing 
 package.json:21 
 
 
 TypeScript 
 ^5 
 Type checking 
 package.json:35 
 
 
 ESLint 
 ^9 
 Linting 
 package.json:29 
 
 
 shadcn 
 ^3.8.4 
 UI component generation 
 package.json:32 
 
 
 
 
 NPM Scripts 
 Source: src/drop-app/package.json:5-12 
 
 
 
 Script 
 Command 
 Description 
 
 
 
 
 dev 
 next dev 
 Start development server (port 3000) 
 
 
 build 
 next build 
 Build for production (standalone output) 
 
 
 start 
 next start 
 Start production server 
 
 
 lint 
 eslint 
 Run ESLint 
 
 
 test 
 vitest run 
 Run unit/integration tests (single run) 
 
 
 test:watch 
 vitest 
 Run tests in watch mode 
 
 
 
 
 Next.js Configuration 
 Source: src/drop-app/next.config.ts:1-49 
 
 
 
 Setting 
 Value 
 Purpose 
 
 
 
 
 output 
 "standalone" 
 Self-contained server for Docker ( next.config.ts:4 ) 
 
 
 devIndicators 
 false 
 Disable dev indicators ( next.config.ts:5 ) 
 
 
 
 Security Headers 
 All responses include these headers (configured in next.config.ts:6-58 ): 
 
 
 
 Header 
 Value (Production) 
 Value (Development) 
 Purpose 
 
 
 
 
 Content-Security-Policy 
 default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline'; font-src 'self'; img-src 'self' data: blob:; connect-src 'self'; frame-ancestors 'none' 
 default-src 'self'; script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; font-src 'self'; img-src 'self' data: blob:; connect-src 'self'; frame-ancestors 'none' 
 XSS and injection protection 
 
 
 X-Frame-Options 
 DENY 
 DENY 
 Clickjacking prevention 
 
 
 X-Content-Type-Options 
 nosniff 
 nosniff 
 MIME sniffing prevention 
 
 
 Referrer-Policy 
 strict-origin-when-cross-origin 
 strict-origin-when-cross-origin 
 Referrer leakage prevention 
 
 
 Permissions-Policy 
 camera=(self), microphone=(), geolocation=(self) 
 camera=(self), microphone=(), geolocation=(self) 
 Feature restriction 
 
 
 Strict-Transport-Security 
 max-age=63072000; includeSubDomains; preload 
 max-age=63072000; includeSubDomains; preload 
 Force HTTPS 
 
 
 
 Note: CSP is stricter in production (no unsafe-eval for scripts). Development mode allows unsafe-inline and unsafe-eval for HMR (Hot Module Replacement) to work. 
 
 Environment Modes 
 Development 
 
 NODE_ENV=development (default) 
 Demo user seeded automatically 
 Login page shows demo credentials hint 
 In-memory rate limiting fallback 
 PostgreSQL 16 via Docker ( docker compose up -d ), port 5433 
 
 Production 
 
 NODE_ENV=production 
 Demo seed data disabled 
 JWT_SECRET required (fatal error if missing) 
 Cookies set with secure: true 
 PostgreSQL 16 on AWS RDS via DATABASE_URL 
 
 Test 
 
 NODE_ENV=test 
 PostgreSQL 16 test database ( drop_test ), created via pg-test-db.ts helper 
 Tables truncated between tests; schema pushed via Drizzle before suite runs 
 Mocked Next.js modules (server, headers) 
 
 
 Port Mapping 
 
 
 
 Service 
 Internal Port 
 External Port 
 Protocol 
 
 
 
 
 Drop App 
 3000 
 3000 
 HTTP 
 
 
 PostgreSQL (local dev) 
 5432 
 5433 
 TCP 
 
 
 PostgreSQL (production RDS) 
 5432 
 5432 
 TCP 
 
 
 
 
 Docker Image Details 
 Base: node:22-alpine 
 User: nextjs (UID 1001)
 Working dir: /app 
 Exposed port: 3000
 Entrypoint: node server.js 
 Build context: src/drop-app/ 
 Image contents (runner stage): 
 
 /app/public/ -- Static assets 
 /app/.next/standalone/ -- Next.js standalone server 
 /app/.next/static/ -- Static build output

Secrets Management
Secrets Management 
 Last updated: 2026-02-17
 Source: src/drop-app/src/lib/secrets.ts 
 
 Overview 
 Drop uses an abstracted secrets management system with pluggable providers. The system is backward compatible -- if no secrets provider is configured, it reads directly from environment variables (existing behavior). 
 
 Provider Selection 
 The provider is selected automatically based on which environment variables are set: 
 
 
 
 Priority 
 Condition 
 Provider 
 Description 
 
 
 
 
 1 
 DOPPLER_TOKEN set 
 Doppler 
 Cloud secrets manager via Doppler API 
 
 
 2 
 AWS_SECRET_ARN set 
 AWS 
 AWS Secrets Manager (requires AWS SDK) 
 
 
 3 
 (default) 
 env 
 Reads from process.env 
 
 
 
 Initialization (call once at app startup): 
 import { initSecrets } from '@/lib/secrets';

// Auto-detect provider based on env vars
initSecrets();

// Optional: custom cache TTL (default 5 minutes)
initSecrets({ ttlMs: 10 * 60 * 1000 }); // 10 minutes
 
 Usage: 
 import { getSecret } from '@/lib/secrets';

const jwtSecret = await getSecret('JWT_SECRET');
const dbUrl = await getSecret('DATABASE_URL');
 
 
 Caching 
 All secret values are cached in memory with a configurable TTL (default: 5 minutes). This reduces API calls to external providers while ensuring secrets are refreshed periodically. 
 
 Cache is cleared on initSecrets() call 
 Cache entries expire individually based on TTL 
 If a provider returns undefined , the system falls back to process.env 
 
 
 Rotation Procedures 
 JWT_SECRET 
 Impact: All active user sessions will be invalidated. 
 
 Generate new secret: openssl rand -base64 48 
 Update in secrets provider (Doppler/AWS/env) 
 Call rotateSecret('JWT_SECRET', newValue) or restart the app 
 Users will need to log in again 
 
 Recommended frequency: Every 90 days or after a suspected compromise. 
 DATABASE_URL (PostgreSQL credentials) 
 Impact: Application loses DB connectivity until updated. 
 
 Create new PostgreSQL credentials 
 Update PostgreSQL user: ALTER USER drop WITH PASSWORD 'new_value'; 
 Update DATABASE_URL in secrets provider with new credentials 
 Restart the application (or call rotateSecret ) 
 
 Recommended frequency: Every 90 days. 
 SENTRY_DSN 
 Status: REMOVED (MC #1271 — Sentry deinstalled) 
 SLACK_WEBHOOK_URL 
 Impact: Alerts stop sending to Slack until updated. 
 
 Create new incoming webhook in Slack workspace 
 Update SLACK_WEBHOOK_URL in secrets provider 
 Restart the application 
 
 Recommended frequency: Only on suspected compromise. 
 Open Banking API Keys 
 Impact: Bank connectivity (AISP/PISP) stops working. 
 
 Regenerate keys in the Open Banking provider dashboard 
 Update the relevant env vars in secrets provider 
 Restart the application 
 Verify bank account connectivity via /api/health 
 
 Recommended frequency: Per provider policy or every 180 days. 
 
 Environment Setup per Provider 
 Environment Variables (Default) 
 No setup required. Set secrets as environment variables: 
 # .env.local (development)
JWT_SECRET=dev-secret-do-not-use-in-production

# Production (Fly.io)
fly secrets set JWT_SECRET="$(openssl rand -base64 48)"
fly secrets set DATABASE_URL="postgresql://..."

# Production (Docker)
# Pass via -e flags or docker-compose environment section
 
 Doppler 
 
 Create account at doppler.com 
 Create project "drop" with environments (dev, staging, production) 
 Add all secrets in the Doppler dashboard 
 Generate a service token for each environment 
 Set DOPPLER_TOKEN in your deployment: 
 
 # Fly.io
fly secrets set DOPPLER_TOKEN="dp.st.production.xxxxx"

# Docker (pass as environment variable)
 
 AWS Secrets Manager 
 
 Create a secret in AWS Secrets Manager (JSON format):
 {
 "JWT_SECRET": "your-jwt-secret",
 "DATABASE_URL": "postgresql://...",
 "SLACK_WEBHOOK_URL": "https://..."
}
 
 
 Note the secret ARN 
 Ensure the application has IAM permissions for secretsmanager:GetSecretValue 
 Install the AWS SDK: npm install @aws-sdk/client-secrets-manager 
 Set AWS_SECRET_ARN in your deployment 
 
 
 Audit Trail 
 All secret rotation events are logged to the audit_log table: 
 
 
 
 Field 
 Value 
 
 
 
 
 action 
 secret_rotated 
 
 
 resource_type 
 secret 
 
 
 resource_id 
 Secret key name (e.g., JWT_SECRET ) 
 
 
 details 
 JSON with provider name and rotation timestamp 
 
 
 
 Query rotation history: 
 SELECT * FROM audit_log
WHERE action = 'secret_rotated'
ORDER BY timestamp DESC;

Deployment Checklist
Deployment Checklist: [PROJECT NAME] 
 Release: v[X.Y.Z]
 Date: YYYY-MM-DD
 Deploy Lead: DevOps
 Approved by: Tech Lead + John
 Environment: Staging → Production 
 
 Pre-Deployment (T-1 Day) 
 Verification 
 
 All tests passing in CI 
 Code review approved and merged 
 UAT sign-off received 
 Release notes prepared 
 No critical/high open bugs 
 
 Preparation 
 
 Database backup completed 
 Staging environment matches production config 
 Rollback procedure tested 
 Stakeholders notified of deployment window 
 On-call person confirmed and available 
 
 Configuration 
 
 Environment variables verified 
 API keys / secrets rotated if needed 
 DNS changes prepared (if applicable) 
 SSL certificates valid (expiry > 30 days) 
 Third-party service limits adequate 
 
 Deployment (T-0) 
 Window 
 
 Allowed: Tue-Thu, 10:00-16:00 
 Never: Fridays 
 Hotfix: Anytime business hours (Tech Lead + John approval) 
 Emergency: Anytime (John + Alem approval) 
 
 Execution 
 
 Announce deployment start in channel 
 Deploy to staging — verify 
 Run staging smoke tests 
 Manual approval gate — Tech Lead confirms 
 Deploy to production 
 Monitor deployment logs for errors 
 
 Post-Deployment (T+0) 
 Smoke Tests 
 
 Homepage loads correctly 
 Authentication works (login/logout) 
 Core user flow #1 works 
 Core user flow #2 works 
 API health endpoint returns 200 
 No errors in error tracking (Sentry) 
 
 Monitoring (First 30 Minutes) 
 
 Error rate normal (< 1%) 
 Response times normal (p95 < 500ms) 
 No 5xx errors in logs 
 Database connections stable 
 Memory/CPU usage normal 
 
 Communication 
 
 Announce deployment complete 
 Send release notes to stakeholders 
 Update project status 
 
 Rollback Plan 
 Rollback Triggers 
 
 Critical functionality broken 
 Data integrity issues 
 Security vulnerability discovered 
 Error rate > 5% 
 Response time > 3x normal 
 
 Rollback Procedure 
 
 Announce rollback in channel 
 Revert to previous version 
 Restore database backup (if schema changed) 
 Verify rollback successful 
 Announce rollback complete 
 Create incident report 
 
 Rollback Time Targets 
 
 Application rollback: < 15 minutes 
 Database rollback: < 30 minutes 
 Full rollback: < 1 hour 
 
 Sign-off 
 
 
 
 Role 
 Name 
 Pre-Deploy 
 Post-Deploy 
 
 
 
 
 DevOps 
 
 ☐ 
 ☐ 
 
 
 Tech Lead 
 
 ☐ 
 ☐ 
 
 
 John 
 
 ☐ 
 ☐

DR Runbook
Drop — Disaster Recovery Runbook 
 Infrastructure Overview 
 Production Environment 
 
 Service: AWS App Runner 
 Region: eu-west-1 (Ireland) 
 Service ARN: arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec 
 Service URL: https://9ef3szvvsb.eu-west-1.awsapprunner.com 
 ECR Repository: 324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web 
 
 Database 
 
 RDS Instance: drop-db 
 Endpoint: drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com:5432 
 Database Name: dropapp 
 Username: dropuser 
 Backup Strategy: Automated snapshots, 7-day retention 
 Backup Window: 23:24-23:54 UTC daily 
 
 Staging Environment 
 
 Platform: Fly.io 
 App Name: drop-staging 
 Region: arn (Stockholm) 
 Database: PostgreSQL 16 (RDS, eu-north-1, or Docker in CI) 
 
 Domain 
 
 Production: getdrop.no (future) 
 Current: App Runner subdomain 
 
 
 Backup Strategy 
 RDS PostgreSQL (Production) 
 
 Automated Snapshots: Daily at 23:24 UTC 
 Retention Period: 7 days 
 Point-in-Time Recovery: Enabled (5-minute granularity) 
 Manual Snapshots: Created before major changes 
 Storage: Same region (eu-west-1) 
 
 Staging PostgreSQL (RDS) 
 
 Automated Snapshots: Daily, 7-day retention (same config as production) 
 Backup Method: Manual export via flyctl ssh console and pg_dump (PostgreSQL 16 — sqlite3 no longer applies; see ADR-014) 
 Recommended: Export before major changes 
 
 
 Recovery Procedures 
 Scenario 1: App Runner Service Down 
 Symptoms 
 
 Service health checks failing 
 5xx errors from App Runner URL 
 CloudWatch alarms triggered 
 
 Investigation Steps 
 # 1. Check service status
aws apprunner describe-service \
 --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
 --region eu-west-1

# 2. View recent logs (last 10 minutes)
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
 --follow \
 --since 10m \
 --region eu-west-1

# 3. Check deployment history
aws apprunner list-operations \
 --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
 --region eu-west-1
 
 Recovery Actions 
 Option A: Restart Service 
 # Trigger new deployment (no code change)
aws apprunner start-deployment \
 --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
 --region eu-west-1

# Monitor deployment status
aws apprunner describe-service \
 --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
 --query 'Service.Status' \
 --region eu-west-1
 
 Option B: Rollback to Previous Image 
 # 1. List recent ECR images
aws ecr describe-images \
 --repository-name drop-web \
 --region eu-west-1 \
 --query 'sort_by(imageDetails,& imagePushedAt)[-5:]'

# 2. Update service to use previous image tag
# (Manual step: Update .github/workflows/deploy-aws.yml with previous tag and push)

# 3. Or update directly via App Runner console (rollback to previous deployment)
 
 RTO: 5-10 minutes (restart) / 15-20 minutes (rollback) 
 
 Scenario 2: RDS Database Failure 
 Symptoms 
 
 Connection timeouts to drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com 
 Database errors in App Runner logs 
 RDS CloudWatch metrics show instance down 
 
 Investigation Steps 
 # 1. Check RDS instance status
aws rds describe-db-instances \
 --db-instance-identifier drop-db \
 --region eu-west-1 \
 --query 'DBInstances[0].DBInstanceStatus'

# 2. Check for automated snapshots
aws rds describe-db-snapshots \
 --db-instance-identifier drop-db \
 --region eu-west-1 \
 --query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-5:]'

# 3. Review recent events
aws rds describe-events \
 --source-identifier drop-db \
 --source-type db-instance \
 --region eu-west-1 \
 --duration 60
 
 Recovery Actions 
 Option A: Restore from Latest Automated Snapshot 
 # 1. Identify latest snapshot
LATEST_SNAPSHOT=$(aws rds describe-db-snapshots \
 --db-instance-identifier drop-db \
 --region eu-west-1 \
 --query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
 --output text)

echo "Latest snapshot: $LATEST_SNAPSHOT"

# 2. Restore to new instance
aws rds restore-db-instance-from-db-snapshot \
 --db-instance-identifier drop-db-restored \
 --db-snapshot-identifier $LATEST_SNAPSHOT \
 --db-instance-class db.t4g.micro \
 --vpc-security-group-ids sg-XXXXX \
 --db-subnet-group-name default \
 --region eu-west-1

# 3. Wait for restore to complete (10-20 minutes)
aws rds wait db-instance-available \
 --db-instance-identifier drop-db-restored \
 --region eu-west-1

# 4. Update DATABASE_URL in App Runner
# (Manual step: Update environment variable via AWS Console or CLI)

# 5. Verify connection
NEW_ENDPOINT=$(aws rds describe-db-instances \
 --db-instance-identifier drop-db-restored \
 --query 'DBInstances[0].Endpoint.Address' \
 --output text \
 --region eu-west-1)

echo "New endpoint: $NEW_ENDPOINT"
 
 Option B: Point-in-Time Recovery 
 # Restore to specific timestamp (e.g., 1 hour ago)
aws rds restore-db-instance-to-point-in-time \
 --source-db-instance-identifier drop-db \
 --target-db-instance-identifier drop-db-pitr \
 --restore-time $(date -u -d '1 hour ago' '+%Y-%m-%dT%H:%M:%SZ') \
 --db-instance-class db.t4g.micro \
 --region eu-west-1

# Wait for restore
aws rds wait db-instance-available \
 --db-instance-identifier drop-db-pitr \
 --region eu-west-1
 
 RPO: 24 hours (snapshot) / 5 minutes (PITR)
 RTO: 30 minutes (snapshot) / 30 minutes (PITR) 
 
 Scenario 3: Data Corruption 
 Symptoms 
 
 Application reports data inconsistencies 
 Missing or incorrect records in database 
 User reports of lost data 
 
 Investigation Steps 
 # 1. Connect to RDS and inspect data
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
 -U dropuser \
 -d dropapp \
 -c "SELECT COUNT(*) FROM users WHERE deleted_at IS NOT NULL;"

# 2. Check audit_log table for suspicious activity
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
 -U dropuser \
 -d dropapp \
 -c "SELECT * FROM audit_log WHERE action IN ('DELETE', 'UPDATE') ORDER BY timestamp DESC LIMIT 50;"

# 3. Identify time of corruption
# Review application logs and database query logs
 
 Recovery Actions 
 Option A: Selective Data Restore (if corruption is isolated) 
 # 1. Create temporary snapshot of current state
aws rds create-db-snapshot \
 --db-instance-identifier drop-db \
 --db-snapshot-identifier drop-db-before-restore-$(date +%Y%m%d-%H%M) \
 --region eu-west-1

# 2. Restore clean snapshot to temporary instance
CLEAN_SNAPSHOT=<snapshot-before-corruption>

aws rds restore-db-instance-from-db-snapshot \
 --db-instance-identifier drop-db-temp \
 --db-snapshot-identifier $CLEAN_SNAPSHOT \
 --db-instance-class db.t4g.micro \
 --region eu-west-1

# 3. Export affected tables from clean instance
pg_dump -h <temp-endpoint> \
 -U dropuser \
 -d dropapp \
 -t users \
 -t transactions \
 --data-only \
 > clean_data.sql

# 4. Selectively import into production (after verification)
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
 -U dropuser \
 -d dropapp \
 < clean_data.sql

# 5. Terminate temporary instance
aws rds delete-db-instance \
 --db-instance-identifier drop-db-temp \
 --skip-final-snapshot \
 --region eu-west-1
 
 Option B: Full Database Restore (see Scenario 2) 
 RTO: 1-2 hours (selective) / 30 minutes (full restore)
 RPO: Depends on snapshot age 
 
 Scenario 4: Full Region Outage (eu-west-1) 
 Current State 
 
 No automated cross-region failover 
 No replica in secondary region 
 Manual failover required 
 
 Investigation Steps 
 # 1. Check AWS Service Health Dashboard
# https://health.aws.amazon.com/health/status

# 2. Verify RDS snapshots are accessible
aws rds describe-db-snapshots \
 --db-instance-identifier drop-db \
 --region eu-west-1

# 3. Check ECR images (may need to copy to secondary region)
aws ecr describe-images \
 --repository-name drop-web \
 --region eu-west-1
 
 Recovery Actions (Manual Failover to eu-north-1) 
 # 1. Copy latest RDS snapshot to eu-north-1
LATEST_SNAPSHOT=$(aws rds describe-db-snapshots \
 --db-instance-identifier drop-db \
 --region eu-west-1 \
 --query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
 --output text)

aws rds copy-db-snapshot \
 --source-db-snapshot-identifier arn:aws:rds:eu-west-1:324480209768:snapshot:$LATEST_SNAPSHOT \
 --target-db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \
 --region eu-north-1

# 2. Restore RDS in eu-north-1
aws rds restore-db-instance-from-db-snapshot \
 --db-instance-identifier drop-db-failover \
 --db-snapshot-identifier drop-db-failover-$(date +%Y%m%d) \
 --db-instance-class db.t4g.micro \
 --region eu-north-1

# 3. Copy ECR image to eu-north-1
# (Manual: create ECR repo in eu-north-1, retag and push latest image)

# 4. Deploy App Runner in eu-north-1
# (Manual: create new App Runner service via console with failover database endpoint)

# 5. Update DNS (when getdrop.no is active)
# Point getdrop.no to new App Runner URL
 
 RTO: 2-4 hours (manual process)
 RPO: Last snapshot before outage (24 hours worst case, 5 minutes with PITR if available) 
 
 Scenario 5: Security Incident 
 Symptoms 
 
 Suspicious database activity 
 Unauthorized access attempts 
 AML alerts triggered 
 STR report filed 
 
 Investigation Steps 
 # 1. Check audit logs for suspicious activity
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
 -U dropuser \
 -d dropapp \
 -c "SELECT * FROM audit_log WHERE timestamp > NOW() - INTERVAL '24 hours' ORDER BY timestamp DESC;"

# 2. Review AML alerts
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
 -U dropuser \
 -d dropapp \
 -c "SELECT * FROM aml_alerts WHERE status = 'open' OR created_at > NOW() - INTERVAL '24 hours';"

# 3. Check AWS CloudTrail for API activity
aws cloudtrail lookup-events \
 --lookup-attributes AttributeKey=ResourceName,AttributeValue=drop-db \
 --region eu-west-1 \
 --max-results 50

# 4. Review App Runner access logs
aws logs filter-log-events \
 --log-group-name /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
 --start-time $(date -u -d '24 hours ago' +%s)000 \
 --region eu-west-1
 
 Containment Actions 
 # 1. Revoke compromised sessions
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
 -U dropuser \
 -d dropapp \
 -c "UPDATE sessions SET revoked = 1 WHERE user_id IN (SELECT user_id FROM aml_alerts WHERE status = 'open');"

# 2. Temporarily disable affected users
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
 -U dropuser \
 -d dropapp \
 -c "UPDATE users SET kyc_status = 'rejected' WHERE id IN (SELECT user_id FROM aml_alerts WHERE severity = 'critical');"

# 3. Rotate database credentials
aws rds modify-db-instance \
 --db-instance-identifier drop-db \
 --master-user-password <new-password> \
 --apply-immediately \
 --region eu-west-1

# Update DATABASE_URL in App Runner with new password

# 4. Enable enhanced monitoring
aws rds modify-db-instance \
 --db-instance-identifier drop-db \
 --monitoring-interval 1 \
 --monitoring-role-arn arn:aws:iam::324480209768:role/rds-monitoring-role \
 --region eu-west-1

# 5. Take forensic snapshot
aws rds create-db-snapshot \
 --db-instance-identifier drop-db \
 --db-snapshot-identifier drop-db-incident-$(date +%Y%m%d-%H%M) \
 --region eu-west-1
 
 Investigation & Remediation 
 
 Analyze audit logs — identify scope of breach 
 File STR reports — if financial crime suspected (via str_reports table) 
 Notify Finanstilsynet — if user data compromised (GDPR requirement) 
 Update security policies — patch vulnerabilities 
 User communication — notify affected users if required by GDPR 
 
 RTO: Immediate containment (revoke sessions) / 24-48 hours full investigation 
 
 RTO/RPO Targets 
 
 
 
 Scenario 
 RTO 
 RPO 
 
 
 
 
 App Runner restart 
 5-10 minutes 
 0 (no data loss) 
 
 
 App Runner rollback 
 15-20 minutes 
 0 (no data loss) 
 
 
 RDS snapshot restore 
 30 minutes 
 24 hours (last snapshot) 
 
 
 RDS PITR restore 
 30 minutes 
 5 minutes (PITR granularity) 
 
 
 Full region failover 
 2-4 hours 
 24 hours (manual process) 
 
 
 Security incident containment 
 Immediate 
 0 (logs preserved) 
 
 
 
 
 Contacts 
 Primary 
 
 Alem Bašić (CEO): +47 40 47 42 51 
 Email: alem@alai.no 
 
 AI Operations 
 
 John (AI Director): Slack #drop-alerts channel 
 
 External Support 
 
 AWS Support: Premium support via AWS Console 
 Fly.io Support: Email support@fly.io 
 
 
 Runbook Maintenance 
 Review Schedule 
 
 Quarterly review — verify all ARNs, endpoints, and procedures 
 After incidents — update based on lessons learned 
 Before major releases — verify backup and rollback procedures 
 
 Test Schedule 
 
 Annually — full DR drill (restore from snapshot to temporary instance) 
 Quarterly — App Runner restart and rollback tests 
 Monthly — verify snapshot creation and retention 
 
 Change Log 
 
 
 
 Date 
 Change 
 Author 
 
 
 
 
 2026-02-18 
 Initial version created 
 Builder 3 (AI) 
 
 
 
 
 Appendix: Useful Commands 
 Quick Health Check 
 # Check App Runner status
aws apprunner describe-service \
 --service-arn arn:aws:apprunner:eu-west-1:324480209768:service/drop-web/8e45b0d335304487a1880f4e32d6aeec \
 --query 'Service.Status' \
 --output text \
 --region eu-west-1

# Check RDS status
aws rds describe-db-instances \
 --db-instance-identifier drop-db \
 --query 'DBInstances[0].DBInstanceStatus' \
 --output text \
 --region eu-west-1

# Check latest snapshot age
aws rds describe-db-snapshots \
 --db-instance-identifier drop-db \
 --region eu-west-1 \
 --query 'DBSnapshots[?SnapshotType==`automated`] | sort_by(@, &SnapshotCreateTime)[-1].SnapshotCreateTime' \
 --output text
 
 Database Connection Test 
 # Test connection from local machine
psql -h drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com \
 -U dropuser \
 -d dropapp \
 -c "SELECT 1;"
 
 Log Streaming 
 # Stream App Runner application logs
aws logs tail /aws/apprunner/drop-web/8e45b0d335304487a1880f4e32d6aeec/application \
 --follow \
 --region eu-west-1

# Stream RDS error logs
aws rds download-db-log-file-portion \
 --db-instance-identifier drop-db \
 --log-file-name error/postgresql.log \
 --region eu-west-1

Deployment Guide
Drop Deployment Guide 
 Last updated: 2026-03-03
 Source: src/drop-app/Dockerfile , docker-compose.yml , DOCKER.md 
 
 NOTE (2026-03-03): This document was updated for ADR-014 (PostgreSQL-only). The SQLite
single-container deployment and better-sqlite3 native dependency have been removed.
Current deployment: Docker + PostgreSQL 16 (dev), AWS App Runner + RDS (production). 
 
 
 Architecture Overview 
 Drop uses a multi-stage Docker build producing a minimal Node.js 22 Alpine production image. The application is a Next.js 16 standalone server. 
 Build stages (from Dockerfile:1-41 ): 
 
 
 
 Stage 
 Base 
 Purpose 
 
 
 
 
 deps 
 node:22-alpine 
 Install node_modules via npm ci . 
 
 
 builder 
 node:22-alpine 
 Copy deps + source, run npm run build (Next.js standalone output). 
 
 
 runner 
 node:22-alpine 
 Minimal production image. Copies only public/ , .next/standalone/ , .next/static/ . 
 
 
 
 Security features in the runner stage ( Dockerfile:25-26 ): 
 
 Non-root user: nextjs (UID 1001, GID 1001) 
 Data directory /app/data owned by nextjs:nodejs 
 No build tools or source code in production image 
 
 
 Deployment Configurations 
 1. Local Development -- docker-compose.yml 
 PostgreSQL 16 + Drop app (ADR-014). 
 File: src/drop-app/docker-compose.yml:1-22 
 services:
 drop-app:
 build: .
 ports:
 - "3000:3000"
 environment:
 - JWT_SECRET=${JWT_SECRET:?JWT_SECRET is required}
 - NODE_ENV=production
 - NEXT_PUBLIC_SERVICE_MODE=mock
 volumes:
 - drop_data:/app/data
 healthcheck:
 test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/api/health"]
 interval: 30s
 timeout: 10s
 retries: 3
 start_period: 10s
 restart: unless-stopped
 
 Quick start: 
 export JWT_SECRET="your-secure-random-string-min-32-chars"
docker compose up -d
 
 Data persistence: PostgreSQL data stored in Docker volume drop_pgdata . 
 2. Production (PostgreSQL) -- docker-compose.production.yml 
 Multi-container setup with separate PostgreSQL 16 database. 
 File: src/drop-app/docker-compose.production.yml:1-38 
 services:
 drop-app:
 build: .
 ports:
 - "3000:3000"
 depends_on:
 postgres:
 condition: service_healthy
 restart: unless-stopped

 postgres:
 image: postgres:16-alpine
 environment:
 - POSTGRES_DB=drop
 - POSTGRES_USER=drop
 - POSTGRES_PASSWORD=${POSTGRES_PASSWORD:-drop_local_dev}
 volumes:
 - postgres_data:/var/lib/postgresql/data
 healthcheck:
 test: ["CMD-SHELL", "pg_isready -U drop"]
 interval: 10s
 timeout: 5s
 retries: 5
 
 Quick start: 
 export JWT_SECRET="your-secure-random-string-min-32-chars"
export POSTGRES_PASSWORD="secure-postgres-password"
docker compose -f docker-compose.production.yml up -d
 
 3. Fly.io Staging -- fly.toml 
 File: src/drop-app/fly.toml:1-28 
 
 
 
 Setting 
 Value 
 
 
 
 
 App name 
 drop-staging 
 
 
 Region 
 arn (Stockholm -- closest to Norway) 
 
 
 Internal port 
 3000 
 
 
 Force HTTPS 
 true 
 
 
 Auto-stop machines 
 stop (scales to zero) 
 
 
 Auto-start machines 
 true 
 
 
 Min machines 
 0 
 
 
 Persistent storage 
 Volume drop_data mounted at /app/data 
 
 
 
 Health check: GET /api/health every 30s, 5s timeout, 10s grace period. 
 
 Environment Variables 
 
 
 
 Variable 
 Required 
 Default 
 Description 
 
 
 
 
 JWT_SECRET 
 Yes (production) 
 Dev: process.cwd() hash 
 JWT signing secret. Minimum 32 characters. Fatal error if missing in production. 
 
 
 NODE_ENV 
 No 
 development 
 Set to production in containers. Controls seed data gating. 
 
 
 NEXT_PUBLIC_SERVICE_MODE 
 No 
 - 
 Set to mock for MVP mode (no external API calls). 
 
 
 DATABASE_URL 
 Yes 
 - 
 PostgreSQL 16 connection string. Required in all environments. Local dev: postgresql://drop:dev_only_not_a_secret@localhost:5433/drop_dev 
 
 
 POSTGRES_PASSWORD 
 Production only 
 drop_local_dev 
 PostgreSQL password (production compose). 
 
 
 PORT 
 No 
 3000 
 HTTP server port. 
 
 
 HOSTNAME 
 No 
 0.0.0.0 
 Server bind address. 
 
 
 
 Database: PostgreSQL 16 is required in all environments. There is no SQLite fallback (ADR-014). 
 
 Health Check 
 Endpoint: GET /api/health 
 Source: src/drop-app/src/app/api/health/route.ts:1-35 
 The health check performs a real database query ( SELECT 1 as ok ) and reports latency. 
 Success response (200): 
 {
 "status": "ok",
 "version": "0.1.0",
 "uptime": 123,
 "db": "connected",
 "dbLatencyMs": 5,
 "timestamp": "2026-02-13T12:00:00.000Z"
}
 
 Failure response (503): 
 {
 "status": "error",
 "db": "disconnected",
 "timestamp": "..."
}
 
 
 Building from Source 
 # Build Docker image
docker build -t drop-app .

# Run standalone container
docker run -d \
 -p 3000:3000 \
 -e JWT_SECRET="your-secret-min-32-chars" \
 -v drop_data:/app/data \
 --name drop-app \
 drop-app
 
 
 Data Backup and Restore 
 Production Backups (AWS RDS) 
 Production database is PostgreSQL 16 on AWS RDS. Backups are managed by AWS: 
 
 Automated backups: Daily snapshots, 7-day retention (configured in RDS) 
 Point-in-time recovery: Available within the 7-day retention window 
 Manual snapshot: Via AWS Console or CLI before major deployments 
 
 Create a manual RDS snapshot before deployments: 
 aws rds create-db-snapshot \
 --db-instance-identifier drop-production \
 --db-snapshot-identifier drop-pre-deploy-$(date +%Y%m%d-%H%M%S)
 
 Restore from snapshot: Via AWS Console → RDS → Snapshots → Restore. 
 Local Dev Backups (Docker) 
 Local development data in the drop_pgdata Docker volume is disposable. Recreate with: 
 docker compose down -v # Remove volume (deletes local data)
docker compose up -d
make db-push && npm run db:seed
 
 Backup Verification 
 Verify production database connectivity and integrity: 
 # Check health endpoint
curl https://your-app-runner-url/api/health

# Connect to RDS (requires VPN or bastion)
psql $DATABASE_URL -c "SELECT COUNT(*) FROM users;"
 
 
 Demo User 
 In non-production mode ( NODE_ENV !== 'production' ), a demo user is seeded: 
 
 
 
 Field 
 Value 
 
 
 
 
 Email 
 amir@example.com 
 
 
 Password 
 demo1234 
 
 
 Role 
 merchant 
 
 
 
 Source: Drizzle seed script in src/shared/db/seed.ts . Gated behind NODE_ENV !== 'production' . 
 
 Troubleshooting 
 Container won't start: 
 docker compose logs
docker compose exec drop-app env | grep JWT_SECRET
 
 Database connection issues: 
 # Check PostgreSQL container is running
docker compose ps

# Test connection
docker compose exec db psql -U drop -d drop_dev -c "SELECT COUNT(*) FROM users;"

# Check app DATABASE_URL is set correctly
docker compose exec drop-app env | grep DATABASE_URL
 
 Permission denied: 
 docker compose down -v # Remove volumes
docker compose up -d # Recreate with correct permissions
 
 Cleanup: 
 docker compose down # Stop containers
docker compose down -v # Stop + remove volumes (WARNING: deletes data)
docker rmi drop-app # Remove image

CI/CD & Monitoring

CI/CD Pipeline
Drop CI/CD Pipeline 
 Last updated: 2026-02-13
 Source: src/drop-app/package.json , Dockerfile , fly.toml , vitest.config.ts , playwright.config.ts 
 
 Current State 
 Drop is in MVP/pre-production stage. Core CI/CD infrastructure exists including a GitHub Actions workflow. 
 What exists: 
 
 GitHub Actions CI workflow ( .github/workflows/ci.yml ) with 5 jobs: lint-and-typecheck, test, build, e2e, docker-build 
 Dockerfile with multi-stage build ( Dockerfile:1-63 ) 
 docker-compose for local and production ( docker-compose.yml , docker-compose.production.yml ) 
 Fly.io deployment config ( fly.toml ) 
 Vitest unit/integration test framework ( vitest.config.ts ) 
 Playwright E2E test framework ( playwright.config.ts ) 
 Health check endpoint ( /api/health ) 
 QA report generation via scripts/qa-report.js (automated in CI) 
 
 What does not exist yet: 
 
 Automated deployment pipeline (CI builds but does not deploy) 
 Container registry integration 
 Automated security scanning (npm audit, Snyk) 
 Test coverage reporting 
 Staging environment (Fly.io config exists but not deployed) 
 
 
 Build Pipeline 
 Step 1: Install Dependencies 
 npm ci
 
 Installs exact versions from package-lock.json . 
 Step 2: Lint 
 npm run lint # eslint
 
 Step 3: Type Check 
 npx tsc --noEmit
 
 Step 4: Unit + Integration Tests 
 npm test # vitest run
 
 Runs all tests in tests/**/*.test.ts (from vitest.config.ts:7 ). Test setup: tests/setup.ts sets NODE_ENV=test . 
 Step 5: Build 
 npm run build # next build
 
 Produces standalone output for Docker deployment. 
 Step 6: Docker Build 
 docker build -t drop-app .
 
 Multi-stage build: deps -> builder -> runner. 
 Step 7: E2E Tests (requires running server) 
 npx playwright test
 
 Requires dev server on http://localhost:3000 . Playwright auto-starts it via webServer config. 
 
 Test Framework Configuration 
 Vitest (Unit + Integration) 
 Config: src/drop-app/vitest.config.ts:1-15 
 
 
 
 Setting 
 Value 
 
 
 
 
 Environment 
 node 
 
 
 Include 
 tests/**/*.test.ts 
 
 
 Setup 
 tests/setup.ts 
 
 
 Path alias 
 @ -> ./src 
 
 
 
 Playwright (E2E) 
 Config: src/drop-app/playwright.config.ts:1-39 
 
 
 
 Setting 
 Value 
 
 
 
 
 Test dir 
 ./tests/e2e 
 
 
 Parallel 
 false (serial -- rate limiter is shared) 
 
 
 Workers 
 1 
 
 
 Retries (CI) 
 2 
 
 
 Timeout 
 30,000ms 
 
 
 Base URL 
 http://localhost:3000 
 
 
 Reporter 
 HTML 
 
 
 Trace 
 on-first-retry 
 
 
 
 Test projects: 
 
 user-flows -- Basic user journey tests ( user-flows.spec.ts ) 
 full-flows -- Complete feature journeys ( full-flows.spec.ts ) 
 input-chaos -- Malicious/edge-case input testing ( input-chaos.spec.ts ). Depends on user-flows . 
 
 Web server config: Auto-starts npm run dev for E2E tests. Reuses existing server if running. 30s timeout. 
 
 Deployment Targets 
 Fly.io (Staging) 
 Config: fly.toml:1-28 
 # Deploy to Fly.io staging
fly deploy

# Set secrets
fly secrets set JWT_SECRET="your-secret"
fly secrets set NEXT_PUBLIC_SERVICE_MODE="mock"
 
 Region: arn (Stockholm)
 Auto-scaling: Scales to 0 when idle, auto-starts on request. 
 Docker (Self-hosted) 
 # Local dev (PostgreSQL 16 via Docker)
docker compose up -d

# Apply schema
make db-push
 
 
 Existing GitHub Actions CI Workflow 
 File: .github/workflows/ci.yml 
 Triggers on push/PR to main or master : 
 Jobs:
 1. lint-and-typecheck — npm ci, npm run lint, tsc --noEmit
 2. test — npm ci, npm test --if-present (depends on lint-and-typecheck)
 3. build — npm ci, npm run build with JWT_SECRET placeholder (depends on lint-and-typecheck)
 4. e2e — npm ci, npx playwright install chromium, npm run build, npm run start (production mode), npx playwright test user-flows + full-flows, generate QA report, upload artifacts (depends on build)
 5. docker-build — docker build -t drop-app:ci (depends on test + build + e2e)
 
 Artifacts uploaded: 
 
 playwright-report/ — Playwright HTML report (7 day retention) 
 qa-report.html — QA metrics report (pass/fail, execution time) 
 
 Not yet implemented: 
 
 Security scan (npm audit, Snyk) 
 Deploy to staging (fly deploy) 
 Deploy to production (manual approval gate) 
 
 Status: Full CI pipeline including E2E tests in place. CD deployment tracked in security hardening checklist ( security/hardening-checklist.md:120-126 ).

Monitoring & Alerting
Drop Monitoring 
 Last updated: 2026-02-17
 Source: src/drop-app/src/app/api/health/route.ts , docker-compose.yml , fly.toml , src/lib/alerts.ts 
 
 Health Check Endpoint 
 Route: GET /api/health 
 Source: src/drop-app/src/app/api/health/route.ts:1-35 
 What It Checks 
 
 Database connectivity -- Executes SELECT 1 as ok against the database 
 Database latency -- Measures query execution time in milliseconds 
 Database driver -- Reports pg (PostgreSQL 16 via Drizzle ORM) 
 Service mode -- Reports NEXT_PUBLIC_SERVICE_MODE ( mock or live ) 
 Application uptime -- Tracks seconds since server start 
 Application version -- Reads from npm_package_version env var, defaults to 0.1.0 
 
 Status Values 
 
 ok -- All checks pass (HTTP 200) 
 degraded -- DB query returned unexpected result (HTTP 200) 
 down -- DB unreachable (HTTP 503) 
 
 Response Format 
 Healthy (200 OK): 
 {
 "data": {
 "status": "ok",
 "version": "0.1.0",
 "uptime": 3600,
 "checks": {
 "db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
 "services": { "mode": "live" }
 },
 "timestamp": "2026-02-17T12:00:00.000Z"
 }
}
 
 Down (503 Service Unavailable): 
 {
 "data": {
 "status": "down",
 "version": "0.1.0",
 "uptime": 3600,
 "checks": {
 "db": { "status": "fail" },
 "services": { "mode": "live" }
 },
 "timestamp": "2026-02-17T12:00:00.000Z"
 }
}
 
 
 Container Health Checks 
 Docker Compose (MVP) 
 Source: docker-compose.yml:12-17 
 healthcheck:
 test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/api/health"]
 interval: 30s
 timeout: 10s
 retries: 3
 start_period: 10s
 
 Docker Compose (Production) 
 Source: docker-compose.production.yml:9-14 
 Same health check configuration as MVP. Additionally, PostgreSQL has its own health check: 
 healthcheck:
 test: ["CMD-SHELL", "pg_isready -U drop"]
 interval: 10s
 timeout: 5s
 retries: 5
 
 The drop-app service depends on PostgreSQL being healthy before starting ( depends_on.postgres.condition: service_healthy ). 
 Fly.io 
 Source: fly.toml:19-23 
 [[http_service.checks]]
 grace_period = "10s"
 interval = "30s"
 method = "GET"
 path = "/api/health"
 timeout = "5s"
 
 Fly.io uses this health check to determine machine readiness and to route traffic. 
 
 Current Monitoring State 
 What Exists 
 
 Health check endpoint with real database verification (not hardcoded) 
 Container-level health checks (Docker + Fly.io) 
 Automatic restart on failure ( restart: unless-stopped in docker-compose) 
 Auto-scaling on Fly.io (scale to zero, auto-start on request) 
 
 What Does Not Exist Yet 
 
 External uptime monitoring service (see UptimeRobot setup below for recommended configuration) 
 Application Performance Monitoring (APM) 
 Structured logging (JSON format) 
 Log aggregation and forwarding 
 Database performance monitoring 
 Rate limit monitoring/metrics 
 Business metrics dashboard (transactions per hour, success rate) 
 
 
 Sentry Error Tracking 
 Status: REMOVED (MC #1271 — Sentry deinstalled) 
 
 Slack Alerting 
 Status: Implemented (MC #1183)
 Source: src/lib/alerts.ts , instrumentation.ts 
 Features 
 
 Operational alerts sent to Slack webhook 
 10-minute cooldown per alert title (prevents spam) 
 Severity-based emoji prefixes (ℹ️ info, ⚠️ warning, 🚨 critical) 
 Graceful degradation when webhook URL not set (dev mode) 
 
 Setup Instructions 
 
 Create incoming webhook in Slack workspace:
 
 Go to Slack App Directory → Incoming Webhooks 
 Choose channel (e.g., #ops or #alerts ) 
 Copy webhook URL 
 
 
 Set environment variable:
 # .env.local (server-side secret)
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX
 
 
 
 Required Environment Variable 
 
 
 
 Variable 
 Required 
 Description 
 
 
 
 
 SLACK_WEBHOOK_URL 
 Yes (production) 
 Slack incoming webhook URL 
 
 
 
 Note: When SLACK_WEBHOOK_URL is not set, alerts are logged to console but not sent to Slack. 
 Alert Types and Severities 
 
 
 
 Severity 
 Emoji 
 Use Case 
 
 
 
 
 info 
 ℹ️ 
 Application startup, normal operations 
 
 
 warning 
 ⚠️ 
 Degraded performance, non-critical issues 
 
 
 critical 
 🚨 
 Service outages, data loss, security incidents 
 
 
 
 Cooldown Behavior 
 
 Each alert title has a 10-minute cooldown 
 Same title sent within 10 minutes → skipped (prevents spam) 
 Different titles → sent immediately (independent tracking) 
 Cooldown resets on app restart (in-memory tracking) 
 
 Example: If "Database connection failed" is sent at 10:00, the next attempt before 10:10 will be skipped. But "High latency detected" can still be sent at 10:05. 
 Usage in Code 
 import { sendAlert } from '@/lib/alerts';

// Basic alert
await sendAlert({
 severity: 'critical',
 title: 'Database connection failed',
 message: 'PostgreSQL unreachable after 3 retries',
});

// Alert with details
await sendAlert({
 severity: 'warning',
 title: 'High error rate detected',
 message: '15 errors in last 5 minutes',
});
 
 Current Integrations 
 
 App startup: Sends info alert when server starts ( instrumentation.ts ) 
 App shutdown: Sends info alert on SIGTERM/SIGINT ( instrumentation.ts ) 
 Error spike detection: Automatically tracks errors and alerts when >5 errors occur in 60 seconds ( src/lib/alerts.ts:trackError ) 
 Unhandled exceptions: Logged and tracked via process event handlers ( instrumentation.ts ) 
 
 Error Spike Detection 
 The alerting system automatically detects error spikes using a rolling window approach: 
 How it works: 
 
 Every server error (HTTP 5xx) is tracked via trackError() 
 Maintains rolling 1-minute window of error timestamps 
 When count exceeds threshold (5 errors in 60 seconds), sends critical alert 
 Integrates with middleware error handling 
 
 Threshold: 5 errors within 60 seconds
 Alert severity: Critical (🚨)
 Implementation: src/lib/alerts.ts:trackError() , wired into src/lib/middleware.ts:jsonError() 
 Note: Error counter is in-memory and resets on app restart. For production workloads requiring persistent tracking, consider Redis-backed counters. 
 
 BetterStack Uptime Monitoring 
 Status: Ready to configure (setup guide available)
 Documentation: BETTERSTACK-SETUP.md 
 Overview 
 BetterStack provides external uptime monitoring independent of Drop's infrastructure. Unlike internal health checks (Docker, Fly.io) that only work when containers are running, BetterStack detects total infrastructure failures. 
 Free tier includes: 
 
 10 monitors (enough for Drop production) 
 3-minute check interval 
 Unlimited integrations (Slack, email) 
 Public status page 
 SSL expiry monitoring 
 
 Recommended Monitors 
 
 
 
 Monitor 
 URL 
 Purpose 
 Expected Response 
 
 
 
 
 Health Endpoint 
 https://drop.alai.no/api/health 
 API + DB connectivity 
 200 , body contains "status":"ok" 
 
 
 Landing Page 
 https://drop.alai.no 
 Public website 
 200 , body contains Send penger 
 
 
 Multi-Region Check 
 https://drop.alai.no/api/health 
 Geographic availability 
 200 , body contains "status":"ok" 
 
 
 
 Alert Escalation 
 BetterStack sends alerts through multiple channels: 
 Minute 0: Alert fires → Slack #drop-ops (immediate)
Minute 5: Still down → Email to alem@alai.no
Minute 15: Still down → SMS (requires paid plan)
 
 Status Page 
 Public status page shows real-time service status: 
 
 URL: https://drop-status.betteruptime.com 
 Components: API Health, Landing Page, Global Network 
 Auto-updates: Incidents automatically posted and resolved 
 Subscriptions: Users can subscribe to email updates 
 
 Setup Instructions 
 Complete setup guide with step-by-step instructions: BETTERSTACK-SETUP.md 
 Setup includes: 
 
 Account creation (free tier) 
 Configure 3 monitors (health, landing, multi-region) 
 Slack integration ( #drop-ops channel) 
 On-call schedule and escalation policy 
 Public status page creation 
 Testing and verification 
 
 Key Features 
 Proactive monitoring: 
 
 3-minute check interval (free tier) or 30s (paid) 
 Keyword verification (not just HTTP 200) 
 SSL certificate expiry warnings (14 days) 
 Multi-region checks (detect geographic issues) 
 
 Incident management: 
 
 Automatic incident creation on downtime 
 Status page updates (public transparency) 
 Escalation to multiple channels (Slack → Email → SMS) 
 Maintenance window support (suppress alerts during deployments) 
 
 Reporting: 
 
 Uptime SLA tracking (99.9% target) 
 Incident history and analysis 
 Response time graphs 
 Downtime duration reports 
 
 Integration with Drop Alerting 
 BetterStack complements Drop's internal alerting ( src/lib/alerts.ts ): 
 
 
 
 Feature 
 Drop Internal Alerts 
 BetterStack External 
 
 
 
 
 Detects 
 Application errors, error spikes 
 Infrastructure outages 
 
 
 When 
 App is running 
 App is unreachable 
 
 
 Source 
 Application logs 
 External HTTP checks 
 
 
 Delivery 
 Slack webhook (direct) 
 Escalation policy 
 
 
 Use case 
 Code bugs, DB issues 
 Container crashes, network failures 
 
 
 
 Example: Database connection fails: 
 
 Drop internal alert: "Database connection failed" → Slack #drop-ops (immediate) 
 BetterStack: Health check returns 503 → Slack #drop-ops + Email after 5 min 
 
 Maintenance Windows 
 When performing planned maintenance (deployments, upgrades): 
 
 Create maintenance window in BetterStack 
 Select affected monitors 
 Set duration (e.g., 1 hour) 
 Effect: Alerts suppressed, status page shows "Scheduled Maintenance" 
 
 Prevents: False downtime alerts during intentional service interruptions. 
 Best Practices 
 Do's: 
 
 ✅ Test alerts monthly (pause monitor to verify escalation) 
 ✅ Use keyword checks (not just HTTP status codes) 
 ✅ Monitor SSL expiry (14-day warnings) 
 ✅ Create maintenance windows for deployments 
 ✅ Review incident history monthly 
 
 Don'ts: 
 
 ❌ Don't ignore degraded status (investigate even if not fully down) 
 ❌ Don't disable monitors (use pause for temporary suppression) 
 ❌ Don't skip keyword checks (HTTP 200 ≠ working API) 
 ❌ Don't rely solely on external monitoring (combine with internal checks) 
 
 
 External Uptime Monitoring (Alternative: UptimeRobot) 
 Status: Alternative to BetterStack (not recommended) 
 BetterStack is recommended over UptimeRobot for Drop because: 
 
 Better Slack integration (richer notifications) 
 Built-in status page (UptimeRobot charges extra) 
 Better UI/UX for incident management 
 More flexible escalation policies 
 
 UptimeRobot Setup (if BetterStack unavailable) 
 Cost: Free tier (50 monitors, 5-minute interval) 
 
 Create account at uptimerobot.com 
 Add HTTP(S) monitor:
 
 Friendly Name: Drop Production 
 URL: https://drop.alai.no/api/health 
 Monitoring Interval: 5 minutes (free tier) or 1 minute (paid) 
 
 
 Configure alert contacts:
 
 Slack webhook (via Alert Contacts) 
 Email ( alem@alai.no ) 
 
 
 Set Keyword Monitoring: Response contains "status":"ok" 
 
 Limitations: 
 
 No built-in escalation policies (requires third-party integrations) 
 Status page requires paid plan 
 Less detailed incident reports 
 5-minute check interval (vs 3-minute for BetterStack free) 
 
 
 Monitoring Stack Summary 
 Implemented (MC #1184) 
 
 ✅ Health check endpoint — /api/health with real database verification 
 ✅ Container health checks — Docker + Fly.io auto-restart on failure 
 ❌ Error tracking — Sentry REMOVED (MC #1271) 
 ✅ Slack alerting — Operational alerts with cooldown protection 
 ✅ Lifecycle monitoring — App startup and graceful shutdown alerts 
 ✅ Error spike detection — Automatic alerting when >5 errors/minute 
 
 Recommended (Manual Setup) 
 
 📋 External uptime monitoring — UptimeRobot checking /api/health every 5 minutes 
 📋 Structured logging — JSON log format with request IDs for correlation 
 📋 Metrics dashboard — Request latency, error rates, database query times 
 📋 Audit logging — Tracked as security requirement ( security/drop-security-rapport.md finding L3) 
 
 Future Enhancements (TODO) 
 
 Database performance monitoring (slow query alerts) 
 Rate limit metrics (track 429 errors per endpoint) 
 Business metrics dashboard (transactions per hour, success rate) 
 Redis-backed error counter (persistent across restarts) 
 Per-endpoint error tracking (isolate problematic routes) 
 
 
 Environment Variables Reference 
 Required for Production 
 # Slack alerting
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXX
 
 Dev Mode (All Optional) 
 All monitoring features gracefully degrade when env vars are not set: 
 
 No SLACK_WEBHOOK_URL: Alerts logged to console only 
 
 This allows development to work without external services configured.

Production Deployment
Drop AWS Amplify Deployment Guide 
 
 Rebrand note (2026-02-14): Originally titled "FontelePay". Product rebranded to Drop . Some env var references (Swan, Stripe) are FUTURE integrations — Drop uses a PSD2 pass-through model. See Drop CLAUDE.md . 
 
 This guide covers deploying Drop to AWS Amplify in the Frankfurt (eu-central-1) region. 
 Prerequisites 
 
 AWS Account with Amplify access 
 GitHub repository with Drop code 
 Environment variables ready (see .env.example ) 
 
 Step 1: Create Amplify App 
 
 Go to AWS Amplify Console 
 Ensure you're in eu-central-1 (Frankfurt) region 
 Click Create new app 
 Select Host web app 
 
 Step 2: Connect Repository 
 
 Choose GitHub as your Git provider 
 Authorize AWS Amplify to access your GitHub account 
 Select the Drop repository 
 Choose the branch to deploy (e.g., main or production ) 
 
 Step 3: Configure Build Settings 
 Amplify will auto-detect Next.js. Verify the settings match amplify.yml : 
 version: 1
frontend:
 phases:
 preBuild:
 commands:
 - npm ci
 build:
 commands:
 - npm run build
 artifacts:
 baseDirectory: .next
 files:
 - '**/*'
 cache:
 paths:
 - node_modules/**/*
 - .next/cache/**/*
 
 Step 4: Configure Environment Variables 
 In Amplify Console, go to App settings > Environment variables and add: 
 Required Variables 
 
 
 
 Variable 
 Description 
 Example 
 
 
 
 
 NODE_ENV 
 Environment 
 production 
 
 
 NEXT_PUBLIC_APP_URL 
 Your app URL 
 https://drop.amplifyapp.com 
 
 
 
 Swan BaaS 
 
 
 
 Variable 
 Description 
 
 
 
 
 SWAN_API_URL 
 https://api.swan.io (production) 
 
 
 SWAN_CLIENT_ID 
 OAuth2 Client ID 
 
 
 SWAN_CLIENT_SECRET 
 OAuth2 Client Secret 
 
 
 SWAN_PROJECT_ID 
 Project ID 
 
 
 SWAN_WEBHOOK_SECRET 
 Webhook validation secret 
 
 
 
 Stripe 
 
 
 
 Variable 
 Description 
 
 
 
 
 NEXT_PUBLIC_STRIPE_PUBLISHABLE_KEY 
 Publishable key (pk_live_...) 
 
 
 STRIPE_SECRET_KEY 
 Secret key (sk_live_...) 
 
 
 STRIPE_WEBHOOK_SECRET 
 Webhook secret (whsec_...) 
 
 
 
 Sumsub KYC 
 
 
 
 Variable 
 Description 
 
 
 
 
 SUMSUB_APP_TOKEN 
 App token 
 
 
 SUMSUB_SECRET_KEY 
 Secret key 
 
 
 SUMSUB_WEBHOOK_SECRET 
 Webhook secret 
 
 
 SUMSUB_LEVEL_NAME 
 KYC flow level 
 
 
 
 Database 
 
 
 
 Variable 
 Description 
 
 
 
 
 DATABASE_URL 
 PostgreSQL connection string 
 
 
 REDIS_URL 
 Redis connection string 
 
 
 
 Authentication 
 
 
 
 Variable 
 Description 
 
 
 
 
 JWT_SECRET 
 Min 32 characters 
 
 
 SESSION_SECRET 
 Min 32 characters 
 
 
 
 Step 5: Configure Next.js for Standalone Output 
 Update next.config.ts to enable standalone output for optimal Amplify deployment: 
 import type { NextConfig } from "next";

const nextConfig: NextConfig = {
 output: 'standalone',
};

export default nextConfig;
 
 Step 6: Deploy 
 
 Click Save and deploy 
 Monitor the build in the Amplify Console 
 Once complete, your app will be available at https://<branch>.<app-id>.amplifyapp.com 
 
 Step 7: Configure Custom Domain (Optional) 
 
 Go to App settings > Domain management 
 Click Add domain 
 Enter your domain (e.g., app.getdrop.no ) 
 Follow DNS configuration instructions 
 SSL certificate is automatically provisioned 
 
 Step 8: Set Up Branch Deployments 
 For staging/production workflows: 
 
 Go to App settings > General 
 Click Edit 
 Enable Branch auto-detection 
 Configure branch patterns:
 
 main -> Production 
 staging -> Staging 
 feature/* -> Preview environments 
 
 
 
 Monitoring & Health Checks 
 Health Endpoint 
 The app exposes /api/health for load balancer health checks: 
 curl https://your-app.amplifyapp.com/api/health
 
 Response: 
 {
 "status": "healthy",
 "timestamp": "2026-02-05T12:00:00.000Z",
 "version": "0.1.0",
 "uptime": 3600,
 "checks": {}
}
 
 CloudWatch Logs 
 
 Go to App settings > Monitoring 
 View build logs and access logs 
 Set up CloudWatch alarms for errors 
 
 Troubleshooting 
 Build Fails 
 
 Check build logs in Amplify Console 
 Verify package.json scripts are correct 
 Ensure all dependencies are in package.json 
 
 Environment Variables Not Working 
 
 Verify variables are set in Amplify Console 
 Remember: NEXT_PUBLIC_ prefix required for client-side access 
 Redeploy after changing environment variables 
 
 502/503 Errors 
 
 Check /api/health endpoint 
 Review CloudWatch logs 
 Verify database connections are correct 
 Check memory limits (adjust if needed) 
 
 Cold Starts 
 For serverless functions, cold starts may occur. Mitigate by: 
 
 Using connection pooling for databases 
 Keeping functions warm with scheduled pings 
 Optimizing bundle size 
 
 Security Checklist 
 
 All secrets in Environment Variables (not in code) 
 HTTPS enforced (automatic in Amplify) 
 CORS configured correctly 
 Rate limiting implemented 
 Webhook signatures validated 
 No sensitive data in logs 
 
 Cost Optimization 
 
 Use cache.paths in amplify.yml to speed up builds 
 Enable CloudFront caching for static assets 
 Monitor build minutes usage 
 Consider reserved concurrency for predictable traffic 
 
 Rollback 
 To rollback to a previous deployment: 
 
 Go to Deployments in Amplify Console 
 Find the previous successful deployment 
 Click Redeploy this version 
 
 Support 
 
 AWS Amplify Documentation 
 Next.js on AWS Amplify 
 Drop Internal Docs

BetterStack Setup
BetterStack Uptime Monitoring Setup Guide 
 Last updated: 2026-02-20
 Related: MONITORING.md , health-check.sh 
 Purpose: External uptime monitoring for Drop production environment 
 
 Why BetterStack? 
 BetterStack provides external uptime monitoring independent of Drop's infrastructure: 
 
 Detects infrastructure failures (AWS App Runner crashes, network issues) 
 Alerts when the entire application is unreachable 
 Provides uptime SLA tracking and historical reports 
 Multiple notification channels (Slack, Email, SMS) 
 Status page for client transparency 
 
 Key difference from internal health checks: Internal checks (Docker, Fly.io) only work when the container is running. BetterStack catches total outages. 
 
 Free Tier Limits 
 Plan: Free tier (no credit card required)
 Limits: 
 
 10 monitors (enough for Drop production) 
 3-minute check interval (paid plan: 30s minimum) 
 1 status page 
 Unlimited team members 
 Unlimited integrations (Slack, email, webhooks) 
 
 Upgrade required for: 
 
 Faster check intervals (<3 minutes) 
 More than 10 monitors (e.g., multi-region checks) 
 Advanced features (maintenance windows, custom headers) 
 
 
 Account Setup 
 Step 1: Create Account 
 
 Go to https://betterstack.com/uptime 
 Click "Start free trial" (becomes free tier after trial) 
 Sign up with Alem's email: alem@alai.no 
 Verify email address 
 Create workspace name: "ALAI Products" (shared across Drop, BasicFakta) 
 
 Step 2: Configure Team 
 
 Navigate to Settings > Team 
 Add team members:
 
 alem@alai.no (Owner) 
 john@basicconsulting.no (Admin) 
 
 
 Set Default timezone: Europe/Oslo (UTC+1) 
 
 
 Monitor Configuration 
 Monitor 1: Health Endpoint (Primary) 
 Purpose: Verify API health and database connectivity 
 
 
 Go to Monitors > Create Monitor 
 
 
 Configure: 
 
 Monitor name: Drop Health Check 
 Monitor type: HTTP 
 URL: https://drop.alai.no/api/health 
 Check interval: 3 minutes (free tier) 
 Request timeout: 5 seconds 
 Method: GET 
 Confirmation period: 30 seconds (1 retry before alerting) 
 
 
 
 Expected Response: 
 
 Status code: 200 
 Keyword check: Enable
 
 Response body contains: "status":"ok" 
 Why: Ensures health endpoint returns valid JSON, not just HTTP 200 
 
 
 
 
 
 Advanced settings: 
 
 Follow redirects: Enabled (default) 
 Verify SSL certificate: Enabled 
 SSL expiry warning: 14 days before expiration 
 
 
 
 Click Create Monitor 
 
 
 
 Monitor 2: Landing Page 
 Purpose: Verify public website availability 
 
 
 Go to Monitors > Create Monitor 
 
 
 Configure: 
 
 Monitor name: Drop Landing Page 
 Monitor type: HTTP 
 URL: https://drop.alai.no 
 Check interval: 3 minutes 
 Request timeout: 10 seconds (landing page has more assets) 
 Method: GET 
 Confirmation period: 30 seconds 
 
 
 
 Expected Response: 
 
 Status code: 200 
 Keyword check: Enable
 
 Response body contains: Send penger (tagline verification) 
 
 
 
 
 
 Click Create Monitor 
 
 
 
 Monitor 3: Multi-Region Health Check 
 Purpose: Detect regional networking issues 
 
 
 Go to Monitors > Create Monitor 
 
 
 Configure: 
 
 Monitor name: Drop Health (US East) 
 Monitor type: HTTP 
 URL: https://drop.alai.no/api/health 
 Check interval: 3 minutes 
 Request timeout: 5 seconds 
 Method: GET 
 Confirmation period: 30 seconds 
 
 
 
 Expected Response: 
 
 Status code: 200 
 Keyword check: Response body contains "status":"ok" 
 
 
 
 Advanced settings: 
 
 Region: US East (different from default EU region) 
 Why: Detects if Drop is unreachable from specific geographies 
 
 
 
 Click Create Monitor 
 
 
 
 Slack Integration 
 Step 1: Create Slack Incoming Webhook 
 
 Go to your Slack workspace: alai-talk.slack.com 
 Navigate to Slack App Directory > Incoming Webhooks 
 Click Add to Slack 
 Select channel: #drop-ops (create if doesn't exist) 
 Click Add Incoming Webhooks Integration 
 Copy webhook URL (format: https://hooks.slack.com/services/T.../B.../XXX ) 
 Save this URL securely (needed for BetterStack) 
 
 Step 2: Add Slack Integration in BetterStack 
 
 In BetterStack, go to Integrations 
 Click Add Integration > Slack 
 Paste webhook URL from Step 1 
 Configure:
 
 Integration name: Drop Ops Slack 
 Notification channel: #drop-ops 
 
 
 Test integration: Click Send test message 
 
 Verify message appears in #drop-ops channel 
 
 
 Click Save Integration 
 
 
 On-Call Team Setup 
 Step 1: Create On-Call Schedule 
 
 Go to On-Call > Create Schedule 
 Configure:
 
 Schedule name: Drop Primary On-Call 
 Timezone: Europe/Oslo 
 
 
 Add rotation:
 
 Team member: alem@alai.no 
 Schedule type: 24/7 (always on-call for now) 
 
 
 Click Create Schedule 
 
 Step 2: Configure Escalation Policy 
 
 
 Go to Escalation Policies > Create Policy 
 
 
 Configure: 
 
 Policy name: Drop Production Incidents 
 
 
 
 Add escalation steps: 
 Step 1 (Immediate): 
 
 Who: Drop Ops Slack integration 
 Delay: 0 minutes 
 
 Step 2 (If still down after 5 minutes): 
 
 Who: alem@alai.no (Email) 
 Delay: 5 minutes 
 
 Step 3 (If still down after 15 minutes): 
 
 Who: alem@alai.no (SMS) — Requires phone number 
 Delay: 15 minutes 
 Note: SMS requires paid plan or verified phone number 
 
 
 
 Click Create Policy 
 
 
 Step 3: Assign Policy to Monitors 
 
 Go to Monitors 
 For each monitor ( Drop Health Check , Drop Landing Page , Drop Health (US East) ):
 
 Click monitor name 
 Go to Settings > Escalation Policy 
 Select: Drop Production Incidents 
 Click Save 
 
 
 
 
 Status Page Setup 
 Purpose 
 Public status page allows clients and stakeholders to check Drop availability without contacting support. 
 Step 1: Create Status Page 
 
 
 Go to Status Pages > Create Status Page 
 
 
 Configure: 
 
 Page name: Drop Status 
 Subdomain: drop-status (URL: https://drop-status.betteruptime.com ) 
 Custom domain (optional): status.drop.alai.no (requires DNS setup) 
 
 
 
 Design settings: 
 
 Logo: Upload Drop logo (green rounded rectangle) 
 Brand color: #0B6E35 (Drop primary green) 
 Header text: Drop Status 
 Tagline: Real-time service status and incident updates 
 
 
 
 Visibility: 
 
 Public: Yes (anyone can view) 
 Search engine indexing: No (prevent Google indexing) 
 
 
 
 Click Create Status Page 
 
 
 Step 2: Add Components 
 
 
 In the status page settings, go to Components 
 
 
 Click Add Component 
 
 
 Add three components: 
 Component 1: 
 
 Name: API & Health Endpoint 
 Linked monitor: Drop Health Check 
 Description: Core API functionality and database connectivity 
 
 Component 2: 
 
 Name: Landing Page 
 Linked monitor: Drop Landing Page 
 Description: Public website and marketing content 
 
 Component 3: 
 
 Name: Global Network 
 Linked monitor: Drop Health (US East) 
 Description: International access and routing 
 
 
 
 Click Save Components 
 
 
 Step 3: Configure Incident Communication 
 
 Go to Status Pages > Settings > Incident Updates 
 Enable:
 
 Auto-create incidents: Yes (when monitor goes down) 
 Auto-resolve incidents: Yes (when monitor recovers) 
 
 
 Notification subscribers: 
 
 Email subscriptions: Enabled (users can subscribe to updates) 
 Webhook notifications: Disabled (optional for future) 
 
 
 
 Step 4: Share Status Page 
 Once created, share the status page URL: 
 
 Internal: Add to #drop-ops Slack channel description 
 External: Link from Drop landing page footer (optional) 
 Clients: Include in onboarding emails 
 
 Status Page URL: https://drop-status.betteruptime.com 
 
 Verification Checklist 
 After completing setup, verify: 
 
 Monitors running: All 3 monitors show green status 
 Slack alerts working: Test by pausing a monitor (triggers down alert) 
 Email notifications working: Verify Alem receives email on test alert 
 Status page public: Open status page URL in incognito mode 
 Escalation policy assigned: All monitors use Drop Production Incidents policy 
 SSL expiry alerts: Monitors configured to warn 14 days before cert expiration 
 
 
 Testing the Setup 
 Test 1: Manual Down Alert 
 
 Go to Monitors > Drop Health Check 
 Click Pause Monitor (simulates downtime) 
 Expected behavior: 
 
 Slack alert in #drop-ops within 30 seconds 
 Email to alem@alai.no after 5 minutes (if still paused) 
 
 
 Click Resume Monitor to clear alert 
 
 Test 2: Actual Downtime 
 
 SSH into production server (or use AWS App Runner console) 
 Stop the Drop application container temporarily 
 Wait for BetterStack to detect downtime (max 3 minutes + 30s confirmation) 
 Expected behavior: 
 
 Monitor shows red status 
 Slack alert in #drop-ops 
 Status page component shows "Down" 
 
 
 Restart application and verify recovery alert 
 
 Test 3: SSL Expiry Warning 
 
 Go to Monitors > Drop Health Check 
 Verify SSL expiry warning is enabled (14 days) 
 Expected behavior: 
 
 Alert sent 14 days before SSL certificate expiration 
 Action required: Renew certificate before expiry 
 
 
 
 
 Alert Examples 
 Downtime Alert (Slack) 
 🚨 Drop Health Check is DOWN

Monitor: Drop Health Check
Status: DOWN
Response: Connection timeout
Region: EU West
Time: 2026-02-20 10:30 UTC

View incident: https://betterstack.com/incidents/...
 
 Recovery Alert (Slack) 
 ✅ Drop Health Check is UP

Monitor: Drop Health Check
Status: UP
Response: 200 OK (2ms)
Downtime duration: 3 minutes
Time: 2026-02-20 10:33 UTC

Incident closed: https://betterstack.com/incidents/...
 
 SSL Expiry Warning (Email) 
 Subject: [BetterStack] SSL certificate expiring in 14 days

Monitor: Drop Health Check
Domain: drop.alai.no
Certificate expiry: 2026-03-06 23:59 UTC

Action required: Renew SSL certificate before expiration.
 
 
 Maintenance Mode 
 When performing planned maintenance (deployments, infrastructure upgrades): 
 
 Go to Maintenance Windows > Create Window 
 Configure:
 
 Name: Drop Deployment 
 Start time: 2026-02-20 22:00 UTC 
 Duration: 1 hour 
 Affected monitors: Select all Drop monitors 
 
 
 Notification: 
 
 Status page update: Yes (shows maintenance banner) 
 Alert suppression: Yes (no downtime alerts during window) 
 
 
 Click Create Maintenance Window 
 
 Effect: During maintenance, downtime alerts are suppressed and status page shows "Scheduled Maintenance" instead of "Down". 
 
 Best Practices 
 Do's 
 
 ✅ Test alerts monthly — Pause a monitor to verify escalation works 
 ✅ Update on-call schedule — Rotate on-call duty if team grows 
 ✅ Monitor SSL expiry — Enable 14-day warnings to prevent outages 
 ✅ Use maintenance windows — Prevent false alerts during deployments 
 ✅ Review incident history — Monthly review of downtime patterns 
 
 Don'ts 
 
 ❌ Don't ignore degraded status — Investigate even if not fully down 
 ❌ Don't disable monitors — Use pause for temporary suppression only 
 ❌ Don't skip keyword checks — HTTP 200 alone doesn't guarantee working API 
 ❌ Don't forget to update URLs — When domain changes, update all monitors 
 ❌ Don't rely solely on external monitoring — Combine with internal health checks 
 
 
 Troubleshooting 
 Monitor shows false positives (frequent up/down) 
 Cause: Network instability or slow response times
 Fix: 
 
 Increase Request timeout from 5s to 10s 
 Increase Confirmation period from 30s to 60s 
 Check Drop API latency in logs 
 
 Slack alerts not received 
 Cause: Webhook URL incorrect or channel archived
 Fix: 
 
 Go to Integrations > Drop Ops Slack 
 Click Send test message 
 If fails, regenerate webhook in Slack and update BetterStack 
 
 Email alerts delayed 
 Cause: Email provider spam filtering
 Fix: 
 
 Whitelist notifications@betterstack.com in email settings 
 Check spam/junk folder 
 Verify email address in BetterStack team settings 
 
 Status page not updating 
 Cause: Monitor not linked to status page component
 Fix: 
 
 Go to Status Pages > Drop Status > Components 
 Ensure each component has a Linked monitor assigned 
 Save changes and trigger test alert 
 
 
 Related Documentation 
 
 MONITORING.md — Full monitoring stack overview 
 health-check.sh — Internal health check script 
 alerts.ts — Slack alerting implementation 
 /api/health route — Health endpoint source code 
 
 
 Support 
 BetterStack Support: 
 
 Documentation: https://betterstack.com/docs 
 Email: support@betterstack.com 
 Status: https://status.betterstack.com 
 
 Internal Contact: 
 
 Slack: #drop-ops 
 Email: alem@alai.no

Sentry Setup
Drop Sentry Setup 
 Last updated: 2026-02-20
 Source: src/drop-app/src/lib/sentry.ts , src/drop-app/src/lib/sentry-server.ts , src/drop-api/src/lib/sentry.ts , src/drop-app/.env.example 
 
 Overview 
 Drop uses Sentry for error tracking and performance monitoring across three components: 
 
 drop-app (client-side) - Browser errors via @sentry/browser 
 drop-app (server-side) - Next.js middleware/API errors via custom envelope API 
 drop-api - Backend API errors via @sentry/node 
 
 All three components share the same DSN and gracefully degrade to console-only logging when Sentry is not configured. 
 
 Sentry Account Setup 
 1. Create Free Sentry Account 
 
 Visit sentry.io and sign up (free tier: 5,000 errors/month) 
 Confirm email and log in 
 
 2. Create Projects 
 Create two separate projects (one for app, one for API): 
 Project 1: drop-app 
 
 Click Projects → Create Project 
 Platform: Next.js 
 Project name: drop-app 
 Team: Default team (or create drop-team ) 
 Alert frequency: On every new issue 
 Click Create Project 
 Copy the DSN (format: https://examplePublicKey@o0.ingest.sentry.io/0 ) 
 
 Project 2: drop-api 
 
 Repeat steps above with platform Node.js 
 Project name: drop-api 
 Copy the DSN (different from drop-app) 
 
 IMPORTANT: Use separate projects to keep frontend and backend errors isolated. 
 
 Environment Variables Configuration 
 drop-app (.env.local) 
 Add these variables to src/drop-app/.env.local : 
 # --- Sentry (Error Tracking) ---
# Client-side error tracking (browser)
NEXT_PUBLIC_SENTRY_DSN=https://YOUR_PUBLIC_KEY@o0.ingest.sentry.io/YOUR_PROJECT_ID

# Server-side error tracking (middleware/API routes)
# NOTE: drop-app server uses custom envelope API (no @sentry/nextjs due to Turbopack incompatibility)
# Both client and server use the SAME DSN (NEXT_PUBLIC_SENTRY_DSN)

# Optional: Performance monitoring sample rate (0.0 to 1.0, default: 0.1 = 10%)
NEXT_PUBLIC_SENTRY_TRACES_SAMPLE_RATE=0.1

# Optional: For source map uploads (requires auth token from Sentry → Settings → Auth Tokens)
SENTRY_ORG=your-org-slug
SENTRY_PROJECT=drop-app
SENTRY_AUTH_TOKEN=your-auth-token
 
 drop-api (.env) 
 Add these variables to src/drop-api/.env : 
 # --- Sentry (Error Tracking) ---
SENTRY_DSN=https://YOUR_PUBLIC_KEY@o0.ingest.sentry.io/YOUR_API_PROJECT_ID

# Optional: Performance monitoring sample rate (0.0 to 1.0, default: 0.1 = 10%)
SENTRY_TRACES_SAMPLE_RATE=0.1

# Optional: For source map uploads
SENTRY_ORG=your-org-slug
SENTRY_PROJECT=drop-api
SENTRY_AUTH_TOKEN=your-auth-token
 
 Where to find these values: 
 
 DSN: Project Settings → Client Keys (DSN) 
 Org slug: Settings → Organization → General Settings → Organization Slug 
 Project name: Project Settings → General → Project Name 
 Auth token: Settings → Auth Tokens → Create New Token (scopes: project:releases , project:write ) 
 
 
 Verification 
 Test Client-Side Error Capture (drop-app) 
 
 Start the app: npm run dev (in src/drop-app/ ) 
 Open browser console: http://localhost:3000 
 Trigger test error via console:
 throw new Error("Sentry test error - client-side");
 
 
 Check Sentry dashboard: Projects → drop-app → Issues 
 You should see the test error appear within 10 seconds 
 
 Expected behavior: 
 
 Error logged to browser console: [Sentry] Error captured: Error: Sentry test error - client-side 
 Error appears in Sentry dashboard with stack trace, breadcrumbs, and browser context 
 
 Test Server-Side Error Capture (drop-app) 
 
 Create test API route: src/drop-app/src/app/api/sentry-test/route.ts 
 import { NextResponse } from 'next/server';
import { captureServerError } from '@/lib/sentry-server';

export async function GET() {
 try {
 throw new Error('Sentry test error - server-side');
 } catch (error) {
 captureServerError(error as Error, { tags: { test: 'true' } });
 return NextResponse.json({ error: 'Test error sent to Sentry' }, { status: 500 });
 }
}
 
 
 Visit: http://localhost:3000/api/sentry-test 
 Check server console: [Sentry Server] Error captured: Error: Sentry test error - server-side 
 Check Sentry dashboard: Projects → drop-app → Issues 
 
 Test API Error Capture (drop-api) 
 
 Start the API: npm run dev (in src/drop-api/ ) 
 Trigger test error via curl:
 curl http://localhost:4000/api/sentry-test
 
 
 OR create test endpoint in src/drop-api/src/routes/test.ts :
 import { Router } from 'express';
import { captureError } from '../lib/sentry.js';

const router = Router();

router.get('/sentry-test', (req, res) => {
 try {
 throw new Error('Sentry test error - API');
 } catch (error) {
 captureError(error as Error, { tags: { test: 'true' } });
 res.status(500).json({ error: 'Test error sent to Sentry' });
 }
});

export default router;
 
 
 Check Sentry dashboard: Projects → drop-api → Issues 
 
 
 Source Map Upload Setup 
 Source maps allow Sentry to show readable stack traces instead of minified code. 
 1. Install Sentry CLI 
 # macOS (Homebrew)
brew install getsentry/tools/sentry-cli

# Or via npm (global)
npm install -g @sentry/cli
 
 2. Configure Sentry CLI 
 Create .sentryclirc in project root: 
 [defaults]
url=https://sentry.io/
org=your-org-slug
project=drop-app

[auth]
token=your-auth-token
 
 IMPORTANT: Add .sentryclirc to .gitignore (contains auth token). 
 3. Add Build Script (drop-app) 
 Update src/drop-app/package.json : 
 {
 "scripts": {
 "build": "next build",
 "build:sentry": "next build && sentry-cli sourcemaps upload --validate .next/static"
 }
}
 
 4. Test Source Map Upload 
 cd src/drop-app
npm run build:sentry
 
 Expected output: 
 > Analyzing source maps for sentry
> Uploading source maps to Sentry
✓ Successfully uploaded source maps
 
 5. CI/CD Integration 
 For automated uploads in CI/CD, add these secrets to your deployment platform: 
 Vercel/Railway/Fly.io: 
 
 SENTRY_ORG 
 SENTRY_PROJECT 
 SENTRY_AUTH_TOKEN 
 
 Then update build command: 
 npm run build && sentry-cli sourcemaps upload --validate .next/static
 
 
 Alert Rules Configuration 
 Recommended Alert Rules 
 1. New Issue Alert (drop-app) 
 
 Go to Projects → drop-app → Settings → Alerts 
 Click Create Alert Rule 
 Configure:
 
 Conditions: When a new issue is created 
 Filters: Environment = production 
 Actions: 
 
 Send notification to: Slack channel #drop-alerts 
 Send email to: alem@alai.no 
 
 
 
 
 Save rule 
 
 2. High Error Rate Alert (drop-app) 
 
 Create new alert rule 
 Configure:
 
 Conditions: Number of events in an issue is more than 100 in 1 hour 
 Filters: Environment = production, Level = error 
 Actions: 
 
 Send notification to: Slack channel #drop-alerts 
 Send email to: alem@alai.no 
 
 
 
 
 Save rule 
 
 3. Critical Error Alert (drop-api) 
 
 Go to Projects → drop-api → Settings → Alerts 
 Create alert rule:
 
 Conditions: When a new issue is created AND Level = fatal 
 Filters: Environment = production 
 Actions: 
 
 Send notification to: Slack channel #drop-critical 
 Send email to: alem@alai.no 
 
 
 
 
 Save rule 
 
 4. Performance Degradation Alert (drop-app) 
 
 Create alert rule:
 
 Conditions: Average transaction duration is above 2000ms for 5 minutes 
 Filters: Environment = production, Transaction = /api/transactions/* 
 Actions: 
 
 Send notification to: Slack channel #drop-performance 
 
 
 
 
 Save rule 
 
 Slack Integration (Optional) 
 
 Go to Settings → Integrations → Slack 
 Click Add Workspace 
 Authorize Sentry to access your Slack workspace 
 Select channels: #drop-alerts , #drop-critical , #drop-performance 
 Test integration by triggering a test error 
 
 
 PII Scrubbing 
 All three Sentry integrations automatically scrub sensitive data before sending events: 
 Scrubbed fields: 
 
 password 
 pin 
 cardNumber 
 cvv 
 fødselsnummer 
 authorization headers 
 cookie headers 
 
 Implementation: 
 
 drop-app (client): src/drop-app/src/lib/sentry.ts (lines 51-76) 
 drop-app (server): Custom envelope API (no PII in server-side events) 
 drop-api: src/drop-api/src/lib/sentry.ts (lines 48-139) 
 
 Verification: 
 
 Trigger error with sensitive data:
 try {
 throw new Error('Login failed for user with password=secret123');
} catch (error) {
 captureError(error, { extra: { cardNumber: '1234567890123456' } });
}
 
 
 Check Sentry event:
 
 Message should show: Login failed for user with password=[REDACTED] 
 Extra context should show: cardNumber: [REDACTED] 
 
 
 
 
 Environment-Specific Configuration 
 Development 
 
 DSN: Optional (errors log to console only if not set) 
 Sample rate: 1.0 (capture all errors for debugging) 
 Source maps: Not required (local stack traces are readable) 
 
 # .env.local (development)
NEXT_PUBLIC_SENTRY_DSN= # Leave empty to disable Sentry in dev
 
 Staging 
 
 DSN: Required (test Sentry integration before production) 
 Sample rate: 0.5 (capture 50% of transactions) 
 Source maps: Enabled (verify uploads work) 
 
 # .env.staging
NEXT_PUBLIC_SENTRY_DSN=https://YOUR_KEY@sentry.io/YOUR_PROJECT_ID
NEXT_PUBLIC_SENTRY_TRACES_SAMPLE_RATE=0.5
SENTRY_AUTH_TOKEN=your-auth-token
 
 Production 
 
 DSN: Required (critical for production monitoring) 
 Sample rate: 0.1 (capture 10% of transactions to stay within free tier) 
 Source maps: Enabled (required for readable stack traces) 
 
 # .env.production
NEXT_PUBLIC_SENTRY_DSN=https://YOUR_KEY@sentry.io/YOUR_PROJECT_ID
NEXT_PUBLIC_SENTRY_TRACES_SAMPLE_RATE=0.1
SENTRY_AUTH_TOKEN=your-auth-token
 
 
 Troubleshooting 
 No errors appearing in Sentry dashboard 
 Check 1: DSN configured? 
 # drop-app
echo $NEXT_PUBLIC_SENTRY_DSN

# drop-api
echo $SENTRY_DSN
 
 Check 2: Console output? 
 
 Errors should ALWAYS log to console, even if Sentry upload fails 
 Look for: [Sentry] Error captured: ... 
 
 Check 3: Network errors? 
 
 Open browser DevTools → Network tab 
 Filter by sentry.io 
 Check for failed requests (should see POST to https://o0.ingest.sentry.io/api/.../envelope/ ) 
 
 Check 4: Environment mismatch? 
 
 Sentry filters events by environment ( production , development , staging ) 
 Verify NEXT_PUBLIC_APP_ENV or NODE_ENV matches your Sentry project filters 
 
 Source maps not working (minified stack traces) 
 Check 1: Source maps uploaded? 
 cd src/drop-app
sentry-cli releases list
 
 Check 2: Release version matches? 
 
 Sentry matches source maps by release version 
 Verify package.json version matches uploaded release 
 
 Check 3: Upload command ran? 
 # Manually test upload
sentry-cli sourcemaps upload --validate .next/static
 
 PII still appearing in events 
 Check 1: Verify beforeSend hook 
 
 Inspect src/lib/sentry.ts (client) or src/lib/sentry.ts (API) 
 Confirm beforeSend function is scrubbing sensitive keys 
 
 Check 2: Add custom scrubbing 
 
 If new sensitive fields appear, add them to scrubbing list:
 const sensitiveKeys = ["password", "pin", "yourNewField"];
 
 
 
 
 Cost Management 
 Sentry Free Tier: 
 
 5,000 errors per month 
 10,000 performance units per month 
 1 GB attachments 
 30 days retention 
 
 Staying within free tier: 
 
 Lower sample rate: Set SENTRY_TRACES_SAMPLE_RATE=0.1 (10%) 
 Filter noisy errors: Use beforeSend to ignore expected errors (e.g., 404s) 
 Set up quotas: Sentry → Settings → Quotas → Set monthly limits 
 
 Example: Ignore 404 errors 
 beforeSend(event, hint) {
 // Ignore 404 errors
 if (event.request?.url?.includes('/api/') && hint?.originalException?.message?.includes('404')) {
 return null; // Don't send to Sentry
 }
 return event;
}
 
 
 Security Considerations 
 
 
 Auth token storage: 
 
 NEVER commit .sentryclirc to git 
 Store SENTRY_AUTH_TOKEN in CI/CD secrets, not .env files 
 
 
 
 DSN exposure: 
 
 NEXT_PUBLIC_SENTRY_DSN is exposed to client-side code (safe - it's public) 
 Sentry rate-limits abuse via DSN quotas 
 
 
 
 PII scrubbing: 
 
 Always verify PII scrubbing works before deploying to production 
 Test with real-world data patterns (Norwegian fødselsnummer, BankID tokens) 
 
 
 
 Access control: 
 
 Limit Sentry dashboard access to authorized team members only 
 Use Sentry Teams to restrict project access 
 
 
 
 
 References 
 
 Sentry Docs: https://docs.sentry.io/platforms/javascript/guides/nextjs/ 
 Sentry CLI: https://docs.sentry.io/product/cli/ 
 Source Maps: https://docs.sentry.io/platforms/javascript/sourcemaps/ 
 PII Scrubbing: https://docs.sentry.io/platforms/javascript/data-management/sensitive-data/ 
 Alert Rules: https://docs.sentry.io/product/alerts/ 
 
 
 Next Steps 
 
 Create Sentry account and projects (drop-app, drop-api) 
 Add DSN to .env.local (development) and .env.production (production) 
 Test error capture in all three components 
 Configure alert rules (new issues, high error rate, critical errors) 
 Set up source map uploads for production builds 
 Integrate Slack notifications (optional) 
 Monitor error dashboard daily during initial deployment

CloudWatch Logs Setup
CloudWatch Logs Setup — Drop Production 
 Date: 2026-02-22
 Priority: P0 (Production Blocker)
 Effort: 2 hours
 Cost: ~$5/month (30 GB ingestion) 
 
 Overview 
 AWS App Runner automatically streams application logs (stdout/stderr) to CloudWatch Logs. This setup guide configures retention policies , log insights queries , and alarms for production monitoring. 
 
 Prerequisites 
 
 AWS CLI configured with credentials 
 App Runner service deployed to eu-west-1 
 Application writes JSON logs to stdout (already implemented via src/lib/logger.ts ) 
 
 
 Configuration 
 1. Set Log Retention Policy 
 Default: CloudWatch Logs retain forever (expensive)
 Recommendation: 30 days (production), 7 days (staging) 
 # Production: 30 days retention
aws logs put-retention-policy \
 --log-group-name /aws/apprunner/drop-production \
 --retention-in-days 30 \
 --region eu-west-1

# Staging: 7 days retention
aws logs put-retention-policy \
 --log-group-name /aws/apprunner/drop-staging \
 --retention-in-days 7 \
 --region eu-west-1
 
 Verify retention: 
 aws logs describe-log-groups \
 --log-group-name-prefix /aws/apprunner/drop \
 --region eu-west-1 \
 | jq '.logGroups[] | {name: .logGroupName, retention: .retentionInDays}'

# Expected:
# {
# "name": "/aws/apprunner/drop-production",
# "retention": 30
# }
 
 
 2. Create Log Insights Queries 
 Purpose: Pre-built queries for common investigations. 
 Query 1: All Errors (Last Hour) 
 fields @timestamp, level, message, metadata.error, metadata.userId, requestId
| filter level = "error"
| sort @timestamp desc
| limit 100
 
 Save as: drop-errors-last-hour 
 Query 2: User Activity Trace 
 fields @timestamp, level, message, metadata.userId, metadata.action, requestId
| filter metadata.userId = "usr_123"
| sort @timestamp desc
| limit 500
 
 Save as: drop-user-activity-trace 
 Query 3: Request Trace by ID 
 fields @timestamp, level, message, metadata
| filter requestId = "req_abc123"
| sort @timestamp asc
 
 Save as: drop-request-trace 
 Query 4: API Endpoint Performance 
 fields @timestamp, message, metadata.endpoint, metadata.latencyMs
| filter metadata.latencyMs > 1000
| stats avg(metadata.latencyMs) as avg_latency, max(metadata.latencyMs) as max_latency, count() as slow_requests by metadata.endpoint
| sort slow_requests desc
 
 Save as: drop-slow-endpoints 
 Query 5: Authentication Events 
 fields @timestamp, level, message, metadata.action, metadata.userId, metadata.ip
| filter metadata.action in ["login_success", "login_failure", "logout"]
| sort @timestamp desc
| limit 100
 
 Save as: drop-auth-events 
 Query 6: Payment Failures 
 fields @timestamp, level, message, metadata.errorCode, metadata.transactionId, metadata.userId
| filter metadata.errorCode in ["INSUFFICIENT_FUNDS", "PAYMENT_REJECTED", "TIMEOUT"]
| sort @timestamp desc
| limit 50
 
 Save as: drop-payment-failures 
 
 3. Create CloudWatch Alarms 
 Alarm 1: High Error Rate 
 Metric: Error log entries per minute
 Threshold: >10 errors/minute for 2 consecutive periods
 Action: Send SNS notification → Slack webhook 
 # Create metric filter
aws logs put-metric-filter \
 --log-group-name /aws/apprunner/drop-production \
 --filter-name drop-error-count \
 --filter-pattern '{ $.level = "error" }' \
 --metric-transformations \
 metricName=ErrorCount,metricNamespace=Drop/Logs,metricValue=1,unit=Count \
 --region eu-west-1

# Create alarm
aws cloudwatch put-metric-alarm \
 --alarm-name drop-high-error-rate \
 --alarm-description "Alert when error rate exceeds threshold" \
 --metric-name ErrorCount \
 --namespace Drop/Logs \
 --statistic Sum \
 --period 60 \
 --evaluation-periods 2 \
 --threshold 10 \
 --comparison-operator GreaterThanThreshold \
 --treat-missing-data notBreaching \
 --alarm-actions <SNS-TOPIC-ARN> \
 --region eu-west-1
 
 Alarm 2: No Logs Received (Service Down) 
 Metric: Log ingestion stopped
 Threshold: No logs for 5 minutes
 Action: Send SNS notification 
 aws cloudwatch put-metric-alarm \
 --alarm-name drop-no-logs-received \
 --alarm-description "Alert when no logs received (service may be down)" \
 --metric-name IncomingLogEvents \
 --namespace AWS/Logs \
 --dimensions Name=LogGroupName,Value=/aws/apprunner/drop-production \
 --statistic Sum \
 --period 300 \
 --evaluation-periods 1 \
 --threshold 1 \
 --comparison-operator LessThanThreshold \
 --treat-missing-data breaching \
 --alarm-actions <SNS-TOPIC-ARN> \
 --region eu-west-1
 
 Alarm 3: Database Errors 
 Metric: Database connection errors
 Threshold: >5 DB errors in 5 minutes 
 aws logs put-metric-filter \
 --log-group-name /aws/apprunner/drop-production \
 --filter-name drop-db-errors \
 --filter-pattern '{ $.message = "*database*" && $.level = "error" }' \
 --metric-transformations \
 metricName=DatabaseErrors,metricNamespace=Drop/Logs,metricValue=1,unit=Count \
 --region eu-west-1

aws cloudwatch put-metric-alarm \
 --alarm-name drop-database-errors \
 --metric-name DatabaseErrors \
 --namespace Drop/Logs \
 --statistic Sum \
 --period 300 \
 --evaluation-periods 1 \
 --threshold 5 \
 --comparison-operator GreaterThanThreshold \
 --alarm-actions <SNS-TOPIC-ARN> \
 --region eu-west-1
 
 
 4. SNS Topic for Alerts 
 Create SNS topic (if not exists): 
 aws sns create-topic \
 --name drop-cloudwatch-alerts \
 --region eu-west-1

# Output:
# {
# "TopicArn": "arn:aws:sns:eu-west-1:324480209768:drop-cloudwatch-alerts"
# }
 
 Subscribe Slack webhook: 
 # Option 1: Email subscription (immediate)
aws sns subscribe \
 --topic-arn arn:aws:sns:eu-west-1:324480209768:drop-cloudwatch-alerts \
 --protocol email \
 --notification-endpoint alem@alai.no \
 --region eu-west-1

# Confirm subscription via email link

# Option 2: Lambda → Slack (requires Lambda function)
# See: infrastructure/cloudwatch-to-slack-lambda.md (future enhancement)
 
 
 5. Export Logs to S3 (Compliance/Archival) 
 Purpose: Long-term storage (>30 days) for compliance, cheaper than CloudWatch. 
 Create S3 bucket: 
 aws s3 mb s3://drop-logs-archive --region eu-west-1

# Set lifecycle policy (move to Glacier after 90 days)
cat > lifecycle.json <<EOF
{
 "Rules": [
 {
 "Id": "archive-old-logs",
 "Status": "Enabled",
 "Transitions": [
 {
 "Days": 90,
 "StorageClass": "GLACIER"
 }
 ],
 "Expiration": {
 "Days": 1825
 }
 }
 ]
}
EOF

aws s3api put-bucket-lifecycle-configuration \
 --bucket drop-logs-archive \
 --lifecycle-configuration file://lifecycle.json
 
 Create export task (manual, run monthly): 
 # Export last 30 days to S3 (run on day 1 of each month)
START_TIME=$(date -u -d '60 days ago' +%s)000
END_TIME=$(date -u -d '30 days ago' +%s)000

aws logs create-export-task \
 --log-group-name /aws/apprunner/drop-production \
 --from $START_TIME \
 --to $END_TIME \
 --destination drop-logs-archive \
 --destination-prefix logs/$(date +%Y-%m) \
 --region eu-west-1

# Check export status
aws logs describe-export-tasks --region eu-west-1
 
 Automate with Lambda (future): 
 
 Schedule Lambda to run monthly 
 Export previous month's logs to S3 
 Delete from CloudWatch after successful export 
 
 
 Log Format 
 Current Format (Structured JSON) 
 Example log entry: 
 {
 "timestamp": "2026-02-22T10:30:45.123Z",
 "level": "info",
 "message": "User logged in",
 "requestId": "req_abc123",
 "metadata": {
 "userId": "usr_456",
 "email": "user@example.com",
 "ip": "1.2.3.4",
 "action": "login_success"
 }
}
 
 CloudWatch Logs Insights automatically parses JSON fields, enabling queries like: 
 | filter metadata.userId = "usr_456"
 
 
 Cost Estimate 
 CloudWatch Logs Pricing (EU-West-1) 
 
 Ingestion: $0.50 per GB 
 Storage: $0.03 per GB/month 
 Log Insights queries: $0.005 per GB scanned 
 
 Expected Usage (Production) 
 
 Log volume: ~1 GB/day (30 GB/month) 
 Ingestion cost: 30 GB × $0.50 = $15/month 
 Storage cost (30-day retention): 30 GB × $0.03 = $0.90/month 
 Query cost: ~10 queries/day × 1 GB × $0.005 × 30 = $1.50/month 
 
 Total: ~$17/month 
 Cost Optimization 
 
 
 Reduce log verbosity (filter debug logs in production): 
 // src/lib/logger.ts
const minLevel = process.env.NODE_ENV === 'production' ? 'info' : 'debug';
 
 
 
 Use sampling for high-volume events : 
 if (Math.random() < 0.1) { // Log 10% of requests
 logger.debug('Request details', { ... });
}
 
 
 
 Export to S3 for long-term storage ($0.023/GB/month, 23% cheaper) 
 
 
 
 Querying Logs 
 Via AWS Console 
 
 Open CloudWatch Console: https://console.aws.amazon.com/cloudwatch/ 
 Navigate to: Logs → Log groups → /aws/apprunner/drop-production 
 Click "Search log group" or "Insights queries" 
 Select saved query or write custom query 
 
 Via AWS CLI 
 # Run saved query
aws logs start-query \
 --log-group-name /aws/apprunner/drop-production \
 --start-time $(date -u -d '1 hour ago' +%s) \
 --end-time $(date -u +%s) \
 --query-string 'fields @timestamp, level, message | filter level = "error" | sort @timestamp desc' \
 --region eu-west-1

# Get query results (use queryId from previous command)
aws logs get-query-results --query-id <query-id> --region eu-west-1
 
 Via Log Streaming (Real-Time) 
 # Stream logs in real-time (like tail -f)
aws logs tail /aws/apprunner/drop-production \
 --follow \
 --format short \
 --region eu-west-1

# Filter by error level
aws logs tail /aws/apprunner/drop-production \
 --follow \
 --filter-pattern '{ $.level = "error" }' \
 --region eu-west-1
 
 
 Troubleshooting 
 Issue: No logs appearing in CloudWatch 
 Diagnosis: 
 # Check if log group exists
aws logs describe-log-groups \
 --log-group-name-prefix /aws/apprunner/drop \
 --region eu-west-1

# Check App Runner service logs integration
aws apprunner describe-service \
 --service-arn <ARN> \
 --region eu-west-1 \
 | jq '.Service.ObservabilityConfiguration'
 
 Solution: 
 
 App Runner auto-creates log group on first log output 
 Verify app is writing to stdout (not file) 
 Check IAM permissions (App Runner role needs logs:CreateLogStream , logs:PutLogEvents ) 
 
 Issue: Logs not in JSON format 
 Diagnosis: 
 # Check log entries
aws logs tail /aws/apprunner/drop-production --format short --region eu-west-1 | head -10
 
 Solution: 
 
 Ensure app uses logger.ts for all logging (not console.log ) 
 Verify process.stdout.write(JSON.stringify(entry) + "\n") is used 
 
 
 Checklist 
 
 Retention policy set (30 days production, 7 days staging) 
 Log Insights queries saved (6 queries) 
 Metric filters created (error count, DB errors) 
 CloudWatch alarms configured (3 alarms) 
 SNS topic created and subscribed (email/Slack) 
 S3 export bucket created (with lifecycle policy) 
 Cost estimate reviewed and approved 
 Team trained on log querying (AWS Console + CLI) 
 Documentation updated 
 
 
 Next Steps 
 
 Deploy retention policies (run commands above) 
 Test alarms (trigger error spike, verify alert received) 
 Save Log Insights queries (via AWS Console) 
 Schedule monthly S3 export (manual for now, automate later) 
 Monitor costs (set billing alert at $20/month) 
 
 
 Related Documentation 
 
 docs/infrastructure/MONITORING.md — Overall monitoring setup 
 src/lib/logger.ts — Structured logging implementation 
 infrastructure/error-tracking-setup.md — Sentry integration 
 AWS CloudWatch Logs docs: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/ 
 
 
 Last Updated: 2026-02-22
 Owner: John (AI Director)

DevOps Stack

DevOps/SRE Stack
DevOps/SRE Stack for Drop (originally FontelePay) 
 
 Rebrand note (2026-02-14): FontelePay was renamed to Drop. Some references to FontelePay remain in this document (metric names, Sentry projects, API URLs). These should be updated when implementing the actual DevOps stack. Drop uses a PSD2 pass-through model — no wallet, no balance held by Drop. 
 
 Table of Contents 
 
 Executive Summary 
 CI/CD Pipeline 
 Testing Strategy 
 Monitoring & Observability 
 Error Tracking 
 Alerting & Incident Management 
 Documentation 
 Security Operations 
 Cost Summary 
 Implementation Priority 
 Integration Diagram 
 
 
 1. Executive Summary 
 Stack Philosophy 
 Drop requires a DevOps/SRE stack that balances: 
 
 Fintech compliance (audit trails, security, GDPR) 
 Cost efficiency for MVP phase 
 Scalability for growth to 100K+ users 
 EU data residency where possible 
 Small team maintainability (1-2 DevOps engineers) 
 
 Recommended Stack Overview 
 
 
 
 Area 
 MVP Tool 
 Scale Tool 
 Reason 
 
 
 
 
 CI/CD 
 GitHub Actions 
 GitHub Actions + ArgoCD 
 Native GitHub, EU runners available 
 
 
 E2E Testing 
 Playwright 
 Playwright 
 Open-source, excellent mobile web 
 
 
 Load Testing 
 k6 
 k6 + Grafana Cloud 
 Grafana ecosystem, scriptable 
 
 
 APM 
 Grafana Cloud 
 Grafana Cloud 
 EU-hosted, cost-effective 
 
 
 Logs 
 Grafana Loki 
 Grafana Loki 
 Part of Grafana stack 
 
 
 Errors 
 Sentry 
 Sentry 
 Best-in-class, EU hosting 
 
 
 Alerts 
 Slack + PagerDuty 
 PagerDuty 
 Start simple, scale 
 
 
 Secrets 
 AWS Secrets Manager 
 AWS Secrets Manager 
 Native AWS, compliant 
 
 
 Security Scan 
 Snyk 
 Snyk + DAST 
 Developer-friendly 
 
 
 
 Total MVP Monthly Cost: EUR 800-1,200/month 
 Total Scale Monthly Cost: EUR 2,500-4,000/month 
 
 2. CI/CD Pipeline 
 2.1 Recommendation: GitHub Actions 
 Why GitHub Actions over alternatives: 
 
 
 
 Criteria 
 GitHub Actions 
 GitLab CI 
 CircleCI 
 
 
 
 
 Native Integration 
 Best (GitHub) 
 Requires migration 
 Good 
 
 
 EU Runners 
 Yes (Azure EU) 
 Yes 
 Limited 
 
 
 Free Tier 
 2,000 min/month 
 400 min/month 
 6,000 min/month 
 
 
 Secrets Management 
 Native 
 Native 
 Native 
 
 
 Self-hosted Runners 
 Yes 
 Yes 
 Limited 
 
 
 Marketplace 
 Largest 
 Growing 
 Medium 
 
 
 Learning Curve 
 Low 
 Medium 
 Medium 
 
 
 OIDC for AWS 
 Native 
 Requires setup 
 Requires setup 
 
 
 
 Decision: GitHub Actions 
 
 Already using GitHub for source control 
 Native OIDC integration with AWS (no long-lived credentials) 
 EU-hosted runners available 
 Excellent ecosystem of actions 
 Cost-effective at scale 
 
 2.2 Pipeline Architecture 
 # .github/workflows/main.yml structure

Triggers:
 - push to main/develop
 - pull request
 - manual dispatch

Jobs:
 1. lint-and-format
 - ESLint, Prettier
 - Parallel for speed

 2. security-scan
 - Snyk dependency check
 - Secret scanning
 - SAST (CodeQL)

 3. test-unit
 - Jest (backend/frontend)
 - Coverage threshold: 80%

 4. test-integration
 - Database tests
 - API contract tests

 5. build
 - Docker image build
 - Multi-arch (amd64/arm64)

 6. test-e2e (staging only)
 - Playwright
 - Against staging environment

 7. deploy-staging
 - Automatic on develop merge

 8. deploy-production
 - Manual approval required
 - Canary deployment
 
 2.3 Deployment Strategies 
 MVP Phase: Rolling Deployment 
 
 Simple, works with small user base 
 Zero-downtime with K8s rolling updates 
 Easy rollback 
 
 Scale Phase: Canary Deployment 
 Production Traffic:
 ├── 95% → Current Version
 └── 5% → New Version (canary)

Promotion: Manual after metrics validation
Rollback: Automatic on error rate spike
 
 Implementation: ArgoCD + Argo Rollouts 
 
 GitOps model (infrastructure as code) 
 Automated sync from Git 
 Progressive delivery 
 Audit trail of all deployments 
 
 2.4 Branch Strategy 
 main (production)
 ↑
 └── develop (staging)
 ↑
 └── feature/* (development)
 └── hotfix/* (emergency fixes)
 
 Rules: 
 
 main : Protected, requires PR + approval + passing CI 
 develop : Protected, requires PR + passing CI 
 Feature branches: Deleted after merge 
 Hotfixes: Can bypass develop in emergencies 
 
 2.5 GitHub Actions Cost Estimate 
 
 
 
 Phase 
 Minutes/Month 
 Cost 
 
 
 
 
 MVP (5 devs) 
 ~3,000 
 Free (2,000) + EUR 20 
 
 
 Scale (15 devs) 
 ~15,000 
 EUR 120/month 
 
 
 
 
 3. Testing Strategy 
 3.1 Testing Pyramid 
 ┌─────────┐
 │ E2E │ ~10% of tests
 │ (Slow) │ Critical user journeys
 └────┬────┘
 │
 ┌──────┴──────┐
 │ Integration │ ~20% of tests
 │ (Medium) │ API contracts, DB
 └──────┬──────┘
 │
 ┌─────────┴─────────┐
 │ Unit │ ~70% of tests
 │ (Fast) │ Business logic
 └───────────────────┘
 
 3.2 Unit Testing 
 Current Stack: Jest (already configured) 
 Coverage Requirements: 
 
 
 
 Component 
 Minimum 
 Target 
 
 
 
 
 Business Logic 
 90% 
 95% 
 
 
 API Controllers 
 80% 
 90% 
 
 
 Utilities 
 70% 
 80% 
 
 
 UI Components 
 60% 
 70% 
 
 
 
 Best Practices: 
 
 Test business logic, not implementation 
 Mock external dependencies 
 Use factories for test data 
 Run on every commit 
 
 3.3 Integration Testing 
 Tools: 
 
 Testcontainers - Spin up PostgreSQL, Redis in Docker 
 Supertest - HTTP assertions for API testing 
 Pact - Contract testing between services 
 
 What to Test: 
 
 Database queries (with real PostgreSQL) 
 Redis caching behavior 
 API contract between services 
 BaaS webhook handlers 
 Payment flow integration (sandbox) 
 
 3.4 E2E Testing 
 Recommendation: Playwright 
 
 
 
 Criteria 
 Playwright 
 Cypress 
 
 
 
 
 Browser Support 
 All major + mobile 
 Chrome, Firefox, Edge 
 
 
 Speed 
 Faster (parallel) 
 Slower 
 
 
 Auto-wait 
 Built-in 
 Built-in 
 
 
 Mobile Testing 
 Better (device emulation) 
 Limited 
 
 
 CI Integration 
 Excellent 
 Good 
 
 
 Cost 
 Free 
 Free (cloud paid) 
 
 
 Learning Curve 
 Medium 
 Lower 
 
 
 
 Decision: Playwright 
 
 Better mobile web testing (critical for Drop) 
 True parallel execution 
 Multiple browser contexts 
 API testing built-in 
 Network interception for mocking 
 
 Critical User Journeys to Test: 
 
 User registration + KYC start 
 Login flow (email + biometric) 
 View balance and transactions 
 Send P2P transfer 
 Card top-up flow 
 Card freeze/unfreeze 
 SEPA transfer initiation 
 
 Playwright Configuration: 
 // playwright.config.ts
{
 projects: [
 { name: 'Desktop Chrome', use: { ...devices['Desktop Chrome'] } },
 { name: 'Mobile Safari', use: { ...devices['iPhone 14'] } },
 { name: 'Mobile Chrome', use: { ...devices['Pixel 7'] } },
 ],
 retries: 2,
 reporter: [['html'], ['junit', { outputFile: 'results.xml' }]],
}
 
 3.5 Load Testing 
 Recommendation: k6 
 Why k6: 
 
 Open-source, scriptable in JavaScript 
 Integrates with Grafana (our monitoring stack) 
 Cloud option available for distributed load 
 Can run locally or in CI/CD 
 
 Load Test Scenarios: 
 
 
 
 Scenario 
 Virtual Users 
 Duration 
 Success Criteria 
 
 
 
 
 Baseline 
 50 
 5 min 
 p95 < 500ms 
 
 
 Peak 
 200 
 10 min 
 p95 < 1000ms 
 
 
 Stress 
 500 
 5 min 
 No crashes 
 
 
 Soak 
 100 
 1 hour 
 No memory leaks 
 
 
 
 Critical Endpoints: 
 
 POST /api/auth/login - 100 req/sec target 
 GET /api/accounts/balance - 500 req/sec target 
 POST /api/transfers - 50 req/sec target 
 GET /api/transactions - 200 req/sec target 
 
 3.6 Security Testing 
 SAST (Static Analysis): 
 
 CodeQL (GitHub native) - Free, good coverage 
 Snyk Code - Better for JavaScript/TypeScript 
 SonarQube - Alternative if self-hosted preferred 
 
 DAST (Dynamic Analysis): 
 
 OWASP ZAP - Free, CI-integrated 
 Burp Suite - For manual penetration testing 
 
 Dependency Scanning: 
 
 Snyk - Primary recommendation 
 Dependabot - Free, GitHub native (backup) 
 
 Schedule: 
 
 
 
 Test Type 
 Frequency 
 Blocker? 
 
 
 
 
 SAST 
 Every PR 
 Yes (high severity) 
 
 
 Dependency Scan 
 Daily 
 Yes (critical) 
 
 
 DAST 
 Weekly 
 No (review) 
 
 
 Pen Test 
 Quarterly 
 N/A (manual) 
 
 
 
 
 4. Monitoring & Observability 
 4.1 Strategy: Unified Grafana Stack 
 Why Grafana Cloud over alternatives: 
 
 
 
 Criteria 
 Grafana Cloud 
 Datadog 
 New Relic 
 
 
 
 
 EU Hosting 
 Yes (Frankfurt) 
 Yes 
 Yes 
 
 
 Pricing Model 
 Usage-based 
 Per-host 
 Per-user 
 
 
 MVP Cost 
 EUR 0-200 
 EUR 400+ 
 EUR 300+ 
 
 
 Scale Cost 
 EUR 500-1,000 
 EUR 2,000+ 
 EUR 1,500+ 
 
 
 Open Standards 
 Full (Prometheus, OTel) 
 Partial 
 Partial 
 
 
 Vendor Lock-in 
 Low 
 High 
 High 
 
 
 Self-host Option 
 Yes (fallback) 
 No 
 No 
 
 
 
 Decision: Grafana Cloud 
 
 Best cost/value for startup 
 EU data residency (Frankfurt region) 
 Open standards (can migrate if needed) 
 Unified platform (metrics, logs, traces) 
 Free tier generous for MVP 
 
 4.2 Metrics (Prometheus + Grafana) 
 Infrastructure Metrics: 
 
 CPU, Memory, Disk, Network 
 Kubernetes pod health 
 Database connections, query latency 
 Redis hit/miss ratio 
 
 Application Metrics: 
 
 Request rate, latency, error rate (RED) 
 Active users (DAU/MAU) 
 Transaction volume and value 
 KYC conversion funnel 
 Card activation rate 
 
 Business Metrics (Custom): 
 fontelepay_transactions_total{type="p2p|sepa|card"}
fontelepay_transaction_value_eur{type="p2p|sepa|card"}
fontelepay_users_registered_total
fontelepay_users_kyc_passed_total
fontelepay_cards_issued_total{type="virtual|physical"}
fontelepay_api_latency_seconds{endpoint="/api/..."}
 
 4.3 Log Aggregation (Loki) 
 Why Loki: 
 
 Part of Grafana stack (unified UI) 
 Cost-effective (indexes labels, not content) 
 Kubernetes native 
 Query language similar to Prometheus 
 
 Log Structure (JSON): 
 {
 "timestamp": "2026-02-05T10:30:00Z",
 "level": "info",
 "service": "payment-service",
 "trace_id": "abc123",
 "user_id": "usr_xxx", // pseudonymized
 "message": "Transfer initiated",
 "amount_eur": 100,
 "transfer_type": "sepa"
}
 
 Retention Policy: 
 
 
 
 Log Type 
 Retention 
 Reason 
 
 
 
 
 Application 
 30 days 
 Debugging 
 
 
 Security/Audit 
 7 years 
 Compliance 
 
 
 Access Logs 
 90 days 
 Security review 
 
 
 
 GDPR Considerations: 
 
 No PII in logs (use pseudonymized IDs) 
 User IDs hashed or tokenized 
 IP addresses masked after 30 days 
 
 4.4 Distributed Tracing (Tempo) 
 Implementation: OpenTelemetry 
 Why OpenTelemetry: 
 
 Vendor-neutral standard 
 Supports all our languages (Java, Node.js, Dart) 
 Auto-instrumentation available 
 Future-proof (industry standard) 
 
 Trace Critical Paths: 
 
 User login (app -> API -> auth -> DB) 
 Payment initiation (app -> API -> payment -> BaaS -> ledger) 
 Card transaction (webhook -> processor -> notification) 
 
 Sampling Strategy: 
 
 100% for errors 
 100% for slow requests (>1s) 
 10% for successful requests (MVP) 
 1% for successful requests (scale) 
 
 4.5 Real User Monitoring (RUM) 
 For Web (Next.js): 
 
 Grafana Faro (free, part of Grafana) 
 Captures: Page load, Web Vitals, JS errors 
 
 For Mobile (Flutter): 
 
 Custom implementation with OpenTelemetry 
 Track: App start time, screen transitions, API calls 
 
 Key Metrics: 
 
 
 
 Metric 
 Target 
 Threshold 
 
 
 
 
 LCP (Largest Contentful Paint) 
 <2.5s 
 <4s 
 
 
 FID (First Input Delay) 
 <100ms 
 <300ms 
 
 
 CLS (Cumulative Layout Shift) 
 <0.1 
 <0.25 
 
 
 App Cold Start 
 <2s 
 <3s 
 
 
 API Response (p95) 
 <500ms 
 <1s 
 
 
 
 4.6 Grafana Cloud Cost Estimate 
 
 
 
 Component 
 MVP Usage 
 MVP Cost 
 Scale Usage 
 Scale Cost 
 
 
 
 
 Metrics 
 10K series 
 Free 
 50K series 
 EUR 150 
 
 
 Logs 
 50 GB/mo 
 Free 
 200 GB/mo 
 EUR 200 
 
 
 Traces 
 10 GB/mo 
 Free 
 50 GB/mo 
 EUR 100 
 
 
 Total 
 - 
 EUR 0-50 
 - 
 EUR 450 
 
 
 
 
 5. Error Tracking 
 5.1 Recommendation: Sentry 
 Comparison: 
 
 
 
 Criteria 
 Sentry 
 Bugsnag 
 Rollbar 
 
 
 
 
 EU Hosting 
 Yes 
 Yes 
 No 
 
 
 Flutter SDK 
 Excellent 
 Good 
 Limited 
 
 
 Source Maps 
 Automatic 
 Automatic 
 Manual 
 
 
 Performance 
 Included 
 Separate 
 Included 
 
 
 Pricing (MVP) 
 Free 
 EUR 100 
 EUR 100 
 
 
 Pricing (Scale) 
 EUR 300 
 EUR 400 
 EUR 350 
 
 
 Slack Integration 
 Native 
 Native 
 Native 
 
 
 Issue Grouping 
 Best 
 Good 
 Good 
 
 
 
 Decision: Sentry 
 
 Best Flutter support (critical for mobile) 
 EU data residency available 
 Excellent source map integration 
 Issue grouping reduces noise 
 Performance monitoring included 
 Generous free tier (5K errors/month) 
 
 5.2 Sentry Configuration 
 Projects: 
 
 fontelepay-web (Next.js frontend) 
 fontelepay-api (Node.js/Java backend) 
 fontelepay-mobile (Flutter app) 
 
 Settings: 
 // sentry.config.js
{
 dsn: "https://xxx@sentry.io/xxx",
 environment: process.env.NODE_ENV,
 release: process.env.GIT_SHA,
 tracesSampleRate: 0.1, // 10% of transactions

 // Filter sensitive data
 beforeSend(event) {
 // Remove PII
 if (event.user) {
 delete event.user.email;
 delete event.user.ip_address;
 }
 return event;
 }
}
 
 Alert Rules: 
 
 
 
 Condition 
 Action 
 Priority 
 
 
 
 
 New issue (high severity) 
 Slack + PagerDuty 
 P1 
 
 
 Issue spike (>10x baseline) 
 Slack + PagerDuty 
 P1 
 
 
 New issue (medium) 
 Slack only 
 P2 
 
 
 Regression (resolved reopened) 
 Slack 
 P2 
 
 
 
 5.3 Source Maps 
 Web (Next.js): 
 
 Automatic upload via @sentry/nextjs 
 Hidden from production (security) 
 
 Mobile (Flutter): 
 
 Upload dSYM (iOS) and mapping files (Android) 
 Integrated with CI/CD 
 
 5.4 Sentry Cost Estimate 
 
 
 
 Phase 
 Events/Month 
 Cost 
 
 
 
 
 MVP 
 <5,000 
 Free 
 
 
 Growth 
 ~50,000 
 EUR 26/month 
 
 
 Scale 
 ~500,000 
 EUR 300/month 
 
 
 
 
 6. Alerting & Incident Management 
 6.1 Phased Approach 
 MVP (Team <5): Slack + Grafana Alerts 
 
 Simple, no additional cost 
 On-call rotation manual 
 Suitable for low traffic 
 
 Growth (Team 5-15): Add PagerDuty 
 
 Proper escalation policies 
 On-call schedules 
 Mobile alerts 
 Incident timeline 
 
 Scale (Team 15+): Full Incident Management 
 
 PagerDuty + Statuspage 
 War room automation 
 Post-incident reviews 
 
 6.2 Alert Levels 
 
 
 
 Level 
 Response Time 
 Examples 
 Notification 
 
 
 
 
 P1 - Critical 
 15 min 
 Payment processing down, data breach 
 PagerDuty + Slack + SMS 
 
 
 P2 - High 
 1 hour 
 High error rate, degraded performance 
 PagerDuty + Slack 
 
 
 P3 - Medium 
 4 hours 
 Non-critical service degraded 
 Slack only 
 
 
 P4 - Low 
 Next business day 
 Warning thresholds 
 Slack (daily digest) 
 
 
 
 6.3 Critical Alerts (P1) 
 
 
 
 Alert 
 Condition 
 Action 
 
 
 
 
 API Down 
 0 successful requests for 2 min 
 Page on-call 
 
 
 Payment Failures 
 >5% failure rate for 5 min 
 Page on-call 
 
 
 Database Unreachable 
 Connection failures >10/min 
 Page on-call 
 
 
 Security Event 
 Suspicious activity detected 
 Page on-call + security 
 
 
 Error Spike 
 10x baseline errors 
 Page on-call 
 
 
 
 6.4 On-Call Rotation 
 MVP Setup: 
 Week 1: Dev A (primary)
Week 2: Dev B (primary)
Week 3: Dev A (primary)
...

Escalation:
 0-15 min: Primary on-call
 15-30 min: Secondary on-call
 30+ min: Engineering lead
 
 PagerDuty Cost: 
 
 
 
 Plan 
 Cost 
 Features 
 
 
 
 
 Free 
 EUR 0 
 5 users, basic 
 
 
 Professional 
 EUR 21/user/mo 
 Full features 
 
 
 
 MVP: Free tier (5 users) 
 Scale: Professional for core team 
 6.5 Incident Response Runbook Template 
 ## Incident: [Title]

### Detection
- Alert source: [Grafana/Sentry/PagerDuty]
- Time detected: [timestamp]
- Severity: [P1/P2/P3]

### Impact
- Users affected: [estimate]
- Services affected: [list]
- Financial impact: [if applicable]

### Timeline
- HH:MM - [Event]
- HH:MM - [Event]

### Root Cause
[Description]

### Resolution
[Steps taken]

### Action Items
- [ ] [Preventive measure]
- [ ] [Process improvement]

### Participants
- Incident Commander: [name]
- Responders: [names]
 
 
 7. Documentation 
 7.1 API Documentation 
 Recommendation: OpenAPI 3.1 + Swagger UI 
 Why: 
 
 Industry standard 
 Auto-generated from code annotations 
 Interactive testing 
 Client SDK generation 
 
 Implementation: 
 # openapi.yaml (partial)
openapi: 3.1.0
info:
 title: Drop API
 version: 1.0.0
 description: Mobile banking API

servers:
 - url: https://api.fontelepay.com/v1
 description: Production
 - url: https://api.staging.fontelepay.com/v1
 description: Staging

security:
 - bearerAuth: []

paths:
 /accounts/{id}/balance:
 get:
 summary: Get account balance
 tags: [Accounts]
 ...
 
 Hosting: 
 
 Swagger UI at /docs endpoint 
 Redoc as alternative (cleaner for external) 
 Postman collection export for testing 
 
 7.2 Runbooks 
 Location: /docs/runbooks/ in repository 
 Required Runbooks: 
 
 
 
 Runbook 
 Purpose 
 
 
 
 
 deploy-production.md 
 Production deployment steps 
 
 
 rollback.md 
 How to rollback a bad deploy 
 
 
 database-migration.md 
 Safe DB migration process 
 
 
 incident-response.md 
 General incident handling 
 
 
 scaling.md 
 How to scale services 
 
 
 secrets-rotation.md 
 Rotating API keys, certs 
 
 
 disaster-recovery.md 
 Full recovery procedures 
 
 
 
 Runbook Template: 
 # Runbook: [Title]

## Overview
[What this runbook covers]

## Prerequisites
- [ ] Access to [system]
- [ ] Permissions: [list]

## Steps
1. [Step with command examples]
2. [Step with verification]

## Verification
[How to confirm success]

## Rollback
[If something goes wrong]

## Contacts
- Primary: [name/slack]
- Escalation: [name/slack]
 
 7.3 Architecture Decision Records (ADRs) 
 Location: /docs/adr/ in repository 
 Format: 
 # ADR-001: Use PostgreSQL as Primary Database

## Status
Accepted

## Context
We need a reliable, ACID-compliant database for financial transactions.

## Decision
Use PostgreSQL 16 as our primary database.

## Consequences
### Positive
- Strong ACID compliance
- Excellent JSON support
- Proven in fintech

### Negative
- Requires more ops than managed NoSQL
- Horizontal scaling more complex

## Alternatives Considered
- MySQL: Less JSON support
- MongoDB: Not ACID by default
- CockroachDB: Higher cost, complexity
 
 Key ADRs to Create: 
 
 ADR-001: Database selection (PostgreSQL) 
 ADR-002: Cloud provider (AWS) 
 ADR-003: BaaS provider (Swan) 
 ADR-004: Mobile framework (Flutter) 
 ADR-005: Monitoring stack (Grafana) 
 ADR-006: CI/CD platform (GitHub Actions) 
 
 7.4 Documentation Tooling 
 
 
 
 Type 
 Tool 
 Cost 
 
 
 
 
 API Docs 
 Swagger/OpenAPI 
 Free 
 
 
 Internal Docs 
 Notion or Confluence 
 Free-EUR 50/mo 
 
 
 Runbooks 
 Git repository 
 Free 
 
 
 Diagrams 
 Mermaid (in Markdown) 
 Free 
 
 
 Postmortems 
 Notion template 
 Free 
 
 
 
 
 8. Security Operations 
 8.1 Dependency Scanning 
 Recommendation: Snyk 
 Why Snyk: 
 
 Best JavaScript/TypeScript support 
 Dart/Flutter support 
 Automatic PR fixes 
 License compliance 
 Container scanning 
 
 Integration: 
 # .github/workflows/security.yml
- name: Snyk Security Scan
 uses: snyk/actions/node@master
 with:
 args: --severity-threshold=high
 
 Policy: 
 
 
 
 Severity 
 Action 
 SLA 
 
 
 
 
 Critical 
 Block PR, fix immediately 
 24 hours 
 
 
 High 
 Block PR, fix before merge 
 72 hours 
 
 
 Medium 
 Warning, fix in sprint 
 2 weeks 
 
 
 Low 
 Track, fix when convenient 
 1 month 
 
 
 
 Snyk Cost: 
 
 
 
 Plan 
 Cost 
 Limits 
 
 
 
 
 Free 
 EUR 0 
 200 tests/month 
 
 
 Team 
 EUR 52/dev/mo 
 Unlimited 
 
 
 
 MVP: Free tier 
 Scale: Team plan 
 8.2 Secret Management 
 Recommendation: AWS Secrets Manager 
 Why AWS Secrets Manager: 
 
 Native AWS integration (using AWS already) 
 Automatic rotation support 
 Audit trail via CloudTrail 
 GDPR compliant (EU region) 
 No additional infrastructure 
 
 Alternative: HashiCorp Vault 
 
 More features but more operational overhead 
 Consider for Scale phase if multi-cloud 
 
 Secrets to Manage: 
 
 
 
 Secret 
 Rotation 
 Access 
 
 
 
 
 Database credentials 
 90 days 
 Backend services 
 
 
 API keys (Swan, Stripe) 
 180 days 
 Backend services 
 
 
 JWT signing keys 
 365 days 
 Auth service 
 
 
 Encryption keys 
 Never (versioned) 
 All services 
 
 
 
 Implementation: 
 // secrets.ts
import { SecretsManager } from '@aws-sdk/client-secrets-manager';

const client = new SecretsManager({ region: 'eu-central-1' });

export async function getSecret(name: string): Promise<string> {
 const response = await client.getSecretValue({ SecretId: name });
 return response.SecretString!;
}
 
 AWS Secrets Manager Cost: 
 
 
 
 Secrets 
 Cost 
 
 
 
 
 10 secrets 
 EUR 4/month 
 
 
 50 secrets 
 EUR 20/month 
 
 
 100 secrets 
 EUR 40/month 
 
 
 
 8.3 Penetration Testing 
 Schedule: 
 
 
 
 Test Type 
 Frequency 
 Provider 
 
 
 
 
 Automated DAST 
 Weekly 
 OWASP ZAP 
 
 
 Web App Pen Test 
 Quarterly 
 External firm 
 
 
 Mobile App Pen Test 
 Quarterly 
 External firm 
 
 
 Infrastructure Pen Test 
 Annually 
 External firm 
 
 
 
 Budget: 
 
 
 
 Test 
 Cost 
 
 
 
 
 Web + API Pen Test 
 EUR 5,000-10,000 
 
 
 Mobile Pen Test 
 EUR 5,000-8,000 
 
 
 Infrastructure 
 EUR 8,000-15,000 
 
 
 Annual Total 
 EUR 25,000-45,000 
 
 
 
 EU-Based Pen Testing Firms: 
 
 Cure53 (Germany) - Excellent reputation 
 Securitum (Poland) - Cost-effective 
 WithSecure (Finland) - Enterprise grade 
 Secura (Netherlands) - Banking expertise 
 
 8.4 Security Monitoring 
 SIEM Considerations: 
 
 MVP: CloudWatch + Grafana alerts (sufficient) 
 Scale: Consider AWS Security Hub or Elastic SIEM 
 
 Security Alerts: 
 
 
 
 Event 
 Action 
 
 
 
 
 Failed login spike 
 Alert + temp block 
 
 
 New device login 
 User notification 
 
 
 Large transfer 
 Manual review queue 
 
 
 Admin action 
 Audit log + alert 
 
 
 API key usage anomaly 
 Alert + investigate 
 
 
 
 8.5 Compliance Automation 
 Tools: 
 
 AWS Config - Configuration compliance 
 Prowler - AWS security assessment (free) 
 Checkov - Infrastructure as code scanning 
 
 Automated Checks: 
 
 S3 buckets not public 
 Encryption at rest enabled 
 Security groups not overly permissive 
 IAM policies least-privilege 
 Audit logging enabled 
 
 
 9. Cost Summary 
 9.1 MVP Phase (Monthly) 
 
 
 
 Category 
 Tool 
 Cost (EUR) 
 
 
 
 
 CI/CD 
 GitHub Actions 
 20-50 
 
 
 Monitoring 
 Grafana Cloud (free tier) 
 0-50 
 
 
 Error Tracking 
 Sentry (free tier) 
 0 
 
 
 Alerting 
 Slack + PagerDuty Free 
 0 
 
 
 Security 
 Snyk (free tier) 
 0 
 
 
 Secrets 
 AWS Secrets Manager 
 10 
 
 
 Testing 
 Playwright, k6 (OSS) 
 0 
 
 
 Total 
 
 EUR 30-110 
 
 
 
 9.2 Growth Phase (Monthly) 
 
 
 
 Category 
 Tool 
 Cost (EUR) 
 
 
 
 
 CI/CD 
 GitHub Actions 
 100-150 
 
 
 Monitoring 
 Grafana Cloud 
 200-400 
 
 
 Error Tracking 
 Sentry Team 
 100-300 
 
 
 Alerting 
 PagerDuty Professional 
 100-200 
 
 
 Security 
 Snyk Team 
 200-400 
 
 
 Secrets 
 AWS Secrets Manager 
 20-40 
 
 
 Testing 
 k6 Cloud (load testing) 
 100-200 
 
 
 Total 
 
 EUR 820-1,690 
 
 
 
 9.3 Scale Phase (Monthly) 
 
 
 
 Category 
 Tool 
 Cost (EUR) 
 
 
 
 
 CI/CD 
 GitHub Actions + ArgoCD 
 200-300 
 
 
 Monitoring 
 Grafana Cloud 
 500-1,000 
 
 
 Error Tracking 
 Sentry Business 
 300-500 
 
 
 Alerting 
 PagerDuty + Statuspage 
 300-500 
 
 
 Security 
 Snyk + DAST 
 500-800 
 
 
 Secrets 
 AWS Secrets Manager 
 40-60 
 
 
 Testing 
 k6 Cloud 
 200-400 
 
 
 Documentation 
 Confluence 
 50-100 
 
 
 Total 
 
 EUR 2,090-3,660 
 
 
 
 9.4 Annual Security Costs 
 
 
 
 Item 
 Cost (EUR) 
 
 
 
 
 Penetration Testing (4x/year) 
 25,000-45,000 
 
 
 Compliance Audit (annual) 
 10,000-20,000 
 
 
 Security Training 
 2,000-5,000 
 
 
 Total 
 EUR 37,000-70,000 
 
 
 
 
 10. Implementation Priority 
 10.1 Phase 1: Foundation (Week 1-2) 
 Must Have: 
 
 GitHub Actions basic pipeline (lint, test, build) 
 Sentry error tracking (all environments) 
 Basic Slack alerting 
 AWS Secrets Manager setup 
 Snyk dependency scanning 
 
 Outcome: Can deploy safely with visibility into errors 
 10.2 Phase 2: Observability (Week 3-4) 
 Must Have: 
 
 Grafana Cloud setup (metrics, logs) 
 Prometheus metrics in application 
 Structured logging (JSON) 
 Basic dashboards (RED metrics) 
 Critical alerts configured 
 
 Outcome: Can monitor application health 
 10.3 Phase 3: Testing (Week 5-6) 
 Must Have: 
 
 Unit test coverage >70% 
 Integration tests for critical paths 
 Playwright E2E for happy paths 
 k6 load test baseline 
 Test runs in CI/CD 
 
 Outcome: Confidence in deployments 
 10.4 Phase 4: Security (Week 7-8) 
 Must Have: 
 
 CodeQL SAST enabled 
 OWASP ZAP in staging 
 Security headers configured 
 Audit logging implemented 
 First penetration test scheduled 
 
 Outcome: Security baseline established 
 10.5 Phase 5: Operations (Week 9-12) 
 Should Have: 
 
 PagerDuty on-call rotation 
 Runbooks for critical scenarios 
 Disaster recovery tested 
 OpenAPI documentation complete 
 ADRs documented 
 
 Outcome: Production-ready operations 
 10.6 Checklist Summary 
 Week 1-2: CI/CD + Errors + Secrets
Week 3-4: Monitoring + Logs + Alerts
Week 5-6: Tests + E2E + Load
Week 7-8: Security + Audit + Pen Test
Week 9-12: On-call + Docs + DR
 
 
 11. Integration Diagram 
 ┌─────────────────────────────────────────────────────────────────────────────┐
│ DEVELOPER WORKFLOW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────────────────────────────────────┐ │
│ │ Code │───>│ PR │───>│ GitHub Actions │ │
│ │ (IDE) │ │ (GitHub)│ │ ┌─────┐ ┌────┐ ┌────┐ ┌─────┐ ┌─────┐ │ │
│ └─────────┘ └─────────┘ │ │Lint │ │Test│ │SAST│ │Build│ │Snyk │ │ │
│ │ └──┬──┘ └──┬─┘ └──┬─┘ └──┬──┘ └──┬──┘ │ │
│ └────┼───────┼──────┼──────┼───────┼─────┘ │
│ └───────┴──────┴──────┴───────┘ │
│ │ │
└────────────────────────────────────────────────────┼────────────────────────┘
 │
 ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ DEPLOYMENT (ArgoCD) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Staging │────────>│ Canary │────────>│ Production │ │
│ │ (automatic) │ │ (5% traffic) │ │ (95% -> 100%)│ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ │ │ │ │
│ └─────────────────────────┴─────────────────────────┘ │
│ │ │
└────────────────────────────────────┼────────────────────────────────────────┘
 │
 ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ KUBERNETES CLUSTER (AWS EKS) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ API Gateway│ │ Auth │ │ Payment │ │ Card │ │
│ │ (Kong) │ │ Service │ │ Service │ │ Service │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │ │
│ └────────────────┴────────────────┴────────────────┘ │
│ │ │
│ ┌─────────────────────────┼─────────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ PostgreSQL │ │ Redis │ │ Kafka │ │
│ │ (RDS) │ │(ElastiCache)│ │ (MSK) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
 │
 │ Telemetry
 ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY STACK │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ GRAFANA CLOUD (EU) │ │
│ │ │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Prometheus │ │ Loki │ │ Tempo │ │ │
│ │ │ (Metrics) │ │ (Logs) │ │ (Traces) │ │ │
│ │ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │ │
│ │ └─────────────────┴─────────────────┘ │ │
│ │ │ │ │
│ │ ┌──────┴──────┐ │ │
│ │ │ Dashboards │ │ │
│ │ │ & Alerts │ │ │
│ │ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ Sentry │ │ PagerDuty │ │
│ │ (Error Track) │ │ (Alerting) │ │
│ └───────┬────────┘ └───────┬────────┘ │
│ │ │ │
│ └───────────────────┬───────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Slack │ │
│ │ (Notif Hub) │ │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ SECURITY LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Snyk │ │ CodeQL │ │ OWASP ZAP │ │ AWS Secrets │ │
│ │ (Deps) │ │ (SAST) │ │ (DAST) │ │ Manager │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
 
 
 Appendix A: Tool Links 
 
 
 
 Tool 
 URL 
 Purpose 
 
 
 
 
 GitHub Actions 
 github.com/features/actions 
 CI/CD 
 
 
 ArgoCD 
 argoproj.github.io/cd 
 GitOps deployment 
 
 
 Grafana Cloud 
 grafana.com/cloud 
 Monitoring 
 
 
 Sentry 
 sentry.io 
 Error tracking 
 
 
 PagerDuty 
 pagerduty.com 
 Incident management 
 
 
 Snyk 
 snyk.io 
 Security scanning 
 
 
 Playwright 
 playwright.dev 
 E2E testing 
 
 
 k6 
 k6.io 
 Load testing 
 
 
 OpenTelemetry 
 opentelemetry.io 
 Observability 
 
 
 
 
 Appendix B: Decision Matrix 
 
 
 
 Decision 
 Options Considered 
 Winner 
 Key Factor 
 
 
 
 
 CI/CD 
 GitHub Actions, GitLab, CircleCI 
 GitHub Actions 
 Native GitHub, EU runners 
 
 
 Monitoring 
 Datadog, New Relic, Grafana 
 Grafana Cloud 
 Cost, EU hosting, open standards 
 
 
 E2E Testing 
 Playwright, Cypress 
 Playwright 
 Mobile web support, speed 
 
 
 Error Tracking 
 Sentry, Bugsnag, Rollbar 
 Sentry 
 Flutter SDK, EU hosting 
 
 
 Alerting 
 PagerDuty, Opsgenie, Slack 
 PagerDuty 
 Industry standard, free tier 
 
 
 Secrets 
 AWS SM, Vault, GCP SM 
 AWS Secrets Manager 
 Already on AWS, simple 
 
 
 Security 
 Snyk, Dependabot, Sonar 
 Snyk 
 Best JS/TS coverage 
 
 
 
 
 Appendix C: Compliance Mapping 
 
 
 
 Requirement 
 Solution 
 Evidence 
 
 
 
 
 PCI DSS 10.x (Logging) 
 Grafana Loki, 7yr retention 
 CloudTrail + Loki 
 
 
 GDPR (Data Residency) 
 Grafana EU, Sentry EU 
 Region configs 
 
 
 GDPR (Right to Erasure) 
 Pseudonymized logs 
 No PII in logs 
 
 
 SOC 2 (Change Mgmt) 
 GitHub PRs, ArgoCD 
 Audit trail 
 
 
 ISO 27001 (Incident) 
 PagerDuty, Runbooks 
 Incident records 
 
 
 
 
 Document created: 2026-02-05 
 Last updated: 2026-02-05 
 Author: DevOps Research

WAF Rules
WAF Rules — Drop Payment App 
 MC #1229 — Web Application Firewall configuration for Drop fintech. 
 Overview 
 Drop runs on Fly.io which does not provide a built-in WAF. Protection is layered: 
 
 Middleware-level (Next.js Edge Middleware) — first line of defense 
 Fly.io Proxy — TLS termination, DDoS mitigation at network edge 
 Application-level — input validation, parameterized SQL, CSRF checks 
 
 Middleware WAF Rules (Implemented in src/drop-app/src/middleware.ts ) 
 1. CSRF Origin Validation 
 
 Rule: All mutation requests (POST/PUT/PATCH/DELETE) to /api/* must have valid Origin or Referer header 
 Action: Block with 403 
 Bypass: None 
 
 2. Rate Limiting 
 
 Rule: Per-IP rate limits on auth endpoints (10 req/window) 
 Action: Block with 429 
 Scope: /api/auth/* 
 
 3. Content-Security-Policy 
 
 Rule: Strict CSP with nonce-based script/style loading in production 
 Action: Browser enforcement (block inline scripts/styles without nonce) 
 Dev mode: unsafe-inline permitted for HMR 
 
 Recommended Reverse Proxy Rules (Fly.io / Cloudflare) 
 If a CDN or reverse proxy is added in front of Fly.io, configure these rules: 
 SQL Injection (SQLi) 
 
 Pattern: Block requests containing SQL keywords in query params and body:
 
 UNION SELECT , OR 1=1 , DROP TABLE , ; -- , ' OR ' 
 
 
 Action: Block with 403 
 Note: Drop uses parameterized queries exclusively — this is defense-in-depth 
 
 Cross-Site Scripting (XSS) 
 
 Pattern: Block requests containing:
 
 <script> , javascript: , on\w+= , <img.*onerror 
 
 
 Action: Block with 403 
 Note: React auto-escapes output; CSP blocks inline scripts 
 
 Path Traversal 
 
 Pattern: Block requests containing:
 
 ../ , ..\\ , %2e%2e , /etc/passwd , /proc/self 
 
 
 Action: Block with 403 
 
 Request Size Limits 
 
 Rule: Max request body 1MB (API), 10KB (auth endpoints) 
 Action: Block with 413 
 
 Geo-blocking (Optional) 
 
 Rule: Drop targets Norway/Scandinavia. Consider restricting to EU/EEA IPs for reduced attack surface. 
 Action: Block with 403 for non-allowed regions 
 Note: Requires Cloudflare or similar CDN with geo-IP support 
 
 Bot Protection 
 
 Rule: Rate limit on /api/auth/* endpoints (already in middleware) 
 Supplemental: Add CAPTCHA challenge after 3 failed BankID attempts 
 Action: Challenge or block 
 
 Implementation Priority 
 
 
 
 Priority 
 Rule 
 Status 
 
 
 
 
 P0 
 CSRF Origin check 
 Implemented (middleware.ts) 
 
 
 P0 
 CSP headers 
 Implemented (middleware.ts + next.config.ts) 
 
 
 P0 
 Rate limiting 
 Implemented (per-endpoint) 
 
 
 P1 
 Trivy container scan 
 Implemented (CI/CD) 
 
 
 P1 
 npm audit 
 Implemented (CI/CD) 
 
 
 P2 
 SQLi WAF rules 
 Pending — requires CDN/proxy 
 
 
 P2 
 XSS WAF rules 
 Pending — requires CDN/proxy 
 
 
 P2 
 Path traversal rules 
 Pending — requires CDN/proxy 
 
 
 P3 
 Geo-blocking 
 Pending — requires CDN/proxy 
 
 
 P3 
 Bot protection (CAPTCHA) 
 Pending — requires frontend integration 
 
 
 
 Testing WAF Rules 
 When WAF rules are deployed via CDN: 
 # Test SQLi blocking
curl -X POST "https://getdrop.no/api/test" -d "id=1 OR 1=1"
# Expected: 403 Forbidden

# Test XSS blocking
curl -X POST "https://getdrop.no/api/test" -d "name=<script>alert(1)</script>"
# Expected: 403 Forbidden

# Test path traversal blocking
curl "https://getdrop.no/../../etc/passwd"
# Expected: 403 Forbidden
 
 Monitoring 
 
 All WAF blocks should be logged with: timestamp, rule ID, client IP, request path, matched pattern 
 Alert on >100 blocks/hour from single IP (potential attack) 
 Weekly WAF report for security review

Cloud Deployment Options
Cloud Deployment Options for Drop 
 
 Rebrand note (2026-02-14): Originally titled "FontelePay". Product rebranded to Drop . See Drop CLAUDE.md . 
 
 Date: 2026-02-05
 Purpose: Evaluate cloud deployment options for European mobile banking MVP 
 
 Requirements Summary 
 
 
 
 Requirement 
 Priority 
 
 
 
 
 Next.js support (static + SSR/API routes) 
 Must-have 
 
 
 EU data residency (GDPR) 
 Must-have 
 
 
 Financial compliance ready (PCI-DSS, SOC2) 
 Must-have 
 
 
 Cost-effective for MVP 
 High 
 
 
 Easy CI/CD integration 
 High 
 
 
 Scalability for production 
 Medium 
 
 
 
 
 Provider Comparison 
 Overview Table 
 
 
 
 Feature 
 Vercel 
 AWS (Amplify/Lambda) 
 Google Cloud Run 
 
 
 
 
 Next.js Support 
 Native (created by Vercel) 
 Full SSR support (v15) 
 Via container deployment 
 
 
 EU Regions 
 Edge caching only 
 Frankfurt, Ireland, Paris, Stockholm + ESC 
 Frankfurt, Belgium, Netherlands, Zurich 
 
 
 Data Residency 
 US-based storage* 
 Full EU residency available 
 Full EU residency available 
 
 
 PCI-DSS 
 v4.0 (SAQ-D AOC) 
 v4.0.1 certified 
 v4.0.1 certified 
 
 
 SOC 2 
 Type 2 certified 
 Type 2 certified 
 Type 2 certified 
 
 
 ISO 27001 
 Certified 
 Certified 
 Certified 
 
 
 GDPR 
 EU-US DPF certified 
 Compliant 
 Compliant 
 
 
 Ease of Use 
 Excellent 
 Moderate 
 Moderate 
 
 
 Vendor Lock-in 
 Medium 
 Low 
 Low 
 
 
 
 *Vercel: Static assets and function responses cached in EU, but primary storage remains US-based. 
 
 Detailed Analysis 
 1. Vercel 
 Strengths: 
 
 Native Next.js support (Vercel created Next.js) 
 Zero-config deployment from Git 
 Excellent DX (Developer Experience) 
 Edge Functions for low latency 
 Preview deployments per PR 
 PCI-DSS v4.0 compliant 
 SOC 2 Type 2, ISO 27001 certified 
 
 Weaknesses: 
 
 No true EU data residency - data primarily stored in US 
 Per-seat pricing scales poorly for teams 
 Limited backend flexibility 
 Enterprise tier required for some compliance features 
 
 Pricing: 
 
 
 
 Tier 
 Cost 
 Includes 
 
 
 
 
 Hobby 
 Free 
 100GB bandwidth, limited features 
 
 
 Pro 
 $20/user/month 
 1TB bandwidth, $20 credits, viewer seats free 
 
 
 Enterprise 
 Custom 
 SAML SSO, SLAs, dedicated support 
 
 
 
 GDPR Concern: Vercel is certified under EU-US Data Privacy Framework, but for banking applications requiring strict EU data residency, this may not be sufficient. Functions can run in EU regions, but metadata and logs may still traverse US infrastructure. 
 
 2. AWS (Amplify + Lambda) 
 Strengths: 
 
 True EU data residency with European Sovereign Cloud (ESC) 
 Full Next.js 15 SSR support via Amplify 
 140+ security certifications including PCI-DSS v4.0.1 
 Frankfurt region well-established for EU fintech 
 Pay-per-use with generous free tier 
 No per-seat pricing 
 Full infrastructure control 
 
 Weaknesses: 
 
 Steeper learning curve 
 Complex billing (multiple services) 
 Requires AWS expertise 
 CI/CD via external tools (GitHub Actions, GitLab) 
 
 Pricing (AWS Amplify): 
 
 
 
 Resource 
 Free Tier 
 Paid 
 
 
 
 
 Build minutes 
 1,000/month 
 $0.01/min 
 
 
 Data served 
 15 GB/month 
 $0.15/GB 
 
 
 Data stored 
 5 GB/month 
 $0.023/GB 
 
 
 SSR requests 
 Varies 
 ~$0.20/1M 
 
 
 
 Estimated MVP Cost: $5-25/month for low-moderate traffic 
 European Sovereign Cloud (ESC): Launched January 2026, provides EU-resident personnel and hardware-enforced access restrictions. Ideal for regulated financial services. 
 
 3. Google Cloud Run 
 Strengths: 
 
 Containerized deployment (flexible) 
 Full EU data residency (Frankfurt, Belgium, Netherlands, Zurich) 
 PCI-DSS v4.0.1 and SOC 2 certified 
 Generous free tier 
 Auto-scaling to zero 
 Pay only for actual compute time 
 
 Weaknesses: 
 
 Requires containerization (Dockerfile) 
 No native Next.js integration 
 More DevOps overhead 
 Less seamless than Vercel for frontend 
 
 Pricing (Tier 1 - EU regions): 
 
 
 
 Resource 
 Free Tier 
 Paid 
 
 
 
 
 CPU 
 180,000 vCPU-seconds/month 
 $0.000024/vCPU-second 
 
 
 Memory 
 360,000 GiB-seconds/month 
 $0.0000025/GiB-second 
 
 
 Requests 
 2 million/month 
 $0.40/million 
 
 
 
 Estimated MVP Cost: $0-15/month for low-moderate traffic (often within free tier) 
 
 Compliance Matrix for Fintech 
 
 
 
 Certification 
 Vercel 
 AWS 
 GCP 
 Required for Drop 
 
 
 
 
 PCI-DSS v4.0+ 
 Yes 
 Yes 
 Yes 
 Yes (payment processing) 
 
 
 SOC 2 Type 2 
 Yes 
 Yes 
 Yes 
 Yes (enterprise clients) 
 
 
 ISO 27001 
 Yes 
 Yes 
 Yes 
 Recommended 
 
 
 GDPR 
 DPF 
 Full 
 Full 
 Yes (EU operations) 
 
 
 EU Data Residency 
 Partial 
 Full 
 Full 
 Critical 
 
 
 
 
 Recommendation 
 MVP Phase (0-6 months) 
 Primary: AWS Amplify (Frankfurt region) 
 Rationale: 
 
 True EU data residency - critical for banking MVP regulatory approval 
 Full Next.js support - SSR, API routes, ISR all work 
 Cost-effective - likely $10-30/month for MVP traffic 
 Compliance-ready - PCI-DSS, SOC 2, ISO 27001 from day one 
 No per-seat pricing - scales with team growth 
 Path to production - same platform, just scale up 
 
 Setup recommendation: 
 
 Region: eu-central-1 (Frankfurt) 
 CI/CD: GitHub Actions 
 Database: Aurora Serverless or PlanetScale (EU region) 
 Auth: Cognito or Auth0 (EU tenant) 
 
 Production Phase (6+ months) 
 Stay with AWS but consider: 
 
 AWS European Sovereign Cloud (ESC) for maximum compliance 
 ECS/EKS for more control if needed 
 Multi-region deployment (Frankfurt + Ireland) for redundancy 
 
 Why Not Vercel? 
 Despite excellent DX, Vercel's partial EU data residency is a significant concern for a banking application. While Vercel is PCI-DSS compliant, regulators may question data flows through US infrastructure. For an MVP seeking banking licenses or partnerships, demonstrating full EU data residency is simpler with AWS or GCP. 
 Why Not GCP Cloud Run? 
 GCP is technically excellent but: 
 
 Requires containerization overhead 
 Less native Next.js support 
 Smaller fintech ecosystem in EU compared to AWS 
 AWS has more established EU banking relationships 
 
 
 Cost Projection (12 months) 
 
 
 
 Scenario 
 Vercel Pro 
 AWS Amplify 
 GCP Cloud Run 
 
 
 
 
 MVP (2 devs, 10k users) 
 $480/year 
 $120-300/year 
 $0-180/year 
 
 
 Growth (5 devs, 50k users) 
 $1,200/year 
 $300-600/year 
 $200-400/year 
 
 
 Scale (10 devs, 200k users) 
 $2,400/year 
 $600-1,500/year 
 $500-1,200/year 
 
 
 
 AWS and GCP costs vary based on usage patterns; Vercel costs fixed per-seat 
 
 Action Items 
 
 Set up AWS account with Frankfurt region default 
 Configure Amplify for Next.js deployment 
 Implement GitHub Actions CI/CD pipeline 
 Document compliance controls for future audits 
 Evaluate AWS ESC when banking license process begins 
 
 
 Sources 
 
 Vercel Pricing 
 Vercel Security & Compliance 
 Vercel PCI Compliance Guide 
 AWS Amplify Pricing 
 AWS European Sovereign Cloud 
 AWS PCI DSS Compliance 
 Google Cloud Run Pricing 
 GCP PCI DSS Compliance 
 GCP SOC 2 Compliance

Infrastructure Overview
Infrastructure Resources 
 Infrastructure resources for Drop project: deployment, monitoring, CI/CD.

Cloud Migration Strategy — GCP → Azure
Cloud Migration Strategy — Drop 
 Dato: 2026-02-18
 Status: Planlegging
 Beslutning: Azure som produksjonsplattform, GCP for dev/staging 
 
 SpareBank 1 — Teknisk Stack (Research) 
 
 
 
 Lag 
 Teknologi 
 
 
 
 
 Cloud 
 Azure (primær) — Eunomia-plattformen for 13 banker 
 
 
 Sekundær 
 AWS (mindre workloads) 
 
 
 Backend 
 Kotlin/Java (Spring Boot) 
 
 
 Frontend 
 React + TypeScript 
 
 
 Orkestrering 
 Kubernetes / OpenShift 
 
 
 Meldingskø 
 Apache Kafka 
 
 
 Autentisering 
 BankID (norsk eID) 
 
 
 API Gateway 
 Axway 
 
 
 Partnerskap 
 Microsoft (strategisk partner) 
 
 
 
 Drops Nåværende Stack 
 
 
 
 Lag 
 Teknologi 
 
 
 
 
 Frontend 
 Next.js 16 + React 19 + Tailwind v4 
 
 
 Backend 
 Next.js API Routes (Node.js) 
 
 
 Database 
 SQLite (better-sqlite3) → PostgreSQL (prod) 
 
 
 Auth 
 JWT (jose) i httpOnly cookies + BankID 
 
 
 Hosting 
 Fly.io (staging), Vercel (planned prod) 
 
 
 
 Migreringsstrategi 
 Fase 1: GCP Dev/Staging (NÅ) 
 
 Gratis prøveperiode: $300/kr2,884 til 20. mai 2026 
 Tjenester: Cloud Run (containerisert Next.js), Cloud SQL (PostgreSQL), Cloud Storage 
 Formål: Utviklingsmiljø + CI/CD testing 
 Ingen regulatorisk risiko (kun testdata) 
 
 Fase 2: Azure Produksjon (Når credits kommer) 
 
 Microsoft Founders Hub — søknad sendt (#1362) 
 Tjenester: Azure App Service eller AKS, Azure Database for PostgreSQL, Azure Blob Storage 
 Formål: Produksjonsmiljø 
 SpareBank 1-tilpasning: Samme skyplattform reduserer friksjon ved partnerskap 
 
 Fase 3: Multi-Cloud Beredskap 
 
 AWS — søknad sendt (#1360), backup/DR 
 Containerisert arkitektur — Docker + eventuelt Kubernetes gjør leverandøruavhengig 
 
 GCP Deploy Plan (Fase 1) 
 Steg 1: Containerisering 
 # Dockerfile for Drop Next.js
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["npm", "start"]
 
 Steg 2: GCP Oppsett 
 
 Opprett GCP-prosjekt (allerede: project-72cd303f) 
 Aktiver Cloud Run API 
 Opprett Cloud SQL PostgreSQL-instans (db-f1-micro for dev) 
 Konfigurer secrets i Secret Manager 
 Sett opp Cloud Build for CI/CD 
 
 Steg 3: Deploy Pipeline 
 # Build og push container
gcloud builds submit --tag gcr.io/PROJECT_ID/drop-web

# Deploy til Cloud Run
gcloud run deploy drop-web \
 --image gcr.io/PROJECT_ID/drop-web \
 --platform managed \
 --region europe-north1 \
 --allow-unauthenticated \
 --set-env-vars DATABASE_URL=postgresql://...
 
 Steg 4: DNS + SSL 
 
 Custom domain: drop.alai.no → Cloud Run 
 SSL: Automatisk via Google-managed certificates 
 CDN: Cloud CDN foran Cloud Run (valgfritt) 
 
 Regulatoriske Krav (Finanstilsynet) 
 
 60-dagers varsel før sky-outsourcing for finansielle tjenester 
 Dokumentasjonskrav: Sikkerhetstiltak, data residency, exit-strategi 
 Data residency: Europe (europe-north1 = Finland, closest to Norway) 
 Handling: Forbered dokumentasjon FØR produksjonsmigrering 
 
 Kotlin/Java Vurdering 
 Beslutning: Nei — beholder Next.js/TypeScript 
 Hvorfor: 
 
 Drop er ~7,600 linjer TypeScript — full rewrite til Kotlin = 2-3 mnd 
 SpareBank 1 sin API er REST/GraphQL — språket på vår side er irrelevant 
 TypeScript fullstack = ett språk, ett team, raskere iterasjon 
 Next.js API Routes er tilstrekkelig for vår skala (fintech MVP) 
 Når vi trenger mikrotjenester → da vurderer vi Kotlin for spesifikke tjenester 
 
 Estimerte Kostnader 
 
 
 
 Tjeneste 
 GCP (dev) 
 Azure (prod) 
 
 
 
 
 Compute 
 Cloud Run: ~$0 (free tier) 
 App Service B1: ~$13/mnd 
 
 
 Database 
 Cloud SQL micro: ~$7/mnd 
 PostgreSQL Basic: ~$25/mnd 
 
 
 Storage 
 5GB: ~$0.10/mnd 
 5GB: ~$0.10/mnd 
 
 
 Totalt 
 ~$7/mnd (dekkes av credits) 
 ~$38/mnd (dekkes av credits) 
 
 
 
 Relaterte Oppgaver 
 
 MC #1360: AWS credits søknad (sendt) 
 MC #1361: GCP credits søknad (sendt, free trial aktiv) 
 MC #1362: Azure/Microsoft Founders Hub (sendt) 
 MC #1364: Anthropic credits søknad

Load Test Results — 2026-02-18
Load Test Results — Drop Staging 
 Dato: 2026-02-18
 Verktøy: k6 v1.6.1
 Mål: https://drop-staging.fly.dev
 Server: Fly.io shared-cpu-1x (256MB RAM, 1 delt CPU, Stockholm) 
 
 Testoppsett 
 To scenarier kjørt samtidig: 
 Scenario 1: Public Stress (helse + valutakurser) 
 
 Ramper fra 0 → 200 samtidige brukere over 2m40s 
 Ingen autentisering, tester rå serverkapasitet 
 
 Scenario 2: Autentisert brukerflyt 
 
 Ramper fra 0 → 30 samtidige brukere 
 Login → Dashboard → Transaksjoner → Mottakere → Profil 
 JWT token gjenbrukt per VU 
 
 
 Resultater 
 
 
 
 Samtidige brukere 
 Median latens 
 p95 latens 
 Feilrate 
 Status 
 
 
 
 
 1-10 
 74ms 
 ~90ms 
 0% 
 Fungerer utmerket 
 
 
 25-50 
 ~500ms 
 ~3s 
 ~5% 
 Degradering starter 
 
 
 75-100 
 ~2-3s 
 ~6s 
 ~30% 
 Alvorlige problemer 
 
 
 150-200 
 3s+ 
 27s+ 
 47% 
 Praktisk talt nede 
 
 
 
 Detaljerte tall (k6 output) 
 Public endpoints: 
 
 health_duration : avg=1134ms, min=54ms, med=74ms, max=44s, p90=3.5s, p95=6.2s 
 rates_duration : avg=1077ms, min=55ms, med=74ms, max=45s, p90=3.3s, p95=6.4s 
 /api/rates feilet i 95% av forespørslene ved høy last 
 /api/health holdt (alltid 200) 
 
 Autentiserte endpoints: 
 
 http_req_duration (auth_flow) : med=3.26s, p90=14.8s, p95=27.6s 
 Dashboard og transaksjonshistorikk hardest rammet 
 
 Totalt: 
 
 12,841 HTTP-forespørsler over 2m53s 
 74 req/s gjennomsnitt 
 48.4% av alle forespørsler feilet 
 
 
 Breaking Point 
 ~25-30 samtidige brukere 
 Etter dette eksploderer responstidene og endepunkter begynner å feile. 
 
 Flaskehalser identifisert 
 1. Maskinressurser (KRITISK) 
 
 shared-cpu-1x = 256MB RAM, 1 delt CPU 
 Bokstavelig talt den minste Fly.io-planen 
 CPU-metning ved ~50 samtidige forespørsler 
 
 2. SQLite single-writer (HØY) 
 
 SQLite WAL-modus hjelper med samtidige lesinger 
 Men ALLE skrivinger (rate_limits, sessions) er serialiserte 
 Under last: skrivelås blokkerer lesinger 
 
 3. Null caching (MEDIUM) 
 
 Ingen Redis eller in-memory cache 
 Valutakurser hentes fra DB på hver forespørsel 
 Brukersesjoner valideres mot DB hver gang 
 
 4. bcrypt 12 rounds (MEDIUM) 
 
 Passord-hashing koster ~300ms CPU per innlogging 
 Saturerer delt CPU raskt under innloggingsbølger 
 
 5. Enkeltinstans (MEDIUM) 
 
 Ingen horisontal skalering 
 auto_stop_machines = stop → kaldstarter (3.8s første forespørsel) 
 min_machines_running = 0 → ingen alltid-på instanser 
 
 
 Oppgraderingsplan 
 
 
 
 Oppgradering 
 Effekt 
 Kostnad 
 
 
 
 
 Fly.io performance-1x (2GB RAM, 1 dedikert CPU) 
 ~3x kapasitet (~75 brukere) 
 ~$30/mnd 
 
 
 + PostgreSQL i stedet for SQLite 
 Samtidige skrivinger, connection pooling 
 ~$15/mnd 
 
 
 + Redis cache (kurser, sesjoner) 
 10x raskere på lese-endepunkter 
 ~$10/mnd 
 
 
 + 2 instanser (auto-scale) 
 ~150+ brukere 
 ~$60/mnd totalt 
 
 
 Full produksjonsoppsett 
 ~500+ brukere 
 ~$100/mnd 
 
 
 
 
 Konklusjon 
 For MVP/demo med SpareBank 1: Nåværende oppsett holder 10-15 samtidige brukere — tilstrekkelig for demo. For pilot med ekte brukere trengs minimum PostgreSQL + større maskin. 
 Se også: Cloud Migration Strategy — GCP → Azure for migreringsplan.

GCP Architecture — Cloud Run + Cloud SQL
GCP Architecture for Drop 
 Dato: 2026-02-18
 Region: europe-north1 (Finland — nærmest Norge)
 Kontekst: Migrering fra Fly.io shared-cpu-1x som takler ~25 samtidige brukere 
 
 Nåværende Fly.io vs GCP — Sammenligning 
 Tier 1: Minimum (Dev/Demo) — ~25 brukere 
 
 
 
 Komponent 
 Fly.io (nå) 
 GCP ekvivalent 
 GCP kostnad 
 
 
 
 
 Compute 
 shared-cpu-1x (256MB) 
 Cloud Run: 1 vCPU, 512MB 
 ~$0 (free tier) 
 
 
 Database 
 SQLite på Fly Volume 
 Cloud SQL db-f1-micro (0.6GB, 10GB) 
 ~$9/mnd 
 
 
 Cache 
 Ingen 
 Ingen 
 $0 
 
 
 Totalt 
 ~$5/mnd 
 
 ~$9/mnd 
 
 
 Kapasitet 
 ~25 samtidige 
 
 ~25 samtidige 
 
 
 
 Tier 2: Pilot (SpareBank 1 demo) — ~100 brukere 
 
 
 
 Komponent 
 GCP tjeneste 
 Spesifikasjon 
 Kostnad 
 
 
 
 
 Compute 
 Cloud Run 
 2 vCPU, 1GB RAM, min 1 instans 
 ~$15/mnd 
 
 
 Database 
 Cloud SQL 
 db-g1-small (1.7GB, 20GB SSD) 
 ~$30/mnd 
 
 
 Cache 
 Memorystore Redis 
 Basic 1GB 
 ~$35/mnd 
 
 
 CDN 
 Cloud CDN 
 10GB egress 
 ~$1/mnd 
 
 
 Secrets 
 Secret Manager 
 10 secrets 
 ~$0 (free tier) 
 
 
 Monitoring 
 Cloud Monitoring 
 Basic 
 ~$0 (free tier) 
 
 
 Totalt 
 
 
 ~$81/mnd 
 
 
 Kapasitet 
 
 
 ~100-150 samtidige 
 
 
 
 Tier 3: Produksjon (Ekte brukere) — ~500+ brukere 
 
 
 
 Komponent 
 GCP tjeneste 
 Spesifikasjon 
 Kostnad 
 
 
 
 
 Compute 
 Cloud Run 
 2 vCPU, 2GB RAM, min 2 instanser, auto-scale til 10 
 ~$50/mnd 
 
 
 Database 
 Cloud SQL 
 Enterprise: 2 vCPU, 8GB RAM, 50GB SSD, HA 
 ~$150/mnd 
 
 
 Cache 
 Memorystore Redis 
 Standard 1GB (HA) 
 ~$70/mnd 
 
 
 CDN 
 Cloud CDN + Load Balancer 
 Global 
 ~$25/mnd 
 
 
 Secrets 
 Secret Manager 
 
 ~$0 
 
 
 Monitoring 
 Cloud Monitoring + Logging 
 
 ~$10/mnd 
 
 
 Backup 
 Automated DB backup 
 
 ~$5/mnd 
 
 
 Totalt 
 
 
 ~$310/mnd 
 
 
 Kapasitet 
 
 
 ~500-1000 samtidige 
 
 
 
 
 Cloud Run Pricing (Tier 1 region) 
 
 
 
 Ressurs 
 Pris 
 Gratis tier 
 
 
 
 
 CPU 
 $0.000024/vCPU-sekund 
 180,000 vCPU-sek/mnd 
 
 
 Minne 
 $0.0000025/GiB-sekund 
 360,000 GiB-sek/mnd 
 
 
 Forespørsler 
 $0.40/million 
 2 millioner/mnd 
 
 
 Egress 
 $0.12/GB (etter 1GB gratis) 
 1 GB/mnd 
 
 
 
 Gratis tier dekker: ~50 timer med 1 vCPU + 256MB — nok for dev/staging med lav trafikk. 
 Viktig: Gratis tier gjelder kun us-central1/us-east1/us-west1. I europe-north1 faktureres ALT fra første bruk — men dekkes av $300 free trial credits. 
 Cloud SQL PostgreSQL Pricing 
 
 
 
 Instanstype 
 vCPU 
 RAM 
 Pris (ca.) 
 
 
 
 
 db-f1-micro 
 Delt 
 0.6 GB 
 ~$9/mnd 
 
 
 db-g1-small 
 Delt 
 1.7 GB 
 ~$27/mnd 
 
 
 db-custom-2-8192 
 2 
 8 GB 
 ~$130/mnd 
 
 
 
 Lagring: ~$0.17/GB/mnd (SSD)
 Backup: ~$0.08/GB/mnd 
 
 Deploy-arkitektur på GCP 
 ┌─────────────────────────────────────────────┐
│ Cloud CDN │
│ (statiske filer + caching) │
└──────────────────┬──────────────────────────┘
 │
┌──────────────────▼──────────────────────────┐
│ Cloud Run Service │
│ drop-web (Next.js standalone) │
│ Region: europe-north1 │
│ Auto-scale: 0-10 instanser │
│ CPU: 1-2 vCPU, RAM: 512MB-2GB │
└──────┬──────────────────┬───────────────────┘
 │ │
┌──────▼──────┐ ┌──────▼──────┐
│ Cloud SQL │ │ Memorystore │
│ PostgreSQL │ │ Redis │
│ europe-n1 │ │ (cache) │
│ Private IP │ │ Basic 1GB │
└─────────────┘ └─────────────┘

Alt: Serverless VPC Connector for private nettverk
 
 Migreringssteg 
 Steg 1: Containerisering (dag 1) 
 
 Dockerfile allerede eksisterer (Node 22 Alpine) 
 Bytt SQLite → PostgreSQL via DATABASE_URL env 
 Drop har allerede PostgreSQL-adapter i koden 
 
 Steg 2: GCP-oppsett (dag 1-2) 
 # Opprett prosjekt (allerede: project-72cd303f)
gcloud config set project project-72cd303f-66e5-46ee-a4c

# Aktiver APIer
gcloud services enable run.googleapis.com
gcloud services enable sqladmin.googleapis.com
gcloud services enable secretmanager.googleapis.com
gcloud services enable cloudbuild.googleapis.com

# Cloud SQL instans
gcloud sql instances create drop-db \
 --database-version=POSTGRES_15 \
 --tier=db-f1-micro \
 --region=europe-north1 \
 --storage-size=10GB \
 --storage-type=SSD

# Opprett database
gcloud sql databases create drop --instance=drop-db

# Opprett bruker
gcloud sql users create drop-user \
 --instance=drop-db \
 --password=<generert>
 
 Steg 3: Deploy til Cloud Run (dag 2) 
 # Build og push container
gcloud builds submit --tag gcr.io/PROJECT_ID/drop-web

# Deploy
gcloud run deploy drop-web \
 --image gcr.io/PROJECT_ID/drop-web \
 --platform managed \
 --region europe-north1 \
 --allow-unauthenticated \
 --memory 512Mi \
 --cpu 1 \
 --min-instances 0 \
 --max-instances 5 \
 --set-env-vars NODE_ENV=production \
 --set-cloudsql-instances PROJECT_ID:europe-north1:drop-db \
 --set-secrets DATABASE_URL=drop-db-url:latest
 
 Steg 4: DNS + SSL (dag 2) 
 # Custom domain
gcloud run domain-mappings create \
 --service drop-web \
 --domain drop-dev.alai.no \
 --region europe-north1
 
 Steg 5: CI/CD (dag 3) 
 
 Cloud Build trigger fra GitHub push 
 Automatisk deploy ved push til main/staging branch 
 Build + deploy tar ~2-3 minutter 
 
 
 Kostnadsdekning 
 
 
 
 Kilde 
 Beløp 
 Dekker 
 
 
 
 
 GCP Free Trial 
 $300 (kr 2,884) 
 Tier 1+2 i ~3 mnd 
 
 
 GCP Startups Program (søkt) 
 Inntil $100,000 
 Alt i 1-2 år 
 
 
 Microsoft Founders Hub (søkt) 
 Inntil $150,000 Azure 
 Azure-migrering senere 
 
 
 
 Med free trial alene: $300 / ~$9 per mnd (Tier 1) = 33 måneder for dev. Med Tier 2 ( $81/mnd) = ~3.7 måneder. 
 
 Alternativt: Billigere cache enn Memorystore 
 Memorystore Redis (1GB basic = ~$35/mnd) er dyrt for MVP. Alternativer: 
 
 
 
 Alternativ 
 Pris 
 Fordel 
 
 
 
 
 Upstash Redis (serverless) 
 Gratis opptil 10K kommandoer/dag 
 Null kostnad for dev 
 
 
 In-memory cache i Cloud Run 
 $0 
 Forsvinner ved restart 
 
 
 Cloud Run + node-cache 
 $0 
 Enkel, per-instans cache 
 
 
 
 Anbefaling: Start uten Redis. Legg til Upstash eller in-memory cache først. Memorystore kun hvis vi trenger delt cache mellom instanser. 
 
 Kapasitetsestimat per tier 
 
 
 
 Tier 
 Samtidige brukere 
 Responstid (p95) 
 Kostnad 
 
 
 
 
 Tier 1 (db-f1-micro, 1 vCPU) 
 ~25-30 
 <200ms 
 ~$9/mnd 
 
 
 Tier 2 (db-g1-small, 2 instanser) 
 ~100-150 
 <500ms 
 ~$80/mnd 
 
 
 Tier 3 (2 vCPU DB, auto-scale) 
 ~500-1000 
 <300ms 
 ~$310/mnd 
 
 
 
 PostgreSQL alene gir ~2-3x bedre concurrent performance enn SQLite pga. connection pooling og parallelle skrivinger.

AWS Deploy — App Runner + RDS (Live)
AWS Deploy — Drop Staging 
 Dato: 2026-02-18
 Status: LIVE
 Region: eu-west-1 (Ireland) 
 
 Infrastruktur 
 
 
 
 Komponent 
 Tjeneste 
 Detaljer 
 
 
 
 
 Compute 
 App Runner 
 1 vCPU, 2GB RAM, auto-scale 
 
 
 Container 
 ECR 
 324480209768.dkr.ecr.eu-west-1.amazonaws.com/drop-web 
 
 
 Database 
 RDS PostgreSQL 16.6 
 db.t3.micro (Free Tier), 20GB gp3 
 
 
 Region 
 eu-west-1 
 Ireland 
 
 
 
 URLer 
 
 
 
 Tjeneste 
 URL 
 
 
 
 
 App 
 https://9ef3szvvsb.eu-west-1.awsapprunner.com 
 
 
 RDS 
 drop-db.czu2qe4quy4v.eu-west-1.rds.amazonaws.com:5432 
 
 
 
 Credentials 
 
 
 
 Nøkkel 
 Verdi 
 
 
 
 
 AWS Account 
 324480209768 
 
 
 IAM User 
 john-deploy (AdministratorAccess) 
 
 
 RDS User 
 dropuser 
 
 
 RDS Database 
 dropapp 
 
 
 JWT Secret 
 drop-aws-jwt-secret-2026-xK9mP2vL 
 
 
 
 NB: Passord i Vaultwarden, ikke i BookStack. 
 Load Test — Sammenligning 
 
 
 
 Metrikk 
 Fly.io (256MB) 
 AWS (2GB) 
 Forbedring 
 
 
 
 
 Throughput 
 74 req/s 
 186 req/s 
 2.5x 
 
 
 Health p95 
 6,216ms 
 614ms 
 10x raskere 
 
 
 Kapasitet 
 ~25 brukere 
 ~75-100 brukere 
 3-4x 
 
 
 
 Neste steg 
 
 Koble App Runner til RDS PostgreSQL (DATABASE_URL) — trenger VPC Connector 
 Sett opp custom domene (drop-staging.alai.no) 
 CI/CD via GitHub Actions → ECR → App Runner 
 Load test med PostgreSQL (forventet ytterligere forbedring) 
 
 Kostnad 
 
 
 
 Tjeneste 
 Estimert 
 
 
 
 
 App Runner (1 vCPU, 2GB) 
 ~$7/mnd (idle) 
 
 
 RDS db.t3.micro 
 $0 (Free Tier 12 mnd) 
 
 
 ECR 
 ~$1/mnd 
 
 
 Totalt 
 ~$8/mnd 
 
 
 
 Dekkes av AWS Activate credits ($1,000 søkt).

Cloud Audit
Cloud infrastructure audit and multi-cloud design

Cloud Audit: Resource Inventory
Drop — AWS Resource Inventory 
 Date: 2026-02-19
 Region: eu-west-1 (Ireland)
 Account: Drop production
 Auditor: infra-lead (CloudForge cloud-audit team)
 MC Task: #1443 
 
 Executive Summary 
 Drop runs a minimal AWS footprint: one App Runner service fronting a PostgreSQL RDS instance, with container images stored in ECR. Total estimated cost is $48-60/month. 
 Three CRITICAL security findings require immediate action: 
 
 RDS database is publicly accessible with security group open to the entire internet (0.0.0.0/0 on port 5432) 
 Database storage is unencrypted 
 Plaintext secrets (DATABASE_URL with password, JWT_SECRET) in App Runner environment variables 
 
 No WAF, no CloudFront, no CloudWatch monitoring, no Route53 DNS management, and Secrets Manager is provisioned but empty. 
 
 Resource Table 
 
 
 
 Resource 
 Type 
 ID / Name 
 Region 
 Status 
 Key Config 
 
 
 
 
 App Runner 
 Service 
 drop-web 
 eu-west-1 
 RUNNING 
 1 vCPU, 2 GB RAM, port 3000 
 
 
 RDS 
 PostgreSQL 16.6 
 drop-db 
 eu-west-1a 
 Available 
 db.t3.micro, 20 GB gp3, single-AZ 
 
 
 ECR 
 Repository 
 drop-web 
 eu-west-1 
 Active 
 ScanOnPush: TRUE, Encryption: AES256 
 
 
 Security Group 
 SG 
 drop-db-sg 
 eu-west-1 
 In use 
 Inbound: 0.0.0.0/0 : 5432 
 
 
 VPC 
 Default 
 — 
 eu-west-1 
 Active 
 172.31.0.0/16 
 
 
 IAM User 
 User 
 john-deploy 
 Global 
 Active 
 Programmatic access 
 
 
 IAM Role 
 Role 
 AppRunnerECRAccessRole 
 Global 
 Active 
 ECR pull permissions 
 
 
 Secrets Manager 
 — 
 (empty) 
 eu-west-1 
 Provisioned 
 0 secrets stored 
 
 
 CloudWatch 
 — 
 — 
 — 
 NOT CONFIGURED 
 No alarms, no dashboards 
 
 
 CloudFront 
 — 
 — 
 — 
 NOT PROVISIONED 
 No CDN 
 
 
 WAF 
 — 
 — 
 — 
 NOT PROVISIONED 
 No web application firewall 
 
 
 Route53 
 — 
 — 
 — 
 NOT PROVISIONED 
 DNS managed externally 
 
 
 S3 
 — 
 — 
 — 
 NOT PROVISIONED 
 No buckets 
 
 
 
 
 Architecture Diagram 
 INTERNET
 |
 | HTTPS (public ingress)
 v
 +------------------+
 | App Runner |
 | drop-web |
 | |
 | 1 vCPU / 2 GB |
 | Port 3000 |
 | ECR source |
 | |
 | ENV (plaintext):|
 | - DATABASE_URL |
 | - JWT_SECRET |
 +--------+---------+
 |
 | VPC Connector (egress)
 |
 +-------------+-------------+
 | Default VPC |
 | 172.31.0.0/16 |
 | |
 | +-------------------+ |
 | | drop-db-sg | |
 | | 0.0.0.0/0:5432 | |
 | +--------+----------+ |
 | | |
 | +--------v----------+ |
 | | RDS | |
 | | drop-db | |
 | | | |
 | | PostgreSQL 16.6 | |
 | | db.t3.micro | |
 | | 20 GB gp3 | |
 | | single-AZ (a) | |
 | | | |
 | | Public: YES | |
 | | Encrypted: NO | |
 | | Backup: 7 days | |
 | | DeletionProt: ON| |
 | | Monitoring: OFF | |
 | +-------------------+ |
 +---------------------------+

 +----------+ +---------------------+
 | ECR | | Secrets Manager |
 | drop-web | | (EMPTY) |
 | ScanPush | +---------------------+
 +----------+

 +----------+ +---------------------+
 | IAM | | MISSING |
 | john- | | CloudWatch |
 | deploy | | CloudFront |
 | ECR Role | | WAF / Route53 / S3|
 +----------+ +---------------------+
 
 
 Security Findings 
 CRITICAL 
 
 
 
 # 
 Finding 
 Resource 
 Risk 
 Remediation 
 
 
 
 
 C1 
 Database publicly accessible 
 RDS drop-db 
 Direct internet access to PostgreSQL. Any attacker can attempt connections. 
 Set PubliclyAccessible=false . App Runner already uses VPC Connector for egress — RDS only needs private subnet access. 
 
 
 C2 
 Security group allows 0.0.0.0/0 on port 5432 
 drop-db-sg 
 Combined with C1, the database is wide open to brute-force and exploitation from any IP on Earth. 
 Restrict inbound rule to App Runner VPC Connector security group only. Remove 0.0.0.0/0 CIDR. 
 
 
 C3 
 Plaintext secrets in App Runner env vars 
 App Runner drop-web 
 DATABASE_URL contains full connection string with password. JWT_SECRET in plaintext. Anyone with console/API access sees credentials. Visible in CloudTrail, config exports, and deployment logs. 
 Migrate secrets to AWS Secrets Manager (already provisioned, currently empty). Reference via App Runner secret ARN configuration. Rotate both DATABASE_URL password and JWT_SECRET after migration. 
 
 
 C4 
 Database storage unencrypted 
 RDS drop-db 
 Data at rest is not encrypted. Violates baseline security posture and most compliance frameworks (SOC2, GDPR, PCI). 
 Enable storage encryption. Note: cannot enable on existing instance — requires snapshot, restore to encrypted instance, DNS/connection swap. Plan downtime window. 
 
 
 
 HIGH 
 
 
 
 # 
 Finding 
 Resource 
 Risk 
 Remediation 
 
 
 
 
 H1 
 Single-AZ deployment 
 RDS drop-db 
 AZ failure = full database outage. No automatic failover. 
 Enable Multi-AZ for production. Cost increase ~$14/mo for db.t3.micro. 
 
 
 H2 
 No monitoring or alerting 
 CloudWatch (missing) 
 No CPU, memory, connection, or storage alarms. No visibility into failures, performance degradation, or security events. Silent failures. 
 Configure CloudWatch alarms: CPU > 80%, FreeStorageSpace < 2 GB, DatabaseConnections > 80%, FreeableMemory < 200 MB. Enable Enhanced Monitoring on RDS. 
 
 
 H3 
 No WAF 
 WAF (missing) 
 No protection against OWASP Top 10 attacks (SQLi, XSS, SSRF, etc.) at the edge. App Runner public endpoint is directly exposed. 
 Deploy AWS WAF with managed rule groups (AWSManagedRulesCommonRuleSet, AWSManagedRulesSQLiRuleSet). Attach to CloudFront distribution (see H4). 
 
 
 
 MEDIUM 
 
 
 
 # 
 Finding 
 Resource 
 Risk 
 Remediation 
 
 
 
 
 M1 
 No CDN / CloudFront 
 CloudFront (missing) 
 All traffic hits App Runner origin directly. No edge caching, no DDoS protection (Shield Standard), higher latency for distant users. 
 Deploy CloudFront distribution in front of App Runner. Enables WAF attachment, caching, and Shield Standard. 
 
 
 M2 
 Default VPC 
 VPC 172.31.0.0/16 
 Default VPC has broad routing, public subnets by default, and no network segmentation. Not suitable for production workloads. 
 Create custom VPC with private subnets for RDS, public subnets for NAT Gateway / ALB if needed. Migrate RDS to private subnet. 
 
 
 M3 
 No DNS management 
 Route53 (missing) 
 DNS managed outside AWS. No health checks, no failover routing, no alias records for AWS resources. 
 Consider Route53 for DNS if domain is Drop-owned. Enables health-check-based routing and simpler AWS integration. 
 
 
 M4 
 TCP health check only 
 App Runner drop-web 
 TCP checks confirm port is open but not that the application is healthy. A process could accept connections while returning 500s. 
 Configure HTTP health check on a dedicated /health endpoint that verifies database connectivity. 
 
 
 
 LOW 
 
 
 
 # 
 Finding 
 Resource 
 Risk 
 Remediation 
 
 
 
 
 L1 
 No S3 buckets 
 S3 (missing) 
 If the app needs file storage in future, ensure encryption-at-rest (SSE-S3 or SSE-KMS), versioning, and public access block from day one. 
 Provision with secure defaults when needed. 
 
 
 L2 
 IAM user john-deploy 
 IAM 
 Long-lived access keys. No indication of key rotation policy or MFA. 
 Audit key age. Enable MFA. Consider OIDC federation for CI/CD instead of IAM user. Rotate keys on a 90-day schedule. 
 
 
 
 
 Cost Breakdown 
 
 
 
 Service 
 Specification 
 Estimated Monthly Cost 
 
 
 
 
 App Runner 
 1 vCPU, 2 GB, always running 
 $29 - $36 
 
 
 RDS 
 db.t3.micro, 20 GB gp3, single-AZ 
 $15 - $18 
 
 
 ECR 
 Image storage (~1-5 GB) 
 $0.50 - $1.00 
 
 
 Data Transfer 
 Minimal (< 10 GB/mo estimate) 
 $1 - $2 
 
 
 Secrets Manager 
 0 secrets (currently unused) 
 $0 
 
 
 Total 
 
 $46 - $57/mo 
 
 
 
 Cost Notes 
 
 App Runner pricing: $0.064/vCPU-hour + $0.007/GB-hour (provisioned mode) 
 RDS db.t3.micro: ~$0.018/hour ($13.14/mo) + $0.115/GB-month storage 
 No NAT Gateway cost (App Runner VPC Connector handles egress) 
 Adding Multi-AZ RDS: +$13-15/mo 
 Adding CloudFront: +$0-5/mo (depends on traffic) 
 Adding WAF: +$5-10/mo (depends on rules and requests) 
 
 
 Gaps Analysis 
 
 
 
 Category 
 Current State 
 Target State 
 Priority 
 
 
 
 
 Secrets management 
 Plaintext env vars 
 Secrets Manager with rotation 
 CRITICAL 
 
 
 Network security 
 Public RDS + open SG 
 Private subnet + restricted SG 
 CRITICAL 
 
 
 Encryption at rest 
 Disabled 
 AES-256 (KMS or default) 
 CRITICAL 
 
 
 Monitoring 
 None 
 CloudWatch alarms + dashboards 
 HIGH 
 
 
 High availability 
 Single-AZ 
 Multi-AZ RDS 
 HIGH 
 
 
 Edge security 
 No WAF / CDN 
 CloudFront + WAF 
 HIGH 
 
 
 Network architecture 
 Default VPC 
 Custom VPC with segmentation 
 MEDIUM 
 
 
 Health checks 
 TCP only 
 HTTP application-level 
 MEDIUM 
 
 
 IAM hygiene 
 Long-lived keys 
 OIDC + key rotation + MFA 
 MEDIUM 
 
 
 DNS 
 External 
 Route53 (optional) 
 LOW 
 
 
 Backup/DR 
 7-day automated only 
 Cross-region snapshot copy 
 LOW 
 
 
 
 
 Recommendations (Priority Order) 
 Phase 1 — Immediate (Week 1) — CRITICAL Security 
 
 
 Lock down RDS network access 
 
 Set PubliclyAccessible=false on drop-db 
 Update drop-db-sg: remove 0.0.0.0/0, allow only App Runner VPC Connector SG 
 Verify App Runner can still connect via VPC Connector 
 
 
 
 Migrate secrets to Secrets Manager 
 
 Create secrets: drop/database-url , drop/jwt-secret 
 Update App Runner service to reference secret ARNs 
 Remove plaintext env vars from App Runner config 
 Rotate database password and JWT secret post-migration 
 
 
 
 Enable RDS encryption 
 
 Snapshot current instance 
 Restore snapshot with encryption enabled 
 Update connection string to new endpoint 
 Verify, then delete old unencrypted instance 
 Requires brief downtime — schedule maintenance window 
 
 
 
 Phase 2 — Short Term (Week 2-3) — HIGH Priority 
 
 
 Configure CloudWatch monitoring 
 
 RDS alarms: CPU, storage, connections, memory 
 App Runner alarms: request count, error rate, latency 
 SNS topic for alert notifications 
 Enable RDS Enhanced Monitoring 
 
 
 
 Enable Multi-AZ RDS 
 
 Modify instance to Multi-AZ 
 Near-zero downtime (AWS handles failover setup) 
 
 
 
 Deploy CloudFront + WAF 
 
 CloudFront distribution pointing to App Runner 
 WAF with AWS managed rule sets (Common, SQLi, Known Bad Inputs) 
 Update DNS to point to CloudFront 
 
 
 
 Phase 3 — Medium Term (Month 2) — Hardening 
 
 
 Custom VPC migration 
 
 Design VPC: 2 private subnets (RDS), 2 public subnets (NAT if needed) 
 Migrate RDS to private subnets 
 Update App Runner VPC Connector 
 
 
 
 HTTP health checks 
 
 Implement /health endpoint in Drop application (DB connectivity check) 
 Configure App Runner HTTP health check path 
 
 
 
 IAM improvements 
 
 Audit john-deploy key age 
 Enable MFA on IAM user 
 Consider GitHub Actions OIDC for CI/CD (eliminates long-lived keys) 
 
 
 
 
 Risk Matrix 
 
 
 
 Risk 
 Likelihood 
 Impact 
 Severity 
 Mitigation 
 
 
 
 
 Database breach via public access + open SG 
 HIGH 
 CRITICAL 
 CRITICAL 
 Phase 1: Lock down network (C1, C2) 
 
 
 Credential leak from plaintext env vars 
 MEDIUM 
 CRITICAL 
 CRITICAL 
 Phase 1: Secrets Manager (C3) 
 
 
 Data exposure from unencrypted storage 
 LOW 
 HIGH 
 HIGH 
 Phase 1: Enable encryption (C4) 
 
 
 Database outage (single-AZ failure) 
 LOW 
 HIGH 
 HIGH 
 Phase 2: Multi-AZ (H1) 
 
 
 Silent application failure (no monitoring) 
 MEDIUM 
 MEDIUM 
 HIGH 
 Phase 2: CloudWatch (H2) 
 
 
 Application-layer attack (no WAF) 
 MEDIUM 
 HIGH 
 HIGH 
 Phase 2: WAF (H3) 
 
 
 DDoS / performance degradation (no CDN) 
 LOW 
 MEDIUM 
 MEDIUM 
 Phase 2: CloudFront (M1) 
 
 
 Lateral movement via default VPC 
 LOW 
 MEDIUM 
 MEDIUM 
 Phase 3: Custom VPC (M2) 
 
 
 IAM key compromise 
 LOW 
 HIGH 
 MEDIUM 
 Phase 3: Key rotation + OIDC (L2) 
 
 
 
 
 Appendix: Raw Resource Details 
 App Runner — drop-web 
 Service: drop-web
Status: RUNNING
Region: eu-west-1
Source: ECR (container image)
CPU: 1 vCPU
Memory: 2 GB
Port: 3000
Ingress: Public
Egress: VPC Connector
Health Check: TCP
Environment: DATABASE_URL (plaintext, contains password)
 JWT_SECRET (plaintext)
 
 RDS — drop-db 
 Engine: PostgreSQL 16.6
Instance Class: db.t3.micro
Storage: 20 GB gp3
AZ: eu-west-1a (single-AZ)
VPC: Default (172.31.0.0/16)
Public Access: TRUE
Encrypted: FALSE
Deletion Prot: TRUE
Backup: 7-day automated
Monitoring: DISABLED
 
 ECR — drop-web 
 Repository: drop-web
Scan on Push: TRUE
Encryption: AES256 (default)
 
 Security Groups — drop-db-sg 
 Inbound Rules:
 - Protocol: TCP
 Port: 5432
 Source: 0.0.0.0/0 (ALL TRAFFIC)
 
 IAM 
 User: john-deploy (programmatic access, deployment)
Role: AppRunnerECRAccessRole (App Runner → ECR pull)
 
 Secrets Manager 
 Secrets stored: 0 (service provisioned but unused)

Cloud Audit: Multi-Cloud Design
Drop — Multi-Cloud Architecture Design 
 Date: 2026-02-19
 Auditor: solution-arch (CloudForge cloud-audit team)
 MC Task: #1443 
 
 Executive Summary 
 Drop is 85% cloud-portable thanks to Docker containerization and PostgreSQL. Main AWS lock-in: App Runner (easily replaceable). Recommendation: stay on AWS , optimize current setup, design Terraform with abstraction for future portability. 
 
 1. Provider Comparison Matrix 
 
 
 
 Service 
 AWS (Current) 
 Azure 
 GCP 
 
 
 
 
 Compute 
 App Runner ($25-35/mo) 
 Container Apps ($20-30/mo) 
 Cloud Run ($15-25/mo) 
 
 
 Database 
 RDS PostgreSQL ($15-18/mo) 
 Azure DB for PG ($15-20/mo) 
 Cloud SQL ($12-18/mo) 
 
 
 Registry 
 ECR ($1-2/mo) 
 ACR ($5/mo) 
 Artifact Registry ($1-2/mo) 
 
 
 Secrets 
 Secrets Manager ($0.40/secret) 
 Key Vault ($0.03/10k ops) 
 Secret Manager ($0.06/10k ops) 
 
 
 CDN 
 CloudFront ($0-5/mo) 
 Front Door ($35+/mo) 
 Cloud CDN ($0-5/mo) 
 
 
 WAF 
 AWS WAF ($5+/mo) 
 Azure WAF ($20+/mo) 
 Cloud Armor ($5+/mo) 
 
 
 Monitoring 
 CloudWatch ($3-10/mo) 
 Azure Monitor ($5-15/mo) 
 Cloud Monitoring ($0-8/mo) 
 
 
 Total estimate 
 $50-75/mo 
 $100-130/mo 
 $35-60/mo 
 
 
 
 
 2. Portable Architecture 
 Cloudflare (DNS + CDN + WAF) ← Cloud-agnostic edge
 |
 | HTTPS
 v
 ┌──────────────────┐
 │ CaaS Platform │ ← App Runner / Container Apps / Cloud Run
 │ ┌──────────┐ │
 │ │ Docker │ │ ← Identical image everywhere
 │ │ Next.js │ │
 │ │ :3000 │ │
 │ └──────────┘ │
 └────────┬────────┘
 │ DATABASE_URL
 ┌────────┴────────┐
 │ Managed PG │ ← RDS / Azure DB / Cloud SQL
 └─────────────────┘
 
 Abstraction Strategy 
 
 
 
 Layer 
 Approach 
 
 
 
 
 Compute 
 Docker image to any CaaS. No platform SDK 
 
 
 Database 
 Standard PostgreSQL via DATABASE_URL 
 
 
 Secrets 
 Terraform abstracts provider. App reads env vars 
 
 
 DNS/CDN/WAF 
 Cloudflare (cloud-agnostic, free tier) 
 
 
 Monitoring 
 Sentry (errors) + structured logs to any aggregator 
 
 
 CI/CD 
 GitHub Actions (already cloud-agnostic) 
 
 
 
 
 3. Migration Paths 
 AWS to Azure (3-5 days) 
 
 Push image to ACR 
 Create Azure DB for PostgreSQL Flexible Server 
 pg_dump/pg_restore data migration 
 Deploy to Azure Container Apps 
 Update Cloudflare DNS 
 Write Azure Terraform modules 
 
 AWS to GCP (2-3 days) 
 
 Push image to Artifact Registry 
 Create Cloud SQL PostgreSQL 
 pg_dump/pg_restore 
 Deploy to Cloud Run (most similar to App Runner) 
 Update Cloudflare DNS 
 Write GCP Terraform modules 
 
 Lock-In Assessment 
 
 
 
 Component 
 Lock-In 
 Notes 
 
 
 
 
 App Runner 
 LOW 
 Standard Docker, replaceable 
 
 
 RDS PostgreSQL 
 LOW 
 Standard PG, any managed PG works 
 
 
 ECR 
 LOW 
 Standard OCI registry 
 
 
 VPC Connector 
 MEDIUM 
 AWS-specific networking 
 
 
 IAM Roles 
 MEDIUM 
 AWS-specific auth model 
 
 
 Secrets Manager 
 LOW 
 App reads env vars regardless 
 
 
 
 
 4. Recommendation: Stay AWS, Optimize 
 Rationale: 
 
 $50-75/mo already low 
 No business need to migrate 
 85% portable — migration possible in 2-5 days if needed 
 Azure costs MORE (~$100-130/mo) 
 GCP saves ~$15/mo but not worth effort now 
 
 Immediate Actions 
 
 Security fixes (encrypt RDS, restrict SG, use Secrets Manager) 
 Add Cloudflare free tier (DNS, CDN, WAF — cloud-agnostic) 
 Terraform all resources (reproducibility) 
 Add CloudWatch basic alarms ($3-5/mo) 
 
 Future Migration Triggers 
 
 AWS cost > $200/mo → evaluate GCP Cloud Run 
 EU data sovereignty requirement → Azure Norway East 
 Multi-region needed → Cloudflare Workers + D1 
 Kubernetes requirement → EKS or GKE 
 
 
 5. 12-Month Cost Projection 
 
 
 
 Scenario 
 Monthly 
 Annual 
 
 
 
 
 Current (no changes) 
 $50-75 
 $600-900 
 
 
 Optimized AWS 
 $55-80 
 $660-960 
 
 
 AWS + Cloudflare 
 $55-80 
 $660-960 
 
 
 Azure equivalent 
 $100-130 
 $1,200-1,560 
 
 
 GCP equivalent 
 $35-60 
 $420-720

Cloud Audit: App Cloud Readiness
Drop Application Cloud-Readiness Audit 
 MC Task: #1443
 Date: 2026-02-19
 Auditor: software-arch (CloudForge team)
 Application: Drop Fintech Payment App (Next.js 15 + SQLite/PostgreSQL dual-driver) 
 
 NOTE (2026-03-03): This audit was performed on 2026-02-19. ADR-014 (2026-03-03) removed SQLite and the dual-driver architecture. Drop now uses PostgreSQL 16 exclusively in all environments. SQLite concerns noted in this audit are resolved. The better-sqlite3 dependency has been removed. 
 
 
 1. Twelve-Factor Compliance 
 I. Codebase — PASS 
 
 Evidence: Single Git repository at /Users/makinja/ALAI/products/Drop/ 
 .github/workflows/ci.yml triggers on main and develop branches 
 One codebase tracked in revision control, multiple deploys (staging via Fly.io, production via Docker Compose) 
 
 II. Dependencies — PASS 
 
 Evidence: package.json:1-55 declares all dependencies explicitly 
 npm ci used in CI ( ci.yml:36 ) and Dockerfile ( Dockerfile:6 ) for deterministic installs 
 package-lock.json referenced in Dockerfile COPY ( Dockerfile:5 ) and CI cache ( ci.yml:32 ) 
 Native modules (better-sqlite3) handled via apk add python3 make g++ in Dockerfile 
 
 III. Config — PASS 
 
 Evidence: .env.example:1-87 documents all env vars with clear groupings 
 env.ts:1-45 validates critical vars at startup, crashes if missing in production 
 fly.toml:16-20 injects env vars at runtime 
 docker-compose.production.yml:7-8 uses ${JWT_SECRET:?} required substitution 
 db.ts:9 — database driver selected via DATABASE_URL env var 
 db.ts:26-30 — SQLite path varies by environment (Vercel /tmp , Docker /app/data , local ./data ) 
 Feature flags externalized as NEXT_PUBLIC_FF_* env vars ( Dockerfile:19-26 ) 
 Minor concern: NEXT_PUBLIC_* vars are baked into the build at compile time (Next.js limitation), requiring rebuild for changes. This is inherent to Next.js, not a code deficiency. 
 
 IV. Backing Services — PASS 
 
 Evidence: db.ts:9-22 — database treated as attached resource via DATABASE_URL 
 PostgreSQL connection string is a single env var; switching databases requires zero code changes 
 docker-compose.production.yml:17-35 — PostgreSQL is a separate service with its own health check 
 BankID, PISP, AISP, Stripe, Sumsub — all configured via env vars ( .env.example:19-53 ) 
 
 V. Build, Release, Run — PASS 
 
 Evidence: Dockerfile uses 3-stage build (deps → builder → runner) 
 Dockerfile:1-6 — Stage 1: dependency installation 
 Dockerfile:9-37 — Stage 2: application build with next build 
 Dockerfile:39-64 — Stage 3: minimal production runner 
 next.config.ts:8 — output: "standalone" generates self-contained deployment 
 CI builds Docker image tagged with commit SHA ( ci.yml:63 ) 
 Build-time vs runtime config cleanly separated (ARG for build, ENV for runtime) 
 
 VI. Processes — PARTIAL 
 
 Evidence: Application runs as a single node server.js process ( Dockerfile:64 ) 
 SQLite concern: When running with SQLite (no DATABASE_URL ), the process is stateful — data lives on local filesystem at /app/data/drop.db . This works on Fly.io with mounted volumes ( fly.toml:36-38 ) but violates share-nothing for horizontal scaling. 
 PostgreSQL mode: Fully stateless — pg.Pool connects to external database ( db.ts:17-22 ). Multiple processes can run concurrently. 
 Rate limiting: rate_limits table in the database ( middleware.ts:15-43 ), which works for single-instance but has race conditions under horizontal scale with SQLite. 
 Assessment: PARTIAL because SQLite mode is actively used (Fly.io staging). In PostgreSQL mode this would be PASS. 
 
 VII. Port Binding — PASS 
 
 Evidence: Dockerfile:61-62 — EXPOSE 3000 , ENV PORT=3000 , ENV HOSTNAME="0.0.0.0" 
 fly.toml:23 — internal_port = 3000 
 docker-compose.production.yml:5 — ports: "3000:3000" 
 Self-contained via Next.js standalone server, no external HTTP server dependency. 
 
 VIII. Concurrency — PARTIAL 
 
 Evidence: Node.js single-threaded event loop handles concurrent requests via async I/O 
 db.ts:16-22 — PostgreSQL connection pool (pg.Pool) supports concurrent queries 
 fly.toml:25-27 — auto_stop_machines / auto_start_machines enables horizontal scaling 
 Limitation: No explicit worker process types. Background work (e.g., exchange rate refresh) runs inline. No separate queue workers. For a fintech app, transaction processing should eventually be separated into dedicated worker processes. 
 Limitation: SQLite mode limits to single process (WAL mode allows concurrent reads but single writer). 
 
 IX. Disposability — PASS 
 
 Evidence: Process starts fast — Next.js standalone is ~500ms cold start 
 db.ts:719-789 — initDb() is idempotent with _initialized guard; safe for restarts 
 Schema uses CREATE TABLE IF NOT EXISTS — safe for repeated initialization 
 fly.toml:25-27 — machines auto-stop/start, confirming disposability design 
 Graceful shutdown handled by Node.js default SIGTERM behavior 
 PostgreSQL pool ( pg.Pool ) handles connection cleanup on process exit 
 
 X. Dev/Prod Parity — PASS 
 
 Evidence: db.ts:9-13 — dual-driver architecture (SQLite for dev, PostgreSQL for prod) with unified async API 
 db.ts:47-63 — SQL compatibility layer translates SQLite idioms to PostgreSQL (placeholder conversion, INSERT OR IGNORE → ON CONFLICT DO NOTHING , datetime('now') → CURRENT_TIMESTAMP ) 
 db.ts:204-460 (SQLITE_SCHEMA) and db.ts:462-690 (PG_SCHEMA) — parallel schemas maintained in sync 
 migrations/0001_initial-schema.ts — node-pg-migrate for PostgreSQL schema versioning 
 Docker Compose production config ( docker-compose.production.yml ) mirrors production topology locally 
 Minor gap: SQLite schema is maintained inline in db.ts while PostgreSQL uses proper migrations ( node-pg-migrate ). Schema drift is possible if one is updated without the other. 
 
 XI. Logs — PARTIAL 
 
 Evidence: Health endpoint uses createLogger() ( health/route.ts:16 ) 
 middleware.ts:82-84 — error tracking via trackError() and Sentry integration 
 .env.example:62-74 — Sentry DSN configurable via env vars 
 Concern: No structured logging to stdout visible in the codebase. Next.js default logging goes to stdout which is good for containers, but there's no consistent structured logging format (JSON lines) that cloud log aggregators can parse efficiently. console.error is used in places ( middleware.ts:83 ). 
 
 XII. Admin Processes — PASS 
 
 Evidence: package.json:12-14 — migration scripts: migrate:up , migrate:down , migrate:create via node-pg-migrate 
 db.ts:735-774 — programmatic ALTER TABLE migrations for schema evolution 
 Seed data controlled by SEED_DEMO env var and isDemoMode() check — admin data seeding decoupled from main app 
 No one-off scripts embedded in application startup (seeding only runs when database is empty) 
 
 
 2. Containerization Quality 
 Multi-Stage Build — EXCELLENT 
 
 3-stage Dockerfile ( Dockerfile:1-64 ):
 
 Stage 1 ( deps ): node:22-alpine , installs native build tools, runs npm ci 
 Stage 2 ( builder ): Copies deps, builds Next.js app 
 Stage 3 ( runner ): Minimal alpine, copies only standalone output + static assets 
 
 
 
 Image Size 
 
 Base: node:22-alpine (minimal, ~180MB base) 
 Issue: Stage 3 installs python3 make g++ ( Dockerfile:42 ) for better-sqlite3 native module rebuild. This adds ~200MB to the production image unnecessarily if running in PostgreSQL mode. These build tools are a security and size concern in production. 
 Recommendation: Either pre-build better-sqlite3 in stage 2 and copy the binary, or conditionally exclude it when PostgreSQL is the target. 
 
 Security 
 
 Non-root user: nextjs:nodejs (UID/GID 1001) created and used ( Dockerfile:48-49, 58 ) 
 NEXT_TELEMETRY_DISABLED=1 set ( Dockerfile:14, 46 ) 
 Data directory owned by non-root user ( Dockerfile:56 ) 
 CI runs Trivy vulnerability scanner on built image ( ci.yml:67-73 ) with HIGH/CRITICAL severity gate 
 SARIF results uploaded to GitHub Security tab ( ci.yml:85-89 ) 
 
 Layer Caching 
 
 Dependencies cached in separate stage ( Dockerfile:5-6 — COPY package.json package-lock.json* before source) 
 Source code copy happens in stage 2 after deps, enabling Docker layer cache for unchanged dependencies 
 Good practice: Build args for feature flags allow cache invalidation only when flags change 
 
 Missing 
 
 No .dockerignore verified (could copy unnecessary files like .git , node_modules into build context) 
 No image tagging strategy beyond CI SHA tag 
 
 
 3. Database Portability 
 Dual-Driver Architecture — STRONG 
 
 Implementation: db.ts:9-13 — Runtime driver selection via DATABASE_URL presence 
 Unified API: query() , getOne() , run() , transaction() — all async, both drivers ( db.ts:67-199 ) 
 Type exports: DbClient interface ( db.ts:136-140 ) for transaction context 
 
 SQL Translation Layer 
 
 
 
 SQLite Idiom 
 PostgreSQL Translation 
 Location 
 
 
 
 
 ? placeholders 
 $1, $2, ... 
 db.ts:47-50 
 
 
 INSERT OR IGNORE INTO 
 INSERT INTO ... ON CONFLICT DO NOTHING 
 db.ts:56, 104-118 
 
 
 INSERT OR REPLACE INTO 
 INSERT INTO ... ON CONFLICT (col) DO UPDATE SET 
 db.ts:58, 120-134 
 
 
 datetime('now') 
 CURRENT_TIMESTAMP 
 db.ts:60 
 
 
 INTEGER PRIMARY KEY AUTOINCREMENT 
 SERIAL PRIMARY KEY 
 db.ts:278 vs 530 
 
 
 hex(randomblob(32)) 
 encode(gen_random_bytes(32), 'hex') 
 db.ts:248 vs 504 
 
 
 
 Transaction Support 
 
 PostgreSQL: BEGIN/COMMIT/ROLLBACK with pgClient.connect() and proper release in finally block ( db.ts:142-173 ) 
 SQLite: db.exec("BEGIN/COMMIT/ROLLBACK") wrapper ( db.ts:174-198 ) 
 Error handling: Both paths catch and rollback on failure 
 
 Migrations 
 
 node-pg-migrate for PostgreSQL ( package.json:12-14 , migrations/0001_initial-schema.ts ) 
 Proper up() and down() functions with ordered table creation/deletion 
 SQLite uses inline schema with CREATE TABLE IF NOT EXISTS + ALTER TABLE try/catch migrations ( db.ts:756-774 ) 
 Risk: Two parallel schema definitions (SQLITE_SCHEMA and PG_SCHEMA in db.ts + node-pg-migrate files) could drift. No automated parity check exists. 
 
 Indexes 
 
 22 indexes defined for both drivers (identical set) 
 Partial indexes supported: idx_users_national_id WHERE national_id_hash IS NOT NULL , idx_tx_idempotency WHERE idempotency_key IS NOT NULL 
 
 
 4. Config Externalization 
 Environment Variables 
 
 
 
 Category 
 Variables 
 Source 
 
 
 
 
 Core 
 JWT_SECRET , JWT_EXPIRY , NODE_ENV 
 .env.example:12-14 
 
 
 Database 
 DATABASE_URL 
 db.ts:9 
 
 
 Service Mode 
 NEXT_PUBLIC_SERVICE_MODE , DROP_MODE 
 .env.example:8 
 
 
 Auth (BankID) 
 BANKID_CLIENT_ID/SECRET/URLS , BANKID_MOCK 
 .env.example:19-29 
 
 
 Payments 
 PISP_API_URL/KEY , AISP_API_URL/KEY 
 .env.example:32-40 
 
 
 Cards 
 STRIPE_SECRET_KEY , STRIPE_PUBLISHABLE_KEY 
 .env.example:43-47 
 
 
 KYC 
 SUMSUB_APP_TOKEN , SUMSUB_SECRET_KEY 
 .env.example:50-52 
 
 
 Monitoring 
 SENTRY_DSN , SENTRY_TRACES_SAMPLE_RATE 
 .env.example:63-74 
 
 
 Feature Flags 
 8x NEXT_PUBLIC_FF_* 
 .env.example:77-87 
 
 
 Exchange 
 EXCHANGE_RATE_API_KEY/URL 
 .env.example:55-59 
 
 
 
 Secrets Management 
 
 env.ts:14-45 validates critical vars at production startup 
 Dockerfile:15 — JWT_SECRET=build-phase-placeholder (safe build-time placeholder) 
 env.ts:21-25 — Skip validation during build phase (detects NEXT_PHASE or placeholder) 
 env.ts:36-38 — Rejects known dev placeholder in production runtime 
 docker-compose.production.yml:7 — ${JWT_SECRET:?} required substitution (fails if missing) 
 No hardcoded secrets found in source code 
 
 Feature Flags 
 
 8 client-side feature flags via NEXT_PUBLIC_FF_* env vars 
 Defaults to false (safe) for all card-related features 
 NEXT_PUBLIC_FF_NOTIFICATIONS=true and NEXT_PUBLIC_FF_MERCHANT_DASHBOARD=true as defaults 
 Build-time injection for client code ( Dockerfile:19-35 ), runtime for server code 
 
 
 5. CI/CD Quality 
 Pipeline Structure ( ci.yml ) 
 lint-test (parallel) docker-scan (sequential, needs lint-test)
 -- npm ci -- docker build
 -- eslint -- Trivy scan (table, exit-code=1 on HIGH/CRITICAL)
 -- tsc --noEmit -- Trivy SARIF -> GitHub Security
 -- vitest run
 -- npm audit (production)
 
 Reproducibility 
 
 Pinned Node.js version: NODE_VERSION: "22" ( ci.yml:15 ) 
 npm ci for deterministic installs ( ci.yml:36 ) 
 Dependency caching via actions/setup-node with cache-dependency-path ( ci.yml:30-32 ) 
 Docker image tagged with commit SHA ( ci.yml:63 ) 
 
 Security Scanning 
 
 npm audit: Production dependencies, HIGH level, continue-on-error ( ci.yml:48-49 ) 
 Trivy: Container vulnerability scan, blocks on HIGH/CRITICAL unfixed vulns ( ci.yml:67-73 ) 
 SARIF: Results uploaded to GitHub Security tab ( ci.yml:85-89 ) 
 Permissions: Minimal — contents: read , security-events: write ( ci.yml:11-12 ) 
 
 Testing 
 
 vitest run in CI ( ci.yml:44 ) 
 Unit test framework configured ( package.json:10-11 ) 
 Coverage tool available: @vitest/coverage-v8 ( package.json:43 ) 
 Missing: No coverage threshold enforcement in CI 
 Missing: No E2E/integration tests in CI pipeline (Playwright is in devDependencies but not wired into CI) 
 
 Deployment 
 
 Fly.io staging configured ( fly.toml ) with health checks, auto-scaling, volume mounts 
 Docker Compose production ( docker-compose.production.yml ) for self-hosted deployments 
 Missing: No automated deployment step in CI (manual fly deploy or similar) 
 Missing: No environment promotion pipeline (develop -> staging -> production) 
 
 
 6. Overall Score and Top 5 Improvements 
 Overall Cloud-Readiness Score: 7.5 / 10 
 The application demonstrates strong cloud-native fundamentals: 
 
 Excellent dual-driver database abstraction 
 Proper multi-stage Dockerfile with security hardening 
 Configuration fully externalized via environment variables 
 Comprehensive CI with security scanning (Trivy + npm audit) 
 Health endpoint with real database connectivity check 
 
 Top 5 Improvements (Priority Order) 
 1. Eliminate Build Tools from Production Image (HIGH) 
 
 File: Dockerfile:42 
 Issue: python3 make g++ in production stage adds ~200MB and attack surface 
 Fix: Pre-compile better-sqlite3 in builder stage, copy only the .node binary. Or use a conditional build that excludes better-sqlite3 entirely when targeting PostgreSQL. 
 
 2. Add Structured Logging (HIGH) 
 
 Files: Throughout — console.error used in middleware.ts:83 , health endpoint has createLogger() but no consistent format 
 Issue: Cloud log aggregators (CloudWatch, Datadog, ELK) need structured JSON logs. Current mix of console.log/error and ad-hoc logger makes log parsing unreliable. 
 Fix: Adopt pino or similar JSON logger, output to stdout in { level, message, timestamp, requestId } format. 
 
 3. Add CI Coverage Enforcement and E2E Tests (MEDIUM) 
 
 File: ci.yml — no coverage gate, no Playwright CI step 
 Issue: @vitest/coverage-v8 and @playwright/test are in devDeps but not enforced in CI 
 Fix: Add --coverage --coverage.thresholds.lines=80 to vitest. Add Playwright E2E job with containerized app. 
 
 4. Automate Schema Parity Check (MEDIUM) 
 
 File: db.ts:204-690 — two parallel schema definitions (SQLite + PostgreSQL) 
 Issue: Manual sync between SQLITE_SCHEMA, PG_SCHEMA, and node-pg-migrate files. Drift will cause runtime errors that only surface in specific deployment targets. 
 Fix: Write a CI check that extracts table/column definitions from both schemas and compares. Or generate both schemas from a single source of truth. 
 
 5. Add Deployment Pipeline and Environment Promotion (MEDIUM) 
 
 File: ci.yml — CI only, no CD 
 Issue: No automated deployment from CI. Fly.io deploy is manual. No staging -> production promotion gate. 
 Fix: Add fly deploy step on develop push (staging) and manual approval gate for main (production). Add smoke test after deploy. Consider GitHub Environments for approval workflows. 
 
 Honorable Mentions 
 
 SQLite mode limits horizontal scaling — document clearly when to switch to PostgreSQL 
 Rate limiting via database has race conditions under concurrent writes (consider Redis for high-throughput) 
 No readiness probe separate from liveness (health endpoint serves both) 
 No graceful shutdown handler (SIGTERM -> drain connections -> exit) 
 playwright-core in production dependencies ( package.json:27 ) — should be devDependencies only 
 
 
 Appendix: File Reference 
 
 
 
 File 
 Purpose 
 
 
 
 
 src/drop-app/src/lib/db.ts 
 Dual-driver database abstraction (SQLite + PostgreSQL) 
 
 
 src/drop-app/Dockerfile 
 3-stage multi-stage build 
 
 
 src/drop-app/.env.example 
 Environment variable documentation (87 lines) 
 
 
 src/drop-app/fly.toml 
 Fly.io deployment config (Stockholm region) 
 
 
 src/drop-app/docker-compose.production.yml 
 Self-hosted production config 
 
 
 src/drop-app/package.json 
 Dependencies and scripts 
 
 
 .github/workflows/ci.yml 
 CI pipeline (lint, test, type-check, Trivy) 
 
 
 src/drop-app/migrations/0001_initial-schema.ts 
 PostgreSQL migration (node-pg-migrate) 
 
 
 src/drop-app/next.config.ts 
 Next.js config (standalone output, security headers) 
 
 
 src/drop-app/src/middleware.ts 
 Edge middleware (CSRF, CSP nonce) 
 
 
 src/drop-app/src/lib/middleware.ts 
 Server middleware (rate limiting, auth, validation, audit) 
 
 
 src/drop-app/src/app/api/health/route.ts 
 Health endpoint (real DB check) 
 
 
 src/drop-app/src/lib/env.ts 
 Environment validation at startup

Cloud Audit: Validation Report
Drop — Validation + Security + Cost Report 
 Date: 2026-02-19
 Auditor: cloud-tester (CloudForge cloud-audit team)
 MC Task: #1443 
 
 Executive Summary 
 Drop's AWS infrastructure has 3 CRITICAL and 4 HIGH security findings requiring immediate remediation. Current spend is ~$50-75/mo, well-optimized for scale. The application is cloud-portable (7.5/10) and the recommended path is to stay on AWS with security hardening + Terraform IaC. 
 
 1. Security Posture Assessment 
 Current vs Improved 
 
 
 
 Area 
 Current State 
 After Remediation 
 Risk Reduction 
 
 
 
 
 Secrets 
 Plaintext in App Runner env vars 
 AWS Secrets Manager 
 CRITICAL → LOW 
 
 
 RDS Access 
 Publicly accessible, SG open 0.0.0.0/0 
 Private, VPC-only access 
 CRITICAL → LOW 
 
 
 Encryption 
 RDS unencrypted at rest 
 AES-256 encryption enabled 
 CRITICAL → RESOLVED 
 
 
 Monitoring 
 None (no CloudWatch) 
 Basic alarms + Performance Insights 
 HIGH → LOW 
 
 
 WAF 
 None 
 Cloudflare WAF (free tier) 
 HIGH → LOW 
 
 
 CDN 
 None (direct App Runner URL) 
 Cloudflare CDN 
 HIGH → LOW 
 
 
 SSL/TLS 
 App Runner managed cert 
 Cloudflare + App Runner 
 MEDIUM → LOW 
 
 
 IAM 
 Single user (john-deploy) 
 Least-privilege roles 
 MEDIUM → LOW 
 
 
 
 Security Findings Summary 
 
 
 
 # 
 Severity 
 Finding 
 Remediation 
 Effort 
 
 
 
 
 S1 
 CRITICAL 
 RDS publicly accessible with SG allowing 0.0.0.0/0:5432 
 Set publicly_accessible=false, restrict SG to VPC CIDR 
 1 hour 
 
 
 S2 
 CRITICAL 
 Database password in plaintext App Runner env var 
 Migrate to Secrets Manager, update App Runner to read from SM 
 2 hours 
 
 
 S3 
 CRITICAL 
 JWT_SECRET in plaintext App Runner env var 
 Migrate to Secrets Manager 
 1 hour 
 
 
 S4 
 HIGH 
 RDS storage not encrypted at rest 
 Enable encryption (requires snapshot + restore for existing DB) 
 2-4 hours 
 
 
 S5 
 HIGH 
 No monitoring or alerting configured 
 Add CloudWatch alarms for CPU, memory, DB connections 
 1 hour 
 
 
 S6 
 HIGH 
 No WAF protection 
 Add Cloudflare WAF (free tier) 
 30 min 
 
 
 S7 
 HIGH 
 No CDN (direct App Runner URL exposed) 
 Add Cloudflare CDN 
 30 min 
 
 
 S8 
 MEDIUM 
 Sentry DSN in plaintext (not secret, but cleanup) 
 Move to Secrets Manager for consistency 
 30 min 
 
 
 S9 
 MEDIUM 
 Docker image has build tools in runner (attack surface) 
 Remove python3/make/g++ from runner stage 
 1 hour 
 
 
 S10 
 MEDIUM 
 No structured logging (incident investigation gaps) 
 Add pino/winston with JSON output 
 2 days 
 
 
 S11 
 LOW 
 ECR image tag mutability (tag overwrite risk) 
 Set image_tag_mutability = IMMUTABLE 
 5 min 
 
 
 S12 
 LOW 
 No lifecycle policy for ECR images 
 Add policy to clean old images 
 15 min 
 
 
 
 Compliance Checklist 
 
 
 
 Item 
 Status 
 Notes 
 
 
 
 
 GDPR data tables (consents, data_access_requests) 
 PASS 
 Schema includes consent tracking, DSAR, right to erasure 
 
 
 Audit logging 
 PASS 
 audit_log table with IP, user_agent, request_id 
 
 
 AML/KYC compliance 
 PASS 
 aml_alerts, str_reports, screening_results tables 
 
 
 Encryption at rest 
 FAIL 
 RDS storage unencrypted 
 
 
 Encryption in transit 
 PARTIAL 
 App Runner HTTPS, but RDS sslmode=no-verify 
 
 
 Secrets management 
 FAIL 
 Plaintext in env vars 
 
 
 Access control 
 PARTIAL 
 Single IAM user, no MFA enforcement 
 
 
 Backup & recovery 
 PASS 
 RDS 7-day automated backups 
 
 
 DeletionProtection 
 PASS 
 Enabled on RDS 
 
 
 
 
 2. Cost Comparison 
 Current AWS Spend 
 
 
 
 Resource 
 Monthly Cost 
 Notes 
 
 
 
 
 App Runner (1 vCPU, 2GB) 
 $25-35 
 Always-on, no auto-stop 
 
 
 RDS db.t3.micro 
 $15-18 
 Single-AZ, 20GB gp3 
 
 
 ECR 
 $1-2 
 Image storage 
 
 
 VPC Connector 
 $5 
 Flat fee 
 
 
 Data transfer 
 $2-5 
 Low traffic 
 
 
 Total 
 $48-65 
 
 
 
 
 Optimized AWS (after fixes) 
 
 
 
 Resource 
 Monthly Cost 
 Change 
 
 
 
 
 App Runner 
 $25-35 
 No change 
 
 
 RDS (encrypted) 
 $15-18 
 No cost increase 
 
 
 ECR 
 $1-2 
 No change 
 
 
 Secrets Manager (3 secrets) 
 $1.20 
 +$1.20 
 
 
 CloudWatch (basic alarms) 
 $3-5 
 +$3-5 
 
 
 Cloudflare (free tier) 
 $0 
 Free CDN/WAF/DNS 
 
 
 Total 
 $52-70 
 +$4-7 
 
 
 
 Multi-Cloud Equivalent 
 
 
 
 Provider 
 Monthly 
 Annual 
 vs Current 
 
 
 
 
 AWS (optimized) 
 $52-70 
 $624-840 
 +$4-7/mo 
 
 
 Azure 
 $100-130 
 $1,200-1,560 
 +$50-65/mo 
 
 
 GCP 
 $35-60 
 $420-720 
 -$5-15/mo 
 
 
 
 Verdict: AWS is cost-effective. GCP saves ~$10/mo but migration effort not justified at current scale. 
 
 3. Risk Matrix 
 
 
 
 Risk 
 Probability 
 Impact 
 Current Mitigation 
 Recommended 
 
 
 
 
 Data breach via public RDS 
 HIGH 
 CRITICAL 
 DeletionProtection only 
 Restrict SG, disable public access 
 
 
 Secret exposure 
 MEDIUM 
 CRITICAL 
 None (plaintext) 
 Secrets Manager + rotation 
 
 
 Service downtime 
 LOW 
 HIGH 
 App Runner auto-scaling 
 Add health checks, CloudWatch alarms 
 
 
 Data loss 
 LOW 
 CRITICAL 
 7-day RDS backups 
 Add cross-region backup copy 
 
 
 Cost overrun 
 LOW 
 MEDIUM 
 None 
 Add AWS Budgets alarm at $100 
 
 
 Vendor lock-in 
 LOW 
 MEDIUM 
 Docker + PostgreSQL 
 Terraform abstraction modules 
 
 
 DDoS attack 
 MEDIUM 
 HIGH 
 None 
 Cloudflare WAF + rate limiting 
 
 
 Compliance failure 
 MEDIUM 
 HIGH 
 Tables exist, no encryption 
 Enable encryption, structured logging 
 
 
 
 
 4. Implementation Roadmap 
 Phase 1: Security Fixes (Immediate — Day 1) 
 
 Create Secrets Manager secrets (DATABASE_URL, JWT_SECRET, SENTRY_DSN) 
 Update App Runner to read from Secrets Manager 
 Restrict RDS security group to VPC CIDR 
 Disable RDS public accessibility 
 Effort: 4-6 hours | Cost impact: +$1.20/mo 
 
 Phase 2: IaC Migration (Week 1) 
 
 Create S3 bucket for Terraform state 
 Import existing resources into Terraform state 
 Run terraform plan to verify no drift 
 Add terraform-ci.yml to GitHub Actions 
 Effort: 1-2 days | Cost impact: $0 
 
 Phase 3: Monitoring & Observability (Week 2) 
 
 Enable RDS Performance Insights 
 Add CloudWatch alarms (CPU > 80%, memory > 80%, DB connections > 80%) 
 Add structured logging (pino) to application 
 Configure Sentry properly (traces, breadcrumbs) 
 Effort: 2-3 days | Cost impact: +$3-5/mo 
 
 Phase 4: Edge Security (Week 2-3) 
 
 Set up Cloudflare (DNS, CDN, WAF) 
 Custom domain (getdrop.no) through Cloudflare 
 Enable Cloudflare WAF rules 
 Add rate limiting at edge 
 Effort: 1 day | Cost impact: $0 (free tier) 
 
 Phase 5: RDS Encryption (Week 3) 
 
 Create encrypted snapshot from current DB 
 Restore to new encrypted instance 
 Update Secrets Manager with new endpoint 
 Verify and swap 
 Effort: 2-4 hours (with downtime) | Cost impact: $0 
 
 Phase 6: Multi-Cloud Readiness (Month 2+) 
 
 Create Azure Terraform modules (optional) 
 Create GCP Terraform modules (optional) 
 Test migration to staging on alternative cloud 
 Effort: 3-5 days | Cost impact: Only if migrated 
 
 
 5. Recommendations Summary 
 
 
 
 Priority 
 Action 
 Status 
 
 
 
 
 P0 (NOW) 
 Fix RDS public access + SG 
 Terraform module created 
 
 
 P0 (NOW) 
 Move secrets to Secrets Manager 
 Terraform module created 
 
 
 P1 (Week 1) 
 Enable RDS encryption 
 Requires snapshot/restore 
 
 
 P1 (Week 1) 
 Deploy Terraform IaC 
 Modules ready 
 
 
 P2 (Week 2) 
 Add monitoring (CloudWatch + Performance Insights) 
 In Terraform 
 
 
 P2 (Week 2) 
 Add Cloudflare CDN/WAF 
 Manual setup 
 
 
 P3 (Month 1) 
 Add structured logging 
 Application code change 
 
 
 P3 (Month 1) 
 Add graceful shutdown handler 
 Application code change 
 
 
 P4 (Month 2+) 
 Multi-cloud Terraform modules 
 As needed 
 
 
 
 
 Overall Assessment: Drop's infrastructure is functional but needs immediate security hardening. The Terraform IaC created by this audit provides a complete, reproducible foundation. Total investment: ~1 week of engineering time, ~$5/mo additional cost, significant risk reduction.

Bilko Deploy — Standard Operating Procedure
$(cat /tmp/bilko-deploy-sop.html | jq -Rs .)

Bilko Deploy — Standard Operating Procedure
Bilko Deploy — Standard Operating Procedure 

 Last updated: 2026-04-22 
 Owner: FlowForge (Kelsey Hightower) 
 Status: ACTIVE 

 Cloud Run Architecture 

 GCP Project: tribal-sign-487920-k0 
 Region: europe-north1 
 Services: 

 
 bilko-web — Next.js 15 frontend (main branch → bilko-demo.alai.no) 
 bilko-api — Express API (main branch → bilko-api-762788903040.europe-north1.run.app) 
 bilko-intesa-demo — Intesa pitch demo (feat/intesa-bih-demo → manual deploy only) 
 

 Deploy Map 

 
 
 
 Branch 
 Service 
 URL 
 CI Workflow 
 Last Verified 
 
 
 
 
 main 
 bilko-web 
 https://bilko-demo.alai.no 
 gcp-deploy.yml (BROKEN) 
 2026-04-22 
 
 
 main 
 bilko-api 
 https://bilko-api-762788903040.europe-north1.run.app 
 gcp-deploy.yml (BROKEN) 
 2026-04-18 
 
 
 feat/intesa-bih-demo 
 bilko-intesa-demo 
 https://bilko-intesa-demo-762788903040.europe-north1.run.app 
 Manual gcloud only 
 2026-04-17 
 
 
 

 Pre-Flight Checks (ZAKON PI2 Check 2) 

 OBAVEZNO — Run these 4 commands and paste output into MC task BEFORE touching code: 

 # 1. Target URL alive?
curl -sI https://bilko-demo.alai.no | head -3

# 2. Branch state?
git log main --oneline -5

# 3. CI health?
gh run list --repo alai-holding/bilko --branch main --limit 3

# 4. Cloud Run service status?
gcloud run services describe bilko-web \
 --region europe-north1 \
 --project tribal-sign-487920-k0 \
 --format='value(status.latestReadyRevisionName,status.url,status.traffic)'
 

 If any returns unexpected: STOP, escalate to John. Do not proceed. 

 CI Pipeline Status 

 Status: BROKEN (2026-04-15 onwards) 
 Root Causes: 

 
 GitHub Actions minutes quota exhausted (monthly limit reached) 
 --no-traffic flag on line 206 of gcp-deploy.yml prevents traffic promotion for existing services 
 

 Workaround: Use manual deploy path (see below) until CI fixed. 

 Manual Deploy Path (Emergency + CI Broken) 

 When CI is broken or for emergency fixes, follow this path: 

 Step 1: Build Docker Image 

 cd /Users/makinja/ALAI/products/Bilko

docker build \
 --platform linux/amd64 \
 -f apps/web/Dockerfile \
 --build-arg NEXT_PUBLIC_API_URL=https://bilko-api-762788903040.europe-north1.run.app/api/v1 \
 -t europe-north1-docker.pkg.dev/tribal-sign-487920-k0/bilko/web:fix-<purpose>-<DDmon> \
 .
 

 Image tag convention: 
 
 ✅ fix-bugs-22apr , fix-logo-23apr 
 ❌ latest (not traceable) 
 

 Context reduction (.dockerignore): As of 2026-04-22, .dockerignore reduces build context from 4.1GB → 50MB by excluding node_modules , .next , apps/e2e , docs , etc. 

 Step 2: Push to Artifact Registry 

 gcloud auth configure-docker europe-north1-docker.pkg.dev

docker push europe-north1-docker.pkg.dev/tribal-sign-487920-k0/bilko/web:fix-<purpose>-<DDmon>
 

 Step 3: Deploy to Cloud Run 

 CRITICAL: Do NOT use --no-traffic flag for existing services. It blocks traffic promotion. 

 gcloud run deploy bilko-web \
 --image europe-north1-docker.pkg.dev/tribal-sign-487920-k0/bilko/web:fix-<purpose>-<DDmon> \
 --region europe-north1 \
 --platform managed \
 --allow-unauthenticated \
 --max-instances 10 \
 --min-instances 0 \
 --memory 512Mi \
 --cpu 1 \
 --concurrency 100 \
 --timeout 60s \
 --port 3000 \
 --set-env-vars NEXT_PUBLIC_API_URL=https://bilko-api-762788903040.europe-north1.run.app/api/v1,NEXT_TELEMETRY_DISABLED=1 \
 --project=tribal-sign-487920-k0
 

 Step 4: Verify Deployment 

 # Check revisions
gcloud run revisions list \
 --service bilko-web \
 --region europe-north1 \
 --project=tribal-sign-487920-k0 \
 --limit=5

# Verify traffic routing (should show 100% on latest revision)
gcloud run services describe bilko-web \
 --region europe-north1 \
 --project=tribal-sign-487920-k0 \
 --format='value(status.traffic)'
 

 Post-Deploy Evidence Gate (ZAKON PI2 Check 5) 

 MC task CANNOT move to done without ALL three: 

 
 curl checks: Paste output showing HTTP 200 for expected routes
 curl -sI https://bilko-demo.alai.no | head -3
curl -sI https://bilko-demo.alai.no/invoices/new | head -3
curl -sI https://bilko-demo.alai.no/settings | head -3
curl -sI https://bilko-demo.alai.no/intesa-bridge | head -3 # Should be 404
 
 
 Playwright screenshots: Stored in docs/evidence/<task-id>/*.png 
 
 Home page 
 Feature verified (e.g., invoice template save button) 
 Any isolation checks (e.g., 404 for client routes on main) 
 
 
 verification.json: Machine-readable evidence file
 {
 "task_id": 8730,
 "timestamp": "2026-04-22T21:41:10Z",
 "revision": "bilko-web-00019-7tl",
 "traffic_100_percent": true,
 "curl_checks": { "home": 200, "intesa-bridge": 404, ... },
 "playwright_pass": true,
 "screenshots": ["home.png", "invoices-new.png", ...]
}
 
 
 

 Deploy Flow Diagram 

 flowchart LR
 A[Code Change] --> B{CI Healthy?}
 B -->|Yes| C[CI: Build + Push]
 B -->|No| D[Manual Build]
 C --> E[Artifact Registry]
 D --> E
 E --> F[Cloud Run Deploy]
 F --> G{Traffic Routing}
 G -->|100%| H[Live]
 G -->|0%| I[Blocked - Check --no-traffic flag]
 H --> J[Evidence Gate]
 J --> K{All 3 checks pass?}
 K -->|Yes| L[MC task done]
 K -->|No| M[Block - Add evidence]
 

 Known Issues + Workarounds 

 Issue 1: CI broken since 2026-04-15 
 Symptom: All main branch pushes fail at deploy step 
 Root cause: GitHub Actions quota + --no-traffic flag 
 Workaround: Use manual deploy path above 

 Issue 2: Intesa content leaked to public URL (fixed 2026-04-22) 
 Symptom: /intesa-bridge route returned 200 on bilko-demo.alai.no 
 Root cause: Intesa feature branch merged to main 
 Fix: Deleted intesa routes from main (commit 66d2220) + added branch-purity.yml CI check 

 Issue 3: Manual paste-copy anti-pattern 
 Symptom: CEO had to manually paste docker build output and gcloud commands 
 Root cause: FlowForge task dispatched after image built locally 
 Fix: Always dispatch FlowForge BEFORE build step, let agent own full flow 

 Branch Purity Rules 

 Client-specific routes MUST NOT appear on main. Reserved prefixes: 
 
 intesa-* → feat/intesa-bih-demo → bilko-intesa-demo Cloud Run 
 corpint-* → TBD client branch → TBD Cloud Run service 
 

 CI Enforcement: .github/workflows/branch-purity.yml runs on every PR to main: 

 find apps/web/app -type d \( -name "intesa-*" -o -name "corpint-*" \) | grep . && exit 1 || exit 0
 

 Registry: ~/system/rules/client-prefix-registry.md 

 Domain Mapping 

 
 bilko-demo.alai.no → Cloud Run service bilko-web (configured via GCP Console) 
 DNS: Cloudflare proxy enabled 
 Mapping verified: 2026-04-22 
 

 Related Documentation 

 
 DEPLOY-MAP.md: /Users/makinja/ALAI/products/Bilko/DEPLOY-MAP.md 
 Incident Postmortem: BookStack → ALAI / Incidents / incident-2026-04-22-bilko-deploy-fix 
 ZAKON PI2: ~/system/rules/zakon-pi2-deploy-verification.md 
 CI Workflow: .github/workflows/gcp-deploy.yml 
 Dockerfile: apps/web/Dockerfile 
 

 Escalation 

 Owner: FlowForge 
 Escalate to: John → pi-orchestrator 
 MC category: devops + priority: H 

 

 Created by ALAI Skillforge, 2026-04-22

Bilko CI/CD — Stage→Prod Pipeline (MC #99477)
Overview 
 Stage pipeline: push-main → bilko-stage-auto-deploy → cloudbuild-stage.yaml → bilko-{web,api}-stage 
 Prod pipeline: tag v* → bilko-main-deploy → cloudbuild.yaml → bilko-{web,api} 
 Stage pipeline is optimized for FAST FEEDBACK — no quality gates. Prod pipeline has 8 production gates including SHA verification, Trivy scanning, Flyway migrations, and Cloud Build native approval. 

 Stage Pipeline 
 
 
 
 Step 
 Purpose 
 Image Tag 
 Duration (avg) 
 
 
 
 
 sanity-check 
 Verify Docker socket + Artifact Registry reachability (environment health, NOT a quality gate) 
 — 
 ~2.3s 
 
 
 build-web 
 Build Next.js app with docker buildx (apps/web/Dockerfile) 
 :stage-${SHORT_SHA} :stage-latest 
 ~3m 
 
 
 push-web 
 Push image to Artifact Registry (europe-north1-docker.pkg.dev/tribal-sign-487920-k0/bilko/web) 
 — 
 ~7s 
 
 
 migrate-db 
 Run Flyway migrations against Cloud SQL bilko-staging-db (POSTGRES_16) via Cloud SQL proxy 
 — 
 ~22s 
 
 
 deploy-web-stage 
 Deploy bilko-web-stage Cloud Run service with :stage-${SHORT_SHA} image, --no-traffic 
 — 
 ~39s 
 
 
 promote-web-stage 
 Route 100% traffic to new revision (no canary for stage) 
 — 
 ~10s 
 
 
 deploy-api-stage 
 Deploy bilko-api-stage (redeploys EXISTING image only — no API build step, see OCD-1) 
 — 
 ~19s 
 
 
 smoke-test 
 curl -sf https://bilko-api-stage-dh4m46blja-lz.a.run.app/api/v1/health — exit 1 if non-200 
 — 
 ~2.5s 
 
 
 
 Total duration: ~5 minutes (build 6f2236f6, validated 2026-05-06) 

 Prod Pipeline 
 Existing prod pipeline (cloudbuild.yaml) has 8 gates and MUST NOT be rewritten. References: 
 
 SHA verification (Git commit SHA in image metadata) 
 Trivy vulnerability scanning 
 Flyway migration validation 
 Cloud Build native approval (approval_required=true in modules/build/main.tf) 
 Smoke tests (health endpoint + web homepage) 
 Gradual traffic rollout (0% → 100%) 
 Rollback on smoke test failure 
 
 Prod pipeline is BLOCKED on OCD-5 (bilko-db Cloud SQL instance does not exist — requires CEO approval for provisioning). 

 Triggers 
 
 
 
 Trigger Name 
 Filename 
 Branch/Tag 
 Approval 
 Service Account 
 
 
 
 
 bilko-stage-auto-deploy 
 infrastructure/gcp/cloudbuild-stage.yaml 
 ^main$ 
 No (auto-deploy) 
 762788903040@cloudbuild.gserviceaccount.com 
 
 
 bilko-main-deploy 
 infrastructure/gcp/cloudbuild.yaml 
 v* (semver tag) 
 Yes (Cloud Build UI) 
 762788903040@cloudbuild.gserviceaccount.com 
 
 
 
 GCP project: tribal-sign-487920-k0 , region: europe-north1 

 Open Risks — 5 CEO Decisions Required 
 These items require CEO judgment and are NOT resolved in this implementation: 

 OCD-1: bilko-api Build Pipeline Gap 
 Status: OPEN — BLOCKER for API continuous delivery 
 Current state: bilko-api-stage is live and serving traffic at https://bilko-api-stage-dh4m46blja-lz.a.run.app/api/v1 with image api:stage-b7e8a59 . No Cloud Build pipeline exists for the Kotlin/Ktor API. Dockerfile path unconfirmed. 
 Impact: Stage cloudbuild-stage.yaml deploy-api-stage step redeploys the EXISTING API image only — cannot build new API images. API deployments must be manual via gcloud run deploy until resolved. 
 CEO decisions needed: 
 
 What is the canonical Dockerfile path for apps/api? 
 Should API have its own Cloud Build step in cloudbuild-stage.yaml or a separate trigger? 
 Is bilko-api currently deployed manually via gcloud run deploy ? 
 

 OCD-2: Stage Hostname — bilko-stage.alai.no vs Raw .run.app URL 
 Status: OPEN — affects CORS configuration 
 Current state: ENV-MATRIX.md CORS_ORIGINS for staging references staging.bilko.io (STALE). terraform.tfvars stage_api_url points to raw .a.run.app URL. Stage pipeline uses raw .run.app URL as default. 
 Impact: Frontend CORS errors if staging.bilko.io DNS is ever pointed at stage services. 
 CEO decision needed: Should bilko-stage.alai.no be the canonical stage hostname? If yes: Cloudflare DNS entry (manual — not in Bilko TF stack) + CORS_ORIGINS update required via separate MC. 

 OCD-3: Postgres Version Mismatch — Stage POSTGRES_16 vs Prod POSTGRES_15 
 Status: OPEN — CRITICAL for financial data integrity 
 Current state: bilko-staging-db runs POSTGRES_16 (confirmed live). envs/prod/main.tf line 94 specifies POSTGRES_15 for prod (bilko-db does not exist yet — see OCD-5). Stage validates migrations and queries against PG16; prod would run PG15. 
 Impact: For a financial accounting SaaS, stage validation on PG16 while prod runs PG15 invalidates the "stage-as-test-environment" premise. Schema compatibility unverified. SQL dialect differences (PG15→PG16) may surface as prod-only bugs. 
 CEO decision needed: Upgrade prod to POSTGRES_16 (requires maintenance window, pg_upgrade or dump/restore) OR downgrade stage to POSTGRES_15? ALAI standard tech stack (ALAI/CLAUDE.md) mandates POSTGRES_16 for all products, suggesting prod config is non-compliant. 

 OCD-4: Stage → Prod SHA Promotion Strategy 
 Status: OPEN — architectural decision 
 Current state: Prod trigger fires on semver tag push, rebuilds from source. Stage-validated image digest is NOT carried to prod build. Stage tests one SHA and prod deploys a different build. If a hot dependency updates between stage build and prod build (e.g., npm registry serves new patch version), stage and prod can diverge on identical Git SHAs. 
 CEO decision needed: 
 
 Option A: Accept rebuild-on-tag (simpler, current model) with acknowledgment of hot-dependency risk. 
 Option B: Implement digest promotion where prod trigger accepts an image digest input parameter and skips rebuild. Requires Cloud Build trigger API call from a promotion script or Google Cloud Deploy. 
 

 OCD-5: Prod Cloud SQL bilko-db Existence 
 Status: OPEN — BLOCKER for prod terraform apply 
 Current state: gcloud sql instances list --project=tribal-sign-487920-k0 shows ONLY bilko-staging-db. No bilko-db (prod) exists. envs/prod/main.tf explicitly notes "bilko-db (prod) — TBD — audit required" (lines 4-6 and import.sh). 
 Impact: Any terraform apply on envs/prod would attempt to create a REGIONAL HA POSTGRES_15 db-custom-2-7680 instance (~$100+/month). Without CEO sign-off, prod infra is BLOCKED. 
 CEO decision needed: Approve prod DB provisioning (cost + data migration strategy if migrating from elsewhere) before ANY envs/prod TF apply is ever run. If bilko-db exists elsewhere (on-prem? Railway?), import.sh must be run first. 

 Validation 
 Evidence file: /tmp/99477-proveo-evidence.md 
 Build ID: 6f2236f6-86ec-444c-96b7-7c22f63cf5a2 
 Build log: View in GCP Console 
 Validation date: 2026-05-06T20:28Z 
 Validator: Angie Jones (Proveo) 
 Verdict: PASS — 7/7 Acceptance Criteria met 
 
 AC1: Build SUCCESS (all 8 steps SUCCESS) 
 AC2: bilko-web-stage HTTP/2 200 
 AC3: bilko-api-stage health endpoint 200 {"status":"ok"} 
 AC4: New web revision deployed within 5min window 
 AC5: Flyway migrate-db ran without error (21.5s) 
 AC6: No gate-* steps executed (0 quality gates) 
 AC7: Image pushed with :stage-${SHORT_SHA} tag (stage-277dd5a confirmed in Artifact Registry) 
 

 Related MCs 
 
 #99395 — VAT enum-cast genesis (billing_country ENUM cast to TEXT in Flyway migration) 
 #99422 — Sibling task (stage Cloud Run services health check) 
 #99477 — This task (Stage CI/CD pipeline implementation) 
 

 ZAKON PI2 Compliance Status 
 Stage pipeline: ✅ COMPLIANT 
 
 DEPLOY-MAP.md exists at repo root ✅ 
 Pre-flight checks executed (4 probes: triggers, GCS bucket, Cloud Run services, SQL instances) ✅ 
 Post-deploy validation (curl 200 + Cloud Run revision evidence) ✅ 
 Evidence files delivered (/tmp/99477-preflight.txt, /tmp/99477-proveo-evidence.md) ✅ 
 
 Prod pipeline: ⏸ BLOCKED (awaiting OCD-5 bilko-db provisioning approval) 

 Last Updated 
 2026-05-06, owner: FlowForge (Kelsey Hightower)

ALAI CI/CD Blueprint Standardization 2026-05-08
ALAI CI/CD Blueprint Standardization — 2026-05-08 
 Master MC: #99881
 Owner: John (AI Director) + Petter Graff persona for canonical refresh
 Status: All 4 phases verified closed. Triple-layer enforcement live.
 Cost: ~$15-30 LLM tokens 
 Context 
 CEO directive 2026-05-08 in single-day push: "Discuss CI/CD pipelines and blueprints" → triple-layer mechanical enforcement live + 7/7 fleet compliance + free-first routing across persona blueprints. 
 4-phase arc summary 
 
 
 
 Faza 
 MC 
 Outcome 
 
 
 
 
 1 — Audit 
 #99882 
 4 artifacts in ~/system/specs/cicd-audit-2026-05-08/ (gap matrix, deploy-map matrix, canonical self-audit, summary). 1 real bug caught: DropSrbija/BUILD-BLUEPRINT.md line 225 stale "Postgres 5434" comment (actual port 5436). 
 
 
 2 — Canonical refresh 
 #99886 
 UNIVERSAL bumped to v3.0 (§13 6-mandatory files including DEPLOY-MAP, §15 forma-only variant, §16.3 CI gates, ZAKON PI2 invariant). DEPLOY bumped to v2.0 (multi-profile §1A GCP / §1B Azure VM / §1C Cloudflare Pages / §1D Vercel deprecated). blueprint-format.md disambiguation header (YAML agent layer vs MD product layer). alai-cicd-architecture.md staleness notice (sections §5.2 AWS, §9 Phase 3 superseded). 
 
 
 3 — Product migration 
 #99896 
 7 in-scope products migrated to v2 §1A/§1B/§1C profiles. 6 new mandatory files created (web PIPELINE/RUNBOOK/CHANGELOG, Gotiva RUNBOOK/CHANGELOG, Drop PIPELINE). Drop §1B refactor reached FULL_COMPLIANCE 5/5 schema. Excluded: BasicFakta (MC #99893 Vercel→CF Pages migration), DropSrbija (MC #99883 scope decision), akershus-fylke (forma-only). 
 
 
 4 — Enforcement 
 #99911 
 Triple-layer mechanical enforcement live. 
 
 
 
 Triple-layer enforcement (all live, all verified) 
 1. Linter — ~/system/tools/blueprint-check.js v2 
 Dual-mode (backward compat with mehanik-commit + pre-dispatch-gate Check 9): 
 
 Rubric mode (default, original): scores BUILD-BLUEPRINT.md 0-100 across 6 checks. Exit 0 if ≥ 60. 
 Inventory mode ( --inventory ): checks 6 mandatory files per UNIVERSAL v3 §13. Validates DEPLOY-MAP.md schema 5/5 per DEPLOY v2 §4. Respects forma-only flag. Verdict states: FULL_COMPLIANCE / FORMA_ONLY_OK / PARTIAL_SCHEMA / MISSING_FILES. 
 
 JSON output reusable by hook + daemon. 
 2. PostToolUse hook — ~/.claude/hooks/blueprint-schema-validator.sh 
 Registered in settings.json under Write|Edit|MultiEdit matcher. Triggers on writes to product-root DEPLOY-MAP.md files under ~/business/ALAI-Holding-AS/{products,web,finance}/*/ . Blocks with exit 2 + structured BLOCKED message + missing sections + template pointers when schema fails. Override marker: <!-- blueprint-schema-validator: skip --> . 
 Trace log: ~/system/state/blueprint-schema-validator-trace.log . 
 3. Nightly daemon — ~/system/daemons/blueprint-fleet-watchdog.js 
 LaunchAgent com.alai.blueprint-fleet-watchdog schedules daily 06:15. Scans 10 product roots, persists state to ~/system/state/blueprint-fleet-status.json , detects regressions (verdict drop, schema score drop, file removal) with differential alert. Exit 1 on regression. 
 Free-first routing (CEO directive "ukljuci free modele gdje god mozes") 
 ~/system/config/tier-routing.json updated: 
 
 MLX FORGE tiers added : M2 (gemma-4-26b@11435), M2c (qwen3-coder-30b@11437), M3 (qwen3-32b@11436). All 3 servers verified live via curl before adding to canonical. 
 callerRoutes added : verifier→2cHQ , fix-builder→2c , redzo-reviewer→M2c . 
 providerFallback chains : verifier (MLX → Ollama ANVIL → Claude secondary), fix-builder (Ollama → Ollama → Claude secondary). 
 
 Persona blueprint sweep (MC #99923): 13 yaml files — 9 all-sonnet personas (AgentForge, Axiom, Finverge, FlowForge, Lexicon, Proveo, Resolver, Skybound, Vizu) + 4 CodeCraft yaml (api-backend, codecraft-api, nextjs-app, openapi-sdk-package). 46 phase declarations swept sonnet → local-first (qwen2.5-coder:32b@anvil for general phases, qwen3-coder:latest@forge for code-gen phases). 6 KEPT-sonnet phases with explicit rationale: 3 Lexicon legal phases (Norwegian law / GDPR / PSD2 regulatory precision), 3 Resolver cross-company phases (multi-domain reasoning). 
 Verifier pattern dokazan 
 bp-verifier background agent ran ~15 rounds, ~178 atomic claims, 2 stvarna buga uhvaćena : 
 
 DropSrbija/BUILD-BLUEPRINT.md line 225 stale comment "Postgres 5434" (actual port 5436 per docker-compose.yml). Fixed in both audit artifact + product blueprint. 
 Drop/DEPLOY-MAP.md schema 3/5 PARTIAL — no formal OPEN RISK / OCD register, no SA distinction. Fixed via §1B-appropriate equivalents (SSH key → Trigger SA equivalent, container USER → Service SA equivalent). 
 
 Pattern recommendation : For every multi-phase project, spawn named bp-verifier in BG ( Agent({subagent_type: "verifier", name: "bp-verifier", run_in_background: true}) ), send each artifact via SendMessage for atomic claim validation, fix-loop on FAIL. Cost: $0.10 per round Claude ( $0 if MLX primary per new tier-routing). 
 Fleet compliance final (verified by daemon 2026-05-08) 
 
 
 
 Product 
 Verdict 
 Files 
 Schema 
 Profile 
 
 
 
 
 Bilko 
 FULL_COMPLIANCE 
 6/6 
 5/5 
 §1A GCP 
 
 
 Tok 
 FULL_COMPLIANCE 
 6/6 
 5/5 
 §1A GCP 
 
 
 Drop 
 FULL_COMPLIANCE 
 6/6 
 5/5 
 §1B Azure VM 
 
 
 Lobby 
 FULL_COMPLIANCE 
 6/6 
 5/5 
 §1A GCP (stub) 
 
 
 Plock 
 FULL_COMPLIANCE 
 6/6 
 5/5 
 §1A GCP (stub) 
 
 
 Gotiva 
 FULL_COMPLIANCE 
 6/6 
 5/5 
 §1A GCP multi-service 
 
 
 web 
 FULL_COMPLIANCE 
 6/6 
 5/5 
 §1C CF Pages 
 
 
 akershus-fylke 
 FORMA_ONLY_OK 
 1/1 
 N/A 
 non-deployable 
 
 
 BasicFakta 
 MISSING_FILES 
 5/6 
 0/5 
 §1D Vercel deprecated (MC #99893 migration backlog) 
 
 
 DropSrbija 
 MISSING_FILES 
 3/6 
 0/5 
 scope decision pending (MC #99883) 
 
 
 
 Open follow-ups (parked, not blocking arc closure) 
 
 #99883 DropSrbija scope decision (separate product vs Drop multi-tenant) — needs petter-graff arch memo 
 #99893 BasicFakta Vercel→CF Pages migration — 3-4h work + 30d soak 
 #99895 Coverage threshold review scheduled 2026-05-22 (after 2-week observability) 
 #99955 Securion task/owner schema canonical alignment (L) 
 
 Git audit trail 
 
 ~/system commit: a02fd0109 — 29 files, +6184/-122 (canonical v3 + audit artifacts + linter v2 + daemon + tier-routing + 13 persona blueprints) 
 ~/.claude commit: bf2ca2d49 — hook + settings.json registration 
 
 Lessons 
 
 Verifier-in-bg uhvati realne bugove — propagated stale comments + schema gaps. USE THIS PATTERN for every multi-phase project. 
 Mehanik enforcement >> ZAKON-only — hook + daemon catch what memo can't. UNIVERSAL §13 / DEPLOY §4 sad mehanički enforced. 
 Local-first viable for builder/verifier — qwen2.5-coder + qwen3-coder + MLX qwen3-coder-30b dovoljno za schema validation, code gen, doc draft. Sonnet ostaje za high-stakes synthesis (legal, cross-company). 
 Closure-loop discipline — build-verify-mark-done pattern, ne build-verify-stop. CEO uhvatio gap u mid-session closure ("jel sve dokumentovano, merged, zatvoreno po propisima") and triggered this BookStack publish + git commit + memory entry. 
 
 References 
 
 Memory project entry: ~/.claude/projects/-Users-makinja/memory/project_cicd_standardization_2026-05-08.md 
 Audit artifacts: ~/system/specs/cicd-audit-2026-05-08/{blueprint-gap-matrix,deploy-map-gap-matrix,canonical-self-audit,summary}.md 
 v3 drafts (review trail): ~/system/specs/cicd-canonical-v3-drafts/ 
 Canonical (production): ~/system/specs/{ALAI-UNIVERSAL-BLUEPRINT,DEPLOY-BLUEPRINT,blueprint-format,alai-cicd-architecture}.md 
 Pre-promotion backups: ~/system/specs/_backups/20260508-111700/

ALAI CI/CD Blueprint Standardization 2026-05-08
ALAI CI/CD Blueprint Standardization — 2026-05-08 
 Master MC: #99881
 Owner: John (AI Director) + Petter Graff persona for canonical refresh
 Status: All 4 phases verified closed. Triple-layer enforcement live.
 Cost: ~$15-30 LLM tokens 
 Context 
 CEO directive 2026-05-08 in single-day push: "Discuss CI/CD pipelines and blueprints" → triple-layer mechanical enforcement live + 7/7 fleet compliance + free-first routing across persona blueprints. 
 4-phase arc summary 
 
 
 
 Faza 
 MC 
 Outcome 
 
 
 
 
 1 — Audit 
 #99882 
 4 artifacts in ~/system/specs/cicd-audit-2026-05-08/ (gap matrix, deploy-map matrix, canonical self-audit, summary). 1 real bug caught: DropSrbija/BUILD-BLUEPRINT.md line 225 stale "Postgres 5434" comment (actual port 5436). 
 
 
 2 — Canonical refresh 
 #99886 
 UNIVERSAL bumped to v3.0 (§13 6-mandatory files including DEPLOY-MAP, §15 forma-only variant, §16.3 CI gates, ZAKON PI2 invariant). DEPLOY bumped to v2.0 (multi-profile §1A GCP / §1B Azure VM / §1C Cloudflare Pages / §1D Vercel deprecated). blueprint-format.md disambiguation header (YAML agent layer vs MD product layer). alai-cicd-architecture.md staleness notice (sections §5.2 AWS, §9 Phase 3 superseded). 
 
 
 3 — Product migration 
 #99896 
 7 in-scope products migrated to v2 §1A/§1B/§1C profiles. 6 new mandatory files created (web PIPELINE/RUNBOOK/CHANGELOG, Gotiva RUNBOOK/CHANGELOG, Drop PIPELINE). Drop §1B refactor reached FULL_COMPLIANCE 5/5 schema. Excluded: BasicFakta (MC #99893 Vercel→CF Pages migration), DropSrbija (MC #99883 scope decision), akershus-fylke (forma-only). 
 
 
 4 — Enforcement 
 #99911 
 Triple-layer mechanical enforcement live. 
 
 
 
 Triple-layer enforcement (all live, all verified) 
 1. Linter — ~/system/tools/blueprint-check.js v2 
 Dual-mode (backward compat with mehanik-commit + pre-dispatch-gate Check 9): 
 
 Rubric mode (default, original): scores BUILD-BLUEPRINT.md 0-100 across 6 checks. Exit 0 if ≥ 60. 
 Inventory mode ( --inventory ): checks 6 mandatory files per UNIVERSAL v3 §13. Validates DEPLOY-MAP.md schema 5/5 per DEPLOY v2 §4. Respects forma-only flag. Verdict states: FULL_COMPLIANCE / FORMA_ONLY_OK / PARTIAL_SCHEMA / MISSING_FILES. 
 
 JSON output reusable by hook + daemon. 
 2. PostToolUse hook — ~/.claude/hooks/blueprint-schema-validator.sh 
 Registered in settings.json under Write|Edit|MultiEdit matcher. Triggers on writes to product-root DEPLOY-MAP.md files under ~/business/ALAI-Holding-AS/{products,web,finance}/*/ . Blocks with exit 2 + structured BLOCKED message + missing sections + template pointers when schema fails. Override marker: <!-- blueprint-schema-validator: skip --> . 
 Trace log: ~/system/state/blueprint-schema-validator-trace.log . 
 3. Nightly daemon — ~/system/daemons/blueprint-fleet-watchdog.js 
 LaunchAgent com.alai.blueprint-fleet-watchdog schedules daily 06:15. Scans 10 product roots, persists state to ~/system/state/blueprint-fleet-status.json , detects regressions (verdict drop, schema score drop, file removal) with differential alert. Exit 1 on regression. 
 Free-first routing (CEO directive "ukljuci free modele gdje god mozes") 
 ~/system/config/tier-routing.json updated: 
 
 MLX FORGE tiers added : M2 (gemma-4-26b@11435), M2c (qwen3-coder-30b@11437), M3 (qwen3-32b@11436). All 3 servers verified live via curl before adding to canonical. 
 callerRoutes added : verifier→2cHQ , fix-builder→2c , redzo-reviewer→M2c . 
 providerFallback chains : verifier (MLX → Ollama ANVIL → Claude secondary), fix-builder (Ollama → Ollama → Claude secondary). 
 
 Persona blueprint sweep (MC #99923): 13 yaml files — 9 all-sonnet personas (AgentForge, Axiom, Finverge, FlowForge, Lexicon, Proveo, Resolver, Skybound, Vizu) + 4 CodeCraft yaml (api-backend, codecraft-api, nextjs-app, openapi-sdk-package). 46 phase declarations swept sonnet → local-first (qwen2.5-coder:32b@anvil for general phases, qwen3-coder:latest@forge for code-gen phases). 6 KEPT-sonnet phases with explicit rationale: 3 Lexicon legal phases (Norwegian law / GDPR / PSD2 regulatory precision), 3 Resolver cross-company phases (multi-domain reasoning). 
 Verifier pattern dokazan 
 bp-verifier background agent ran ~15 rounds, ~178 atomic claims, 2 stvarna buga uhvaćena : 
 
 DropSrbija/BUILD-BLUEPRINT.md line 225 stale comment "Postgres 5434" (actual port 5436 per docker-compose.yml). Fixed in both audit artifact + product blueprint. 
 Drop/DEPLOY-MAP.md schema 3/5 PARTIAL — no formal OPEN RISK / OCD register, no SA distinction. Fixed via §1B-appropriate equivalents (SSH key → Trigger SA equivalent, container USER → Service SA equivalent). 
 
 Pattern recommendation : For every multi-phase project, spawn named bp-verifier in BG ( Agent({subagent_type: "verifier", name: "bp-verifier", run_in_background: true}) ), send each artifact via SendMessage for atomic claim validation, fix-loop on FAIL. Cost: $0.10 per round Claude ( $0 if MLX primary per new tier-routing). 
 Fleet compliance final (verified by daemon 2026-05-08) 
 
 
 
 Product 
 Verdict 
 Files 
 Schema 
 Profile 
 
 
 
 
 Bilko 
 FULL_COMPLIANCE 
 6/6 
 5/5 
 §1A GCP 
 
 
 Tok 
 FULL_COMPLIANCE 
 6/6 
 5/5 
 §1A GCP 
 
 
 Drop 
 FULL_COMPLIANCE 
 6/6 
 5/5 
 §1B Azure VM 
 
 
 Lobby 
 FULL_COMPLIANCE 
 6/6 
 5/5 
 §1A GCP (stub) 
 
 
 Plock 
 FULL_COMPLIANCE 
 6/6 
 5/5 
 §1A GCP (stub) 
 
 
 Gotiva 
 FULL_COMPLIANCE 
 6/6 
 5/5 
 §1A GCP multi-service 
 
 
 web 
 FULL_COMPLIANCE 
 6/6 
 5/5 
 §1C CF Pages 
 
 
 akershus-fylke 
 FORMA_ONLY_OK 
 1/1 
 N/A 
 non-deployable 
 
 
 BasicFakta 
 MISSING_FILES 
 5/6 
 0/5 
 §1D Vercel deprecated (MC #99893 migration backlog) 
 
 
 DropSrbija 
 MISSING_FILES 
 3/6 
 0/5 
 scope decision pending (MC #99883) 
 
 
 
 Open follow-ups (parked, not blocking arc closure) 
 
 #99883 DropSrbija scope decision (separate product vs Drop multi-tenant) — needs petter-graff arch memo 
 #99893 BasicFakta Vercel→CF Pages migration — 3-4h work + 30d soak 
 #99895 Coverage threshold review scheduled 2026-05-22 (after 2-week observability) 
 #99955 Securion task/owner schema canonical alignment (L) 
 
 Git audit trail 
 
 ~/system commit: a02fd0109 — 29 files, +6184/-122 (canonical v3 + audit artifacts + linter v2 + daemon + tier-routing + 13 persona blueprints) 
 ~/.claude commit: bf2ca2d49 — hook + settings.json registration 
 
 Lessons 
 
 Verifier-in-bg uhvati realne bugove — propagated stale comments + schema gaps. USE THIS PATTERN for every multi-phase project. 
 Mehanik enforcement >> ZAKON-only — hook + daemon catch what memo can't. UNIVERSAL §13 / DEPLOY §4 sad mehanički enforced. 
 Local-first viable for builder/verifier — qwen2.5-coder + qwen3-coder + MLX qwen3-coder-30b dovoljno za schema validation, code gen, doc draft. Sonnet ostaje za high-stakes synthesis (legal, cross-company). 
 Closure-loop discipline — build-verify-mark-done pattern, ne build-verify-stop. CEO uhvatio gap u mid-session closure ("jel sve dokumentovano, merged, zatvoreno po propisima") and triggered this BookStack publish + git commit + memory entry. 
 
 References 
 
 Memory project entry: ~/.claude/projects/-Users-makinja/memory/project_cicd_standardization_2026-05-08.md 
 Audit artifacts: ~/system/specs/cicd-audit-2026-05-08/{blueprint-gap-matrix,deploy-map-gap-matrix,canonical-self-audit,summary}.md 
 v3 drafts (review trail): ~/system/specs/cicd-canonical-v3-drafts/ 
 Canonical (production): ~/system/specs/{ALAI-UNIVERSAL-BLUEPRINT,DEPLOY-BLUEPRINT,blueprint-format,alai-cicd-architecture}.md 
 Pre-promotion backups: ~/system/specs/_backups/20260508-111700/

Slack bot token SSOT — slack.json (MC #102830) — 2026-06-03
Summary 
 MC #102830 makes ~/system/config/slack.json the single source of truth (SSOT) for the Slack bot's tokens, with environment-variable fallback, and removes the hardcoded tokens from the LaunchAgent plist. Previously the com.john.slack-bot.plist hardcoded both SLACK_BOT_TOKEN and SLACK_APP_TOKEN in EnvironmentVariables — so a token rotation that wasn't mirrored into the plist would strand the daemon with a stale token. 
 Change 
 
 slack.json ( ~/system/config/slack.json , mode 0600): now holds token (xoxb bot), app_token (xapp), workspace , bot_name . 
 slack-bot.js loadSlackTokens() (line ~443): reads slack.json first (SSOT); returns {botToken, appToken} when both present; otherwise silently falls through to the existing Keychain → vault → env chain. Env-var fallback preserved. 
 com.john.slack-bot.plist : SLACK_BOT_TOKEN and SLACK_APP_TOKEN removed from EnvironmentVariables (GROQ/HOME/PATH untouched). plutil -lint OK. 
 run-slack-bot-reload.sh : reload wrapper added under ~/system/tools/ . 
 
 Token rotation procedure (new) 
 
 Edit ~/system/config/slack.json — update token (xoxb) and/or app_token (xapp). 
 bash ~/system/tools/run-slack-bot-reload.sh 
 
 No plist edit. No risk of stranding the daemon on rotation. 
 Verification 
 
 plist: grep -c SLACK_*_TOKEN = 0; plutil -lint OK. 
 slack.json: mode 0600; keys token/app_token/workspace/bot_name. 
 slack-bot.js: node --check SYNTAX_OK; SSOT branch at line 443. 
 Daemon: launchctl PID 42749, LastExitStatus 0, stable; log Tokens loaded from slack.json (SSOT) + Slack bot started (Socket Mode) . 
 Live: slack.js send #ops succeeded (token valid). 
 Independent verifier (Company Mesh / eval-Proveo): PASS — mesh-thr-b04409c5-ab59-4ff1-bc24-b163433bd063 . 
 til-done: DONE — /tmp/til-done/102830-20260603T142915Z.json . 
 
 Security note 
 This also improves posture: secrets moved out of a (potentially world-readable) LaunchAgent plist into the 0600 slack.json. Token values are never logged (masked).

Bilko CI — integration-test job (Testcontainers) MC #102843 — 2026-06-03
Summary 
 MC #102843 adds an integration-test job to Bilko's .github/workflows/ci.yml . Previously the backend-test job ran only ./gradlew test , and tasks.test does excludeTags("integration") (apps/api/build.gradle.kts:159) — so the integrationTest task (Testcontainers/Postgres, includeTags("integration") ) never ran in CI . PRs that broke integration tests passed green (surfaced manually by Proveo during MC #102798). 
 Change (PR #245, base main, not merged) 
 
 New integration-test job: ubuntu-latest, Java 21, ./gradlew integrationTest --no-daemon in apps/api . Testcontainers spins its own Postgres (no services: block needed). 
 Non-blocking for now ( continue-on-error: true , NOT in build needs: ). 
 
 Why non-blocking (important) 
 Running the suite revealed it is currently broken on main: ~78/1147 integration tests fail (FlywayMigrateException in SettingsServiceRlsTest, ExposedSQLException in VatReportStatutoryGroupingTest, and others). These had never run in CI. Making the job a required gate immediately would red-lock every PR. So the job is visible on every PR (failures now surface) but does not block merges yet. 
 Path to required gate 
 Tracked in MC #102874 (H): fix the 78 failing integration suites. Once green, promotion is a one-line CI change — remove continue-on-error: true and add integration-test to build needs: [lint, unit, backend-test, integration-test] . 
 Verification 
 
 CI run 26900887524 : integration-test job executed; log shows Task :integrationTest , 1147 tests / 78 failed, Testcontainers Postgres started. 
 yaml-lint PASS, actionlint PASS, gitleaks 0, diff = ci.yml only. 
 Independent pre-verifier (Company Mesh / Proveo): PASS — mesh-thr-8f34975b-a1ef-4d99-8bd8-cd9f894a022a . 
 til-done: DONE — /tmp/til-done/102843-20260603T172857Z.json . 
 
 Incident (logged, low severity) 
 During implementation a build branch was accidentally pushed to origin/main (commit ecf5a97 ) and immediately reverted ( 036e2c6 ). It triggered bilko-stage-auto-deploy twice; both SUCCESS, change was ci.yml -only (no app artifact change), stage is non-customer-facing. origin/main verified clean afterwards. Lesson recorded: build agents must git push -u origin HEAD:<branch> and verify upstream ≠ origin/main (push to Bilko main auto-deploys stage).

Bilko integrationTest suite green — 79->0 failures (MC #102874) — 2026-06-03
Summary 
 MC #102874 took the Bilko backend integrationTest suite from 79 failing → 0 failing (1213 tests, 91 suites) . These integration tests had never run in CI ( tasks.test does excludeTags("integration") ); the new CI job from MC #102843 exposed the rot. PR #246 (base main , not merged). 
 Root-cause clusters fixed 
 
 A. Stale error-envelope assertions (~39): tests asserted legacy "FORBIDDEN" / "NOT_FOUND" bodies; the app correctly returns RFC7807 with errorCode BILKO-AUTH-003 / BILKO-INV-001 . Tests updated to assert the strict error codes (stronger, not weaker). 
 B. Flyway init cascade (3 init suites): V30_1__ensure_bilko_admin_role.sql ran GRANT bilko_admin TO CURRENT_USER as bilko_admin → self-membership SQLSTATE 0LP01 on newer PG → migration abort. Fixed test-side (Testcontainers withUsername → bilko_test , a non-admin user). Production migration left immutable. 
 C1. PasswordReset SQL: SET LOCAL cannot take a subquery → resolve orgId via Exposed then pass the literal UUID. 
 C3. Expense semantics: self-approval / unknown-vendor exception types aligned with documented MC #102746 behavior. 
 Final 3 (this session): 
 
 VatReportStatutoryGroupingTest init error — invoice_items seed omitted NOT-NULL line_number → added it to 5 inserts. 
 HrFullPathE2ETest T1 + ImpersonationSessionIntegrationTest T8 (same cause) — ArchiveService.triggerBackgroundBuild fire-and-forget coroutine wrote completed_at to an already-closed Hikari pool → uncaught ExposedSQLException leaked into unrelated tests. Fixed: catch ExposedSQLException / SQLException -on-closed-pool and log (graceful degrade), rethrow CancellationException . Production-correct hardening. 
 
 
 
 Assertion-strength (anti-gate-gaming) 
 Two tests had been widened to multi-code accept lists; re-pinned to single deterministic codes: 
 
 InvoiceRoutes TS-INV-01 → exactly 503 MARKET_NOT_AVAILABLE (RS org deterministic). 
 BankingRoutes non-existent accountId → exactly 201 (auto-provisions Asset GL).
No @Disabled / @Ignore / excludeTags /deleted tests. Flyway migrations untouched. 
 
 Verification 
 
 John forced fresh run: ./gradlew cleanIntegrationTest integrationTest --rerun-tasks --no-daemon → :integrationTest executed, BUILD SUCCESSFUL 2m31s . XML aggregate 1213 / 0 / 0 / 0 . 
 Independent pre-verifier (Company Mesh / Proveo): PASS — mesh-thr-5947b05c-a6c5-4b3b-9ac8-37d7e6a7ea2c . 
 til-done: DONE — /tmp/til-done/102874-20260603T193125Z.json . 
 PR #246: 36 source files, 0 build artifacts ; origin/main untouched. 
 
 Process note 
 Multi-session: one session produced the WIP fix (~76 failures), a John review session affirmed it ("NOT gate-gaming") and flagged 3 items, this session ran the decisive green run, fixed the last 3 + the 2 flags, and closed. Earlier in the campaign a build agent accidentally pushed to main (reverted) — lesson recorded; this PR was pushed cleanly with explicit refspec, main untouched. 
 Downstream (NOT part of this fix) 
 
 Merge PR #246 → main green. 
 Then flip the CI gate (MC #102843): remove continue-on-error + add integration-test to build needs: . Cannot flip before merge (main is still red until then).

Prometheus Best Practices — USE vs RED
Prometheus Best Practices and Pitfalls 
 Source: YouTube Learning — Julius Volz (Prometheus co-founder), Swiss Cloud Native Day 2021 
 Indexed: 2026-06-15 (MC #103620) 
 
 USE vs RED: Decision Framework 
 USE Method (Resource-Oriented Systems) 
 For infrastructure components (CPU, memory, disk, network): 
 
 U tilization: % busy (0-100%) 
 S aturation: degree of queuing (wait time, queue length) 
 E rrors: error count/rate 
 
 When to use: Cloud Run instances, Azure Container Apps, database connections, worker threads, storage volumes. 
 RED Method (Request-Oriented Systems) 
 For services handling requests: 
 
 R ate: requests/second 
 E rrors: failed request count or % 
 D uration: latency (p50, p95, p99) 
 
 When to use: REST APIs, BFF layers, RPC services, HTTP endpoints. 
 
 Custom Metrics in Application Code 
 Best Practices 
 
 Counter for events that only go up (requests, errors, jobs completed) 
 Gauge for values that go up/down (active connections, queue size, temperature) 
 Histogram for bucketed observations (latency, request size) — auto-generates _sum , _count , _bucket 
 Summary for client-side quantiles (use histogram + server-side quantiles in PromQL instead) 
 
 Common Pitfalls 
 
 High cardinality labels (user IDs, UUIDs, timestamps) → cardinality explosion → OOM 
 Missing units in metric names ( http_request_duration vs http_request_duration_seconds ) 
 Inconsistent naming (mix of snake_case/camelCase) 
 Not exposing /metrics endpoint early in service development 
 Using Summary instead of Histogram (histograms aggregate better) 
 
 
 PromQL Essentials 
 # Rate of HTTP errors over 5min
rate(http_requests_total{status=~"5.."}[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# CPU utilization (USE)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Error rate (RED)
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) 
 
 How This Applies to ALAI 
 Current Infrastructure 
 
 Grafana: https://grafana.alai.no (monitoring hub) 
 Bilko APIs/BFF: Java/Spring Boot → RED metrics for /api/* endpoints 
 LumisCare BFF/services: Kotlin/Ktor → RED metrics for REST + USE metrics for connection pools 
 Cloud Run / Azure Container Apps: Platform exposes USE metrics (CPU, memory, request queue) 
 
 Recommended Next Steps 
 
 Instrument Bilko/LumisCare services with Micrometer (auto-exposes Prometheus /actuator/prometheus ) 
 Add RED dashboards for all user-facing APIs (Grafana template: https://grafana.com/grafana/dashboards/4701) 
 Add USE dashboards for Cloud Run / ACA resource health 
 Alert on SLIs: Error rate >1%, p95 latency >2s, CPU >80% 
 
 ALAI-Specific Pitfall to Avoid 
 Do NOT add per-user or per-client labels to core metrics. Use organization_id buckets (max ~50) or aggregate at service level. High cardinality = Prometheus death. 
 
 References 
 
 Prometheus docs: https://prometheus.io/docs/practices/naming/ 
 USE Method: http://www.brendangregg.com/usemethod.html 
 RED Method: https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/ 
 Micrometer + Spring Boot: https://micrometer.io/docs/registry/prometheus