CI/CD & Monitoring
- CI/CD Pipeline
- Monitoring & Alerting
- Production Deployment
- BetterStack Setup
- Sentry Setup
- CloudWatch Logs Setup
CI/CD Pipeline
Drop CI/CD Pipeline
Last updated: 2026-02-13
Source: src/drop-app/package.json, Dockerfile, fly.toml, vitest.config.ts, playwright.config.ts
Current State
Drop is in MVP/pre-production stage. Core CI/CD infrastructure exists including a GitHub Actions workflow.
What exists:
- GitHub Actions CI workflow (
.github/workflows/ci.yml) with 5 jobs: lint-and-typecheck, test, build, e2e, docker-build - Dockerfile with multi-stage build (
Dockerfile:1-63) - docker-compose for local and production (
docker-compose.yml,docker-compose.production.yml) - Fly.io deployment config (
fly.toml) - Vitest unit/integration test framework (
vitest.config.ts) - Playwright E2E test framework (
playwright.config.ts) - Health check endpoint (
/api/health) - QA report generation via
scripts/qa-report.js(automated in CI)
What does not exist yet:
- Automated deployment pipeline (CI builds but does not deploy)
- Container registry integration
- Automated security scanning (npm audit, Snyk)
- Test coverage reporting
- Staging environment (Fly.io config exists but not deployed)
Build Pipeline
Step 1: Install Dependencies
npm ci
Installs exact versions from package-lock.json.
Step 2: Lint
npm run lint # eslint
Step 3: Type Check
npx tsc --noEmit
Step 4: Unit + Integration Tests
npm test # vitest run
Runs all tests in tests/**/*.test.ts (from vitest.config.ts:7). Test setup: tests/setup.ts sets NODE_ENV=test.
Step 5: Build
npm run build # next build
Produces standalone output for Docker deployment.
Step 6: Docker Build
docker build -t drop-app .
Multi-stage build: deps -> builder -> runner.
Step 7: E2E Tests (requires running server)
npx playwright test
Requires dev server on http://localhost:3000. Playwright auto-starts it via webServer config.
Test Framework Configuration
Vitest (Unit + Integration)
Config: src/drop-app/vitest.config.ts:1-15
| Setting | Value |
|---|---|
| Environment | node |
| Include | tests/**/*.test.ts |
| Setup | tests/setup.ts |
| Path alias | @ -> ./src |
Playwright (E2E)
Config: src/drop-app/playwright.config.ts:1-39
| Setting | Value |
|---|---|
| Test dir | ./tests/e2e |
| Parallel | false (serial -- rate limiter is shared) |
| Workers | 1 |
| Retries (CI) | 2 |
| Timeout | 30,000ms |
| Base URL | http://localhost:3000 |
| Reporter | HTML |
| Trace | on-first-retry |
Test projects:
user-flows-- Basic user journey tests (user-flows.spec.ts)full-flows-- Complete feature journeys (full-flows.spec.ts)input-chaos-- Malicious/edge-case input testing (input-chaos.spec.ts). Depends onuser-flows.
Web server config: Auto-starts npm run dev for E2E tests. Reuses existing server if running. 30s timeout.
Deployment Targets
Fly.io (Staging)
Config: fly.toml:1-28
# Deploy to Fly.io staging
fly deploy
# Set secrets
fly secrets set JWT_SECRET="your-secret"
fly secrets set NEXT_PUBLIC_SERVICE_MODE="mock"
Region: arn (Stockholm)
Auto-scaling: Scales to 0 when idle, auto-starts on request.
Docker (Self-hosted)
# Local dev (PostgreSQL 16 via Docker)
docker compose up -d
# Apply schema
make db-push
Existing GitHub Actions CI Workflow
File: .github/workflows/ci.yml
Triggers on push/PR to main or master:
Jobs:
1. lint-and-typecheck — npm ci, npm run lint, tsc --noEmit
2. test — npm ci, npm test --if-present (depends on lint-and-typecheck)
3. build — npm ci, npm run build with JWT_SECRET placeholder (depends on lint-and-typecheck)
4. e2e — npm ci, npx playwright install chromium, npm run build, npm run start (production mode), npx playwright test user-flows + full-flows, generate QA report, upload artifacts (depends on build)
5. docker-build — docker build -t drop-app:ci (depends on test + build + e2e)
Artifacts uploaded:
playwright-report/— Playwright HTML report (7 day retention)qa-report.html— QA metrics report (pass/fail, execution time)
Not yet implemented:
- Security scan (npm audit, Snyk)
- Deploy to staging (fly deploy)
- Deploy to production (manual approval gate)
Status: Full CI pipeline including E2E tests in place. CD deployment tracked in security hardening checklist (security/hardening-checklist.md:120-126).
Monitoring & Alerting
Drop Monitoring
Last updated: 2026-02-17
Source: src/drop-app/src/app/api/health/route.ts, docker-compose.yml, fly.toml, src/lib/alerts.ts
Health Check Endpoint
Route: GET /api/health
Source: src/drop-app/src/app/api/health/route.ts:1-35
What It Checks
- Database connectivity -- Executes
SELECT 1 as okagainst the database - Database latency -- Measures query execution time in milliseconds
- Database driver -- Reports
pg(PostgreSQL 16 via Drizzle ORM) - Service mode -- Reports
NEXT_PUBLIC_SERVICE_MODE(mockorlive) - Application uptime -- Tracks seconds since server start
- Application version -- Reads from
npm_package_versionenv var, defaults to0.1.0
Status Values
- ok -- All checks pass (HTTP 200)
- degraded -- DB query returned unexpected result (HTTP 200)
- down -- DB unreachable (HTTP 503)
Response Format
Healthy (200 OK):
{
"data": {
"status": "ok",
"version": "0.1.0",
"uptime": 3600,
"checks": {
"db": { "status": "pass", "latencyMs": 2, "driver": "pg" },
"services": { "mode": "live" }
},
"timestamp": "2026-02-17T12:00:00.000Z"
}
}
Down (503 Service Unavailable):
{
"data": {
"status": "down",
"version": "0.1.0",
"uptime": 3600,
"checks": {
"db": { "status": "fail" },
"services": { "mode": "live" }
},
"timestamp": "2026-02-17T12:00:00.000Z"
}
}
Container Health Checks
Docker Compose (MVP)
Source: docker-compose.yml:12-17
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/api/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
Docker Compose (Production)
Source: docker-compose.production.yml:9-14
Same health check configuration as MVP. Additionally, PostgreSQL has its own health check:
healthcheck:
test: ["CMD-SHELL", "pg_isready -U drop"]
interval: 10s
timeout: 5s
retries: 5
The drop-app service depends on PostgreSQL being healthy before starting (depends_on.postgres.condition: service_healthy).
Fly.io
Source: fly.toml:19-23
[[http_service.checks]]
grace_period = "10s"
interval = "30s"
method = "GET"
path = "/api/health"
timeout = "5s"
Fly.io uses this health check to determine machine readiness and to route traffic.
Current Monitoring State
What Exists
- Health check endpoint with real database verification (not hardcoded)
- Container-level health checks (Docker + Fly.io)
- Automatic restart on failure (
restart: unless-stoppedin docker-compose) - Auto-scaling on Fly.io (scale to zero, auto-start on request)
What Does Not Exist Yet
- External uptime monitoring service (see UptimeRobot setup below for recommended configuration)
- Application Performance Monitoring (APM)
- Structured logging (JSON format)
- Log aggregation and forwarding
- Database performance monitoring
- Rate limit monitoring/metrics
- Business metrics dashboard (transactions per hour, success rate)
Sentry Error Tracking
Status: REMOVED (MC #1271 — Sentry deinstalled)
Slack Alerting
Status: Implemented (MC #1183)
Source: src/lib/alerts.ts, instrumentation.ts
Features
- Operational alerts sent to Slack webhook
- 10-minute cooldown per alert title (prevents spam)
- Severity-based emoji prefixes (ℹ️ info, ⚠️ warning, 🚨 critical)
- Graceful degradation when webhook URL not set (dev mode)
Setup Instructions
- Create incoming webhook in Slack workspace:
- Go to Slack App Directory → Incoming Webhooks
- Choose channel (e.g.,
#opsor#alerts) - Copy webhook URL
- Set environment variable:
# .env.local (server-side secret) SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX
Required Environment Variable
| Variable | Required | Description |
|---|---|---|
SLACK_WEBHOOK_URL |
Yes (production) | Slack incoming webhook URL |
Note: When SLACK_WEBHOOK_URL is not set, alerts are logged to console but not sent to Slack.
Alert Types and Severities
| Severity | Emoji | Use Case |
|---|---|---|
info |
ℹ️ | Application startup, normal operations |
warning |
⚠️ | Degraded performance, non-critical issues |
critical |
🚨 | Service outages, data loss, security incidents |
Cooldown Behavior
- Each alert title has a 10-minute cooldown
- Same title sent within 10 minutes → skipped (prevents spam)
- Different titles → sent immediately (independent tracking)
- Cooldown resets on app restart (in-memory tracking)
Example: If "Database connection failed" is sent at 10:00, the next attempt before 10:10 will be skipped. But "High latency detected" can still be sent at 10:05.
Usage in Code
import { sendAlert } from '@/lib/alerts';
// Basic alert
await sendAlert({
severity: 'critical',
title: 'Database connection failed',
message: 'PostgreSQL unreachable after 3 retries',
});
// Alert with details
await sendAlert({
severity: 'warning',
title: 'High error rate detected',
message: '15 errors in last 5 minutes',
});
Current Integrations
- App startup: Sends info alert when server starts (
instrumentation.ts) - App shutdown: Sends info alert on SIGTERM/SIGINT (
instrumentation.ts) - Error spike detection: Automatically tracks errors and alerts when >5 errors occur in 60 seconds (
src/lib/alerts.ts:trackError) - Unhandled exceptions: Logged and tracked via process event handlers (
instrumentation.ts)
Error Spike Detection
The alerting system automatically detects error spikes using a rolling window approach:
How it works:
- Every server error (HTTP 5xx) is tracked via
trackError() - Maintains rolling 1-minute window of error timestamps
- When count exceeds threshold (5 errors in 60 seconds), sends critical alert
- Integrates with middleware error handling
Threshold: 5 errors within 60 seconds
Alert severity: Critical (🚨)
Implementation: src/lib/alerts.ts:trackError(), wired into src/lib/middleware.ts:jsonError()
Note: Error counter is in-memory and resets on app restart. For production workloads requiring persistent tracking, consider Redis-backed counters.
BetterStack Uptime Monitoring
Status: Ready to configure (setup guide available) Documentation: BETTERSTACK-SETUP.md
Overview
BetterStack provides external uptime monitoring independent of Drop's infrastructure. Unlike internal health checks (Docker, Fly.io) that only work when containers are running, BetterStack detects total infrastructure failures.
Free tier includes:
- 10 monitors (enough for Drop production)
- 3-minute check interval
- Unlimited integrations (Slack, email)
- Public status page
- SSL expiry monitoring
Recommended Monitors
| Monitor | URL | Purpose | Expected Response |
|---|---|---|---|
| Health Endpoint | https://drop.alai.no/api/health |
API + DB connectivity | 200, body contains "status":"ok" |
| Landing Page | https://drop.alai.no |
Public website | 200, body contains Send penger |
| Multi-Region Check | https://drop.alai.no/api/health |
Geographic availability | 200, body contains "status":"ok" |
Alert Escalation
BetterStack sends alerts through multiple channels:
Minute 0: Alert fires → Slack #drop-ops (immediate)
Minute 5: Still down → Email to alem@alai.no
Minute 15: Still down → SMS (requires paid plan)
Status Page
Public status page shows real-time service status:
- URL:
https://drop-status.betteruptime.com - Components: API Health, Landing Page, Global Network
- Auto-updates: Incidents automatically posted and resolved
- Subscriptions: Users can subscribe to email updates
Setup Instructions
Complete setup guide with step-by-step instructions: BETTERSTACK-SETUP.md
Setup includes:
- Account creation (free tier)
- Configure 3 monitors (health, landing, multi-region)
- Slack integration (
#drop-opschannel) - On-call schedule and escalation policy
- Public status page creation
- Testing and verification
Key Features
Proactive monitoring:
- 3-minute check interval (free tier) or 30s (paid)
- Keyword verification (not just HTTP 200)
- SSL certificate expiry warnings (14 days)
- Multi-region checks (detect geographic issues)
Incident management:
- Automatic incident creation on downtime
- Status page updates (public transparency)
- Escalation to multiple channels (Slack → Email → SMS)
- Maintenance window support (suppress alerts during deployments)
Reporting:
- Uptime SLA tracking (99.9% target)
- Incident history and analysis
- Response time graphs
- Downtime duration reports
Integration with Drop Alerting
BetterStack complements Drop's internal alerting (src/lib/alerts.ts):
| Feature | Drop Internal Alerts | BetterStack External |
|---|---|---|
| Detects | Application errors, error spikes | Infrastructure outages |
| When | App is running | App is unreachable |
| Source | Application logs | External HTTP checks |
| Delivery | Slack webhook (direct) | Escalation policy |
| Use case | Code bugs, DB issues | Container crashes, network failures |
Example: Database connection fails:
- Drop internal alert: "Database connection failed" → Slack
#drop-ops(immediate) - BetterStack: Health check returns 503 → Slack
#drop-ops+ Email after 5 min
Maintenance Windows
When performing planned maintenance (deployments, upgrades):
- Create maintenance window in BetterStack
- Select affected monitors
- Set duration (e.g., 1 hour)
- Effect: Alerts suppressed, status page shows "Scheduled Maintenance"
Prevents: False downtime alerts during intentional service interruptions.
Best Practices
Do's:
- ✅ Test alerts monthly (pause monitor to verify escalation)
- ✅ Use keyword checks (not just HTTP status codes)
- ✅ Monitor SSL expiry (14-day warnings)
- ✅ Create maintenance windows for deployments
- ✅ Review incident history monthly
Don'ts:
- ❌ Don't ignore degraded status (investigate even if not fully down)
- ❌ Don't disable monitors (use pause for temporary suppression)
- ❌ Don't skip keyword checks (HTTP 200 ≠ working API)
- ❌ Don't rely solely on external monitoring (combine with internal checks)
External Uptime Monitoring (Alternative: UptimeRobot)
Status: Alternative to BetterStack (not recommended)
BetterStack is recommended over UptimeRobot for Drop because:
- Better Slack integration (richer notifications)
- Built-in status page (UptimeRobot charges extra)
- Better UI/UX for incident management
- More flexible escalation policies
UptimeRobot Setup (if BetterStack unavailable)
Cost: Free tier (50 monitors, 5-minute interval)
- Create account at uptimerobot.com
- Add HTTP(S) monitor:
- Friendly Name: Drop Production
- URL:
https://drop.alai.no/api/health - Monitoring Interval: 5 minutes (free tier) or 1 minute (paid)
- Configure alert contacts:
- Slack webhook (via Alert Contacts)
- Email (
alem@alai.no)
- Set Keyword Monitoring: Response contains
"status":"ok"
Limitations:
- No built-in escalation policies (requires third-party integrations)
- Status page requires paid plan
- Less detailed incident reports
- 5-minute check interval (vs 3-minute for BetterStack free)
Monitoring Stack Summary
Implemented (MC #1184)
- ✅ Health check endpoint —
/api/healthwith real database verification - ✅ Container health checks — Docker + Fly.io auto-restart on failure
- ❌ Error tracking — Sentry REMOVED (MC #1271)
- ✅ Slack alerting — Operational alerts with cooldown protection
- ✅ Lifecycle monitoring — App startup and graceful shutdown alerts
- ✅ Error spike detection — Automatic alerting when >5 errors/minute
Recommended (Manual Setup)
- 📋 External uptime monitoring — UptimeRobot checking
/api/healthevery 5 minutes - 📋 Structured logging — JSON log format with request IDs for correlation
- 📋 Metrics dashboard — Request latency, error rates, database query times
- 📋 Audit logging — Tracked as security requirement (
security/drop-security-rapport.mdfinding L3)
Future Enhancements (TODO)
- Database performance monitoring (slow query alerts)
- Rate limit metrics (track 429 errors per endpoint)
- Business metrics dashboard (transactions per hour, success rate)
- Redis-backed error counter (persistent across restarts)
- Per-endpoint error tracking (isolate problematic routes)
Environment Variables Reference
Required for Production
# Slack alerting
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXX
Dev Mode (All Optional)
All monitoring features gracefully degrade when env vars are not set:
- No SLACK_WEBHOOK_URL: Alerts logged to console only
This allows development to work without external services configured.
Production Deployment
Drop AWS Amplify Deployment Guide
Rebrand note (2026-02-14): Originally titled "FontelePay". Product rebranded to Drop. Some env var references (Swan, Stripe) are FUTURE integrations — Drop uses a PSD2 pass-through model. See Drop CLAUDE.md.
This guide covers deploying Drop to AWS Amplify in the Frankfurt (eu-central-1) region.
Prerequisites
- AWS Account with Amplify access
- GitHub repository with Drop code
- Environment variables ready (see
.env.example)
Step 1: Create Amplify App
- Go to AWS Amplify Console
- Ensure you're in eu-central-1 (Frankfurt) region
- Click Create new app
- Select Host web app
Step 2: Connect Repository
- Choose GitHub as your Git provider
- Authorize AWS Amplify to access your GitHub account
- Select the Drop repository
- Choose the branch to deploy (e.g.,
mainorproduction)
Step 3: Configure Build Settings
Amplify will auto-detect Next.js. Verify the settings match amplify.yml:
version: 1
frontend:
phases:
preBuild:
commands:
- npm ci
build:
commands:
- npm run build
artifacts:
baseDirectory: .next
files:
- '**/*'
cache:
paths:
- node_modules/**/*
- .next/cache/**/*
Step 4: Configure Environment Variables
In Amplify Console, go to App settings > Environment variables and add:
Required Variables
| Variable | Description | Example |
|---|---|---|
NODE_ENV |
Environment | production |
NEXT_PUBLIC_APP_URL |
Your app URL | https://drop.amplifyapp.com |
Swan BaaS
| Variable | Description |
|---|---|
SWAN_API_URL |
https://api.swan.io (production) |
SWAN_CLIENT_ID |
OAuth2 Client ID |
SWAN_CLIENT_SECRET |
OAuth2 Client Secret |
SWAN_PROJECT_ID |
Project ID |
SWAN_WEBHOOK_SECRET |
Webhook validation secret |
Stripe
| Variable | Description |
|---|---|
NEXT_PUBLIC_STRIPE_PUBLISHABLE_KEY |
Publishable key (pk_live_...) |
STRIPE_SECRET_KEY |
Secret key (sk_live_...) |
STRIPE_WEBHOOK_SECRET |
Webhook secret (whsec_...) |
Sumsub KYC
| Variable | Description |
|---|---|
SUMSUB_APP_TOKEN |
App token |
SUMSUB_SECRET_KEY |
Secret key |
SUMSUB_WEBHOOK_SECRET |
Webhook secret |
SUMSUB_LEVEL_NAME |
KYC flow level |
Database
| Variable | Description |
|---|---|
DATABASE_URL |
PostgreSQL connection string |
REDIS_URL |
Redis connection string |
Authentication
| Variable | Description |
|---|---|
JWT_SECRET |
Min 32 characters |
SESSION_SECRET |
Min 32 characters |
Step 5: Configure Next.js for Standalone Output
Update next.config.ts to enable standalone output for optimal Amplify deployment:
import type { NextConfig } from "next";
const nextConfig: NextConfig = {
output: 'standalone',
};
export default nextConfig;
Step 6: Deploy
- Click Save and deploy
- Monitor the build in the Amplify Console
- Once complete, your app will be available at
https://<branch>.<app-id>.amplifyapp.com
Step 7: Configure Custom Domain (Optional)
- Go to App settings > Domain management
- Click Add domain
- Enter your domain (e.g.,
app.getdrop.no) - Follow DNS configuration instructions
- SSL certificate is automatically provisioned
Step 8: Set Up Branch Deployments
For staging/production workflows:
- Go to App settings > General
- Click Edit
- Enable Branch auto-detection
- Configure branch patterns:
main-> Productionstaging-> Stagingfeature/*-> Preview environments
Monitoring & Health Checks
Health Endpoint
The app exposes /api/health for load balancer health checks:
curl https://your-app.amplifyapp.com/api/health
Response:
{
"status": "healthy",
"timestamp": "2026-02-05T12:00:00.000Z",
"version": "0.1.0",
"uptime": 3600,
"checks": {}
}
CloudWatch Logs
- Go to App settings > Monitoring
- View build logs and access logs
- Set up CloudWatch alarms for errors
Troubleshooting
Build Fails
- Check build logs in Amplify Console
- Verify
package.jsonscripts are correct - Ensure all dependencies are in
package.json
Environment Variables Not Working
- Verify variables are set in Amplify Console
- Remember:
NEXT_PUBLIC_prefix required for client-side access - Redeploy after changing environment variables
502/503 Errors
- Check
/api/healthendpoint - Review CloudWatch logs
- Verify database connections are correct
- Check memory limits (adjust if needed)
Cold Starts
For serverless functions, cold starts may occur. Mitigate by:
- Using connection pooling for databases
- Keeping functions warm with scheduled pings
- Optimizing bundle size
Security Checklist
- All secrets in Environment Variables (not in code)
- HTTPS enforced (automatic in Amplify)
- CORS configured correctly
- Rate limiting implemented
- Webhook signatures validated
- No sensitive data in logs
Cost Optimization
- Use
cache.pathsinamplify.ymlto speed up builds - Enable CloudFront caching for static assets
- Monitor build minutes usage
- Consider reserved concurrency for predictable traffic
Rollback
To rollback to a previous deployment:
- Go to Deployments in Amplify Console
- Find the previous successful deployment
- Click Redeploy this version
Support
BetterStack Setup
BetterStack Uptime Monitoring Setup Guide
Last updated: 2026-02-20 Related: MONITORING.md, health-check.sh Purpose: External uptime monitoring for Drop production environment
Why BetterStack?
BetterStack provides external uptime monitoring independent of Drop's infrastructure:
- Detects infrastructure failures (AWS App Runner crashes, network issues)
- Alerts when the entire application is unreachable
- Provides uptime SLA tracking and historical reports
- Multiple notification channels (Slack, Email, SMS)
- Status page for client transparency
Key difference from internal health checks: Internal checks (Docker, Fly.io) only work when the container is running. BetterStack catches total outages.
Free Tier Limits
Plan: Free tier (no credit card required) Limits:
- 10 monitors (enough for Drop production)
- 3-minute check interval (paid plan: 30s minimum)
- 1 status page
- Unlimited team members
- Unlimited integrations (Slack, email, webhooks)
Upgrade required for:
- Faster check intervals (<3 minutes)
- More than 10 monitors (e.g., multi-region checks)
- Advanced features (maintenance windows, custom headers)
Account Setup
Step 1: Create Account
- Go to https://betterstack.com/uptime
- Click "Start free trial" (becomes free tier after trial)
- Sign up with Alem's email:
alem@alai.no - Verify email address
- Create workspace name: "ALAI Products" (shared across Drop, BasicFakta)
Step 2: Configure Team
Monitor Configuration
Monitor 1: Health Endpoint (Primary)
Purpose: Verify API health and database connectivity
-
Go to Monitors > Create Monitor
-
Configure:
- Monitor name:
Drop Health Check - Monitor type:
HTTP - URL:
https://drop.alai.no/api/health - Check interval:
3 minutes(free tier) - Request timeout:
5 seconds - Method:
GET - Confirmation period:
30 seconds(1 retry before alerting)
- Monitor name:
-
Expected Response:
- Status code:
200 - Keyword check: Enable
- Response body contains:
"status":"ok" - Why: Ensures health endpoint returns valid JSON, not just HTTP 200
- Response body contains:
- Status code:
-
Advanced settings:
- Follow redirects:
Enabled(default) - Verify SSL certificate:
Enabled - SSL expiry warning:
14 days before expiration
- Follow redirects:
-
Click Create Monitor
Monitor 2: Landing Page
Purpose: Verify public website availability
-
Go to Monitors > Create Monitor
-
Configure:
- Monitor name:
Drop Landing Page - Monitor type:
HTTP - URL:
https://drop.alai.no - Check interval:
3 minutes - Request timeout:
10 seconds(landing page has more assets) - Method:
GET - Confirmation period:
30 seconds
- Monitor name:
-
Expected Response:
- Status code:
200 - Keyword check: Enable
- Response body contains:
Send penger(tagline verification)
- Response body contains:
- Status code:
-
Click Create Monitor
Monitor 3: Multi-Region Health Check
Purpose: Detect regional networking issues
-
Go to Monitors > Create Monitor
-
Configure:
- Monitor name:
Drop Health (US East) - Monitor type:
HTTP - URL:
https://drop.alai.no/api/health - Check interval:
3 minutes - Request timeout:
5 seconds - Method:
GET - Confirmation period:
30 seconds
- Monitor name:
-
Expected Response:
- Status code:
200 - Keyword check: Response body contains
"status":"ok"
- Status code:
-
Advanced settings:
- Region:
US East(different from default EU region) - Why: Detects if Drop is unreachable from specific geographies
- Region:
-
Click Create Monitor
Slack Integration
Step 1: Create Slack Incoming Webhook
- Go to your Slack workspace: alai-talk.slack.com
- Navigate to Slack App Directory > Incoming Webhooks
- Click Add to Slack
- Select channel: #drop-ops (create if doesn't exist)
- Click Add Incoming Webhooks Integration
- Copy webhook URL (format:
https://hooks.slack.com/services/T.../B.../XXX) - Save this URL securely (needed for BetterStack)
Step 2: Add Slack Integration in BetterStack
- In BetterStack, go to Integrations
- Click Add Integration > Slack
- Paste webhook URL from Step 1
- Configure:
- Integration name:
Drop Ops Slack - Notification channel:
#drop-ops
- Integration name:
- Test integration: Click Send test message
- Verify message appears in
#drop-opschannel
- Verify message appears in
- Click Save Integration
On-Call Team Setup
Step 1: Create On-Call Schedule
- Go to On-Call > Create Schedule
- Configure:
- Schedule name:
Drop Primary On-Call - Timezone:
Europe/Oslo
- Schedule name:
- Add rotation:
- Team member:
alem@alai.no - Schedule type:
24/7(always on-call for now)
- Team member:
- Click Create Schedule
Step 2: Configure Escalation Policy
-
Go to Escalation Policies > Create Policy
-
Configure:
- Policy name:
Drop Production Incidents
- Policy name:
-
Add escalation steps:
Step 1 (Immediate):
- Who:
Drop Ops Slackintegration - Delay:
0 minutes
Step 2 (If still down after 5 minutes):
- Who:
alem@alai.no(Email) - Delay:
5 minutes
Step 3 (If still down after 15 minutes):
- Who:
alem@alai.no(SMS) — Requires phone number - Delay:
15 minutes - Note: SMS requires paid plan or verified phone number
- Who:
-
Click Create Policy
Step 3: Assign Policy to Monitors
- Go to Monitors
- For each monitor (
Drop Health Check,Drop Landing Page,Drop Health (US East)):- Click monitor name
- Go to Settings > Escalation Policy
- Select:
Drop Production Incidents - Click Save
Status Page Setup
Purpose
Public status page allows clients and stakeholders to check Drop availability without contacting support.
Step 1: Create Status Page
-
Go to Status Pages > Create Status Page
-
Configure:
- Page name:
Drop Status - Subdomain:
drop-status(URL:https://drop-status.betteruptime.com) - Custom domain (optional):
status.drop.alai.no(requires DNS setup)
- Page name:
-
Design settings:
- Logo: Upload Drop logo (green rounded rectangle)
- Brand color:
#0B6E35(Drop primary green) - Header text:
Drop Status - Tagline:
Real-time service status and incident updates
-
Visibility:
- Public: Yes (anyone can view)
- Search engine indexing: No (prevent Google indexing)
-
Click Create Status Page
Step 2: Add Components
-
In the status page settings, go to Components
-
Click Add Component
-
Add three components:
Component 1:
- Name:
API & Health Endpoint - Linked monitor:
Drop Health Check - Description:
Core API functionality and database connectivity
Component 2:
- Name:
Landing Page - Linked monitor:
Drop Landing Page - Description:
Public website and marketing content
Component 3:
- Name:
Global Network - Linked monitor:
Drop Health (US East) - Description:
International access and routing
- Name:
-
Click Save Components
Step 3: Configure Incident Communication
- Go to Status Pages > Settings > Incident Updates
- Enable:
- Auto-create incidents: Yes (when monitor goes down)
- Auto-resolve incidents: Yes (when monitor recovers)
- Notification subscribers:
- Email subscriptions: Enabled (users can subscribe to updates)
- Webhook notifications: Disabled (optional for future)
Step 4: Share Status Page
- Internal: Add to
#drop-opsSlack channel description - External: Link from Drop landing page footer (optional)
- Clients: Include in onboarding emails
Status Page URL: https://drop-status.betteruptime.com
Verification Checklist
After completing setup, verify:
- Monitors running: All 3 monitors show green status
- Slack alerts working: Test by pausing a monitor (triggers down alert)
- Email notifications working: Verify Alem receives email on test alert
- Status page public: Open status page URL in incognito mode
- Escalation policy assigned: All monitors use
Drop Production Incidentspolicy - SSL expiry alerts: Monitors configured to warn 14 days before cert expiration
Testing the Setup
Test 1: Manual Down Alert
- Go to Monitors >
Drop Health Check - Click Pause Monitor (simulates downtime)
- Expected behavior:
- Slack alert in
#drop-opswithin 30 seconds - Email to
alem@alai.noafter 5 minutes (if still paused)
- Slack alert in
- Click Resume Monitor to clear alert
Test 2: Actual Downtime
- SSH into production server (or use AWS App Runner console)
- Stop the Drop application container temporarily
- Wait for BetterStack to detect downtime (max 3 minutes + 30s confirmation)
- Expected behavior:
- Monitor shows red status
- Slack alert in
#drop-ops - Status page component shows "Down"
- Restart application and verify recovery alert
Test 3: SSL Expiry Warning
- Go to Monitors >
Drop Health Check - Verify SSL expiry warning is enabled (14 days)
- Expected behavior:
- Alert sent 14 days before SSL certificate expiration
- Action required: Renew certificate before expiry
Alert Examples
Downtime Alert (Slack)
🚨 Drop Health Check is DOWN
Monitor: Drop Health Check
Status: DOWN
Response: Connection timeout
Region: EU West
Time: 2026-02-20 10:30 UTC
View incident: https://betterstack.com/incidents/...
Recovery Alert (Slack)
✅ Drop Health Check is UP
Monitor: Drop Health Check
Status: UP
Response: 200 OK (2ms)
Downtime duration: 3 minutes
Time: 2026-02-20 10:33 UTC
Incident closed: https://betterstack.com/incidents/...
SSL Expiry Warning (Email)
Subject: [BetterStack] SSL certificate expiring in 14 days
Monitor: Drop Health Check
Domain: drop.alai.no
Certificate expiry: 2026-03-06 23:59 UTC
Action required: Renew SSL certificate before expiration.
Maintenance Mode
When performing planned maintenance (deployments, infrastructure upgrades):
- Go to Maintenance Windows > Create Window
- Configure:
- Name:
Drop Deployment - Start time:
2026-02-20 22:00 UTC - Duration:
1 hour - Affected monitors: Select all Drop monitors
- Name:
- Notification:
- Status page update: Yes (shows maintenance banner)
- Alert suppression: Yes (no downtime alerts during window)
- Click Create Maintenance Window
Effect: During maintenance, downtime alerts are suppressed and status page shows "Scheduled Maintenance" instead of "Down".
Best Practices
Do's
- ✅ Test alerts monthly — Pause a monitor to verify escalation works
- ✅ Update on-call schedule — Rotate on-call duty if team grows
- ✅ Monitor SSL expiry — Enable 14-day warnings to prevent outages
- ✅ Use maintenance windows — Prevent false alerts during deployments
- ✅ Review incident history — Monthly review of downtime patterns
Don'ts
- ❌ Don't ignore degraded status — Investigate even if not fully down
- ❌ Don't disable monitors — Use pause for temporary suppression only
- ❌ Don't skip keyword checks — HTTP 200 alone doesn't guarantee working API
- ❌ Don't forget to update URLs — When domain changes, update all monitors
- ❌ Don't rely solely on external monitoring — Combine with internal health checks
Troubleshooting
Monitor shows false positives (frequent up/down)
Cause: Network instability or slow response times Fix:
- Increase Request timeout from 5s to 10s
- Increase Confirmation period from 30s to 60s
- Check Drop API latency in logs
Slack alerts not received
Cause: Webhook URL incorrect or channel archived Fix:
- Go to Integrations >
Drop Ops Slack - Click Send test message
- If fails, regenerate webhook in Slack and update BetterStack
Email alerts delayed
Cause: Email provider spam filtering Fix:
- Whitelist
notifications@betterstack.comin email settings - Check spam/junk folder
- Verify email address in BetterStack team settings
Status page not updating
Cause: Monitor not linked to status page component Fix:
- Go to Status Pages >
Drop Status> Components - Ensure each component has a Linked monitor assigned
- Save changes and trigger test alert
Related Documentation
- MONITORING.md — Full monitoring stack overview
- health-check.sh — Internal health check script
- alerts.ts — Slack alerting implementation
- /api/health route — Health endpoint source code
Support
BetterStack Support:
- Documentation: https://betterstack.com/docs
- Email: support@betterstack.com
- Status: https://status.betterstack.com
Internal Contact:
- Slack:
#drop-ops - Email:
alem@alai.no
Sentry Setup
Drop Sentry Setup
Last updated: 2026-02-20
Source: src/drop-app/src/lib/sentry.ts, src/drop-app/src/lib/sentry-server.ts, src/drop-api/src/lib/sentry.ts, src/drop-app/.env.example
Overview
Drop uses Sentry for error tracking and performance monitoring across three components:
- drop-app (client-side) - Browser errors via
@sentry/browser - drop-app (server-side) - Next.js middleware/API errors via custom envelope API
- drop-api - Backend API errors via
@sentry/node
All three components share the same DSN and gracefully degrade to console-only logging when Sentry is not configured.
Sentry Account Setup
1. Create Free Sentry Account
- Visit sentry.io and sign up (free tier: 5,000 errors/month)
- Confirm email and log in
2. Create Projects
Create two separate projects (one for app, one for API):
Project 1: drop-app
- Click Projects → Create Project
- Platform: Next.js
- Project name:
drop-app - Team: Default team (or create
drop-team) - Alert frequency: On every new issue
- Click Create Project
- Copy the DSN (format:
https://examplePublicKey@o0.ingest.sentry.io/0)
Project 2: drop-api
- Repeat steps above with platform Node.js
- Project name:
drop-api - Copy the DSN (different from drop-app)
IMPORTANT: Use separate projects to keep frontend and backend errors isolated.
Environment Variables Configuration
drop-app (.env.local)
Add these variables to src/drop-app/.env.local:
# --- Sentry (Error Tracking) ---
# Client-side error tracking (browser)
NEXT_PUBLIC_SENTRY_DSN=https://YOUR_PUBLIC_KEY@o0.ingest.sentry.io/YOUR_PROJECT_ID
# Server-side error tracking (middleware/API routes)
# NOTE: drop-app server uses custom envelope API (no @sentry/nextjs due to Turbopack incompatibility)
# Both client and server use the SAME DSN (NEXT_PUBLIC_SENTRY_DSN)
# Optional: Performance monitoring sample rate (0.0 to 1.0, default: 0.1 = 10%)
NEXT_PUBLIC_SENTRY_TRACES_SAMPLE_RATE=0.1
# Optional: For source map uploads (requires auth token from Sentry → Settings → Auth Tokens)
SENTRY_ORG=your-org-slug
SENTRY_PROJECT=drop-app
SENTRY_AUTH_TOKEN=your-auth-token
drop-api (.env)
Add these variables to src/drop-api/.env:
# --- Sentry (Error Tracking) ---
SENTRY_DSN=https://YOUR_PUBLIC_KEY@o0.ingest.sentry.io/YOUR_API_PROJECT_ID
# Optional: Performance monitoring sample rate (0.0 to 1.0, default: 0.1 = 10%)
SENTRY_TRACES_SAMPLE_RATE=0.1
# Optional: For source map uploads
SENTRY_ORG=your-org-slug
SENTRY_PROJECT=drop-api
SENTRY_AUTH_TOKEN=your-auth-token
Where to find these values:
- DSN: Project Settings → Client Keys (DSN)
- Org slug: Settings → Organization → General Settings → Organization Slug
- Project name: Project Settings → General → Project Name
- Auth token: Settings → Auth Tokens → Create New Token (scopes:
project:releases,project:write)
Verification
Test Client-Side Error Capture (drop-app)
- Start the app:
npm run dev(insrc/drop-app/) - Open browser console:
http://localhost:3000 - Trigger test error via console:
throw new Error("Sentry test error - client-side"); - Check Sentry dashboard: Projects → drop-app → Issues
- You should see the test error appear within 10 seconds
Expected behavior:
- Error logged to browser console:
[Sentry] Error captured: Error: Sentry test error - client-side - Error appears in Sentry dashboard with stack trace, breadcrumbs, and browser context
Test Server-Side Error Capture (drop-app)
- Create test API route:
src/drop-app/src/app/api/sentry-test/route.tsimport { NextResponse } from 'next/server'; import { captureServerError } from '@/lib/sentry-server'; export async function GET() { try { throw new Error('Sentry test error - server-side'); } catch (error) { captureServerError(error as Error, { tags: { test: 'true' } }); return NextResponse.json({ error: 'Test error sent to Sentry' }, { status: 500 }); } } - Visit:
http://localhost:3000/api/sentry-test - Check server console:
[Sentry Server] Error captured: Error: Sentry test error - server-side - Check Sentry dashboard: Projects → drop-app → Issues
Test API Error Capture (drop-api)
- Start the API:
npm run dev(insrc/drop-api/) - Trigger test error via curl:
curl http://localhost:4000/api/sentry-test - OR create test endpoint in
src/drop-api/src/routes/test.ts:import { Router } from 'express'; import { captureError } from '../lib/sentry.js'; const router = Router(); router.get('/sentry-test', (req, res) => { try { throw new Error('Sentry test error - API'); } catch (error) { captureError(error as Error, { tags: { test: 'true' } }); res.status(500).json({ error: 'Test error sent to Sentry' }); } }); export default router; - Check Sentry dashboard: Projects → drop-api → Issues
Source Map Upload Setup
Source maps allow Sentry to show readable stack traces instead of minified code.
1. Install Sentry CLI
# macOS (Homebrew)
brew install getsentry/tools/sentry-cli
# Or via npm (global)
npm install -g @sentry/cli
2. Configure Sentry CLI
Create .sentryclirc in project root:
[defaults]
url=https://sentry.io/
org=your-org-slug
project=drop-app
[auth]
token=your-auth-token
IMPORTANT: Add .sentryclirc to .gitignore (contains auth token).
3. Add Build Script (drop-app)
Update src/drop-app/package.json:
{
"scripts": {
"build": "next build",
"build:sentry": "next build && sentry-cli sourcemaps upload --validate .next/static"
}
}
4. Test Source Map Upload
cd src/drop-app
npm run build:sentry
Expected output:
> Analyzing source maps for sentry
> Uploading source maps to Sentry
✓ Successfully uploaded source maps
5. CI/CD Integration
For automated uploads in CI/CD, add these secrets to your deployment platform:
Vercel/Railway/Fly.io:
SENTRY_ORGSENTRY_PROJECTSENTRY_AUTH_TOKEN
Then update build command:
npm run build && sentry-cli sourcemaps upload --validate .next/static
Alert Rules Configuration
Recommended Alert Rules
1. New Issue Alert (drop-app)
- Go to Projects → drop-app → Settings → Alerts
- Click Create Alert Rule
- Configure:
- Conditions: When a new issue is created
- Filters: Environment = production
- Actions:
- Send notification to: Slack channel
#drop-alerts - Send email to: alem@alai.no
- Send notification to: Slack channel
- Save rule
2. High Error Rate Alert (drop-app)
- Create new alert rule
- Configure:
- Conditions: Number of events in an issue is more than 100 in 1 hour
- Filters: Environment = production, Level = error
- Actions:
- Send notification to: Slack channel
#drop-alerts - Send email to: alem@alai.no
- Send notification to: Slack channel
- Save rule
3. Critical Error Alert (drop-api)
- Go to Projects → drop-api → Settings → Alerts
- Create alert rule:
- Conditions: When a new issue is created AND Level = fatal
- Filters: Environment = production
- Actions:
- Send notification to: Slack channel
#drop-critical - Send email to: alem@alai.no
- Send notification to: Slack channel
- Save rule
4. Performance Degradation Alert (drop-app)
- Create alert rule:
- Conditions: Average transaction duration is above 2000ms for 5 minutes
- Filters: Environment = production, Transaction = /api/transactions/*
- Actions:
- Send notification to: Slack channel
#drop-performance
- Send notification to: Slack channel
- Save rule
Slack Integration (Optional)
- Go to Settings → Integrations → Slack
- Click Add Workspace
- Authorize Sentry to access your Slack workspace
- Select channels:
#drop-alerts,#drop-critical,#drop-performance - Test integration by triggering a test error
PII Scrubbing
All three Sentry integrations automatically scrub sensitive data before sending events:
Scrubbed fields:
passwordpincardNumbercvvfødselsnummerauthorizationheaderscookieheaders
Implementation:
- drop-app (client):
src/drop-app/src/lib/sentry.ts(lines 51-76) - drop-app (server): Custom envelope API (no PII in server-side events)
- drop-api:
src/drop-api/src/lib/sentry.ts(lines 48-139)
Verification:
- Trigger error with sensitive data:
try { throw new Error('Login failed for user with password=secret123'); } catch (error) { captureError(error, { extra: { cardNumber: '1234567890123456' } }); } - Check Sentry event:
- Message should show:
Login failed for user with password=[REDACTED] - Extra context should show:
cardNumber: [REDACTED]
- Message should show:
Environment-Specific Configuration
Development
- DSN: Optional (errors log to console only if not set)
- Sample rate: 1.0 (capture all errors for debugging)
- Source maps: Not required (local stack traces are readable)
# .env.local (development)
NEXT_PUBLIC_SENTRY_DSN= # Leave empty to disable Sentry in dev
Staging
- DSN: Required (test Sentry integration before production)
- Sample rate: 0.5 (capture 50% of transactions)
- Source maps: Enabled (verify uploads work)
# .env.staging
NEXT_PUBLIC_SENTRY_DSN=https://YOUR_KEY@sentry.io/YOUR_PROJECT_ID
NEXT_PUBLIC_SENTRY_TRACES_SAMPLE_RATE=0.5
SENTRY_AUTH_TOKEN=your-auth-token
Production
- DSN: Required (critical for production monitoring)
- Sample rate: 0.1 (capture 10% of transactions to stay within free tier)
- Source maps: Enabled (required for readable stack traces)
# .env.production
NEXT_PUBLIC_SENTRY_DSN=https://YOUR_KEY@sentry.io/YOUR_PROJECT_ID
NEXT_PUBLIC_SENTRY_TRACES_SAMPLE_RATE=0.1
SENTRY_AUTH_TOKEN=your-auth-token
Troubleshooting
No errors appearing in Sentry dashboard
Check 1: DSN configured?
# drop-app
echo $NEXT_PUBLIC_SENTRY_DSN
# drop-api
echo $SENTRY_DSN
Check 2: Console output?
- Errors should ALWAYS log to console, even if Sentry upload fails
- Look for:
[Sentry] Error captured: ...
Check 3: Network errors?
- Open browser DevTools → Network tab
- Filter by
sentry.io - Check for failed requests (should see POST to
https://o0.ingest.sentry.io/api/.../envelope/)
Check 4: Environment mismatch?
- Sentry filters events by environment (
production,development,staging) - Verify
NEXT_PUBLIC_APP_ENVorNODE_ENVmatches your Sentry project filters
Source maps not working (minified stack traces)
Check 1: Source maps uploaded?
cd src/drop-app
sentry-cli releases list
Check 2: Release version matches?
- Sentry matches source maps by release version
- Verify
package.jsonversion matches uploaded release
Check 3: Upload command ran?
# Manually test upload
sentry-cli sourcemaps upload --validate .next/static
PII still appearing in events
Check 1: Verify beforeSend hook
- Inspect
src/lib/sentry.ts(client) orsrc/lib/sentry.ts(API) - Confirm
beforeSendfunction is scrubbing sensitive keys
Check 2: Add custom scrubbing
- If new sensitive fields appear, add them to scrubbing list:
const sensitiveKeys = ["password", "pin", "yourNewField"];
Cost Management
Sentry Free Tier:
- 5,000 errors per month
- 10,000 performance units per month
- 1 GB attachments
- 30 days retention
Staying within free tier:
- Lower sample rate: Set
SENTRY_TRACES_SAMPLE_RATE=0.1(10%) - Filter noisy errors: Use
beforeSendto ignore expected errors (e.g., 404s) - Set up quotas: Sentry → Settings → Quotas → Set monthly limits
Example: Ignore 404 errors
beforeSend(event, hint) {
// Ignore 404 errors
if (event.request?.url?.includes('/api/') && hint?.originalException?.message?.includes('404')) {
return null; // Don't send to Sentry
}
return event;
}
Security Considerations
-
Auth token storage:
- NEVER commit
.sentryclircto git - Store
SENTRY_AUTH_TOKENin CI/CD secrets, not.envfiles
- NEVER commit
-
DSN exposure:
NEXT_PUBLIC_SENTRY_DSNis exposed to client-side code (safe - it's public)- Sentry rate-limits abuse via DSN quotas
-
PII scrubbing:
- Always verify PII scrubbing works before deploying to production
- Test with real-world data patterns (Norwegian fødselsnummer, BankID tokens)
-
Access control:
- Limit Sentry dashboard access to authorized team members only
- Use Sentry Teams to restrict project access
References
- Sentry Docs: https://docs.sentry.io/platforms/javascript/guides/nextjs/
- Sentry CLI: https://docs.sentry.io/product/cli/
- Source Maps: https://docs.sentry.io/platforms/javascript/sourcemaps/
- PII Scrubbing: https://docs.sentry.io/platforms/javascript/data-management/sensitive-data/
- Alert Rules: https://docs.sentry.io/product/alerts/
Next Steps
- Create Sentry account and projects (drop-app, drop-api)
- Add DSN to
.env.local(development) and.env.production(production) - Test error capture in all three components
- Configure alert rules (new issues, high error rate, critical errors)
- Set up source map uploads for production builds
- Integrate Slack notifications (optional)
- Monitor error dashboard daily during initial deployment
CloudWatch Logs Setup
CloudWatch Logs Setup — Drop Production
Date: 2026-02-22 Priority: P0 (Production Blocker) Effort: 2 hours Cost: ~$5/month (30 GB ingestion)
Overview
AWS App Runner automatically streams application logs (stdout/stderr) to CloudWatch Logs. This setup guide configures retention policies, log insights queries, and alarms for production monitoring.
Prerequisites
- AWS CLI configured with credentials
- App Runner service deployed to
eu-west-1 - Application writes JSON logs to stdout (already implemented via
src/lib/logger.ts)
Configuration
1. Set Log Retention Policy
Default: CloudWatch Logs retain forever (expensive) Recommendation: 30 days (production), 7 days (staging)
# Production: 30 days retention
aws logs put-retention-policy \
--log-group-name /aws/apprunner/drop-production \
--retention-in-days 30 \
--region eu-west-1
# Staging: 7 days retention
aws logs put-retention-policy \
--log-group-name /aws/apprunner/drop-staging \
--retention-in-days 7 \
--region eu-west-1
Verify retention:
aws logs describe-log-groups \
--log-group-name-prefix /aws/apprunner/drop \
--region eu-west-1 \
| jq '.logGroups[] | {name: .logGroupName, retention: .retentionInDays}'
# Expected:
# {
# "name": "/aws/apprunner/drop-production",
# "retention": 30
# }
2. Create Log Insights Queries
Purpose: Pre-built queries for common investigations.
Query 1: All Errors (Last Hour)
fields @timestamp, level, message, metadata.error, metadata.userId, requestId
| filter level = "error"
| sort @timestamp desc
| limit 100
Save as: drop-errors-last-hour
Query 2: User Activity Trace
fields @timestamp, level, message, metadata.userId, metadata.action, requestId
| filter metadata.userId = "usr_123"
| sort @timestamp desc
| limit 500
Save as: drop-user-activity-trace
Query 3: Request Trace by ID
fields @timestamp, level, message, metadata
| filter requestId = "req_abc123"
| sort @timestamp asc
Save as: drop-request-trace
Query 4: API Endpoint Performance
fields @timestamp, message, metadata.endpoint, metadata.latencyMs
| filter metadata.latencyMs > 1000
| stats avg(metadata.latencyMs) as avg_latency, max(metadata.latencyMs) as max_latency, count() as slow_requests by metadata.endpoint
| sort slow_requests desc
Save as: drop-slow-endpoints
Query 5: Authentication Events
fields @timestamp, level, message, metadata.action, metadata.userId, metadata.ip
| filter metadata.action in ["login_success", "login_failure", "logout"]
| sort @timestamp desc
| limit 100
Save as: drop-auth-events
Query 6: Payment Failures
fields @timestamp, level, message, metadata.errorCode, metadata.transactionId, metadata.userId
| filter metadata.errorCode in ["INSUFFICIENT_FUNDS", "PAYMENT_REJECTED", "TIMEOUT"]
| sort @timestamp desc
| limit 50
Save as: drop-payment-failures
3. Create CloudWatch Alarms
Alarm 1: High Error Rate
Metric: Error log entries per minute Threshold: >10 errors/minute for 2 consecutive periods Action: Send SNS notification → Slack webhook
# Create metric filter
aws logs put-metric-filter \
--log-group-name /aws/apprunner/drop-production \
--filter-name drop-error-count \
--filter-pattern '{ $.level = "error" }' \
--metric-transformations \
metricName=ErrorCount,metricNamespace=Drop/Logs,metricValue=1,unit=Count \
--region eu-west-1
# Create alarm
aws cloudwatch put-metric-alarm \
--alarm-name drop-high-error-rate \
--alarm-description "Alert when error rate exceeds threshold" \
--metric-name ErrorCount \
--namespace Drop/Logs \
--statistic Sum \
--period 60 \
--evaluation-periods 2 \
--threshold 10 \
--comparison-operator GreaterThanThreshold \
--treat-missing-data notBreaching \
--alarm-actions <SNS-TOPIC-ARN> \
--region eu-west-1
Alarm 2: No Logs Received (Service Down)
Metric: Log ingestion stopped Threshold: No logs for 5 minutes Action: Send SNS notification
aws cloudwatch put-metric-alarm \
--alarm-name drop-no-logs-received \
--alarm-description "Alert when no logs received (service may be down)" \
--metric-name IncomingLogEvents \
--namespace AWS/Logs \
--dimensions Name=LogGroupName,Value=/aws/apprunner/drop-production \
--statistic Sum \
--period 300 \
--evaluation-periods 1 \
--threshold 1 \
--comparison-operator LessThanThreshold \
--treat-missing-data breaching \
--alarm-actions <SNS-TOPIC-ARN> \
--region eu-west-1
Alarm 3: Database Errors
Metric: Database connection errors Threshold: >5 DB errors in 5 minutes
aws logs put-metric-filter \
--log-group-name /aws/apprunner/drop-production \
--filter-name drop-db-errors \
--filter-pattern '{ $.message = "*database*" && $.level = "error" }' \
--metric-transformations \
metricName=DatabaseErrors,metricNamespace=Drop/Logs,metricValue=1,unit=Count \
--region eu-west-1
aws cloudwatch put-metric-alarm \
--alarm-name drop-database-errors \
--metric-name DatabaseErrors \
--namespace Drop/Logs \
--statistic Sum \
--period 300 \
--evaluation-periods 1 \
--threshold 5 \
--comparison-operator GreaterThanThreshold \
--alarm-actions <SNS-TOPIC-ARN> \
--region eu-west-1
4. SNS Topic for Alerts
Create SNS topic (if not exists):
aws sns create-topic \
--name drop-cloudwatch-alerts \
--region eu-west-1
# Output:
# {
# "TopicArn": "arn:aws:sns:eu-west-1:324480209768:drop-cloudwatch-alerts"
# }
Subscribe Slack webhook:
# Option 1: Email subscription (immediate)
aws sns subscribe \
--topic-arn arn:aws:sns:eu-west-1:324480209768:drop-cloudwatch-alerts \
--protocol email \
--notification-endpoint alem@alai.no \
--region eu-west-1
# Confirm subscription via email link
# Option 2: Lambda → Slack (requires Lambda function)
# See: infrastructure/cloudwatch-to-slack-lambda.md (future enhancement)
5. Export Logs to S3 (Compliance/Archival)
Purpose: Long-term storage (>30 days) for compliance, cheaper than CloudWatch.
Create S3 bucket:
aws s3 mb s3://drop-logs-archive --region eu-west-1
# Set lifecycle policy (move to Glacier after 90 days)
cat > lifecycle.json <<EOF
{
"Rules": [
{
"Id": "archive-old-logs",
"Status": "Enabled",
"Transitions": [
{
"Days": 90,
"StorageClass": "GLACIER"
}
],
"Expiration": {
"Days": 1825
}
}
]
}
EOF
aws s3api put-bucket-lifecycle-configuration \
--bucket drop-logs-archive \
--lifecycle-configuration file://lifecycle.json
Create export task (manual, run monthly):
# Export last 30 days to S3 (run on day 1 of each month)
START_TIME=$(date -u -d '60 days ago' +%s)000
END_TIME=$(date -u -d '30 days ago' +%s)000
aws logs create-export-task \
--log-group-name /aws/apprunner/drop-production \
--from $START_TIME \
--to $END_TIME \
--destination drop-logs-archive \
--destination-prefix logs/$(date +%Y-%m) \
--region eu-west-1
# Check export status
aws logs describe-export-tasks --region eu-west-1
Automate with Lambda (future):
- Schedule Lambda to run monthly
- Export previous month's logs to S3
- Delete from CloudWatch after successful export
Log Format
Current Format (Structured JSON)
Example log entry:
{
"timestamp": "2026-02-22T10:30:45.123Z",
"level": "info",
"message": "User logged in",
"requestId": "req_abc123",
"metadata": {
"userId": "usr_456",
"email": "user@example.com",
"ip": "1.2.3.4",
"action": "login_success"
}
}
CloudWatch Logs Insights automatically parses JSON fields, enabling queries like:
| filter metadata.userId = "usr_456"
Cost Estimate
CloudWatch Logs Pricing (EU-West-1)
- Ingestion: $0.50 per GB
- Storage: $0.03 per GB/month
- Log Insights queries: $0.005 per GB scanned
Expected Usage (Production)
- Log volume: ~1 GB/day (30 GB/month)
- Ingestion cost: 30 GB × $0.50 = $15/month
- Storage cost (30-day retention): 30 GB × $0.03 = $0.90/month
- Query cost: ~10 queries/day × 1 GB × $0.005 × 30 = $1.50/month
Total: ~$17/month
Cost Optimization
-
Reduce log verbosity (filter debug logs in production):
// src/lib/logger.ts const minLevel = process.env.NODE_ENV === 'production' ? 'info' : 'debug'; -
Use sampling for high-volume events:
if (Math.random() < 0.1) { // Log 10% of requests logger.debug('Request details', { ... }); } -
Export to S3 for long-term storage ($0.023/GB/month, 23% cheaper)
Querying Logs
Via AWS Console
- Open CloudWatch Console: https://console.aws.amazon.com/cloudwatch/
- Navigate to: Logs → Log groups →
/aws/apprunner/drop-production - Click "Search log group" or "Insights queries"
- Select saved query or write custom query
Via AWS CLI
# Run saved query
aws logs start-query \
--log-group-name /aws/apprunner/drop-production \
--start-time $(date -u -d '1 hour ago' +%s) \
--end-time $(date -u +%s) \
--query-string 'fields @timestamp, level, message | filter level = "error" | sort @timestamp desc' \
--region eu-west-1
# Get query results (use queryId from previous command)
aws logs get-query-results --query-id <query-id> --region eu-west-1
Via Log Streaming (Real-Time)
# Stream logs in real-time (like tail -f)
aws logs tail /aws/apprunner/drop-production \
--follow \
--format short \
--region eu-west-1
# Filter by error level
aws logs tail /aws/apprunner/drop-production \
--follow \
--filter-pattern '{ $.level = "error" }' \
--region eu-west-1
Troubleshooting
Issue: No logs appearing in CloudWatch
Diagnosis:
# Check if log group exists
aws logs describe-log-groups \
--log-group-name-prefix /aws/apprunner/drop \
--region eu-west-1
# Check App Runner service logs integration
aws apprunner describe-service \
--service-arn <ARN> \
--region eu-west-1 \
| jq '.Service.ObservabilityConfiguration'
Solution:
- App Runner auto-creates log group on first log output
- Verify app is writing to stdout (not file)
- Check IAM permissions (App Runner role needs
logs:CreateLogStream,logs:PutLogEvents)
Issue: Logs not in JSON format
Diagnosis:
# Check log entries
aws logs tail /aws/apprunner/drop-production --format short --region eu-west-1 | head -10
Solution:
- Ensure app uses
logger.tsfor all logging (notconsole.log) - Verify
process.stdout.write(JSON.stringify(entry) + "\n")is used
Checklist
- Retention policy set (30 days production, 7 days staging)
- Log Insights queries saved (6 queries)
- Metric filters created (error count, DB errors)
- CloudWatch alarms configured (3 alarms)
- SNS topic created and subscribed (email/Slack)
- S3 export bucket created (with lifecycle policy)
- Cost estimate reviewed and approved
- Team trained on log querying (AWS Console + CLI)
- Documentation updated
Next Steps
- Deploy retention policies (run commands above)
- Test alarms (trigger error spike, verify alert received)
- Save Log Insights queries (via AWS Console)
- Schedule monthly S3 export (manual for now, automate later)
- Monitor costs (set billing alert at $20/month)
Related Documentation
docs/infrastructure/MONITORING.md— Overall monitoring setupsrc/lib/logger.ts— Structured logging implementationinfrastructure/error-tracking-setup.md— Sentry integration- AWS CloudWatch Logs docs: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/
Last Updated: 2026-02-22 Owner: John (AI Director)