ALAI Hosting Operations
ALAI Hosting Operations Runbook
Owner: FlowForge (Kelsey Hightower) | Updated: 2026-04-20 | MC: #8491
1. Overview
This runbook covers operational procedures for ALAI's static site hosting on Cloudflare Pages. For architecture and migration plan, see the ALAI Static Hosting Blueprint (Infrastructure chapter).
In Scope:
- Cloudflare Pages deployments (9 static sites)
- DNS configuration (Cloudflare DNS)
- SSL certificate management (auto-renewal)
- Rollback procedures (< 60s target)
- SENTINEL uptime monitoring integration
Out of Scope:
- Azure VM services (BookStack, Documenso, Planka, Vaultwarden) — see individual runbooks
- GCP Cloud Run (Bilko API, Intesa demo) — see Bilko runbooks
- Dynamic Next.js apps (app.getdrop.no) — see Drop runbook
2. Rollback Procedure
When: Deploy caused production issue (5xx errors, broken UI, functionality regression)
Target: < 60 seconds from decision to live rollback
Step 1: Identify Last Known Good Deployment
# List recent deployments
npx wrangler pages deployment list --project-name=<project-name>
# Example output:
# ID: abc123def456
# Created: 2026-04-20 14:30:00
# Branch: main
# Status: active
Step 2: Execute Rollback
# Rollback to previous deployment (use ID from step 1)
npx wrangler pages deployment rollback <deployment-id> --project-name=<project-name>
# Example:
npx wrangler pages deployment rollback abc123def456 --project-name=alai-no
Step 3: Verify
# Check HTTP status
curl -I https://<domain>
# Expected: HTTP/2 200
# If 5xx persists → escalate to L2 (Kelsey)
Step 4: Alert & Document
# Post to Slack
node ~/system/tools/slack.js send "#infra-alerts" \
"ROLLBACK executed: <project-name> to deployment <deployment-id> at $(date -u +%Y-%m-%dT%H:%M:%SZ)"
# Create incident report (if > 5 min downtime)
node ~/system/tools/mc.js add "Incident: <domain> rollback" \
--desc "Reason: [fill]. Rollback target: [deployment-id]. Downtime: [X min]" \
--priority H --owner kelsey
3. SSL Certificate Auto-Renewal
Cloudflare Pages manages SSL certificates automatically via Cloudflare's CA. Certificates renew 30 days before expiry.
No manual action required.
Troubleshooting: SSL Cert Warning
If SENTINEL alerts "SSL cert expiry < 30 days":
# Step 1: Verify domain DNS points to Cloudflare
dig <domain> +short
# Expected: CNAME to <project-name>.pages.dev or Cloudflare IP range
# Step 2: Check Cloudflare dashboard
open "https://dash.cloudflare.com/pages"
# Navigate to: Project > Settings > Custom domains
# Verify: "SSL/TLS certificate" shows "Active"
# Step 3: If cert not renewing, trigger manual renewal
# (Cloudflare Pages does not expose manual renewal API — contact support)
node ~/system/tools/slack.js send "#infra-alerts" \
"SSL cert not auto-renewing for <domain> — escalating to Cloudflare support"
4. Migration Workflow: New Site
Input: New static site needs hosting (markdown, React, Next.js static export, Astro)
Output: Site live on custom domain with SSL, SENTINEL monitoring enabled
Step 1: Validate Static Export
# For Next.js: verify static export enabled
grep 'output.*export' /path/to/site/next.config.js
# Expected: output: 'export'
# Build locally to verify
cd /path/to/site && npm run build
# Expected: Output directory exists (out/, dist/, .next/)
Step 2: Create Cloudflare Pages Project
# Option A: Dashboard (recommended for first-time)
open "https://dash.cloudflare.com/pages"
# Click: Create a project > Connect to Git > Select repo
# Option B: CLI
npx wrangler pages project create <project-name> --production-branch main
Step 3: Configure Build Settings
In Cloudflare dashboard: Project > Settings > Builds
| Framework | Build command | Output directory |
|---|---|---|
| Static HTML | (none) | / |
| Next.js (static export) | npm run build |
out |
| Astro | npm run build |
dist |
Save settings.
Step 4: Add GitHub Actions Workflow
Copy from template:
cp /Users/makinja/system/specs/templates/cf-pages-deploy.yml \
/path/to/site/.github/workflows/deploy.yml
Commit and push to trigger first deploy.
Step 5: Add Custom Domain
# In Cloudflare dashboard: Project > Custom domains > Add custom domain
# Enter: <domain>
# If domain DNS is already on Cloudflare: CNAME record auto-created
# If domain DNS is external: Manual CNAME to <project-name>.pages.dev required
Verify SSL activates (usually < 5 min).
Step 6: Enable SENTINEL Monitoring
Add domain to /Users/makinja/system/tools/sentinel-uptime.sh:
# Open file
nano /Users/makinja/system/tools/sentinel-uptime.sh
# Add line to SITES array:
"https://<domain>"
# Save and test
bash /Users/makinja/system/tools/sentinel-uptime.sh
Verify Slack alert NOT sent (indicates site UP).
Step 7: Document
Update site inventory:
# Add line to ~/system/docs/infrastructure-inventory.md
echo "| <domain> | Cloudflare Pages | <project-name> | [GitHub repo URL] | ACTIVE |" \
>> ~/system/docs/infrastructure-inventory.md
5. SENTINEL Uptime Integration
SENTINEL checks all ALAI sites every 5 minutes via cron.
Script: /Users/makinja/system/tools/sentinel-uptime.sh
Cron: */5 * * * * bash /Users/makinja/system/tools/sentinel-uptime.sh
Alert Channel: #infra-alerts (Slack)
Add New Site to SENTINEL
# Edit SITES array
nano /Users/makinja/system/tools/sentinel-uptime.sh
# Add:
"https://<domain>"
# Test manually
bash /Users/makinja/system/tools/sentinel-uptime.sh
# Expected: No output (site UP) or Slack alert (site DOWN)
Troubleshoot False Alerts
If SENTINEL reports DOWN but site is UP:
# Test from command line
curl -I --max-time 10 https://<domain>
# If returns 200: SENTINEL script has timeout issue (increase --max-time)
# If returns 5xx: Real issue — investigate Cloudflare Pages logs
# If returns 301/302: Update SENTINEL to accept redirects
6. Emergency DR: Serve from Azure VM
Scenario: Cloudflare Pages is down (e.g., Cloudflare incident) AND site is business-critical (e.g., alai.no during client demo).
Target: Site accessible within 120 seconds.
Step 1: Copy Build Output to VM
# From local machine:
cd /path/to/site
npm run build
scp -r ./out [email protected]:/var/www/<site-name>
Step 2: Serve via Caddy
# SSH to VM
ssh -i ~/.ssh/azure_alai [email protected]
# Start Caddy reverse proxy
sudo caddy reverse-proxy --from <domain> --to localhost:8080 &
# Start simple HTTP server
cd /var/www/<site-name> && python3 -m http.server 8080 &
Step 3: Update DNS (if needed)
# If Cloudflare DNS is also down, update DNS to point to Azure VM IP
# This requires registrar access — NOT recommended unless multi-hour Cloudflare outage
Step 4: Monitor & Rollback
# Verify site accessible
curl -I https://<domain>
# When Cloudflare recovers: DNS auto-reverts (CNAME to .pages.dev still exists)
# Kill Caddy process on VM
sudo killall caddy
7. Escalation
| Issue | L1 Action | L2 Escalation | L3 Escalation |
|---|---|---|---|
| Deploy failure | Review build logs; check package.json/next.config.js | Kelsey investigates Cloudflare Pages logs | Contact Cloudflare support via dashboard |
| 5xx errors (< 5 min) | Execute rollback (Section 2) | Kelsey reviews last commit for breaking change | CEO notification + DR activation (Section 6) |
| SSL cert not renewing | Verify DNS (Section 3) | Kelsey triggers manual renewal or contacts CF support | Switch to Let's Encrypt via Azure VM |
| SENTINEL false alerts | Verify site UP via curl; adjust timeout | Kelsey reviews SENTINEL script logic | Disable SENTINEL for that site; use external monitor |
| DNS not resolving | Verify Cloudflare DNS records; check registrar NS | Kelsey checks registrar portal for NS change | Contact registrar support |
Key Contacts:
- L2: Kelsey Hightower (FlowForge agent) via MC task
- L3: CEO (Alem Basic) via Slack DM or phone (+47 404 74 251)
8. Maintenance Schedule
| Task | Frequency | Owner | How |
|---|---|---|---|
| Test rollback procedure | Monthly | Proveo (Angie Jones) | Execute rollback on staging site; verify < 60s |
| Review SENTINEL alerts | Weekly | Kelsey | Check Slack #infra-alerts for false positives |
| Update dependency versions | Weekly | Renovate bot | Auto-merge minor/patch; manual review major |
| Backup DNS zone config | Weekly | Automated cron | Exports to ~/system/backups/dns/ |
| Verify SSL certs valid | Daily | SENTINEL | Auto-alert if < 30 days to expiry |
9. Related Docs
- ALAI Static Hosting Blueprint — Architecture & migration plan
- Infrastructure Inventory — All ALAI sites & services
- SENTINEL Reliability Sprint — Monitoring architecture
- Incident Response Playbook — General incident workflow
10. Change Log
| Date | Change | Author |
|---|---|---|
| 2026-04-20 | Initial version — rollback, SSL, migration, SENTINEL | Skillforge (MC #8491) |