Skip to main content

ALAI Hosting Operations

ALAI Hosting Operations Runbook

Owner: FlowForge (Kelsey Hightower) | Updated: 2026-04-20 | MC: #8491


1. Overview

This runbook covers operational procedures for ALAI's static site hosting on Cloudflare Pages. For architecture and migration plan, see the ALAI Static Hosting Blueprint (Infrastructure chapter).

In Scope:

  • Cloudflare Pages deployments (9 static sites)
  • DNS configuration (Cloudflare DNS)
  • SSL certificate management (auto-renewal)
  • Rollback procedures (< 60s target)
  • SENTINEL uptime monitoring integration

Out of Scope:

  • Azure VM services (BookStack, Documenso, Planka, Vaultwarden) — see individual runbooks
  • GCP Cloud Run (Bilko API, Intesa demo) — see Bilko runbooks
  • Dynamic Next.js apps (app.getdrop.no) — see Drop runbook

2. Rollback Procedure

When: Deploy caused production issue (5xx errors, broken UI, functionality regression)

Target: < 60 seconds from decision to live rollback

Step 1: Identify Last Known Good Deployment

# List recent deployments
npx wrangler pages deployment list --project-name=<project-name>

# Example output:
# ID: abc123def456
# Created: 2026-04-20 14:30:00
# Branch: main
# Status: active

Step 2: Execute Rollback

# Rollback to previous deployment (use ID from step 1)
npx wrangler pages deployment rollback <deployment-id> --project-name=<project-name>

# Example:
npx wrangler pages deployment rollback abc123def456 --project-name=alai-no

Step 3: Verify

# Check HTTP status
curl -I https://<domain>

# Expected: HTTP/2 200
# If 5xx persists → escalate to L2 (Kelsey)

Step 4: Alert & Document

# Post to Slack
node ~/system/tools/slack.js send "#infra-alerts" \
  "ROLLBACK executed: <project-name> to deployment <deployment-id> at $(date -u +%Y-%m-%dT%H:%M:%SZ)"

# Create incident report (if > 5 min downtime)
node ~/system/tools/mc.js add "Incident: <domain> rollback" \
  --desc "Reason: [fill]. Rollback target: [deployment-id]. Downtime: [X min]" \
  --priority H --owner kelsey

3. SSL Certificate Auto-Renewal

Cloudflare Pages manages SSL certificates automatically via Cloudflare's CA. Certificates renew 30 days before expiry.

No manual action required.

Troubleshooting: SSL Cert Warning

If SENTINEL alerts "SSL cert expiry < 30 days":

# Step 1: Verify domain DNS points to Cloudflare
dig <domain> +short

# Expected: CNAME to <project-name>.pages.dev or Cloudflare IP range

# Step 2: Check Cloudflare dashboard
open "https://dash.cloudflare.com/pages"
# Navigate to: Project > Settings > Custom domains
# Verify: "SSL/TLS certificate" shows "Active"

# Step 3: If cert not renewing, trigger manual renewal
# (Cloudflare Pages does not expose manual renewal API — contact support)
node ~/system/tools/slack.js send "#infra-alerts" \
  "SSL cert not auto-renewing for <domain> — escalating to Cloudflare support"

4. Migration Workflow: New Site

Input: New static site needs hosting (markdown, React, Next.js static export, Astro)

Output: Site live on custom domain with SSL, SENTINEL monitoring enabled

Step 1: Validate Static Export

# For Next.js: verify static export enabled
grep 'output.*export' /path/to/site/next.config.js

# Expected: output: 'export'

# Build locally to verify
cd /path/to/site && npm run build

# Expected: Output directory exists (out/, dist/, .next/)

Step 2: Create Cloudflare Pages Project

# Option A: Dashboard (recommended for first-time)
open "https://dash.cloudflare.com/pages"
# Click: Create a project > Connect to Git > Select repo

# Option B: CLI
npx wrangler pages project create <project-name> --production-branch main

Step 3: Configure Build Settings

In Cloudflare dashboard: Project > Settings > Builds

Framework Build command Output directory
Static HTML (none) /
Next.js (static export) npm run build out
Astro npm run build dist

Save settings.

Step 4: Add GitHub Actions Workflow

Copy from template:

cp /Users/makinja/system/specs/templates/cf-pages-deploy.yml \
   /path/to/site/.github/workflows/deploy.yml

Commit and push to trigger first deploy.

Step 5: Add Custom Domain

# In Cloudflare dashboard: Project > Custom domains > Add custom domain
# Enter: <domain>

# If domain DNS is already on Cloudflare: CNAME record auto-created
# If domain DNS is external: Manual CNAME to <project-name>.pages.dev required

Verify SSL activates (usually < 5 min).

Step 6: Enable SENTINEL Monitoring

Add domain to /Users/makinja/system/tools/sentinel-uptime.sh:

# Open file
nano /Users/makinja/system/tools/sentinel-uptime.sh

# Add line to SITES array:
"https://<domain>"

# Save and test
bash /Users/makinja/system/tools/sentinel-uptime.sh

Verify Slack alert NOT sent (indicates site UP).

Step 7: Document

Update site inventory:

# Add line to ~/system/docs/infrastructure-inventory.md
echo "| <domain> | Cloudflare Pages | <project-name> | [GitHub repo URL] | ACTIVE |" \
  >> ~/system/docs/infrastructure-inventory.md

5. SENTINEL Uptime Integration

SENTINEL checks all ALAI sites every 5 minutes via cron.

Script: /Users/makinja/system/tools/sentinel-uptime.sh

Cron: */5 * * * * bash /Users/makinja/system/tools/sentinel-uptime.sh

Alert Channel: #infra-alerts (Slack)

Add New Site to SENTINEL

# Edit SITES array
nano /Users/makinja/system/tools/sentinel-uptime.sh

# Add:
"https://<domain>"

# Test manually
bash /Users/makinja/system/tools/sentinel-uptime.sh

# Expected: No output (site UP) or Slack alert (site DOWN)

Troubleshoot False Alerts

If SENTINEL reports DOWN but site is UP:

# Test from command line
curl -I --max-time 10 https://<domain>

# If returns 200: SENTINEL script has timeout issue (increase --max-time)
# If returns 5xx: Real issue — investigate Cloudflare Pages logs
# If returns 301/302: Update SENTINEL to accept redirects

6. Emergency DR: Serve from Azure VM

Scenario: Cloudflare Pages is down (e.g., Cloudflare incident) AND site is business-critical (e.g., alai.no during client demo).

Target: Site accessible within 120 seconds.

Step 1: Copy Build Output to VM

# From local machine:
cd /path/to/site
npm run build
scp -r ./out [email protected]:/var/www/<site-name>

Step 2: Serve via Caddy

# SSH to VM
ssh -i ~/.ssh/azure_alai [email protected]

# Start Caddy reverse proxy
sudo caddy reverse-proxy --from <domain> --to localhost:8080 &

# Start simple HTTP server
cd /var/www/<site-name> && python3 -m http.server 8080 &

Step 3: Update DNS (if needed)

# If Cloudflare DNS is also down, update DNS to point to Azure VM IP
# This requires registrar access — NOT recommended unless multi-hour Cloudflare outage

Step 4: Monitor & Rollback

# Verify site accessible
curl -I https://<domain>

# When Cloudflare recovers: DNS auto-reverts (CNAME to .pages.dev still exists)
# Kill Caddy process on VM
sudo killall caddy

7. Escalation

Issue L1 Action L2 Escalation L3 Escalation
Deploy failure Review build logs; check package.json/next.config.js Kelsey investigates Cloudflare Pages logs Contact Cloudflare support via dashboard
5xx errors (< 5 min) Execute rollback (Section 2) Kelsey reviews last commit for breaking change CEO notification + DR activation (Section 6)
SSL cert not renewing Verify DNS (Section 3) Kelsey triggers manual renewal or contacts CF support Switch to Let's Encrypt via Azure VM
SENTINEL false alerts Verify site UP via curl; adjust timeout Kelsey reviews SENTINEL script logic Disable SENTINEL for that site; use external monitor
DNS not resolving Verify Cloudflare DNS records; check registrar NS Kelsey checks registrar portal for NS change Contact registrar support

Key Contacts:

  • L2: Kelsey Hightower (FlowForge agent) via MC task
  • L3: CEO (Alem Basic) via Slack DM or phone (+47 404 74 251)

8. Maintenance Schedule

Task Frequency Owner How
Test rollback procedure Monthly Proveo (Angie Jones) Execute rollback on staging site; verify < 60s
Review SENTINEL alerts Weekly Kelsey Check Slack #infra-alerts for false positives
Update dependency versions Weekly Renovate bot Auto-merge minor/patch; manual review major
Backup DNS zone config Weekly Automated cron Exports to ~/system/backups/dns/
Verify SSL certs valid Daily SENTINEL Auto-alert if < 30 days to expiry


10. Change Log

Date Change Author
2026-04-20 Initial version — rollback, SSL, migration, SENTINEL Skillforge (MC #8491)