# ALAI Hosting Operations

# ALAI Hosting Operations Runbook

**Owner:** FlowForge (Kelsey Hightower) | **Updated:** 2026-04-20 | **MC:** #8491

---

## 1. Overview

This runbook covers operational procedures for ALAI's static site hosting on Cloudflare Pages. For architecture and migration plan, see the [ALAI Static Hosting Blueprint](https://docs.alai.no) (Infrastructure chapter).

**In Scope:**
- Cloudflare Pages deployments (9 static sites)
- DNS configuration (Cloudflare DNS)
- SSL certificate management (auto-renewal)
- Rollback procedures (< 60s target)
- SENTINEL uptime monitoring integration

**Out of Scope:**
- Azure VM services (BookStack, Documenso, Planka, Vaultwarden) — see individual runbooks
- GCP Cloud Run (Bilko API, Intesa demo) — see Bilko runbooks
- Dynamic Next.js apps (app.getdrop.no) — see Drop runbook

---

## 2. Rollback Procedure

**When:** Deploy caused production issue (5xx errors, broken UI, functionality regression)

**Target:** < 60 seconds from decision to live rollback

### Step 1: Identify Last Known Good Deployment

```bash
# List recent deployments
npx wrangler pages deployment list --project-name=<project-name>

# Example output:
# ID: abc123def456
# Created: 2026-04-20 14:30:00
# Branch: main
# Status: active
```

### Step 2: Execute Rollback

```bash
# Rollback to previous deployment (use ID from step 1)
npx wrangler pages deployment rollback <deployment-id> --project-name=<project-name>

# Example:
npx wrangler pages deployment rollback abc123def456 --project-name=alai-no
```

### Step 3: Verify

```bash
# Check HTTP status
curl -I https://<domain>

# Expected: HTTP/2 200
# If 5xx persists → escalate to L2 (Kelsey)
```

### Step 4: Alert & Document

```bash
# Post to Slack
node ~/system/tools/slack.js send "#infra-alerts" \
  "ROLLBACK executed: <project-name> to deployment <deployment-id> at $(date -u +%Y-%m-%dT%H:%M:%SZ)"

# Create incident report (if > 5 min downtime)
node ~/system/tools/mc.js add "Incident: <domain> rollback" \
  --desc "Reason: [fill]. Rollback target: [deployment-id]. Downtime: [X min]" \
  --priority H --owner kelsey
```

---

## 3. SSL Certificate Auto-Renewal

Cloudflare Pages manages SSL certificates automatically via Cloudflare's CA. Certificates renew 30 days before expiry.

**No manual action required.**

### Troubleshooting: SSL Cert Warning

If SENTINEL alerts "SSL cert expiry < 30 days":

```bash
# Step 1: Verify domain DNS points to Cloudflare
dig <domain> +short

# Expected: CNAME to <project-name>.pages.dev or Cloudflare IP range

# Step 2: Check Cloudflare dashboard
open "https://dash.cloudflare.com/pages"
# Navigate to: Project > Settings > Custom domains
# Verify: "SSL/TLS certificate" shows "Active"

# Step 3: If cert not renewing, trigger manual renewal
# (Cloudflare Pages does not expose manual renewal API — contact support)
node ~/system/tools/slack.js send "#infra-alerts" \
  "SSL cert not auto-renewing for <domain> — escalating to Cloudflare support"
```

---

## 4. Migration Workflow: New Site

**Input:** New static site needs hosting (markdown, React, Next.js static export, Astro)

**Output:** Site live on custom domain with SSL, SENTINEL monitoring enabled

### Step 1: Validate Static Export

```bash
# For Next.js: verify static export enabled
grep 'output.*export' /path/to/site/next.config.js

# Expected: output: 'export'

# Build locally to verify
cd /path/to/site && npm run build

# Expected: Output directory exists (out/, dist/, .next/)
```

### Step 2: Create Cloudflare Pages Project

```bash
# Option A: Dashboard (recommended for first-time)
open "https://dash.cloudflare.com/pages"
# Click: Create a project > Connect to Git > Select repo

# Option B: CLI
npx wrangler pages project create <project-name> --production-branch main
```

### Step 3: Configure Build Settings

In Cloudflare dashboard: Project > Settings > Builds

| Framework | Build command | Output directory |
|-----------|--------------|------------------|
| Static HTML | (none) | / |
| Next.js (static export) | `npm run build` | `out` |
| Astro | `npm run build` | `dist` |

Save settings.

### Step 4: Add GitHub Actions Workflow

Copy from template:

```bash
cp /Users/makinja/system/specs/templates/cf-pages-deploy.yml \
   /path/to/site/.github/workflows/deploy.yml
```

Commit and push to trigger first deploy.

### Step 5: Add Custom Domain

```bash
# In Cloudflare dashboard: Project > Custom domains > Add custom domain
# Enter: <domain>

# If domain DNS is already on Cloudflare: CNAME record auto-created
# If domain DNS is external: Manual CNAME to <project-name>.pages.dev required
```

Verify SSL activates (usually < 5 min).

### Step 6: Enable SENTINEL Monitoring

Add domain to `/Users/makinja/system/tools/sentinel-uptime.sh`:

```bash
# Open file
nano /Users/makinja/system/tools/sentinel-uptime.sh

# Add line to SITES array:
"https://<domain>"

# Save and test
bash /Users/makinja/system/tools/sentinel-uptime.sh
```

Verify Slack alert NOT sent (indicates site UP).

### Step 7: Document

Update site inventory:

```bash
# Add line to ~/system/docs/infrastructure-inventory.md
echo "| <domain> | Cloudflare Pages | <project-name> | [GitHub repo URL] | ACTIVE |" \
  >> ~/system/docs/infrastructure-inventory.md
```

---

## 5. SENTINEL Uptime Integration

SENTINEL checks all ALAI sites every 5 minutes via cron.

**Script:** `/Users/makinja/system/tools/sentinel-uptime.sh`

**Cron:** `*/5 * * * * bash /Users/makinja/system/tools/sentinel-uptime.sh`

**Alert Channel:** `#infra-alerts` (Slack)

### Add New Site to SENTINEL

```bash
# Edit SITES array
nano /Users/makinja/system/tools/sentinel-uptime.sh

# Add:
"https://<domain>"

# Test manually
bash /Users/makinja/system/tools/sentinel-uptime.sh

# Expected: No output (site UP) or Slack alert (site DOWN)
```

### Troubleshoot False Alerts

If SENTINEL reports DOWN but site is UP:

```bash
# Test from command line
curl -I --max-time 10 https://<domain>

# If returns 200: SENTINEL script has timeout issue (increase --max-time)
# If returns 5xx: Real issue — investigate Cloudflare Pages logs
# If returns 301/302: Update SENTINEL to accept redirects
```

---

## 6. Emergency DR: Serve from Azure VM

**Scenario:** Cloudflare Pages is down (e.g., Cloudflare incident) AND site is business-critical (e.g., alai.no during client demo).

**Target:** Site accessible within 120 seconds.

### Step 1: Copy Build Output to VM

```bash
# From local machine:
cd /path/to/site
npm run build
scp -r ./out alai-admin@4.223.110.181:/var/www/<site-name>
```

### Step 2: Serve via Caddy

```bash
# SSH to VM
ssh -i ~/.ssh/azure_alai alai-admin@4.223.110.181

# Start Caddy reverse proxy
sudo caddy reverse-proxy --from <domain> --to localhost:8080 &

# Start simple HTTP server
cd /var/www/<site-name> && python3 -m http.server 8080 &
```

### Step 3: Update DNS (if needed)

```bash
# If Cloudflare DNS is also down, update DNS to point to Azure VM IP
# This requires registrar access — NOT recommended unless multi-hour Cloudflare outage
```

### Step 4: Monitor & Rollback

```bash
# Verify site accessible
curl -I https://<domain>

# When Cloudflare recovers: DNS auto-reverts (CNAME to .pages.dev still exists)
# Kill Caddy process on VM
sudo killall caddy
```

---

## 7. Escalation

| Issue | L1 Action | L2 Escalation | L3 Escalation |
|-------|-----------|---------------|---------------|
| Deploy failure | Review build logs; check package.json/next.config.js | Kelsey investigates Cloudflare Pages logs | Contact Cloudflare support via dashboard |
| 5xx errors (< 5 min) | Execute rollback (Section 2) | Kelsey reviews last commit for breaking change | CEO notification + DR activation (Section 6) |
| SSL cert not renewing | Verify DNS (Section 3) | Kelsey triggers manual renewal or contacts CF support | Switch to Let's Encrypt via Azure VM |
| SENTINEL false alerts | Verify site UP via curl; adjust timeout | Kelsey reviews SENTINEL script logic | Disable SENTINEL for that site; use external monitor |
| DNS not resolving | Verify Cloudflare DNS records; check registrar NS | Kelsey checks registrar portal for NS change | Contact registrar support |

**Key Contacts:**
- L2: Kelsey Hightower (FlowForge agent) via MC task
- L3: CEO (Alem Basic) via Slack DM or phone (+47 404 74 251)

---

## 8. Maintenance Schedule

| Task | Frequency | Owner | How |
|------|-----------|-------|-----|
| Test rollback procedure | Monthly | Proveo (Angie Jones) | Execute rollback on staging site; verify < 60s |
| Review SENTINEL alerts | Weekly | Kelsey | Check Slack `#infra-alerts` for false positives |
| Update dependency versions | Weekly | Renovate bot | Auto-merge minor/patch; manual review major |
| Backup DNS zone config | Weekly | Automated cron | Exports to `~/system/backups/dns/` |
| Verify SSL certs valid | Daily | SENTINEL | Auto-alert if < 30 days to expiry |

---

## 9. Related Docs

- [ALAI Static Hosting Blueprint](https://docs.alai.no) — Architecture & migration plan
- [Infrastructure Inventory](~/system/docs/infrastructure-inventory.md) — All ALAI sites & services
- [SENTINEL Reliability Sprint](~/system/docs/runbooks/sentinel-reliability.md) — Monitoring architecture
- [Incident Response Playbook](~/system/docs/runbooks/incident-response-playbook.md) — General incident workflow

---

## 10. Change Log

| Date | Change | Author |
|------|--------|--------|
| 2026-04-20 | Initial version — rollback, SSL, migration, SENTINEL | Skillforge (MC #8491) |